Classifying Lung Adenocarcinoma and Squamous Cell Carcinoma using RNA-Seq Data

Authors

  • Zhengyan Huang Author
  • Li Chen Author
  • Chi Wang Author

Keywords:

LUAD, LUSC, Principal Components, LASSO, Kth Nearest Neighbors

Abstract

Background: Lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are
two primary subtypes of non-small cell lung carcinoma (NSCLC). Currently, the most widely
used method to discriminate between LUAD and LUSC is hematoxylin-eosin (HE) staining.
However, this method is not always able to precisely diagnose LUAD or LUSC. More accurate
diagnostic approaches are highly desired.
Methods: We propose to use gene expression profile to discriminate a patient’s NSCLC subtype.
We leveraged RNA-Seq data from The Cancer Genome Atlas (TCGA) and randomly split
the data into training and testing subsets. To construct classifiers based on the training data, we
considered three methods: logistic regression on principal components (PCR), logistic regression
with LASSO shrinkage (LASSO), and kth nearest neighbors (KNN). Performances of the
classifiers were evaluated and compared based on the testing data.
Results: All gene expression-based classifiers show high accuracy in discriminating between
LUSC and LUAD. The classifier obtained by LASSO has the smallest overall misclassification
rate of 3.42% (95% CI: 3.25%-3.60%) when using 0.5 as the cutoff value for the predicted
probability of belonging to a subtype, followed by classifiers obtained by PCR (4.36%,
95% CI: 4.23%-4.49%) and KNN (8.70%, 95% CI: 8.57%-8.83%). The LASSO classifier also
has the highest average area under the receiver operating characteristic curve (AUC) value of
0.993, compared to PCR (0.987) and KNN (0.965).
Conclusions: Our results suggest that mRNA expressions are highly informative for classifying
NSCLC subtypes and may potentially be used to assist clinical diagnosis.

Downloads

Published

2017-09-19