Using Gene Expression Data to Predict Clinical Information...

1
Using Gene Expression Data to Predict Clinical Information in Seven Human Cancers Nathan Abell Dataset Overview References and Acknowledgements Future Directions [email protected] Department of Genetics Stanford University School of Medicine In this project, it quickly became obvious that very low- dimensional sets of genes, forming coherent signatures, could be used to represent the disease sub-type in all studied tissues. Additionally, several other phenotypes, such as progesterone receptor status in breast cancers, were also easily predictable. Much more complex, however, were quantitative outcomes like survival time, or age of disease onset, which rarely were accurate within years of their target. I found, clearly, that variable reduction was the crucial step, with many classification and regression algorithms later performing similarly well (or poorly). To proceed further, I would start by incorporating information about the selected genes, to see if they were shared across tissues or private. I would also incorporate more tissues, attempt to incorporate matched normal tissue, and attempt to include additional data types like copy number variation. [1] RG Verhaak, KA Hoadley, E Purdometal. Integratedgenomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17(1):98-110. [2] KA Hoadley, C Yau, DM Wolf, et al. Multiplatformanalysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158(4):929-44. [3] https://cancergenome.nih.gov [4] J Friedman, T Hastie, R Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 2010;33(1):1-22. [5] https://CRAN.R-project.org/package=e1071 [6] WN Venables, BD Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, New York. 2002. ISBN 0-387-95457-0 [7] A Liaw and M Wiener. Classification and Regression by randomForest. R News 2(3), 18--22. 2002. Background Statistical Approach Predicting Clinical Attributes Clinical Outcomes Fig. 1: Ten human tissues with the indicated number of samples in the Genomic Data Commons Fig. 3: Distributions of clinical outcomes in breast and kidney tumors Fig. 2: Representative Pearson correlation heatmap between lung cancers revealing the extent of gene expression correlation Fig. 4: Visual overview of the procedure applied to each tissue separately Fig. 6A: ROC plots for two example predictions: left, breast cancer progesterone receptor; right, bladder cancer stage (early vs late) Feature Selection Across Tissues Fig. 5B: LASSO regularization path, dashed lines showing estimates for optimal values of lambda by the misclassification rate Fig. 5A: Principal component analysis before (above) and after (right) LASSO variable reduction for breast tumors colored by histology Fig. 5C: Fitted LASSO parameters for disease sub-type in all tissues Normalization Tissue Type Sample Size Bladder 414 Brain 667 Breast 1102 Kidney 891 Lung 1035 Prostate 495 Skin 103 Split 70/30 LASSO LASSO PCR SVR ●● ●● ●● 30 20 10 0 10 20 30 0 30 60 PC1 (16.9% explained var.) PC2 (6.0% explained var.) groups Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma NA -5 -4 -3 -2 -1 0.1 0.2 0.3 0.4 log(Lambda) Misclassification Error 155 140 138 131 125 109 93 84 72 61 52 43 34 29 25 21 19 12 6 1 10 5 0 5 10 5 0 5 10 PC1 (11.5% explained var.) PC2 (4.7% explained var.) groups Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma NA Tissue CV Accuracy Bladder 0.0672 0.9035 Brain 0.0091 0.9310 Breast 0.0235 0.9811 Kidney 0.0234 0.9503 Lung 0.0444 0.9628 Prostate 0.0796 0.9651 Skin 0.1090 0.8961 0 200 400 histological_type count histological_type Kidney Clear Cell Renal Carcinoma Kidney Papillary Renal Cell Carcinoma Kidney Chromophobe histological type: kidney A 0 200 400 600 800 histological_type count histological_type Infiltrating Carcinoma NOS Infiltrating Ductal Carcinoma Infiltrating Lobular Carcinoma Medullary Carcinoma Metaplastic Carcinoma Mixed Histology (please specify) Mucinous Carcinoma Other, specify NA histological type: breast 0 100 200 300 400 stage_event_pathologic_stage count stage_event_pathologic_stage Stage I Stage II Stage III Stage IV stage: kidney B 0 100 200 300 stage_event_pathologic_stage count stage_event_pathologic_stage Stage I Stage IA Stage IB Stage II Stage IIA Stage IIB Stage III Stage IIIA Stage IIIB Stage IIIC Stage IV Stage X NA stage: breast 0 100 200 300 400 hemoglobin_result count hemoglobin_result Elevated Low Normal hemoglobin: kidney C 0 200 400 600 breast_carcinoma_progesterone_receptor_status count breast_carcinoma_progesterone_receptor_status Indeterminate Negative Positive NA progesterone receptor: breast D 0 10 20 30 25 50 75 age_at_initial_pathologic_diagnosis count stage: kidney E 0 20 40 40 60 80 age_at_initial_pathologic_diagnosis count stage: breast Logistic LDA SVM RF Validation All tissues responded similarly to the LASSO, with very robust performance for classifiers (particularly subtype. Fig 5C). In a multinomial context, the LASSO generally helps separate the desired groups. However, this did not extend to quantitative responses, which failed to show the normal regularization path (Fig 5B). The heterogeneity of known cancers share one key property - genetic and transcriptomic abnormalities. To this end, the Genomic Data Commons (GDC) has aggregated and standardized tens of thousands of experimental datasets from dozens of human cancers [1-3]. Here, we describe a pipeline for the prediction of specific clinical features (ranging from blood tests to pathological features and survival outcomes) on all available gene expression data for seven human cancers. Each tissue consists of a sample set, each with ~60000 expression measurements. Thus, many measurements are highly correlated, for biological and experimental reasons (Fig. 1). This presents an immediate problem, as many predictors are almost perfectly co- linear. Reducing the large set of genes to a representative set of variables is a crucial first task. Some samples in each tissue are annotated with clinical information, such as disease subtype or survival time. These vary between categorical, binomial, and multinomial response variables (Fig 3) with some variability between datasets. Thus we focus on subsets of these attributes. I began by normalizing each dataset for various factors like depth and variance. Then, I separated each tissue into training and validation sets, using only the training sets. Using cross- validation, I obtained very small subsets of variables with non- zero LASSO coefficients for each tissue, and used them to train models (also by cross-validation within the test set) on a variety of models depending on whether the response was continuous or categorical. This was largely done using packages in R, including glmnet, MASS, e1071, and randomForest [4-7]. For all disease subtype attributes (Fig. 5C), each validation was over 0.9 accurate in validation. So, I attempted other complex categorical responses; two are shown above. The progesterone receptor in breast cancer was very predictable from gene expression, while bladder tumor stage was much more difficult to predict. Across all built models, significant variation was observed with respect to classifier performance, though always better than regression-based predictions. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 P(FP) P(TP) Multinomial LASSO Linear Discriminant Analysis Support Vector Machine, Gaus Random Forests 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 P(FP) P(TP) Multinomial LASSO Linear Discriminant Analysis Support Vector Machine, Gaus Random Forests

Transcript of Using Gene Expression Data to Predict Clinical Information...

Page 1: Using Gene Expression Data to Predict Clinical Information ...cs229.stanford.edu/proj2016/poster/Abell-UsingGeneExpressionData… · progesterone receptor in breast cancer was very

Using Gene Expression Data to Predict Clinical Information in Seven Human Cancers

Nathan Abell

Dataset Overview

References and Acknowledgements

Future Directions

[email protected] of Genetics

Stanford University School of Medicine

In this project, it quickly became obvious that very low-dimensional sets of genes, forming coherent signatures, could be used to represent the disease sub-type in all studied tissues. Additionally, several other phenotypes, such as progesterone receptor status in breast cancers, were also easily predictable. Much more complex, however, were quantitative outcomes like survival time, or age of disease onset, which rarely were accurate within years of their target. I found, clearly, that variable reduction was the crucial step, with many classification and regression algorithms later performing similarly well (or poorly).

To proceed further, I would start by incorporating information about the selected genes, to see if they were shared across tissues or private. I would also incorporate more tissues, attempt to incorporate matched normal tissue, and attempt to include additional data types like copy number variation.

[1] RG Verhaak, KA Hoadley, E Purdometal. Integratedgenomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17(1):98-110. [2] KA Hoadley, C Yau, DM Wolf, et al. Multiplatformanalysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158(4):929-44. [3] https://cancergenome.nih.gov[4] J Friedman, T Hastie, R Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 2010;33(1):1-22.[5] https://CRAN.R-project.org/package=e1071[6] WN Venables, BD Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, New York. 2002. ISBN 0-387-95457-0[7] A Liaw and M Wiener. Classification and Regression by randomForest. R News 2(3), 18--22. 2002.

Background Statistical Approach Predicting Clinical Attributes

Clinical Outcomes

Fig. 1: Ten human tissues with the indicated number of samples in the Genomic Data

Commons

Fig. 3: Distributions of clinical outcomes in breast and kidney tumors

Fig. 2: Representative Pearson correlation heatmap between lung cancers revealing the

extent of gene expression correlation

Fig. 4: Visual overview of the procedure applied to each tissue separatelyFig. 6A: ROC plots for two example predictions: left, breast cancer

progesterone receptor; right, bladder cancer stage (early vs late)

Feature Selection Across Tissues

Fig. 5B: LASSO regularization path, dashed lines showing

estimates for optimal values of lambda by the misclassification rate

Fig. 5A: Principal component analysis before (above) and after

(right) LASSO variable reduction for breast tumors colored by histology

Fig. 5C: Fitted LASSO parameters for disease sub-type in all tissues

NormalizationTissue Type Sample Size

Bladder 414Brain 667Breast 1102Kidney 891Lung 1035

Prostate 495Skin 103

Split 70/30

LASSO

• LASSO• PCR• SVR

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●●●

● ●●

●●

●●

● ●●

● ●

●●●

●● ●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●●●

●●

●●

●●●

● ●

●●

● ●●

●● ●●●

●●

●●

● ●

●●

●●

● ●●●

●●

●●

● ●●

●●

●●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

● ●

●●

● ●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

● ●●●

●●

●●●●

● ●●●

●●●●●

●●

●●

●●

● ●

●●

●●●●

●●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●●

●●

●●

●●

●●

−30

−20

−10

0

10

20

−30 0 30 60PC1 (16.9% explained var.)

PC2

(6.0

% e

xpla

ined

var

.)

groups●

Infiltrating Ductal Carcinoma

Infiltrating Lobular Carcinoma

NA

-5 -4 -3 -2 -1

0.1

0.2

0.3

0.4

log(Lambda)

Mis

clas

sific

atio

n Er

ror

155 140 138 131 125 109 93 84 72 61 52 43 34 29 25 21 19 12 6 1

●●

●●

● ●●●

●●

● ●

●●

●●

● ●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

−10

−5

0

5

10

−5 0 5 10PC1 (11.5% explained var.)

PC2

(4.7

% e

xpla

ined

var

.)

groups●

Infiltrating Ductal Carcinoma

Infiltrating Lobular Carcinoma

NA

Tissue 𝛌 CV Accuracy

Bladder 0.0672 0.9035Brain 0.0091 0.9310Breast 0.0235 0.9811Kidney 0.0234 0.9503Lung 0.0444 0.9628

Prostate 0.0796 0.9651Skin 0.1090 0.8961

0

200

400

histological_type

coun

t histological_typeKidney Clear Cell Renal CarcinomaKidney Papillary Renal Cell CarcinomaKidney Chromophobe

histological type: kidneyA

0

200

400

600

800

histological_type

coun

t

histological_type

Infiltrating Carcinoma NOSInfiltrating Ductal CarcinomaInfiltrating Lobular CarcinomaMedullary CarcinomaMetaplastic CarcinomaMixed Histology (please specify)Mucinous CarcinomaOther, specifyNA

histological type: breast

0

100

200

300

400

stage_event_pathologic_stage

coun

t

stage_event_pathologic_stage

Stage IStage IIStage IIIStage IV

stage: kidneyB

0

100

200

300

stage_event_pathologic_stage

coun

t

stage_event_pathologic_stage

Stage IStage IAStage IBStage IIStage IIAStage IIBStage IIIStage IIIAStage IIIBStage IIICStage IVStage XNA

stage: breast

0

100

200

300

400

hemoglobin_result

coun

t

hemoglobin_result

ElevatedLowNormal

hemoglobin: kidneyC

0

200

400

600

breast_carcinoma_progesterone_receptor_status

coun

t

breast_carcinoma_progesterone_receptor_status

IndeterminateNegativePositiveNA

progesterone receptor: breastD

0

10

20

30

25 50 75age_at_initial_pathologic_diagnosis

coun

t

stage: kidneyE

0

20

40

40 60 80age_at_initial_pathologic_diagnosis

coun

t

stage: breast

• Logistic• LDA• SVM• RF

Validation

All tissues responded similarly to the LASSO, with very robust

performance for classifiers (particularly subtype. Fig 5C). In

a multinomial context, the LASSO generally helps separate

the desired groups. However, this did not extend to

quantitative responses, which failed to show the normal

regularization path (Fig 5B).

The heterogeneity of known cancers share one key property - genetic and transcriptomic abnormalities. To this end, the Genomic Data Commons (GDC) has aggregated and standardized tens of thousands of experimental datasets from dozens of human cancers [1-3]. Here, we describe a pipeline for the prediction of specific clinical features (ranging from blood tests to pathological features and survival outcomes) on all available gene expression data for seven human cancers.

Each tissue consists of a sample set, each with ~60000 expression measurements. Thus, many measurements are highly correlated, for biological and experimental reasons (Fig. 1). This presents an immediate problem, as many predictors are almost perfectly co-linear. Reducing the large set of genes to a representative set of variables is a crucial first task.

Some samples in each tissue are annotated with clinical

information, such as disease subtype or survival time. These

vary between categorical, binomial, and multinomial

response variables (Fig 3) with some variability between

datasets. Thus we focus on subsets of these attributes.

I began by normalizing each dataset for various factors like depth and variance. Then, I separated each tissue into training and

validation sets, using only the training sets. Using cross-validation, I obtained very small subsets of variables with non-zero LASSO coefficients for each tissue, and used them to train models (also by cross-validation within the test set) on a variety of models depending on whether the response was continuous or categorical. This was largely done using packages in R, including

glmnet, MASS, e1071, and randomForest [4-7].

For all disease subtype attributes (Fig. 5C), each validation was over 0.9 accurate in validation. So, I attempted other complex categorical responses; two are shown above. The progesterone receptor in breast cancer was very predictable from gene expression, while bladder tumor stage was much more difficult to predict. Across all built models, significant

variation was observed with respect to classifier performance, though always better than regression-based predictions.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

P(FP)

P(TP

)

Multinomial LASSOLinear Discriminant AnalysisSupport Vector Machine, Gaussian KernelRandom Forests

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

P(FP)

P(TP

)

Multinomial LASSOLinear Discriminant AnalysisSupport Vector Machine, Gaussian KernelRandom Forests