An analytic approach for interpretable predictive models in high dimensional data, in the presence...
-
Upload
sahirbhatnagar -
Category
Science
-
view
7 -
download
0
Transcript of An analytic approach for interpretable predictive models in high dimensional data, in the presence...
An analytic approach for interpretablepredictive models in high dimensional data, inthe presence of interactions with exposures
Sahir Rai Bhatnagar, PhD CandidateJoint with Yi Yang, Mathieu Blanchette, Luigi Bouchard, Celia Greenwood
Biostatistics, McGill Universitypreprint available atsahirbhatnagar.com
Simple Rule 11:
Simulated Data ̸=Real Data
0/21
Simple Rule 11:
Simulated Data ̸=Real Data
0/21
Motivation
one predictor variable at a time
Predictor Variable Phenotype
Test 1Test 2
Test 3
Test 4
Test5
1/21
one predictor variable at a time
Predictor Variable Phenotype
Test 1Test 2
Test 3
Test 4
Test5
1/21
a network based view
Predictor Variable Phenotype
Test 1
2/21
a network based view
Predictor Variable Phenotype
Test 1
2/21
a network based view
Predictor Variable Phenotype
Test 1
2/21
system level changes due to environmentPredictor Variable PhenotypeEnvironment
A
B
Test 1
3/21
system level changes due to environmentPredictor Variable PhenotypeEnvironment
A
B
Test 1
3/21
Motivating Dataset: Newborn epigenetic adaptations to gesta-tional diabetes exposure (Luigi Bouchard, USherbrooke)
EnvironmentGestationalDiabetes
Large DataChild’s epigenome
(p ≈ 450k)
PhenotypeObesity measures
4/21
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
5/21
NIH MRI brain study
EnvironmentAge
Large DataCortical Thickness
(p ≈ 80k)
PhenotypeIntelligence
6/21
Goals of this study
Objective
(i) Whether clustering that incorporates known covariate orexposure information can improve prediction models
(ii) Can the resulting clusters provide an easier route tointerpretation
7/21
Goals of this study
Objective
(i) Whether clustering that incorporates known covariate orexposure information can improve prediction models
(ii) Can the resulting clusters provide an easier route tointerpretation
7/21
Methods
ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) ClusterRepresentation
n × 1 n × 1
2) PenalizedRegression
Yn×1∼ + ×E
8/21
ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) ClusterRepresentation
n × 1 n × 1
2) PenalizedRegression
Yn×1∼ + ×E
8/21
ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) ClusterRepresentation
n × 1 n × 1
2) PenalizedRegression
Yn×1∼ + ×E
8/21
ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) ClusterRepresentation
n × 1 n × 1
2) PenalizedRegression
Yn×1∼ + ×E
8/21
ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) ClusterRepresentation
n × 1 n × 1
2) PenalizedRegression
Yn×1∼ + ×E
8/21
ECLUST - our proposed method: 2 steps
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) ClusterRepresentation
n × 1 n × 1
2) PenalizedRegression
Yn×1∼ + ×E
8/21
the objective of statisticalmethods is the reduction ofdata. A quantity of data . . . is to bereplaced by relatively few quantitieswhich shall adequately represent. . . the relevant informationcontained in the original data.
- Sir R. A. Fisher, 1922
8/21
Step 1a: Method to detect gene clusters
(i) Hierarchical clustering (average linkage) with TOM1 scoringdissimilarity2:
|TOME=1 − TOME=0|
(ii) Number of clusters chosen using dynamicTreeCut algorithm 3
Original Data
E = 0
1a) Gene Similarity
E = 1
1Ravasz et al., Science (2002)2Klein Oros et al., Frontiers in Genetics (2016)3Langfelder and Zhang, Bioinformatics (2008)
9/21
Step 1b: Cluster Representation
(i) Average 4
(ii) 1st Principal Component 5
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) ClusterRepresentation
n × 1 n × 1
4Hastie et al., Genome Biology (2001), Park et al., Biostatistics (2007)5Kendall, A Course in Multivariate analysis (1957)
10/21
Step 2: Variable Selection
(i) Linear effects: Lasso, Elastic Net 6
(ii) Non-linear effects: MARS 7
Original Data
E = 0
1a) Gene Similarity
E = 1
1b) ClusterRepresentation
n × 1 n × 1
2) PenalizedRegression
Yn×1∼ + ×E
6Tibshirani, JRSSB (1996), Zou and Hastie, JRSSB (2005)7Friedman, Annals of Statistics (1991)
11/21
Simulation Study
Simulated TOM by Exposure Status
(a) TOM(XE=1) (b) TOM(XE=0)
12/21
Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)| 13/21
TOM based on all subjects
(a) TOM(Xall) 14/21
Real Data Analysis
Gestational Diabetes: Prediction Performance
15/21
Gestational Diabetes: Interpretation of Clusters with IPA
• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity
• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids
• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with
macrosomia
16/21
Gestational Diabetes: Interpretation of Clusters with IPA
• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity
• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids
• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with
macrosomia
16/21
Gestational Diabetes: Interpretation of Clusters with IPA
• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity
• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids
• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with
macrosomia
16/21
NIHPD: Age
17/21
NIHPD: Income
18/21
Final Remarks
Discussion and Contributions
• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)
• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)
• Clusters can be interpreted but require much more expertknowledge
• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)
• Software implementation in R: sahirbhatnagar.com
19/21
Discussion and Contributions
• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)
• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)
• Clusters can be interpreted but require much more expertknowledge
• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)
• Software implementation in R: sahirbhatnagar.com
19/21
Discussion and Contributions
• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)
• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)
• Clusters can be interpreted but require much more expertknowledge
• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)
• Software implementation in R: sahirbhatnagar.com
19/21
Discussion and Contributions
• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)
• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)
• Clusters can be interpreted but require much more expertknowledge
• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)
• Software implementation in R: sahirbhatnagar.com
19/21
Discussion and Contributions
• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)
• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)
• Clusters can be interpreted but require much more expertknowledge
• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)
• Software implementation in R: sahirbhatnagar.com
19/21
Limitations
• There must be a high-dimensional signature of the exposure
• Covariance estimation• Currently limited to binary environment• Interpretation can be difficult
20/21
Limitations
• There must be a high-dimensional signature of the exposure• Covariance estimation
• Currently limited to binary environment• Interpretation can be difficult
20/21
Limitations
• There must be a high-dimensional signature of the exposure• Covariance estimation• Currently limited to binary environment
• Interpretation can be difficult
20/21
Limitations
• There must be a high-dimensional signature of the exposure• Covariance estimation• Currently limited to binary environment• Interpretation can be difficult
20/21
Acknowledgements
• Dr. Celia Greenwood• Dr. Blanchette and Dr. Yang• Dr. Luigi Bouchard, André AnneHoude
• Dr. Steele, Dr. Kramer,Dr. Abrahamowicz
• Maxime Turgeon, KevinMcGregor, Lauren Mokry,Dr. Forest
• Greg Voisin, Dr. Forgetta,Dr. Klein
• Mothers and children from thestudy
21/21