An analytic approach for interpretable predictive models in high dimensional data, in the presence...

49
An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures Sahir Rai Bhatnagar, PhD Candidate Joint with Yi Yang, Mathieu Blanchette, Luigi Bouchard, Celia Greenwood Biostatistics, McGill University preprint available at sahirbhatnagar.com

Transcript of An analytic approach for interpretable predictive models in high dimensional data, in the presence...

Page 1: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

An analytic approach for interpretablepredictive models in high dimensional data, inthe presence of interactions with exposures

Sahir Rai Bhatnagar, PhD CandidateJoint with Yi Yang, Mathieu Blanchette, Luigi Bouchard, Celia Greenwood

Biostatistics, McGill Universitypreprint available atsahirbhatnagar.com

Page 2: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Simple Rule 11:

Simulated Data ̸=Real Data

0/21

Page 3: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Simple Rule 11:

Simulated Data ̸=Real Data

0/21

Page 4: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Motivation

Page 5: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

one predictor variable at a time

Predictor Variable Phenotype

Test 1Test 2

Test 3

Test 4

Test5

1/21

Page 6: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

one predictor variable at a time

Predictor Variable Phenotype

Test 1Test 2

Test 3

Test 4

Test5

1/21

Page 7: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

a network based view

Predictor Variable Phenotype

Test 1

2/21

Page 8: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

a network based view

Predictor Variable Phenotype

Test 1

2/21

Page 9: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

a network based view

Predictor Variable Phenotype

Test 1

2/21

Page 10: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

system level changes due to environmentPredictor Variable PhenotypeEnvironment

A

B

Test 1

3/21

Page 11: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

system level changes due to environmentPredictor Variable PhenotypeEnvironment

A

B

Test 1

3/21

Page 12: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Motivating Dataset: Newborn epigenetic adaptations to gesta-tional diabetes exposure (Luigi Bouchard, USherbrooke)

EnvironmentGestationalDiabetes

Large DataChild’s epigenome

(p ≈ 450k)

PhenotypeObesity measures

4/21

Page 13: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Differential Correlation between environments

(a) Gestational diabetes affected pregnancy (b) Controls

5/21

Page 14: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

NIH MRI brain study

EnvironmentAge

Large DataCortical Thickness

(p ≈ 80k)

PhenotypeIntelligence

6/21

Page 15: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Goals of this study

Objective

(i) Whether clustering that incorporates known covariate orexposure information can improve prediction models

(ii) Can the resulting clusters provide an easier route tointerpretation

7/21

Page 16: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Goals of this study

Objective

(i) Whether clustering that incorporates known covariate orexposure information can improve prediction models

(ii) Can the resulting clusters provide an easier route tointerpretation

7/21

Page 17: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Methods

Page 18: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

Page 19: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

Page 20: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

Page 21: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

Page 22: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

Page 23: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

Page 24: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

the objective of statisticalmethods is the reduction ofdata. A quantity of data . . . is to bereplaced by relatively few quantitieswhich shall adequately represent. . . the relevant informationcontained in the original data.

- Sir R. A. Fisher, 1922

8/21

Page 25: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Step 1a: Method to detect gene clusters

(i) Hierarchical clustering (average linkage) with TOM1 scoringdissimilarity2:

|TOME=1 − TOME=0|

(ii) Number of clusters chosen using dynamicTreeCut algorithm 3

Original Data

E = 0

1a) Gene Similarity

E = 1

1Ravasz et al., Science (2002)2Klein Oros et al., Frontiers in Genetics (2016)3Langfelder and Zhang, Bioinformatics (2008)

9/21

Page 26: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Step 1b: Cluster Representation

(i) Average 4

(ii) 1st Principal Component 5

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

4Hastie et al., Genome Biology (2001), Park et al., Biostatistics (2007)5Kendall, A Course in Multivariate analysis (1957)

10/21

Page 27: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Step 2: Variable Selection

(i) Linear effects: Lasso, Elastic Net 6

(ii) Non-linear effects: MARS 7

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

6Tibshirani, JRSSB (1996), Zou and Hastie, JRSSB (2005)7Friedman, Annals of Statistics (1991)

11/21

Page 28: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Simulation Study

Page 29: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Simulated TOM by Exposure Status

(a) TOM(XE=1) (b) TOM(XE=0)

12/21

Page 30: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Difference of TOMs

(a) |TOM(XE=1) − TOM(XE=0)| 13/21

Page 31: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

TOM based on all subjects

(a) TOM(Xall) 14/21

Page 32: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Real Data Analysis

Page 33: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Gestational Diabetes: Prediction Performance

15/21

Page 34: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Gestational Diabetes: Interpretation of Clusters with IPA

• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity

• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids

• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with

macrosomia

16/21

Page 35: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Gestational Diabetes: Interpretation of Clusters with IPA

• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity

• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids

• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with

macrosomia

16/21

Page 36: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Gestational Diabetes: Interpretation of Clusters with IPA

• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity

• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids

• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with

macrosomia

16/21

Page 37: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

NIHPD: Age

17/21

Page 38: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

NIHPD: Income

18/21

Page 39: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Final Remarks

Page 40: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Page 41: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Page 42: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Page 43: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Page 44: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Page 45: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Limitations

• There must be a high-dimensional signature of the exposure

• Covariance estimation• Currently limited to binary environment• Interpretation can be difficult

20/21

Page 46: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation

• Currently limited to binary environment• Interpretation can be difficult

20/21

Page 47: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation• Currently limited to binary environment

• Interpretation can be difficult

20/21

Page 48: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation• Currently limited to binary environment• Interpretation can be difficult

20/21

Page 49: An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures

Acknowledgements

• Dr. Celia Greenwood• Dr. Blanchette and Dr. Yang• Dr. Luigi Bouchard, André AnneHoude

• Dr. Steele, Dr. Kramer,Dr. Abrahamowicz

• Maxime Turgeon, KevinMcGregor, Lauren Mokry,Dr. Forest

• Greg Voisin, Dr. Forgetta,Dr. Klein

• Mothers and children from thestudy

21/21