An analytic approach for interpretable predictive models in high dimensional data, in the presence...

An analytic approach for interpretablepredictive models in high dimensional data, inthe presence of interactions with exposures

Sahir Rai Bhatnagar, PhD CandidateJoint with Yi Yang, Mathieu Blanchette, Luigi Bouchard, Celia Greenwood

Biostatistics, McGill Universitypreprint available atsahirbhatnagar.com

Simple Rule 11:

Simulated Data ̸=Real Data

Simple Rule 11:

Simulated Data ̸=Real Data

Motivation

one predictor variable at a time

Predictor Variable Phenotype

Test 1Test 2

Test 3

Test 4

one predictor variable at a time

Test 1Test 2

Test 3

Test 4

a network based view

Test 1

system level changes due to environmentPredictor Variable PhenotypeEnvironment

Test 1

system level changes due to environmentPredictor Variable PhenotypeEnvironment

Test 1

Motivating Dataset: Newborn epigenetic adaptations to gesta-tional diabetes exposure (Luigi Bouchard, USherbrooke)

EnvironmentGestationalDiabetes

Large DataChild’s epigenome

(p ≈ 450k)

PhenotypeObesity measures

Differential Correlation between environments

(a) Gestational diabetes affected pregnancy (b) Controls

NIH MRI brain study

EnvironmentAge

Large DataCortical Thickness

(p ≈ 80k)

PhenotypeIntelligence

Goals of this study

Objective

(i) Whether clustering that incorporates known covariate orexposure information can improve prediction models

(ii) Can the resulting clusters provide an easier route tointerpretation

Goals of this study

Objective

(i) Whether clustering that incorporates known covariate orexposure information can improve prediction models

(ii) Can the resulting clusters provide an easier route tointerpretation

Methods

ECLUST - our proposed method: 2 steps

Original Data

1a) Gene Similarity

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

Original Data

1a) Gene Similarity

n × 1 n × 1

Yn×1∼ + ×E

Original Data

1a) Gene Similarity

n × 1 n × 1

Yn×1∼ + ×E

Original Data

1a) Gene Similarity

n × 1 n × 1

Yn×1∼ + ×E

Original Data

1a) Gene Similarity

n × 1 n × 1

Yn×1∼ + ×E

Original Data

1a) Gene Similarity

n × 1 n × 1

Yn×1∼ + ×E

the objective of statisticalmethods is the reduction ofdata. A quantity of data . . . is to bereplaced by relatively few quantitieswhich shall adequately represent. . . the relevant informationcontained in the original data.

- Sir R. A. Fisher, 1922

Step 1a: Method to detect gene clusters

(i) Hierarchical clustering (average linkage) with TOM1 scoringdissimilarity2:

|TOME=1 − TOME=0|

(ii) Number of clusters chosen using dynamicTreeCut algorithm 3

Original Data

1a) Gene Similarity

1Ravasz et al., Science (2002)2Klein Oros et al., Frontiers in Genetics (2016)3Langfelder and Zhang, Bioinformatics (2008)

Step 1b: Cluster Representation

(i) Average 4

(ii) 1st Principal Component 5

Original Data

1a) Gene Similarity

n × 1 n × 1

4Hastie et al., Genome Biology (2001), Park et al., Biostatistics (2007)5Kendall, A Course in Multivariate analysis (1957)

Step 2: Variable Selection

(i) Linear effects: Lasso, Elastic Net 6

(ii) Non-linear effects: MARS 7

Original Data

1a) Gene Similarity

n × 1 n × 1

Yn×1∼ + ×E

6Tibshirani, JRSSB (1996), Zou and Hastie, JRSSB (2005)7Friedman, Annals of Statistics (1991)

Simulation Study

Simulated TOM by Exposure Status

(a) TOM(XE=1) (b) TOM(XE=0)

Difference of TOMs

(a) |TOM(XE=1) − TOM(XE=0)| 13/21

TOM based on all subjects

(a) TOM(Xall) 14/21

Real Data Analysis

Gestational Diabetes: Prediction Performance

Gestational Diabetes: Interpretation of Clusters with IPA

• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity

• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids

• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with

macrosomia

NIHPD: Age

NIHPD: Income

Final Remarks

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

Limitations

• There must be a high-dimensional signature of the exposure

• Covariance estimation• Currently limited to binary environment• Interpretation can be difficult

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation

• Currently limited to binary environment• Interpretation can be difficult

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation• Currently limited to binary environment

• Interpretation can be difficult

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation• Currently limited to binary environment• Interpretation can be difficult

Acknowledgements

• Dr. Celia Greenwood• Dr. Blanchette and Dr. Yang• Dr. Luigi Bouchard, André AnneHoude

• Dr. Steele, Dr. Kramer,Dr. Abrahamowicz

• Maxime Turgeon, KevinMcGregor, Lauren Mokry,Dr. Forest

• Greg Voisin, Dr. Forgetta,Dr. Klein

• Mothers and children from thestudy

An analytic approach for interpretable predictive models in high dimensional data, in the presence...

Science

Transcript of An analytic approach for interpretable predictive models in high dimensional data, in the presence...