An analytic approach for interpretable predictive models in high dimensional data, in the presence...

Post on 12-Apr-2017

7 views 0 download

Transcript of An analytic approach for interpretable predictive models in high dimensional data, in the presence...

An analytic approach for interpretablepredictive models in high dimensional data, inthe presence of interactions with exposures

Sahir Rai Bhatnagar, PhD CandidateJoint with Yi Yang, Mathieu Blanchette, Luigi Bouchard, Celia Greenwood

Biostatistics, McGill Universitypreprint available atsahirbhatnagar.com

Simple Rule 11:

Simulated Data ̸=Real Data

0/21

Simple Rule 11:

Simulated Data ̸=Real Data

0/21

Motivation

one predictor variable at a time

Predictor Variable Phenotype

Test 1Test 2

Test 3

Test 4

Test5

1/21

one predictor variable at a time

Predictor Variable Phenotype

Test 1Test 2

Test 3

Test 4

Test5

1/21

a network based view

Predictor Variable Phenotype

Test 1

2/21

a network based view

Predictor Variable Phenotype

Test 1

2/21

a network based view

Predictor Variable Phenotype

Test 1

2/21

system level changes due to environmentPredictor Variable PhenotypeEnvironment

A

B

Test 1

3/21

system level changes due to environmentPredictor Variable PhenotypeEnvironment

A

B

Test 1

3/21

Motivating Dataset: Newborn epigenetic adaptations to gesta-tional diabetes exposure (Luigi Bouchard, USherbrooke)

EnvironmentGestationalDiabetes

Large DataChild’s epigenome

(p ≈ 450k)

PhenotypeObesity measures

4/21

Differential Correlation between environments

(a) Gestational diabetes affected pregnancy (b) Controls

5/21

NIH MRI brain study

EnvironmentAge

Large DataCortical Thickness

(p ≈ 80k)

PhenotypeIntelligence

6/21

Goals of this study

Objective

(i) Whether clustering that incorporates known covariate orexposure information can improve prediction models

(ii) Can the resulting clusters provide an easier route tointerpretation

7/21

Goals of this study

Objective

(i) Whether clustering that incorporates known covariate orexposure information can improve prediction models

(ii) Can the resulting clusters provide an easier route tointerpretation

7/21

Methods

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

ECLUST - our proposed method: 2 steps

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

8/21

the objective of statisticalmethods is the reduction ofdata. A quantity of data . . . is to bereplaced by relatively few quantitieswhich shall adequately represent. . . the relevant informationcontained in the original data.

- Sir R. A. Fisher, 1922

8/21

Step 1a: Method to detect gene clusters

(i) Hierarchical clustering (average linkage) with TOM1 scoringdissimilarity2:

|TOME=1 − TOME=0|

(ii) Number of clusters chosen using dynamicTreeCut algorithm 3

Original Data

E = 0

1a) Gene Similarity

E = 1

1Ravasz et al., Science (2002)2Klein Oros et al., Frontiers in Genetics (2016)3Langfelder and Zhang, Bioinformatics (2008)

9/21

Step 1b: Cluster Representation

(i) Average 4

(ii) 1st Principal Component 5

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

4Hastie et al., Genome Biology (2001), Park et al., Biostatistics (2007)5Kendall, A Course in Multivariate analysis (1957)

10/21

Step 2: Variable Selection

(i) Linear effects: Lasso, Elastic Net 6

(ii) Non-linear effects: MARS 7

Original Data

E = 0

1a) Gene Similarity

E = 1

1b) ClusterRepresentation

n × 1 n × 1

2) PenalizedRegression

Yn×1∼ + ×E

6Tibshirani, JRSSB (1996), Zou and Hastie, JRSSB (2005)7Friedman, Annals of Statistics (1991)

11/21

Simulation Study

Simulated TOM by Exposure Status

(a) TOM(XE=1) (b) TOM(XE=0)

12/21

Difference of TOMs

(a) |TOM(XE=1) − TOM(XE=0)| 13/21

TOM based on all subjects

(a) TOM(Xall) 14/21

Real Data Analysis

Gestational Diabetes: Prediction Performance

15/21

Gestational Diabetes: Interpretation of Clusters with IPA

• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity

• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids

• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with

macrosomia

16/21

Gestational Diabetes: Interpretation of Clusters with IPA

• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity

• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids

• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with

macrosomia

16/21

Gestational Diabetes: Interpretation of Clusters with IPA

• Canonical Pathways: 1.25-dihydroxyvitamin D3 Biosynthesis –vitamin D associated with obesity

• Diseases and Disorders: Hepatic System Disease – metabolismof glucose and lipids

• Physiological System Development and Function:(i) Behavior and neurodevelopment – associated with obesity(ii) Embryonic and organ development – GD associated with

macrosomia

16/21

NIHPD: Age

17/21

NIHPD: Income

18/21

Final Remarks

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Discussion and Contributions

• Large system-wide changes are observed in manyenvironments (DNA methylation, cortical thickness, geneexpression)

• Environment dependent clustering can improve predictionperformance in high dimensional settings (n << p)

• Clusters can be interpreted but require much more expertknowledge

• Leverages existing computationally fast algorithms and can runon a laptop computer (p ≈ 10k)

• Software implementation in R: sahirbhatnagar.com

19/21

Limitations

• There must be a high-dimensional signature of the exposure

• Covariance estimation• Currently limited to binary environment• Interpretation can be difficult

20/21

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation

• Currently limited to binary environment• Interpretation can be difficult

20/21

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation• Currently limited to binary environment

• Interpretation can be difficult

20/21

Limitations

• There must be a high-dimensional signature of the exposure• Covariance estimation• Currently limited to binary environment• Interpretation can be difficult

20/21

Acknowledgements

• Dr. Celia Greenwood• Dr. Blanchette and Dr. Yang• Dr. Luigi Bouchard, André AnneHoude

• Dr. Steele, Dr. Kramer,Dr. Abrahamowicz

• Maxime Turgeon, KevinMcGregor, Lauren Mokry,Dr. Forest

• Greg Voisin, Dr. Forgetta,Dr. Klein

• Mothers and children from thestudy

21/21