A model for interpretable high dimensional interactions
-
Upload
sahirbhatnagar -
Category
Science
-
view
130 -
download
3
Transcript of A model for interpretable high dimensional interactions
A Model for Interpretable High Dimensional
Interactions
Sahir Rai Bhatnagar
Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood
Poster Number 67
Motivation
one predictor variable at a time
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
1
one predictor variable at a time
Predictor Variable Phenotype
Test 1
Test 2
Test 3
Test 4
Test 5
1
a network based view
Predictor Variable Phenotype
Test 1
2
a network based view
Predictor Variable Phenotype
Test 1
2
a network based view
Predictor Variable Phenotype
Test 1
2
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
3
system level changes due to environment
Predictor Variable PhenotypeEnvironment
A
B
Test 1
3
Motivating Dataset: Newborn epigenetic adaptations to gesta-
tional diabetes exposure (Luigi Bouchard, Sherbrooke)
Environment
Gestational
Diabetes
Large Data
Child’s epigenome
(p ≈ 450k)
Phenotype
Obesity measures
4
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
5
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
6
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
6
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
6
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
6
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
6
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X and can
modify the relation between X and Y
Objective
• Which elements of X that are associated with Y , depend on E?
6
Methods
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
7
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
7
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
7
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
7
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
7
ECLUST - our proposed method: 3 phases
Original Data
E = 0
1) Gene Similarity
E = 1
2) Cluster
Representation
n × 1 n × 1
3) Penalized
Regression
Yn×1∼ + ×E
7
the objective of statistical
methods is the reduction of data.
A quantity of data . . . is to be
replaced by relatively few quantities
which shall adequately represent
. . . the relevant information
contained in the original data.
- Sir R. A. Fisher, 1922
7
Model
g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects
+α1E (X1E ) + · · ·+ αpE (XpE )︸ ︷︷ ︸interactions
Reparametrization1: αjE = γjEβjβE .
Strong heredity principle2:
αjE 6= 0 ⇒ βj 6= 0 and βE 6= 0
1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics
8
Model
g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects
+α1E (X1E ) + · · ·+ αpE (XpE )︸ ︷︷ ︸interactions
Reparametrization1: αjE = γjEβjβE .
Strong heredity principle2:
αjE 6= 0 ⇒ βj 6= 0 and βE 6= 0
1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics
8
Model
g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects
+α1E (X1E ) + · · ·+ αpE (XpE )︸ ︷︷ ︸interactions
Reparametrization1: αjE = γjEβjβE .
Strong heredity principle2:
αjE 6= 0 ⇒ βj 6= 0 and βE 6= 0
1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics
8
Strong Heredity Model with Penalization
arg minβ0,β,γ
1
2‖Y − g(µ)‖2 +
λβ (w1β1 + · · ·+ wqβq + wEβE ) +
λγ (w1Eγ1E + · · ·+ wqEγqE )
wj =
∣∣∣∣∣ 1
βj
∣∣∣∣∣ , wjE =
∣∣∣∣∣ βj βEαjE
∣∣∣∣∣
9
Results
Simulation Study: Jaccard Index and test set MSE
10
Open source software
• Software implementation in R: http://sahirbhatnagar.com/eclust/
• Allows user specified interaction terms
• Automatically determines the optimal tuning parameters through
cross validation
• Can also be applied to genetic data
11
Conclusions
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
12
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
12
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
12
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
12
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
12
Conclusions and Contributions
• Large system-wide changes are observed in many environments
• This assumption can possibly be exploited to aid analysis of large
data
• We develop and implement a multivariate penalization procedure for
predicting a continuous or binary disease outcome while detecting
interactions between high dimensional data (p >> n) and an
environmental factor.
• Dimension reduction is achieved through leveraging the
environmental-class-conditional correlations
• Also, we develop and implement a strong heredity framework
within the penalized model
• R software: http://sahirbhatnagar.com/eclust/
12
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
• Need more samples . . . Got data? (Poster 67)
13
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
• Need more samples . . . Got data? (Poster 67)
13
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
• Need more samples . . . Got data? (Poster 67)
13
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised
• Two tuning parameters
• Need more samples . . . Got data? (Poster 67)
13
acknowledgements
• Dr. Celia Greenwood
• Dr. Blanchette and Dr. Yang
• Dr. Luigi Bouchard, Andre Anne
Houde
• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz
• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,
Dr. Forest
• Greg Voisin, Dr. Forgetta,
Dr. Klein
• Mothers and children from the
study
14