Post on 09-Apr-2017
Building a search engine to find environmental and phenotypic factors
associated with disease and health
Chirag J PatelNortheast Big Data Innovation Hub Workshop
02/24/17
chirag@hms.harvard.edu@chiragjp
www.chiragjpgroup.org
P = G + EType 2 Diabetes
CancerAlzheimer’s
Gene expression
Phenotype Genome
Variants
Environment
Infectious agentsNutrientsPollutants
Drugs
We are great at G investigation!
over 2400 Genome-wide Association Studies (GWAS)
https://www.ebi.ac.uk/gwas/
G
Nothing comparable to elucidate E influence!
E: ???We lack high-throughput methods and data to discover new E in P…
until now!
σ2Gσ2P H2 =
Heritability (H2) is the range of phenotypic variability attributed to genetic variability in a
population
Indicator of the proportion of phenotypic differences attributed to G.
Eye colorHair curliness
Type-1 diabetesHeight
SchizophreniaEpilepsy
Graves' diseaseCeliac disease
Polycystic ovary syndromeAttention deficit hyperactivity disorder
Bipolar disorderObesity
Alzheimer's diseaseAnorexia nervosa
PsoriasisBone mineral density
Menarche, age atNicotine dependence
Sexual orientationAlcoholism
LupusRheumatoid arthritis
Crohn's diseaseMigraine
Thyroid cancerAutism
Blood pressure, diastolicBody mass index
DepressionCoronary artery disease
InsomniaMenopause, age at
Heart diseaseProstate cancer
QT intervalBreast cancer
Ovarian cancerHangoverStrokeAsthma
Blood pressure, systolicHypertensionOsteoarthritis
Parkinson's diseaseLongevity
Type-2 diabetesGallstone diseaseTesticular cancer
Cervical cancerSciatica
Bladder cancerColon cancerLung cancerLeukemia
Stomach cancer
0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for burdensome diseases are low and variable: massive opportunity for E discovery
Type 2 Diabetes
Heart Disease
Asthma
Eye colorHair curliness
Type-1 diabetesHeight
SchizophreniaEpilepsy
Graves' diseaseCeliac disease
Polycystic ovary syndromeAttention deficit hyperactivity disorder
Bipolar disorderObesity
Alzheimer's diseaseAnorexia nervosa
PsoriasisBone mineral density
Menarche, age atNicotine dependence
Sexual orientationAlcoholism
LupusRheumatoid arthritis
Crohn's diseaseMigraine
Thyroid cancerAutism
Blood pressure, diastolicBody mass index
DepressionCoronary artery disease
InsomniaMenopause, age at
Heart diseaseProstate cancer
QT intervalBreast cancer
Ovarian cancerHangoverStrokeAsthma
Blood pressure, systolicHypertensionOsteoarthritis
Parkinson's diseaseLongevity
Type-2 diabetesGallstone diseaseTesticular cancer
Cervical cancerSciatica
Bladder cancerColon cancerLung cancerLeukemia
Stomach cancer
0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
G estimates for complex traits are low and variable: massive opportunity for high-throughput E discovery
σ2E : Exposome!
Enhance accessibility of clinical and environmental data, and analytic artificial intelligence tools!
How can we drive discovery of environmental factors (E) in disease phenotypes (P)?
Enhance accessibility of large open data and tools to drive discovery of
environmental factors (E) in disease phenotypes (P)
Greg Cooper, MD, PhD Pittsburgh
Vasant Honavar, PhD Penn State
Noémie Elhadad, PhD Columbia
Chirag Patel, PhD Harvard
Where do we get disease (P) data?
wearable.com
Where do we get disease P data?Health record data from your doctor!
• Longitudinal data on millions of patients
• diagnoses, prescriptions, lab reports, notes
• Sitting there in institutional IT infrastructure
• OHDSI provides a unified model to access data across institutions, enhancing the scientific process!
Noémie Elhadad, PhD Columbia
Capitalize on digitalized health record data (from around the world)!
High-powered dataset(s) for discovery.
Where do we get environmental (E) data?
Examples of sources of disparate external exposome datasets available in the Exposome Data Warehouse
• Geological
• NASA - Cloud and Atmosphere Profiles
• NOAA Climate Data
• Pollution
• EPA Air Quality Surveillance Data Mart, or AirData,
• Socio-Economic
• US Census American Community Survey (ACS)
• Epidemiological
• CDC Wonder, USDA Food Atlas
A key challenge: mashing up Exposome Data Warehouse with patient
data from OHDSI
f(location, time)PM2.5income
Pollen count
EPA AirData American CommunitySurvey
NOAA Climate
home zipcodeencounter time
0.50
gini index
0.20
0.1
hh income (K)
100
50
3030
15
Wind
0
individualn
…
..
..
..
..
3/5/1998yes
10
E?age
no
95376
20
sex
M
Temp
individual2
PM2.5
individual1
75 55
02215
1/1/2016 23
Pb
individual3
21
70
F
M
zip
35
Time(E)
2312/11/2015
0
yes
50
302124
OHDSI EPA NOAA Census
milli
ons
of p
atie
nts
Will it work? yep!
Does temperature (and weather) influence asthma-related pediatric ER visits?
• Children <= 17 y/o with >=1 ICD9 code corresponding to 493.*
• N=56K, >84K ER visits
• Weather station data
• (daily temperature, wind, humidity)
• Case-crossover design (only investigated cases)
Yeran Li, PhD (MS, HSPH) Chirag Lakhani, PhD (HMS)
Yun Wang, PhD (Post-doc, HSPH)
Prevalence of asthma attack varies across the US
Temperature(F)20
4060
80
Lag
01
23
45
6
Relative risk
0.951.00
1.05
20 40 60 80
0.9
1.1
1.3
Overall temperature effect
Mean temperature (F)
Rel
ativ
e ris
k
●
●
●●
● ● ●
0 1 2 3 4 5 6
0.90
1.00
1.10
Lag effect at T=60F
Lag(day)
Rel
ativ
e R
isk
20 40 60 80
0.90
1.00
1.10
Temperature effect at Lag=0
Mean temperature (F)
Rel
ativ
e R
isk
Does temperature influence asthma ER visits?: yes! Relative risk of asthma attack by mean temperature
Rates of asthma attacks depend on season?: yes!
0.9
1.0
1.1
20 40 60 80
season Fall Spring Summer Winter
Temperature)effects)vary)at)different)regions)(use)NOAA)defini9on,)ER)kids))
h@ps://www.ncdc.noaa.gov/monitoringEreferences/maps/usEclimateEregions.php)
sout
h
north
east
sout
heas
t
cent
ral
sout
hwes
t
wn_
cent
ral
en_c
entra
l
west
north
west
Region counts by NOAA
Num
of o
bs
020
000
6000
0
51052
78126
57238
68719
15256 18050
44551
22923
13895
south northeast southeast central southwest wn_central en_central west northwest
0.5
1.0
1.5
2.0
20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80 20 40 60 80Temperature
Risk
weather effect: different weather zones by NOAA
Rates of asthma attacks dependent on region?: yes!
Does temperature (and weather) influence asthma-related ER visits in kids?: the tip of the iceberg!
Temperature(F)20
4060
80
Lag
01
23
45
6
Relative risk
0.951.00
1.05
20 40 60 80
0.9
1.1
1.3
Overall temperature effect
Mean temperature (F)
Rel
ativ
e ris
k
●
●
●●
● ● ●
0 1 2 3 4 5 6
0.90
1.00
1.10
Lag effect at T=60F
Lag(day)
Rel
ativ
e R
isk
20 40 60 80
0.90
1.00
1.10
Temperature effect at Lag=0
Mean temperature (F)
Rel
ativ
e R
isk
What other scientific questions? •what is influence of pollen?•what is the influence of air pollution?•what about adults?
Can we replicate the analysis? •different populations•using different data•with different analysts
ExposomeDBEnvironmental Protection Agency
AirData
US CensusSocioeconomic data
National Oceanic and Atmospheric Administration (NOAA)Climate Data
Observational Health Data Sciences and Informatics
(OHDSI)
Harvard Medical School Columbia University P&S
Geotemporal database Network of observational data
Causal Analytics
A2
A3
A1
AX Aim X
Analytics Workshop
Hands-on workshops for leveraging work in
Aims 1-3
Training Resources
Jupyter notebooksStudent internships at Pitt,
Harvard, PSU, and Columbia P&S
OHDSI Node
A4
University of Pittsburgh Penn State University
Integrating the ExposomeData Warehouse with OHDSI and causal modeling tools to drive and demonstrate
discovery.
Many hypotheses are possible to address: useful for public health & planning!
What is the effect of air pollution levels in disease?
Do adverse weather conditions influence hospital use?
What pharmaceutical drugs lead to adverse health outcomes?
How does socioeconomic context influence hospital use, disease rates, and recovery?
We will harness tools in machine learning extract signal from noise!
Greg Cooper, MD, PhD Pittsburgh
Vasant Honavar, PhD Penn State
SES
PM2.5
asthma
oF
bayesian networks
socio-economic
case-crossoverweather
air pollution
Sci Rep 2016
http://www.ccd.pitt.edu/tools/
Integrating the ExposomeDB with OHDSI and analytics to drive and demonstrate discovery by the community!
(trainees especially welcome)
• 2-day hands-on workshop in New York or Boston
• remote “exchange” internship program and 2-week immersion
• dissemination of electronic training resources
Many hypotheses are possible to address: useful for can we build a machine learning predictor to estimate E?
Eye colorHair curliness
Type-1 diabetesHeight
SchizophreniaEpilepsy
Graves' diseaseCeliac disease
Polycystic ovary syndromeAttention deficit hyperactivity disorder
Bipolar disorderObesity
Alzheimer's diseaseAnorexia nervosa
PsoriasisBone mineral density
Menarche, age atNicotine dependence
Sexual orientationAlcoholism
LupusRheumatoid arthritis
Crohn's diseaseMigraine
Thyroid cancerAutism
Blood pressure, diastolicBody mass index
DepressionCoronary artery disease
InsomniaMenopause, age at
Heart diseaseProstate cancer
QT intervalBreast cancer
Ovarian cancerHangoverStrokeAsthma
Blood pressure, systolicHypertensionOsteoarthritis
Parkinson's diseaseLongevity
Type-2 diabetesGallstone diseaseTesticular cancer
Cervical cancerSciatica
Bladder cancerColon cancerLung cancerLeukemia
Stomach cancer
0 25 50 75 100Heritability: Var(G)/Var(Phenotype) Source: SNPedia.com
σ2E : Exposome!
Harvard DBMI Isaac KohaneSusanne ChurchillNathan PalmerSunny AlvearMichal Preminger
Chirag J Patelchirag@hms.harvard.edu
@chiragjpwww.chiragjpgroup.org
ThanksRagGroup Chirag Lakhani Yeran Li Shreyas BhaveRolando Acosta
Noémie Elhadad (Columbia) Vasant Honavar (PSU)
Greg Cooper (Pitt) George Hripcsak (Columbia)
René Baston Katie Naum
Kathleen McKeown
NIH Common FundBig Data to Knowledge