Selecting Initial GWAS and replication studies David Hunter Harvard School of Public Health
description
Transcript of Selecting Initial GWAS and replication studies David Hunter Harvard School of Public Health
Selecting Initial GWAS and replication studies
David HunterHarvard School of Public HealthBrigham and Women’s HospitalBroad Institute of MIT and Harvard
Initial Study for GWAS
• Cases and controls well matched with respect to ancestry to minimize population stratification
(restriction to one self-identified group)
• Genomic control or other methods
e.g. Eigenstrat (Price et al, 2006), may compensate for looser matching
Control of population stratification e.g. hair colorin Nurses’ Health Study (European ancestry)
Chi-squared inflation factors and Q-Q plots of –log10 p-values with no adjustment for population stratification and
adjusting for the top four and fifty eigenvectors (Price et al, 2006)
45, 19 and 19 SNPs (respectively) with p<10-7 not shown
Kraft P, unpublished
ArticleNature 447, 661-678 (7 June 2007) | doi:10.1038/nature05911; Received 26 March 2007; Accepted 11 May 2007Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controlsThe Wellcome Trust Case Control Consortium
Conclusions
Broad matching on ancestry and region adequate for discovery of strongest hits
Statistical methods for control of population stratification (within populations of European ancestry) adequate to assist in discovery of strongest hits
Will more rigorous designs permit discovery of weaker associations?
When signal-noise is low, how does noise due to multiple comparisons compare with noise due to poor matching of controls? False negatives the biggest problem (can deal with false +ves via replication).
Criteria for follow-up of initial reports of genotype–phenotype associations
Replication studies should be of sufficient sample size to convincingly distinguish the proposed effect from no effect
Replication studies should preferably be conducted in independent data sets, to avoid the tendency to split one well-powered study into two less conclusive ones
The same or a very similar phenotype should be analysed
A similar population should be studied, and notable differences between the populations studied in the initial and attempted replication studies should be described
Similar magnitude of effect and significance should be demonstrated, in the same direction, with the same SNP or a SNP in perfect or very high linkage disequilibrium with the prior SNP (r2 close to 1.0)
Statistical significance should first be obtained using the genetic model reported in the initial studyWhen possible, a joint or combined analysis should lead to a smaller P-value than that seen in the initial report
A strong rationale should be provided for selecting SNPs to be replicated from the initial study, including linkage-disequilibrium structure, putative functional data or published literature
Replication reports should include the same level of detail for study design and analysis plan as reported for the initial study
Chanock, Maniolo et al. Nature, June 7th 2007
Initial Study for GWAS: technical issues
• Standard advice – case and control samples handled exactly the same at every stage
• Source of DNA – Blood/buffy coat mostly good results– Buccal cell variable results (Feigelson et al.
CEBP, 2007 - encouraging)– Whole genome amplified DNA (Affy OK,
Illumina in development)
Replication studies
For statistical replication, prefer:• Similar phenotype• Similar ancestry
For generalizability, prefer• Different populations• Different ancestry backgrounds (may also help with fine mapping)
Study design?
Prospective
• Protect from survivor bias
• Protect from selection bias
• Interpretability of gene-environment analyses
• Possibility of interpretable biomarkers
Study quality?
Importance depends on strength of signal
• To date – little apparent relation between probability of replication and quality
• May matter more for weak signals
• Sample size may trump quality (within limits)
NCI BPC3 Results: 7909 cases, 8683 controls
Cohort Genotype Cases / Controls OR (99%CI) P-valueAll CC 5,566 / 6,666 Ref. 4.00x10-19
(phet=0.483) AC 2,064 / 1,842 1.33 (1.20-1.46)AA 279 / 175 1.87 (1.44-2.42)
ACS CC 871 / 955 Ref. 2.63x10-5
AC 238 / 166 1.56 (1.17-2.08)AA 21 / 9 2.61 (0.92-7.37)
ATBC CC 606 / 623 Ref. 0.012AC 312 / 260 1.23 (0.95-1.60)AA 45 / 25 1.81 (0.94-3.51)
EPIC CC 551 / 869 Ref. 0.258AC 169 / 233 1.17 (0.87-1.58)AA 12 / 12 1.57 (0.53-4.59)
HPFS CC 495 / 545 Ref. 3.63x10-3
AC 157 / 114 1.53 (1.07-2.19)AA 11 / 6 2.09 (0.56-7.80)
MEC CC 1,426 / 1,565 Ref. 2.58x10-7
AC 728 / 614 1.32 (1.11-1.58)AA 146 / 88 1.89 (1.30-2.75)
PHS CC 801 / 1,123 Ref. 0.013AC 200 / 220 1.27 (0.96-1.69)AA 21 / 15 2.06 (0.83-5.12)
PLCO CC 816 / 986 Ref. 0.014AC 260 / 235 1.33 (1.02-1.72)AA 23 / 20 1.39 (0.63-3.10)
Rs1447295: Overall p, trend 4 x 10-19
Schumacher et al. Can Res, April 2007
a, rs2981582; b, rs3803662; c, rs889312; d, rs13281615; and e, rs3817198
Forest plots of the per-allele odds ratios for each of the five SNPs reaching genome-wide significance for breast cancer. Easton et al. Nature, May 2007
FGFR2
Cancer Genetic Markers of Susceptibility (CGEMS):
http://cgems.cancer.gov
Follow-up Study #1 4500 cases/ 4500 controls
Follow-up Study #23500 cases/ 3500 controls
Fine Mapping
Initial GWAS Study1150 cases/1150 controls
~28,000 SNPs
at least 1,500SNPs
30 ±20loci
540,000 Tag SNPs
General Strategy for Multistage analysis of Prostate & Breast Cancer
Committed Studies CGEMS
Prostate CancerPLCO (GWAS)ACSHPFSPHSATBCCeRePPEPICMEC
Breast CancerNHS (GWAS)PLCOWHIPolish C/CACSEPICMEC
CGEMS: caBIG Posting Pre-Computed Analysis
Pre-computed AnalysisNo Restrictions
Raw Genotype Case/control Age (in 5 yrs) Family Hx (+/-)Registration
http://cgems.cancer.gov/data
http://cgems.cancer.gov
Association TestsProstate 10/06Breast 04/07~528,000 SNPsIllumina 550k
Instant Replication!
Additional In silico replication possibilities
dbGAP ncbi.nlm.nih.gov/dbgap
Framinghamnhlbi.nih.gov/about/framingham
WTCCC wtccc.org.uk
DGI broad.mit.edu/diabetes
Chromosomes
222120191817161514131211109
1 2 3 4 5 6 7 8
X
-2
-3
-4
-5
-2
-3
-4
-5
-6
Log10(p-value)
p q p q
p q p q
FGFR2
The six SNPs with the smallest P values of the 528,173 tested among 1,145 cases of postmenopausal invasive breast cancer and 1,141 controls (full results available at http://cgems.cancer.gov ).
SNP ID Χ2* P* ORhet* ORhomo* Chromosome Gene1. rs10510126 25.37 0.0000031 0.59 0.62 10 2. rs1219648 23.56 0.0000076 1.24 1.81 10 FGFR23. rs17157903 23.39 0.0000083 1.60 0.79 7 RELN4. rs2420946 23.17 0.0000095 1.25 1.81 10 FGFR25. rs7696175 22.40 0.0000137 1.38 0.86 4 TLR1,TLR66. rs12505080 21.99 0.0000168 1.21 0.52 4
*From analyses adjusting for age, matching factors (see Methods), and three eigenvectors of the principal components identified by Eigenstrat. P value obtained by a score test with 2df.
Hunter et al, Nat Gen, May 2007
123.2 123.3 123.4
0
FGFR2
Fig2
-2
-4
-6
log1
0(p
-va
lue
)
Scatterplot of P values for the FGFR2 locus from the GWAS.
Results of associations of rs1219648 in the Nurses Health Study, Nurses’ Health Study 2, and the PLCO study.
Study Population Allele Frequency ORhet ORhomo Ptrend(N cases/N controls) Cases Controls (95% CI) (95% CI)
(%) (%)
Nurses’ Health Study (1,145/1,141)45.54 38.47 1.24 1.81 2.0 x 10-6
(1.04-1.50) (1.43-2.31)
Nurses’ Health Study 2 (302/594)48.18 40.57 1.29 1.93 0.002
(0.95-1.75) (1.31-2.86)
PLCO (919/922) 44.50 41.49 1.06 1.22 0.13(0.86-1.30) (0.94-1.58)
ACS CPS-II (555/556) 44.95 37.41 1.32 2.06 0.0002(1.02-1.72) (1.42-2.97)
Pooled estimates (2,921/3,213) 1.20 1.64 1.1 x 10-10
(1.07-1.34) (1.42-1.90)
Results of associations of rs1219648 in the Nurses Health Study, Nurses’ Health Study 2, and the PLCO study.
Study Population Allele Frequency ORhet ORhomo Ptrend(N cases/N controls) Cases Controls (95% CI) (95% CI)
(%) (%)
Nurses’ Health Study (1,145/1,141)45.54 38.47 1.24 1.81 2.0 x 10-6
(1.04-1.50) (1.43-2.31)
Nurses’ Health Study 2 (302/594)48.18 40.57 1.29 1.93 0.002
(0.95-1.75) (1.31-2.86)
PLCO (919/922) 44.50 41.49 1.06 1.22 0.13(0.86-1.30) (0.94-1.58)
ACS CPS-II (555/556) 44.95 37.41 1.32 2.06 0.0002(1.02-1.72) (1.42-2.97)
Pooled estimates (2,921/3,213) 1.20 1.64 1.1 x 10-10
(1.07-1.34) (1.42-1.90)
UNFINISHED AGENDAWhere is the causal variant?What does this tell us about mechanisms of breast carcinogenesis?
THE HITS KEEP COMING….
UNFINISHED EPIDEMIOLOGIC/PUBLIC HEALTH AGENDAGene-environment interaction, what do the genes tell us about
environmental exposures?Gene-gene interactionPathway analysisClinical implications – risk stratification for screening? Intervention?Health policy implications?
Much of the substrate data – publicly available or relatively cheap.
NHS/HPFS/PHS GENETIC STUDIESImmaculata De Vivo NHS/HPFS:
Peter Kraft Sue Hankinson
Hardeep Ranu Shelley Tworoger Crystal Arnone Eric Rimm
Carolyn Guo Frank Hu
Pati Soule Meir Stampfer
Craig Labadie Walt Willett
Carolyn Guo Frank Speizer
Jiali Han Charles Fuchs
Monica Macgrath Ed Giovannucci
Chunyan He Andy Chan, Debra Patrick Dennett Schaumberg
David Cox Fran Grodstein, Jae Tim Niu Hee Kang
Aditi Hazra PHS: Jing Ma
Fred Schumacher Mike Gaziano, P Ridker
NCI BPC3 STEERING COMMITTEE:Harvard David Hunter, Michael Gaziano, Julie Buring, Graham Colditz, Walter WillettEPIC,CEPH, Cambridge Elio Riboli, Rudolf Kaaks, Federico Canzian, Gilles Thomas,ACS Michael Thun, Heather Feigelson, Jeanne CalleNCI Richard Hayes, Demetrius Albanes, Bob Hoover, Stephen Chanock; Program - Mukesh VermaMEC & Broad Brian Henderson, Laurence Kolonel, David Altshuler, Malcolm Pike
SECRETARIAT: David Hunter, Elio Riboli
GENOMICS subgroup:
David Altshuler (Chair) Steve ChanockGilles Thomas
STATISTICS subgroup:
Dan Stram (Chair)Peter KraftRudolf KaaksPaul PharoahMalcolm PikeGilles ThomasShalom Wacholder
Harvard cohorts
ACS cohort
EPIC cohorts
Multiethnic Cohort
PLCO cohort
ATBC cohort
CEPH
BROADINSTITUTE
NCI Core Gen Facility
Genotypingsubgroup:
Chris Haiman (Chair)Federico CanzianAlison DunningSteve ChanockDavid CoxDavid HunterLoic LeMarchandJames Mackay
PUBLICATIONSCOMMITTEE:
Michael Thun (Chair)Elio RiboliBrian HendersonDavid HunterGraham ColditzRichard HayesDemetrius Albanes
CGEMS Acknowledgements• NCI• Stephen Chanock• Gilles Thomas• Robert Hoover• Joseph Fraumeni• Daniela Gerhard• Kevin Jacobs• Zhaoming Wang• Meredith Yeager• Robert Welch• Richard Hayes• Sholom Wacholder• Nilanjan Chatterjee• Kai Yu• Margaret Tucker• Marianne Rivera-Silva• NCICB
HSPHDavid HunterPeter KraftFred SchumacherDavid Cox
ACSHeather FeigelsonCarmen RodriguezEugenia CalleMichael Thun
PLCORegina ZieglerChris BergSaundra BuysChris MacCarty
Selecting initial and replication samples from existing studies
I. What studies of the same phenotype exist?
II. Can a consortium or collaborative approach provide a study with adequate power for the initial GWAS, along with pre-planned replication studies?
III. Do any of these studies have pre-existing data that would increase power e.g. “free” controls for a prior GWAS of another phenotype?
IV. Is the phenotype defined in the same or similar manner?
V. Are covariate data available, and defined similarly?
VI. Do any of the studies have additional phenotypic information e.g. biomarkers that would create opportunities for “added value” analyses, if these are the subjects of the GWAS?