Selecting Initial GWAS and replication studies David Hunter Harvard School of Public Health

Selecting Initial GWAS and replication studies

David HunterHarvard School of Public HealthBrigham and Women’s HospitalBroad Institute of MIT and Harvard

Initial Study for GWAS

• Cases and controls well matched with respect to ancestry to minimize population stratification

(restriction to one self-identified group)

• Genomic control or other methods

e.g. Eigenstrat (Price et al, 2006), may compensate for looser matching

Control of population stratification e.g. hair colorin Nurses’ Health Study (European ancestry)

Chi-squared inflation factors and Q-Q plots of –log10 p-values with no adjustment for population stratification and

adjusting for the top four and fifty eigenvectors (Price et al, 2006)

45, 19 and 19 SNPs (respectively) with p<10-7 not shown

Kraft P, unpublished

ArticleNature 447, 661-678 (7 June 2007) | doi:10.1038/nature05911; Received 26 March 2007; Accepted 11 May 2007Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controlsThe Wellcome Trust Case Control Consortium

http://www.nature.com.ezp1.harvard.edu/nature/journal/v447/n7145/full/nature05911.html#The%20Wellcome%20Trust%20Case%20Control%20Consortium



Conclusions

Broad matching on ancestry and region adequate for discovery of strongest hits

Statistical methods for control of population stratification (within populations of European ancestry) adequate to assist in discovery of strongest hits

Will more rigorous designs permit discovery of weaker associations?

When signal-noise is low, how does noise due to multiple comparisons compare with noise due to poor matching of controls? False negatives the biggest problem (can deal with false +ves via replication).

Criteria for follow-up of initial reports of genotype–phenotype associations

Replication studies should be of sufficient sample size to convincingly distinguish the proposed effect from no effect

Replication studies should preferably be conducted in independent data sets, to avoid the tendency to split one well-powered study into two less conclusive ones

The same or a very similar phenotype should be analysed

A similar population should be studied, and notable differences between the populations studied in the initial and attempted replication studies should be described

Similar magnitude of effect and significance should be demonstrated, in the same direction, with the same SNP or a SNP in perfect or very high linkage disequilibrium with the prior SNP (r2 close to 1.0)

Statistical significance should first be obtained using the genetic model reported in the initial studyWhen possible, a joint or combined analysis should lead to a smaller P-value than that seen in the initial report

A strong rationale should be provided for selecting SNPs to be replicated from the initial study, including linkage-disequilibrium structure, putative functional data or published literature

Replication reports should include the same level of detail for study design and analysis plan as reported for the initial study

Chanock, Maniolo et al. Nature, June 7th 2007

Initial Study for GWAS: technical issues

• Standard advice – case and control samples handled exactly the same at every stage

• Source of DNA – Blood/buffy coat mostly good results– Buccal cell variable results (Feigelson et al.

CEBP, 2007 - encouraging)– Whole genome amplified DNA (Affy OK,

Illumina in development)

Replication studies

For statistical replication, prefer:• Similar phenotype• Similar ancestry

For generalizability, prefer• Different populations• Different ancestry backgrounds (may also help with fine mapping)

Study design?

Prospective

• Protect from survivor bias

• Protect from selection bias

• Interpretability of gene-environment analyses

• Possibility of interpretable biomarkers

Study quality?

Importance depends on strength of signal

• To date – little apparent relation between probability of replication and quality

• May matter more for weak signals

• Sample size may trump quality (within limits)

NCI BPC3 Results: 7909 cases, 8683 controls

Cohort Genotype Cases / Controls OR (99%CI) P-valueAll CC 5,566 / 6,666 Ref. 4.00x10-19

(phet=0.483) AC 2,064 / 1,842 1.33 (1.20-1.46)AA 279 / 175 1.87 (1.44-2.42)

ACS CC 871 / 955 Ref. 2.63x10-5

AC 238 / 166 1.56 (1.17-2.08)AA 21 / 9 2.61 (0.92-7.37)

ATBC CC 606 / 623 Ref. 0.012AC 312 / 260 1.23 (0.95-1.60)AA 45 / 25 1.81 (0.94-3.51)

EPIC CC 551 / 869 Ref. 0.258AC 169 / 233 1.17 (0.87-1.58)AA 12 / 12 1.57 (0.53-4.59)

HPFS CC 495 / 545 Ref. 3.63x10-3

AC 157 / 114 1.53 (1.07-2.19)AA 11 / 6 2.09 (0.56-7.80)

MEC CC 1,426 / 1,565 Ref. 2.58x10-7

AC 728 / 614 1.32 (1.11-1.58)AA 146 / 88 1.89 (1.30-2.75)

PHS CC 801 / 1,123 Ref. 0.013AC 200 / 220 1.27 (0.96-1.69)AA 21 / 15 2.06 (0.83-5.12)

PLCO CC 816 / 986 Ref. 0.014AC 260 / 235 1.33 (1.02-1.72)AA 23 / 20 1.39 (0.63-3.10)

Rs1447295: Overall p, trend 4 x 10-19

Schumacher et al. Can Res, April 2007

a, rs2981582; b, rs3803662; c, rs889312; d, rs13281615; and e, rs3817198

Forest plots of the per-allele odds ratios for each of the five SNPs reaching genome-wide significance for breast cancer. Easton et al. Nature, May 2007

FGFR2

Cancer Genetic Markers of Susceptibility (CGEMS):

http://cgems.cancer.gov

Follow-up Study #1 4500 cases/ 4500 controls

Follow-up Study #23500 cases/ 3500 controls

Fine Mapping

Initial GWAS Study1150 cases/1150 controls

~28,000 SNPs

at least 1,500SNPs

30 ±20loci

540,000 Tag SNPs

General Strategy for Multistage analysis of Prostate & Breast Cancer

Committed Studies CGEMS

Prostate CancerPLCO (GWAS)ACSHPFSPHSATBCCeRePPEPICMEC

Breast CancerNHS (GWAS)PLCOWHIPolish C/CACSEPICMEC

CGEMS: caBIG Posting Pre-Computed Analysis

Pre-computed AnalysisNo Restrictions

Raw Genotype Case/control Age (in 5 yrs) Family Hx (+/-)Registration

http://cgems.cancer.gov/data

http://cgems.cancer.gov

Association TestsProstate 10/06Breast 04/07~528,000 SNPsIllumina 550k

Instant Replication!

Additional In silico replication possibilities

dbGAP ncbi.nlm.nih.gov/dbgap

Framinghamnhlbi.nih.gov/about/framingham

WTCCC wtccc.org.uk

DGI broad.mit.edu/diabetes

Chromosomes

222120191817161514131211109

1 2 3 4 5 6 7 8

X

-2

-3

-4

-5

-2

-3

-4

-5

-6

Log10(p-value)

p q p q

p q p q

FGFR2

The six SNPs with the smallest P values of the 528,173 tested among 1,145 cases of postmenopausal invasive breast cancer and 1,141 controls (full results available at http://cgems.cancer.gov ).

SNP ID Χ2* P* ORhet* ORhomo* Chromosome Gene1. rs10510126 25.37 0.0000031 0.59 0.62 10 2. rs1219648 23.56 0.0000076 1.24 1.81 10 FGFR23. rs17157903 23.39 0.0000083 1.60 0.79 7 RELN4. rs2420946 23.17 0.0000095 1.25 1.81 10 FGFR25. rs7696175 22.40 0.0000137 1.38 0.86 4 TLR1,TLR66. rs12505080 21.99 0.0000168 1.21 0.52 4

*From analyses adjusting for age, matching factors (see Methods), and three eigenvectors of the principal components identified by Eigenstrat. P value obtained by a score test with 2df.

Hunter et al, Nat Gen, May 2007

http://cgems.cancer.gov/

http://cgems.cancer.gov/

123.2 123.3 123.4

0

FGFR2

Fig2

-2

-4

-6

log1

0(p

-va

lue

)

Scatterplot of P values for the FGFR2 locus from the GWAS.

Results of associations of rs1219648 in the Nurses Health Study, Nurses’ Health Study 2, and the PLCO study.

Study Population Allele Frequency ORhet ORhomo Ptrend(N cases/N controls) Cases Controls (95% CI) (95% CI)

(%) (%)

Nurses’ Health Study (1,145/1,141)45.54 38.47 1.24 1.81 2.0 x 10-6

(1.04-1.50) (1.43-2.31)

Nurses’ Health Study 2 (302/594)48.18 40.57 1.29 1.93 0.002

(0.95-1.75) (1.31-2.86)

PLCO (919/922) 44.50 41.49 1.06 1.22 0.13(0.86-1.30) (0.94-1.58)

ACS CPS-II (555/556) 44.95 37.41 1.32 2.06 0.0002(1.02-1.72) (1.42-2.97)

Pooled estimates (2,921/3,213) 1.20 1.64 1.1 x 10-10

(1.07-1.34) (1.42-1.90)

Results of associations of rs1219648 in the Nurses Health Study, Nurses’ Health Study 2, and the PLCO study.

Study Population Allele Frequency ORhet ORhomo Ptrend(N cases/N controls) Cases Controls (95% CI) (95% CI)

(%) (%)

Nurses’ Health Study (1,145/1,141)45.54 38.47 1.24 1.81 2.0 x 10-6

(1.04-1.50) (1.43-2.31)

Nurses’ Health Study 2 (302/594)48.18 40.57 1.29 1.93 0.002

(0.95-1.75) (1.31-2.86)

PLCO (919/922) 44.50 41.49 1.06 1.22 0.13(0.86-1.30) (0.94-1.58)

ACS CPS-II (555/556) 44.95 37.41 1.32 2.06 0.0002(1.02-1.72) (1.42-2.97)

Pooled estimates (2,921/3,213) 1.20 1.64 1.1 x 10-10

(1.07-1.34) (1.42-1.90)

UNFINISHED AGENDAWhere is the causal variant?What does this tell us about mechanisms of breast carcinogenesis?

THE HITS KEEP COMING….

UNFINISHED EPIDEMIOLOGIC/PUBLIC HEALTH AGENDAGene-environment interaction, what do the genes tell us about

environmental exposures?Gene-gene interactionPathway analysisClinical implications – risk stratification for screening? Intervention?Health policy implications?

Much of the substrate data – publicly available or relatively cheap.

http://www.boston.com/sports/baseball/redsox/articles/2007/04/23/recounting_3d_inning_barrage_is_a_real_blast

NHS/HPFS/PHS GENETIC STUDIESImmaculata De Vivo NHS/HPFS:

Peter Kraft Sue Hankinson

Hardeep Ranu Shelley Tworoger Crystal Arnone Eric Rimm

Carolyn Guo Frank Hu

Pati Soule Meir Stampfer

Craig Labadie Walt Willett

Carolyn Guo Frank Speizer

Jiali Han Charles Fuchs

Monica Macgrath Ed Giovannucci

Chunyan He Andy Chan, Debra Patrick Dennett Schaumberg

David Cox Fran Grodstein, Jae Tim Niu Hee Kang

Aditi Hazra PHS: Jing Ma

Fred Schumacher Mike Gaziano, P Ridker

NCI BPC3 STEERING COMMITTEE:Harvard David Hunter, Michael Gaziano, Julie Buring, Graham Colditz, Walter WillettEPIC,CEPH, Cambridge Elio Riboli, Rudolf Kaaks, Federico Canzian, Gilles Thomas,ACS Michael Thun, Heather Feigelson, Jeanne CalleNCI Richard Hayes, Demetrius Albanes, Bob Hoover, Stephen Chanock; Program - Mukesh VermaMEC & Broad Brian Henderson, Laurence Kolonel, David Altshuler, Malcolm Pike

SECRETARIAT: David Hunter, Elio Riboli

GENOMICS subgroup:

David Altshuler (Chair) Steve ChanockGilles Thomas

STATISTICS subgroup:

Dan Stram (Chair)Peter KraftRudolf KaaksPaul PharoahMalcolm PikeGilles ThomasShalom Wacholder

Harvard cohorts

ACS cohort

EPIC cohorts

Multiethnic Cohort

PLCO cohort

ATBC cohort

CEPH

BROADINSTITUTE

NCI Core Gen Facility

Genotypingsubgroup:

Chris Haiman (Chair)Federico CanzianAlison DunningSteve ChanockDavid CoxDavid HunterLoic LeMarchandJames Mackay

PUBLICATIONSCOMMITTEE:

Michael Thun (Chair)Elio RiboliBrian HendersonDavid HunterGraham ColditzRichard HayesDemetrius Albanes

CGEMS Acknowledgements• NCI• Stephen Chanock• Gilles Thomas• Robert Hoover• Joseph Fraumeni• Daniela Gerhard• Kevin Jacobs• Zhaoming Wang• Meredith Yeager• Robert Welch• Richard Hayes• Sholom Wacholder• Nilanjan Chatterjee• Kai Yu• Margaret Tucker• Marianne Rivera-Silva• NCICB

HSPHDavid HunterPeter KraftFred SchumacherDavid Cox

ACSHeather FeigelsonCarmen RodriguezEugenia CalleMichael Thun

PLCORegina ZieglerChris BergSaundra BuysChris MacCarty

Selecting initial and replication samples from existing studies

I. What studies of the same phenotype exist?

II. Can a consortium or collaborative approach provide a study with adequate power for the initial GWAS, along with pre-planned replication studies?

III. Do any of these studies have pre-existing data that would increase power e.g. “free” controls for a prior GWAS of another phenotype?

IV. Is the phenotype defined in the same or similar manner?

V. Are covariate data available, and defined similarly?

VI. Do any of the studies have additional phenotypic information e.g. biomarkers that would create opportunities for “added value” analyses, if these are the subjects of the GWAS?

Selecting Initial GWAS and replication studies David Hunter Harvard School of Public Health

Documents

Transcript of Selecting Initial GWAS and replication studies David Hunter Harvard School of Public Health