Association Analysis Association Analysis...

19
1 Association Analysis University of Louisville University of Louisville Center for Genetics and Molecular Medicine Center for Genetics and Molecular Medicine January 11, 2008 January 11, 2008 Dana Crawford, PhD Dana Crawford, PhD Vanderbilt University Vanderbilt University Center for Human Genetics Research Center for Human Genetics Research Association Analysis Outline Study Design SNPs versus Haplotypes Analysis Methods Candidate Gene Whole Genome Analysis Replication and Function Study Design Does your trait or phenotype have a genetic component? • Segregation analysis • Recurrence risks • Heritability • Other sources of evidence for a genetic component Classic Segregation Analysis • Determines if a major gene is involved • Compares data to Mendelian models, such as Autosomal dominant Autosomal recessive X-linked • Results can be used as parameters for linkage analysis (e.g. parametric LOD) • Subject to ascertainment bias Note: More complex methods needed for complex traits

Transcript of Association Analysis Association Analysis...

Page 1: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

1

Association Analysis

University of LouisvilleUniversity of LouisvilleCenter for Genetics and Molecular MedicineCenter for Genetics and Molecular Medicine

January 11, 2008January 11, 2008

Dana Crawford, PhDDana Crawford, PhDVanderbilt UniversityVanderbilt University

Center for Human Genetics ResearchCenter for Human Genetics Research

Association Analysis Outline

• Study Design• SNPs versus Haplotypes• Analysis Methods• Candidate Gene• Whole Genome Analysis• Replication and Function

Study DesignDoes your trait or phenotype have a genetic component?

• Segregation analysis

• Recurrence risks

• Heritability

• Other sources of evidence for a geneticcomponent

Classic Segregation Analysis• Determines if a major gene is involved

• Compares data to Mendelian models, such asAutosomal dominantAutosomal recessiveX-linked

• Results can be used as parameters forlinkage analysis (e.g. parametric LOD)

• Subject to ascertainment bias

Note: More complex methods needed for complex traits

Page 2: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

2

Recurrence Risks

The chance that a disease present in thefamily will recur in that family

“Lightning striking twice”

If recurrence risk is greater in the familycompared with unrelated individuals,

the disease has a “genetic” component

Suggests familial aggregation

Recurrence Risks

Measured using the risk ratio (_)

Sibling risk ratio = _s

_s = sibling recurrence riskpopulation prevalence

Cystic fibrosis _s = (0.25/0.0004) = 500

Huntington disease _s = (0.50/0.0001) = 5000

Recurrence Risks: Complex traits

_ here is for first degree relativeMerikangas and Risch (2003) Science 302:599-601.

Heritability

Think “twin studies”

The proportion of phenotypic variation in a population attributable to genetic variation

Quantitative traits

Heritability measured as h2

(Can also be family studies)

Page 3: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

3

Heritability and Quantitative Traits

Determined by genes and environment

Boys GirlsMexican Americans

Blacks

Whites

Mexican Americans

Blacks

Whites

Example: Height

NHANES 1971-1974 versus NHANES 1999-2002

Freedman et al (2006) Obesity 14:301-308

Heritability and Quantitative Traits

Trait variation = genetic + environment

Genetic variation = additive + dominant

_T2 = _G

2 + _E2

_G2 = _a

2 + _d2

_E2 = _f

2 + _e2 Environmental variation =

familial/household + random/individual

hB2= _G

2 / _T2 Broad Sense heritability

Narrow Sense heritabilityhN2= _a

2 / _T2

Heritability and Twins Studies

h2 = 2(rMZ – rDZ),

where r is the correlation coefficient

Monozygotic = same genetic material = r ~ 100%

Dizygotic = half genetic material = r ~ 50%

Heritability and Twins Studies

Trait r(MZ) r(DZ) Reference

Cholesterol 0.76 0.39 Fenger et al

SBP 0.60 0.32 Evans et al

BMI 0.67 0.32 Schousboe et al

Perceived pitch 0.67 0.44 Drayna et al

Page 4: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

4

Heritability: Is everything genetic?

Trait r(MZ) r(DZ) Reference

Vote choice 0.81 0.69 Hatemi et al

Religiousness 0.62 0.42 Koenig et al

Other Evidence For A Genetic Component

Monogenic disorders

Example:Phenotype of interest is sensitivity to warfarindosing, but there are no heritability estimates

Solution:Rare, familial disorder of warfarin resistance

Other Evidence For A Genetic Component

Case Reports

Example:Phenotype of interest is susceptibility toNeisseria meningitidis (prevalence: 1/100,000)

Solution:Case report of recurrent N. meningitidis inpatient

Other Evidence For A Genetic Component

• Animal models

• Biochemistry or biological pathways

• Expression data

• Previous genetic association studies

Other good arguments…

Page 5: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

5

Study DesignHow well can you diagnose the disease or measure the trait?

• Narrow definitions better than all-inclusive definitionsThere are many paths that lead to the samephenotype

• Avoid misclassification and measurement errorDirect measurement versus recall/survey data or indirect proxies

• Be aware of age of onsetCan your control become a case over time?

Arguably most important step in study design

Target PhenotypesDisease or Quantitative trait?

Carlson et al. (2004) Nature 429:446-452

MI

CRP

LDL-C

IL6

LDLR

Acute Illness

Diet

Note: SNPs associated with quantitative traits may not be associated with clinical endpoint

Study DesignHow many cases and controls will you need to detect

an association?

Statistical Power• Null hypothesis: all alleles are equal risk

• Given that a risk allele exists, how likely is a study to rejectthe null?

• Study sample size ideally determined before you begin torecruit and genotype

• Statistical significance– Significance = p(false positive)– Traditional threshold 5%

• Statistical power– Power = 1- p(false negative)– Traditional threshold 80%

• Traditional thresholds balance confidence in resultsagainst reasonable sample size

Study DesignWhat are the thresholds/variables in a general power calculation?

Note: Significance threshold for 1 SNP tested

Page 6: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

6

Study Design

Power Calculation Resources

• Quanto (hydra.usc.edu/gxe/)Supports quantitative, discrete traits (unrelated

and family based)

• Genetic Power Calculator (pngu.mgh.arvard.edu/~purcell/gpc/)

Supports discrete traits, variance components, quantitative traits for linkage and association studies

(List of other software: linkage.rockefeller.edu/soft/)

Study DesignHow can you maximize power for your study?

• Large sample sizeBetter estimate of variability or riskChance of misclassification / measurement error

• Large genetic effect sizeSNP risk allele with large odds ratio or explains a lot of trait varianceThis is unknown at beginning of study

• Risk SNP is commonThis is unknown at beginning of studyCalculate power for a range of common MAFs (5-45%)

• Genotype the risk SNP directlyRisk SNP is unknown at beginning of studyRemember tagSNPs are imperfect proxiesAdjust sample size by 1/r2

Study Design

0

20

40

60

80

100

120

140

160

2

2.2

2.4

2.6

2.8 3

3.2

3.4

3.6

3.8 4

4.2

4.4

4.6

4.8 5

5.2

5.4

5.6

5.8 6

Genotype relative risk(Additive model)

Sam

ple

size

(cas

es) 0.05

0.1

0.15

0.2

0.25

Calculated using Quanto 1.1.1

MAF

Power calculation example:Cases: Adverse reaction (wheezing) to flu vaccinationControls: Vaccinated children with no adverse reactions

Study DesignPower calculation example:Immunogenicity to influenza A (H5N1) vaccine

0

100

200

300

400

500

600

700

800

900

0.010.040.07 0.1 0.1

30.160.190.220.250.280.310.340.37 0.4 0.4

30.460.49

R2

(Additive model)

Sam

ple

size

Calculated using Quanto 1.1.1

Page 7: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

7

Study DesignWhy are you considering an association study instead of linkage?

• Linkage analysis is powerful for disorders with– Discernable pattern of inheritance– Rare alleles w/ large genetic effect sizes– High penetrance

• Not powerful for disorders that– have complex pattern of inheritance – are common– many risk alleles with small effect sizes– have low penetrance

Common variant/common disease hypothesis

• Common genetic variants confer susceptibility

• Risk-conferring alleles ancient; common across mostpopulations

• Risk-conferring allele has small effect

• Multiple risk alleles expected for common disease; also environment

Study Design

Study DesignShould you design a candidate gene or whole genome study?

• Candidate gene association study– Interrogate specific genes or regions– Based on previous knowledge or

biological plausibility– Hypothesis testing

• Whole genome association study– Interrogate the “entire” genome– No previous knowledge required– Hypothesis generation

Candidate gene association studies

• Choose gene based on previous knowledge– Gene function– Biological pathway– Previous linkage or association study

• Choose DNA variations for genotyping– Direct association approach– Indirect association approach

Page 8: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

8

Direct Candidate Gene Association Study

Genotype “functional” SNPs

Collins et al (1997) Science 278:1580-1581

Example: Nonsynonymous SNPs

Direct Candidate Gene Association Study

Botstein and Risch (2003) Nat Genet 33 Suppl:228-37.

Problem: We don’t know what is functionaland what is not functional

Direct Candidate Gene Association Study

What would we miss?

Functional synonymous SNPs in MDR1 alterP-glycoprotein activity

Komar (2007) Science 315:466-467

Direct Candidate Gene Association Study

What would we miss?

• 99% human genome is non-coding

• Non-coding SNPs or DNA variations in– Introns– Intergenic regulatory regions

Page 9: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

9

Indirect Candidate Gene Association Study

• Genotype a fraction of all SNPs regardless of “function”

• Rely on SNP-SNP correlations (linkage disequilibrium) to capture information for SNPs not genotyped

Kruglyak (2005) Nat Genet 37:1299-1300

Indirect Candidate Gene Association Study

Linkage disequilibrium (LD)

Measured by r2

r2 = [f(A1B1) – f(A1)f(B1)]2 f(A1)f(A2)f(B1)f(B2)

r2 = 0 SNPs are independentr2 = 1 SNPs are perfectly correlated AND

have the same minor allele frequency

Indirect Candidate Gene Association Study

Using LD to pick “tagSNPs”

CRPEuropean-descent10 SNPs >5% MAF

CRPEuropean-descent

4 tagSNPs

r2>0.80

Indirect Candidate Gene Association Study

“tagSNPs” are population specific

CRPEuropean-descent

4 tagSNPs

CRPAfrican-descent

10 tagSNPs

Page 10: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

10

Indirect Candidate Gene Association Study

• “tagSNPs” are population specific

• Merge sets for “cosmopolitan” set

http://gvs.gs.washington.edu/GVS/

Indirect Candidate Gene Association Study

Multiple testing

• Testing many SNPs for association with disease status

• No consensus on correcting p-value– Bonferroni– False Discovery Rate

• Need to replicate findings in independent study

Indirect Candidate Gene Association Study: Pros and Cons

• Can interrogate all common SNPs in gene

• SNPs must be known and genotypes available to calculate LD and pick tagSNPs

• Multiple testing within a gene

• Limited to previous knowledge

Whole Genome Association Study• Can now genotype 100K – 1 million SNPs

• Coverage depends on platform and chip– tagSNPs capturing HapMap common SNPs– Genic SNPs overrepresented– Conserved non-coding SNPs represented– Evenly spaced across genome

Illumina Infinium assay Affymetrix GeneChips

Page 11: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

11

Whole Genome Association Study

• Same study design and challenges as candidate gene

– Mostly case-control (retrospective)– Multiple testing

• Data storage and higher-order interaction testing issues

• Hypothesis generation tool (replication)

Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)

Case/Control Study DesignsFor either candidate gene or whole genome

Study Pros Cons

Case/Control Easier to collect Subject to bias Less expensive No risk estimates

Case/Control Study Designs: Pros and Cons

Prospective Risk estimates Harder to collect More expensive Subject to bias

For rare outcomes, case/control design may be only option

Case/Control Study Designs: Pros and Cons

Types of bias• Bias in selection of cases

Those that are currently livingMiss fatal or short episodes of diseaseMight miss mild diseasesReferral/admission bias

• Non-response bias• Exposure suspicion bias• Family information bias• Recall bias

Manolio et al. Nature Reviews Genetics 7, 812–820 (October 2006)

Often ignored in genetic association studies

Page 12: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

12

Analysis Methods

Genotype QC

• Test for departures of Hardy-Weinberg Equilibrium

• Test for gender inconsistencies

• Eliminate very rare SNPs (no power)

• Eliminate SNPs with low genotyping efficiency

• Eliminate samples with low genotyping efficiency

Analysis MethodsWhat statistical methods do you use to analyze your data?

• SNP by SNP (borrowed from epidemiology)Chi-square and Fisher’s exact

2x2 table2x3 table

Logistic and linear regressionCovariates

• HaplotypesHaplo.stats and regression

• InteractionsTraditional regressionMDR (Ritchie et al)

Analysis Methods

Case Control

Minor allele A B

Major allele C D

Odds ratio (OR) = ratio of odds of minor allele in Cases (A/C) and Controls (B/D)

OR(A*D)/(B*C)

The Case/Control Study

Case Control

Aa A B

AA C D

For genotypes, set homozygous for major allele (A) as “referent” genotype, and calculate 2 odds ratios:

Case Control

aa A B

AA C D

Analysis Methods

Page 13: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

13

Analysis Methods

Case/control:Interpretation of Odds Ratio

1.0 – Referent>1.0 – Greater odds of disease compared with controls<1.0 – Lesser odds of disease compared with controls

Confidence Intervals: probably contain true OR

OR does not measure risk*

Prospective cohort

• Disease free at beginning of study

• Followed over time for disease (“incident”)

• Follow “exposed” and “unexposed” groups

• Gold-standard study design

Analysis Methods

Analysis Methods

Prospective cohortCase Control Total

Exposed A B (A+B)

Unexposed C D (C+D)

Risk Ratio (RR) = Incidence of disease inExposed A/(A+B)

or Unexposed C/(C+D)

Prospective Study:Interpretation of Risk Ratio

1.0 – Referent>1.0 – Risk for disease increases<1.0 – Risk for disease decreases

Confidence Intervals: probably contain true RR

*For rare diseases, OR ~ RR

Analysis Methods

Page 14: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

14

Case/control: Matching

Age Gender Race

Warning: Can “over match” andmiss describing an interesting factor

Bad Example: Cases: Adults with heart disease Controls: Newborns without heart disease

Analysis MethodsCase/control: Stratifying

Age Gender Race

Warning: Need sufficient sample size to stratify or split the data into males and females

Ex. Cases with heart disease Aged-matched controls without heart disease (Exposure: smoking status)

Stratify for Gender Specific Risks

Analysis Methods

Problems in Case/Control genetic association studies –

• “Confounding” by race or ancestry

• AKA population stratification

• Solutions:MatchStratifyAdjust (using genetic

markers)“Trios”

Cardon and Palmer (2003) Lancet 361:598-604

Analysis Methods

• Given– Height as “target” or “dependent” variable– Sex as “explanatory” or “independent”

variable• Fit regression model

height = β*sex + ε

Analysis Methods

Regression

Page 15: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

15

Analysis Methods

• Given– Quantitative “target” or “dependent” variable y– Quantitative or binary “explanatory” or

“independent” variables xi

• Fit regression modely = β1x1 + β2x2 + … + βixi + ε

Regression• Works best for normal y and x• Can include covariates• Fit regression model

y = β1x1 + β2x2 + … + βixi + ε• Estimate errors on β’s• Use t-statistic to evaluate significance of β’s• Use F-statistic to evaluate model overall• Use R2 to evaluate variance explained by

model

Analysis MethodsRegression

Analysis Methods

Coding Genotypes

000GG011AG121AARecessiveAdditiveDominantGenotype

Genotype can be re-coded in any numberof ways for regression analysis

Page 16: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

16

Example of gene-environmentInteraction and traditional

regression

Analysis MethodsStatistical Packages for Genetic Association Studies

• Candidate gene association studySAS/GeneticsSTATASPSSRPLINK

• Whole genome association studyRPLINK

Analysis Methods

Whole genome in PLINK(pngu.mgh.harvard.edu/~purcell/plink/)

MHC removed

Can adjust for population stratificationCan add covariates

P<1x10-100P<2x10-11

P<5x10-8 Genome-widesignificance

P=5x10-8

Plenge et al 2007 NEJM

Page 17: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

17

SNPs versus Haplotypes

• There is no right answer: explore both

• The only thing that matters is the correlationbetween the assayed variable and the causalvariable

• Sometimes the best assayed variable is a SNP,sometimes a haplotype

SNPs versus Haplotypes

• Haplo.stats (haplotype regression)Lake et al, Hum Hered. 2003;55(1):56-65.

• PHASE (case/control haplotype)Stephens et al, Am J Hum Genet. 2005 Mar;76(3):449-62

• Haplo.view (case/control SNP analysis)Barrett et al, Bioinformatics. 2005 Jan 15;21(2):263-5.

• SNPHAP (haplotype regression?)Sham et al Behav Genet. 2004 Mar;34(2):207-14.

Statistical Packages for Genetic Association Studieswith haplotypes

Analysis MethodsMultiple testing

• Bonferroni correctionToo conservative b/c each SNP tested

may not be independent (LD)How many independent tests did you do?See Conneely and Boehnke AJHG (in press)

• False Discovery RateAlso has arbitrary threshold

• Best bet is replication

Statistical Replication

0

0.1

0.2

0.3

0.4

0.5

0.6

H2 H5 H6 H7 H8

Cha

nge

in ln

(CR

P) p

er c

opy

rela

tive

to H

2

BlackMexican-AmericanWhite

Carlson et al. AJHG 2005;77:64-77

Results Consistent with CARDIA

CRP SNPs and CRP levels in NHANES III

Crawford et al Circulation 2006; 114:2458-2465

Page 18: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

18

• Statistical replication is not always possible

• Association may imply mechanism

• Test for mechanism at the bench– Is predicted effect in the right direction?– Dissect haplotype effects to define functional SNPs

Functional Replication Functional Replication

CRP Evolutionary Conservation

• TATA box: 1697• Transcript start: 1741• CRP Promoter region (bp 1444-1650) >75% conserved in mouse

Functional ReplicationLow CRP Levels Associated with H1-4

• USF1 (Upstream Stimulating Factor)– Polymorphism at 1440 alters USF1 binding site

1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5-6 gcagctacCACGTGcacccagatggcCACTTGtt

High CRP Levels Associated with H6

• USF1 (Upstream Stimulating Factor)– Polymorphism at 1421 alters another USF1 binding site

1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt H7-8 gcagctacCACGTGcacccagatggcCACTAGtt H5 gcagctacCACGTGcacccagatggcCACTTGtt H6 gcagctacCACATGcacccagatggcCACTTGtt

Functional Replication

Page 19: Association Analysis Association Analysis Outlineegp.gs.washington.edu/.../download/Day2_145pm_Crawford.pdf · 2009. 8. 24. · Dana Crawford, PhD Vanderbilt University Center for

19

CRP Promoter Luciferase Assay

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

H1-3 H4 H5 H6 H7-8 empty SV40p

Fold

chan

ge ov

er H1

-3

Carlson et al, AJHG v77 p64

Functional Replication Association Analysis Outline

• Study Design• SNPs versus Haplotypes• Analysis Methods• Candidate Gene• Whole Genome Analysis• Replication and Function