Download - 1 Human Genetics Genetic Epidemiology. 2 Family trees can have a lot of nuts.

1

Human GeneticsHuman Genetics

Genetic EpidemiologyGenetic Epidemiology

2

Family trees can have a lot of nuts

3

Genetic Epidemiology - AimsGenetic Epidemiology - Aims

1. Gene detection

2. Gene characterization

mode of inheritance

allele frequencies

→ prevalence, attributable risk

4

Genetic Epidemiology - MethodsGenetic Epidemiology - Methods

• Aggregation

• Segregation

• Co-segregation

• Association

5

SegregationSegregation

Can the dichotomy or trichotomy be explained by Mendelian segregation?

affected and unaffected or

two distributions:

determined by a dominant or recessive allele

Also possible: three distributions:

6

Likelihood (parameter(s); data)

Probability (data | parameter(s))

founders nonfounders observed

( ) ( | , ) ( | )

j ji j f mi j

P G P G G G P Y G

The joint probability of the genotypes and phenotypes of all the members of a pedigree can be written as

nonfounders 1 2 founders

observed

( ; )

( ) ( | , )

( | ).

j j

n

i j f mG G G i j

L Y

P G P G G G

P Y G

7

Transmission ProbabilitiesTransmission Probabilities

P(AA transmits A) = τ AA A

P(Aa transmits A) = τ Aa A

P(aa transmits A) = τ aa A

Value if there is Mendelian segregation

1

½

0

8

• We examine segregating sibships

• The proportion of sibs affected is larger than expected on the basis of

Mendelian inheritance

• The likelihood must be conditional on the mode of ascertainment

• We need to know the proband sampling frame

AscertainmentAscertainment

9

CosegregationCosegregation

• Chromosome segments are transmitted

• Cosegregation is caused by linked loci

ultimate statistical proof of genetic etiology

10

Methods of Linkage AnalysisMethods of Linkage Analysis

• Trait model-based – assume a genetic model underlying the trait

• Trait model-free - no assumptions about the genetic model underlying the trait

(parametric)

(non-parametric)

• Ascertainment is often not an issue for locus detection by linkage analysis

11

Model-based Linkage AnalysisModel-based Linkage Analysis

• If founder marker genotypes are unknown, we can

1) estimate them

2) use a database

• If founder marker genotypes are known or can be inferred exactly,

→ no increase in Type 1 error

→ smallest Type 2 error when the model is correct

• All parameters other than the recombination fraction are assumed known

12

1 2 founders nonfounders

observed

( ; )

( ) ( | , )

( | ).

j j

n

i j f mG G G i j

P G P G G G

P Y G

L Y

( | , ) is expressed as a function of

2-locus transmission probabilitiesj jj f mP G G G

(1 )

2

2

AB ABAB abab ab

AB ABAb aBab ab

and

13

Model-free Linkage AnalysisModel-free Linkage Analysis

Identity-in-state versus Identity-by-descentIdentity-in-state versus Identity-by-descent

Two alleles are identical by descent if they are copies of the same parental allele

AA11AA11 AA11AA22

AA11AA22 AA11AA22

IBDIBD

14

Sib pairs shareSib pairs share

0, 1 or 2alleles identical by descent at a marker locus

0, 1 or 2alleles identical by descent at a trait locus

LinkageLinkage

The average proportion shared at any particular The average proportion shared at any particular locus is locus is 11//22

15

Relative Pair Model-Free Linkage AnalysisRelative Pair Model-Free Linkage Analysis

• We correlate relative-pair similarity (dissimilarity) for the trait of interest with relative-pair similarity (dissimilarity) for a marker

• Affected relative pair analysis: Do affected relative pairs share more marker alleles than expected if there is no linkage?• No controls!

• Linkage between a trait locus and a marker locus

→ positive correlation

16

AssociationAssociation

• Causes of association between a marker and a disease

• chance• stratification, population heterogeneity• very close linkage• pleiotropy

17

Causes of Allelic AssociationCauses of Allelic Association

The best solution to avoid this confounding is to study only ethnically homogeneous populations

Heterogeneity/stratification

This allelic association is nuisance association

Simpson's paradox: If we mix two populations that have both different disease prevalence and different marker allele prevalence, and there is no association between the disease and marker allele in each population, there will be an association between the disease and the marker allele in the mixed population.

This chromosome is passed down through the generations,

and now there are many copies. If the distance between D

and A1 is small, recombinations are unlikely, so most D

chromosomes carry A1

This is the type of allelic association we are interested in

Imagine a number of generations ago, a normal allele d

mutated to a disease allele D on a particular chromosome

on which the allele at a marker locus was A1

mutationA1 d A1 D

(Tight)(Tight) LinkageLinkage

19

Guarding Against StratificationGuarding Against Stratification

• Three solutions:

• use a homogenous population

• use family-based controls

• use genomic control

20

Matching on EthnicityMatching on Ethnicity• Close relatives are the best controls, but can lead to

overmatching• Cases and control family members must have the

same family history of disease

SiblingsSiblings CousinsCousins

21

Transmission Disequilibrium Test Transmission Disequilibrium Test (TDT)(TDT)

• A design that uses pseudosibs as controls• Cases and their parents are typed for markers

A1A2 A2A2

A1A2

Transmitted genotype is A1A2

Untransmitted genotype is A2A2

Father transmits A1, does not transmit A2

Mother transmits A2, does not transmit A2

(uninformative in terms of alleles)

22

• Build up a 2 x 2 table:Build up a 2 x 2 table:Transmitted

A1 A2

Untransmitted A1

A2•

Transmitted

A1 A2

Untransmitted A1

A2c

a b

d

• The counts a and d come from homozygous parents

• The counts b and c come from heterozygous parents

• McNemar's test : χ12

(b - c)2

b + c

23

Genomic ControlGenomic Control

• Calculate an association statistic for acandidate locus

• Calculate the same association statistic, from the same sample, for a set of unlinked loci

• Determine significance by reference to the results for the unlinked loci

24

Linkage Between Linkage Between a Marker and a Diseasea Marker and a Disease

• Intrafamilial association

• Typically no population association

• Not affected by population stratification

• Population association if very close

25

Association versus LinkageAssociation versus Linkage

Allelic Association Linkage

• Association at the population level

Intrafamilial association

• Pinpoints alleles Pinpoints loci• More powerful Less powerful

• More tests required Fewer tests required• More sensitive to mistyping

Less sensitive to mistyping

• Sensitive to population stratification

Not sensitive to population stratification

• Which is better?

26

What is the Best Design and Analysis?What is the Best Design and Analysis?

Note: cost, burden of multiple testing

• If heterogeneity / stratification could be an issue, genome scan desired,

large extended pedigrees, type all (founders and nonfounders) for 200-400 equi-spaced markers, for linkage analysis

• If heterogeneity / stratification is a non-issue,unrelated cases and controls for association analysis

(genome scan?)

A wise investigator, like a wise investor, would hedge bets with a judicious mix

27

Case-Control DataCase-Control Data• Consider a particular marker allele, A1, sample of cases and controls:

Nn2n1n0Total

Ss2s1s0Controls

Rr2r1r0Cases

Total210

Number of A1 alleles

28

• Cochran-Armitage trend: test the null hypothesis

p2 + ½p1 = q2 + ½q1

without assuming the two alleles a person has are independent

Sasieni (1997) Biometrics 53:1253-1261

q2q1q0Controls

p2p1p0Cases

210 Number of A1 Alleles

• Consider the probability structure:

29

21 12 2 2 1 2 1

1 2

1 2 1 22

ˆ ˆ ˆ ˆ(p + p ) - (q + q )Y =

1 1 1 1 1+ N n + n - n + n

R S N 4 2

asymptotically has a χ2 distribution with 1 d.f

30

Cochran-Armitage Trend TestCochran-Armitage Trend Test• Does not assume independence of alleles within a

person• Does assume independence of genotypes from person to

person

genomic control.Devlin and Roeder (1999) Biometrics 55:997-1004

• Is not valid if there is population stratification

• The increased variance due to stratification can be estimated from a random set of markers that are independent of the disease

31

Case-only StudiesCase-only Studies• Look at departure from

(1-p)22p(1-p)p2

A*A*A1A*A1A1

where p = P(A1) = p2 + ½p1

• Hardy-Weinberg Disequilibrium (HWD) test statistic:ˆ ˆ ˆ

2212 2 12 2

1

p - (p + p )χ

estimated variance

é ùê úë û ®

• Suggested as• more powerful (only cases needed)• more precise (signal decreases faster with distance

from the causative locus)

32

Case - only StudiesCase - only Studies

• there must be a difference in HWD between cases and controls

• No controls

2

2 21 12 22 2 1 2 2 1

2

ˆ ˆ ˆ ˆ ˆ ˆp -(p + p ) - q -(q + q )Y =

estimated variance

• therefore we consider this HWD trend test:

• No power in the case of a multiplicative model

1 * 1 1 * *P(affected | A A ) P(affected | A A ) P(affected | A A )

33

1

2

b²Y =

var(b)

d²Y =

var(d)

ˆˆ

ˆˆ

34

2

2 2

ˆ ˆw | b | (1 w) | d |Y ˆ ˆ ˆ ˆw var(b) (1 w) var(d) 2w(1 w)cov(| b |,| d |)

We want to give more weight to b or d, whichever yields the larger signal

1

1 2

Yw

Y Y

Therefore take

Weighted average of the Cochran-Weighted average of the Cochran-Armitage trend test and the HWD trend Armitage trend test and the HWD trend

test statisticstest statistics

35

• To investigate the null distribution of this average we simulate many different situations – sample sizes up to 10,000 cases and 10,000 controls - and generate

0 1 2 0 1 2p ,p ,p for cases and q ,q ,q for controlsˆ ˆ ˆ ˆ ˆ ˆ

• For all situations considered, the distribution is well approximated by a Gamma distribution

36

• As the sample size and marker allele frequency increase, the largest mean and the smallest variance occur for 10,000 cases and 10,000 controls, and for a marker allele frequency 0.5

• For 10,000 cases and 10,000 controls, and marker allele frequency 0.5, the upper tail of the distribution is well approximated by a Gamma distribution with mean μ = 1.78 and variance σ2 = 3.45

37

• We develop a prediction equation to determine percentiles of the null distribution for smaller sample sizes and marker allele frequencies

• We base goodness of fit on the root mean squared error (RMSE) of logeα, calculated

for various sample size combinations, from the variance among 50 replicate samples:

1

22

e e

1ˆRSME = (log α - log α)

50

38

• With ~90% confidence, the true loge α lies in the

interval logeα + 1.645(RSME), i.e., α is within

e+1.645(RSME) - fold of the true α• For total sample size (R + S) 200 or larger and α =

0.0001 or larger, in the very worst case (R = S = 100, α =

0.0001) with 90% confidence α could differ from the

true α by a factor of at most ~ 4.8

• The average RMSE is 0.35, corresponding to being between 78% and 122% of the true α with 90% confidence

39

Probability of being affected given

A1A1 A1A* A*A*

1 Recessive 1 1.00 0.10 0.10

2 Recessive 2 1.00 0.05 0.05

3 Additive 1.00 0.50 0.00

4Multiplicative 0.81 0.045 0.0025

POWERPOWERGenetic Models SimulatedGenetic Models Simulated

• Marker loci placed at distances 0 – 6 cM from the disease susceptibility locus

• For type I error, no association between the disease and marker loci

• Each simulated population contains 500,000 individuals allowed to randomly mate for 50 generations after the appearance of a disease mutation

40

Tests PerformedTests PerformedHomogeneous populations

• HWD, cases only

• Allele test

• Allele test x HWD in cases

• HWD trend test

• Cochran-Armitage trend test

• Cochran-Armitage trend test x HWD trend test

• Weighted average

Population stratification

• Cochran-Armitage trend test with genomic control

• Product of this and the HWD trend test

• Weighted average with genomic control

41

Type I error, homogeneous populationType I error, homogeneous population

∆ HWD test, cases only

▲ product of the allele test and HWD test

42

Type I error, population stratificationType I error, population stratification

○ allele test ◊ Cochran-Armitage trend test▲ product of the allele test and HWD test ■ weighted average test ● product of the Cochrn-Armitage trend test and the HWD test

43

Power, homogeneous populationPower, homogeneous population

■ weighted average test

44

Power, population stratificationPower, population stratification

□ HWD trend test♦ CA test with genomic control■ weighted average with genomic control

45

ConclusionsConclusions

• Under recessive inheritance, the weighted average has better performance than either the Cochran-Armitage trend test or the HWD trend test

• Has good performance for other models as well

• The product of the Cochran-Armitage trend test statistic and the HWD test statistic (cases only) has better power, but has inflated Type I error if there is population stratification

• The weighted average has good overall properties, automatically controls for marker mistyping

46

With acknowledgment to

Kijoung Song

47

Can we use evolutionary models, when we have large amounts of genetic data on a sample of cases and controls, to obtain a more powerful way of detecting loci involved in the etiology of disease?

Will these models bear fruit or nuts?