1
Human GeneticsHuman Genetics
Genetic EpidemiologyGenetic Epidemiology
2
Family trees can have a lot of nuts
3
Genetic Epidemiology - AimsGenetic Epidemiology - Aims
1. Gene detection
2. Gene characterization
mode of inheritance
allele frequencies
→ prevalence, attributable risk
4
Genetic Epidemiology - MethodsGenetic Epidemiology - Methods
• Aggregation
• Segregation
• Co-segregation
• Association
5
SegregationSegregation
Can the dichotomy or trichotomy be explained by Mendelian segregation?
affected and unaffected or
two distributions:
determined by a dominant or recessive allele
Also possible: three distributions:
6
Likelihood (parameter(s); data)
Probability (data | parameter(s))
founders nonfounders observed
( ) ( | , ) ( | )
j ji j f mi j
P G P G G G P Y G
The joint probability of the genotypes and phenotypes of all the members of a pedigree can be written as
nonfounders 1 2 founders
observed
( ; )
( ) ( | , )
( | ).
j j
n
i j f mG G G i j
L Y
P G P G G G
P Y G
7
Transmission ProbabilitiesTransmission Probabilities
P(AA transmits A) = τ AA A
P(Aa transmits A) = τ Aa A
P(aa transmits A) = τ aa A
Value if there is Mendelian segregation
1
½
0
8
• We examine segregating sibships
• The proportion of sibs affected is larger than expected on the basis of
Mendelian inheritance
• The likelihood must be conditional on the mode of ascertainment
• We need to know the proband sampling frame
AscertainmentAscertainment
9
CosegregationCosegregation
• Chromosome segments are transmitted
• Cosegregation is caused by linked loci
ultimate statistical proof of genetic etiology
10
Methods of Linkage AnalysisMethods of Linkage Analysis
• Trait model-based – assume a genetic model underlying the trait
• Trait model-free - no assumptions about the genetic model underlying the trait
(parametric)
(non-parametric)
• Ascertainment is often not an issue for locus detection by linkage analysis
11
Model-based Linkage AnalysisModel-based Linkage Analysis
• If founder marker genotypes are unknown, we can
1) estimate them
2) use a database
• If founder marker genotypes are known or can be inferred exactly,
→ no increase in Type 1 error
→ smallest Type 2 error when the model is correct
• All parameters other than the recombination fraction are assumed known
12
1 2 founders nonfounders
observed
( ; )
( ) ( | , )
( | ).
j j
n
i j f mG G G i j
P G P G G G
P Y G
L Y
( | , ) is expressed as a function of
2-locus transmission probabilitiesj jj f mP G G G
(1 )
2
2
AB ABAB abab ab
AB ABAb aBab ab
and
13
Model-free Linkage AnalysisModel-free Linkage Analysis
Identity-in-state versus Identity-by-descentIdentity-in-state versus Identity-by-descent
Two alleles are identical by descent if they are copies of the same parental allele
AA11AA11 AA11AA22
AA11AA22 AA11AA22
IBDIBD
14
Sib pairs shareSib pairs share
0, 1 or 2alleles identical by descent at a marker locus
0, 1 or 2alleles identical by descent at a trait locus
LinkageLinkage
The average proportion shared at any particular The average proportion shared at any particular locus is locus is 11//22
15
Relative Pair Model-Free Linkage AnalysisRelative Pair Model-Free Linkage Analysis
• We correlate relative-pair similarity (dissimilarity) for the trait of interest with relative-pair similarity (dissimilarity) for a marker
• Affected relative pair analysis: Do affected relative pairs share more marker alleles than expected if there is no linkage?• No controls!
• Linkage between a trait locus and a marker locus
→ positive correlation
16
AssociationAssociation
• Causes of association between a marker and a disease
• chance• stratification, population heterogeneity• very close linkage• pleiotropy
17
Causes of Allelic AssociationCauses of Allelic Association
The best solution to avoid this confounding is to study only ethnically homogeneous populations
Heterogeneity/stratification
This allelic association is nuisance association
Simpson's paradox: If we mix two populations that have both different disease prevalence and different marker allele prevalence, and there is no association between the disease and marker allele in each population, there will be an association between the disease and the marker allele in the mixed population.
This chromosome is passed down through the generations,
and now there are many copies. If the distance between D
and A1 is small, recombinations are unlikely, so most D
chromosomes carry A1
This is the type of allelic association we are interested in
Imagine a number of generations ago, a normal allele d
mutated to a disease allele D on a particular chromosome
on which the allele at a marker locus was A1
mutationA1 d A1 D
(Tight)(Tight) LinkageLinkage
19
Guarding Against StratificationGuarding Against Stratification
• Three solutions:
• use a homogenous population
• use family-based controls
• use genomic control
20
Matching on EthnicityMatching on Ethnicity• Close relatives are the best controls, but can lead to
overmatching• Cases and control family members must have the
same family history of disease
SiblingsSiblings CousinsCousins
21
Transmission Disequilibrium Test Transmission Disequilibrium Test (TDT)(TDT)
• A design that uses pseudosibs as controls• Cases and their parents are typed for markers
A1A2 A2A2
A1A2
Transmitted genotype is A1A2
Untransmitted genotype is A2A2
Father transmits A1, does not transmit A2
Mother transmits A2, does not transmit A2
(uninformative in terms of alleles)
22
• Build up a 2 x 2 table:Build up a 2 x 2 table:Transmitted
A1 A2
Untransmitted A1
A2•
Transmitted
A1 A2
Untransmitted A1
A2c
a b
d
• The counts a and d come from homozygous parents
• The counts b and c come from heterozygous parents
• McNemar's test : χ12
(b - c)2
b + c
23
Genomic ControlGenomic Control
• Calculate an association statistic for acandidate locus
• Calculate the same association statistic, from the same sample, for a set of unlinked loci
• Determine significance by reference to the results for the unlinked loci
24
Linkage Between Linkage Between a Marker and a Diseasea Marker and a Disease
• Intrafamilial association
• Typically no population association
• Not affected by population stratification
• Population association if very close
25
Association versus LinkageAssociation versus Linkage
Allelic Association Linkage
• Association at the population level
Intrafamilial association
• Pinpoints alleles Pinpoints loci• More powerful Less powerful
• More tests required Fewer tests required• More sensitive to mistyping
Less sensitive to mistyping
• Sensitive to population stratification
Not sensitive to population stratification
• Which is better?
26
What is the Best Design and Analysis?What is the Best Design and Analysis?
Note: cost, burden of multiple testing
• If heterogeneity / stratification could be an issue, genome scan desired,
large extended pedigrees, type all (founders and non- founders) for 200-400 equi-spaced markers, for linkage analysis
• If heterogeneity / stratification is a non-issue,unrelated cases and controls for association analysis
(genome scan?)
A wise investigator, like a wise investor, would hedge bets with a judicious mix
27
Case-Control DataCase-Control Data• Consider a particular marker allele, A1, sample of cases and controls:
Nn2n1n0Total
Ss2s1s0Controls
Rr2r1r0Cases
Total210
Number of A1 alleles
28
• Cochran-Armitage trend: test the null hypothesis
p2 + ½p1 = q2 + ½q1
without assuming the two alleles a person has are independent
Sasieni (1997) Biometrics 53:1253-1261
q2q1q0Controls
p2p1p0Cases
210 Number of A1 Alleles
• Consider the probability structure:
29
21 12 2 2 1 2 1
1 2
1 2 1 22
ˆ ˆ ˆ ˆ(p + p ) - (q + q )Y =
1 1 1 1 1+ N n + n - n + n
R S N 4 2
asymptotically has a χ2 distribution with 1 d.f
30
Cochran-Armitage Trend TestCochran-Armitage Trend Test• Does not assume independence of alleles within a
person• Does assume independence of genotypes from person to
person
genomic control.Devlin and Roeder (1999) Biometrics 55:997-1004
• Is not valid if there is population stratification
• The increased variance due to stratification can be estimated from a random set of markers that are independent of the disease
31
Case-only StudiesCase-only Studies• Look at departure from
(1-p)22p(1-p)p2
A*A*A1A*A1A1
where p = P(A1) = p2 + ½p1
• Hardy-Weinberg Disequilibrium (HWD) test statistic:ˆ ˆ ˆ
2212 2 12 2
1
p - (p + p )χ
estimated variance
é ùê úë û ®
• Suggested as• more powerful (only cases needed)• more precise (signal decreases faster with distance
from the causative locus)
32
Case - only StudiesCase - only Studies
• there must be a difference in HWD between cases and controls
• No controls
2
2 21 12 22 2 1 2 2 1
2
ˆ ˆ ˆ ˆ ˆ ˆp -(p + p ) - q -(q + q )Y =
estimated variance
• therefore we consider this HWD trend test:
• No power in the case of a multiplicative model
1 * 1 1 * *P(affected | A A ) P(affected | A A ) P(affected | A A )
33
1
2
b²Y =
var(b)
d²Y =
var(d)
ˆˆ
ˆˆ
34
2
2 2
ˆ ˆw | b | (1 w) | d |Y ˆ ˆ ˆ ˆw var(b) (1 w) var(d) 2w(1 w)cov(| b |,| d |)
We want to give more weight to b or d, whichever yields the larger signal
1
1 2
Yw
Y Y
Therefore take
Weighted average of the Cochran-Weighted average of the Cochran-Armitage trend test and the HWD trend Armitage trend test and the HWD trend
test statisticstest statistics
35
• To investigate the null distribution of this average we simulate many different situations – sample sizes up to 10,000 cases and 10,000 controls - and generate
0 1 2 0 1 2p ,p ,p for cases and q ,q ,q for controlsˆ ˆ ˆ ˆ ˆ ˆ
• For all situations considered, the distribution is well approximated by a Gamma distribution
36
• As the sample size and marker allele frequency increase, the largest mean and the smallest variance occur for 10,000 cases and 10,000 controls, and for a marker allele frequency 0.5
• For 10,000 cases and 10,000 controls, and marker allele frequency 0.5, the upper tail of the distribution is well approximated by a Gamma distribution with mean μ = 1.78 and variance σ2 = 3.45
37
• We develop a prediction equation to determine percentiles of the null distribution for smaller sample sizes and marker allele frequencies
• We base goodness of fit on the root mean squared error (RMSE) of logeα, calculated
for various sample size combinations, from the variance among 50 replicate samples:
1
22
e e
1ˆRSME = (log α - log α)
50
38
• With ~90% confidence, the true loge α lies in the
interval logeα + 1.645(RSME), i.e., α is within
e+1.645(RSME) - fold of the true α• For total sample size (R + S) 200 or larger and α =
0.0001 or larger, in the very worst case (R = S = 100, α =
0.0001) with 90% confidence α could differ from the
true α by a factor of at most ~ 4.8
• The average RMSE is 0.35, corresponding to being between 78% and 122% of the true α with 90% confidence
39
Probability of being affected given
A1A1 A1A* A*A*
1 Recessive 1 1.00 0.10 0.10
2 Recessive 2 1.00 0.05 0.05
3 Additive 1.00 0.50 0.00
4Multiplicative 0.81 0.045 0.0025
POWERPOWERGenetic Models SimulatedGenetic Models Simulated
• Marker loci placed at distances 0 – 6 cM from the disease susceptibility locus
• For type I error, no association between the disease and marker loci
• Each simulated population contains 500,000 individuals allowed to randomly mate for 50 generations after the appearance of a disease mutation
40
Tests PerformedTests PerformedHomogeneous populations
• HWD, cases only
• Allele test
• Allele test x HWD in cases
• HWD trend test
• Cochran-Armitage trend test
• Cochran-Armitage trend test x HWD trend test
• Weighted average
Population stratification
• Cochran-Armitage trend test with genomic control
• Product of this and the HWD trend test
• Weighted average with genomic control
41
Type I error, homogeneous populationType I error, homogeneous population
∆ HWD test, cases only
▲ product of the allele test and HWD test
42
Type I error, population stratificationType I error, population stratification
○ allele test ◊ Cochran-Armitage trend test▲ product of the allele test and HWD test ■ weighted average test ● product of the Cochrn-Armitage trend test and the HWD test
43
Power, homogeneous populationPower, homogeneous population
■ weighted average test
44
Power, population stratificationPower, population stratification
□ HWD trend test♦ CA test with genomic control■ weighted average with genomic control
45
ConclusionsConclusions
• Under recessive inheritance, the weighted average has better performance than either the Cochran-Armitage trend test or the HWD trend test
• Has good performance for other models as well
• The product of the Cochran-Armitage trend test statistic and the HWD test statistic (cases only) has better power, but has inflated Type I error if there is population stratification
• The weighted average has good overall properties, automatically controls for marker mistyping
46
With acknowledgment to
Kijoung Song
47
Can we use evolutionary models, when we have large amounts of genetic data on a sample of cases and controls, to obtain a more powerful way of detecting loci involved in the etiology of disease?
Will these models bear fruit or nuts?
Top Related