Post on 19-Dec-2015
A A R H U S U N I V E R S I T E T
Faculty of Agricultural Sciences
Analysis of whole genome Analysis of whole genome association studies in pedigreed association studies in pedigreed
populationspopulations
Goutam Sahana
Genetics and Biotechnology Faculty of Agricultural Sciences
Aarhus University, 8830 Tjele, Denmark
Concept of mapping
Identification of genetic variant underlying disease susceptibility or a trait value
= Causal variant
Evidence for the location of the gene
Approaches to Mapping
1. Candidate gene studies Association Resequencing approaches
2. Genome-wide studies Linkage analysis Genome-wide association studies (Linkage
disequilibrium, LD mapping)
Linkage mapping
Look for marker alleles that are correlated with the phenotype within a pedigree
Different alleles can be connected with the trait in the different pedigrees
Association mapping
Marker alleles are correlated with a trait on a population level
Can detect association by looking at unrelated individuals from a population
Does not necessarily imply that markers are linked to (are close to) genes influencing the trait.
Eff
ect
Freq. of causal variant
Very difficult
Linkage analysis
Unlikely to exist
Association study
Linkage vs. association
Modified from D. Altschuler
Linkage vs. association
Potential Advantage Linkage Association
No prior information regarding gene function required
+ +
Localization to small genomic region - +
Not susceptible to effects of stratification + -/+
Sufficient power to detect common alleles of modest effect (MAFs>5%)
-/+ +
Ability to detect rare allele (MAFs<1%) + -
Tools for analysis available + +/-
Hirschhorn & Daly, Nature Rev. Genet. 2005
Allelic Association
Direct Association Allele of interest is itself involved in phenotype
Indirect Association Allele itself is not involved, but due to LD with
the functional variant
Spurious association Confounding factors (e.g., population
stratification)
Linkage disequilibrium
Non random association between alleles at different loci. Loci are in LD if alleles are present on haplotypes in different proportions than expected based on allele frequencies
Two alleles that are in LD are occurring together more often than would be expected by chance
Linkage disequilibrium
Locus A: Alleles A & a; freq. PA & Pa
Locus B: Alleles B & b; freq. PB & Pb
Possible haplotyoesA
B
A
b
a
B
a
b
Expected frequencies: pApB pApb papB papb
Observed frequencies: pAB pAb paB pab
D = pAB - pApB ≠ 0
LD variation across genome
The extent of LD is highly variable across the genome
The determinants of LD are not fully understood.
Factors that are believed to influence LD Genetic drift Population growth Admixture or migration Selection Variable recombination rates
Haplotype
Genotypes
Locus1 2 4Locus2 1 3Locus3 3 2Locus4 4 1Locus5 2 3Locus6 1 2
Identification of phase
413122
232431
Haplotypes
PHASE
BEAGLE
Haplotype-based analysis
Increased ability to identify regions that are shared identical by descent among affected individuals
Haplotypes may the causative ‘composite allele’ rather than a particular nucleotide at a particular SNP
Haplotype analysis is meaningful only if SNPS are in themselves in LD
Monogenic trait
Mutation in single gene is both necessary and sufficient to produce the phenotype or to cause the disease
The impact of the gene on genetic risk is the same in all families
Follow clear segregation pattern in families
Typically rare in population
Complex trait
Multiple genes lead to genetic predisposition to a phenotype
Pedigree reveals no Mendelian pattern
Any particular gene mutation is neither sufficient nor necessary to explain the phenotype
Environment has major contribution
We study the relative impact of individual gene on the phenotype
Some examples
Disease Mendelian/ Complex
No. of genes
Incidence (in 100,000)
Cystic fibrosis M 1 40
Huntington disease
M 1 5-10
Diabetes, type 2 C ? 10,000 – 20,000
Alzheimer C ? 20,000
Schizophrenia C ? 1000
Quantitative Trait
A biological trait that shows continuous variation rather than falling into distinct categories
Quantitative trait locus (QTL) - Genetic locus that is associated with variation in such quantitative trait
Assessing genetic contributions to complex traits
Continuous characters (wt, blood pressure) Heritability: Proportion of observed variance in
phenotype explained by genetic factors
Discrete characters (disease) Relative risk ratio: λ= risk to relative of an
affected individual/risk in general population λ encompasses all genetic and environmental
effects, not just those due to any single locus
Factors that influence identification of allelic association
Effect size Linkage disequilibrium Disease and marker allele frequencies Sample Size
Reviewed by Zondervar & Cardon, Nature Rev. Genet. 2004
Sample size
Zondervar & Cardon (Nature Rev. Genet. 2004)
Disease allele freq.
Marker allele freq.
Odd ratio
3.0 2.0 1.3
0.2 0.2 150 360 2900
0.5 430 1250 11,000
0.05 0.2 1170 4150 40,000
0.5 4200 15000 160,000
No. of cases= no. of controls; D’=0.7; power 80%; =0.001
Population stratification
M m Freq.
Affected 50 50 0.10
Unaffec. 450 450 0.90
Freq. 0.50 0.50
Consider two case/control samples, genotyped at a marker with alleles M and m
M m Freq.
Affected 1 9 0.01
Unaffec. 99 891 0.99
Freq. 0.10 0.90
Sample A
2 NS 2 NS
Sample B
Population stratification
M m Freq.
Affected 50 50 0.10
Unaffec. 450 450 0.90
Freq. 0.50 0.50
M m Freq.
Affected 1 9 0.01
Unaffec. 99 891 0.99
Freq. 0.10 0.90
Sample A Sample B
M m Freq.
Affected 51 59 0.055
Unaffec. 549 1341 0.945
Freq. 0.30 0.70
2 =14.8
P<0.001
Dealing with population structure
Genomic control (Devlin and Roeder, 1999) Inflate the distribution of the test statistic by λ. λ estimated from data
Unlinked ‘null’ markersTest locus
2 No stratification
E(2)
2
E(2)
Stratification Adjust test statistics
Dealing with population structure
Structured association (Pritchard et al., 2000) Discover structure from set of unlinked markers,
i.e. assign probabilities of ancestry from k populations to each individual, and then control for it.
Association analysis approaches Case–control studies
Markers frequencies are determined in a group of affected individuals and compared with allele frequencies in a control population
Family based methods Based on unequal transmission of alleles from parents to
a single affected child in each family. Associations are summed over many unrelated families
Case-Control studies: 2 test
11 12 22 Total
Case n11 n12 n22 N
Ctrl m11 m12 m22 M
Total T11 T12 T22 N+M
Genotypes Alleles
1 2 Total
Case n1 n2 2N
Ctrl m1 m2 2M
Total T1 T2 2(N+M)
2x3 contingency table 2x2 contingency table
Test of independence:
2 = (O-E)2/E with 2 or 1 df
Family based tests
Genotypes from independent family trios where the child is affected
Use the non-transmitted genotypes or alleles as internal controls to the transmitted ones
Family-based association studies
1 4 transmitted
2 3 non-transmitted
? ?1 2 3 4
1 4 control
Is an allele transmitted more often than
it’s not transmitted to affected offspring ?
TDT: Transmission Disequilibrium Test
G/G G/g
G/g
Non-transmitted
G g
Tra
nsm
itte
d
G
g
a b
c d
TDTG = (TG-NTG)2/(TG+NTG)
=(b-c)2/(b+c) ~ 21
TDT: Transmission Disequilibrium Test
Multiallelic markers ETDT (Sham & Curtis, 1995)
Missing parent genotypes TRANSMIT (Cayton,1999)
Haplotypes TDTHAP (Clayton & Jones, 1999)
Sibs TDT/STDT (Spielman & Ewens, 1998)
Pedigrees PBAT (Martin et al, 2000)
Quantitative traits QTDT (Abecasis et al. 2000)
Some limitations
Subjects – random or structure family Parents not available Difficult when there are very many genes
individually of small effect Environmental influence may obscure
genetic effects Genetic heterogeneity underlying disease
phenotype Hidden (unaccounted) relationship
Complex pedigree
Non-independence among pedigree members
Only polygenic relationship is not sufficient Association analysis should account for the
point-wise relationship among individuals Identical-by-decent probabilities
Methods
Combined linkage and LD Generalized linear models Mixed-model (Yu et al. 2006) Bayesian approach
Combined linkage and LD
• Polygene – the whole relationship in pedigree is used
• Identical-by-descend coefficients were estimated for
point-wise relationship
Phenotype= Fixed factors + Polygene + Haplotype
Phase determination - GDQTLQTL mapping - DMU
QTL for Clinical Mastitis in cattle
0
2
4
6
8
10
12
14
16
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Morgan
LR
T
LA
QTL for Clinical Mastitis in cattle
0
2
4
6
8
10
12
14
16
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Morgan
LR
T
LA
LD
QTL for Clinical Mastitis in cattle
0
2
4
6
8
10
12
14
16
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Morgan
LR
T
LA
LD/LA
LD
Simulation
100 half-sib families (Dairy cattle pedigree)
2000 progeny 5 chromosomes – 100 cM (each) SNP – 5000 15 QTL (1QTL-10%, 4QTL-5 %, 10QTL–2%)
50% of the genetic variance
Heritability – 30%
Generalized linear models
Phenotype= Sire-family + genotype
Software – TASSELhttp://www2.maizegenetics.net/index.php?page=bioinformatics/tassel
Mixed-model (Yu et al. 2006)
Phenotype= Fixed factors + SNP + Population + polygene
SAS mixed model (Gael Pressoir)
Relationship
0 1 2
STRUCTURE
Bayesian approach
Phenotype= Fixed factors + Polygene + Allele or Haplotype
• All markers are fitted simultaneously, search for
marker combination that explains the trait variation
• Avoid multiple testing
Software – iBays (Janss LLG, 2007)
Multiple testing
Performing one test at an alpha level of 0.05 implies 5% chance of rejecting a true null hypothesis (false positive)
Performing 100 tests at = 0.05 when all 100 H0 are true, we expect 5 of the tests to give FP results
Pr(at least one FP)=1-Pr(no FP)= 1- (0.95)100 = 0.994 (if the tests are independent)
Multiple testing
Bonferroni correction Rejection level of each test is i /m
Permutation test False discovery rate (FDR)
What proportion of rejections are when H0 is true?
Of all the times you reject H0 how often is H0 true?
q value (Storey et al. PNAS 2003)