Introduction to Gene-Finding: Linkage and Association Danielle Dick, Sarah Medland, (Ben Neale)

Introduction to Gene-Finding: Linkage and Association

Danielle Dick, Sarah Medland, (Ben Neale)

Aim of QTL mapping…

LOCALIZE and then IDENTIFY a locus that regulates a trait (QTL)

• Locus: Nucleotide or sequence of nucleotides with variation in the population, with different variants associated with different trait levels.

Location and Identification

• Linkage

• localize region of the genome where a QTL that regulates the trait is likely to be harboured

• Family-specific phenomenon: Affected individuals in a family share the same ancestral predisposing DNA segment at a given QTL

Location and Identification

• Association

• identify a QTL that regulates the trait• Population-specific phenomenon: Affected

individuals in a population share the same ancestral predisposing DNA segment at a given QTL

Linkage

Overview

Progress of the Human Genome Project

Human Chromosome 4

Genetic markers (DNA polymorphisms)

ATGCTTGCCACGCE

ATGCTTGCCATGCE

Single Nucleotide Polymorphism

ATGCTTGCCACGCE

ATGCTTCTTGCCATGCE

Microsatellite Markerscan be di(2), tri(3), or tetra (4)nucleotide repeats

DNA polymorphisms

Can occur in gene, but be silent

Can change gene product (protein) Alter amino acid sequence (a lot or a little)

Can regulate gene product Upregulate or downregulate protein production Turn off or on gene

Can occur in noncoding region This happens most often!

Mutations

How do we map genes?

Deviation from Mendel’s Independent Assortment Law Aa & Bb = ¼ AB, ¼ Ab, ¼ aB, ¼ ab

We’re looking for variation from this

Recombination

Recombination

Another way of introducing genetic diversity

Allows us to map genes!

Crossovers more likely to occur between genes that are further away; likelihood of a recombination event is proportional to the distance Interference – tend not to see 2 crossovers in a small area

Alleles that are very close together are more likely to stay together, don’t assort independently

Linkage Mapping (is a marker “linked” to the disease gene)

Collect families with affected individuals

Genome Scan - Test markers evenly spaced across the entire genome (~every 10cM, ~400 markers)

Lod score (“log of the odds”) – what are the odds of observing the family marker data if the marker is linked to the disease (less recombination than expected) compared to if the marker is not linked to the disease

Thomas Hunt Morgan – discoverer of linkage

Linkage = Co-segregation

A2A4

A3A4

A1A3

A1A2

A2A3

A1A2 A1A4 A3A4 A3A2

Marker allele A1

cosegregates withdominant disease

Lod scores

>3.0 evidence for linkage

<-2.0 can rule out linkage

In between – inconclusive, collect more families

Linkage = Co-segregation

A2A4

A3A4

A1A3

A1A2

A2A3

A1A2 A1A4 A3A4 A3A2

•Parametric Linkage used very successfully to map disease genes for Mendelian disorders

•Problematic for complex disorders: requires disease model, penetrance, assumes gene of major effect, phenotypic precision

Nonparametric Linkage

Based on allele-sharing

More appropriate for phenotypes with multiple genes of small effect, environment, no disease model assumed

Basic unit of data: affected relative (often sibling) pairs

x

1/4 1/4 1/4 1/4

IDENTITY BY DESCENT

Sib 1

Sib 2

4/16 = 1/4 sibs share BOTH parental alleles IBD = 2

8/16 = 1/2 sibs share ONE parental allele IBD = 1

4/16 = 1/4 sibs share NO parental alleles IBD = 0

2

2

2

2

1

1

1 1

1

1

1 10

0

0

0

Genotypic similarity between relatives

IBS Alleles shared Identical By State “look the same”, may have the

same DNA sequence but they are not necessarily derived from a

known common ancestor - focus for associationIBD Alleles shared

Identical By Descent

are a copy of the

same ancestor allele

- focus for linkage

M1

Q1

M2

Q2

M3

Q3

M3

Q4

M1

Q1

M3

Q3

M1

Q1

M3

Q4

M1

Q1

M2

Q2

M3

Q3

M3

Q4

IBS IBD

2 1

Genotypic similarity – basic principals Loci that are close together are more likely to be

inherited together than loci that are further apart

Loci are likely to be inherited in context – ie with their surrounding loci

Because of this, knowing that a loci is transmitted from a common ancestor is more informative than simply observing that it is the same allele Critical to have parental data when possible

Linkage Markers…

For disease traits (affected/unaffected)

Affected sib pairs selected

IBD = 2IBD = 1IBD = 0

1000

250

750

500

Expected 1 2 3 127 310

Markers

For continuous measures

Unselected sib pairs1.00

0.25

0.75

0.50

IBD = 0 IBD = 1 IBD = 2

Co

rre

lati

on

be

twee

n s

ibs

0.00

So how does all this fit into Mx?

IDENTITY BY DESCENT

Sib 1

Sib 2

4/16 = 1/4 sibs share BOTH parental alleles IBD = 2

8/16 = 1/2 sibs share ONE parental allele IBD = 1

4/16 = 1/4 sibs share NO parental alleles IBD = 0

2

2

2

2

1

1

1 1

1

1

1 10

0

0

0

T2

AEA C

a ac e

T1

1 111

E

e

1

C

c

1

1 or .5

1

In biometrical modeling A is correlated at 1 for MZ twins and .5 for DZ twins .5 is the average genome-wide sharing of genes

between full siblings (DZ twin relationship)

In linkage analysis we will be estimating an additional variance component Q For each locus under analysis the coefficient of

sharing for this parameter will vary for each pair of siblings The coefficient will be the probability that the pair of

siblings have both inherited the same alleles from a common ancestor π̂

Q A C E

PTwin1

E C A Q

PTwin2

MZ=1.0 DZ=0.5

MZ & DZ = 1.0

1 1 1 11 1 1 1

q a c e e c a q

π̂

Linkage

How do we do this?

1.Genotyping data.

Break down of time spent during a linkage/ association study

Cleaning &preparinggenotype dataRuning linkageanalyses

Estimatingsignificance

Microsatellite data Ideally positioned at equal genetic distances

across chromosome Mostly di/tri nucleotide repeats

http://research.marshfieldclinic.org/genetics/GeneticResearch/screeningsets.asp

Microsatellite data Raw data consists of allele lengths/calls (bp) Different primers give different lengths

So to compare data you MUST know which primers were used

http://research.marshfieldclinic.org/genetics/GeneticResearch/screeningsets.asp

Binning Raw allele lengths are converted to allele

numbers or lengths Example:D1S1646 tri-nucleotide repeat size

range130-150 Logically: Work with binned lengths Commonly: Assign allele 1 to 130 allele, 2 to 133 allele … Commercially: Allele numbers often assigned based on

reference populations CEPH. So if the first CEPH allele was 136 that would be assigned 1 and 130 & 133 would assigned the next free allele number

Conclusions: whenever possible start from the RAW allele size and work with allele length

Error checking

After binning check for errors Family relationships (GRR, Rel-pair) Mendelian Errors (Sib-pair) Double Recombinants (MENDEL, ASPEX,

ALEGRO) An iterative process

‘Clean’ data ped file

Family, individual, father, mother, sex, dummy, genotypes

Estimating genotypic sharing… The ped file is used with ‘map’ files to obtain

estimates of genotypic sharing between relatives at each of the locations under analysis

Estimating genotypic sharing…

Merlin will give you probabilities of sharing 0, 1, 2 alleles for every pair of individuals.

Estimating genotypic sharing… Output


Why isn’t P0, P1, P2 exact for everyone?


Why isn’t P0, P1, P2 exact for everyone?

-missing parental genotypes-low informativeness at marker

1/2 2/2

1/2 2/2

Q A C E

PTwin1

E C A Q

PTwin2

MZ=1.0 DZ=0.5

MZ & DZ = 1.0

1 1 1 11 1 1 1

q a c e e c a q

π̂

Genotypic similarity between relatives

IBD Alleles shared Identical By Descent are a copy of the same ancestor allele

Pairs of siblings may share 0, 1 or 2 alleles IBD

The probability of a pair of relatives being IBD is known as pi-hat

M1

Q1

M2

Q2

M3

Q3

M3

Q4

M1

Q1

M3

Q3

M1

Q1

M3

Q4

M1

Q1

M2

Q2

M3

Q3

M3

Q4

IBS IBD

2 1

ˆ ( 2) .5* ( 1)p IBD p IBDπ = +

Estimating genotypic sharing… Output ˆ ( 2) .5* ( 1)p IBD p IBDπ = +

ˆ ?π =

ˆ ?π =

Distribution of pi-hat

Adult Dutch DZ pairs: distribution of pi-hat at 65 cM on chromosome 19 < 0.25: IBD=0 group > 0.75: IBD=2 group others: IBD=1 group pi65cat= (0,1,2)

π̂

π̂π̂

Linkage Analyses

Advantage Systematically scan the genome

Disadvantages Not very powerful Need hundreds – thousands of family member Broad peaks

Lod scores

1cM = 1MB1MB=1000kb1kb=1000bp1cM = 1,000,000 bp

Strategy

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160cM

Lod Scores

Wave 1

Wave 2

Combined

1. Ascertain families with multiple affecteds

2. Linkage analyses to identify chromosomal regions

3. Association analyses to identify specific genes

allele-sharing among affecteds within a family

Gene A Gene B Gene C

Linkage vs. Association

Linkage analyses look for relationship between a marker and disease within a family (could be different marker in each family)

Association analyses look for relationship between a marker and disease between families (must be same marker in all families)

Allelic Association:Allelic Association:Extension of linkage to the populationExtension of linkage to the population

3/5 2/6

3/2 5/2

3/5 2/6

3/6 5/6

Both families are ‘linked’ with the marker, but a different allele is involved

Allelic AssociationAllelic AssociationExtension of linkage to the populationExtension of linkage to the population

3/6 2/4

3/2 6/2

3/5 2/6

3/6 5/6

All families are ‘linked’ with the markerAllele 6 is ‘associated’ with disease

4/6 2/6

6/6 6/6

Localization

Linkage analysis yields broad chromosome regions harbouring many genes Resolution comes from recombination events (meioses)

in families assessed ‘Good’ in terms of needing few markers, ‘poor’ in terms

of finding specific variants involved

Association analysis yields fine-scale resolution of genetic variants Resolution comes from ancestral recombination events ‘Good’ in terms of finding specific variants, ‘poor’ in

terms of needing many markers

Allelic AssociationAllelic AssociationThree Common FormsThree Common Forms

• Direct Association• Mutant or ‘susceptible’ polymorphism• Allele of interest is itself involved in phenotype

• Indirect Association• Allele itself is not involved, but a nearby correlated

marker changes phenotype

• Spurious association• Apparent association not related to genetic aetiology

(most common outcome…)

Indirect and Direct Allelic Association

D

*

Measure disease relevance (*) directly, ignoring correlated markers nearby

Semantic distinction between Linkage Disequilibrium: correlation between (any) markers in populationAllelic Association: correlation between marker allele and trait

Direct Association

M1 M2 Mn

Assess trait effects on D via correlated markers (Mi) rather than susceptibility/etiologic variants.

D

Indirect Association & LD

Decay of Linkage Disequilibrium

Reich et al., Nature 2001

Average Levels of LD along chromosomes

0.00

0.25

0.50

0.75

1.00

0 5 10 15 20 25 30

CEPHW.EurEstonian

Chr22

Dawson et alNature 2002

Characterizing Patterns of Linkage Disequilibrium

Average LD decay vs physical distance

0.00

0.25

0.50

0.75

1.00

0 5 10 15 20 25 30

Mean trends along chromosomes

Haplotype Blocks

Linkage Disequilibrium Maps & Allelic Association

Primary Aim of LD maps: Use relationships amongst background markers (M1, M2, M3, …Mn) to learn something about D for association studies

Something = * Efficient association study design by reduced genotyping* Predict approx location (fine-map) disease loci * Assess complexity of local regions* Attempt to quantify/predict underlying (unobserved) patterns···

Marker 1 2 3 n

LD

D

Deliverables: Sets of haplotype tagging SNPs

1. Human Genome Project Good for consensus, not good for individual differences

2. Identify genetic variants Anonymous with respect to traits.

3. Assay genetic variants Verify polymorphisms, catalogue correlations amongst sites Anonymous with respect to traits

Sept 01 Feb 02 April 04 Oct 04

April 1999 – Dec 01

Oct 2002 - present

Building Haplotype Maps for Gene-finding

Haplotype Tagging for Efficient Genotyping

Cardon & Abecasis, TIG 2003

• Some genetic variants within haplotype blocks give redundant information

• A subset of variants, ‘htSNPs’, can be used to ‘tag’ the conserved haplotypes with little loss of information (Johnson et al., Nat Genet, 2001)

• … Initial detection of htSNPs should facilitate future genetic association studies

HapMap Strategy

Samples Four populations, small samples

Genotyping 5 kb initial density across genome (600K

markers) Subsequent focus on low LD regions Recent NIH RFA for deeper coverage

Hapmap validating millions of SNPs.

Are they the right SNPs?

0

0.1

0.2

0.3

0.4

0.5

0.6

1-10% 11-20% 21-30% 31-40% 41-50%

Minor allele frequency

Population frequency

0

0.1

0.2

0.3

0.4

0.5

0.6

1-10% 11-20% 21-30% 31-40% 41-50%



Frequency of public markers

Expected frequency in population

Distribution of allele frequencies in public markers is biased toward common alleles

Phillips et al. Nat Genet 2003

Updated with phase 2—more similar to expectation

Summary of Role of Linkage Disequilibrium on Association Studies Marker characterization is becoming extensive and

genotyping throughput is high

Tagging studies will yield panels for immediate use Need to be clear about assumptions/aims of each panel

Density of eventual Hapmap probably cover much of genome in high LD, but not all

Challenges

Just having more markers doesn’t mean that success rate will improve Expectations of association success via LD are too high.

Two types of association studies Case-control Family-based

Allelic AssociationAllelic Association

3/6

2/43/2

6/23/5

2/6

3/6 5/6

Allele 6 is ‘associated’ with disease

4/62/6

6/6

6/6

3/4

5/2

Controls Cases

Main BlameMain Blame

Primary Concern with Case-Control Analyses

Population stratification

Analysis of mixed samples having different allele frequencies is a primary concern in human genetics, as it leads to false evidence for allelic association.

Population Stratification

Leads to spurious association

Requirements: Group differences in allele frequencies AND Group differences in outcome

In epidemiology, this is a classic matching problem, with genetics as a confounding variable

M m Freq. Affected 51 59 .055 Unaffected 549 1341 .945 .30 .70

Population Stratification

Sample ‘B’ M m Freq. Affected 1 9 .01 Unaffected 99 891 .99 .10 .90 χ2

1 is n.s.

+

χ21 = 14.84, p < 0.001

Spurious Association

Family-based association methods

TDT – Transmission Disequilibrium Test

1/2 3/3

2/3

•50/50 chance the 2 is transmitted•Looking for overtransmission of a particular allele across affected individuals (undertransmission to unaffecteds)

TDT Advantages/Disadvantages

Detection/elimination of genotyping errors causes bias (Gordon et al., 2001)Uses only heterozygous parents Inefficient for genotyping

3 individuals yield 2 founders: 1/3 information not used

Can be difficult/impossible to collect

Late-onset disorders, psychiatric conditions, pharmacogenetic applications

Robust to stratificationGenotyping error detectable via Mendelian inconsistenciesEstimates of haplotypes possible

Advantages

Disadvantages

Association studies < 2000: TDT

• TDT virtually ubiquitous over past decadeGrant, manuscript referees & editors mandated design

• View of case/control association studies greatly diminished due to perceived role of stratification

• Case/controls, using extra genotyping• +families, when available

Association Studies 2000+ :Return to population

Detecting and Controlling for Detecting and Controlling for Population Stratification with Genetic MarkersPopulation Stratification with Genetic Markers

Idea• Take advantage of availability of large N genetic markers

• Use case/control design

• Genotype genetic markers across genome (Number depends on different factors)

• Look if any evidence for background population substructure exists and account for it

Two types of association studies Case-control

Adv: more powerful Disadv: population stratification

limited by case/control definition

Family-based Adv: population stratification not a problem Disadv: less powerful, hard to collect parents for some

phenotypes

Association Analyses vs Linkage Advantage

More powerful

Disadvantage Not systematic (in the past)

Now! Genome wide association scans

Current Association Study Challenges1) Genome-wide screen or candidate geneGenome-wide screen

Hypothesis-free High-cost: large genotyping

requirements Multiple-testing issues

Possible many false positives, fewer misses

Candidate gene Hypothesis-driven Low-cost: small genotyping

requirements Multiple-testing less

important Possible many misses,

fewer false positives

Current Association Study Challenges2) What constitutes a replication?

GOLD Standard for association studies

Replicating association results in different laboratories is often seen as most compelling piece of evidence for ‘true’ finding

But…. in any sample, we measureMultiple traitsMultiple genesMultiple markers in genes

and we analyse all this using multiple statistical tests

What is a true replication?

What is a true replication?

Association to same trait, but different gene

Association to same trait, same gene, different SNPs (or haplotypes)

Association to same trait, same gene, same SNP – but in opposite direction (protective disease)

Association to different, but correlated phenotype(s)

No association at all

Genetic heterogeneity

Allelic heterogeneity

Allelic heterogeneity/pop differences

Phenotypic heterogeneity

Sample size too small

Replication Outcome Explanation

There exist 6+ million putative SNPs in the public domain. Are they the right markers?

Current Association Study Challenges3) Do we have the best set of genetic markers

0

0.1

0.2

0.3

0.4

0.5

0.6

1-10% 11-20% 21-30% 31-40% 41-50%



0

0.1

0.2

0.3

0.4

0.5

0.6

1-10% 11-20% 21-30% 31-40% 41-50%



Frequency of public markers

Expected frequency in population

Allele frequency distribution is biased toward common alleles

Current Association Study Challenges3) Do we have the best set of genetic markers

Tabor et al, Nat Rev Genet 2003

Greatest power comes from markers that match allele freq with trait loci

Disease AlleleFrequency

Marker Allele Frequency

0.1 0.3 0.5 0.7 0.9

0.1 248 626 1306 2893 10830

0.3 1018 238 466 996 3651

0.5 2874 702 267 556 2002

0.7 9169 2299 925 337 1187

0.9 73783 18908 7933 3229 616

s = 1.5, = 5 x 10-8, Spielman TDT (Müller-Myhsok and Abel, 1997)

Current Association Study Challenges4) Integrating the sampling, LD and genetic effectsQuestions that don’t stand alone:

How much LD is needed to detect complex disease genes?

What effect size is big enough to be detected?

How common (rare) must a disease variant(s) be to be identifiable?

What marker allele frequency threshold should be used to find complex disease genes?

Complexity of System•In any indirect association study, we measure marker alleles that are correlated with trait variants…

We do not measure the trait variants themselves

•But, for study design and power, we concern ourselves with frequencies and effect sizes at the trait locus….

This can only lead to underpowered studies and inflated expectations

•We should concern ourselves with the apparent effect size at the marker, which results from

1) difference in frequency of marker and trait alleles2) LD between the marker and trait loci3) effect size of trait allele

Practical Implications of Allele Frequencies

‘Strongest argument for using common markers is not CD-CV. It is practical:

For small effects, common markers are the only ones for which sufficient sample sizes can be collected

There are situations where indirect association analysis will not work Discrepant marker/disease freqs, low LD, heterogeneity, … Linkage approach may be only genetics approach in these cases

At present, no way to know when association will/will not work Balance with linkage

Allele based test? 2 alleles 1 df

E(Y) = a + bX X = 0/1 for presence/absence

Genotype-based test? 3 genotypes 2 df

E(Y) = a + b1A+ b2D A = 0/1 additive (hom); W = 0/1 dom (het)

Haplotype-based test? For M markers, 2M possible haplotypes 2M -1 df

E(Y) = a + bH H coded for haplotype effects

Multilocus test? Epistasis, G x E interactions, many possibilities

Current Association Study Challenges5) How to analyse the data

Candidate genes: a few tests (probably correlated)

Linkage regions: 100’s – 1000’s tests (some correlated)

Whole genome association: 100,000s – 1,000,000s tests (many correlated)

What to do? Bonferroni (conservative) False discovery rate? Permutations?….Area of active research

Current Association Study Challenges6) Multiple Testing

Despite challenges: upcoming association studies hold some promise Availability of millions of genetic markers

Genotyping costs decreasing rapidly Cost per SNP: 2001 ($0.25) 2003 ($0.10) 2004

($0.01)

Background LD patterns being characterized International HapMap and other projects

Genome Wide Association Studies (GWAS) Underway: Genetic Analysis Information Network (GAIN)

Psoriasis, ADHD, Schizophrenia, Bipolar Disorder, Depression, Type 1 Diabetes

Welcome Trust Case Control Consortium Bipolar Disorder, Coronary Artery Disease, Crohn’s disease,

Rheumatoid Arthristis, Type 1 Diabetes, Type 2 Diabetes

Genes, Environment, & Health Initiative (Gene/Environment Association Studies: GENEVA) Addiction, diabetes, Heart Disease, Oral Clefts, Maternal

Metabolism and Birth Weight, Lung Cancer, Pre-Term Birth, Dental Carries

Genes, Environment, & Development Initiative (GEDI)

Introduction to Gene-Finding: Linkage and Association Danielle Dick, Sarah Medland, (Ben Neale)

Documents

Transcript of Introduction to Gene-Finding: Linkage and Association Danielle Dick, Sarah Medland, (Ben Neale)