March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner .
-
date post
20-Dec-2015 -
Category
Documents
-
view
229 -
download
0
Transcript of March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner .
March 2006 Vineet Bafna
CSE280b: Population Genetics
Vineet Bafna/Pavel Pevzner
www.cse.ucsd.edu/classes/sp05/cse291www.cse.ucsd.edu/classes/sp05/cse291
March 2006 Vineet Bafna
Population Genetics
• Individuals in a species (population) are phenotypically different.
• Often these differences are inherited (genetic).
• Studying these differences is important!
• Q:How predictive are these differences?
March 2006 Vineet Bafna
EX:Population Structure
• 377 locations (loci) were sampled in 1000 people from 52 populations.
• 6 genetic clusters were obtained, which corresponded to 5 geographic regions (Rosenberg et al. Science 2003)
• Genetic differences can predict ethnicity.
AfricaEurasia East Asia
America
Oce
ania
March 2006 Vineet Bafna
Scope of these lectures
• Basic terminology• Key principles
– Sources of variation– HW equilibrium– Linkage– Coalescent theory– Recombination/Ancestral Recombination Graph– Haplotypes/Haplotype phasing– Population sub-structure– Structural polymorphisms– Medical genetics basis: Association
mapping/pedigree analysis
March 2006 Vineet Bafna
Alleles
• Genotype: genetic makeup of an individual• Allele: A specific variant at a location
– The notion of alleles predates the concept of gene, and DNA.
– Initially, alleles referred to variants that described a measurable phenotype (round/wrinkled seed)
– Now, an allele might be a nucleotide on a chromosome, with no measurable phenotype.
• Humans are diploid, they have 2 copies of each chromosome.– They may have heterozygosity/homozygosity at a location– Other organisms (plants) have higher forms of ploidy.– Additionally, some sites might have 2 allelic forms, or even
many allelic forms.
March 2006 Vineet Bafna
What causes variation in a population?
• Mutations (may lead to SNPs)• Recombinations• Other genetic events (gene conversion)• Structural Polymorphisms
March 2006 Vineet Bafna
Single Nucleotide Polymorphisms
000001010111000110100101000101010010000000110001111000000101100110
Infinite Sites Assumption:Each site mutates at most once
March 2006 Vineet Bafna
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAGGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGCGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATTGCTAGTTAGCTAGATCATCATCATCATCATTGC
435335
March 2006 Vineet Bafna
STR can be used as a DNA fingerprint
• Consider a collection of regions with variable length repeats.
• Variable length repeats will lead to variable length DNA
• Vector of lengths is a finger-print
4 23 35 13 23 15 3
loci
indiv
idual
s
March 2006 Vineet Bafna
Recombination
0000000011111111
00011111
March 2006 Vineet Bafna
Gene Conversion
• Gene Conversion versus crossover– Hard to distinguish
in a population
March 2006 Vineet Bafna
Structural polymorphisms
• Large scale structural changes (deletions/insertions/inversions) may occur in a population.
March 2006 Vineet Bafna
Topic 1: Basic Principles
• In a ‘stable’ population, the distribution of alleles obeys certain laws– Not really, and the deviations are
interesting• HW Equilibrium
– (due to mixing in a population)• Linkage (dis)-equilibrium
– Due to recombination
March 2006 Vineet Bafna
Hardy Weinberg equilibrium
• Consider a locus with 2 alleles, A, a• p (respectively, q) is the frequency of A
(resp. a) in the population• 3 Genotypes: AA, Aa, aa• Q: What is the frequency of each genotype
If various assumptions are satisfied, (such as random mating, no natural selection), Then• PAA=p2
• PAa=2pq• Paa=q2
March 2006 Vineet Bafna
Hardy Weinberg: why?
• Assumptions:– Diploid– Sexual reproduction– Random mating– Bi-allelic sites– Large population size, …
• Why? Each individual randomly picks his two chromosomes. Therefore, Prob. (Aa) = pq+qp = 2pq, and so on.
March 2006 Vineet Bafna
Hardy Weinberg: Generalizations
• Multiple alleles with frequencies– By HW,
• Multiple loci?
€
θ1,θ2,L ,θH
€
Pr[homozygous genotype i] =θ i2
Pr[heterozygous genotype i, j] = 2θ iθ j
March 2006 Vineet Bafna
Hardy Weinberg: Implications
• The allele frequency does not change from generation to generation. Why?
• It is observed that 1 in 10,000 caucasians have the disease phenylketonuria. The disease mutation(s) are all recessive. What fraction of the population carries the disease?
• Males are 100 times more likely to have the “red’ type of color blindness than females. Why?
• Conclusion: While the HW assumptions are rarely satisfied, the principle is still important as a baseline assumption, and significant deviations are interesting.
March 2006 Vineet Bafna
Recombination
0000000011111111
00011111
March 2006 Vineet Bafna
What if there were no recombinations?
• Life would be simpler• Each individual sequence would have a
single parent (even for higher ploidy)• The relationship is expressed as a tree.
March 2006 Vineet Bafna
The Infinite Sites Assumption
0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0
3
8 5
• The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa.
• Some phenotypes could be linked to the polymorphisms• Some of the linkage is “destroyed” by recombination
March 2006 Vineet Bafna
Infinite sites assumption and Perfect Phylogeny
• Each site is mutated at most once in the history.
• All descendants must carry the mutated value, and all others must carry the ancestral value
i
1 in position i0 in position i
March 2006 Vineet Bafna
Perfect Phylogeny
• Assume an evolutionary model in which no recombination takes place, only mutation.
• The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny.
March 2006 Vineet Bafna
The 4-gamete condition
• A column i partitions the set of species into two sets i0, and i1
• A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous.
• EX: i is heterogenous w.r.t {A,D,E}
iA 0B 0C 0D 1E 1F 1
i0
i1
March 2006 Vineet Bafna
4 Gamete Condition
• 4 Gamete Condition– There exists a perfect phylogeny if and only
if for all pair of columns (i,j), j is not heterogenous w.r.t i0, or i1.
– Equivalent to– There exists a perfect phylogeny if and only
if for all pairs of columns (i,j), the following 4 rows do not exist(0,0), (0,1), (1,0), (1,1)
March 2006 Vineet Bafna
4-gamete condition: proof (only if)
• Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous.
• (only if) Every perfect phylogeny satisfies the 4-gamete condition
• (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist? i0
i1
i
j
March 2006 Vineet Bafna
Handling recombination
• A tree is not sufficient as a sequence may have 2 parents
• Recombination leads to loss of correlation between columns
March 2006 Vineet Bafna
Linkage (Dis)-equilibrium (LD)
• Consider sites A &B• Case 1: No
recombination• Each new individual
chromosome chooses a parent from the existing ‘haplotype’
A B0 10 10 00 01 01 01 01 0
1 0
March 2006 Vineet Bafna
Linkage (Dis)-equilibrium (LD)
• Consider sites A &B• Case 2: diploidy and
recombination• Each new individual
chooses a parent from the existing alleles
A B0 10 10 00 01 01 01 01 0
1 1
March 2006 Vineet Bafna
Linkage (Dis)-equilibrium (LD)
• Consider sites A &B• Case 1: No recombination• Each new individual chooses a
parent from the existing ‘haplotype’
– Pr[A,B=0,1] = 0.25• Linkage disequilibrium
• Case 2: Extensive recombination• Each new individual simply
chooses and allele from either site
– Pr[A,B=(0,1)=0.125• Linkage equilibrium
A B0 10 10 00 01 01 01 01 0
March 2006 Vineet Bafna
LD
• In the absence of recombination, – Correlation between columns– The joint probability Pr[A=a,B=b] is
different from P(a)P(b)• With extensive recombination
– Pr(a,b)=P(a)P(b)
March 2006 Vineet Bafna
Measures of LD
• Consider two bi-allelic sites with alleles marked with 0 and 1
• Define– P00 = Pr[Allele 0 in locus 1, and 0 in locus 2]
– P0* = Pr[Allele 0 in locus 1]
• Linkage equilibrium if P00 = P0* P*0
• D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = …
March 2006 Vineet Bafna
LD over time
• With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear– Let D(t) = LD at time t– P(t)
00 = (1-r) P(t-1)00 + r P(t-1)
0* P(t-1)*0
– D(t) = P(t)00 - P(t)
0* P(t)*0 = P(t)
00 - P(t-1)0* P(t-1)
*0 (HW)
– D(t) =(1-r) D(t-1) =(1-r)t D(0)
March 2006 Vineet Bafna
LD over distance
• Assumption– Recombination rate increases linearly with
distance– LD decays exponentially with distance.
• The assumption is reasonable, but recombination rates vary from region to region, adding to complexity
• This simple fact is the basis of disease association mapping.
March 2006 Vineet Bafna
LD and disease mapping
• Consider a mutation that is causal for a disease. • The goal of disease gene mapping is to discover
which gene (locus) carries the mutation.• Consider every polymorphism, and check:
– There might be too many polymorphisms – Multiple mutations (even at a single locus) that lead to
the same disease
• Instead, consider a dense sample of polymorphisms that span the genome
March 2006 Vineet Bafna
LD can be used to map disease genes
• LD decays with distance from the disease allele.
• By plotting LD, one can short list the region containing the disease gene.
011001
DNNDDN
LD
March 2006 Vineet Bafna
LD and disease gene mapping problems
• Marker density?• Complex diseases• Population sub-structure
March 2006 Vineet Bafna
Population Genetics
• Often we look at these equilibria (Linkage/HW) and their deviations in specific populations
• These deviations offer insight into evolution.
• However, what is Normal?• A combination of empirical (simulation)
and theoretical insight helps distinguish between expected and unexpected.
March 2006 Vineet Bafna
Topic 2: Simulating population data
• We described various population genetic concepts (HW, LD), and their applicability
• The values of these parameters depend critically upon the population assumptions.– What if we do not have infinite populations– No random mating (Ex: geographic isolation)– Sudden growth– Bottlenecks– Ad-mixture
• It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation?
March 2006 Vineet Bafna
Wright Fisher Model of Evolution
• Fixed population size from generation to generation
• Random mating
March 2006 Vineet Bafna
Coalescent model
• Insight 1: – Separate the genealogy from allelic states (mutations)– First generate the genealogy (who begat whom)– Assign an allelic state (0) to the ancestor. Drop mutations on the
branches.
March 2006 Vineet Bafna
Coalescent theory
• Insight 2: – Much of the genealogy is irrelevant, because it
disappears.– Better to go backwards
March 2006 Vineet Bafna
Coalescent theory (Kingman)
• Input – (Fixed population (N individuals), random
mating)• Consider 2 individuals.
– Probability that they coalesce in the previous generation (have the same parent)=
• Probability that they do not coalesce after t generations=
€
1
N
€
1− 1N( )
t
≅ e− t N
March 2006 Vineet Bafna
Coalescent theory
• Consider k individuals. – Probability that no pair coalesces after 1
generation
– Probability that no pair coalesces after t generations
€
1−
k2 ⎛ ⎝ ⎜ ⎞
⎠ ⎟
N
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
t
≅ e−
k2 ⎛ ⎝ ⎜ ⎞
⎠ ⎟t
N
= e− k
2 ⎛ ⎝ ⎜ ⎞
⎠ ⎟τ
is time in unitsof N generations
March 2006 Vineet Bafna
Coalescent approximation
• Insight 3:– Topology is independent of coalescent times– If you have n individuals, generate a
random binary topology• Iterate (until one individual)
– Pick a pair at random, and coalesce
• Insight 4:– To generate coalescent times, there is no
need to go back generation by generation
March 2006 Vineet Bafna
Coalescent approximation
• At any step, there are 1 <= k <= n individuals• To generate time to coalesce (k to k-1
individuals)– Pick a number from exponential distribution with rate
k(k-1)/2– Mean time to coalescence
= 2/(k(k-1))= 2/(k(k-1))
March 2006 Vineet Bafna
Typical coalescents
• 4 random examples with n=6 (Note that we do not need to specify N. Why?)
• Expected time to coalesce?
March 2006 Vineet Bafna
Coalescent properties
• Expected time for the last step
• The last step is half of the total time to coalesce• Studying larger number of individuals does not change
numbers tremendously• EX: Number of mutations in a population is proportional
to the total branch length of the tree– E(Ttot)
=1
March 2006 Vineet Bafna
Variants (exponentially growing populations)
• If the population is growing exponentially, the branch lengths become similar, or even star-like. Why?
• With appropriate scaling of time, the same process can be extended to various scenarios: male-female, hermaphrodite, segregation, migration, etc.
March 2006 Vineet Bafna
Simulating population data
• Generate a coalescent (Topology + Branch lengths)
• For each branch length, drop mutations with rate
• Generate sequence data• Note that the resulting sequence is a perfect phylogeny.• Given such sequence data, can you reconstruct the
coalescent tree? (Only the topology, not the branch lengths)
• Also, note that all pairs of positions are correlated (should have high LD).
March 2006 Vineet Bafna
Coalescent with Recombination
• An individual may have one parent, or 2 parents
March 2006 Vineet Bafna
ARG: Coalescent with recombination
• Given: mutation rate , recombination rate , population size 2N (diploid), sample size n.
• How can you generate the ARG (topology+branch lengths) efficiently?
• How will you generate sequences for n individuals?
• Given sequence data, can you reconstruct the ARG (topology)
March 2006 Vineet Bafna
Recombination
• Define r as the probability of recombining. – Note that the parameter is a caled
value which will be defined later• Assume k individuals in a
generation. The following might happen:1. An individual arises because of a
recombination event between two individuals (It will have 2 parents).
2. Two individuals coalesce3. Neither (Each individual has a
distinct parent)4. Multiple events (low probability)
March 2006 Vineet Bafna
Recombination
• We ignore the case of multiple (> 1) events in one generation
• Pr (No recombination) = 1-kr• Pr (No coalescence)
• Consider scaled time in units of 2N generations. Thus the number of individuals increase with rate kr2N, and decrease with rate
• The value 2rN is usually small, and therefore, the process will ultimately coalesce to a single individual (MRCA)
€
1−
k2 ⎛ ⎝ ⎜ ⎞
⎠ ⎟
2N
⎛
⎝
⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟
€
k2 ⎛ ⎝ ⎜ ⎞
⎠ ⎟
March 2006 Vineet Bafna
• Let k = n,• Define • Iterate until k= 1
– Choose time from an exponential distribution with rate
– Pick event as recombination with probability
– If event is recombination, choose an individual to recombine, and a position, else choose a pair to coalesce.
– Update k, and continue
ARG
€
=4rN
€
kρ
2+ k
2 ⎛ ⎝ ⎜ ⎞
⎠ ⎟
€
+ (k −1)
What is the flaw in this procedure?
March 2006 Vineet Bafna
Simulating sequences on the ARG
• Generate topology and branch lengths as before
• For each recombination, generate a position.
• Next generate mutations at random on branch lengths– For a mutation, select a position as well.
March 2006 Vineet Bafna
Recombination events and
• Given , n, can you compute the expected number of recombination events?
• It can be shown that E(n, ) = log (n)• The question that people are really interested
in• Given a set of sequences from a population, compute
the recombination rate • Given a population reconstruct the most likely
history (as an ancestral recombination graph)• We will address this question in subsequent lectures
March 2006 Vineet Bafna
An algorithm for constructing a perfect phylogeny
• We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later.
• In any tree, each node (except the root) has a single parent.– It is sufficient to construct a parent for every
node.• In each step, we add a column and refine
some of the nodes containing multiple children.
• Stop if all columns have been considered.
March 2006 Vineet Bafna
Inclusion Property
• For any pair of columns i,j– i < j if and only if i1
j1 • Note that if i<j then the
edge containing i is an ancestor of the edge containing i
i
j
March 2006 Vineet Bafna
Example
1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
r
A B C D E
Initially, there is a single clade r, and each node has r as its parent
March 2006 Vineet Bafna
Sort columns
• Sort columns according to the inclusion property (note that the columns are already sorted here).
• This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order
1 2 3 4 5A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
March 2006 Vineet Bafna
Add first column
• In adding column i– Check each edge
and decide which side you belong.
– Finally add a node if you can resolve a clade
r
A BC DE
1 2 3 4 5
A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
u
March 2006 Vineet Bafna
Adding other columns
• Add other columns on edges using the ordering property
r
E B
C
D
A
1 2 3 4 5
A 1 1 0 0 0B 0 0 1 0 0C 1 1 0 1 0D 0 0 1 0 1E 1 0 0 0 0
1
2
4
3
5
March 2006 Vineet Bafna
Unrooted case
• Switch the values in each column, so that 0 is the majority element.
• Apply the algorithm for the rooted case
March 2006 Vineet Bafna
March 2006 Vineet Bafna