Post on 18-Dec-2015
Chap. 6. Molecular Phylogeny
Charles Darwin, 1859 Natural selection
Evolution Change in frequency of genes in a population
Heritable changes in a population over many generations
Process of mutation with selectionTwo essential factors that define evolution
Error-prone self-replication Variation in success at self-replication
Evolution
Self-replication Whatever is evolving must have the ability to make copies of itself
Typical developments, aging etc., are not evolution
Genes can self-replicate in the context of cells that they reside in
“replicator” can self-replicateAsexual organisms like bacteria can self-replicate
Sexual organisms can replicate, but inheriting from parents
Darwin focused on genes rather than organisms as the fundamental replicators
Error-prone Self-replication
Error-prone Copies are not always identical to the originals
Perfect copies will not foster evolution
In fact, current genes are from gradual changes from previous versions with slight errors
Errors are essential for evolution, provided they occur not too frequently
Error-prone Self-replication
Cell Replication Replication
One double-strand DNA to two identical double-strand DNA’s
One mother strand is in each of two daughter DNA’s (semi-conservative replication)
Replication step 1 Separate the two DNA strands
At origin of replication
Replication step 2 Synthesize DNA from 5’ to 3’
end and at the same time 3’ to 5’ end DNA polymerase catalyzes
only in 5’ to 3’ direction in new chains
Original 3’-5’ (leading) strand continues replicating
Original 5’-3’ (lagging) strand replicate semi-discontiously at every 1000-2000 bp (Ozaki fragment)
Replication step 3 Proofread and repair
detect mutation, once in 104 to 105 bases
Mismatch repair in E.Coli(a)Newly synthesized DNA (red) has a mismatch (G-T).(b) MutH, MutS, and MutL link the mismatch with the nearest methylation site (blue)(c) An exonuclease removes from red strand(d) DNA polymerases replace it
How to find the origination/termination site ? Chargaff parity rules (CPR) -1951
# of A = # of T; # of C = # of G CPR I – double strands of DNAs
Obvious from complementary relationship
CPR II – single strand of DNA Cause is not known yet Violation is called ‘skew’ GC skew: (G-C)/(G+C)
GC skew
Max or min of GC skew appears at ori or ter sites
Oligomer skew fi : # of oligomer i in a segment
OAi = ln(fi/fi’)
Most organisms can increase exponentially If all organisms survived and multiplied at the same rate, there will be no change in frequency of the variants, and thus no evolution
Limited by food, space, predators, etc. When population size is limited, not all variants survive
A possibility of natural selectionAlso, chance effects exist
Equal-sized populations with two variants will not stay the same even with the same degree of fitness
Called random drift, the chance effect will take over the whole population
This implies that evolution can occur even without natural selection, referred to as neutral evolution
Variation
Any change in a gene sequence that is passed on to offspringCaused by
A damage to DNA moledule (from radiation, etc.) Errors in replication
Point mutation – simplest form of mutation and occurs all over DNA sequences
Transition – mutation within purine (A,G) or pyrimidine (C,T/U)
Transversion – mutation between nt groups Effects depend on where mutations occur
Non-coding region – no effect on proteins, and neutral
But may have significant effects if occurring in control region
Coding region Synonymous substitution when a mutation does not
change AA Non-synonymous
AA is replaced by another stop codon is introduced
Mutation
Models of nucleotide substitution
A G
T C
transition
transition
transversiontransversion
A
Jukes and Cantor one-parameter model of nucleotide substitution (=)
G
T C
A
Kimura model of nucleotide substitution (assumes ≠ )
G
T C
Jukes-Cantor (JC) Kimura 2P Tamura
Indel mutation Small indels of a single base of a few bases are frequent
Caused by slippage during DNA replication Particularly frequent with repeated sequences
GCGC…: insertion of extra GC or deletion cause slight slippage
CAG repeated region in huntingtin protein can expand, causing Huntington’s disease
Indels can cause frame shift, if indels are not multiples of three
Gene inversion Whole genes are copied to offspring in reverse direction
Translocation Whole genes can be deleted from one genome and inserted into another
Mutation
Orthologs:members of a gene (protein)family in variousorganisms.This tree showsglobin orthologs.
Mutation Example
Paralogs: members of a gene (protein) family within aspecies. This tree shows human globin paralogs.
Globin phylogeny by Dayhoff (1972)
Globin phylogeny by Dayhoff in evolutionary time (1972)
Mature insulin consists of an A chain and B chainheterodimer connected by disulphide bridges
The signal peptide and C peptide are cleaved,and their sequences display fewerfunctional constraints.
Note the sequence divergence in the disulfide loop region of the A chain
Historical background: insulinBy the 1950s, it became clear that amino acid substitutions occur nonrandomly.
For example, Sanger and colleagues noted that most amino acid changes in the insulin A chain are restricted to a disulfide loop region.
Such differences are called “neutral” changes (Kimura, 1968; Jukes and Cantor, 1969)
Subsequent studies at the DNA level showed that rate of nucleotide (and of amino acid) substitution is about six-to ten-fold higher in the C peptide, relative to the A and B chains.
Number of nucleotide substitutions/site/year
0.1 x 10-9
0.1 x 10-91 x 10-9
Surprisingly, insulin from the guinea pig (and from the related coypu) evolve seven times faster than insulinfrom other species. Why?
The answer is that guinea pig and coypu insulindo not bind two zinc ions, while insulin molecules frommost other species do. There was a relaxation on thestructural constraints of these molecules, and so the genes diverged rapidly.
Historical background: insulin
Guinea pig and coypu insulin have undergone anextremely rapid rate of evolutionary change
Arrows indicate positions at which guinea pig insulin (A chain and B chain) differs from both human and mouse
In the 1960s, sequence data were accumulated forsmall, abundant proteins such as globins,cytochromes c, and fibrinopeptides. Some proteinsappeared to evolve slowly, while others evolved rapidly.
Linus Pauling, Emanuel Margoliash and others proposed the hypothesis of a molecular clock:
Molecular clock hypothesis
For every given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages
Millions of years since divergence
corr
ecte
d a
min
o a
cid
ch
ang
es
per
100
res
idu
es (m
)
Dickerson (1971)
If protein sequences evolve at constant rates,they can be used to estimate the times that sequences diverged. This is analogous to datinggeological specimens by radioactive decay.
Molecular clock hypothesis: implications
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Molecular phylogeny uses trees to depict evolutionaryrelationships among organisms. These trees are basedupon DNA and protein sequence data.
Population GeneticsGenealogical Tree
Evolution tree of a gene without recombination (mtDNA, chromosome)
Given the current generation, can trace back to a single copy of the gene – coalescence process
Example Human mtDNA is traced back to African woman 200,000 years ago (1996)
Coalescence ModelAssumptions
Constant population of N throughout time Each individual is equally fit (same expected number of offspring) – equally likely to have any of the individuals in the previous generation as mother
Pick two individuals in the present generation Prob. of having the same mother = 1/N
Prob. that their most recent common ancestor lived T generations ago
P(T) = (1 - 1/N)T-1 (1/N) ≈ e-T/N / N Coalescence of the lines of descent of any two individuals is exponentially distributed with the mean time until coalescence of N generations
CoalescenceMitochondrial Eve
Used highly variable non-coding part, called D-loop
The average # of site with difference: 61.1 out of 16,553 bases
The average pairwise difference is 76.7 between Africans, and 38.5 between non-Africans
There have been different divergent population in Africa for much longer
Relatively small population left African and spread through the rest of the world
The earliest branch point – 170,000 ± 50,000
Non-African migration – 52,000 ± 27,000
Purple/Green – all Africans
Yellow/blue – non-Africans
Fixation in Neutral ModelMutation 1 does not survive to the present generationMutation 2 has a chance to spread to the entire population (fixed)Most mutation die outIf a mutation is neutral, the prob. of becoming fixed, Pfix ?
Assume N copies of a gene and that each one is equally likely to mutate
Prob. that mutation occurred in the gene copy of an ancestor of the present generation is 1/N = pfix
New mutation takes place with the prob. of u Rate of new fixation of new mutations is the rate at which mutations occur, multiplied by the prob. that each mutation is fixed:
ufix = (Nu)*pfix = u Shows that the rate of fixation of neutral mutations is equal to the underlying mutation rate and is independent of the population size
Fixation in Neutral ModelNumber of mutation in the population changes on a random basis
If m copies of a neutral mutant sequence at one generation,
The number of copies at the next generation, n ≈ m
Wright-Fisher model Each copy of the gene in
the next generation is randomly selected from genes in the previous generation
Mutation prob. a = m/N, prob. of no mutation = 1-a
Prob. of n mutations in the next generation, p(n) = CN
nan(1-a)N-n
The mean value: Na = m Simulation with N=200 with
2,000 generations