Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201...
Phylogenetic Tree Construction
Dong Xu
Computer Science Department271C Life Sciences Center
1201 East Rollins RoadUniversity of Missouri-Columbia
Columbia, MO 65211-2060E-mail: [email protected]
573-882-7064 (O)http://digbio.missouri.edu
Outline
Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu
Evolution
Many theories of evolutionBasic idea:
speciation events lead to creation of different species
Any two species share a (possibly distant) common ancestor
Evolutionary Events
Extinction: A new node u is created at the end of a lineage, no new lineage is started from u
Speciation: A new node u is created at the end of a lineage, and two new lineages are started from u
Hybridization: A new node u is created when two lineages combine (diploid or polyploid)when one lineage creates u and the new lineage from
u has double the number of homologs (auto-polyploid)
Tree of Life
http://tolweb.org/
Toxonomy
Glycine maxTaxonomy ID: 3847Genbank common name: soybeanRank: speciesGenetic code: Translation table 1 (Standard)Mitochondrial genetic code: Translation table 1 (Standard)Other names:common name:soybeansLineage( full )
cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids I; Fabales; Fabaceae; Papilionoideae; Phaseoleae; Glycine
Kingdom Plantae
Evolutionary tree of plants From primitive more advanced
traits
Non-vascular
Greenalgaancestor
_______moncot __________
Dicot
Vascular
Flowers
Gymnosperms
Monocot vs. dicot plants (1)
FEATURE MONOCOTS DICOTS
Cotyledons 1 2
Leaf venation parallel broad
Root system Fibrous Tap
Number of floral parts
In 3’s In 4’s or 5’s
Vascular bundle position
Scattered Arranged in a circle
Woody or herbaceous
Herbaceous Either
Number of cotyledons: one vs. two
Monocot vs. dicot plants (2)
Leaf venation pattern: Monocot is parallel Dicot is net pattern
Monocot vs. dicot plants (3)
Flower parts: Monocot: in groups of three Dicot: in groups of four or five
Monocot vs. dicot plants (4)
Outline
Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu
Phylogenies (1)
A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species
AardvarkBison Chimp Dog Elephant
Phylogenies (2)
Leafs - current day species Nodes - hypothetical most recent common
ancestors Edges length - “time” from one speciation
to the next
Primate Evolution
Tree Terminology
a b c d
{a,b}
{a,b,c}
{a,b,c,d} root
leaf
internal nodecluster
edge
Rooted trees Single common ancestor Requires more information
Unrooted trees Objects are leaves Internal nodes are some common ancestors Insufficient information to tell whether not not a given
internal node is a common ancestor of any 2 leaves
Rooted/Unrooted Tree
Motivation
Understand the lineage of different species
Organizing principle to sort species into a taxonomy
Understand how various functions evolved
Understand forces and constraints on evolution
Perform multiple sequence alignment
Predict gene function (phylogenetic footprint)
Tree Basis
Phylogenies are reconstructed based on comparisons between present-day objects
Two main aspectsTopology
How its interior nodes connect to one another and to the leaves
Distance An estimate of the evolutionary distance between
the nodes
Assumptions
homology reflects common ancestry single common ancestor treelike relationship exists positional homology independent processes no reversals or convergence molecular clock
Outline
Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu
Molecular Clock Theory (1)
For any given protein, accepted mutations in the amino acid sequence for the protein occur at constant rate
Accepted = mutations that allow protein to function without death
Implication
# of accepted mutations proportional to length of time interval
i.e. relatively constant rate of accepted mutations within a protein
Rate of accepted mutations maybe different for different proteins (depending on their tolerance for mutations)
Different parts of a protein may evolve at different rates
Thus, if A and B differ by k accepted mutations, then roughly k/2 mutations have occurred since divergence
Molecular Clock Theory (2)
Science vol. 289
Molecular clock
Outline
Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu
Species/Gene Trees (1)
Species tree (how are my species related?) contains only one representative from each specieswhen did speciation take place?all nodes indicate speciation events
Gene tree (how are my genes related?)normally contains a number of genes from a single
speciesnodes relate either to speciation or gene duplication
events
• Your sequence data may not have the same phylogenetic history as the species from which they were isolated
•Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA).
Species/Gene Trees (2)
Morphological vs. Molecular
Classical phylogenetic analysis: morphological featuresnumber of legs, lengths of legs, etc.
Modern biological methods allow to use molecular featuresGene sequences
Protein sequences
Dangers in Molecular Phylogenies
Gene/protein sequence can be homologous for different reasons:
Orthologs -- sequences diverged after a speciation event
Paralogs -- sequences diverged after a duplication event
Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)
Ultrametric trees (1)
A metric on a set of objects O given by the assignment of a real number d(x,y) to every pair x,y in O
An ultrametric has to fulfill the additional requirement
An ultrametric tree is characterized by the three point condition
Ultrametric trees (2)
Additive Trees
Generalization of ultrametric trees# of mutations were assumed to be proportional to
temporal distance of a node to ancestor
Also assumed, mutations took place at same rate in all branches
Additive trees model different rates of mutation along different branches
Additivity
In “real” tree, distances between species are the sum of distances between intermediate nodes
ab
c
i
j
k
cbkjd
cakid
bajid
),(
),(
),(
)),(),(),((),( jidkjdkid21
kmd
m
c =
Phylogeny Construction
parsimony methods: fewest changes likelihood methods: maximize the
probability distance methods: based on pairwise
evolutionary distances (sequence similarity, nucleotide composition, etc.)
Outline
Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu
UPGMA
UPGMA is the unweighted pair group method with arithmetic mean
Distance matrix can come from (e.g) DNA-DNA hybridization, or be constructed from sequence data etc.
Iteratively group the most closely related groups. The average distance between elements in two groups is the distance between the groups.
UPGMA Procedure
1. find closest pair of units (species, to start with)
2. connect this pair, defining an evolutionary unit (branch)
3. compute distances from the ancestor of this unit to all other ungrouped units --Branch length is distance/2
4. go back to #1 and repeat
Evolutionary distances among
primates (1)
Human Chimp Gorilla Orang
Chimp 1.45
Gorilla 1.51 1.57
Orang 2.98 2.98 3.04
Rhesus 7.51 7.55 7.39 7.10
nucleotide substitutions per 100 sites
Humans and chimps are closest: lump them and recompute distances
H C
H-C Gorilla Orang
Gorilla 1.54
Orang 2.98 3.04
Rhesus 7.53 7.39 7.10
H Ce.g., (H-C) to gorilla distance = (H-G+C-G)/2= (1.51+1.57)/2 = 1.54Gorilla is closest to H-C clade(((H, C), 1.45), G, 1.54)
G
Evolutionary distances among
primates (2)
H-C-G Orang
Orang 3
Rhesus 7.46 7.10
Human-Chimp-Gorilla is closerto Orang than to Rhesus
H CGOR
Evolutionary distances among
primates (3)
UPGMA Clustering
Let Ci and Cj be clusters, define distance between them to be
When we combine two cluster, Ci and Cj, to form a new cluster Ck, then
i jCp Cqji
ji qpdCC
1CCd ),(
||||),(
||||
),(||),(||),(
ji
ljjliilk CC
CCdCCCdCCCd
UPGMA: conclusions
UPGMA gives branch lengths or evolutionary distances as well as branching order
if (a big if) mutations occur at a constant rate, we can estimate dates of divergence from sequence differences
Outline
Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu
t1
t2
t3
1 three-taxa tree
t2
t4
t1
t3
t1
t2
t3
t4
t2
t3
t1
t4
1*(2*3-3) = 3 four-taxa trees
Possible Evolutionary Tree (1)
Taxa (n) rooted
(2n-3)!/(2n-2(n-2)!)
unrooted
(2n-5)!/(2n-3(n-3)!)
2 1 1
3 3 1
4 15 3
5 105 15
6 954 105
7 10,395 954
8 135,135 10,395
9 2,027,025 135,135
10 34,459,425 2,027,025
Possible Evolutionary Tree (2)
Taxa (n) Unrooted/rooted
2
2 1/1
3 1/3
4 3/15
43Taxa (n):
Possible Evolutionary Tree (3)
Maximum parsimony (1)
Minimizes the number of steps required to generate the observed variation in the sequences
Guaranteed to find the "best" tree - danger of over-fitting the data
Columns representing greater variation dominate
Works best for small, highly conserved sequences
Begin with a multiple sequence alignment Identify informative sites within the
sequences Tree requiring smallest number of changes
identified Repeat over all informative sites Length = sum of the # of steps in each
branch Choose tree with smallest length
Maximum parsimony (2)
Sequence position and character
Taxa 1 2 3 4 5 6 7 8 9
1 A A G A G T G C A
2 A G C C G T G C G
3 A G A T A T C C A
4 A G A G A T C C G
Maximum parsimony (3)
A
B
C
D
A
C
B
D
A
D
B
C
A ACGA
B ATGC
C GTGC
D GCAA
Tree 1
Tree 2
Tree 3
1 2 3 4 Total
Tree 1 1 2 1 2 6
Tree 2 2 2 1 2 7
Tree 3 2 1 1 1 5
Maximum parsimony (4)
Parsimony on genomic sequence
Site Human Chimp Gorilla Orang Recent branch
34 A G A G human-gorilla
560 C C A A human-chimp
1287 * * T T human-chimp
3057-
3060
**** **** TAAT TAAT human-chimp
5153 A C C A chimp-gorilla
human-chimp chimp-gorilla human-gorilla
12 3 4
Outline
Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu
Probabilistic Approaches to Phylogeny (1)
Notation and definitions:
Let P(x•|T,t•) denote the probability of a set of data given a tree, where:
x• denotes n sequences
T denotes a tree with n leaves with sequence j at leaf j
t• denotes the edge lengths of the tree
The definition of P(x•|T,t•) depends on our choice of model of evolution.
Let P(x|y,t) denote the probability that sequence y evolves into x along an edge of length t.
Assume that we can define P(x|y,t).
If we can do this for each edge of T we can calculate the probability of T.
Probabilistic Approaches to Phylogeny (2)
Ridiculously simplistic model of evolution:1. Every site is independent
2. Deletions and insertions do not occur
3. Substitution accounts for all evolution
Let P(b|a, t) denote the probability of the substitution of residue b for residue a over an edge length of t.
Extending to aligned gapless sequences x and y,
P(x | y, t) = uP(xu|yu, t), where u indexes over sites
Probabilistic Approaches to Phylogeny (3)
P(x1,.., x5|T, t•) = P(x1|x4,t1)P(x2|x4,t2)P(x3|x5,t3)P(x4|x5,t4)P(x5)
x5
x3
x2
x4
x1
t1
t2 t3
t4
root
Probabilistic Approaches to Phylogeny (4)
Divergence time estimates for major groups. Thick bars on branches denote fossil record of fungi; solid circles are calibration points. From Heckman et al. 2001. Science 293: 1132
Confidence Assessment
Bootstrap values
Bootstrapping is a statistical technique that can use random resampling of data to determine sampling error for tree topologies
Bootstrapping phylogenies
Characters are resampled with replacement to create many bootstrap replicate data sets
Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML etc.)
Agreement among the resulting trees is summarized with a majority-rule consensus tree
Frequencies of occurrence of groups, bootstrap proportions (BPs), are a measure of support for those groups
Outline
Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu
Avian Influenza Viruses
Single strand Negative RNA Fragmented Polymorphic
2003/2004 H5N1 Pandemic
Highly pathogenic; can be transmitted to people and some cases are fatal
Virus: 8 genomic segments (PB1, PB2, PA, HA, NP, NA, M, and NS) and genetic reassortment
DNA sequences o South China Agricultural University, China o Genbank
Sources and evolution of flu viruses?
Outbreak History
HA
HA gene
Other 6 segments (excluding PA) have a similar tree structure
Our analyses suggest that 2003-04 H5N1 pandemic be caused by multiple independent transmissions with multiple genotypes from genetic reassortments.
PA gene
(2001)
Reading Assignments
Suggested reading: Chapter 14 in “Warren J. Ewens and
Gregory R. Grant: Statistical Methods in Bioinformatics – An Introduction. Springer. 2001”.
Optional reading: Chapter 17 in “Dan Gusfield: Algorithms on
Strings, Trees, and Sequences. Cambridge University Press. 1997”.
Develop a program that implement the UPGMA algorithm
1. Modify your code in the assignment for global alignment.
2. Use edit distance (match 1; otherwise 0) with gap penalty –1 – k (k is gap size) for pairwise sequence alignment.
3. Use the sequence identity as the tree distance between leaves.
4. Input format: FASTA in one file.
5. Output format: (((a, b), d1), c, d2))…
Project Assignment