Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201...

69
Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia, MO 65211-2060 E-mail: [email protected] 573-882-7064 (O) http://digbio.missouri.edu
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201...

Page 1: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Phylogenetic Tree Construction

Dong Xu

Computer Science Department271C Life Sciences Center

1201 East Rollins RoadUniversity of Missouri-Columbia

Columbia, MO 65211-2060E-mail: [email protected]

573-882-7064 (O)http://digbio.missouri.edu

Page 2: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outline

Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu

Page 3: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Evolution

Many theories of evolutionBasic idea:

speciation events lead to creation of different species

Any two species share a (possibly distant) common ancestor

Page 4: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Evolutionary Events

Extinction: A new node u is created at the end of a lineage, no new lineage is started from u

Speciation: A new node u is created at the end of a lineage, and two new lineages are started from u

Hybridization: A new node u is created when two lineages combine (diploid or polyploid)when one lineage creates u and the new lineage from

u has double the number of homologs (auto-polyploid)

Page 5: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Tree of Life

http://tolweb.org/

Page 6: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Toxonomy

Glycine maxTaxonomy ID: 3847Genbank common name: soybeanRank: speciesGenetic code: Translation table 1 (Standard)Mitochondrial genetic code: Translation table 1 (Standard)Other names:common name:soybeansLineage( full )

cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; eurosids I; Fabales; Fabaceae; Papilionoideae; Phaseoleae; Glycine

Page 7: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Kingdom Plantae

Evolutionary tree of plants From primitive more advanced

traits

Non-vascular

Greenalgaancestor

_______moncot __________

Dicot

Vascular

Flowers

Gymnosperms

Page 8: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Monocot vs. dicot plants (1)

FEATURE MONOCOTS DICOTS

Cotyledons 1 2

Leaf venation parallel broad

Root system Fibrous Tap

Number of floral parts

In 3’s In 4’s or 5’s

Vascular bundle position

Scattered Arranged in a circle

Woody or herbaceous

Herbaceous Either

Page 9: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Number of cotyledons: one vs. two

Monocot vs. dicot plants (2)

Page 10: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Leaf venation pattern: Monocot is parallel Dicot is net pattern

Monocot vs. dicot plants (3)

Page 11: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Flower parts: Monocot: in groups of three Dicot: in groups of four or five

Monocot vs. dicot plants (4)

Page 12: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outline

Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu

Page 13: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Phylogenies (1)

A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species

AardvarkBison Chimp Dog Elephant

Page 14: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Phylogenies (2)

Leafs - current day species Nodes - hypothetical most recent common

ancestors Edges length - “time” from one speciation

to the next

Page 15: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Primate Evolution

Page 16: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Tree Terminology

a b c d

{a,b}

{a,b,c}

{a,b,c,d} root

leaf

internal nodecluster

edge

Page 17: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Rooted trees Single common ancestor Requires more information

Unrooted trees Objects are leaves Internal nodes are some common ancestors Insufficient information to tell whether not not a given

internal node is a common ancestor of any 2 leaves

Rooted/Unrooted Tree

Page 18: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Motivation

Understand the lineage of different species

Organizing principle to sort species into a taxonomy

Understand how various functions evolved

Understand forces and constraints on evolution

Perform multiple sequence alignment

Predict gene function (phylogenetic footprint)

Page 19: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Tree Basis

Phylogenies are reconstructed based on comparisons between present-day objects

Two main aspectsTopology

How its interior nodes connect to one another and to the leaves

Distance An estimate of the evolutionary distance between

the nodes

Page 20: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Assumptions

homology reflects common ancestry single common ancestor treelike relationship exists positional homology independent processes no reversals or convergence molecular clock

Page 21: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outline

Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu

Page 22: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Molecular Clock Theory (1)

For any given protein, accepted mutations in the amino acid sequence for the protein occur at constant rate

Accepted = mutations that allow protein to function without death

Implication

# of accepted mutations proportional to length of time interval

i.e. relatively constant rate of accepted mutations within a protein

Page 23: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Rate of accepted mutations maybe different for different proteins (depending on their tolerance for mutations)

Different parts of a protein may evolve at different rates

Thus, if A and B differ by k accepted mutations, then roughly k/2 mutations have occurred since divergence

Molecular Clock Theory (2)

Page 24: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Science vol. 289

Molecular clock

Page 25: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,
Page 26: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outline

Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu

Page 27: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Species/Gene Trees (1)

Species tree (how are my species related?) contains only one representative from each specieswhen did speciation take place?all nodes indicate speciation events

Gene tree (how are my genes related?)normally contains a number of genes from a single

speciesnodes relate either to speciation or gene duplication

events

Page 28: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

• Your sequence data may not have the same phylogenetic history as the species from which they were isolated

•Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA).

Species/Gene Trees (2)

Page 29: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Morphological vs. Molecular

Classical phylogenetic analysis: morphological featuresnumber of legs, lengths of legs, etc.

Modern biological methods allow to use molecular featuresGene sequences

Protein sequences

Page 30: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Dangers in Molecular Phylogenies

Gene/protein sequence can be homologous for different reasons:

Orthologs -- sequences diverged after a speciation event

Paralogs -- sequences diverged after a duplication event

Xenologs -- sequences diverged after a horizontal transfer (e.g., by virus)

Page 31: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Ultrametric trees (1)

A metric on a set of objects O given by the assignment of a real number d(x,y) to every pair x,y in O

Page 32: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

An ultrametric has to fulfill the additional requirement

An ultrametric tree is characterized by the three point condition

Ultrametric trees (2)

Page 33: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Additive Trees

Generalization of ultrametric trees# of mutations were assumed to be proportional to

temporal distance of a node to ancestor

Also assumed, mutations took place at same rate in all branches

Additive trees model different rates of mutation along different branches

Page 34: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Additivity

In “real” tree, distances between species are the sum of distances between intermediate nodes

ab

c

i

j

k

cbkjd

cakid

bajid

),(

),(

),(

)),(),(),((),( jidkjdkid21

kmd

m

c =

Page 35: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Phylogeny Construction

parsimony methods: fewest changes likelihood methods: maximize the

probability distance methods: based on pairwise

evolutionary distances (sequence similarity, nucleotide composition, etc.)

Page 36: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outline

Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu

Page 37: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

UPGMA

UPGMA is the unweighted pair group method with arithmetic mean

Distance matrix can come from (e.g) DNA-DNA hybridization, or be constructed from sequence data etc.

Iteratively group the most closely related groups. The average distance between elements in two groups is the distance between the groups.

Page 38: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

UPGMA Procedure

1. find closest pair of units (species, to start with)

2. connect this pair, defining an evolutionary unit (branch)

3. compute distances from the ancestor of this unit to all other ungrouped units --Branch length is distance/2

4. go back to #1 and repeat

Page 39: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Evolutionary distances among

primates (1)

Human Chimp Gorilla Orang

Chimp 1.45

Gorilla 1.51 1.57

Orang 2.98 2.98 3.04

Rhesus 7.51 7.55 7.39 7.10

nucleotide substitutions per 100 sites

Humans and chimps are closest: lump them and recompute distances

H C

Page 40: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

H-C Gorilla Orang

Gorilla 1.54

Orang 2.98 3.04

Rhesus 7.53 7.39 7.10

H Ce.g., (H-C) to gorilla distance = (H-G+C-G)/2= (1.51+1.57)/2 = 1.54Gorilla is closest to H-C clade(((H, C), 1.45), G, 1.54)

G

Evolutionary distances among

primates (2)

Page 41: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

H-C-G Orang

Orang 3

Rhesus 7.46 7.10

Human-Chimp-Gorilla is closerto Orang than to Rhesus

H CGOR

Evolutionary distances among

primates (3)

Page 42: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

UPGMA Clustering

Let Ci and Cj be clusters, define distance between them to be

When we combine two cluster, Ci and Cj, to form a new cluster Ck, then

i jCp Cqji

ji qpdCC

1CCd ),(

||||),(

||||

),(||),(||),(

ji

ljjliilk CC

CCdCCCdCCCd

Page 43: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

UPGMA: conclusions

UPGMA gives branch lengths or evolutionary distances as well as branching order

if (a big if) mutations occur at a constant rate, we can estimate dates of divergence from sequence differences

Page 44: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outline

Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu

Page 45: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

t1

t2

t3

1 three-taxa tree

t2

t4

t1

t3

t1

t2

t3

t4

t2

t3

t1

t4

1*(2*3-3) = 3 four-taxa trees

Possible Evolutionary Tree (1)

Page 46: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Taxa (n) rooted

(2n-3)!/(2n-2(n-2)!)

unrooted

(2n-5)!/(2n-3(n-3)!)

2 1 1

3 3 1

4 15 3

5 105 15

6 954 105

7 10,395 954

8 135,135 10,395

9 2,027,025 135,135

10 34,459,425 2,027,025

Possible Evolutionary Tree (2)

Page 47: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Taxa (n) Unrooted/rooted

2

2 1/1

3 1/3

4 3/15

43Taxa (n):

Possible Evolutionary Tree (3)

Page 48: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Maximum parsimony (1)

Minimizes the number of steps required to generate the observed variation in the sequences

Guaranteed to find the "best" tree - danger of over-fitting the data

Columns representing greater variation dominate

Works best for small, highly conserved sequences

Page 49: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Begin with a multiple sequence alignment Identify informative sites within the

sequences Tree requiring smallest number of changes

identified Repeat over all informative sites Length = sum of the # of steps in each

branch Choose tree with smallest length

Maximum parsimony (2)

Page 50: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Sequence position and character

Taxa 1 2 3 4 5 6 7 8 9

1 A A G A G T G C A

2 A G C C G T G C G

3 A G A T A T C C A

4 A G A G A T C C G

Maximum parsimony (3)

Page 51: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

A

B

C

D

A

C

B

D

A

D

B

C

A ACGA

B ATGC

C GTGC

D GCAA

Tree 1

Tree 2

Tree 3

1 2 3 4 Total

Tree 1 1 2 1 2 6

Tree 2 2 2 1 2 7

Tree 3 2 1 1 1 5

Maximum parsimony (4)

Page 52: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Parsimony on genomic sequence

Site Human Chimp Gorilla Orang Recent branch

34 A G A G human-gorilla

560 C C A A human-chimp

1287 * * T T human-chimp

3057-

3060

**** **** TAAT TAAT human-chimp

5153 A C C A chimp-gorilla

human-chimp chimp-gorilla human-gorilla

12 3 4

Page 53: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outline

Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu

Page 54: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Probabilistic Approaches to Phylogeny (1)

Notation and definitions:

Let P(x•|T,t•) denote the probability of a set of data given a tree, where:

x• denotes n sequences

T denotes a tree with n leaves with sequence j at leaf j

t• denotes the edge lengths of the tree

The definition of P(x•|T,t•) depends on our choice of model of evolution.

Page 55: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Let P(x|y,t) denote the probability that sequence y evolves into x along an edge of length t.

Assume that we can define P(x|y,t).

If we can do this for each edge of T we can calculate the probability of T.

Probabilistic Approaches to Phylogeny (2)

Page 56: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Ridiculously simplistic model of evolution:1. Every site is independent

2. Deletions and insertions do not occur

3. Substitution accounts for all evolution

Let P(b|a, t) denote the probability of the substitution of residue b for residue a over an edge length of t.

Extending to aligned gapless sequences x and y,

P(x | y, t) = uP(xu|yu, t), where u indexes over sites

Probabilistic Approaches to Phylogeny (3)

Page 57: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

P(x1,.., x5|T, t•) = P(x1|x4,t1)P(x2|x4,t2)P(x3|x5,t3)P(x4|x5,t4)P(x5)

x5

x3

x2

x4

x1

t1

t2 t3

t4

root

Probabilistic Approaches to Phylogeny (4)

Page 58: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Divergence time estimates for major groups. Thick bars on branches denote fossil record of fungi; solid circles are calibration points. From Heckman et al. 2001. Science 293: 1132

Page 59: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Confidence Assessment

Bootstrap values

Bootstrapping is a statistical technique that can use random resampling of data to determine sampling error for tree topologies

Page 60: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Bootstrapping phylogenies

Characters are resampled with replacement to create many bootstrap replicate data sets

Each bootstrap replicate data set is analysed (e.g. with parsimony, distance, ML etc.)

Agreement among the resulting trees is summarized with a majority-rule consensus tree

Frequencies of occurrence of groups, bootstrap proportions (BPs), are a measure of support for those groups

Page 61: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outline

Evolution theory Concept of phylogeny Molecular clock Types of trees UPGMA Parsimony Maximum likelihood An example for bird flu

Page 62: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Avian Influenza Viruses

Single strand Negative RNA Fragmented Polymorphic

Page 63: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

2003/2004 H5N1 Pandemic

Highly pathogenic; can be transmitted to people and some cases are fatal

Virus: 8 genomic segments (PB1, PB2, PA, HA, NP, NA, M, and NS) and genetic reassortment

DNA sequences o South China Agricultural University, China o Genbank

Sources and evolution of flu viruses?

Page 64: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Outbreak History

Page 65: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

HA

HA gene

Other 6 segments (excluding PA) have a similar tree structure

Page 66: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Our analyses suggest that 2003-04 H5N1 pandemic be caused by multiple independent transmissions with multiple genotypes from genetic reassortments.

PA gene

(2001)

Page 67: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,
Page 68: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Reading Assignments

Suggested reading: Chapter 14 in “Warren J. Ewens and

Gregory R. Grant: Statistical Methods in Bioinformatics – An Introduction. Springer. 2001”.

Optional reading: Chapter 17 in “Dan Gusfield: Algorithms on

Strings, Trees, and Sequences. Cambridge University Press. 1997”.

Page 69: Phylogenetic Tree Construction Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia,

Develop a program that implement the UPGMA algorithm

1. Modify your code in the assignment for global alignment.

2. Use edit distance (match 1; otherwise 0) with gap penalty –1 – k (k is gap size) for pairwise sequence alignment.

3. Use the sequence identity as the tree distance between leaves.

4. Input format: FASTA in one file.

5. Output format: (((a, b), d1), c, d2))…

Project Assignment