1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior...

MAVID: Constrained Ancestral Alignment of Multiple Sequence

Author: Nicholas Bray and Lior Pachter

Outline

• AVID• MAVID

– Progressive alignment– Constraints– Tree Building– Experimental Results

AVID: A Global Alignment Program• Fast• Memory efficient• Practical for sequence for

alignments of large genomic region• Sensitive in finding homologous

regions• Specific and avoids the false-

positive problems

Algorithm

• Repeat Masking (Optional)• Finding Matches Using Suffix Trees• Anchor Selection• Recursion

Repeat Masking

Match finding

Anchor selection

Base pair alignmentSplit sequences

using anchors

Enough anchors?

Recursion

Repeat Masking (Optional)

• RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html)

• Repeat matches• Clean matches

Repeat matches Clean matches

Finding Matches Using Suffix Trees

Finding Matches Using Suffix Trees• Maximal repeated substring (Match)

– Every subsequence that contains it is not repeated in the string

• Maximal matches between two sequence– Pairs of matching subsequences

whose flanking bases are mismatches• Transform

Maximal repeated substring

Maximal matches between two sequence

Transform

Anchor Selection

• Eliminate noisy matches (those less than half the length of the longest match)

• The left matches are ordered by– Long clean -> short clean -> long repeat ->

short repeat

Anchor Selection

• A variant of Smith-Waterman algorithm (no overlapping)

• Gap score: 0• Mismatch score: ∞• Match score:

Recursion

Condition

• There are still significant matches– The anchor set is >50% of the length of the s

equence• Recursion

– Otherwise• Needleman-Wunsch algorithm

• No significant matches– Short sequence (<4kb)

• Needleman-Wunsch algorithm– Long sequence

• Trivial alignment (gap)

• Rapidly aligning multiple large genomic regions

• Incorporating biologically meaningful heuristics

• Sound alignment strategies

Method

• Core: progressive ancestral alignment, which incorporate preprocessed constraint

• Terminology – Match

• Similar (may not exactly match) region between two sequences

– Constraint• The order of positions of alignment

Standard progressive alignment• Compute the distance matrix by aligning

all pairs of sequences • Build a phylogenetic tree (guide tree)

from the distance matrix– Cluster– Midpoint method

• Progressively align the sequence according to the branching order in the guide tree– Aligning two alignments– An alignment is viewed as a sequence

Method

Key difference

• Instead of aligning alignments, we first infer ancestral sequences of alignments using maximum-likelihood estimation within a probabilistic evolutionary model

• maximum-likelihood estimation– a popular statistical method used to

make inferences about parameters of the underlying probability distribution of a given data set

Key difference

• The ancestral sequences are then aligned with AVID

• The scores of the Smith-Waterman step are assigned according to the branch length of the two alignments

• The alignment of the ancestral sequences is then used to glue two alignments. Gaps in the ancestral sequences lead to gaps in the multiple alignment

Alignment A

Ancestral A

Ancestral B

Alignment B

AVID with preprocessed data

• Gene predictions using GENSCAN• Protein alignments using BLAT• Finding exon matches without using

suffix tree• In addition, the exon matches can b

e used shape the final multiple alignment

MAVID(Constraints, Tree building, and

Experimental results)

Speaker: 羅正偉2005/12/07

Constraints(1/3)

Notation: ai ≤ bj

This means that position i in sequence a must appear before position j in sequence b in the multiple sequence alignment.

Constraints(2/3)a

If x ≤ y, then ai ≤ cx ≤ cy ≤ bj ,and so ai ≤ bj by transitivity.

Constraints(3/3)

• The above information can be used in the alignment of the ancestral sequences by requiring potential anchors between the sequences to satisfy the constraints.

Prime Constraints(1/4)

• Consider every triplet of sequences (a, b, c) with a in u, b in v, and c not in x.

• Every triplet can provide potential constraints for the alignment.

• If there are n sequences, there are O(n3) such triplets.

u vToo many constraints!

• Actually, we don’t need to find all possible constraints, many of which will be redundant.

• Instead, we wish to find a set of prime constraints

• In this set, no constraint is implied by the others.

• Such a set can be inferred from the homology map.

Illustration

• If there are m sets of orthologous exons, then at node x there can be at most O(m) prime constraints.

• The sets of all prime constraints can be found in O(mk2), where k is the number of leaves below x.

• Matches between the ancestral sequences that are inconsistent with this set of constraints can be filtered out in time O(N logm), where N is the total number of matches.

• For typical values of m and k, the time taken computing and utilizing the constraints is negligible.

Tree Building(1/3)

• Most multiple alignment programs require pairwise alignments of all the sequences to build in initial guide tree. (Quadratic number of sequence alignments)

• We utilize an iterative method to obtain a guide tree using only linear number of alignments.

Tree Building(2/3)

• The initial guide tree is selected randomly from the set of complete binary trees.

• The sequences are aligned using this random tree, and then a phylogenetic tree is inferred from the resulting multiple alignment.

• The above process is iterated until the alignment and tree are satisfactory.

Tree Building(3/3)

• Instead of computing all pairwise alignments, only O(nk) alignments are necessary to perform n iterations with k sequences.

• We found that for typical alignment problems, only a small number of iterations were necessary.

Experimental Results 1

• A human, mouse, and rat whole-genome multiple alignment.– A homology map for the genomes was

built by C. Dewey, and was used to generate gene anchors and constraints.

– Chromosome 20 was chosen because it aligns almost completely with mouse chromosome 2.

Experimental Results 1 (cont.)

Coverage of human chromosome 20 RefSeq exons by the MAVID alignments. Of a total of 3927 exons, only six were not in the homology map. A total of 53.5% of the exons were covered by precomputed exon anchors in either mouse or rat. The remaining exons are mostly aligned by MAVID, resulting in 93.6% of the exons covered by alignment in either mouse or rat.

Experimental Results 2

• Alignment of 21 Organisms– We aligned 1.8 Mb of human sequenc

e together with the homologous regions from 20 other organisms of a total 23 Mb of sequence.

– Baboon, cat, chicken, chimp, cow, dog, dunnart, fugu, hedgehog, horse, lemur, macaque, mouse, opossum, pig, platypus, rabbit, rat, tetraodon, and zebra-fish.

Experimental Results 2(cont.)

• The MAVID alignments were compared with MLAGAN, version 1.1(Brudno et al. 2003).

• MLAGAN is the only other program we know of that is able to align the 21 sequences in a reasonable period of time.

Experimental Results 2(cont.)

• MAVID and MLAGAN both aligned sequences correctly.

• MAVID took 40 min, while MLAGAN took roughly 6h.

1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior...

Documents

Transcript of 1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior...

Lior power kabel

Centering Ancestral Knowledges: ANCESTRAL NOWLEDGES ...

Ancestral Blues

PRES - LIOR SUCHARD2

2021 - Lior Textile Industries

Prologue - Lior Samson

Robust Alignment of Drosophila Genomes Lior Pachter EECS Joint Colloquium, October 5th 2005.

239 Discrete Mathematics for the Life Sciencesqchu/Notes/239.pdf · 239 Discrete Mathematics for the Life Sciences Lior Pachter Notes by Qiaochu Yuan Spring 2013. 1 Introduction There

Advances in Water Desalination - Noam Lior

Lior Pachter - GitHub Pages · Lior Pachter Curriculum vitae, January 2017 Publications Books 1.L. Pachter and B. Sturmfels:Algebraic Statistics for Computational Biology, Cam-bridge

VisitScotland Ancestral Brand overview Gillian Swan, Ancestral Marketing Manager 1.

Parametric Inference for Biological Sequence Analysispeople.csail.mit.edu/jrennie/trg/papers/pachter-biosequence-04.pdf · Parametric Inference for Biological Sequence Analysis Lior

Epistasis and Shapes of Fitness Landscapes Niko Beerenwinkel, Lior Pachter, Bernd Sturmfels Department of Mathematics University of California at Berkeley.

arXiv:1104.3889v2 [q-bio.GN] 13 May 2011 · arXiv:1104.3889v2 [q-bio.GN] 13 May 2011. 2 Lior Pachter Remark 1 (Meaning of quanti cation for RNA-Seq). Since RNA-Seq consists of se-quencing

Catalogo 2011 LIOR

Geometry of Rank Tests Fundamentals for Finding Somitogenesis Clock Genes Jason Morton, Lior Pachter, Anne Shiu, Bernd Sturmfels, Oliver Wienand Department.

The exergy ﬁelds in transport processes: Their …lior/lior papers/The exergy fields in...The exergy ﬁelds in transport processes: Their calculation and use Noam Lior*, Wladimir

Lior Dayan - Photos Ma

(Combinatorics of) Alignment and Gene Finding Lior Pachter Basic definitions (alignment) Combinatorics of alignment Pair hidden Markov models Alignment.

Ancestral Cities. Ancestral Sustainability. ENGLISH