Post on 18-Jan-2018
description
1
MAVID: Constrained Ancestral Alignment of Multiple Sequence
Author: Nicholas Bray and Lior Pachter
2
Outline
• AVID• MAVID
– Progressive alignment– Constraints– Tree Building– Experimental Results
3
AVID: A Global Alignment Program• Fast• Memory efficient• Practical for sequence for
alignments of large genomic region• Sensitive in finding homologous
regions• Specific and avoids the false-
positive problems
4
Algorithm
• Repeat Masking (Optional)• Finding Matches Using Suffix Trees• Anchor Selection• Recursion
5
Repeat Masking
Match finding
Anchor selection
Base pair alignmentSplit sequences
using anchors
Enough anchors?
Recursion
6
Repeat Masking (Optional)
• RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html)
• Repeat matches• Clean matches
Repeat matches Clean matches
7
Finding Matches Using Suffix Trees
8
Finding Matches Using Suffix Trees• Maximal repeated substring (Match)
– Every subsequence that contains it is not repeated in the string
• Maximal matches between two sequence– Pairs of matching subsequences
whose flanking bases are mismatches• Transform
9
Maximal repeated substring
Maximal matches between two sequence
Transform
10
Anchor Selection
• Eliminate noisy matches (those less than half the length of the longest match)
• The left matches are ordered by– Long clean -> short clean -> long repeat ->
short repeat
11
Anchor Selection
• A variant of Smith-Waterman algorithm (no overlapping)
• Gap score: 0• Mismatch score: ∞• Match score:
10 bp
12
Recursion
13
Condition
• There are still significant matches– The anchor set is >50% of the length of the s
equence• Recursion
– Otherwise• Needleman-Wunsch algorithm
• No significant matches– Short sequence (<4kb)
• Needleman-Wunsch algorithm– Long sequence
• Trivial alignment (gap)
14
MAVID
• Rapidly aligning multiple large genomic regions
• Incorporating biologically meaningful heuristics
• Sound alignment strategies
15
Method
• Core: progressive ancestral alignment, which incorporate preprocessed constraint
• Terminology – Match
• Similar (may not exactly match) region between two sequences
– Constraint• The order of positions of alignment
16
Standard progressive alignment• Compute the distance matrix by aligning
all pairs of sequences • Build a phylogenetic tree (guide tree)
from the distance matrix– Cluster– Midpoint method
• Progressively align the sequence according to the branching order in the guide tree– Aligning two alignments– An alignment is viewed as a sequence
17
Method
18
Key difference
• Instead of aligning alignments, we first infer ancestral sequences of alignments using maximum-likelihood estimation within a probabilistic evolutionary model
• maximum-likelihood estimation– a popular statistical method used to
make inferences about parameters of the underlying probability distribution of a given data set
19
Key difference
• The ancestral sequences are then aligned with AVID
• The scores of the Smith-Waterman step are assigned according to the branch length of the two alignments
• The alignment of the ancestral sequences is then used to glue two alignments. Gaps in the ancestral sequences lead to gaps in the multiple alignment
20
Alignment A
Ancestral A
Ancestral B
Alignment B
AVID
21
AVID with preprocessed data
• Gene predictions using GENSCAN• Protein alignments using BLAT• Finding exon matches without using
suffix tree• In addition, the exon matches can b
e used shape the final multiple alignment
22
MAVID(Constraints, Tree building, and
Experimental results)
Speaker: 羅正偉2005/12/07
23
Constraints(1/3)
Notation: ai ≤ bj
This means that position i in sequence a must appear before position j in sequence b in the multiple sequence alignment.
24
Constraints(2/3)a
c
b
ai
cycx
bj
If x ≤ y, then ai ≤ cx ≤ cy ≤ bj ,and so ai ≤ bj by transitivity.
25
Constraints(3/3)
• The above information can be used in the alignment of the ancestral sequences by requiring potential anchors between the sequences to satisfy the constraints.
26
Prime Constraints(1/4)
• Consider every triplet of sequences (a, b, c) with a in u, b in v, and c not in x.
• Every triplet can provide potential constraints for the alignment.
• If there are n sequences, there are O(n3) such triplets.
x
u vToo many constraints!
27
Prime Constraints(2/4)
• Actually, we don’t need to find all possible constraints, many of which will be redundant.
• Instead, we wish to find a set of prime constraints
• In this set, no constraint is implied by the others.
• Such a set can be inferred from the homology map.
28
Illustration
29
Prime Constraints(3/4)
• If there are m sets of orthologous exons, then at node x there can be at most O(m) prime constraints.
• The sets of all prime constraints can be found in O(mk2), where k is the number of leaves below x.
30
Prime Constraints(4/4)
• Matches between the ancestral sequences that are inconsistent with this set of constraints can be filtered out in time O(N logm), where N is the total number of matches.
• For typical values of m and k, the time taken computing and utilizing the constraints is negligible.
31
Tree Building(1/3)
• Most multiple alignment programs require pairwise alignments of all the sequences to build in initial guide tree. (Quadratic number of sequence alignments)
• We utilize an iterative method to obtain a guide tree using only linear number of alignments.
32
Tree Building(2/3)
• The initial guide tree is selected randomly from the set of complete binary trees.
• The sequences are aligned using this random tree, and then a phylogenetic tree is inferred from the resulting multiple alignment.
• The above process is iterated until the alignment and tree are satisfactory.
33
Tree Building(3/3)
• Instead of computing all pairwise alignments, only O(nk) alignments are necessary to perform n iterations with k sequences.
• We found that for typical alignment problems, only a small number of iterations were necessary.
34
Experimental Results 1
• A human, mouse, and rat whole-genome multiple alignment.– A homology map for the genomes was
built by C. Dewey, and was used to generate gene anchors and constraints.
– Chromosome 20 was chosen because it aligns almost completely with mouse chromosome 2.
35
Experimental Results 1 (cont.)
Coverage of human chromosome 20 RefSeq exons by the MAVID alignments. Of a total of 3927 exons, only six were not in the homology map. A total of 53.5% of the exons were covered by precomputed exon anchors in either mouse or rat. The remaining exons are mostly aligned by MAVID, resulting in 93.6% of the exons covered by alignment in either mouse or rat.
36
Experimental Results 2
• Alignment of 21 Organisms– We aligned 1.8 Mb of human sequenc
e together with the homologous regions from 20 other organisms of a total 23 Mb of sequence.
– Baboon, cat, chicken, chimp, cow, dog, dunnart, fugu, hedgehog, horse, lemur, macaque, mouse, opossum, pig, platypus, rabbit, rat, tetraodon, and zebra-fish.
37
Experimental Results 2(cont.)
• The MAVID alignments were compared with MLAGAN, version 1.1(Brudno et al. 2003).
• MLAGAN is the only other program we know of that is able to align the 21 sequences in a reasonable period of time.
38
Experimental Results 2(cont.)
• MAVID and MLAGAN both aligned sequences correctly.
• MAVID took 40 min, while MLAGAN took roughly 6h.