Picking Alignments from (Steiner) Trees
description
Transcript of Picking Alignments from (Steiner) Trees
Picking Alignments from (Steiner) Trees
Fumei Lam
Marina Alexandersson
Lior Pachter
Alignment
Pair Hidden Markov Models
Steiner Networks
ATCG--GA-CGTCA
M
X
Y
biologically meaningful
fast alignmentsbased on HMM structure
Some basic definitions:
Let G be a graph and S V(G). A k-spanner for S is a subgraph G’ G such that for any u,v S the length of the shortest path between u,v in G’ is at most k timesthe distance between u and v in G.
Let V(G)=R2 and E(G)=horizontal and vertical line segments.A Manhattan network is a 1-spanner for a set S of pointsin R2. Vertices in the Manhattan network that are notin S are called Steiner points
Example:
S: red points
Manhattan network
Steiner point
[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points
4-approximation in O(n3) and 8-approximation in O(nlogn)
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid
A(v) = {u:v is the topmost node below and to the left of u}
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations)
v
slide
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide
The minimum slide arborescense problem:
Lingas-Pinter-Rivest-Shamir 1982
O(n3) optimal solution using dynamic programming
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness
u
v
b
a
What is an alignment?
ATCG--GACATTACC-ACAC-GTCA-GATTA-CAAC
M
X
Y
M = (mis)matchX = insert seq1Y = insert seq2
Pair HMMsSimple sequence-alignment PHMM
M X YM M Y M
Hidden sequence:
AA
TC
C-
GG
-T
-C
GA
Observed sequence:
ATCGGACGTCA
Hidden alignment:
ATCG--GAC-GTCA
Pair HMMstransitionprobabilities
outputprobabilities
Using the Pair HMMIn practice, we have observed sequence
ATCGGACGTCA
for which we wish to infer the underlying hidden states
One solution: among all possible sequences of hiddenstates, determine the most likely (Viterbi algorithm).
ATCG--GAC-GTCA
MMXMYYM
Viterbi in PHMM ≡ Needleman Wunsch
M
X
Y1-3
1-3
1-3
1-3
1-3
1-3
Match prob: pm
Mismatch prob: pr
Match score: log(pm)Mismatch score: log(pr)Gap score: log(pg)
Gap prob: pg
Want to take into account that the sequencesare genomic sequences:
Example: a pair of syntenic genomic regions
YX
PHMM
M
X
Y
YX
PHMM
• A property of “single sequence” states is that all paths in the Viterbi graph between two vertices have the same weight
Strategy for Alignment
GATTACATTGATCAGACAGGTGAAGA
GATCTTCATGTAG
The CD4 region
human
mouse
50000
50000
0
0
5’ 3’
Exon 1 Exon 2 Exon 3 Exon 4
Intron 1 Intron 2 Intron 3
Branchpoint
CTGAC
Splice siteCAG
Splice siteGGTGAG
TranslationInitiationATG
Stop codonTAG/TGA/TAA
Suggests a new Steiner problemFind the shortest 1-spanner connecting reds to blues
Generalizes the Manhattan network problem (all points red and blue)
Generalizes the Rectilinear Steiner Arborescence problem
1985, Trubin - polynomial time algorithm
History of the Rectilinear Steiner Arborescence Problem
1992, Rao-Sadayappan-Hwang-Shor - error in Trubin
2000, Shi and Su - NP complete!
Results for unlabeled problem
•An O(n3) 2-approximation algorithm (implemented)
• An O(nlogn) 4-approximation algorithm
• Testing on CD4 region in human/mouse
•Implementation ( SLIM ) http://bio.math.berkeley.edu/slim/
• SLIM for SLAM (in progress) http://bio.math.berkeley.edu/slam/
TAAT GTATTGAGGTATTGAG TGAA
CTG GTTGGTCCTCAG GTG TGTC
ATGTCCACGG
GA GT TACA TC
TTGTACACGGCA G
T GT ACGCT GG
ATGTAAC
ACATGTA
X
CNS
Y
M
D
I
The Viterbi graph for a more complicated alignment
PHMM
Comparison and Analysis of Performance
Our method has two main steps: (L=length of seqs, n=#HSP)
1. Building the network O(n3) or O(nlogn)
2. Running the Viterbi algorithm O(nL) worst case
for the HMM on the network• Banding algorithms are O(L2) worst case for step 2.
• Chaining algorithms are O(n2) in the case where gap penalties can depend on the sequences.• These strategies do not generalize well for more sophisticated HMMs.
Summary
Thanks: Nick Bray and Simon Cawley
SLIM (network build): http://bio.math.berkeley.edu/slim/SLAM (alignment): http://bio.math.berkeley.ed/slam/
ATCG--GA-CGTCA
M
X
Y
Software: