Picking Alignments from (Steiner) Trees

28
Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter

description

Picking Alignments from (Steiner) Trees. Lior Pachter. Fumei Lam. Marina Alexandersson. X. M. Y. Alignment. ATCG--G A-CGTCA. biologically meaningful. Steiner Networks. Pair Hidden Markov Models. fast alignments based on HMM structure. Some basic definitions: - PowerPoint PPT Presentation

Transcript of Picking Alignments from (Steiner) Trees

Page 1: Picking Alignments from (Steiner) Trees

Picking Alignments from (Steiner) Trees

Fumei Lam

Marina Alexandersson

Lior Pachter

Page 2: Picking Alignments from (Steiner) Trees

Alignment

Pair Hidden Markov Models

Steiner Networks

ATCG--GA-CGTCA

M

X

Y

biologically meaningful

fast alignmentsbased on HMM structure

Page 3: Picking Alignments from (Steiner) Trees

Some basic definitions:

Let G be a graph and S V(G). A k-spanner for S is a subgraph G’ G such that for any u,v S the length of the shortest path between u,v in G’ is at most k timesthe distance between u and v in G.

Let V(G)=R2 and E(G)=horizontal and vertical line segments.A Manhattan network is a 1-spanner for a set S of pointsin R2. Vertices in the Manhattan network that are notin S are called Steiner points

Page 4: Picking Alignments from (Steiner) Trees

Example:

S: red points

Manhattan network

Steiner point

Page 5: Picking Alignments from (Steiner) Trees

[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points

4-approximation in O(n3) and 8-approximation in O(nlogn)

Page 6: Picking Alignments from (Steiner) Trees

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid

Page 7: Picking Alignments from (Steiner) Trees

A(v) = {u:v is the topmost node below and to the left of u}

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations)

v

slide

Page 8: Picking Alignments from (Steiner) Trees

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide

The minimum slide arborescense problem:

Lingas-Pinter-Rivest-Shamir 1982

O(n3) optimal solution using dynamic programming

Page 9: Picking Alignments from (Steiner) Trees

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness

u

v

b

a

Page 10: Picking Alignments from (Steiner) Trees

What is an alignment?

ATCG--GACATTACC-ACAC-GTCA-GATTA-CAAC

Page 11: Picking Alignments from (Steiner) Trees

M

X

Y

M = (mis)matchX = insert seq1Y = insert seq2

Pair HMMsSimple sequence-alignment PHMM

Page 12: Picking Alignments from (Steiner) Trees

M X YM M Y M

Hidden sequence:

AA

TC

C-

GG

-T

-C

GA

Observed sequence:

ATCGGACGTCA

Hidden alignment:

ATCG--GAC-GTCA

Pair HMMstransitionprobabilities

outputprobabilities

Page 13: Picking Alignments from (Steiner) Trees

Using the Pair HMMIn practice, we have observed sequence

ATCGGACGTCA

for which we wish to infer the underlying hidden states

One solution: among all possible sequences of hiddenstates, determine the most likely (Viterbi algorithm).

ATCG--GAC-GTCA

MMXMYYM

Page 14: Picking Alignments from (Steiner) Trees

Viterbi in PHMM ≡ Needleman Wunsch

M

X

Y1-3

1-3

1-3

1-3

1-3

1-3

Match prob: pm

Mismatch prob: pr

Match score: log(pm)Mismatch score: log(pr)Gap score: log(pg)

Gap prob: pg

Page 15: Picking Alignments from (Steiner) Trees

Want to take into account that the sequencesare genomic sequences:

Example: a pair of syntenic genomic regions

Page 16: Picking Alignments from (Steiner) Trees

YX

PHMM

M

X

Y

Page 17: Picking Alignments from (Steiner) Trees

YX

PHMM

• A property of “single sequence” states is that all paths in the Viterbi graph between two vertices have the same weight

Page 18: Picking Alignments from (Steiner) Trees

Strategy for Alignment

GATTACATTGATCAGACAGGTGAAGA

GATCTTCATGTAG

Page 19: Picking Alignments from (Steiner) Trees

The CD4 region

human

mouse

50000

50000

0

0

Page 20: Picking Alignments from (Steiner) Trees

5’ 3’

Exon 1 Exon 2 Exon 3 Exon 4

Intron 1 Intron 2 Intron 3

Branchpoint

CTGAC

Splice siteCAG

Splice siteGGTGAG

TranslationInitiationATG

Stop codonTAG/TGA/TAA

Page 21: Picking Alignments from (Steiner) Trees

Suggests a new Steiner problemFind the shortest 1-spanner connecting reds to blues

Page 22: Picking Alignments from (Steiner) Trees

Generalizes the Manhattan network problem (all points red and blue)

Generalizes the Rectilinear Steiner Arborescence problem

Page 23: Picking Alignments from (Steiner) Trees

1985, Trubin - polynomial time algorithm

History of the Rectilinear Steiner Arborescence Problem

1992, Rao-Sadayappan-Hwang-Shor - error in Trubin

2000, Shi and Su - NP complete!

Page 24: Picking Alignments from (Steiner) Trees

Results for unlabeled problem

•An O(n3) 2-approximation algorithm (implemented)

• An O(nlogn) 4-approximation algorithm

• Testing on CD4 region in human/mouse

•Implementation ( SLIM ) http://bio.math.berkeley.edu/slim/

• SLIM for SLAM (in progress) http://bio.math.berkeley.edu/slam/

Page 25: Picking Alignments from (Steiner) Trees

TAAT GTATTGAGGTATTGAG TGAA

CTG GTTGGTCCTCAG GTG TGTC

ATGTCCACGG

GA GT TACA TC

TTGTACACGGCA G

T GT ACGCT GG

ATGTAAC

ACATGTA

X

CNS

Y

M

D

I

Page 26: Picking Alignments from (Steiner) Trees

The Viterbi graph for a more complicated alignment

PHMM

Page 27: Picking Alignments from (Steiner) Trees

Comparison and Analysis of Performance

Our method has two main steps: (L=length of seqs, n=#HSP)

1. Building the network O(n3) or O(nlogn)

2. Running the Viterbi algorithm O(nL) worst case

for the HMM on the network• Banding algorithms are O(L2) worst case for step 2.

• Chaining algorithms are O(n2) in the case where gap penalties can depend on the sequences.• These strategies do not generalize well for more sophisticated HMMs.

Page 28: Picking Alignments from (Steiner) Trees

Summary

Thanks: Nick Bray and Simon Cawley

SLIM (network build): http://bio.math.berkeley.edu/slim/SLAM (alignment): http://bio.math.berkeley.ed/slam/

ATCG--GA-CGTCA

M

X

Y

Software: