Picking Alignments from (Steiner) Trees

Post on 12-Jan-2016

30 views 0 download

description

Picking Alignments from (Steiner) Trees. Lior Pachter. Fumei Lam. Marina Alexandersson. X. M. Y. Alignment. ATCG--G A-CGTCA. biologically meaningful. Steiner Networks. Pair Hidden Markov Models. fast alignments based on HMM structure. Some basic definitions: - PowerPoint PPT Presentation

Transcript of Picking Alignments from (Steiner) Trees

Picking Alignments from (Steiner) Trees

Fumei Lam

Marina Alexandersson

Lior Pachter

Alignment

Pair Hidden Markov Models

Steiner Networks

ATCG--GA-CGTCA

M

X

Y

biologically meaningful

fast alignmentsbased on HMM structure

Some basic definitions:

Let G be a graph and S V(G). A k-spanner for S is a subgraph G’ G such that for any u,v S the length of the shortest path between u,v in G’ is at most k timesthe distance between u and v in G.

Let V(G)=R2 and E(G)=horizontal and vertical line segments.A Manhattan network is a 1-spanner for a set S of pointsin R2. Vertices in the Manhattan network that are notin S are called Steiner points

Example:

S: red points

Manhattan network

Steiner point

[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points

4-approximation in O(n3) and 8-approximation in O(nlogn)

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid

A(v) = {u:v is the topmost node below and to the left of u}

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations)

v

slide

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide

The minimum slide arborescense problem:

Lingas-Pinter-Rivest-Shamir 1982

O(n3) optimal solution using dynamic programming

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness

u

v

b

a

What is an alignment?

ATCG--GACATTACC-ACAC-GTCA-GATTA-CAAC

M

X

Y

M = (mis)matchX = insert seq1Y = insert seq2

Pair HMMsSimple sequence-alignment PHMM

M X YM M Y M

Hidden sequence:

AA

TC

C-

GG

-T

-C

GA

Observed sequence:

ATCGGACGTCA

Hidden alignment:

ATCG--GAC-GTCA

Pair HMMstransitionprobabilities

outputprobabilities

Using the Pair HMMIn practice, we have observed sequence

ATCGGACGTCA

for which we wish to infer the underlying hidden states

One solution: among all possible sequences of hiddenstates, determine the most likely (Viterbi algorithm).

ATCG--GAC-GTCA

MMXMYYM

Viterbi in PHMM ≡ Needleman Wunsch

M

X

Y1-3

1-3

1-3

1-3

1-3

1-3

Match prob: pm

Mismatch prob: pr

Match score: log(pm)Mismatch score: log(pr)Gap score: log(pg)

Gap prob: pg

Want to take into account that the sequencesare genomic sequences:

Example: a pair of syntenic genomic regions

YX

PHMM

M

X

Y

YX

PHMM

• A property of “single sequence” states is that all paths in the Viterbi graph between two vertices have the same weight

Strategy for Alignment

GATTACATTGATCAGACAGGTGAAGA

GATCTTCATGTAG

The CD4 region

human

mouse

50000

50000

0

0

5’ 3’

Exon 1 Exon 2 Exon 3 Exon 4

Intron 1 Intron 2 Intron 3

Branchpoint

CTGAC

Splice siteCAG

Splice siteGGTGAG

TranslationInitiationATG

Stop codonTAG/TGA/TAA

Suggests a new Steiner problemFind the shortest 1-spanner connecting reds to blues

Generalizes the Manhattan network problem (all points red and blue)

Generalizes the Rectilinear Steiner Arborescence problem

1985, Trubin - polynomial time algorithm

History of the Rectilinear Steiner Arborescence Problem

1992, Rao-Sadayappan-Hwang-Shor - error in Trubin

2000, Shi and Su - NP complete!

Results for unlabeled problem

•An O(n3) 2-approximation algorithm (implemented)

• An O(nlogn) 4-approximation algorithm

• Testing on CD4 region in human/mouse

•Implementation ( SLIM ) http://bio.math.berkeley.edu/slim/

• SLIM for SLAM (in progress) http://bio.math.berkeley.edu/slam/

TAAT GTATTGAGGTATTGAG TGAA

CTG GTTGGTCCTCAG GTG TGTC

ATGTCCACGG

GA GT TACA TC

TTGTACACGGCA G

T GT ACGCT GG

ATGTAAC

ACATGTA

X

CNS

Y

M

D

I

The Viterbi graph for a more complicated alignment

PHMM

Comparison and Analysis of Performance

Our method has two main steps: (L=length of seqs, n=#HSP)

1. Building the network O(n3) or O(nlogn)

2. Running the Viterbi algorithm O(nL) worst case

for the HMM on the network• Banding algorithms are O(L2) worst case for step 2.

• Chaining algorithms are O(n2) in the case where gap penalties can depend on the sequences.• These strategies do not generalize well for more sophisticated HMMs.

Summary

Thanks: Nick Bray and Simon Cawley

SLIM (network build): http://bio.math.berkeley.edu/slim/SLAM (alignment): http://bio.math.berkeley.ed/slam/

ATCG--GA-CGTCA

M

X

Y

Software: