Picking Alignments from (Steiner) Trees

Fumei Lam

Marina Alexandersson

Lior Pachter

Alignment

Pair Hidden Markov Models

Steiner Networks

ATCG--GA-CGTCA

biologically meaningful

fast alignmentsbased on HMM structure

Some basic definitions:

Let G be a graph and S V(G). A k-spanner for S is a subgraph G’ G such that for any u,v S the length of the shortest path between u,v in G’ is at most k timesthe distance between u and v in G.

Let V(G)=R2 and E(G)=horizontal and vertical line segments.A Manhattan network is a 1-spanner for a set S of pointsin R2. Vertices in the Manhattan network that are notin S are called Steiner points

Example:

S: red points

Manhattan network

Steiner point

[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points

4-approximation in O(n3) and 8-approximation in O(nlogn)

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid

A(v) = {u:v is the topmost node below and to the left of u}

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations)

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide

The minimum slide arborescense problem:

Lingas-Pinter-Rivest-Shamir 1982

O(n3) optimal solution using dynamic programming

[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness

What is an alignment?

ATCG--GACATTACC-ACAC-GTCA-GATTA-CAAC

M = (mis)matchX = insert seq1Y = insert seq2

Pair HMMsSimple sequence-alignment PHMM

M X YM M Y M

Hidden sequence:

Observed sequence:

ATCGGACGTCA

Hidden alignment:

ATCG--GAC-GTCA

Pair HMMstransitionprobabilities

outputprobabilities

Using the Pair HMMIn practice, we have observed sequence

ATCGGACGTCA

for which we wish to infer the underlying hidden states

One solution: among all possible sequences of hiddenstates, determine the most likely (Viterbi algorithm).

ATCG--GAC-GTCA

MMXMYYM

Viterbi in PHMM ≡ Needleman Wunsch

Match prob: pm

Mismatch prob: pr

Match score: log(pm)Mismatch score: log(pr)Gap score: log(pg)

Gap prob: pg

Want to take into account that the sequencesare genomic sequences:

Example: a pair of syntenic genomic regions

• A property of “single sequence” states is that all paths in the Viterbi graph between two vertices have the same weight

Strategy for Alignment

GATTACATTGATCAGACAGGTGAAGA

GATCTTCATGTAG

The CD4 region

5’ 3’

Exon 1 Exon 2 Exon 3 Exon 4

Intron 1 Intron 2 Intron 3

Branchpoint

Splice siteCAG

Splice siteGGTGAG

TranslationInitiationATG

Stop codonTAG/TGA/TAA

Suggests a new Steiner problemFind the shortest 1-spanner connecting reds to blues

Generalizes the Manhattan network problem (all points red and blue)

Generalizes the Rectilinear Steiner Arborescence problem

1985, Trubin - polynomial time algorithm

History of the Rectilinear Steiner Arborescence Problem

1992, Rao-Sadayappan-Hwang-Shor - error in Trubin

2000, Shi and Su - NP complete!

Results for unlabeled problem

•An O(n3) 2-approximation algorithm (implemented)

• An O(nlogn) 4-approximation algorithm

• Testing on CD4 region in human/mouse

•Implementation ( SLIM ) http://bio.math.berkeley.edu/slim/

• SLIM for SLAM (in progress) http://bio.math.berkeley.edu/slam/

TAAT GTATTGAGGTATTGAG TGAA

CTG GTTGGTCCTCAG GTG TGTC

ATGTCCACGG

GA GT TACA TC

TTGTACACGGCA G

T GT ACGCT GG

ATGTAAC

ACATGTA

The Viterbi graph for a more complicated alignment

Comparison and Analysis of Performance

Our method has two main steps: (L=length of seqs, n=#HSP)

1. Building the network O(n3) or O(nlogn)

2. Running the Viterbi algorithm O(nL) worst case

for the HMM on the network• Banding algorithms are O(L2) worst case for step 2.

• Chaining algorithms are O(n2) in the case where gap penalties can depend on the sequences.• These strategies do not generalize well for more sophisticated HMMs.

Summary

Thanks: Nick Bray and Simon Cawley

SLIM (network build): http://bio.math.berkeley.edu/slim/SLAM (alignment): http://bio.math.berkeley.ed/slam/

ATCG--GA-CGTCA

Software:

Picking Alignments from (Steiner) Trees

Documents

Transcript of Picking Alignments from (Steiner) Trees

Rapid Global Alignments

L2 Alignments - National Institutes of Health · II-L2-1 SEP 97 L2 Alignments L2 Alignments L2 Alignments L2 Alignments L2 Alignments L2 Alignments L2 Alignments L2 Alignments L2

Alignments - kofler.or.at · Alignments RobertKoﬂer Alignment Whatisanalignment Arrangingsequences(DNA,RNA,protein)toidentifyregionsofsimilarity. Alignmentsareusuallyrepresentedasrowswithin

Alignments Jmcinerney

E1 Alignments - National Institutes of Health · II-E1-1 SEP 97 E1 Alignments E1 Alignments E1 Alignments E1 Alignments E1 Alignments E1 Alignments E1 Alignments E1 Alignments E1

Real World Rapid Content Development Implementations Mark Steiner mark steiner, inc. Mark Steiner mark steiner, inc.

Pairwise sequence alignments

New Conference Alignments

Whole genome alignments

STEINER WALDORF EDUCATION, PLYMOUTH · PDF fileSTEINER WALDORF EDUCATION, PLYMOUTH UNIVERSITY ... “Rudolf Steiner was racist / Steiner schools ... Steiner schools on the grounds

Lisca d Alignments

Multiple Sequence Alignments

Lab 05_Horizontal Alignments

Rudolf Steiner, Rudolf Steiner in the Waldorf Sch

L1 Alignments - PaVE: Papilloma virus genome database · II-L1-1 AUG 96 L1 Alignments L1 Alignments L1 Alignments L1 Alignments L1 Alignments L1 Alignments L1 Alignments L1 Alignments

Alignments Lecture

Radical re-alignments (?) - Theatre Futurestheatrefutures.org.uk/.../files/2013/07/RadicalRe-alignments-1.pdfRadical re-alignments (?) innovations in curriculum design / assessment

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

of Alignments

Kufan Political Alignments