Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1...

29
1 CS 482/682, Fall 2014 Week 5 Week 5 Topics for this week: How to find good anchors: suffix trees Their description, and an easy quadratic-time algorithm The big idea for this week of the course: Suffix trees can do many string operations you might think are very hard, in linear time.

Transcript of Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1...

Page 1: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

1 CS 482/682, Fall 2014 Week 5

Week 5

Topics for this week: •  How to find good anchors: suffix

trees •  Their description, and an easy

quadratic-time algorithm

The big idea for this week of the course:

Suffix trees can do many string operations you might think are very hard, in linear time.

Page 2: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

2 CS 482/682, Fall 2014 Week 5

One possible alignment anchor

Given 2 sequences, S and T: Maximal unique matches: short

sequences found in both sequences, with some requirements

•  Found exactly once in each sequence •  If you move to the left or right of

each mum, the next characters are different in both S and T.

No restriction on the length of mums.

How to find them?

Page 3: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

3 CS 482/682, Fall 2014 Week 5

One more inspiration

(And this one is simpler) Suppose we’re given a sequence S, and

allowed to preprocess it.

Users are allowed to ask the question: “Is sequence T found in S?” for strings T of arbitrary length.

We want to answer the question in O(|T|) time, not O(|S|) or something complicated depending on how many times T appears in S.

This is a pattern search problem.

Page 4: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

4 CS 482/682, Fall 2014 Week 5

Solving this problem

Classical string matching: •  Preprocess the pattern, O(|T|) time •  Examine the text, O(|S|) time. •  Total is O(|S|+|T|). •  If we have a new text, S’, examining it

will take O(|S’|) time.

But we’re dealing with the reverse situation, with multiple patterns and only one text.

Classical string matching doesn’t work here. One solution is suffix trees.

Page 5: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

5 CS 482/682, Fall 2014 Week 5

What are suffix trees?

They store a very large amount of information about strings while still keeping within linear space requirements.

Suffix trees allow the storage of all of the substrings of a given string in linear space.

Users can search for new strings in the original string, in time proportional to the length of the new string.

All of this would be useless, except that they can also be built in linear time.

We won’t see that algorithm; see Gusfield for details.

Page 6: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

6 CS 482/682, Fall 2014 Week 5

Formally

The suffix tree for a string S of size m is a rooted graph theoretic tree.

Edges of the tree labelled with substrings of S. Paths from root labeled with concatenation of edge labels.

The path from the root, r, to any leaf x is a suffix of the string S. Each of the m suffixes of the string is in the tree, so it has m leaves.

Each internal node has at least 2 children.

The substrings labelling each edge out of a node must all begin with different letters.

Page 7: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

7 CS 482/682, Fall 2014 Week 5

As a picture

Here is the suffix tree for GAAGAT

G G

G

G A

A

A

A A

A

A

T

T

T

T T

T

Some notation: An edge is labelled with a substring of the original string. A node’s label is the concatenation of all edge labels for the path leading to that node.

Page 8: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

8 CS 482/682, Fall 2014 Week 5

Some basic facts

1)  If we’re really going to include all of the suffixes, then the last letter must be unique (as in the example). Else, suffix may end at an internal node!

2)  We’re going to assume throughout that the alphabet size is constant. This is quite important.

3)  The suffix tree is unique except for ordering of edges.

4)  We’ll often talk (especially as we build them) about paths that end partway in an edge. For example, the path labeled AAG ends partway through an edge.

Page 9: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

9 CS 482/682, Fall 2014 Week 5

Who cares?

All paths from the root of the suffix tree are labelled with the prefixes of path labels.

That is, they’re labelled with prefixes of suffixes of the string S.

Or, in other words, they’re labelled with substrings of S.

Example of using this: Is the string x a substring of S? Start at the root, and follow paths

labelled by the characters of x. If you can get to the end of x, then yes, it is.

Page 10: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

10 CS 482/682, Fall 2014 Week 5

Better than hash tables?

Well, that’s a fair question. Hash tables are certainly easier to understand.

And, one can produce a hash table of all length k strings in O(m) time and look up a k-length string x in O(k) time, finding all p places where string x is found in O(p) time.

This is exactly the bound for suffix trees, as well.

But what if you don’t know how long the string x is going to be?

Um…well, that’s a problem. And most other string matching tricks

don’t work for it either.

Page 11: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

11 CS 482/682, Fall 2014 Week 5

How are we going to do this?

The linear-time algorithm for building suffix trees is not, really, all that bad. But it is time-consuming.

We’ll start with a quadratic-time algorithm and talk about the uses of suffix trees.

Then we’ll finish with some applications. These aren’t easy on the first time.

Chapters 5-8 of Gusfield’s book are excellent. It’s strongly recommended that you look at them.

Much of what can be done with suffix trees can also be done with suffix arrays.

Page 12: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

12 CS 482/682, Fall 2014 Week 5

A lot of notation

We’re going to say that the string S = s1s2…sm$, where the $ character exists only at the end.

Throughout, we’ll use letters like a,b,c to represent letters from the alphabet.

We’ll represent nodes with letters in italic like u, v, and w.

We’ll represent strings of letters either by their actual sequence, or by Greek letters like α, β, and γ. To represent a subsequence of S, we may use S[i..j] instead of si…sj.

Page 13: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

13 CS 482/682, Fall 2014 Week 5

First, a simple algorithm

Given: A string S of length m over a finite alphabet. The last character of S is a unique $ character.

We’ll build the suffix tree from right to left.

Begin with this tree:

Then, for i = m downto 1: Follow the letters of S[i…m] along the

edges of the tree T. When we reach a point where no path

exists, break the current edge and add a new edge for what is left.

$

Page 14: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

14 CS 482/682, Fall 2014 Week 5

One round

Add AGAAGAT to this tree.

First, follow the edges for A and for GA from the root.

Then split after the A since the only path in the tree is for T, and we have an A, instead.

Add a new edge for AGAT.

G G

G

G A

A

A

A A

A

A

T

T

T

T T

T

Page 15: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

15 CS 482/682, Fall 2014 Week 5

New tree

This yields this new tree.

Note: There are 3 places where “AGAT” is sitting off by itself. Maybe this is repetition might be something to think about.

G G

G

G A

A

A

A A

A

A

T

T

T

T T

T A G A T

Page 16: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

16 CS 482/682, Fall 2014 Week 5

Obvious runtime

Obviously, this algorithm has runtime O(m2), since it’s doing O(m) work in each phase.

But quadratic work on, say, a genome would be unacceptable.

First observation (not immediately obvious): This needs not be taking quadratic space.

Why? Because each edge doesn’t need to be

labelled with a string, but just with coordinates in the sequence.

Page 17: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

17 CS 482/682, Fall 2014 Week 5

Example:

There’s the same suffix tree, in linear space.

(Somewhere, you have to store the string itself…)

4-5

2

6

6

6 6

3-6

1-2

3-6 3-6

Page 18: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

18 CS 482/682, Fall 2014 Week 5

One application of suffix trees

What’s the longest sequence common to both S1 and S2?

Build a suffix tree for S=S1#S2$. For suffixes that start inside S1, stop them at the # sign.

This is called a generalized suffix tree. (Generalized because it contains suffixes of S1 and of S2.)

Color all nodes v of the tree: •  red if v’s label is a substring of S1 •  blue if it’s a substring of S2 •  purple if it’s a substring of both We want the lowest purple node.

Page 19: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

19 CS 482/682, Fall 2014 Week 5

Maximal repeats

Another application of suffix trees Given: String S. Find all substrings of S with at least two

copies in S, where if one extended both copies on either side, they differ.

Example: AAGATATGATAGGAT$ Maximal repeats include: GATA (as in positions 3 and 8): AAGATATGATAGGAT$ GAT (as in position 3 and 13): AAGATATGATAGGAT$ And so on.

Page 20: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

20 CS 482/682, Fall 2014 Week 5

Suffix trees are magic…

These can be identified in O(m) time. We’ll prove this slowly. 1)  If α is a maximal repeat, then there

is an internal node with that label. Proof: Since α is a maximal repeat, then

twice when it occurs, it’s followed by different letters. Hence, it’s a bifurcation point in the tree.

2)  There are at most m maximal repeats.

Proof: The tree has m+1 leaves, and hence at most m internal nodes.

OK, so which internal nodes correspond to maximal repeats?

Page 21: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

21 CS 482/682, Fall 2014 Week 5

Finding maximal repeats

Consider a node x with path label α. Now, consider all suffixes of S that

begin with α. (all paths from the root that go through x)

Let the “left character” of a suffix be the character before that suffix in S.

(For the string “banana$”, the left character of anana$ is “b”)

Theorem: α is a maximal repeat iff α is a node in the tree and there are at least two suffixes that begin with α and have different left characters.

Page 22: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

22 CS 482/682, Fall 2014 Week 5

Proof of theorem

 Assume α is a maximal repeat. Then, as noted before, it is a node in the tree. Also, it occurs twice with different preceding characters

Assume two different suffixes of the tree begin dα and eα. Since α is a node, there are >1 letters that follow α, say b and c. If dα and eα are followed by different letters, then clearly α is maximal. If they’re both followed by the same letter (b), there must be three suffixes that begin with α (since something ends with b and with c). And the suffix that continues with c can’t start with both d and e.

Page 23: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

23 CS 482/682, Fall 2014 Week 5

Finding maximal repeats

“Left-diverse” nodes (with descendants having different left characters) easily found:

•  Do a DFS on the tree •  For each leaf, identify its left character •  For each internal node:

–  If its children are left-diverse, so is it. –  Else, if all children have same left

character, propagate it. –  Else, it’s left-diverse.

Easily performed in O(m) time. And left-diverse nodes are exactly what

we want!

Page 24: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

24 CS 482/682, Fall 2014 Week 5

Recall from before Large-scale global alignment

Idea: •  Pick some “anchors” through which

the true alignment is very likely to fall. •  Align the regions between the anchors

either recursively or just using classical global alignment tools.

Can we use suffix trees to choose good global anchors?

First program that does so: MUMMER by Delcher et al. (1998)

Page 25: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

25 CS 482/682, Fall 2014 Week 5

Seeds come from suffix trees

Where Mummer differs from other programs is in the seeds it uses, which come from a use of suffix trees.

A seed match is a maximal unique match, or MUM.

That is, it’s a string that occurs exactly one time in each genome, and which can’t be extended either way and still be a match.

What is that in a suffix tree?

Page 26: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

26 CS 482/682, Fall 2014 Week 5

How to find mums?

Just as for maximal repeats, build a suffix tree for S#T$.

A mum is a maximal match between the two sequences…

•  Can’t extend on the right: a node in the tree

•  It occurs once in S and once in T: 2 descendants, one from S and one from T

•  Can’t extend on the left: Different left characters in both descendants

All mums between S and T can be found in … linear time.

Page 27: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

27 CS 482/682, Fall 2014 Week 5

Other sequence

Advantages of this approach: •  Matches aren’t based on k-mers for

fixed k, so can be quite long and of high confidence

•  Anchoring approach reduces number of subalignments by a lot.

Disadvantages: •  There’s no reason to assume that the

alignment is optimal in any sense. •  Bad anchor choosing might result in

a hopelessly awful alignment. This does happen sometimes.

But quite successful in practice, and many extensions exist.

Page 28: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

28 CS 482/682, Fall 2014 Week 5

Post-mortem on suffix trees

That’s all we’re going to do with suffix trees.

Most commonly used in 3 areas of bioinformatics

1)  Repeat finding 2)  Very-large-scale sequence alignment 3)  Regulatory element detection Proposed for sequence assembly, and

suffix arrays are currently (2014) used in this area.

Next topic: sequence annotation.

Page 29: Week 5 - University of Waterloobrowndg/482S14/Notes/Week5.pdf · CS 482/682, Fall 2014 Week 5 1 Week 5 Topics for this week: • How to find good anchors: suffix trees • Their description,

29 CS 482/682, Fall 2014 Week 5

1-slide review of pairwise alignment

Global alignment: runtime is O(nm) with linear or affine gap penalties

Local alignment: Algorithm similar, but only find highest-scoring regions

Heuristic local alignment: Require that an alignment have a high-scoring core, of some sort, in order to possibly be found

Heuristic global alignment: Chain together high-scoring cores, using traditional alignment between the cores

All are based on underlying probabilistic models of genomic or protein sequence