Advanced Algorithm Design and Analysis (Lecture...

Advanced Algorithm Design and Analysis (Lecture 5)

SW5 fall 2005Simonas Šaltenis E1-215bsimas@cs.aau.dk

AALG, lecture 5, © Simonas Šaltenis, 2005 2

Text-Search Data Structures

Goals of the lecture: Finish discussing text-searching algorithms:

• Boyer-Moore-Horspool

Dictionary ADT for strings: • to understand the principles of tries, compact tries,

Patricia tries

Text-searching data structures:• to understand and be able to analyze text searching

algorithm using the suffix tree and Pat tree

Reverse naïve algorithm

Why not search from the end of P? Boyer and Moore

Reverse-Naive-Search(T,P) 01 for s ← 0 to n – m02 j ← m – 1 // start from the end 03 // check if T[s..s+m–1] = P[0..m–1]04 while T[s+j] = P[j] do05 j ← j - 106 if j < 0 return s07 return –1

Running time is exactly the same as of the naïve algorithm…

Occurrence heuristic

Boyer and Moore added two heuristics to reverse naïve, to get an O(n+m) algorithm, but it is complex

Horspool suggested just to use the modified occurrence heuristic: After a mismatch, align T[s + m–1] with the

rightmost occurrence of that letter in the pattern P[0..m–2]

Examples: • T= “detective date” and P= “date”

• T= “tea kettle” and P= “kettle”

Shift table

In preprocessing, compute the shift table of the size |Σ|.

Example: P = “kettle” shift[e] =4, shift[l] =1, shift[t] =2, shift[k] =5 shift[any other letter] = 6

Example: P = “pappar” What is the shift table?

shift [w]={m−1−max {im−1∣P [i ]=w} if w is in P [0. .m−2] ,m otherwise.

Boyer-Moore-Horspool Alg.

BMH-Search(T,P) 01 // compute the shift table for P01 for c ← 0 to |Σ|- 102 shift[c] = m // default values03 for k ← 0 to m - 204 shift[P[k]] = m – 1 - k05 // search06 s ← 0 07 while s ≤ n – m do 08 j ← m – 1 // start from the end 09 // check if T[s..s+m–1] = P[0..m–1]10 while T[s+j] = P[j] do11 j ← j - 112 if j < 0 return s13 s ← s + shift[T[s + m–1]] // shift by last letter14 return –1

BMH Analysis

Worst-case running time Preprocessing: O(|Σ|+m) Searching: O(nm)

• What input gives this bound?

Total: O(nm)

Space: O(|Σ|) Independent of m

On real-world data sets very fast

Comparison

Let’s compare the algorithms. Criteria: Worst-case running time

• Preprocessing• Searching

Expected running time Space used Implementation complexity

Dictionary ADT for Strings

Dictionary ADT for strings – stores a set of text strings: search(x) – checks if string x is in the set insert(x) – inserts a new string x into the set delete(x) – deletes the string equal to x from

the set of strings Assumptions, notation:

n strings, N characters in total m – length of x Size of the alphabet d = |Σ|

BST of Strings

We can, of course, use binary search trees. Some issues: Keys are of varying length A lot of strings share similar prefixes

(beginnings) – potential for saving space Let’s count comparisons of characters.

• What is the worst-case running time of searching for a string of length m?

Trie – a data structure for storing a set of strings (name from the word “retrieval”): Let’s assume, all strings end with “$” (not in Σ)

Set of strings: {bear, bid, bulk, bull, sun, sunday}

Tries II

Properties of a trie: A multi-way tree. Each node has from 1 to d children. Each edge of the tree is labeled with a

character. Each leaf node corresponds to the stored

string, which is a concatenation of characters on a path from the root to this node.

Search and Insertion in Tries

The search algorithm just follows the path down the tree (starting with Trie-Search(root, P[0..m]))

Trie-Search(t, P[k..m]) //searches for P in t01 if t is leaf then return true 02 else if t.child(P[k])=nil then return false03 else return Trie-Search(t.child(P[k]), P[k+1..m])

How would the delete work?

Trie-Insert(t, P[k..m]) //inserts string P into t01 if t is not leaf then //otherwise P is already present02 if t.child(P[k])=nil then 03 Create a new child of t and a “branch” starting

with that chlid and storing P[k..m] 04 else Trie-Insert(t.child(P[k]), P[k+1..m])

Trie Node Structure

“Implementation detail” What is the node structure? = What is the

complexity of the t.child(c) operation?:• An array of child pointers of size d: waist of space,

but child(c) is O(1)• A hash table of child pointers: less waist of space,

child(c) is expected O(1)• A list of child pointers: compact, but child(c) is O(d)

in the worst-case• A binary search tree of child pointers: compact and

child(c) is O(lg d) in the worst-case

Analysis of the Trie

Size: O(N) in the worst-case

Search, insertion, and deletion (string of length m): depending on the node structure:

O(dm), O(m lg d), O(m) Compare with the string BST

Observation: Having chains of one-child nodes is wasteful

Compact Tries

Compact Trie: Replace a chain of one-child nodes with an edge

labeled with a string Each non-leaf node (except root) has at least

two children

b sunear$

id$ ul

Compact Tries II

Implementation: Strings are external to the structure in one

array, edges are labeled with indices in the array (from, to)

Can be used to do word matching: find where the given word appears in the text. Use the compact trie to “store” all words in the

text Each child in the compact trie has a list of

indices in the text where the corresponding word appears.

Word Matching with Tries

To find a word P: At each node (after matching P[0..k–1]), follow edge

(i,j), such that P[k..k+i-j] = T[i..j] If there is no such edge, there is no P in T, otherwise,

find all starting indexes of P when a leaf is reached.

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

(31,34)

(14,16)

(1,2)(17,18)

(19,19)(22,24)

(3,3)(8,11)

they think that we were there and there

(28,30)(4,5)

25,35 1

Word Matching with Tries II

Building of a compact trie for a given text: How do you do that? Describe the compact trie

insertion procedure Running time: O(N)

Complexity of word matching: O(m) What if the text is in external memory?

In the worst-case we do O(m) I/O operations just to access single characters in the text – not efficient

Patricia trie

Patricia trie: a compact trie where each edge’s label (from, to) is

replaced by (T[from], to – from + 1)

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

(t,2)(w,2)

(_,1)(r,3)

(e,1)(i,4)

they think that we were there and there

(r,3)(y,2)

25,35 1

Querying Patricia Trie

Word prefix query: find all words in T, which start with P[0..m-1]

Patricia-Search(t, P, k) // searches for P in t01 if t is leaf then02 j ← the first index in the t.list03 if T[j..j+m-1] = P[0..m-1] then 04 return t.list // exact match 05 else if there is a child-edge (P[k],s) then06 if k + s < m then 07 return Patricia-Search(t.child(P[k]), P, k+s)08 else go to any descendent leaf of t and do the

check of line 03, if it is true, return lists of all descendent leafs of t, otherwise return nil

09 else return nil // nothing is found

Analysis of the Patricia Trie

Idea of patricia trie – postpone the actual comparison with the text to the end: If the text is in external memory only O(1) I/O are

performed (if the trie fits in main-memory)

Build a Patricia Trie for word matching:

Usually binary patricia tries are used: Consider binary encoding of text (and queries) Each node in the tree has two children (left for 0, right

for 1) Edges are labeled just with skip values (in bits)

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40føtex har haft en fødselsdagT:

Text-Search Problem

Input: Text T = “carrara” Pattern P = “ar”

Output: All occurrences of P in T

Reformulate the problem: Find all suffixes of T that

has P as a prefix! We already saw how to do a

word prefix query.

carrara arrara rrara rara ara ra a

Suffix Trees

Suffix tree – a compact trie (or similar structure) of all suffixes of the text Patricia trie of suffixes is sometimes called a Pat

carrara$1 2 3 4 5 6 7 8

carrara$

rara$ a$

rara$a

(a,1) (r,1)

(r,1)($,1)

(r,5) (a,2)

(r,5)(a,1)

(r,3)($,1)

Pat Trees: Analysis

Text search for P is then a prefix query. Running time: O(m+z), where z is the number

of answers Just O(1) I/Os if the text is in external-memory

(independent of z)! The size of the Pat tree: O(N)

Why? Advantage of compression: the size of the

simple trie of suffixes would be in the worst-case N + (N-1)+ (N-2) + … 1 = O(N2)

Constructing Suffix Trees

The naïve algorithm Insert all suffixes one after another: O(N2)

Clever algorithms: O(N) McCreight, Ukkonen Scan the text from left to right, use additional

suffix links in the tree Question: How does the the Pat tree looks

like after inserting the first five prefixes using the naïve algorithm?

Honolulu$1 2 3 4 5 6 7 8 9

Advanced Algorithm Design and Analysis (Lecture...

Documents

Transcript of Advanced Algorithm Design and Analysis (Lecture...

September 22, 20031 Algorithms and Data Structures Lecture III Simonas Šaltenis Aalborg University simas@cs.auc.dk.

September 26, 20021 Algorithms and Data Structures Lecture VI Simonas Šaltenis Nykredit Center for Database Research Aalborg University simas@cs.auc.dk.

Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg05/slides/aalg12.pdf · Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2005 Simonas Šaltenis

Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b simas@cs.aau.dk.

Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b simas@cs.aau.dk.

Advanced Algorithm Design and Analysis (Lecture 1)people.cs.aau.dk/~simas/aalg2011/slides/aalg8.pdf · Advanced Algorithm Design and Analysis (Lecture 8) SW8 spring 2011 Simonas Šaltenis

Simonas Šaltenis Aalborg Universitypeople.cs.aau.dk/~torp/Teaching/E10/presentations/dat5_10_rtree.pdf · Simonas Šaltenis Aalborg University daisy.aau.dk Spatial Indexing Indexing

September 12, 20021 Algorithms and Data Structures Lecture III Simonas Šaltenis Nykredit Center for Database Research Aalborg University simas@cs.auc.dk.

December 3, 20011 Algorithms and Data Structures Lecture XV Simonas Šaltenis Nykredit Center for Database Research Aalborg University simas@cs.auc.dk.

Seinų St. 5 PALANGA Simonas AleksandravičiusPALANGA Simonas Aleksandravičius Vytauto gatvė 88A 2014 m. atlikti žvalgomieji tyrinėjimai Pa langos senojo miesto vietoje (UK 12613),

VILNIUS UNIVERSITY SIMONAS KUTANOVAS ...2096852/2096852.pdfVILNIUS UNIVERSITY SIMONAS KUTANOVAS INVESTIGATION OF TETRAMETHYLPYRAZINE DEGRADATION IN RHODOCOCCUS SP. TMP1 BACTERIA Summary

Advanced Algorithm Design and Analysis (Lecture 14) SW5 fall 2004 Simonas Šaltenis E1-215b simas@cs.aau.dk.

Advanced Algorithm Design and Analysis (Lecture 13) SW5 fall 2004 Simonas Šaltenis E1-215b simas@cs.aau.dk.

Chapter 32: String Matching - user.it.uu.seuser.it.uu.se/~pierref/courses/AD2/slides/32-StringMatching-A.pdfChapter 32: String Matching Fall 2007 Simonas Šaltenis simas@cs.aau.dk

November 14, 20031 Algorithms and Data Structures Lecture XIII Simonas Šaltenis Aalborg University simas@cs.auc.dk.

October 1, 20011 Algorithms and Data Structures Lecture III Simonas Šaltenis Nykredit Center for Database Research Aalborg University simas@cs.auc.dk.

Project Proposals Simonas Šaltenis Aalborg University Nykredit Center for Database Research Department of Computer Science, Aalborg University.

October 10, 20021 Algorithms and Data Structures Lecture IX Simonas Šaltenis Nykredit Center for Database Research Aalborg University simas@cs.auc.dk.

Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.

Advanced Algorithm Design and Analysis (Lecture 8) SW5 fall 2004 Simonas Šaltenis E1-215b simas@cs.aau.dk.