Download - Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Text Indexing

S. Srinivasa RaoApril 19, 2007

[based on slides by Paolo Ferragina]

Why string data are interesting ?

They are ubiquitous: Digital libraries and product catalogues Electronic white and yellow pages Specialized information sources (e.g. Genomic or Patent

dbs) Web pages repositories Private information dbs ...

String collections are growing at a staggering

rate:

...more than 10Tb of textual data in the web...more than 15Gb of base pairs in the genomic dbs

The need for an “index” Brute-force scanning is not a viable approach:

– Fast single searches– Multiple simple searches for complex queries

The index is a basic block of any IR system.

An IR system also encompasses:

– IR models– Ranking algorithms– Query languages and operations– ...

We will concentrate only on “index design” !!We will concentrate only on “index design” !!

Two families of indexes

Types of data

Linguistic or tokenizable textRaw sequence of characters or bytes

Word-based queryCharacter-based query

Types of query

Two indexing approaches :• Word-based indexes, here a concept of “word” must be devised !

» Inverted files, Signature files or Bitmaps.

• Full-text indexes, no constraint on text and queries !» Suffix Array, Suffix tree, Hybrid indexes, or String B-tree.

DNA sequencesAudio-video filesExecutables

Arbitrary substringComplex matches

Exact wordWord prefix or suffixPhrase

Word-based: Inverted files (or lists)

Now is the timefor all good men

to come to the aidof their country

Doc #1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc #2

Query answering is a two-phase process: midnight AND time

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Vocabulary Postings

2

Full-text indexes

Their need is pervasive:– Raw data: DNA sequences, Audio-Video files, ...– Linguistic texts: data mining, statistics, ... – Vocabulary for Inverted Lists– Xpath queries on XML documents– Intrusion detection, Anti-viruses, ...

Classes of indexes: Suffix array, Suffix tree (variants) Multi-level indexes: Short Pat array B-tree based data structures: Prefix B-tree, String B-tree

Terminology An alphabet, denoted ∑ is a set of (ordered) characters.

A string S is an array of characters, S[1,n] = S[1] S[2] … S[n].

S[i,j] = S[i] … S[j] is a substring of S.

S[1,j] is a prefix of S; S[i,n] is a suffix of S.

∑* denotes all strings over alphabet ∑.

Lexicographic order:– Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z,

the lexicographic order is the same as in a dictionary.

SUF(T) = Sorted set of suffixes of T

Indexed string matching problem

Let T be a set of K strings in ∑*, where N is the total length of all strings in T.

String matching query on T: Given a pattern P find all occurrences of P in the strings in T.

Static problem: Store T in a data structure that supports string matching queries. Such a data structure is called a full-text index.

Dynamic version: Supports also insertions and deletions of strings in the full-text index.

A simple but crucial observation

Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n]

TPi

T[i,n]

Occurrences of P in T = All suffixes of T having P as a prefix

T = This is a visual example This is a visual example This is a visual example

3,6,12

Can transform the string matching problem to a “prefix matching problem” over all the suffixes.

Suffix Array [Manber-Myers, 90]

Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order.

T = mississippi#

#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#

SUF(T)

Suffix Array• SA: array of ints, 4N bytes• Text T: N bytes 5N bytes of space occupancy

(N2) space

SA

121185211097463

T = mississippi#

suffix pointer

5

Two key properties [Manber-Myers, 90]

Prop 1. All suffixes in SUF(T) having prefix P are contiguous.Prop 2. Starting position is the lexicographic one of P.

P=si

T = mississippi#


SUF(T)SA

121185211097463

T = mississippi#

Searching in Suffix Array [Manber-Myers, 90]

Indirected binary search on SA: O(p log2 N) time

T = mississippi#SA

121185211097463

si

P is larger

2 accesses for binary step

Searching in Suffix Array [Manber-Myers, 90]

Indirected binary search on SA: O(p log2 N) time

T = mississippi#SA

121185211097463

si

P is smaller

Listing the occurrences [Manber-Myers, 90]

Brute-force comparison: O(p x occ) time

T = mississippi# 4 6 7SA

121185211097463

si

occ=2

Suffix Array search• O (p (log2 N + occ)) time

• O (log2 N + occ) in practice

External memory• Simple disk paging for SA

• O ((p/B) (log2 N + occ)) I/Os

issippiP is not a prefix

P is a prefixsissippi

P is a prefixsippi

logB N+ occ/B

121185211097463

121185211097463

SA

121185211097463

Lcp

00140010213

Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA

Output-sensitive retrieval

T = mississippi# 4 6 7


SUF(T)

P=si

Suffix Array search• O ((p/B) log2 N + (occ/B)) I/Os• 9 N bytes of space

Scan Lcp until Lcp[i] < p

+ : incremental search

base B : tricky !!

Compare against P

00140010213

121185211097463

121185211097463

00140010213

occ=2

q

Incremental search (case 1)

Incremental search using the LCP array: no rescanning of pattern chars

SA

Ran

ge M

inim

a

i

j

The cost: O (1) memory accesses

PMin Lcp[i,q-1]

> P’s

known induct.

< P’s

> P’s

q



SA

i

j

Min Lcp[i,q-1]

The cost: O (1) memory accesses

Ran

ge M

inim

a

known induct.

< P’s

> P’s

P



SA

i

j

q

Min Lcp[i,q-1]

Suffix char > Pattern char

Suffix char < Pattern char

Suffix Array search• O(log2 N) binary steps• O(p) total char-cmp for routing O((p/B) + log2 N + (occ/B)) I/Os

base B : more tricky (we will not consider

this)

L

The cost: O(L) char cmp

Ran

ge M

inim

a

< P’s

> P’s

P

Summary: suffix array

[Manber-Myers, 90]

Space: O(N)

String matching queries: O(p + log N + occ)

Can be constructed O(N log N) time.

Static.

SA

Hybrid Index: Short Pat array

Exploit internal memory: sample the suffix array and copy something in memory

M

Disk

P

binary-search inside

SA + Supra-index• O((p/B) + log2 (N/s) + (occ/B)) I/Os

Parameter s depends on M and influences both performance and space !!

(only a huristic)

s

Copy a prefix ofmarked suffixes

Tries Trie (name from the word “retrieval”): a data

structure for storing a set of strings– Let’s assume that all strings end with “$” (not

in )

b s

e

ar

$

i

d

$

u

l

k

$ $

l

u

n

d

a

y

$

$

Set of strings: {bear, bid, bulk, bull, sun, sunday}

Tries

Properties of a trie:– A multi-way tree. – Each node has from 1 to 1

children.– Each edge of the tree is labeled with

a character.– Each leaf node corresponds to the

stored string, which is a concatenation of characters on a path from the root to this node.

Analysis of the Trie

Given k strings of total length N: Size:

– O(N) in the worst-case Search, insertion, and deletion

(string of length m): – O(m) (assuming ∑ is constant)

Observation:– Having chains of one-child nodes is

wasteful

Compact Tries Compact Trie:

– Replace a chain of one-child nodes with an edge labeled with a string

– Each non-leaf node (except root) has at least two children

b s

e

ar

$

i

d

$

u

l

k

$ $

l

u

n

d

a

y

$

$

b sun

ear$id$ ul

k$l$

day$$

Compact Tries Implementation:

– Strings are external to the structure in one array, edges are labeled with indices in the array (from, to)

– Improves the space to O(k)

b sun

ear$id$ ul

k$l$

day$$

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

bear$bid$bulk$bull$sun$sunday$T:

1,120,22

2,57,9 11,12

13,1418,19

27,305,5

Patricia Tries Patricia trie:

– a compact trie where each edge’s label (from, to) is replaced by (T[from], to – from + 1)

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

bear bid bulk bull sun sunday T:

b,1s,3

e,4i,3 u,2

k,2l,2

d,4_,1

1,120,22

2,57,9 11,12

13,1418,19

27,305,5

Suffix Trees [McCreight, 76]

Suffix tree – a compact trie (or similar structure) of all suffixes of the text– Patricia trie of suffixes is sometimes called a

Pat tree

carrara$

1 2 3 4 5 6 7 8

a r

r$

carrara$

rara$ a$

rara$a

$ ra$

2 5

71

6 4

3

(a,1) (r,1)

(r,1)($,1)

(c,8)

(r,5) (a,2)

2 5

71

6 4

3

(r,5)(a,1)

(r,3)($,1)

Search in suffix trees

T = abababbc# 1 3 5 7 9

3

24

c

cc

b

a

bb

b

bb

a2

4

13 5

ab

c

a

b

b

c

a

b

c

bP = ba

cb

0

1

(5,8)O(N) space

O(p) time

What about ST in external memory ?– Unbalanced tree topology – Updates

CPAT tree ~ 5N on average

No (p/B), possibly no (occ/B), mainly static and space costly

76 8

c

Packing ?! (p) I/Os(occ) I/Os??

Search is a path traversal

24

and O(occ) time

- Large space ~ 15N

b

a

Summary: compact trie

K strings of total length N:

Space: O(K)

String matching queries: O(p + occ) time

Can be constructed O(N) time.

Update(S): O(|S|) time

Summary: suffix tree

[McCreight, 76]

For a string of length n: Space: O(n)

String matching queries: O(p + occ)

Can be constructed O(n) time.

Static.

The String B-tree (An I/O-efficient full-text index !!)

[Ferragina-Grossi, 95]

The prologue

We are left with many open issues:– Suffix Array: updates– Hybrid: Heuristic tuning of the performance– Suffix tree: difficult packing and (p) I/Os

B-tree is ubiquitous in large-scale applications:– Atomic keys: integers, reals, ...– Prefix B-tree: bounded length keys ( 255 chars)

Suffix trees + B-trees ? String B-tree [Ferragina-Grossi, 95]

Index unbounded length keys Good worst-case I/O-bounds in search and update

Some considerations

Strings have arbitrary length:– Disk page cannot ensure the storage of (B) strings– M may be unable to store even one single string

String storage:– Pointers allow to fit (B) strings per disk page– String comparison needs disk access and may be

expensiveString pointers organization seen so far:

Suffix array: simple but static and not optimal Patricia trie: sophisticated and much efficient

(optimal ?)Recall the problem: is a string collection

Search( P[1,p] ): retrieve all occurrences of P in ’s strings

Update( S[1,t] ): insert or delete a text S from

1º step: B-tree on string pointers

AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30

Disk

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

P = ATO((p/B) log2 B) I/Os

O(logB N) levels

Search(P) •O ((p/B) log2 N) I/Os•O (occ/B) I/Os

It is dynamic !! O(s (s/B) log2 N) I/Os

+ B

2º step: The Patricia trie

1

AGAAGA

AGAAGG

AGAC

GCGCAGA

GCGCAGG

GCGCGGA

GCGCGGGA

65

3

6

4

0

Disk

AG

A

4 5 6 72 3

(1; 1,3) (4; 1,4)

(6; 5,6)(5; 5,6)(2; 1,2) (3; 4,4)

(1; 6,6) (2; 6,6)(4; 7,7) (5;

7,7)(7; 7,8)

(6; 7,7)

G

2º step: The Patricia trie

1

AGAAGA

AGAAGG

AGAC

GCGCAGA

GCGCAGG

GCGCGGA

GCGCGGGA

65

3

Disk

A0

G

4 5 62 3

A

AA

A4

G

A6

7

GGG

C

Space PT• O(k) , not O(N) Two-phase search:

P = GCACGCAC

G

1

G5 7

A

Search(P):• First phase: no string access• Second phase: O(p/B) I/Os

mismatch

P’s position

max LCPwith P

Just one string is

checked !!

GC

3º step: B-tree + Patricia tree

AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30

Disk

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

PT

Search(P) •O((p/B) logB N) I/Os•O(occ/B) I/Os

Insert(S)O( s (s/B) logB N) I/Os

O(p/B) I/Os

O(logB N) levels

P = AT

+

PTLevel i

Search(P) • O(logB N) I/Os just

to go to the leaf level

4º step: Incremental Search

Max_lcp(i)

P

Level i+1 PT PT

First case

Leaf Level

adjacent

Level i+1 PT

InductiveStep

PTLevel i

Search(P) • O(p/B + logB N) I/Os• O(occ/B) I/Os

4º step: Incremental Search

Max_lcp(i)

P

Max_lcp(i+1)

skip Max_lcp(i)

i-th step:

O(( lcp i+1 – lcp i )/B + 1) I/Os

No rescanning

Second case

Summary: String B-tree

String B-tree performance: [Ferragina-Grossi, 95]

– Search(P) takes O(p/B + logB N + occ/B) I/Os

– Update(S) takes O( s logB N ) I/Os– Space is (N/B) disk pages

Using the String B-tree in internal memory:– Search(P) takes O(p + log2 N + occ) time

– Update(S) takes O( s log2 N ) time– Space is (N) bytes It is a sort of dynamic suffix array

Summary

Indexed string matching problem – Word-based– Full-text

Internal memory data structures– Suffix array– Suffix tree

External memory data structures– Patricia trie and Pat tree– Short Pat array– String B-tree