Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

41
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    2

Transcript of Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Page 1: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Text Indexing

S. Srinivasa RaoApril 19, 2007

[based on slides by Paolo Ferragina]

Page 2: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Why string data are interesting ?

They are ubiquitous: Digital libraries and product catalogues Electronic white and yellow pages Specialized information sources (e.g. Genomic or Patent

dbs) Web pages repositories Private information dbs ...

String collections are growing at a staggering

rate:

...more than 10Tb of textual data in the web...more than 15Gb of base pairs in the genomic dbs

Page 3: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

The need for an “index” Brute-force scanning is not a viable approach:

– Fast single searches– Multiple simple searches for complex queries

The index is a basic block of any IR system.

An IR system also encompasses:

– IR models– Ranking algorithms– Query languages and operations– ...

We will concentrate only on “index design” !!We will concentrate only on “index design” !!

Page 4: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Two families of indexes

Types of data

Linguistic or tokenizable textRaw sequence of characters or bytes

Word-based queryCharacter-based query

Types of query

Two indexing approaches :• Word-based indexes, here a concept of “word” must be devised !

» Inverted files, Signature files or Bitmaps.

• Full-text indexes, no constraint on text and queries !» Suffix Array, Suffix tree, Hybrid indexes, or String B-tree.

DNA sequencesAudio-video filesExecutables

Arbitrary substringComplex matches

Exact wordWord prefix or suffixPhrase

Page 5: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Word-based: Inverted files (or lists)

Now is the timefor all good men

to come to the aidof their country

Doc #1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc #2

Query answering is a two-phase process: midnight AND time

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Vocabulary Postings

2

Page 6: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Full-text indexes

Their need is pervasive:– Raw data: DNA sequences, Audio-Video files, ...– Linguistic texts: data mining, statistics, ... – Vocabulary for Inverted Lists– Xpath queries on XML documents– Intrusion detection, Anti-viruses, ...

Classes of indexes: Suffix array, Suffix tree (variants) Multi-level indexes: Short Pat array B-tree based data structures: Prefix B-tree, String B-tree

Page 7: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Terminology An alphabet, denoted ∑ is a set of (ordered) characters.

A string S is an array of characters, S[1,n] = S[1] S[2] … S[n].

S[i,j] = S[i] … S[j] is a substring of S.

S[1,j] is a prefix of S; S[i,n] is a suffix of S.

∑* denotes all strings over alphabet ∑.

Lexicographic order:– Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z,

the lexicographic order is the same as in a dictionary.

SUF(T) = Sorted set of suffixes of T

Page 8: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Indexed string matching problem

Let T be a set of K strings in ∑*, where N is the total length of all strings in T.

String matching query on T: Given a pattern P find all occurrences of P in the strings in T.

Static problem: Store T in a data structure that supports string matching queries. Such a data structure is called a full-text index.

Dynamic version: Supports also insertions and deletions of strings in the full-text index.

Page 9: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

A simple but crucial observation

Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n]

TPi

T[i,n]

Occurrences of P in T = All suffixes of T having P as a prefix

T = This is a visual example This is a visual example This is a visual example

3,6,12

Can transform the string matching problem to a “prefix matching problem” over all the suffixes.

Page 10: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Suffix Array [Manber-Myers, 90]

Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order.

T = mississippi#

#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#

SUF(T)

Suffix Array• SA: array of ints, 4N bytes• Text T: N bytes 5N bytes of space occupancy

(N2) space

SA

121185211097463

T = mississippi#

suffix pointer

5

Page 11: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Two key properties [Manber-Myers, 90]

Prop 1. All suffixes in SUF(T) having prefix P are contiguous.Prop 2. Starting position is the lexicographic one of P.

P=si

T = mississippi#

#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#

SUF(T)SA

121185211097463

T = mississippi#

Page 12: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Searching in Suffix Array [Manber-Myers, 90]

Indirected binary search on SA: O(p log2 N) time

T = mississippi#SA

121185211097463

si

P is larger

2 accesses for binary step

Page 13: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Searching in Suffix Array [Manber-Myers, 90]

Indirected binary search on SA: O(p log2 N) time

T = mississippi#SA

121185211097463

si

P is smaller

Page 14: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Listing the occurrences [Manber-Myers, 90]

Brute-force comparison: O(p x occ) time

T = mississippi# 4 6 7SA

121185211097463

si

occ=2

Suffix Array search• O (p (log2 N + occ)) time

• O (log2 N + occ) in practice

External memory• Simple disk paging for SA

• O ((p/B) (log2 N + occ)) I/Os

issippiP is not a prefix

P is a prefixsissippi

P is a prefixsippi

logB N+ occ/B

121185211097463

121185211097463

Page 15: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

SA

121185211097463

Lcp

00140010213

Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA

Output-sensitive retrieval

T = mississippi# 4 6 7

#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#

SUF(T)

P=si

Suffix Array search• O ((p/B) log2 N + (occ/B)) I/Os• 9 N bytes of space

Scan Lcp until Lcp[i] < p

+ : incremental search

base B : tricky !!

Compare against P

00140010213

121185211097463

121185211097463

00140010213

occ=2

Page 16: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

q

Incremental search (case 1)

Incremental search using the LCP array: no rescanning of pattern chars

SA

Ran

ge M

inim

a

i

j

The cost: O (1) memory accesses

PMin Lcp[i,q-1]

> P’s

known induct.

< P’s

> P’s

Page 17: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

q

Incremental search (case 2)

Incremental search using the LCP array: no rescanning of pattern chars

SA

i

j

Min Lcp[i,q-1]

The cost: O (1) memory accesses

Ran

ge M

inim

a

known induct.

< P’s

> P’s

P

Page 18: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Incremental search (case 3)

Incremental search using the LCP array: no rescanning of pattern chars

SA

i

j

q

Min Lcp[i,q-1]

Suffix char > Pattern char

Suffix char < Pattern char

Suffix Array search• O(log2 N) binary steps• O(p) total char-cmp for routing O((p/B) + log2 N + (occ/B)) I/Os

base B : more tricky (we will not consider

this)

L

The cost: O(L) char cmp

Ran

ge M

inim

a

< P’s

> P’s

P

Page 19: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Summary: suffix array

[Manber-Myers, 90]

Space: O(N)

String matching queries: O(p + log N + occ)

Can be constructed O(N log N) time.

Static.

Page 20: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

SA

Hybrid Index: Short Pat array

Exploit internal memory: sample the suffix array and copy something in memory

M

Disk

P

binary-search inside

SA + Supra-index• O((p/B) + log2 (N/s) + (occ/B)) I/Os

Parameter s depends on M and influences both performance and space !!

(only a huristic)

s

Copy a prefix ofmarked suffixes

Page 21: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Tries Trie (name from the word “retrieval”): a data

structure for storing a set of strings– Let’s assume that all strings end with “$” (not

in )

b s

e

ar

$

i

d

$

u

l

k

$ $

l

u

n

d

a

y

$

$

Set of strings: {bear, bid, bulk, bull, sun, sunday}

Page 22: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Tries

Properties of a trie:– A multi-way tree. – Each node has from 1 to 1

children.– Each edge of the tree is labeled with

a character.– Each leaf node corresponds to the

stored string, which is a concatenation of characters on a path from the root to this node.

Page 23: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Analysis of the Trie

Given k strings of total length N: Size:

– O(N) in the worst-case Search, insertion, and deletion

(string of length m): – O(m) (assuming ∑ is constant)

Observation:– Having chains of one-child nodes is

wasteful

Page 24: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Compact Tries Compact Trie:

– Replace a chain of one-child nodes with an edge labeled with a string

– Each non-leaf node (except root) has at least two children

b s

e

ar

$

i

d

$

u

l

k

$ $

l

u

n

d

a

y

$

$

b sun

ear$id$ ul

k$l$

day$$

Page 25: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Compact Tries Implementation:

– Strings are external to the structure in one array, edges are labeled with indices in the array (from, to)

– Improves the space to O(k)

b sun

ear$id$ ul

k$l$

day$$

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

bear$bid$bulk$bull$sun$sunday$T:

1,120,22

2,57,9 11,12

13,1418,19

27,305,5

Page 26: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Patricia Tries Patricia trie:

– a compact trie where each edge’s label (from, to) is replaced by (T[from], to – from + 1)

1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

bear bid bulk bull sun sunday T:

b,1s,3

e,4i,3 u,2

k,2l,2

d,4_,1

1,120,22

2,57,9 11,12

13,1418,19

27,305,5

Page 27: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Suffix Trees [McCreight, 76]

Suffix tree – a compact trie (or similar structure) of all suffixes of the text– Patricia trie of suffixes is sometimes called a

Pat tree

carrara$

1 2 3 4 5 6 7 8

a r

r$

carrara$

rara$ a$

rara$a

$ ra$

2 5

71

6 4

3

(a,1) (r,1)

(r,1)($,1)

(c,8)

(r,5) (a,2)

2 5

71

6 4

3

(r,5)(a,1)

(r,3)($,1)

Page 28: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Search in suffix trees

T = abababbc# 1 3 5 7 9

3

24

c

cc

b

a

bb

b

bb

a2

4

13 5

ab

c

a

b

b

c

a

b

c

bP = ba

cb

0

1

(5,8)O(N) space

O(p) time

What about ST in external memory ?– Unbalanced tree topology – Updates

CPAT tree ~ 5N on average

No (p/B), possibly no (occ/B), mainly static and space costly

76 8

c

Packing ?! (p) I/Os(occ) I/Os??

Search is a path traversal

24

and O(occ) time

- Large space ~ 15N

b

a

Page 29: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Summary: compact trie

K strings of total length N:

Space: O(K)

String matching queries: O(p + occ) time

Can be constructed O(N) time.

Update(S): O(|S|) time

Page 30: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Summary: suffix tree

[McCreight, 76]

For a string of length n: Space: O(n)

String matching queries: O(p + occ)

Can be constructed O(n) time.

Static.

Page 31: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

The String B-tree (An I/O-efficient full-text index !!)

[Ferragina-Grossi, 95]

Page 32: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

The prologue

We are left with many open issues:– Suffix Array: updates– Hybrid: Heuristic tuning of the performance– Suffix tree: difficult packing and (p) I/Os

B-tree is ubiquitous in large-scale applications:– Atomic keys: integers, reals, ...– Prefix B-tree: bounded length keys ( 255 chars)

Suffix trees + B-trees ? String B-tree [Ferragina-Grossi, 95]

Index unbounded length keys Good worst-case I/O-bounds in search and update

Page 33: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Some considerations

Strings have arbitrary length:– Disk page cannot ensure the storage of (B) strings– M may be unable to store even one single string

String storage:– Pointers allow to fit (B) strings per disk page– String comparison needs disk access and may be

expensiveString pointers organization seen so far:

Suffix array: simple but static and not optimal Patricia trie: sophisticated and much efficient

(optimal ?)Recall the problem: is a string collection

Search( P[1,p] ): retrieve all occurrences of P in ’s strings

Update( S[1,t] ): insert or delete a text S from

Page 34: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

1º step: B-tree on string pointers

AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30

Disk

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

P = ATO((p/B) log2 B) I/Os

O(logB N) levels

Search(P) •O ((p/B) log2 N) I/Os•O (occ/B) I/Os

It is dynamic !! O(s (s/B) log2 N) I/Os

+ B

Page 35: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

2º step: The Patricia trie

1

AGAAGA

AGAAGG

AGAC

GCGCAGA

GCGCAGG

GCGCGGA

GCGCGGGA

65

3

6

4

0

Disk

AG

A

4 5 6 72 3

(1; 1,3) (4; 1,4)

(6; 5,6)(5; 5,6)(2; 1,2) (3; 4,4)

(1; 6,6) (2; 6,6)(4; 7,7) (5;

7,7)(7; 7,8)

(6; 7,7)

G

Page 36: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

2º step: The Patricia trie

1

AGAAGA

AGAAGG

AGAC

GCGCAGA

GCGCAGG

GCGCGGA

GCGCGGGA

65

3

Disk

A0

G

4 5 62 3

A

AA

A4

G

A6

7

GGG

C

Space PT• O(k) , not O(N) Two-phase search:

P = GCACGCAC

G

1

G5 7

A

Search(P):• First phase: no string access• Second phase: O(p/B) I/Os

mismatch

P’s position

max LCPwith P

Just one string is

checked !!

GC

Page 37: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

3º step: B-tree + Patricia tree

AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30

Disk

29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23

29 2 26 13 20 25 6 18 3 14 21 23

29 13 20 18 3 23

PT PT PT

PT PT PT PT PT PT

PT

Search(P) •O((p/B) logB N) I/Os•O(occ/B) I/Os

Insert(S)O( s (s/B) logB N) I/Os

O(p/B) I/Os

O(logB N) levels

P = AT

+

Page 38: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

PTLevel i

Search(P) • O(logB N) I/Os just

to go to the leaf level

4º step: Incremental Search

Max_lcp(i)

P

Level i+1 PT PT

First case

Leaf Level

adjacent

Page 39: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Level i+1 PT

InductiveStep

PTLevel i

Search(P) • O(p/B + logB N) I/Os• O(occ/B) I/Os

4º step: Incremental Search

Max_lcp(i)

P

Max_lcp(i+1)

skip Max_lcp(i)

i-th step:

O(( lcp i+1 – lcp i )/B + 1) I/Os

No rescanning

Second case

Page 40: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Summary: String B-tree

String B-tree performance: [Ferragina-Grossi, 95]

– Search(P) takes O(p/B + logB N + occ/B) I/Os

– Update(S) takes O( s logB N ) I/Os– Space is (N/B) disk pages

Using the String B-tree in internal memory:– Search(P) takes O(p + log2 N + occ) time

– Update(S) takes O( s log2 N ) time– Space is (N) bytes It is a sort of dynamic suffix array

Page 41: Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]

Summary

Indexed string matching problem – Word-based– Full-text

Internal memory data structures– Suffix array– Suffix tree

External memory data structures– Patricia trie and Pat tree– Short Pat array– String B-tree