Text Indexing
S. Srinivasa RaoApril 19, 2007
[based on slides by Paolo Ferragina]
Why string data are interesting ?
They are ubiquitous: Digital libraries and product catalogues Electronic white and yellow pages Specialized information sources (e.g. Genomic or Patent
dbs) Web pages repositories Private information dbs ...
String collections are growing at a staggering
rate:
...more than 10Tb of textual data in the web...more than 15Gb of base pairs in the genomic dbs
The need for an “index” Brute-force scanning is not a viable approach:
– Fast single searches– Multiple simple searches for complex queries
The index is a basic block of any IR system.
An IR system also encompasses:
– IR models– Ranking algorithms– Query languages and operations– ...
We will concentrate only on “index design” !!We will concentrate only on “index design” !!
Two families of indexes
Types of data
Linguistic or tokenizable textRaw sequence of characters or bytes
Word-based queryCharacter-based query
Types of query
Two indexing approaches :• Word-based indexes, here a concept of “word” must be devised !
» Inverted files, Signature files or Bitmaps.
• Full-text indexes, no constraint on text and queries !» Suffix Array, Suffix tree, Hybrid indexes, or String B-tree.
DNA sequencesAudio-video filesExecutables
Arbitrary substringComplex matches
Exact wordWord prefix or suffixPhrase
Word-based: Inverted files (or lists)
Now is the timefor all good men
to come to the aidof their country
Doc #1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc #2
Query answering is a two-phase process: midnight AND time
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Vocabulary Postings
2
Full-text indexes
Their need is pervasive:– Raw data: DNA sequences, Audio-Video files, ...– Linguistic texts: data mining, statistics, ... – Vocabulary for Inverted Lists– Xpath queries on XML documents– Intrusion detection, Anti-viruses, ...
Classes of indexes: Suffix array, Suffix tree (variants) Multi-level indexes: Short Pat array B-tree based data structures: Prefix B-tree, String B-tree
Terminology An alphabet, denoted ∑ is a set of (ordered) characters.
A string S is an array of characters, S[1,n] = S[1] S[2] … S[n].
S[i,j] = S[i] … S[j] is a substring of S.
S[1,j] is a prefix of S; S[i,n] is a suffix of S.
∑* denotes all strings over alphabet ∑.
Lexicographic order:– Example: For ∑ = { a, b, c, …, z }, where a < b < c < … < z,
the lexicographic order is the same as in a dictionary.
SUF(T) = Sorted set of suffixes of T
Indexed string matching problem
Let T be a set of K strings in ∑*, where N is the total length of all strings in T.
String matching query on T: Given a pattern P find all occurrences of P in the strings in T.
Static problem: Store T in a data structure that supports string matching queries. Such a data structure is called a full-text index.
Dynamic version: Supports also insertions and deletions of strings in the full-text index.
A simple but crucial observation
Pattern P[1,p] occurs at position i of T[1,n] iff P[1,p] is a prefix of the suffix T[i,n]
TPi
T[i,n]
Occurrences of P in T = All suffixes of T having P as a prefix
T = This is a visual example This is a visual example This is a visual example
3,6,12
Can transform the string matching problem to a “prefix matching problem” over all the suffixes.
Suffix Array [Manber-Myers, 90]
Suffix array: an array of pointers to all the suffixes in the text in their lexicographic order.
T = mississippi#
#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#
SUF(T)
Suffix Array• SA: array of ints, 4N bytes• Text T: N bytes 5N bytes of space occupancy
(N2) space
SA
121185211097463
T = mississippi#
suffix pointer
5
Two key properties [Manber-Myers, 90]
Prop 1. All suffixes in SUF(T) having prefix P are contiguous.Prop 2. Starting position is the lexicographic one of P.
P=si
T = mississippi#
#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#
SUF(T)SA
121185211097463
T = mississippi#
Searching in Suffix Array [Manber-Myers, 90]
Indirected binary search on SA: O(p log2 N) time
T = mississippi#SA
121185211097463
si
P is larger
2 accesses for binary step
Searching in Suffix Array [Manber-Myers, 90]
Indirected binary search on SA: O(p log2 N) time
T = mississippi#SA
121185211097463
si
P is smaller
Listing the occurrences [Manber-Myers, 90]
Brute-force comparison: O(p x occ) time
T = mississippi# 4 6 7SA
121185211097463
si
occ=2
Suffix Array search• O (p (log2 N + occ)) time
• O (log2 N + occ) in practice
External memory• Simple disk paging for SA
• O ((p/B) (log2 N + occ)) I/Os
issippiP is not a prefix
P is a prefixsissippi
P is a prefixsippi
logB N+ occ/B
121185211097463
121185211097463
SA
121185211097463
Lcp
00140010213
Lcp[1,n-1] stores the longest-common-prefix between suffixes adjacent in SA
Output-sensitive retrieval
T = mississippi# 4 6 7
#i#ippi#issippi#ississippi#mississippi#pi#ppi#sippi#sissippi#ssippi#ssissippi#
SUF(T)
P=si
Suffix Array search• O ((p/B) log2 N + (occ/B)) I/Os• 9 N bytes of space
Scan Lcp until Lcp[i] < p
+ : incremental search
base B : tricky !!
Compare against P
00140010213
121185211097463
121185211097463
00140010213
occ=2
q
Incremental search (case 1)
Incremental search using the LCP array: no rescanning of pattern chars
SA
Ran
ge M
inim
a
i
j
The cost: O (1) memory accesses
PMin Lcp[i,q-1]
> P’s
known induct.
< P’s
> P’s
q
Incremental search (case 2)
Incremental search using the LCP array: no rescanning of pattern chars
SA
i
j
Min Lcp[i,q-1]
The cost: O (1) memory accesses
Ran
ge M
inim
a
known induct.
< P’s
> P’s
P
Incremental search (case 3)
Incremental search using the LCP array: no rescanning of pattern chars
SA
i
j
q
Min Lcp[i,q-1]
Suffix char > Pattern char
Suffix char < Pattern char
Suffix Array search• O(log2 N) binary steps• O(p) total char-cmp for routing O((p/B) + log2 N + (occ/B)) I/Os
base B : more tricky (we will not consider
this)
L
The cost: O(L) char cmp
Ran
ge M
inim
a
< P’s
> P’s
P
Summary: suffix array
[Manber-Myers, 90]
Space: O(N)
String matching queries: O(p + log N + occ)
Can be constructed O(N log N) time.
Static.
SA
Hybrid Index: Short Pat array
Exploit internal memory: sample the suffix array and copy something in memory
M
Disk
P
binary-search inside
SA + Supra-index• O((p/B) + log2 (N/s) + (occ/B)) I/Os
Parameter s depends on M and influences both performance and space !!
(only a huristic)
s
Copy a prefix ofmarked suffixes
Tries Trie (name from the word “retrieval”): a data
structure for storing a set of strings– Let’s assume that all strings end with “$” (not
in )
b s
e
ar
$
i
d
$
u
l
k
$ $
l
u
n
d
a
y
$
$
Set of strings: {bear, bid, bulk, bull, sun, sunday}
Tries
Properties of a trie:– A multi-way tree. – Each node has from 1 to 1
children.– Each edge of the tree is labeled with
a character.– Each leaf node corresponds to the
stored string, which is a concatenation of characters on a path from the root to this node.
Analysis of the Trie
Given k strings of total length N: Size:
– O(N) in the worst-case Search, insertion, and deletion
(string of length m): – O(m) (assuming ∑ is constant)
Observation:– Having chains of one-child nodes is
wasteful
Compact Tries Compact Trie:
– Replace a chain of one-child nodes with an edge labeled with a string
– Each non-leaf node (except root) has at least two children
b s
e
ar
$
i
d
$
u
l
k
$ $
l
u
n
d
a
y
$
$
b sun
ear$id$ ul
k$l$
day$$
Compact Tries Implementation:
– Strings are external to the structure in one array, edges are labeled with indices in the array (from, to)
– Improves the space to O(k)
b sun
ear$id$ ul
k$l$
day$$
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
bear$bid$bulk$bull$sun$sunday$T:
1,120,22
2,57,9 11,12
13,1418,19
27,305,5
Patricia Tries Patricia trie:
– a compact trie where each edge’s label (from, to) is replaced by (T[from], to – from + 1)
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
bear bid bulk bull sun sunday T:
b,1s,3
e,4i,3 u,2
k,2l,2
d,4_,1
1,120,22
2,57,9 11,12
13,1418,19
27,305,5
Suffix Trees [McCreight, 76]
Suffix tree – a compact trie (or similar structure) of all suffixes of the text– Patricia trie of suffixes is sometimes called a
Pat tree
carrara$
1 2 3 4 5 6 7 8
a r
r$
carrara$
rara$ a$
rara$a
$ ra$
2 5
71
6 4
3
(a,1) (r,1)
(r,1)($,1)
(c,8)
(r,5) (a,2)
2 5
71
6 4
3
(r,5)(a,1)
(r,3)($,1)
Search in suffix trees
T = abababbc# 1 3 5 7 9
3
24
c
cc
b
a
bb
b
bb
a2
4
13 5
ab
c
a
b
b
c
a
b
c
bP = ba
cb
0
1
(5,8)O(N) space
O(p) time
What about ST in external memory ?– Unbalanced tree topology – Updates
CPAT tree ~ 5N on average
No (p/B), possibly no (occ/B), mainly static and space costly
76 8
c
Packing ?! (p) I/Os(occ) I/Os??
Search is a path traversal
24
and O(occ) time
- Large space ~ 15N
b
a
Summary: compact trie
K strings of total length N:
Space: O(K)
String matching queries: O(p + occ) time
Can be constructed O(N) time.
Update(S): O(|S|) time
Summary: suffix tree
[McCreight, 76]
For a string of length n: Space: O(n)
String matching queries: O(p + occ)
Can be constructed O(n) time.
Static.
The String B-tree (An I/O-efficient full-text index !!)
[Ferragina-Grossi, 95]
The prologue
We are left with many open issues:– Suffix Array: updates– Hybrid: Heuristic tuning of the performance– Suffix tree: difficult packing and (p) I/Os
B-tree is ubiquitous in large-scale applications:– Atomic keys: integers, reals, ...– Prefix B-tree: bounded length keys ( 255 chars)
Suffix trees + B-trees ? String B-tree [Ferragina-Grossi, 95]
Index unbounded length keys Good worst-case I/O-bounds in search and update
Some considerations
Strings have arbitrary length:– Disk page cannot ensure the storage of (B) strings– M may be unable to store even one single string
String storage:– Pointers allow to fit (B) strings per disk page– String comparison needs disk access and may be
expensiveString pointers organization seen so far:
Suffix array: simple but static and not optimal Patricia trie: sophisticated and much efficient
(optimal ?)Recall the problem: is a string collection
Search( P[1,p] ): retrieve all occurrences of P in ’s strings
Update( S[1,t] ): insert or delete a text S from
1º step: B-tree on string pointers
AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30
Disk
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
P = ATO((p/B) log2 B) I/Os
O(logB N) levels
Search(P) •O ((p/B) log2 N) I/Os•O (occ/B) I/Os
It is dynamic !! O(s (s/B) log2 N) I/Os
+ B
2º step: The Patricia trie
1
AGAAGA
AGAAGG
AGAC
GCGCAGA
GCGCAGG
GCGCGGA
GCGCGGGA
65
3
6
4
0
Disk
AG
A
4 5 6 72 3
(1; 1,3) (4; 1,4)
(6; 5,6)(5; 5,6)(2; 1,2) (3; 4,4)
(1; 6,6) (2; 6,6)(4; 7,7) (5;
7,7)(7; 7,8)
(6; 7,7)
G
2º step: The Patricia trie
1
AGAAGA
AGAAGG
AGAC
GCGCAGA
GCGCAGG
GCGCGGA
GCGCGGGA
65
3
Disk
A0
G
4 5 62 3
A
AA
A4
G
A6
7
GGG
C
Space PT• O(k) , not O(N) Two-phase search:
P = GCACGCAC
G
1
G5 7
A
Search(P):• First phase: no string access• Second phase: O(p/B) I/Os
mismatch
P’s position
max LCPwith P
Just one string is
checked !!
GC
3º step: B-tree + Patricia tree
AATCAGCGAATGCTGCTT CTGTTGATGA 1 3 5 7 9 11 13 15 17 19 20 22 24 26 28 30
Disk
29 1 9 5 2 26 10 4 7 13 20 16 28 8 25 6 12 15 22 18 3 27 24 11 14 21 17 23
29 2 26 13 20 25 6 18 3 14 21 23
29 13 20 18 3 23
PT PT PT
PT PT PT PT PT PT
PT
Search(P) •O((p/B) logB N) I/Os•O(occ/B) I/Os
Insert(S)O( s (s/B) logB N) I/Os
O(p/B) I/Os
O(logB N) levels
P = AT
+
PTLevel i
Search(P) • O(logB N) I/Os just
to go to the leaf level
4º step: Incremental Search
Max_lcp(i)
P
Level i+1 PT PT
First case
Leaf Level
adjacent
Level i+1 PT
InductiveStep
PTLevel i
Search(P) • O(p/B + logB N) I/Os• O(occ/B) I/Os
4º step: Incremental Search
Max_lcp(i)
P
Max_lcp(i+1)
skip Max_lcp(i)
i-th step:
O(( lcp i+1 – lcp i )/B + 1) I/Os
No rescanning
Second case
Summary: String B-tree
String B-tree performance: [Ferragina-Grossi, 95]
– Search(P) takes O(p/B + logB N + occ/B) I/Os
– Update(S) takes O( s logB N ) I/Os– Space is (N/B) disk pages
Using the String B-tree in internal memory:– Search(P) takes O(p + log2 N + occ) time
– Update(S) takes O( s log2 N ) time– Space is (N) bytes It is a sort of dynamic suffix array
Summary
Indexed string matching problem – Word-based– Full-text
Internal memory data structures– Suffix array– Suffix tree
External memory data structures– Patricia trie and Pat tree– Short Pat array– String B-tree
Top Related