Indexed Search Tree (Trie)

23
Indexed Search Tree (Trie) Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park

description

Indexed Search Tree (Trie). Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park. Indexed Search Tree ( Trie). Special case of tree Applicable when Key C can be decomposed into a sequence of subkeys C 1 , C 2 , … C n - PowerPoint PPT Presentation

Transcript of Indexed Search Tree (Trie)

Page 1: Indexed Search Tree (Trie)

Indexed Search Tree (Trie)

Nelson Padua-Perez

Chau-Wen Tseng

Department of Computer Science

University of Maryland, College Park

Page 2: Indexed Search Tree (Trie)

Indexed Search Tree (Trie)

Special case of tree

Applicable when Key C can be decomposed into a sequence of subkeys C1, C2, … Cn

Redundancy exists between subkeys

ApproachStore subkey at each node

Path through trie yields full key

ExampleHuffman tree

C3

C1

C2

C4C3

Page 3: Indexed Search Tree (Trie)

Tries

Useful for searching strings String decomposes into sequence of letters

Example

“ART” “A” “R” “T”

Can be very fastLess overhead than hashing

May reduce memoryExploiting redundancy

May require more memoryExplicitly storing substrings

S

A

R

TE

“ART”

Page 4: Indexed Search Tree (Trie)

Types of Tries

StandardSingle character per node

CompressedEliminating chains of nodes

CompactStores indices into original string(s)

SuffixStores all suffixes of string

Page 5: Indexed Search Tree (Trie)

Standard Tries

ApproachEach node (except root) is labeled with a character

Children of node are ordered (alphabetically)

Paths from root to leaves yield all input strings

Trie for Morse Code

Page 6: Indexed Search Tree (Trie)

Standard Trie Example

For strings{ a, an, and, any, at }

Page 7: Indexed Search Tree (Trie)

Standard Trie Example

For strings{ bear, bell, bid, bull, buy, sell, stock, stop }

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Page 8: Indexed Search Tree (Trie)

Standard Tries

Node structureValue between 1…m

Reference to m children

Array or linked list

ExampleClass Node {

Letter value; // Letter V = { V1, V2, … Vm }Node child[ m ];

}

Page 9: Indexed Search Tree (Trie)

Standard Tries

EfficiencyUses O(n) space

Supports search / insert / delete in O(dm) time

For

n total size of strings indexed by trie

d length of the parameter string

m size of the alphabet

Page 10: Indexed Search Tree (Trie)

Word Matching Trie

Insert words into trie

Each leaf stores occurrences of word in the text

s e e b e a r ? s e l l s t o c k !

s e e b u l l ? b u y s t o c k !

b i d s t o c k !

a

a

h e t h e b e l l ? s t o p !

b i d s t o c k !

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

a r87 88

a

e

b

l

s

u

l

e t

e

0, 24

o

c

i

l

r

6

l

78

d

47, 58l

30

y

36l

12k

17, 40,51, 62

p

84

h

e

r

69

a

Page 11: Indexed Search Tree (Trie)

Compressed Trie

ObservationInternal node v of T is redundant if v has one child and is not the root

ApproachA chain of redundant nodes can be compressed

Replace chain with single node

Include concatenation of labels from chain

ResultInternal nodes have at least 2 children

Some nodes have multiple characters

Page 12: Indexed Search Tree (Trie)

Compressed Trie

e

b

ar ll

s

u

ll y

ell to

ck p

id

a

e

b

r

l

l

s

u

l

l

y

e t

l

l

o

c

k

p

i

d

Node selain root yang hanya punya 1 anak perlu dikompress

Page 13: Indexed Search Tree (Trie)

Compact Tries

Compact representation of a compressed trie

ApproachFor an array of strings S = S[0], … S[s-1]

Store ranges of indices at each node

Instead of substring

Represent as a triplet of integers (i, j, k)

Such that X = s[i][j..k]

Example: S[0] = “abcd”, (0,1,2) = “bc”

PropertiesUses O(s) space, where s = # of strings in the array

Serves as an auxiliary index structure

Page 14: Indexed Search Tree (Trie)

Compact Representation

Example

s e e

b e a r

s e l l

s t o c k

b u l l

b u y

b i d

h e

b e l l

s t o p

0 1 2 3 4a rS[0] =

S[1] =

S[2] =

S[3] =

S[4] =

S[5] =

S[6] =

S[7] =

S[8] =

S[9] =

0 1 2 3 0 1 2 3

1, 1, 1

1, 0, 0 0, 0, 0

4, 1, 1

0, 2, 2

3, 1, 2

1, 2, 3 8, 2, 3

6, 1, 2

4, 2, 3 5, 2, 2 2, 2, 3 3, 3, 4 9, 3, 3

7, 0, 3

0, 1, 1

e

b s

u

e

to

ar ll

id

ll y ll ck p

hear

e

Page 15: Indexed Search Tree (Trie)

Suffix Trie

Compressed trie of all suffixes of text

Example: “IPDPS”Suffixes

IPDPS

PDPS

DPS

PS

S

Useful for finding pattern in any part of textOccurrence prefix of some suffix

Example: find PDP in IPDPS

D

P

S

P

I

P

D

S

P

S

D

PS

S

Page 16: Indexed Search Tree (Trie)

Suffix Trie Example

e nimize

nimize ze

zei mi

mize nimize ze

m i n i z em i0 1 2 3 4 5 6 7

7, 7 2, 7

2, 7 6, 7

6, 7

4, 7 2, 7 6, 7

1, 1 0, 1

minimize

inimize

nimize

imize

mize

ize

ze

e

Page 17: Indexed Search Tree (Trie)

Encoding Trie (1)Encode adalah merepresentasikan alphabet ke binary

An encoding trie represents a prefix codeEach leaf stores a character

The code word of a character is given by the path from the root to the leaf storing the character (0 for a left child and 1 for a right child

a

b c

d e00 010 011 10 11

a b c d e

0

0 1

0 1

1

0 1

Page 18: Indexed Search Tree (Trie)

Encoding Trie (2)Given a text string X, we want to find a prefix code for the characters of X that yields a small encoding for X

Frequent characters should have long code-wordsRare characters should have short code-words

ExampleX = abracadabraT1 encodes X into 29 bits: 010 11 011 010 00 010 10 010 11 011 010T2 encodes X into 24 bits: 00 10 11 00 010 00 011 00 10 11 00

c

a r

d b a

c d

b r

T1T2

00 10 010 011 11

a b c d r

010 11 00 10 011

a b c d r

0 1

0 10 1

0 1

0 1

0 1

0 1

0 1

Page 19: Indexed Search Tree (Trie)

Huffman’s Tree

a b c d r

5 2 1 1 2

X = abracadabra

Frequencies

r adc b52 21 1

c ardb

2

52 2

•Given a string X, Huffman’s algorithm construct a prefix code• the minimizes the size of the encoding of X

c

ar

d

b 25

4

2

a5

6

c d

b 2

4r

a

11

6

c d

b 2

4r

: 0 110 10 0 1110 0 1111 0 110 10 0: 23 digit

Page 20: Indexed Search Tree (Trie)

Tries and Web Search Engines

Search engine indexCollection of all searchable words

Stored in compressed trie

Each leaf of trie Associated with a word

List of pages (URLs) containing that word

Called occurrence list

Trie is kept in memory (fast)

Occurrence lists kept in external memoryRanked by relevance

Page 21: Indexed Search Tree (Trie)

Computational Biology

DNASequence of 4 different nucleotides (ATCG)

Portions of DNA sequence produce proteins (genes)

GenomeMaster DNA sequence for organism

For Human

46 chromosomes

3 billion nucleotides

Page 22: Indexed Search Tree (Trie)
Page 23: Indexed Search Tree (Trie)

Tries and Computational Biology

ESTsFragments of expressed DNA

Indicator for genes (& location)

5.5 million sequences at NIH

ESTmapperBuild suffix trie of genome

8 hours, 60 Gbytes

Search for ESTs in suffix trie

11 hours w/ 8 processor Sun

Search genome w/ BLAST 5+ years (predicted)

Genome

ESTs

Suffix tree

Mapping

Gene