Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and...

28
Special Topics in Computer Science Special Topics in Computer Science The Art of Information The Art of Information Retrieval Retrieval Chapter 8: Indexing and Chapter 8: Indexing and Searching Searching Alexander Gelbukh www.Gelbukh.com

Transcript of Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and...

Page 1: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

Special Topics in Computer ScienceSpecial Topics in Computer Science

The Art of Information RetrievalThe Art of Information Retrieval

Chapter 8: Indexing and Chapter 8: Indexing and Searching Searching

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Text transformation: meaning instead of stringso Lexical analysis

o Stopwords

o Stemming POS, WSD, syntax, semantics Ontologies to collate similar stems

Text compressiono Searchable (compress the query, then search)

o Random access

o Word-based statistical methods (Huffman)

Index compression

Page 3: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

All computational linguisticso Improved POS tagging

o Improved WSD

Uses of thesauruso for user navigation

o for collating similar terms

Better compression methodso Searchable compression

o Random access

Page 4: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

4

Page 5: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

5

Types of searchingTypes of searching

Sequentialo Small texts

o Volatile, or space limited

Indexedo Semi-static

o Space overhead

First, we discuss indexed searching, then sequential

Page 6: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

6

Inverted filesInverted files

Vocabulary: sqrt (n). Heaps’ law. 1GB 5M Occurrences: n * 40% (stopwords)

o positions (word, char), files, sections...

Page 7: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

7

Compression: Block addressingCompression: Block addressing

Block addressing: 5% overheado 256, 64K, ..., blocks (1, 2, ..., bytes)

o Equal size (faster search) or logical sections (retrieval units)

Page 8: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

8

Searching in inverted filesSearching in inverted files

Vocabulary searcho Separate fileo Many searching techniqueso Lexicographic: log V (voc. size) = ½ log n (Heaps)o Hashing is not good for prefix search

Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)

o Boolean operations. Context search Merging One list is shorter (Zipf law)

Only inverted files allow sublinear both space & timeSuffix trees and signature files don’t

Page 9: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

9

Building inverted file: 1Building inverted file: 1

Infinite memory? Use trie to store vocabulary

o append positions

O(n)

Page 10: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

10

Building inverted file: 2Building inverted file: 2

Finite memory? Fill the memory Write partial index; n/M pieces Merge partial indices (hierarchically): n log (n/M)

Insertion: index, merge. n + n'log(n'/M) Deleting: eliminate every occurrence. n

Very fast creating/maintenance

Page 11: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

11

Suffix treesSuffix trees

Text as one long string. No words.o Genetic databases

o Complex queries

o Compacted trie structure

o Problem: space

For text retrieval, inverted files are better

Page 12: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

12

Page 13: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

13

Page 14: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

14

Suffix arraySuffix array

All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access

Page 15: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

15

Searching. ConstructionSearching. Construction

Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size)

Construction of arrays: sortingo Large text: n2 log (M)/M, more than for inverted fileso Skip details

Addition: n n' log (M)/M Deletion: n

Page 16: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

16

Signature filesSignature files

Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all its bits are set Sequential search for blocks False drops!

o Design of the hash function

o Have to traverse the block

Good to search ANDs or proximity querieso bit patterns are ORed

Page 17: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

17

Page 18: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

18

Boolean operationsBoolean operations

Merging file (occurrences) listso AND: to find repetitions

According to query syntax tree Complexity linear in intermediate results

o Can be slow if they are huge

There are optimization techniqueso E.g.: merge small list with a big one by searching

o This is a usual case (Zipf)

Page 19: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

19

Sequential searchSequential search

Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average Knuth-Morris-Pratt: linear worst, but the same avrg Boyer-Moore: n log(m) / m. Not all chars are examined!

o If some part of the pattern was compared,no need to compare inside it: you analyze the pattern once

Shift-Or: uses logical operation on all 32 bits in parallel BDM: automation. Complexity same as Boyer-Moore Combination of BDM with bit parallelism

Page 20: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

20

Approximate string matchingApproximate string matching

Match with k errors Levenshtein distance Dynamic programming: O(mn), O(kn) Automation: non-deterministic

o Convert to deterministic: O(n), but huge structure

o Bit-parallel: O(n), the fastest known

Filtering: sublinear!o k errors cannot alter k segments

o multipattern exact search; detect suspicious places

o uses approximate algorithm only when needed

Page 21: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

21

Regular expressionsRegular expressions

Regular expressionso Automation: O (m 2m) + O (n) – bad for long patterns

o Bit-parallel (simulates non-deterministic)

Using indices to search for words with errorso Inverted files: search in vocabulary, then each word

o Suffix trees and Suffix arrays: the same algorithms!

Page 22: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

22

Structural queriesStructural queries

Ad-hoc index for structure Indexing tags as words

o Inverted files are goodsince they store occurrences in order

Page 23: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

23

Search over compressionSearch over compression

Improves both space AND time (less disk operations) Compress query and search

o Huffman compression, words as symbols, bytes (frequencies: most frequent shorter)

o Search each word in the vocabulary its code

o More sophisticated algorithms

Compressed inverted files: less disk less time

Text and index compression can be combined

Page 24: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

24

...compression...compression

Suffix trees can be compressed almost to size ofsuffix arrays

Suffix arrays can’t be compressed (almost random),but can be constructed over compressed texto instead of Huffman, use a code that respects alphabetic order

o almost the same compression

Signature files are sparse, so can be compressedo ratios up to 70%

Page 25: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

25

Page 26: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

26

Research topicsResearch topics

Perhaps, new details in integration of compression and search

“Linguistic” indexing: allowing linguistic variationso Search in plural or only singular

o Search with or without synonyms

Page 27: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

27

ConclusionsConclusions

Inverted files seem to be the best option Other structures are good for specific cases

o Genetic databases

Sequential searching is an integral part of manyindexing-based search techniqueso Many methods to improve sequential searching

Compression can be integrated with search

Page 28: Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh .

28

Thank you!Till compensation

lecture?