Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter...

25
Special Topics in Computer Science Special Topics in Computer Science Advanced Topics in Information Advanced Topics in Information Retrieval Retrieval Lecture 4 Lecture 4 (book chapter 8) (book chapter 8) : : Indexing and Searching Indexing and Searching Alexander Gelbukh www.Gelbukh.com

Transcript of Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter...

Page 1: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

Special Topics in Computer ScienceSpecial Topics in Computer Science

Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval

Lecture 4 Lecture 4 (book chapter 8)(book chapter 8): :

Indexing and SearchingIndexing and Searching

Alexander Gelbukh

www.Gelbukh.com

Page 2: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

2

Previous Chapter: Previous Chapter: ConclusionsConclusions

Main measures: Precision & Recall.o For sets

o Rankings are evaluated through initial subsets

There are measures that combine them into oneo Involve user-defined preferences

Many (other) characteristicso An algorithm can be good at some and bad at others

o Averages are used, but not always are meaningful

Reference collection exists with known answers to evaluate new algorithms

Page 3: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

3

Previous Chapter: Research topicsPrevious Chapter: Research topics

Different types of interfaces Interactive systems:

o What measures to use?

o Such as infromativeness

Page 4: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

4

Types of searchingTypes of searching

Indexedo Semi-static

o Space overhead

Sequentialo Small texts

o Volatile, or space limited

Combinedo Index into large portions, then sequential inside portion

o Best combination of speed / overhead

Page 5: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

5

Inverted filesInverted files

Vocabulary: sqrt (n). Heaps’ law. 1GB 5M Occurrences: n * 40% (stopwords)

o positions (word, char), files, sections...

Page 6: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

6

Compression: Block addressingCompression: Block addressing

Block addressing: 5% overheado 256, 64K, ..., blocks (1, 2, ..., bytes)

o Equal size (faster search) or logical sections (retrieval units)

Page 7: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

7

Searching in inverted filesSearching in inverted files

Vocabulary searcho Separate fileo Many searching techniqueso Lexicographic: log V (voc. size) = ½ log n (Heaps)o Hashing is not good for prefix search

Retrieval of occurrences Manipulation with occurrences: ~sqrt (n) (Heaps, Zipf)

o Boolean operations. Context search Merging occurrences For AND: One list is usually shorter (Zipf law) sublinear!

Only inverted files allow sublinear both space & timeo Suffix trees and signature files don’t

Page 8: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

8

Building inverted file: 1Building inverted file: 1

Infinite memory? Use trie to store vocabulary. O(n)o append positions

Finite memory? Build in chunks, merge. Almost O(n) Insertion: index + merge. Deleting: O(n). Very fast.

Page 9: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

9

Suffix treesSuffix trees

Text as one long string. No words.o Genetic databases

o Complex queries

o Compacted trie structure

o Problem: space

For text retrieval, inverted files are better

Page 10: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

10

Page 11: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

11

Info for tree comes from the text itself

Page 12: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

12

Suffix arraySuffix array

All suffixes (by position) in lexicographic order Allows binary search Much less space: 40% n Supra-index: sampling, for better disk access

Page 13: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

13

Suffix tree and suffix array:Suffix tree and suffix array:Searching. ConstructionSearching. Construction

Searching Patterns, prefixes, phrases. Not only words Suffix tree: O(m), but: space (m = query size) Suffix array: O(log n) (n = database size)

Construction of arrays: sortingo Large text: n2 log (M)/M, more than for inverted fileso Skip details

Addition: n n' log (M)/M. (n' is the size of new portion) Deletion: n

Page 14: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

14

Signature filesSignature files

Usually worse than inverted files Words are mapped to bit patterns Blocks are mapped to ORs of their word patterns If a block contains a word, all bits of its pattern are set Sequential search for blocks False drops!

o Design of the hash function

o Have to traverse the block

Good to search ANDs or proximity querieso bit patterns are ORed

Page 15: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

15

False drop: letters in 2nd block

Page 16: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

16

Boolean operationsBoolean operations

Merging file (occurrences) listso AND: to find repetitions

According to query syntax tree Complexity linear in intermediate results

o Can be slow if they are huge

There are optimization techniqueso E.g.: merge small list with a big one by searching

o This is a usual case (Zipf)

Page 17: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

17

Sequential searchSequential search

Necessary part of many algorithms (e.g., block addr) Brute force: O(nm) worst-case, O(n) on average MANY faster algorithms, but more complicated

o See the book

Page 18: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

18

Approximate string matchingApproximate string matching

Match with k errors, select the one with min k Levenshtein distance between strings s1 and s2

o The minimum number of editing operations to make onefrom another

o Symmetric for standard sets of operations

o Operations: deletion, addition, change

o Sometimes weighted

Solution: dynamic programming. O(mn), O(kn)o m, n are lengths of the two strings

Page 19: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

19

Regular expressionsRegular expressions

Regular expressionso Automation: O (m 2m) + O (n) – bad for long patterns

o There are better methods, see book

Using indices to search for words with errorso Inverted files: search in vocabulary

o Suffix trees and Suffix arrays: the same algorithms as forsearch without errors! Just allow deviations from the path

Page 20: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

20

Search over compressionSearch over compression

Improves both space AND time (less disk operations) Compress query and search

o Huffman compression, words as symbols, bytes (frequencies: most frequent shorter)

o Search each word in the vocabulary its code

o More sophisticated algorithms

Compressed inverted files: less disk less time

Text and index compression can be combined

Page 21: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

21

...compression...compression

Suffix trees can be compressed almost to size ofsuffix arrays

Suffix arrays can’t be compressed (almost random),but can be constructed over compressed texto instead of Huffman, use a code that respects alphabetic order

o almost the same compression

Signature files are sparse, so can be compressedo ratios up to 70%

Page 22: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

22

Page 23: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

23

Research topicsResearch topics

Perhaps, new details in integration of compression and search

“Linguistic” indexing: allowing linguistic variationso Search in plural or only singular

o Search with or without synonyms

Page 24: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

24

ConclusionsConclusions

Inverted files seem to be the best option Other structures are good for specific cases

o Genetic databases

Sequential searching is an integral part of manyindexing-based search techniqueso Many methods to improve sequential searching

Compression can be integrated with search

Page 25: Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8) : Indexing and Searching Alexander Gelbukh .

25

Thank you!Till April 26, 6 pm