Document Retrieval Problems
description
Transcript of Document Retrieval Problems
![Page 1: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/1.jpg)
Document Retrieval Problems
S. Muthukrishnan
![Page 2: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/2.jpg)
Storyline
• Zvi Galil gave a talk on the 13th on 13 open problems he posed 13 years ago in string matching …..– Update on the status of open problems.
• Eric Allender invited me to give a string matching talk at Rutgers U.
• Gives me a chance to look through 30 years of history.
Fernand Braudel
History may be divided into three movements:what moves rapidly,what moves slowly, andwhat appears not to move at all.
![Page 3: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/3.jpg)
The Key Problem
• Given a set of documents D to be preprocessed, query is to list all the locations in the documents where a given pattern occurs. occurrence listing
• Given a set of documents D to be preprocessed, query is to list all the documents in which a given pattern occurs. document listing
D={ aabaa, abaaa, bc } d1=aabaa, d2=abaaa, d3=bcP= aaO={ (1,1), (1,4), (2,3), (2,4) }
D={ aabaa, abaaa, bc } d1=aabaa, d2=abaaa, d3=bcP= aaO={ 1, 2}
Muthu:
Use this problem to frame the discussion,
Muthu:
Use this problem to frame the discussion,
![Page 4: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/4.jpg)
Occurrence Vs Document Listing
• Given n documents of total length N, occurrence listing can be solved with– O(N) preprocessing and.– O(m + output) time for query pattern of size m.– Elegant 1973 paper by Weiner introduced suffix trees
and solved this problem – optimal, output sensitive.
• No such optimal result for document listing.– O( (m+out) log n ) time query processing.– log n loglog n by fractional cascading.
muthu: assuming you don’t hastily give the answer without looking at the entire document or the pattern!:
muthu: assuming you don’t hastily give the answer without looking at the entire document or the pattern!:
![Page 5: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/5.jpg)
Other Document Listing Problems
• Find all document that contain at least K occurrences of the given pattern. (mining)
• Find all documents that contain two occurrences of the pattern separated by at most distance d. (proximity repeat)
• Find all documents that do NOT contain the given pattern. (negative query)
• Find all documents that contain pattern P but not Q. (boolean query)
• Combinations thereof…
Muthu:
Normally. Negative queries are not selective, but work within selectedsubset or in conjuction with other patterns.
Muthu:
Normally. Negative queries are not selective, but work within selectedsubset or in conjuction with other patterns.
![Page 6: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/6.jpg)
Nature of Document RetrievalProblems
• Document listing versions are natural. – Occurrence listing versions primarily studied in
Computational Biology and Data Mining.
• No optimal algorithms previously known.– Bounds are off by factors of log n … n in the worst
case depending on the problem.
• We will provide (near) optimal algorithms.– Optimal algorithm for key document listing problem.
Muthu:
Motivated the discussion with this problem,
It is also framed in history.
Muthu:
Motivated the discussion with this problem,
It is also framed in history.
Theory following Practice?Inverted word index + variants, in IR.
![Page 7: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/7.jpg)
Talk Overview
• Optimal algorithm for the document listing problem.– List all documents that contain the given pattern.
• Efficient algorithm for the document mining problem.– List all documents that contain at least K
occurrences of the given pattern.
• Techniques. – Colored range query data structural problems.
![Page 8: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/8.jpg)
Preamble: Occurrence Listing• Construct a suffix tree (compressed trie) of all
the documents. D= {abaa, aabaa, bc }S = {abaa#, baa#, aa#, a#, aabaa#, bc#, c#}
ab
c#
c#
aa#
#
#a
baa#baa#
(1,4), (2,5)
(1,3), (2,4)
(2,1)(1,1), (2,2)
(1,2), (2,3)
(3,1)
(3,2)
http://commfaculty.fullerton.edu/lester/writings/1000_words.html
![Page 9: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/9.jpg)
Preamble: Occurrence Listing
• Find all occurrences of pattern aa.– Trace down the path aa and report all the leaves
[Weiner 73].
Input:D= {abaa, aabaa, bc }
Output:(1,3), (2,4), (2,1)
ab
c#
c#
aa#
#
#a
baa#baa#
(1,4), (2,5)
(1,3), (2,4)
(2,1)(1,1), (2,2)
(1,2), (2,3)
(3,1)
(3,2)
![Page 10: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/10.jpg)
Document Listing
• Find all documents that contain pattern aa.– Trace down the path aa and report the distinct
“colors” on leaves.
ab
c#
c#
aa#
#
#a
baa#baa#
1, 2
1, 2
21, 2
1, 2
3
3Input:D= {abaa, aabaa, bc }
Output sought:1, 2
Colors: 1, 2, 3
Challenge: Avoid reporting duplicate colors.
muthu:
Use hot pink sparingly
muthu:
Use hot pink sparingly
![Page 11: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/11.jpg)
Document Listing: Our Approacha
bc#
c#
aa#
#
#a
baa#baa#
1, 2
1, 2
21, 2
1, 2
3
3
1 2 1 2 2 1 2 1 2 3 3
Colored range query:Return distinct colors in given range.
Mathematics is the art of giving the same name to different things. --- Jules Henri Poincare
![Page 12: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/12.jpg)
Document Listing: Our Approach
1 2 3 4 5 6 7 8 9 10 11
1 2 1 2 2 1 2 1 2 3 3
1 2 3 4 5 6 7 8 9 10 11
-1 -1 1 2 4 3 5 1 7 -1 10
List distinct colors
List numbers less than 3.Colors do not matter anymore.
![Page 13: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/13.jpg)
Document Listing: Our Approach
1 2 3 4 5 6 7 8 9 10 11
-1 -1 1 2 4 3 5 1 7 -1 10
List numbers less than 3.
R = (l,r). Find all integers smaller than x in A[l,r]:
1. Perform rangemin(R) to determine i such that A[i] is smallest in A[l,r].
2. If A[i] is smaller than x, recurse on A[l,i-1] and A[i+1,r] and return A[i].
O(1) time per rangemin query
O(output) time.
![Page 14: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/14.jpg)
Document Listing: Summary
• Given a set of documents of total size N, document listing problem can be solved in– O(N) time and space for preprocessing, and.– O(m + output) time for a query of size m.– Uses Weiner’s O(N) time suffix tree construction.
• Overview of techniques– Reduce the problem to colored range searching.– “Chain” occurrences of suffixes from each document,
Necessity is not necessarily the mother of invention. Ruth Benedict in Patterns of Culture.
Muthu: Now, let us get started with fun stuff.
Muthu: Now, let us get started with fun stuff.
![Page 15: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/15.jpg)
Document Mining
• Find all documents that contain at least K occurrences of given pattern.
Find colors that appear at least K times in this range.
![Page 16: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/16.jpg)
Document Mining: First Approach
• Fix K.
Chain to the Kth occurrenceof red to the left.
Given range [l,r], determine all numbers in A[l,r] that are less than l.
Does not work: output * KYesterday it workedToday it is not workingWindows is like that.
![Page 17: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/17.jpg)
Document Mining: Second Approach
• Given a set of colored intervals to be preprocessed, query is some interval I and we must determine the distinct colored intervals that are contained in I.
Chain to the Kth occurrence of red to the left. Replace by red intervals.
No optimal results known
![Page 18: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/18.jpg)
Document Mining: Fixed K
Mark Least Common Ancestor (L,R) with red color.L
R
Each query Find the set of distinct colors in a subtree.O(N) preprocessing, O( m + output) time per query
![Page 19: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/19.jpg)
Document Mining: Variable K• K is part of the query: o(NK) preprocessing?
1 K2 3 K+1 K+2 2K-1
• For a fixed K, all LCAs lie in paths separated by K occurrences.• Suffices to keep the lowest in each path.
muthu: that deserves the hot pink.
muthu: that deserves the hot pink.
![Page 20: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/20.jpg)
Document Mining: Variable K
• For a fixed K, find the lowest LCA on each of the paths separated by O(K) occurrences of each document.
• Preprocessing time: bin searching paths.
• Query processing in O(m + output) time.
)log(log
log 2
,NNO
K
KNK
K
lKKi
i
![Page 21: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/21.jpg)
Summary
• Solving other document listing problems.– Optimal for negative query: list absent colors.– (Near) optimal for proximity repeats: structural
properties of “gaps.”– Best known for two patterns: breaking the
quadratic preprocessing bottleneck.
• Techniques: Chaining, Colored range queries (7+ such problems in the paper), Combinatorial structure.
Muthu: Solving these colored range searching problems are of independent interest….
Muthu: Solving these colored range searching problems are of independent interest….
muthu:
Hope that whetted your appetite for algorithmics.
muthu:
Hope that whetted your appetite for algorithmics.
![Page 22: Document Retrieval Problems](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814889550346895db59ec2/html5/thumbnails/22.jpg)
Discussion
• “non” local chaining? – Find documents in which no two occurrences of the
pattern are within distance K. OPEN
• Try it in IPScope: Interactive Patents Analysis System.