Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman...
![Page 1: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/1.jpg)
Efficient Processing of Complex Features for Information Retrieval
Dissertation by Trevor StrohmanPresented by
Laura C. VandivierFor ITCS6050, UNCC, Fall 2008
![Page 2: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/2.jpg)
Overview
• Indexing• Ranking• Query Expansion• Query Evaluation• Tupleflow
![Page 3: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/3.jpg)
Topics Not Covered
• Binned Probabilities• Score-Sorted Index Optimization• Document-Sorted Index
Optimization• Navigational Search with Complex
Features
![Page 4: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/4.jpg)
Document Indexing
• Inverted ListA mapping from a single word to a set of
documents that contain the word
• Inverted IndexA set of inverted lists
![Page 5: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/5.jpg)
Inverted Index
• Contain one inverted list for each term in the document collection
• Often omit frequently occurring words such as “a,” “and” and “the.”
![Page 6: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/6.jpg)
Inverted Index Example
Sample Documents1. Cats, dogs, dogs.2. Dogs, cats, sheep.3. Whales, sheep, goats.4. Fish, whales, whales.
Inverted Indexcats
dogs
fish
goats
sheep
whales
1 1 4 3 2 3
2 2 3 4
QueryAnswer
cats 1,2
sheep + dogs
2
![Page 7: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/7.jpg)
Expanding Inverted Indexes
• Include term frequencyMore terms implies “about”
cats
dogs
fish goats
sheep
whales
(1,1)
(1,2)
(4,1)
(3,1)
(2,1)
(3,1)
(2,1)
(2,1)
(3,1)
(4,2)
![Page 8: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/8.jpg)
Expanding Inverted Indexes (cont.)
• Add word position informationFacilitates phrase searching
cats dogs fish goats sheep
whales
(1,1): 1
(1,2): 2,3
(4,1): 1
(3,1): 2
(2,1): 3
(3,1): 1
(2,1): 2
(2,1): 1 (3,1): 2
(4,2): 1
![Page 9: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/9.jpg)
Inverted Index Statistics
• Compressed inverted indexes containing only word counts– 5% of the document collection in size– Built and queried faster
• Compressed inverted indexes containing word counts and positions– 20% of the document collection in size– Essential for high effectiveness, even in queries
not using phrases
![Page 10: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/10.jpg)
Document Ranking
• Documents returned in order of relevance
• Perfect ranking impossible• Retrieval systems calculate
probability a document is relevant
![Page 11: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/11.jpg)
Computing Relevance
• Assume “bag of words” with term independence
• Simple estimation
• Problems1. If a document does not contain all words of a multi-
word query it will not be retrieved.document containing 0 words = document containing some
words
2. All words are treated equally.Query = Maltese falcondocument(maltese:2, falcon:1) = document(maltese:1,falcon:2)for documents of similar length
• Smoothing can help
# occurrences
document length
![Page 12: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/12.jpg)
Computing Relevance (cont.)
• Add additional features– Position/field in document, ex.
title– Proximity of query terms– Combinations
![Page 13: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/13.jpg)
Computing Relevance (cont.)
Add query independent information• # links from other documents• URL depth
shorter generallonger specific
• User clicksMay match expectations but not relevance
• Dwell time• Document quality models
Unusual term distribution implies poor grammar so the document is not a good retrieval candidate
![Page 14: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/14.jpg)
Query Expansion
StemmingGroups words that mean the same concept based on
natural language rules. ex: run, runs, running, ran
• Aggressive StemmerMay group words that are not related. ex. marine,
marinate
• Conservative StemmerMay fail to group words that are related. ex. run, ran
• Statistical StemmerUses word co-occurrence data to determine if they are
related.Would probably avoid the marine, marinate mistake.
![Page 15: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/15.jpg)
Query Expansion (cont.)
SynonymsGroup by terms that mean the same concept
• ProblemMay be different depending on context
US: President = head of state = commander in chiefUK: prime minister = head of stateCorporation: president = chief executive (maybe)
• Solutions– Include synonyms in query but prefer term
matches– Use context from the whole query
“president of canada” “prime minister”
![Page 16: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/16.jpg)
Query Expansion (cont.)
Relevance FeedbackUser selects relevant documents and they
are used to find similar documents.
Pseudo Relevance FeedbackSystem assumes the first few documents
retrieved are relevant and uses them to search for more.
No user involvement, so not as precise.
![Page 17: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/17.jpg)
Evaluation
• Effectiveness
• Efficiency
![Page 18: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/18.jpg)
Effectiveness
• Precision# of relevant results / # results
• SuccessWhether the first document was relevant
• Recall# relevant docs found / # relevant docs that exist
• Mean Average Precision (MAP)Average precision over all relevant documents
• Normalized Discounted Cumulative Gain (NDCG)Calculates using sum over result ranks
![Page 19: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/19.jpg)
Calculating MAP
Assume a retrieval set of 10 documents with 1, 5, 7, 8 and 10 relevant.
Rank
Precision
1 1/1 = 1
5 2/5 = .2
7 3/7 = .43
8 4/8 = .5
10 5/10 = .5
If there were only 5 relevant documents, then(1 + .2 + .43 + .5 + .5) / 5 = .53
If we retrieved only 5 of 6 relevant documents, then(1 + .2 + .43 + .5 + .5) / 6 = .44
![Page 20: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/20.jpg)
NDCG
• Uses 4 values for relevance, not just is/is not with 0 being not relevant and 4 being most relevant.
• Calculated asN (2r(i) − 1)/ log(1 + i)
Where i is the rank and r(i) is the relevance value at that rank.
Example: with the following results where is relevant and is not
i
1 10 20
MAP
NDCG
1.00 1.00 .51 .79 .33 .55
![Page 21: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/21.jpg)
Efficiency
• Throughput# of queries processed per secondMust use identical systems.
• LatencyTime between when the user issues a
query and the system delivers a response.
< 150ms considered “instantaneous”• Generally, improving one implies
worsening the other
![Page 22: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/22.jpg)
Measuring Efficiency
• DirectAttempt to create a real world system and
measure statistics.Straightforward but limited to
experimenter access.
• SimulationSystem operation is simulated in software.Repeatable but is only as good as its model.
![Page 23: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/23.jpg)
Query Evaluation
• Document-at-a-timeEvaluate each term for a document
before moving to the next document.
• Term-at-a-timeEvaluate each document for a term
before moving to the next term.
![Page 24: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/24.jpg)
Document-at-a-Time
• Produces complete document scores early so can quickly display partial results.
• Can incrementally fetch the inverted list data so uses less memory.
![Page 25: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/25.jpg)
Document-at-a-Time Algorithm
procedure DocumentAtATimeRetrieval(Q)L ← Array()R ← PriorityQueue()for all terms wi in Q do
li ← InvertedList(wi)L.add( li )
end forfor all documents D in the collection do
for all inverted lists li in L dosD ← sD + f(Q,C,wi)(c(wi;D)) #Update the document
scoreend forsD ← sD · d(Q,C)(|D|) #Multiply by a document-dependent
factorR.add( sD,D )
end forreturn the top n results from R
end procedure
![Page 26: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/26.jpg)
Term-at-a-Time
• Does not jump between inverted lists so saves branching.
• Inner loop iterates over documents so is executed for a long time, thus is easier to optimize.
• Efficient query processing strategies have been developed for term-at-a-time.
• Preferred for efficient system implementation.
![Page 27: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/27.jpg)
Term-at-a-Time Algorithmprocedure TermAtATimeRetrieval(Q)
A ← HashTable()for all terms wi in Q do
li ← InvertedList(wi)for all documents D in li do
swi,D ← A[D] + f(Q,C,wi)(c(wi;D))end for
end forR ← PriorityQueue()for all accumulators A[D] in A do
sD ← A[D] · d(Q,C)(|D|) #Normalize the accumulator value
R.add( sD,D )end forreturn the top n results from R
end procedure
![Page 28: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/28.jpg)
Optimization Types
• Unoptimized• Unsafe• Set Safe• Rank Safe• Score Safe
![Page 29: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/29.jpg)
Unoptimized
• Compare the query to each document and calculate the score.
• Sort the documents. Documents with the same score may appear in any order.
• Return results in ranked order. “Top k documents” could be different.
![Page 30: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/30.jpg)
Optimized• Unsafe
Documents returned have no guaranteed set of properties.
• Set SafeDocuments are guaranteed to be in the result set
but may not be in the same order as the unoptimized results.
• Rank SafeDocuments are guaranteed to be in the result set
and in the correct order, but document scores may not be thes same as the unoptimized results.
• Score SafeDocuments are guaranteed to be in the result set
and have the same scores as the unoptimized results.
![Page 31: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/31.jpg)
Tupleflow
Distributed computing framework for indexing.• Flexibility
Settings made in parameter files, no ode changes required
• ScalabilityIndependent tasks spread across processors
• Disk abstractionStreaming data model
• Low abstraction penaltyCode handles custom hashing, sorting and
serialization
![Page 32: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/32.jpg)
Traditional Indexing Approach
Create a word occurrence model by counting the unique terms in each document.
• Serial processingParse one document, move to the next
• Large memory requirements for unique word hash over large document setwords, misspellings, numbers, urls, etc.
• Different code required for each document typeDocuments, web pages, databases, etc.
![Page 33: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/33.jpg)
Tupleflow Approach
Break processing into steps• Count terms (countsMaker)• Sort terms• Combine counts (countsReducer)
![Page 34: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/34.jpg)
Tupleflow Example
The cat in the hat.
countsMaker sort
countsReducer
Word
Count
Word
Count Word Count
the 1 cat 1 cat 1
cat 1 hat 1 hat 1
in 1 in 1 in 1
the 1 the 1 the 2
hat 1 the 1
![Page 35: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/35.jpg)
Tupleflow Execution Graph
• Single Processor
• Multiple Processors
filenames
read text
parse text
count words
filenames
read text
parse text
count words
combine counts
read text
parse text
count words
read text
parse text
count words
![Page 36: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/36.jpg)
Summary
Document indexing and querying are time and resource intensive tasks. Optimizing and parallelizing wherever possible is essential to minimize resources and maximize efficiency. Tupleflow is one example of efficient indexing by parallelization.
![Page 37: Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,](https://reader036.fdocuments.in/reader036/viewer/2022062421/56649d5f5503460f94a405a5/html5/thumbnails/37.jpg)
Questions?