Search Engines Session 11 LBSC 690 Information Technology.
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
1
Transcript of Search Engines Session 11 LBSC 690 Information Technology.
![Page 1: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/1.jpg)
Search Engines
Session 11
LBSC 690
Information Technology
![Page 2: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/2.jpg)
Muddiest Points
• MySQL
• What’s Joomla for?
• PHP arrays and loops
![Page 3: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/3.jpg)
Agenda
• The search process
• Information retrieval
• Recommender systems
• Evaluation
![Page 4: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/4.jpg)
The Memex Machine
![Page 5: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/5.jpg)
Information Hierarchy
Data
Information
Knowledge
Wisdom
More refined and abstract
![Page 6: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/6.jpg)
Other issues
Interaction with system
Results we get
Queries we’re posing
What we’re retrieving
IRDatabases
Effectiveness and usability are critical.
Concurrency, recovery, atomicity are critical.
Interaction is important.One-shot queries.
Sometimes relevant, often not.
Exact. Always correct in a formal sense.
Vague, imprecise information needs (often expressed in natural language).
Formally (mathematically) defined queries. Unambiguous.
Mostly unstructured. Free text with some metadata.
Structured data. Clear semantics based on a formal model.
![Page 7: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/7.jpg)
Information “Retrieval”
• Find something that you want– The information need may or may not be explicit
• Known item search– Find the class home page
• Answer seeking– Is Lexington or Louisville the capital of Kentucky?
• Directed exploration– Who makes videoconferencing systems?
![Page 8: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/8.jpg)
The Big Picture
• The four components of the information retrieval environment:– User (user needs)– Process– System– Data
What computer geeks care about!
What we care about!
![Page 9: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/9.jpg)
DocumentDelivery
BrowseSearch
Query Document
Select Examine
Information Retrieval Paradigm
![Page 10: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/10.jpg)
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Query Reformulation and
Relevance Feedback
SourceReselection
Nominate ChoosePredict
![Page 11: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/11.jpg)
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
![Page 12: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/12.jpg)
Human-Machine Synergy
• Machines are good at:– Doing simple things accurately and quickly– Scaling to larger collections in sublinear time
• People are better at:– Accurately recognizing what they are looking for– Evaluating intangibles such as “quality”
• Both are pretty bad at:– Mapping consistently between words and concepts
![Page 13: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/13.jpg)
Search Component Model
Comparison Function
Representation Function
Query Formulation
Human Judgment
Representation Function
Retrieval Status Value
Utility
Query
Information Need Document
Query Representation Document Representation
Que
ry P
roce
ssin
g
Doc
umen
t P
roce
ssin
g
![Page 14: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/14.jpg)
Ways of Finding Text
• Searching metadata– Using controlled or uncontrolled vocabularies
• Searching content– Characterize documents by the words the contain
• Searching behavior– User-Item: Find similar users– Item-Item: Find items that cause similar reactions
![Page 15: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/15.jpg)
Two Ways of Searching
Write the documentusing terms to
convey meaning
Author
Content-BasedQuery-Document
Matching Document Terms
Query Terms
Construct query fromterms that may
appear in documents
Free-TextSearcher
Retrieval Status Value
Construct query fromavailable concept
descriptors
ControlledVocabulary
Searcher
Choose appropriate concept descriptors
Indexer
Metadata-BasedQuery-Document
Matching Query Descriptors
Document Descriptors
![Page 16: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/16.jpg)
“Exact Match” Retrieval
• Find all documents with some characteristic– Indexed as “Presidents -- United States”– Containing the words “Clinton” and “Peso”– Read by my boss
• A set of documents is returned– Hopefully, not too many or too few– Usually listed in date or alphabetical order
![Page 17: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/17.jpg)
The Perfect Query Paradox
• Every information need has a perfect document ste– Finding that set is the goal of search
• Every document set has a perfect query– AND every word to get a query for document 1– Repeat for each document in the set– OR every document query to get the set query
• The problem isn’t the system … it’s the query!
![Page 18: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/18.jpg)
Queries on the Web (1999)
• Low query construction effort– 2.35 (often imprecise) terms per query– 20% use operators– 22% are subsequently modified
• Low browsing effort– Only 15% view more than one page– Most look only “above the fold”
• One study showed that 10% don’t know how to scroll!
![Page 19: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/19.jpg)
Types of User Needs
• Informational (30-40% of AltaVista queries)– What is a quark?
• Navigational – Find the home page of United Airlines
• Transactional– Data: What is the weather in Paris?– Shopping: Who sells a Viao Z505RX?– Proprietary: Obtain a journal article
![Page 20: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/20.jpg)
Ranked Retrieval
• Put most useful documents near top of a list– Possibly useful documents go lower in the list
• Users can read down as far as they like– Based on what they read, time available, ...
• Provides useful results from weak queries– Untrained users find exact match harder to use
![Page 21: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/21.jpg)
Similarity-Based Retrieval
• Assume “most useful” = most similar to query
• Weight terms based on two criteria:– Repeated words are good cues to meaning– Rarely used words make searches more selective
• Compare weights with query– Add up the weights for each query term– Put the documents with the highest total first
![Page 22: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/22.jpg)
Simple Example: Counting Words
1
1
1
1: Nuclear fallout contaminated Texas.
2: Information retrieval is interesting.
3: Information retrieval is complicated.
1
1
1
1
1
1
nuclear
fallout
Texas
contaminated
interesting
complicated
information
retrieval
1
1 2 3
Documents:
Query: recall and fallout measures for information retrieval
1
1
1
Query
![Page 23: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/23.jpg)
Discussion Point: Which Terms to Emphasize?
• Major factors– Uncommon terms are more selective– Repeated terms provide evidence of meaning
• Adjustments– Give more weight to terms in certain positions
• Title, first paragraph, etc.
– Give less weight each term in longer documents– Ignore documents that try to “spam” the index
• Invisible text, excessive use of the “meta” field, …
![Page 24: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/24.jpg)
“Okapi” Term Weights
5.0
5.0log*
5.05.1 ,
,,
j
j
jii
jiji DF
DFN
TFLL
TFw
0.0
0.2
0.4
0.6
0.8
1.0
0 5 10 15 20 25
Raw TF
Oka
pi
TF 0.5
1.0
2.0
4.4
4.6
4.8
5.0
5.2
5.4
5.6
5.8
6.0
0 5 10 15 20 25
Raw DF
IDF Classic
Okapi
LL /
TF component IDF component
![Page 25: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/25.jpg)
Index Quality
• Crawl quality– Comprehensiveness, dead links, duplicate detection
• Document analysis– Frames, metadata, imperfect HTML, …
• Document extension– Anchor text, source authority, category, language, …
• Document restriction (ephemeral text suppression)– Banner ads, keyword spam, …
![Page 26: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/26.jpg)
Other Web Search Quality Factors
• Spam suppression– “Adversarial information retrieval”– Every source of evidence has been spammed
• Text, queries, links, access patterns, …
• “Family filter” accuracy– Link analysis can be very helpful
![Page 27: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/27.jpg)
Indexing Anchor Text
• A type of “document expansion”– Terms near links describe content of the target
• Works even when you can’t index content– Image retrieval, uncrawled links, …
![Page 28: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/28.jpg)
Information Retrieval Types
Source: Ayse Goker
![Page 29: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/29.jpg)
Expanding the Search Space
Scanned Docs
Scanned Docs
Identity: Harriet
“… Later, I learned that John had not heard …”
![Page 30: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/30.jpg)
Page Layer Segmentation• Document image generation model
– A document consists many layers, such as handwriting, machine printed text, background patterns, tables, figures, noise, etc.
![Page 31: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/31.jpg)
Searching Other Languages
Search
Translated Query
Selection
Ranked List
Examination
Document
Use
Document
QueryFormulation
QueryTranslation
Query
Query Reformulation
MT
Translated “Headlines”
English Definitions
![Page 32: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/32.jpg)
![Page 33: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/33.jpg)
Speech Retrieval Architecture
AutomaticSearch
BoundaryTagging
InteractiveSelection
ContentTagging
SpeechRecognition
QueryFormulation
![Page 34: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/34.jpg)
High Payoff Investments
SearchableFraction
Transducer Capabilities
OCRMT
HandwritingSpeech
producedwords
wordsrecognizedaccurately
![Page 35: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/35.jpg)
http://www.ctr.columbia.edu/webseek/
![Page 36: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/36.jpg)
Color Histogram Example
![Page 37: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/37.jpg)
Rating-Based Recommendation
• Use ratings as to describe objects– Personal recommendations, peer review, …
• Beyond topicality:– Accuracy, coherence, depth, novelty, style, …
• Has been applied to many modalities– Books, Usenet news, movies, music, jokes, beer, …
![Page 38: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/38.jpg)
Using Positive InformationSmallWorld
SpaceMtn
MadTea Pty
Dumbo Speed-way
CntryBear
Joe D A B D ? ?Ellen A F D FMickey A A A A A AGoofy D A CJohn A C A C ABen F A FNathan D A A
![Page 39: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/39.jpg)
Using Negative InformationSmallWorld
SpaceMtn
MadTea Pty
Dumbo Speed-way
CntryBear
Joe D A B D ? ?Ellen A F D FMickey A A A A A AGoofy D A CJohn A C A C ABen F A FNathan D A A
![Page 40: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/40.jpg)
Problems with Explicit Ratings
• Cognitive load on users -- people don’t like to provide ratings
• Rating sparsity -- needs a number of raters to make recommendations
• No ways to detect new items that have not rated by any users
![Page 41: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/41.jpg)
Putting It All Together
Free Text Behavior Metadata
Topicality
Quality
Reliability
Cost
Flexibility
![Page 42: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/42.jpg)
Evaluation
• What can be measured that reflects the searcher’s ability to use a system? (Cleverdon, 1966)
– Coverage of Information
– Form of Presentation
– Effort required/Ease of Use
– Time and Space Efficiency
– Recall
– Precision
Effectiveness
![Page 43: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/43.jpg)
Evaluating IR Systems
• User-centered strategy– Given several users, and at least 2 retrieval systems– Have each user try the same task on both systems– Measure which system works the “best”
• System-centered strategy– Given documents, queries, and relevance judgments– Try several variations on the retrieval system– Measure which ranks more good docs near the top
![Page 44: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/44.jpg)
Which is the Best Rank Order?
= relevant document
A.
B.
C.
D.
E.
F.
![Page 45: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/45.jpg)
Precision and Recall
• Precision– How much of what was found is relevant?– Often of interest, particularly for interactive
searching
• Recall– How much of what is relevant was found?– Particularly important for law, patents, and
medicine
![Page 46: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/46.jpg)
Relevant
Retrieved
|Rel|
|RelRet| Recall
|Ret|
|RelRet| Precision
Measures of Effectiveness
![Page 47: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/47.jpg)
Precision-Recall Curves
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision
Source: Ellen Voorhees, NIST
![Page 48: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/48.jpg)
Affective Evaluation
• Measure stickiness through frequency of use– Non-comparative, long-term
• Key factors (from cognitive psychology):– Worst experience– Best experience– Most recent experience
• Highly variable effectiveness is undesirable– Bad experiences are particularly memorable
![Page 49: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/49.jpg)
Example Interfaces
• Google: keyword in context
• Microsoft Live: query refinement suggestions
• Exalead: faceted refinement
• Clusty: clustered results
• Kartoo: cluster visualization
• WebBrain: structure visualization
• Grokker: “map view”
• PubMed: related article search
![Page 50: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/50.jpg)
Summary
• Search is a process engaged in by people
• Human-machine synergy is the key
• Content and behavior offer useful evidence
• Evaluation must consider many factors
![Page 51: Search Engines Session 11 LBSC 690 Information Technology.](https://reader030.fdocuments.in/reader030/viewer/2022032704/56649d795503460f94a5cd73/html5/thumbnails/51.jpg)
Before You Go
On a sheet of paper, answer the following (ungraded) question (no names, please):
What was the muddiest point in today’s class?