Reach me on Twitter: @matifq Email [email protected] 1
INFORMATION RETRIEVALA Look into the Science of Web Search Engines
Muhammad Atif Qureshi
Reach me on Twitter: @matifq Email [email protected]
2
Contents
Story Mode Learning Learning by Imagination Appendix
Reach me on Twitter: @matifq Email [email protected]
3
Story Mode Learning(Borrowed from Prof. Jimmy Lin,
University of Maryland, Scientist in Twitter)
Reach me on Twitter: @matifq Email [email protected]
4
Information Retrieval Systems Information
What is “information”? Retrieval
What do we mean by “retrieval”? What are different types information
needs? Systems
How do computer systems fit into the human information seeking process?
Reach me on Twitter: @matifq Email [email protected]
5
What is Information?
What do you think? There is no “correct” definition Cookie Monster’s definition:
“news or facts about something” Different approaches:
Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science
Reach me on Twitter: @matifq Email [email protected]
6
Dictionary says…
Oxford English Dictionary information: informing, telling; thing told,
knowledge, items of knowledge, news knowledge: knowing familiarity gained by
experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known
Random House Dictionary information: knowledge communicated or
received concerning a particular fact or circumstance; news
Reach me on Twitter: @matifq Email [email protected]
7
Intuitive Notions
Information must Be something, although the exact nature
(substance, energy, or abstract concept) is not clear;
Be “new”: repetition of previously received messages is not informative
Be “true”: false or counterfactual information is “mis-information”
Be “about” something
Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science,
48(3), 254-269.
Reach me on Twitter: @matifq Email [email protected]
8
Three Views of Information
Information as process Information as communication Information as message transmission
and reception
Reach me on Twitter: @matifq Email [email protected]
9
One View
Information = characteristics of the output of a process Tells us something about the process and the
input
Information-generating process do not occur in isolation
ProcessInput
Input
Input
Output
Output
Output
Process1 Process2Input Output…
Reach me on Twitter: @matifq Email [email protected]
10
Where’s the human?
If a tree falls in the forest, and no one is around to hear it, is information transmitted?
In the “information as process”: Yes, but that’s not very interesting to us
We’re concerned about information for human consumption Transmission of information from one
person to another Recording of information Reconstruction of stored information
Reach me on Twitter: @matifq Email [email protected]
11
Another View
Information science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient” This implies that the sender has knowledge of the
recipient's structure Text = “a collection of signs purposefully
structured by a sender with the intention of changing image-structure of a recipient”
Information = “the structure of any text which is capable of changing the image-structure of a recipient”
Reach me on Twitter: @matifq Email [email protected]
12
Transfer of Information
Communication = transmission of information
Thoughts
Words
Sounds
Thoughts
Words
Sounds
Encoding Decoding
Speech
Writing
Telepathy?
Reach me on Twitter: @matifq Email [email protected]
13
Information Theory
Better called “communication theory” Developed by Claude Shannon in 1940’s
Concerned with the transmission of electrical signals over wires
How do we send information quickly and reliably? Underlies modern electronic communication:
Voice and data traffic… Over copper, fiber optic, wireless, etc.
Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy
Information = “reduction in surprise”
Reach me on Twitter: @matifq Email [email protected]
14
The Noisy Channel Model
Communication = producing the same message at the destination that was sent at the source The message must be encoded for
transmission across a medium (called channel)
But the channel is noisy and can distort the message
Semantics (meaning) is irrelevant
Source Destination
channelmessage Receiver messageTransmitter
noise
Reach me on Twitter: @matifq Email [email protected]
15
A Synthesis
Information retrieval as communication over time and space, across a noisy channelSource Destination
Transmitter Receiverchannelmessage message
noise
Sender Recipient
Encoding Decodingstoragemessage message
noise
indexing/writing retrieval/reading
Reach me on Twitter: @matifq Email [email protected]
16
“Retrieval?”
“Fetch something” that’s been stored Recover a stored state of knowledge Search through stored messages to find
some messages relevant to the task at hand
Sender Recipient
Encoding Decodingstoragemessage message
noise
indexing/writing Retrieval/reading
Reach me on Twitter: @matifq Email [email protected]
17
What is IR?
Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user
Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-
143.
Reach me on Twitter: @matifq Email [email protected]
18
Types of Information Needs
Retrospective “Searching the past” Different queries posed against a static
collection Time invariant
Prospective “Searching the future” Static query posed against a dynamic
collection Time dependent
Reach me on Twitter: @matifq Email [email protected]
19
Retrospective Searches (I)
Ad hoc retrieval: find documents “about this”
Known item search
Directed exploration
Identify positive accomplishments of the Hubble telescope since it was launched in 1991.
Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them.
Find Jimmy Lin’s homepage.
What’s the ISBN number of “Modern Information Retrieval”?
Who makes the best chocolates?
What video conferencing systems exist for digital reference desk services?
Reach me on Twitter: @matifq Email [email protected]
20
Retrospective Searches (II)
Question answeringWho discovered Oxygen?When did Hawaii become a state?Where is Ayer’s Rock located?What team won the World Series in 1992?
“Factoid”
What countries export oil?Name U.S. cities that have a “Shubert” theater.
“List”
Who is Aaron Copland?What is a quasar?
“Definition”
Reach me on Twitter: @matifq Email [email protected]
21
Prospective “Searches”
Filtering Make a binary decision about each
incoming document
Routing Sort incoming documents into different
bins?
Spam or not spam?
Categorize news headlines: World? Nation? Metro? Sports?
Reach me on Twitter: @matifq Email [email protected]
22
What types of information?
Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services
Reach me on Twitter: @matifq Email [email protected]
23
Content-Based Search
This is a relative new concept! What else would you search on? What’s more effective? Why is this hard in many applications?
Reach me on Twitter: @matifq Email [email protected]
24
Interesting Examples
Google image search
Google video search
Query by humming
http://images.google.com/
http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html
http://video.google.com/
Reach me on Twitter: @matifq Email [email protected]
25
What about databases?
What are examples of databases? Banks storing account information Retailers storing inventories Universities storing student grades
What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of “the world”
Reach me on Twitter: @matifq Email [email protected]
26
A (Simple) Database Example
Department ID DepartmentEE Electrical EngineeringHIST HistoryCLIS Information Studies
Course ID Course Namelbsc690 Information Technologyee750 Communicationhist405 American History
Student ID Course ID Grade1 lbsc690 901 ee750 952 lbsc690 952 hist405 803 hist405 904 lbsc690 98
Student ID Last Name First Name Department ID email1 Arrows John EE jarrows@wam2 Peters Kathy HIST kpeters2@wam3 Smith Chris HIST smith2002@glue4 Smith John CLIS js03@wam
Student Table
Department Table Course Table
Enrollment Table
Reach me on Twitter: @matifq Email [email protected]
27
Database Queries
What would you want to know from a database? What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Who’s in the history department? Of all the non-CLIS students taking LBSC
690 with a last name shorter than six characters and were born on a Monday, who has the longest email address?
Reach me on Twitter: @matifq Email [email protected]
28
Databases vs. IR
Other issues
Interaction with system
Results we get
Queries we’re posing
What we’re retrieving
IRDatabases
Issues downplayed.Concurrency, recovery, atomicity are all critical.
Interaction is important.
One-shot queries.
Sometimes relevant, often not.
Exact. Always correct in a formal sense.
Vague, imprecise information needs (often expressed in natural language).
Formally (mathematically) defined queries. Unambiguous.
Mostly unstructured. Free text with some metadata.
Structured data. Clear semantics based on a formal model.
Reach me on Twitter: @matifq Email [email protected]
29
The Big Picture
The four components of the information retrieval environment: User Process System Collection
What computer geeks care about!What we care about!
Reach me on Twitter: @matifq Email [email protected]
30
The Information Retrieval Cycle
SourceSelection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
QueryFormulation
Resource
query reformulation,vocabulary learning,relevance feedback
source reselection
Reach me on Twitter: @matifq Email [email protected]
31
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
QueryFormulation
Resource
Indexing Index
Acquisition Collection
Reach me on Twitter: @matifq Email [email protected]
32
Simplification?Source
Selection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
QueryFormulation
Resource
query reformulation,vocabulary learning,relevance feedback
source reselection
Is this itself a vast simplification?
Reach me on Twitter: @matifq Email [email protected]
33
Tackling the IR Challenge
Divide and conquer! Strategy: use encapsulation to limit
complexity Approach:
Define interfaces (input and output) for each component
Define the functions performed by each component
Study each component in isolation Repeat the process within components as
needed Make sure that this decomposition makes
sense Result: a hierarchical decomposition
Reach me on Twitter: @matifq Email [email protected]
34
Where do we make the cut?
Study the IR black box in isolation Simple behavior: in goes query, out comes
documents Optimize the quality of documents that come out
Study everything else around the black box Put the human back in the loop!
Search
Query
Ranked List
Reach me on Twitter: @matifq Email [email protected]
36
Inside The IR Black Box
DocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index
37
Reach me on Twitter: @matifq Email [email protected]
The Central Problem in IR
Information Seeker Authors
Concepts Concepts
Query Terms Document Terms
Do these represent the same concepts?
Reach me on Twitter: @matifq Email [email protected]
39
Imagine a System
We have 1000s of web pages, what make these web pages different? May be different key terms or key words
occurring in different web pages (e.g., sports, education, video sharing)
Reach me on Twitter: @matifq Email [email protected]
40
Realize Query Needs
What do we expect when query? Query can be single word (no order), collection of
words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan")
How to manage data of web pages Bag of words data structure with/without position of
words/terms (simply, posting list of words/terms) What’s the best match?
We have many matching results, but what’s the order?
Reach me on Twitter: @matifq Email [email protected]
41
Order of Matching Results
How could we rank web pages? Via query content matching score against
web pages i.e., content based methods Via importance of web pages i.e., link
based methods
Reach me on Twitter: @matifq Email [email protected]
42
What does Content Tell?
Content Information: Rare terms give more information than
frequent terms as common terms do not differentiate well between the content of documents (Information entropy)
So what does common words make? Stop words (extreme case, e.g., it, a, the) or
words with lesser importance (e.g., word science inside scientific documents)
Reach me on Twitter: @matifq Email [email protected]
43
Ranking Methods
Content based methods: Examples: Tf-idf with cosine similarity,
bm25, etc. Link based methods:
Examples: PageRank, HITS, etc.
Reach me on Twitter: @matifq Email [email protected]
44
What is More in Ranking?
What other measures we can take for ranking better? Combining content based methods with link
based methods
How about learning to rank by user click through data (apply machine learning)
How about learning from social web (apply social science theories)
Reach me on Twitter: @matifq Email [email protected]
45
Lots of Web Pages
How about scalability?
We have too many words, can we limit them? Example: Is Studying conceptually different from
study or studies? may be not (concept called stemming could simply everything to simple concept study)
Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education)
Reach me on Twitter: @matifq Email [email protected]
46
Is it sufficient?
Can we feel confident about how Web Search Engine works? No, it was just a summary for the day
Reach me on Twitter: @matifq Email [email protected]
50
Outline
What is Research? How to prepare yourself for IR research?
Reach me on Twitter: @matifq Email [email protected]
52
What is Research?
Research Discover new knowledge Seek answers to questions
Basic research Goal: Expand man’s knowledge (e.g., which genes control
social behavior of honey bees? ) Often driven by curiosity (but not always) High impact examples: relativity theory, DNA, …
Applied research Goal: Improve human condition (i.e., improve the world)
(e.g., how to cure cancers?) Driven by practical needs High impact examples: computers, transistors,
vaccinations, … The boundary is vague; distinction isn’t important
Reach me on Twitter: @matifq Email [email protected]
53
Why Research?
Amount of knowledge
Advancement of Technology
Utility of Applications
Basic ResearchApplied Research
ApplicationDevelopment
Curiosity
Reach me on Twitter: @matifq Email [email protected]
54
Where’s IR Research?
Amount of knowledge
Advancement of Technology
Utility of Applications
Quality of Life
Basic ResearchApplied Research
ApplicationDevelopment
Information Science
Computer Science
Reach me on Twitter: @matifq Email [email protected]
55
Research Process
Identification of the topic (e.g., Web search)
Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art)
Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data)
Test hypothesis (e.g., compare X and Y on the data)
Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?)
Reach me on Twitter: @matifq Email [email protected]
56
Typical IR Research Process
Look for a high-impact topic (basic or applied) New problem: define/frame the problem Identify weakness of existing solutions if any Propose new methods Choose data sets (often a main challenge) Design evaluation measures (can be very
difficult) Run many experiments (need to have clear
research hypotheses) Analyze results and repeat the steps above if
necessary Publish research results
Reach me on Twitter: @matifq Email [email protected]
57
Research Methods
Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”)
Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”)
Empirical research: evaluate and compare existing solutions (e.g., “a comparative evaluation of link analysis methods for web search”)
The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory…
Reach me on Twitter: @matifq Email [email protected]
58
Types of Research Questions and Results
Exploratory (Framework): What’s out there?
Descriptive (Principles): What does it look like? How does it work?
Evaluative (Empirical results): How well does a method solve a problem?
Explanatory (Causes): Why does something happen the way it happens?
Predictive (Models): What would happen if xxx ?
Reach me on Twitter: @matifq Email [email protected]
59
Solid and High Impact Research Solid work:
A clear hypothesis (research question) with conclusive result (either positive or negative)
Clearly adds to our knowledge base (what can we learn from this work?)
Implications: a solid, focused contribution is often better than a non-conclusive broad exploration
High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and
don’t just be satisfied with a good solution, make it the best
Reach me on Twitter: @matifq Email [email protected]
61
What it Takes to do Research? Curiosity: allow you to ask questions Critical thinking: allow you to challenge
assumptions Learning: take you to the frontier of
knowledge Persistence: so that you don’t give up Respect data and truth: ensure your
research is solid Communication: allow you to publish
your work …
Reach me on Twitter: @matifq Email [email protected]
62
Learning about IR (1/2)
Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…)
Then read “Readings in IR” by Karen Sparck Jones, Peter Willett
And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf
Read other papers published in recent IR/IR-related conferences
Reach me on Twitter: @matifq Email [email protected]
63
Learning about IR (2/2)
Getting more focused Choose your favorite sub-area (e.g., retrieval models) Extend your knowledge about related topics (e.g.,
machine learning, statistical modeling, optimization) Stay in frontier:
Keep monitoring literature in both IR and related areas Broaden your view: Keep an eye on
Industry activities Read about industry trends Try out novel prototype systems
Funding trends Read request for proposals
Reach me on Twitter: @matifq Email [email protected]
64
Critical Thinking
Develop a habit of asking questions, especially why questions
Always try to make sense of what you have read/heard; don’t let any question pass by
Get used to challenging everything Practical advice
Question every claim made in a paper or a talk (can you argue the other way?)
Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it)
Force yourself to challenge one point in every talk that you attend and raise a question
Reach me on Twitter: @matifq Email [email protected]
65
Respect Data and Truth
Be honest with the experiment results Don’t throw away negative results! Try to learn from negative results
Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data
Be objective in data analysis and interpretation; don’t mislead readers
Aim at understanding/explanation instead of just good results
Be careful not to over-generalize (for both good and bad results); you may be far from the truth
Reach me on Twitter: @matifq Email [email protected]
66
Communications
General communication skills: Oral and written Formal and informal Talk to people with different level of
backgrounds Be clear, concise, accurate, and adaptive
(elaborate with examples, summarize by abstraction)
English proficiency Get used to talking to people from
different fields
Reach me on Twitter: @matifq Email [email protected]
67
Persistence
Work only on topics that you are passionate about
Work only on hypotheses that you believe in
Don’t draw negative conclusions prematurely and give up easily positive results may be hidden in negative
results In many cases, negative results don’t
completely reject a hypothesis Be comfortable with criticisms about
your work (learn from negative reviews of a rejected paper)
Think of possibilities of repositioning a work
Reach me on Twitter: @matifq Email [email protected]
68
Optimize Your Training
Know your strengths and weaknesses strong in math vs. strong in system
development creative vs. thorough …
Train yourself to fix weaknesses Find strategic partners Position yourself to take advantage of
your strengths
Reach me on Twitter: @matifq Email [email protected]
69
Thank You
Reach me on Twitter: @matifqEmail me: [email protected]
Top Related