Information Retrieval

Post on 13-May-2015

2.759 views 1 download

Tags:

description

This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.

Transcript of Information Retrieval

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk 1

INFORMATION RETRIEVALA Look into the Science of Web Search Engines

Muhammad Atif Qureshi

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

2

Contents

Story Mode Learning Learning by Imagination Appendix

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

3

Story Mode Learning(Borrowed from Prof. Jimmy Lin,

University of Maryland, Scientist in Twitter)

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

4

Information Retrieval Systems Information

What is “information”? Retrieval

What do we mean by “retrieval”? What are different types information

needs? Systems

How do computer systems fit into the human information seeking process?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

5

What is Information?

What do you think? There is no “correct” definition Cookie Monster’s definition:

“news or facts about something” Different approaches:

Philosophy Psychology Linguistics Electrical engineering Physics Computer science Information science

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

6

Dictionary says…

Oxford English Dictionary information: informing, telling; thing told,

knowledge, items of knowledge, news knowledge: knowing familiarity gained by

experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known

Random House Dictionary information: knowledge communicated or

received concerning a particular fact or circumstance; news

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

7

Intuitive Notions

Information must Be something, although the exact nature

(substance, energy, or abstract concept) is not clear;

Be “new”: repetition of previously received messages is not informative

Be “true”: false or counterfactual information is “mis-information”

Be “about” something

Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science,

48(3), 254-269.

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

8

Three Views of Information

Information as process Information as communication Information as message transmission

and reception

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

9

One View

Information = characteristics of the output of a process Tells us something about the process and the

input

Information-generating process do not occur in isolation

ProcessInput

Input

Input

Output

Output

Output

Process1 Process2Input Output…

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

10

Where’s the human?

If a tree falls in the forest, and no one is around to hear it, is information transmitted?

In the “information as process”: Yes, but that’s not very interesting to us

We’re concerned about information for human consumption Transmission of information from one

person to another Recording of information Reconstruction of stored information

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

11

Another View

Information science is characterized by “the deliberate (purposeful) structure of the message by the sender in order to affect the image structure of the recipient” This implies that the sender has knowledge of the

recipient's structure Text = “a collection of signs purposefully

structured by a sender with the intention of changing image-structure of a recipient”

Information = “the structure of any text which is capable of changing the image-structure of a recipient”

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

12

Transfer of Information

Communication = transmission of information

Thoughts

Words

Sounds

Thoughts

Words

Sounds

Encoding Decoding

Speech

Writing

Telepathy?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

13

Information Theory

Better called “communication theory” Developed by Claude Shannon in 1940’s

Concerned with the transmission of electrical signals over wires

How do we send information quickly and reliably? Underlies modern electronic communication:

Voice and data traffic… Over copper, fiber optic, wireless, etc.

Famous result: Channel Capacity Theorem Formal measure of information in terms of entropy

Information = “reduction in surprise”

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

14

The Noisy Channel Model

Communication = producing the same message at the destination that was sent at the source The message must be encoded for

transmission across a medium (called channel)

But the channel is noisy and can distort the message

Semantics (meaning) is irrelevant

Source Destination

channelmessage Receiver messageTransmitter

noise

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

15

A Synthesis

Information retrieval as communication over time and space, across a noisy channelSource Destination

Transmitter Receiverchannelmessage message

noise

Sender Recipient

Encoding Decodingstoragemessage message

noise

indexing/writing retrieval/reading

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

16

“Retrieval?”

“Fetch something” that’s been stored Recover a stored state of knowledge Search through stored messages to find

some messages relevant to the task at hand

Sender Recipient

Encoding Decodingstoragemessage message

noise

indexing/writing Retrieval/reading

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

17

What is IR?

Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user

Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-

143.

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

18

Types of Information Needs

Retrospective “Searching the past” Different queries posed against a static

collection Time invariant

Prospective “Searching the future” Static query posed against a dynamic

collection Time dependent

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

19

Retrospective Searches (I)

Ad hoc retrieval: find documents “about this”

Known item search

Directed exploration

Identify positive accomplishments of the Hubble telescope since it was launched in 1991.

Compile a list of mammals that are considered to be endangered, identify their habitat and, if possible, specify what threatens them.

Find Jimmy Lin’s homepage.

What’s the ISBN number of “Modern Information Retrieval”?

Who makes the best chocolates?

What video conferencing systems exist for digital reference desk services?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

20

Retrospective Searches (II)

Question answeringWho discovered Oxygen?When did Hawaii become a state?Where is Ayer’s Rock located?What team won the World Series in 1992?

“Factoid”

What countries export oil?Name U.S. cities that have a “Shubert” theater.

“List”

Who is Aaron Copland?What is a quasar?

“Definition”

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

21

Prospective “Searches”

Filtering Make a binary decision about each

incoming document

Routing Sort incoming documents into different

bins?

Spam or not spam?

Categorize news headlines: World? Nation? Metro? Sports?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

22

What types of information?

Text (Documents and portions thereof) XML and structured documents Images Audio (sound effects, songs, etc.) Video Source code Applications/Web services

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

23

Content-Based Search

This is a relative new concept! What else would you search on? What’s more effective? Why is this hard in many applications?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

24

Interesting Examples

Google image search

Google video search

Query by humming

http://images.google.com/

http://www.cs.cornell.edu/Info/Faculty/bsmith/query-by-humming.html

http://video.google.com/

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

25

What about databases?

What are examples of databases? Banks storing account information Retailers storing inventories Universities storing student grades

What exactly is a (relational) database? Think of them as a collection of tables They model some aspect of “the world”

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

26

A (Simple) Database Example

Department ID DepartmentEE Electrical EngineeringHIST HistoryCLIS Information Studies

Course ID Course Namelbsc690 Information Technologyee750 Communicationhist405 American History

Student ID Course ID Grade1 lbsc690 901 ee750 952 lbsc690 952 hist405 803 hist405 904 lbsc690 98

Student ID Last Name First Name Department ID email1 Arrows John EE jarrows@wam2 Peters Kathy HIST kpeters2@wam3 Smith Chris HIST smith2002@glue4 Smith John CLIS js03@wam

Student Table

Department Table Course Table

Enrollment Table

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

27

Database Queries

What would you want to know from a database? What classes is John Arrow enrolled in? Who has the highest grade in LBSC 690? Who’s in the history department? Of all the non-CLIS students taking LBSC

690 with a last name shorter than six characters and were born on a Monday, who has the longest email address?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

28

Databases vs. IR

Other issues

Interaction with system

Results we get

Queries we’re posing

What we’re retrieving

IRDatabases

Issues downplayed.Concurrency, recovery, atomicity are all critical.

Interaction is important.

One-shot queries.

Sometimes relevant, often not.

Exact. Always correct in a formal sense.

Vague, imprecise information needs (often expressed in natural language).

Formally (mathematically) defined queries. Unambiguous.

Mostly unstructured. Free text with some metadata.

Structured data. Clear semantics based on a formal model.

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

29

The Big Picture

The four components of the information retrieval environment: User Process System Collection

What computer geeks care about!What we care about!

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

30

The Information Retrieval Cycle

SourceSelection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

31

Supporting the Search Process

SourceSelection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

Indexing Index

Acquisition Collection

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

32

Simplification?Source

Selection

Search

Query

Selection

Ranked List

Examination

Documents

Delivery

Documents

QueryFormulation

Resource

query reformulation,vocabulary learning,relevance feedback

source reselection

Is this itself a vast simplification?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

33

Tackling the IR Challenge

Divide and conquer! Strategy: use encapsulation to limit

complexity Approach:

Define interfaces (input and output) for each component

Define the functions performed by each component

Study each component in isolation Repeat the process within components as

needed Make sure that this decomposition makes

sense Result: a hierarchical decomposition

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

34

Where do we make the cut?

Study the IR black box in isolation Simple behavior: in goes query, out comes

documents Optimize the quality of documents that come out

Study everything else around the black box Put the human back in the loop!

Search

Query

Ranked List

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

35

The IR Black Box

DocumentsQuery

Hits

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

36

Inside The IR Black Box

DocumentsQuery

Hits

RepresentationFunction

RepresentationFunction

Query Representation Document Representation

ComparisonFunction Index

37

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

The Central Problem in IR

Information Seeker Authors

Concepts Concepts

Query Terms Document Terms

Do these represent the same concepts?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

38

Learning by Imagination

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

39

Imagine a System

We have 1000s of web pages, what make these web pages different? May be different key terms or key words

occurring in different web pages (e.g., sports, education, video sharing)

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

40

Realize Query Needs

What do we expect when query? Query can be single word (no order), collection of

words i.e., free sentence (order does not matter) or strict phrase (order matters e.g., "I love Pakistan")

How to manage data of web pages Bag of words data structure with/without position of

words/terms (simply, posting list of words/terms) What’s the best match?

We have many matching results, but what’s the order?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

41

Order of Matching Results

How could we rank web pages? Via query content matching score against

web pages i.e., content based methods Via importance of web pages i.e., link

based methods

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

42

What does Content Tell?

Content Information: Rare terms give more information than

frequent terms as common terms do not differentiate well between the content of documents (Information entropy)

So what does common words make? Stop words (extreme case, e.g., it, a, the) or

words with lesser importance (e.g., word science inside scientific documents)

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

43

Ranking Methods

Content based methods: Examples: Tf-idf with cosine similarity,

bm25, etc. Link based methods:

Examples: PageRank, HITS, etc.

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

44

What is More in Ranking?

What other measures we can take for ranking better? Combining content based methods with link

based methods

How about learning to rank by user click through data (apply machine learning)

How about learning from social web (apply social science theories)

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

45

Lots of Web Pages

How about scalability?

We have too many words, can we limit them? Example: Is Studying conceptually different from

study or studies? may be not (concept called stemming could simply everything to simple concept study)

Stemming may not be sufficient then how about clustering web pages into topics i.e., (terms study, science, arts, university, school, college would single concept or a topic may be called as topic education)

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

46

Is it sufficient?

Can we feel confident about how Web Search Engine works? No, it was just a summary for the day

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

47

Guess! what next you would see? ?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

48

Our search engineYes we are making it

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

49

Appendix

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

50

Outline

What is Research? How to prepare yourself for IR research?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

51

What is Research?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

52

What is Research?

Research Discover new knowledge Seek answers to questions

Basic research Goal: Expand man’s knowledge (e.g., which genes control

social behavior of honey bees? ) Often driven by curiosity (but not always) High impact examples: relativity theory, DNA, …

Applied research Goal: Improve human condition (i.e., improve the world)

(e.g., how to cure cancers?) Driven by practical needs High impact examples: computers, transistors,

vaccinations, … The boundary is vague; distinction isn’t important

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

53

Why Research?

Amount of knowledge

Advancement of Technology

Utility of Applications

Basic ResearchApplied Research

ApplicationDevelopment

Curiosity

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

54

Where’s IR Research?

Amount of knowledge

Advancement of Technology

Utility of Applications

Quality of Life

Basic ResearchApplied Research

ApplicationDevelopment

Information Science

Computer Science

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

55

Research Process

Identification of the topic (e.g., Web search)

Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art)

Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data)

Test hypothesis (e.g., compare X and Y on the data)

Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?)

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

56

Typical IR Research Process

Look for a high-impact topic (basic or applied) New problem: define/frame the problem Identify weakness of existing solutions if any Propose new methods Choose data sets (often a main challenge) Design evaluation measures (can be very

difficult) Run many experiments (need to have clear

research hypotheses) Analyze results and repeat the steps above if

necessary Publish research results

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

57

Research Methods

Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”)

Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”)

Empirical research: evaluate and compare existing solutions (e.g., “a comparative evaluation of link analysis methods for web search”)

The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory…

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

58

Types of Research Questions and Results

Exploratory (Framework): What’s out there?

Descriptive (Principles): What does it look like? How does it work?

Evaluative (Empirical results): How well does a method solve a problem?

Explanatory (Causes): Why does something happen the way it happens?

Predictive (Models): What would happen if xxx ?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

59

Solid and High Impact Research Solid work:

A clear hypothesis (research question) with conclusive result (either positive or negative)

Clearly adds to our knowledge base (what can we learn from this work?)

Implications: a solid, focused contribution is often better than a non-conclusive broad exploration

High impact = high-importance-of-problem * high-quality-of-solution high impact = open up an important problem high impact = close a problem with the best solution high impact = major milestones in between Implications: question the importance of the problem and

don’t just be satisfied with a good solution, make it the best

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

60

How to Prepare Yourself for IR

Research?

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

61

What it Takes to do Research? Curiosity: allow you to ask questions Critical thinking: allow you to challenge

assumptions Learning: take you to the frontier of

knowledge Persistence: so that you don’t give up Respect data and truth: ensure your

research is solid Communication: allow you to publish

your work …

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

62

Learning about IR (1/2)

Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…)

Then read “Readings in IR” by Karen Sparck Jones, Peter Willett

And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf

Read other papers published in recent IR/IR-related conferences

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

63

Learning about IR (2/2)

Getting more focused Choose your favorite sub-area (e.g., retrieval models) Extend your knowledge about related topics (e.g.,

machine learning, statistical modeling, optimization) Stay in frontier:

Keep monitoring literature in both IR and related areas Broaden your view: Keep an eye on

Industry activities Read about industry trends Try out novel prototype systems

Funding trends Read request for proposals

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

64

Critical Thinking

Develop a habit of asking questions, especially why questions

Always try to make sense of what you have read/heard; don’t let any question pass by

Get used to challenging everything Practical advice

Question every claim made in a paper or a talk (can you argue the other way?)

Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it)

Force yourself to challenge one point in every talk that you attend and raise a question

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

65

Respect Data and Truth

Be honest with the experiment results Don’t throw away negative results! Try to learn from negative results

Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data

Be objective in data analysis and interpretation; don’t mislead readers

Aim at understanding/explanation instead of just good results

Be careful not to over-generalize (for both good and bad results); you may be far from the truth

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

66

Communications

General communication skills: Oral and written Formal and informal Talk to people with different level of

backgrounds Be clear, concise, accurate, and adaptive

(elaborate with examples, summarize by abstraction)

English proficiency Get used to talking to people from

different fields

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

67

Persistence

Work only on topics that you are passionate about

Work only on hypotheses that you believe in

Don’t draw negative conclusions prematurely and give up easily positive results may be hidden in negative

results In many cases, negative results don’t

completely reject a hypothesis Be comfortable with criticisms about

your work (learn from negative reviews of a rejected paper)

Think of possibilities of repositioning a work

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

68

Optimize Your Training

Know your strengths and weaknesses strong in math vs. strong in system

development creative vs. thorough …

Train yourself to fix weaknesses Find strategic partners Position yourself to take advantage of

your strengths

Reach me on Twitter: @matifq Email maqureshi@iba.edu.pk

69

Thank You

Reach me on Twitter: @matifqEmail me: maqureshi@iba.edu.pk