1 Information Retrieval. [email protected] 2 What is IR? IR is concerned with the...

77
1 Information Retrieval Information Retrieval

Transcript of 1 Information Retrieval. [email protected] 2 What is IR? IR is concerned with the...

Page 1: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

1

Information RetrievalInformation Retrieval

Page 2: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

2

[email protected]

What is IR?What is IR?

IR is concerned with the representation, storage, organization, and accessing of information items. [Salton]

Information include Text , Audio, image, ….

For simplicity , we consider texts.

Text information retrieval.

Information SpaceInformation Space

UserUser

User

Request DocumentsLanguage

Page 3: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

3

[email protected]

The role of databasesThe role of databases

Databases hold specific data items Organization is explicit

Keys relate items to each other

Queries are constrained, but effective in retrieving the data that is there

Databases generally respond to specific queries with specific results

Searching for items not anticipated by the designers can be difficult

Page 4: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

4

[email protected]

Information vs. Information sourcesInformation vs. Information sources

User needs information Distinguish data, information, knowledge

Information sources Very well organized, indexed, controlled

Totally unorganized, uncharacterized, uncontrolled

Something in between

Connect the two in a way that matches information needs to information available.

Page 5: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

5

[email protected]

The WebThe Web

Extreme opposite of a database

No organization, no overall structure, no index or key to the content

Searching and browsing are supported, but generally are not complete. (You will not know if you got every good response to your request. You may be able to tell that you got the response that meets your need, but may not know if you got the best response available.)

Each HTML page is considered as a document

Page 6: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

6

[email protected]

Information Retrieval vs. Information Retrieval vs. Data RetrievalData Retrieval

Data retrieval consists of determining which documents of the collection contain the keywords in the user query.

Information retrieval should “interpret” the contents of the documents in the collection and retrieve all the documents that are relevant to the user query while retrieving a few non relevant documents as possible.

Page 7: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

7

[email protected]

A General text-information retrieval modelA General text-information retrieval model

An information retrieval model is a quadruple<D,Q,F,R(qi, dj)> where

D is a set composed of logical views (or representations) for the documents in the collection

Q is a set composed of logical views (or representations) for the user information needs called “queries”

F is a framework for modeling document representations, queries and their relationships

R(qi, dj) is a ranking function which associates a real number with a query qi in Q and a document representation dj in D.(A similarity measure which perform a mapping from query to documents that are more similar to our particular query )

Page 8: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

8

[email protected]

Retrieval modelsRetrieval models

Probabilistic IR : (Baysian, Naïve Bayes), Compute the probability of relevance of a

document to given query.

Statistical IR (Vector Space, Concept space).

Machine Learning based techniques. (Extracting knowledge or identifying patterns) Symbolic learning (ID3)

Neural Networks (Any where that is required)

Evolution based Algorithms( For adapting of F as matching function )

The effectiveness of an IR system depends on the ability of the document representation to capture the “meaning” of the documents with respect to the users’ needs

Page 9: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

9

[email protected]

Text retrieval Overall ArchitectureText retrieval Overall Architecture

UsersUsers

Queries (Q)

RelevanceFeedback

RelevanceFeedback

MatchingAlgorithm (R)

DocumentRepresentation (D)

Documents

User Side

User Side

Information Space

Information Space

Retrieved Documents

F

Page 10: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

10

[email protected]

Preparing queries and documentsPreparing queries and documents

Convert file format.

Text segmentation.

Term extraction

Stemming , eliminating stop words

Term weighting

Phrase construction

Storing indexed documents

Similar stages for preparing queries, but instead of storage it passes to matching algorithm.

Page 11: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

11

[email protected]

The Retrieval ProcessThe Retrieval Process

UserInterface

Text Operations

QueryOperations

Searching

Indexing DB ManagerModule

Index

TextDatabase

Ranking

User’s need

Ranked Docs

Retrieved Docs

Query

User’s feedback

Text

Text

Logical view

Inverted file

Page 12: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

12

[email protected]

Vector space Vector space

1960s introduction of vector space model (Salton, cornell Univ. Smart system )

Dj=(Wj1,Wj2,…,Wjt) if kth term doesn’t exist then Wjk=0

Qj=(wj1 ,wj2 ,…,wjt)

NtNNN

t

t

t

WWWD

WWWD

WWWD

TTT

...

...

...

...

21

222212

112111

21

Sparse Term-Doc matrix

Doc

Term

Query

NtNNN

t

t

t

WWWQ

WWWQ

WWWQ

TTT

...

...

...

...

21

222212

112111

21

Page 13: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

13

[email protected]

Term WeightingTerm Weighting

jiijij NGLW

Global•None 1

•IDF

•Entropy

•IDFB

•IDFP

N

j

j

ij

i

ij

N

F

f

F

f

1 log

log

1

i

i

n

F

)log(in

N

)log(i

i

n

nN

Normalization•None 1

•Cosine

•PUQN

t

i iji LG0

2)(

1

jlslopPivotslop )1(

1

Local

•Binary 1

•TF

•Log

•LOGN

•ATF1

)1log( ijf

ijf

j

ij

a

f

log1

log1

)(5.05.0j

ij

x

f

If fij==0 then 0else

Page 14: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

14

[email protected]

Vector space graphical Vector space graphical representationrepresentation

Vector space graphical Vector space graphical representationrepresentation

Example:

D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

• Is D1 or D2 more similar to Q?• How to measure the degree of

similarity? Distance? Angle? Projection?

Page 15: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

15

[email protected]

Similarity Measure (Matching Function)Similarity Measure (Matching Function)FF

Similarity between documents Di and query Q can be computed as the inner vector product:

where dik is the weight of Term k in document i and qk is the weight of Term k in the query

1

sim ( Di , Q ) = (dik.qk)t

k

Binary: D = 1, 1, 1, 0, 1, 1, 0

Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

retri

eval

database

archite

cture

computer

textmanagem

ent

informatio

n

Page 16: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

16

[email protected]

Cosine Similarity measureCosine Similarity measure

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13

Q = 0T1 + 0T2 + 2T3

t

i

t

i

t

i

ww

ww

qd

qd

iqij

iqij

j

j

1 1

22

1

)(

CosSim(dj, q) =

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

32

5

Page 17: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

17

[email protected]

t

k

t

kkik

t

k

t

kkik

qdqd

qd

kik1 11

22

1

)(

)(

Inner Product:

Cosine:

Jaccard :qkdiqkdi

qkdi

t

k

t

k

t

kkik

qd

qd

kik1 1

22

1

)(

qkdi

qkdi

t

kkik qd

1

)( qkdi

di and qk here are sets of keywordsdi and qk here are vectors

Similarity MeasuresSimilarity Measures

Page 18: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

18

[email protected]

Comments on Vector Space Comments on Vector Space ModelsModels

Simple, mathematically based approach.

Considers both local (tf) and global (idf) word occurrence frequencies.

Provides partial matching and ranked results.

Tends to work quite well in practice despite obvious weaknesses.

Allows efficient implementation for large document collections.

Page 19: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

19

[email protected]

Problems with Vector SpaceProblems with Vector Space

There is no real theoretical basis for the assumption of a term space it is more for visualization that having any real basis

most similarity measures work about the same regardless of model

Terms are not really orthogonal dimensions Terms are not independent of all other terms

Page 20: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

20

[email protected]

Semantic IRSemantic IR

Different voc. For users and authors (or indexers)

Polysemy problem (words having multiple meaning)

Synonymy problem (multiple words having the same meaning)

Using a dictionary of Synonymy and Polysemy inside IR system.

Latent Semantic Indexing (LSI). Using Singular Value decomposition

Identifying the correlation between terms by means of singular values (e.g. car and auto inside gasoline,… in different docs)

SVD provides a solution to this, and in doing so, It captures all the info in the original array, without loss.

Reduces the size of the matrix to operate on. (Deals with non- sparse parts)

Places similar elements closer to each other.

Allows the reconstruction of the original matrix, with some loss of precision.

Page 21: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

21

[email protected]

Information Retrieval SystemsInformation Retrieval Systems

Information retrieval (IR) systems use a simpler data model than database systems Information organized as a collection of documents

Documents are unstructured, no schema

Information retrieval locates relevant documents, on the basis of user input such as keywords or example documents e.g., find documents containing the words “database systems”

Can be used even on textual descriptions provided with non-textual data such as images

IR on Web documents has become extremely important E.g. google, altavista, …

Page 22: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

22

[email protected]

Information Retrieval Systems (Cont.)Information Retrieval Systems (Cont.)

Differences from database systems IR systems don’t deal with transactional updates (including

concurrency control and recovery)

Database systems deal with structured data, with schemas that define the data organization

IR systems deal with some querying issues not generally addressed by database systems

Approximate searching by keywords

Ranking of retrieved answers by estimated degree of relevance

Page 23: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

23

[email protected]

Query Modification ProcessQuery Modification Process

F: accepts relevance judgement from the user and produces as output sets of relevant and nonrelevant documents

G: implements the feedback formula (for rewriting the original query)

RetrievalProcess F G

Originalquery Q

Ranked output

Rel. & nonrel.documents

Reformulatedquery Q’

Relevancy judgement

Page 24: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

24

[email protected]

The Effect of Relevance FeedbackThe Effect of Relevance Feedback

xx

x

Relevant documents

Nonrelevant documents

Original queryOriginal query retrieved five documents

xx

Reformulated query

Page 25: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

25

[email protected]

The Basic Idea of Query ModificationThe Basic Idea of Query Modification

Terms that occur in relevant documents are added to the original query vectors, or the weight of such terms is increased by an appropriate factor in constructing the new query statements

Terms occurring in nonrelevant documents are deleted from the original query statements, or the weight of such terms is appropriately reduced

Page 26: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

26

[email protected]

Keyword SearchKeyword Search

In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document

Information-retrieval systems typically allow query expressions formed using keywords and the logical connectives and, or, and not Ands are implicit, even if not explicitly specified

Ranking of documents on the basis of estimated relevance to a query is critical Relevance ranking is based on factors such as

Term frequency

– Frequency of occurrence of query keyword in document Inverse document frequency

– How many documents the query keyword occurs in

» Fewer give more importance to keyword Hyperlinks to documents

– More links to a document document is more important

Page 27: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

27

[email protected]

Relevance Ranking Using TermsRelevance Ranking Using Terms

TF-IDF (Term frequency/Inverse Document frequency) ranking: Let n(d) = number of terms in the document d

n(d, t) = number of occurrences of term t in the document d.

Then relevance of a document d to a term t

The log factor is to avoid excessive weightage to frequent terms

And relevance of document to query Q

nn((dd))

nn((dd, , tt))1 +1 +rr((dd, , tt) = ) = loglog

rr((dd, , QQ) =) = rr((dd, , tt))nn((tt))ttQQ

Page 28: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

28

[email protected]

Relevance Ranking Using Terms (Cont.)Relevance Ranking Using Terms (Cont.)

Most systems add to the above model Words that occur in title, author list, section headings, etc. are given

greater importance

Words whose first occurrence is late in the document are given lower importance

Very common words such as “a”, “an”, “the”, “it” etc are eliminated

Called stop words

Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart

Documents are returned in decreasing order of relevance score Usually only top few documents are returned, not all

Page 29: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

29

[email protected]

Relevance Using HyperlinksRelevance Using Hyperlinks

When using keyword queries on the Web, the number of documents is enormous (many billions) Number of documents relevant to a query can be enormous if only

term frequencies are taken into account

Most of the time people are looking for pages from popular sites

Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords

Problem: hard to find actual popularity of site Solution: next slide

Page 30: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

30

[email protected]

Relevance Using Hyperlinks (Cont.)Relevance Using Hyperlinks (Cont.) Solution: use number of hyperlinks to a site as a measure of the

popularity or prestige of the site Count only one hyperlink from each site (why?)

Popularity measure is for site, not for individual page

Most hyperlinks are to root of site

Site-popularity computation is cheaper than page popularity computation

Refinements When computing prestige based on links to a site, give more weightage to

links from sites that themselves have higher prestige

Definition is circular

Set up and solve system of simultaneous linear equations

Above idea is basis of the Google PageRank ranking mechanism

Page 31: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

31

[email protected]

Relevance Using Hyperlinks (Cont.)Relevance Using Hyperlinks (Cont.)

Connections to social networking theories that ranked prestige of people E.g. the president of the US has a high prestige since many people

know him

Someone known by multiple prestigious people has high prestige

Hub and authority based ranking A hub is a page that stores links to many pages (on a topic)

An authority is a page that contains actual information on a topic

Each page gets a hub prestige based on prestige of authorities that it points to

Each page gets an authority prestige based on prestige of hubs that point to it

Again, prestige definitions are cyclic, and can be got by solving linear equations

Use authority prestige when ranking answers to a query

Page 32: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

32

[email protected]

Similarity Based RetrievalSimilarity Based Retrieval

Similarity based retrieval - retrieve documents similar to a given document Similarity may be defined on the basis of common words

E.g. find k terms in A with highest r(d, t) and use these terms to find relevance of other documents; each of the terms carries a weight of r (d,t)

Similarity can be used to refine answer set to keyword query User selects a few relevant documents from those retrieved by

keyword query, and system finds other documents similar to these

Page 33: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

33

[email protected]

Synonyms and HomonymsSynonyms and Homonyms

Synonyms E.g. document: “motorcycle repair”, query: “motorcycle maintenance”

need to realize that “maintenance” and “repair” are synonyms

System can extend query as “motorcycle and (repair or maintenance)”

Homonyms E.g. “object” has different meanings as noun/verb

Can disambiguate meanings (to some extent) from the context

Extending queries automatically using synonyms can be problematic Need to understand intended meaning in order to infer synonyms

Or verify synonyms with user

Synonyms may have other meanings as well

Page 34: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

34

[email protected]

Indexing of DocumentsIndexing of Documents

An inverted index maps each keyword Ki to a set of documents Si

that contain the keyword Documents identified by identifiers

Inverted index may record Keyword locations within document to allow proximity based ranking

Counts of number of occurrences of keyword to compute TF

and operation: Finds documents that contain all of K1, K2, ..., Kn.

Intersection S1 S2 ..... Sn

or operation: documents that contain at least one of K1, K2, …, Kn

union, S1 U S2 ..... USn

Each Si is kept sorted to allow efficient intersection/union by merging

“not” can also be efficiently implemented by merging of sorted lists

Page 35: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

35

[email protected]

Measuring Retrieval EffectivenessMeasuring Retrieval Effectiveness IR systems save space by using index structures that support

only approximate retrieval. May result in: false negative (false drop) - some relevant documents may not

be retrieved.

false positive - some irrelevant documents may be retrieved.

For many applications a good index should not permit any false drops, but may permit a few false positives.

Relevant performance metrics: Precision - what percentage of the retrieved documents are

relevant to the query.

Recall - what percentage of the documents relevant to the query were retrieved.

Page 36: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

36

[email protected]

Performance Evaluation

of Information Retrieval Systems

Page 37: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

37

[email protected]

Why is System Evaluation Needed?Why is System Evaluation Needed?

There are many retrieval systems on the market, which one is the best?

When the system is in operation, is the performance satisfactory? Does it deviate from the expectation?

To fine tune a query to obtain the best result (for a particular set of documents and application)

To determine the effects of changes made to an existing system (system A versus system B)

Efficiency: speed

Effectiveness: how good the result is?

Page 38: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

38

[email protected]

Difficulties in Evaluating IR SystemDifficulties in Evaluating IR System

Effectiveness is related to relevancy of items retrieved

Relevancy is not a binary evaluation but a continuous function

Relevancy, from a human judgement standpoint, is subjective - depends upon a specific user’s judgement

situational - relates to user’s requirement

cognitive - depends on human perception and behavior

temporal - changes over time

Page 39: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

39

[email protected]

documents relevant of number Total

retrieved documents relevant of Number recall

retrieved documents of Number totalretrieved documents relevant of Number

precision

Retrieval Effectiveness - Precision and RecallRetrieval Effectiveness - Precision and Recall

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

Not retrieved & irrelevant

retrieved not retrieved

rele

vant

irre

leva

nt

Page 40: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

40

[email protected]

Precision and RecallPrecision and Recall

Precision evaluates the correlation of the query to the database

an indirect measure of the completeness of indexing algorithm

Recall the ability of the search to find all of the relevant items in the database

Among three numbers, only two are always available

total number of items retrieved

number of relevant items retrieved

total number of relevant items is usually not available

Unfortunately, precision and recall affect each other in the opposite direction! Given a system:

Broadening a query will increase recall but lower precision

Increasing the number of documents returned has the same effect

Page 41: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

41

[email protected]

Total Number of Relevant ItemsTotal Number of Relevant Items

Problem: which documents are actually relevant, and which are not Usual solution: human judges

Create a corpus of documents and queries, with humans deciding which documents are relevant to which queries

In an uncontrolled environment (e.g., the web), it is unknown.

Two possible approaches to get estimates

Sampling across the database and performing relevance judgment on the returned items

Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total number of relevant documents in the collection

Page 42: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

42

[email protected]

Relationship between Recall and Relationship between Recall and PrecisionPrecision

10

1

precision

reca

ll

Return mostly relevantdocuments but missmany relevant ones

The idealReturn most of the relevantdocuments but include many junks

Page 43: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

43

[email protected]

Fallout RateFallout Rate

Problems with precision and recall: A query on “Hong Kong” will return most relevant documents but it

doesn’t tell you how good or how bad the system is! (What is the chance that a randomly picked document is relevant to the query?)

number of irrelevant documents in the collection is not taken into account

recall is undefined when there is no relevant document in the collection

precision is undefined when no document is retrieved

collection the in items tnonrelevan of no. totalretrieved items tnonrelevan of no.

Fallout

A good system should have high recall and low fallout

Page 44: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

44

[email protected]

Fallout (cont)Fallout (cont)

Fallout can be viewed as the inverse of recall It is very unlikely to have situation as 0/0

the number of non-relevant items in a collection can be safely be assumed to be non-zero.

It is the probability that a retrieved item is nonrelevant. (Recall: the probability that a retrieved item is relevant)

Among three measures, precision, recall and fallout, fallout is least sensitive to the accuracy of the search process

A good system should have high recall and low fallout

Page 45: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

45

[email protected]

R=2/5=0.4; p=2/3=0.67

Computation of Recall and PrecisionComputation of Recall and Precisionn doc # relevantRecallPrecision1 588 x 0.2 1.002 589 x 0.4 1.003 576 0.4 0.674 590 x 0.6 0.765 986 0.6 0.606 592 x 0.8 0.677 984 0.8 0.578 988 0.8 0.509 578 0.8 0.4410 985 0.8 0.4011 103 0.8 0.3612 591 0.8 0.3313 772 x 1.0 0.3814 990 1.0 0.36

Suppose:total no. of relevant docs = 5

R=1/5=0.2; p=1/1=1

R=2/5=0.4; p=2/2=1

R=5/5=1; p=5/13=0.38

Page 46: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

46

[email protected]

Computation of Recall and PrecisionComputation of Recall and Precisionn RecallPrecision1 0.2 1.002 0.4 1.003 0.4 0.674 0.6 0.765 0.6 0.606 0.8 0.677 0.8 0.578 0.8 0.509 0.8 0.4410 0.8 0.4011 0.8 0.3612 0.8 0.3313 1.0 0.3814 1.0 0.36 0.4 0.8

1.0

0.8

0.6

0.4

0.2

0.2 1.00.6

1 2

3

4

5

6

7

12

13

200

recall

prec

isio

n

Page 47: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

47

[email protected]

Compare Two or More SystemsCompare Two or More Systems

Computing recall and precision values for two or more systems

Superimposing the results in the same graph

The curve closest to the upper right-hand corner of the graph indicates the best performance

TREC (Text REtrieval Conference) Benchmark

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

Stem Theraurus

Page 48: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

48

[email protected]

Web CrawlingWeb Crawling

Web crawlers are programs that locate and gather information on the Web Recursively follow hyperlinks present in known documents, to find

other documents

Starting from a seed set of documents

Fetched documents

Handed over to an indexing system

Can be discarded after indexing, or store as a cached copy

Crawling the entire Web would take a very large amount of time Search engines typically cover only a part of the Web, not all of it

Take months to perform a single crawl

Page 49: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

49

[email protected]

Web Crawling (Cont.)Web Crawling (Cont.)

Crawling is done by multiple processes on multiple machines, running in parallel Set of links to be crawled stored in a database

New links found in crawled pages added to this set, to be crawled later

Indexing process also runs on multiple machines Creates a new copy of index instead of modifying old index

Old index is used to answer queries

After a crawl is “completed” new index becomes “old” index

Multiple machines used to answer queries Indices may be kept in memory

Queries may be routed to different machines for load balancing

Page 50: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

50

[email protected]

BrowsingBrowsing

Storing related documents together in a library facilitates browsing users can see not only requested document but also related ones.

Browsing is facilitated by classification system that organizes logically related documents together.

Organization is hierarchical: classification hierarchy

Page 51: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

51

[email protected]

A Classification Hierarchy For A Library A Classification Hierarchy For A Library SystemSystem

Page 52: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

52

[email protected]

Classification DAGClassification DAG

Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important.

Classification hierarchy is thus Directed Acyclic Graph (DAG)

Page 53: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

53

[email protected]

A Classification DAG For A Library A Classification DAG For A Library Information Retrieval SystemInformation Retrieval System

Page 54: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

54

[email protected]

Web DirectoriesWeb Directories

A Web directory is just a classification directory on Web pages E.g. Yahoo! Directory, Open Directory project

Issues:

What should the directory hierarchy be?

Given a document, which nodes of the directory are categories relevant to the document

Often done manually

Classification of documents into a hierarchy may be done based on term similarity

Page 55: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

55

[email protected]

Computational Creativity is a small sub-field of artificial intelligence

Its focus is the study and support, through computational methods, of behaviour which, in humans, would be deemed “creative”

Ranging from intelligent digital libraries to systems which create music, art, scientific theories, etc.

What isWhat isComputational Creativity?Computational Creativity?

Page 56: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

56

[email protected]

Work in CCC falls into various categories: literary forensics

(Dr Peter Smith/Dr Gea De Jong)computer-based musicology

(Dr Geraint Wiggins/Tim Crawford/David Lewis/Michael Gale) intelligent digital signal & score processing

(Dr Michael Casey/Dr Geraint Wiggins/Dr Darrell Conklin/Dave Meredith/Miguel Ferrand)

computational music cognition(Dr Geraint Wiggins/Dr Andrés Melo/Dave Meredith/Marcus Pearce/Miguel Ferrand)

intelligent composition and performance systems(Dr Geraint Wiggins/Dr Darrell Conklin/Dr John Drever/Marcus Pearce/Tak-Shing Chan/ Prof Simon Emmerson/Prof Denis Smalley)

formal models of creative systems(Dr Geraint Wiggins)

Work in CCCWork in CCC

Page 57: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

57

[email protected]

Content-Based Information Content-Based Information RetrievalRetrieval

Driven by large volumes of multimedia.

Search terra-bytes of sound and images by similarity.

International Standardisation makes it work globally. (Like the WWW).

Page 58: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

58

[email protected]

MPEG-7 MPEG-7 International StandardInternational Standard

ISO/IEC/JTC-1/SC29/WG11 [MPEG]

ISO-15938 2001 Part 4 (Audio) [MPEG-7]

Multimedia Content Description Interface

Page 59: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

59

[email protected]

Audio Information RetrievalAudio Information Retrieval

MPEG-7Database

A pre-indexed Collection of Sounds

Page 60: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

60

[email protected]

Audio Query Extract

MPEG-7Database

Segment Match

Result ListA Sound or Scene orList of Sounds

Audio Information RetrievalAudio Information Retrieval

Page 61: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

61

[email protected]

Audio Query Extract

MPEG-7Database

Segment Match

Result ListFeature extractionfrom audio.

Audio Information RetrievalAudio Information Retrieval

Page 62: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

62

[email protected]

Audio Query Extract

MPEG-7Database

Segment Match

Result ListPartitioningof audio intochunks.

Audio Information RetrievalAudio Information Retrieval

Page 63: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

63

[email protected]

Audio Query Extract

MPEG-7Database

Segment Match

Result List

Find similar chunksof Audio

Audio Information RetrievalAudio Information Retrieval

Page 64: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

64

[email protected]

Audio Query Extract

MPEG-7Database

Segment Match

Result List

Use Results for Creativity Support

Creativity SupportApplication

User

Collect

Relate

Audio Information RetrievalAudio Information Retrieval

Page 65: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

65

Arabic IRArabic IR

Page 66: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

66

[email protected]

Creating the DatabaseCreating the Database

Method 1: Without Morphology

Index the text based on the form of the word

Method 2: With Morphology

Index the text based on the stem of the word

Page 67: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

67

[email protected]

Retrieval SystemsRetrieval Systems

Monolingual retrieval system Arabic query

Returns Arabic Documents

Cross lingual retrieval system (Arabic Translingual System) English query Translated Using Online Dictionary with human selection of terms Returns Arabic Documents and translations

Page 68: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

68

[email protected]

Monolingual retrieval systemMonolingual retrieval system

Arabic query

Retrieve text

Display

Morphology

Page 69: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

69

[email protected]

Monolingual retrieval systemMonolingual retrieval systemEnter Query, Select Data Source, Search

Page 70: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

70

[email protected]

List of Documents and Top Document Returned

Monolingual retrieval systemMonolingual retrieval system

Page 71: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

71

[email protected]

Arabic Translingual SystemArabic Translingual System

Arabic query

Retrieve text

Display/Translate

Morphology

English query

Translate

Page 72: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

72

[email protected]

Arabic Translingual SystemArabic Translingual System

Type query in English, Select Translate option

Page 73: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

73

[email protected]

Double click on any word to see dictionary entry

Arabic Translingual SystemArabic Translingual System

Page 74: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

74

[email protected]

Click on translation button for gisting translation

Arabic Translingual SystemArabic Translingual System

Page 75: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

75

[email protected]

Translingual SystemTranslingual System

Include syntax in translation system

Expand the bi-directional dictionaries

Improve Onomasticon

Perform automatic disambiguation of translated queries in cross-language system (necessary for TREC-9) Using ontology?

Participate in TREC

Page 76: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

76

[email protected]

Distributed IRDistributed IR

Engine 1 Engine 2 Engine 3 Engine 4 Engine n. . . .

. . . . . .?

InformationNeed

Common scenarios:• Multiple partitions, single service• Independent engines, single organization• Independent engines, affiliated organizations• Independent engines, unaffiliated organizations

Defining dimensions:• Cooperative vs. uncooperative engines• Centralized vs. decentralized solutions

Page 77: 1 Information Retrieval. 22.2eftekhari@cse.shirazu.ac.ir 2 What is IR? IR is concerned with the representation, storage, organization, and accessing of.

77

[email protected]

Any Question ?Any Question ?