Information Retrieval Basics - DCU School of Computingasmeaton/CA652/IRBasics.pdf · - 3 - Why is...

73
- 1 - Information Retrieval Basics

Transcript of Information Retrieval Basics - DCU School of Computingasmeaton/CA652/IRBasics.pdf · - 3 - Why is...

- 1 -

Information Retrieval Basics

- 2 -

• IR is the process of matching Q vs D• IR is hard for several reasons ..

– Information is complex, we need torepresent it, computers not good at doingthis.

– Context varies … what is a sponge ?

– Opinion varies …

Why is IR hard ?

Good news ? Honest ? Funny ?

- 3 -

Why is IR hard ?

– Semantics

– Information needs must be expressed as aquery, in a search box, but often we don’tknow what we want !

– We have problems …• Verbalising information needs• Understanding query syntax• Understanding search engines

– .. All mentioned earlier under “informationseeking”

Bank note Bank of England West Bank

- 4 -

Users

Who are they?What are they doing?What are they using?

- 5 -

Users

• You have seen HCI already...• Users are concerned with results.

– They rarely understand or consider the underlyingmechanism.

– Even more rarely make use of sophisticated searchtools.

• 1% use advanced search facilities.• 10% use query syntax - often incorrectly!

• Average length of search queries is currentlyaround 2.5 words!

• Accuracy (precision) is generally moreimportant than quantity (recall).– Although there are special applications such as

patent search where high recall is very important.

- 6 -

When they are searching, what dothey want to find?

… blogs, tweets, babies, prices, dogs, sales, maps, products,recipes, news, answers, movies, books, flight schedules, mail,news stories, video clips, papers, calendar entries, goals,appointments, what their friends spoke about last night … andWWW pages …

- 7 -

Where are they from?

… US, Canada, UK, Norway, France, Sweden, Australia, Japan,Korea, Ireland, South Africa, India, Germany, Singapore, Russia,Italy, Brazil, Portugal, China, Mexico, SE Asia, New Zealand,Finland, …

- 8 -

How good is the User?

• User information needs need to be expressedas a query…– But users don’t often know what they want– or can’t articulate the information need– or don’t understand query syntax– or don’t understand search engines

• Users also…– Don’t give much explicit feedback– Don’t look beyond the first page of results– Can’t adequately express an information need as a

query…

- 9 -

User Queries are…

• Misspelled• Ambiguous• Context sensitive

– Novel information is better?

• Representative of different types ofsearch request– Fact search, homepage finding, general…

• Usually textual in nature

- 10 -

… query log extract …

• responsibility• electronic parts tv repair do it yourself• gordon• can i find informationabout susan b. anthony• coupons off the web• nissan• cad• another word for hue• how do i reprogram a gm keyless remote• supervisory training programs ontario• where can i find pictures of 1989 z24• alison norris• how to make man made boulders• godzilla soundtrack• spring lake nc• absolutelymale day• away in a manger• toysoldier• mastercraft boats• where can i find spanish recipes• tobacco• history of environmentalism silent spring• how to make man made boulders• godzilla soundtrack• spring lake nc• absolutelymale day• away in a manger• how to make man made boulders• godzilla soundtrack• spring lake nc• absolutelymale day• away in a manger

• toysoldier• mastercraft boats• where can i find spanish recipes• tobacco• history of environmentalism silent spring• psychology of lying• country song lyrics• where can i find information about depression• alexander graham bell• friends scripts• booman• what is kwanza• fishing• where can i find art in 1964• hubble• yellow pages of san diego california• project change forms• antique clocks• golden pacific systems• can i find informationabout susan b. anthony• metallica• karagoz• javascipts• encoders• cheverlay• christian books• margaret kempe• angel policeman• pottery portland maine• woodstock sucks.

- 11 -

What are people searching now?

http://trends.google.com/trends

- 12 -

What devices are they using?

- 13 -

Web Browsers & OS

http://www.w3counter.com/globalstats.php

- 14 -

Countries & Resolutions

http://www.w3counter.com/globalstats.php

- 15 -

- 16 -

Access Devices

• Desktop Computer– What resolution? What plug-ins?

• Mobile Device– What resolution? Support zooming?

Location services? Bluetooth services?

• Games Console/TV– Integrated browsers are becoming the

norm, so what level of interactivity?

- 17 -

Access Devices?

- 18 -

Ok, so that is users.. Users searchusing text. It is easier to expressan information need that way… so

let us look at text!

- 19 -

Text is:

A lot of IR difficulty arises because of the nature of text:

• word tokens from a surprisingly small lexicon or dictionary• each of which independently convey some meaning.• whose morphology is changed as words are concatenated

into units of dialogue called sentences.• for which there is a grammar of allowable syntactic

combinations to which sentences conform.• which are in turn are concatenated to make prose which

makes up documents.• which can reach a large enough size they may be

structurally organized.• and typically this is a hierarchy, to ease user navigation

through the document.– chapters, sections, sub-sections, paragraphs, sentences, clauses,

phrases, words, morphemes, letters…

- 20 -

Problems with Text

Tokens are Words are Terms…• Can be polysemous

– e.g. BAR

• ‘SMITH’ is NOT ‘Smith’ is NOT ‘smith’.– or

• ‘plane’ is NOT ‘aircraft’– or

• “cooking” is NOT “cooked” is NOT“cookery”

- 21 -

Information Retrieval

• Early days of IR– Indexing into internal representation was a

manual process• Keywords, abstraction and classification

– Not possible now• Nowadays IR is automatic

– Documents converted into an internalrepresentation• Not original document as this is inefficient• Rather terms are removed and only important

terms or concepts indexed• What terms are important?

- 22 -

Distribution of Word Frequencies

constantrrankffrequency =! )()(

Zipf’s Law

- 23 -

Synthetic Power Law Data on LinearAxis

0

20

40

60

80

100

120

0 10 20 30 40 50 60

WebSites

Visitors

- 24 -

Synthetic Power Law Data on Log-Log Axis

1

10

100

1000

1 10 100

WebSites

Visitors

- 25 -

Genuine linkage distributions

Off-site Correlation = 0.9005On-site Correlation = 0.8542

1

10

100

1000

10000

1 10 100 1000

On-site dist.

Off-site dist.

Num

ber o

f Web

Pag

es

Outdegree

- 26 -

What does this tells us about text?

• We can make two observations relatingto term importance, that:– Terms below the lower bound are

considered too rare to be of benefit to theretrieval process .• They may be removed, but in practice this does

not happen

– Terms above the upper bound areconsidered to occur too frequently to be ofbenefit• They are usually removed from the internal

document representation.– These words are referred to as stopwords

- 27 -

Inverse Document Frequency

• How then do we allocate an importance weight to wordsin a document collection?

• We use a formula like IDF (Inverse DocumentFrequency)– allocates term importance which is inversely proportional

to the total number of documents containing that term.– Two core principles

• Higher document frequency (df), the less discriminating thatterm is…

• Lower document frequency a term occurs, the morediscriminating that term is.

!!"

#$$%

&=

jj df

Nidf log

- 28 -

Stopword Removal

• We want to remove high frequency words fromthe indexing process.– Can be done automatically using a predefined

stopword list• These stopwords have high DF values:• Benefits?

– Smaller index (30-50% smaller)• Problems?

– But removing stopwords does cause difficulties indealing with some valid queries… an example: “to beor not to be”…

– Phrase searching can be affected!

- 29 -

Stopwords for Englisha about above across after again against all almost alone alongalso although always am among an and another any anybody anyoneanything anywhere apart are around as aside at away be becausebeen before behind being below besides between beyond both butby can cannot could deep did do does doing done down downwardsduring each either else enough etc even ever every everybodyeveryone except far few for forth from get gets got had hardlyhas have having her here herself him himself his how however ifin indeed instead into inward is it its itself just kept many

maybe might mine more most mostly much must myself near neithernext no nobody none nor not nothing nowhere of off often on

only onto or other others ought our ours out outside over ownper please plus quite rather really said seem self selves

several shall she should since so some somebody somewhat stillsuch than that the their theirs them themselves then theretherefore these they this thorough thoroughly those throughthus to together too toward towards under until up upon verywas well were what whatever when whenever where whether whichwhile who whom whose will with within without would yet young

your yourself

- 30 -

Stemming

• Words can appear in different forms– Walk, walking, walks, walker

• We need some way to recognisecommon concept roots..

• The solution is stemming– Not a perfect solution…

• Policy / police• Arm / army• Organisation / organ

- 31 -

Stemming

• Here the indexing terms are word stems, notwords.

• Must happen to both documents and queries• Language Dependent : available for most

languages– A lot of development needed to make a new one.

Computer Computing Computational Compute

comput

- 32 -

Porter Stemming

Porter's (1980) algorithm is popular:

• remove plurals, -ED, -ING• terminal Y -> I when another vowel in stem• map double suffixes to single ... -ISATION• deal with -IC, -FULL, -NESS• take off -ANT, -ENCE• remove -E if word > 2

The code is available in most programminglanguages for downloading…

- 33 -

Stemmers

• Language dependent ... English,American, French, Norwegian

• High cost in generating a stemmingalgorithm

“May I have information on the computationalcomplexity of nearest neighbour problems in

graph theory. This will give us:”

INFORM, COMPUT, COMPLEX, NEAR,NEIGHBOUR, PROBLEM, GRAPH, THEORI.

- 34 -

Summary

May I have information on the computational complexity ofnearest neighbour problems in graph theory.

may i have information on the computational complexity ofnearest neighbour problems in graph theory

information computational complexity nearest neighbourproblems graph theory

INFORM, COMPUT, COMPLEX, NEAR,NEIGHBOUR, PROBLEM, GRAPH, THEORI

Document Tokenisation & Term Normalisation

Stopword Removal

Stemming

- 35 -

Extended BooleanFuzzy

Boolean

Generalised VectorLatent Semantic IndexingNeural Networks

Vector

Inference NetworkBelief Network

Probabilistic

Classical Models

Non-overlapping listsProximal Nodes

Structured Models

Retrieval

FlatStructure GuidedHypertext

Browsing

User Task

IR models

• Having turned a document into a set of terms,how do we do retrieval ?

• Need to model, mathematically, the retrievalprocess and from that derive retrievalimplementations. So there is a taxonomy ofvery many IR models …

- 36 -

A

Boolean IR

information AND retrieval NOT managementA AND B NOT C

B

C

Relevant Documents

- 37 -

Limitations of Boolean IR

• complexity of query formulation for multi-concepttopics.

• Boolean logic is intimidating and off-putting.• no control over size of output produced.• Boolean formulations are restrictive and not powerful

for subtle queries.• no adequate ranking of output in decreasing probability

of relevance.• batch process with no feedback from user back into

search.• no differentiation among terms in the query.

- 38 -

A Document Vector (Boolean)

web directories are comprised of a structuredhierarchy of pages each of which contains manylinks to other web pages based on the content ofthese pages these usually have beenpainstakingly handcrafted by people which makethem very expensive to maintain and grow in linewith the ever expanding web however they doact as excellent starting points for a user tobrowse the web if one views the web as a bookthen the web directory is like the table ofcontents with a high level overview of thecontents of the www if you are just browsing anon-fictional book using the table of contents is agreat way to quickly locate the desired section

web directories comprised structured hierarchypages contains links other web pages basedcontent pages painstakingly handcrafted peoplemake expensive maintain grow line expandingweb act excellent starting points user browseweb views web book web directory table contentshigh level overview contents www browsing nonfictional book table contents great way quicklylocate desired section

actbasedbook

browsebrowsingcomprisedcontainscontentcontents

……

waywebwww

actbasedbookbrowsebrowsingcomprisedcontainscontentcontentsdesireddirectoriesdirectoryexcellentexpandingexpensivefictionalgreatgrowhandcraftedhierarchyhighlevellinelinkslocatemaintainmakenonotheroverviewpagespainstakinglypeoplepointsquicklysectionstartingstructuredtableuserviewswaywebwww

Term

Cleaning

Stopwordremoval

Doc Vector

- 39 -

A Document Vector (term-weighted)

web directories are comprised of a structuredhierarchy of pages each of which contains manylinks to other web pages based on the content ofthese pages these usually have beenpainstakingly handcrafted by people which makethem very expensive to maintain and grow in linewith the ever expanding web however they doact as excellent starting points for a user tobrowse the web if one views the web as a bookthen the web directory is like the table ofcontents with a high level overview of thecontents of the www if you are just browsing anon-fictional book using the table of contents is agreat way to quickly locate the desired section

web directories comprised structured hierarchypages contains links other web pages basedcontent pages painstakingly handcrafted peoplemake expensive maintain grow line expandingweb act excellent starting points user browseweb views web book web directory table contentshigh level overview contents www browsing nonfictional book table contents great way quicklylocate desired section

actbasedbook

browsebrowsingcomprisedcontainscontentcontents

……

waywebwww

actbasedbookbrowsebrowsingcomprisedcontainscontentcontentsdesireddirectoriesdirectoryexcellentexpandingexpensivefictionalgreatgrowhandcraftedhierarchyhighlevellinelinkslocatemaintainmakenonotheroverviewpagespainstakinglypeoplepointsquicklysectionstartingstructuredtableuserviewswaywebwww

211321131……162

Term TF

Cleaning

Stopwordremoval

Doc Vector

- 40 -

How to implement term weighting …we need inverted files

• Inverted files allow for fast searching...– avoid having to search the entire document

collection at query time.

• BOOK EXAMPLE: if you are looking through ahistory book for references to X, two choices:– Read or scan through each page and pull out

references to the topic or– Look at the index at the back of the book and it will

point you at the relevant pages.

• An inverted file is to a Search Engine what anindex is to a book.

- 41 -

Terms are the record keys

- 42 -

Conventional Inverted Index

We want to do informationmanagement / retrieval /indexing / categorisation /filtering / routing / clustering/ extraction / summarisationand all of this is based ontext content which is morethan just the words used in adocument.

Even discussing a planecrash, the words we may usewould be aeroplane, plane,aircraft, flight, airplane,crash, accident, disaster

evenairbus & boeing… How shouldIR systems handle this… If a

TERM

IDs

Doc

IDsMATRIX

Search EngineDocuments

Internal Representation to aidFaster Searching

T

D

In boolean IR, the term document matrixContains 0s or 1s. In a term-weightedModel, actual term weights are stored

Instead of the binary values of Boolean IR

- 43 -

Conventional Inverted Index

This inverted file structure will allow us to generate a list ofrelevant documents by following these simple steps:

1. Accept a query, perhaps process it.2. For each query term:

– access the dictionary and get a listing of all the documents thatcontain that term

– store the documents in a set identified by the query term.3. And finally by using set theory (using set intersection or

difference operators) we can generate a list of relevantdocuments… e.g. A AND B NOT C

The final ranking of a document can be based on a count of thenumber of sets containing that particular document, or someother more complex techniques…

- 44 -

Using a Conventional Inv. Index (1)

RelevantDIDs

TERM

IDs

Doc

IDs

CATTerm ID ofCAT = 45

Documents

T

erm

s

Remember, the Matrix is sortedtermwise (a-z) to support fastidentification of documents containinga given term

- 45 -

Using a Conventional Inv. Index (2)

3

56

67

RelevantDIDs

TERM

IDs

Doc

IDs

CATTerm ID ofCAT = 45

Documents

T

erm

s

To user

- 46 -

3

56

67

set (CAT)

TERM

IDs

Doc

IDs

CATor

DOG Term ID ofCAT = 45DOG = 62

Documents

T

erm

s

To user

56

57

set (DOG)

BooleanAlgebra

3

56

57

67

Rele

van

tD

IDs

- 47 -

3

56

67

set (CAT)

TERM

IDs

Doc

IDs

CATand

DOG Term ID ofCAT = 45DOG = 62

Documents

T

erm

s

To user

56

57

set (DOG)

BooleanAlgebra

56

Rele

van

tD

IDs

- 48 -

Location Based Indexing

We may want to return documents that contain many terms in closeproximity, so this may be necessary…

- 49 -

Desirable Features of an IR system

• Above and beyond what Boolean IR hasto offer, we want:– ranked output rather than sets.– relevance feedback.– query modification/expansion.

• Done by incorporating term weights– Weights calculated using frequencies of

occurrence in natural language• Most large text collections (one language) will

have the same statistical characteristics

- 50 -

How to realise non-Boolean retrieval

What we do know about text is that:– most frequent words are function words– least frequent words are obscure– mid-range words are content-bearing

• There are two (reasonable) assumptions wemust make about the frequencies of words intext:– The more a document contains a given word, the

more that document is about a concept representedby that word.

– The more rarely a term occurs in individualdocuments in a collection, the more discriminatingthat term is.

• How to make these ‘reasonable assumptions’into retrieval algorithms … we model theretrieval process, mathematically

- 51 -

Vector Space Model

• Around since 60s– Formulated by Gerry Salton at Cornell

• Relatively simple statistical model– Based on two assumptions just mentioned

• Assigns non-binary weights to terms– In both docs and queries

• Non-binary weights are used tocalculate the degree of similarity ofdocs to a query– Generating ranked output

• Aim of which is to satisfy a user’s info need

- 52 -

Vector Space Model

cat

dog0,0

1

1

Doc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]

A

- 53 -

Vector Space Model

dog0,0

1

1

Doc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]

A

B

cat

- 54 -

Vector Space Model

dog0,0

1

1

cos δ

A

BDoc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]

Similarity of A and B

δ

cat

- 55 -

Vector Space Model

dog0,0

1

1

Doc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]

A

B

Query

cat

- 56 -

Vector Space Model

dog0,0

1

1

cos Φ cos δ

Doc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]

A

B

Query

cos Φ, δ = [0..1]

cat

- 57 -

Vector Space Model

• In reality there are more than twoterms in a language

• You need an axis for each unique termin your collection– Millions

• How is it usually implemented?– TF-IDF

- 58 -

TF-IDF

Recall the two (reasonable) assumptionsabout the frequencies of words in text:– The more a document contains a given

word, the more that document is about aconcept represented by that word.• TF value

– The more rarely a term occurs in individualdocuments in a collection, the morediscriminating that term is.• IDF value.. We saw this before..

• Calculate TF-IDF values for query anddocument terms

- 59 -

TF-IDF (Ranked Output)

TF : Term Frequency… the number of times a term occurs ina document.

DF : Document Frequency… the number of documents aterm occurs in.

IDF : Inverse Document Frequency… how important is theterm to the document in a whole collection.

wij is the weight assigned to a term Tj in a document Di.tfij = frequency of Term Tj in Document Di.N = number of Documents in collection.dfj = number of Documents where term Tj occurs at least

once.Calculated for each unique term in each doc and query

(mostly zero).. This creates a document vector

!!"

#$$%

&'=

jijij df

Ntfw log

- 60 -

TF-IDF similarity

• Simple approach : dot product (angle of vectors)

• For length normalisation, use cosine similarity

!=

"=t

kjkikji QTermTermQueryDocSIM

1)(),(

( ) ( )!

!

=

=

"

"=

t

kjkik

t

kjkik

ji

QTermTerm

QTermTermQueryDocCOSINE

1

22

1

)(

)(),(

Dot Product

Product ofEuclidean Lengths

Dot Product only

- 61 -

Finally with TF-IDF

The advantages of the vector model are:• Its term-weighting scheme improves

retrieval performance over Boolean IR.• Allows retrieval of documents that

approximately match a query.• It is easy to sort documents according

to their degree of similarity.BUT it does not allow for dependencies between terms …

e.g. information retrieval…

Lets have a look at it in operation…

- 62 -

MultiMedia InformationSystems (CA4, CAE4 & CL4).a final year undergraduate

course for the B.Sc. InComputer Applications and

The B.Sc. in AppliedComputational Linguistics,delivered to the full-time

and to the part-time classes

Step 1 – generate doc Vectors

The history of the PersonalComputer since 1985 is

Focussed on one company.That company is Microsoft.Microsoft was founded byBill Gates in the late 70s. Gates was not a computerGraduate, but had been

Using computers since he…

Jaguar-Racing’srecruitment drive hascontinued, the team

Announcing today theappointment of ItalianGuenther Steiner as

its new Managing Director.

198570sbill

companycomputercomputersfocussedfoundedgates

graduatehistory

latemicrosoft personal

since

111221112111212

DocumentVector TF

Convert into a Document

Vector of unique termsand generate a

similar sized vectorof TFs

746th document

45432 documents in total

- 63 -

Step 2, index the documents (a)

198570sbill

companycomputercomputersfocussedfoundedgates

graduatehistory

latemicrosoft personal

since

111221112111212

DocumentVector TF

d0d1………

d743d744d745

Document List

aardvarkaardwolfaargghh

………

gateringgatersgates

gatesesgatesheadgatesian

gatesqueakgatesville

gatesworldgraduate

……

zzz

Term List

A new entryin the

documentlist

012………8231823282338234823582368237823882398240……45432

01………743744745

‘Up to’ 15 newentries inthe term

list…

3414………35622310012576861……12

DF of eachTermurl

url………urlurlurl

- 64 -

Step 2, index the documents (b)

198570sbill

companycomputercomputersfocussedfoundedgates

graduatehistory

latemicrosoft personal

since

111221112111212

DocumentVector TF

And update the term – document matrix (15 updates)

0 1 2 … 8233 … 8240 … …

D0 4 1

D1 1 1

D2 4

d745 2 1

- 65 -

And set the weights for each term

0 1 2 … 8233 … 8240 … …

D0 0.1 0.2

D1 0.4 0.11

D2 0.1

d745 0.21 0.34

( ) !!"

#$$%

&'=

jijij df

Ntfw loglog ( ) !"

#$%

&'=3245432

log2log8233,745dw

- 66 -

Step 3, process a query(1)

jj dfdfN /)( !

Gates, Microsoft

8233, 16031

(a) Using Term Listconvert to TIDs

8233 16031

D1 0.11 0.23

D212 0.02 0.45

D745 0.21 0.34

D5612 0.12 0.16

(c) Generate a smallmatrix containing

only the query termsand the documents

containing these terms

(b) Using TF-IDF formula generate score For each query term

( ) !!"

#$$%

&'=

jijij df

Ntfw loglog

8233 0.32

160310.13

Tf-idf score for the query terms

jj dfdfN /)( !

- 67 -

Step 3, process a query(2)

8233 16031

D1 0.11 x 0.32 0.23 x 0.13

D212 0.02 x 0.32 0.45 x 0.13

D745 0.21 x 0.32 0.34 x 0.13

D5612 0.12 x 0.32 0.16 x 0.13

8233 0.32

160310.13

+

Taking a shortcut, we multiply term weights by thequery weights And add the results to produce adocument relevance score… really we use cosine similarity.

- 68 -

Step 3, process a query(2)

8233 16031

D1 0.11 x 0.32 0.23 x 0.13

D212 0.02 x 0.32 0.45 x 0.13

D745 0.21 x 0.32 0.34 x 0.13

D5612 0.12 x 0.32 0.16 x 0.13

8233 0.32

160310.13

+

Taking a shortcut, we multiply term weights by thequery weights And add the results to produce adocument relevance score… really we use cosine similarity.

Doc Doc

- 69 -

Step 3, process a query(2)

8233 16031

D1 0.11 x 0.32 0.23 x 0.13

D212 0.02 x 0.32 0.45 x 0.13

D745 0.21 x 0.32 0.34 x 0.13

D5612 0.12 x 0.32 0.16 x 0.13

8233 0.32

160310.13

+

Taking a shortcut, we multiply term weights by thequery weights And add the results to produce adocument relevance score… really we use cosine similarity.

Query Query

- 70 -

Step 3, process a query(2)

8233 16031

D1 0.11 x 0.32 0.23 x 0.13

D212 0.02 x 0.32 0.45 x 0.13

D745 0.21 x 0.32 0.34 x 0.13

D5612 0.12 x 0.32 0.16 x 0.13

8233 0.32

160310.13

+

0.0352 + 0.0299 = 0.0651

0.0064 + 0.0585 = 0.0649

0.0674 + 0.0442 = 0.1116

0.0384 + 0.0208 = 0.0592

Taking a shortcut, we multiply term weights by thequery weights And add the results to produce adocument relevance score… really we use cosine similarity.

Search Results ( 4 documents relevant )1, D745 - ……………………………………………2, D1 - ……………………………………………3, D212 - ……………………………………………4, D5612 - ……………………………………………

Rank in decreasing order of relevanceand present results touser…

- 71 -

A Simple IR system

Search Engine

Query Mgr.Indexer

TF-IDF

TFMatrix

Indexer

DF

TF-IDFMatrix

SimilarityMatrix

WWW

TL

DL

Using termweightingformula

1

2

3

- 72 -

BM25

• There are alternatives to tf*IDF whichcome from the other models of retrieval

ij

ijij tfK

tfkW

+

+=

)1( 1!"

#$%

& '+(=advllbbkKwhere i)1(1

( )jjqk

qkqj dfdfN

tfktf

w /)(ln3

!"+

=

• tfij indicating the within-document frequency of term j in document i• b, k1 k3 are parameters.• K represents the ratio between the length of document i measured by lI (sum oftfij) and the collection mean, denoted by advl.• dfj indicates collection-wide term frequency of term j• N is the number of documents in the collection.

To index terms in a document:

To index terms in a query:

- 73 -

Term Weighting Retrieval

• Some comments…– TF-IDF

• Not always used as shown– Doc length normalisation– Applying TF only

» E.g. for anchor text surrogates– Applying only IDF to query terms…

– BM25• Parameters should be optimised for different collections

– TV news is not web pages is not medical texts.

– Other approaches• There’s a whole gamut of techniques and formulae and

algorithms in the information retrieval research fieldbut that’s enough of them.