Information Retrieval Basics - DCU School of Computingasmeaton/CA652/IRBasics.pdf · - 3 - Why is...
Transcript of Information Retrieval Basics - DCU School of Computingasmeaton/CA652/IRBasics.pdf · - 3 - Why is...
- 2 -
• IR is the process of matching Q vs D• IR is hard for several reasons ..
– Information is complex, we need torepresent it, computers not good at doingthis.
– Context varies … what is a sponge ?
– Opinion varies …
Why is IR hard ?
Good news ? Honest ? Funny ?
- 3 -
Why is IR hard ?
– Semantics
– Information needs must be expressed as aquery, in a search box, but often we don’tknow what we want !
– We have problems …• Verbalising information needs• Understanding query syntax• Understanding search engines
– .. All mentioned earlier under “informationseeking”
Bank note Bank of England West Bank
- 5 -
Users
• You have seen HCI already...• Users are concerned with results.
– They rarely understand or consider the underlyingmechanism.
– Even more rarely make use of sophisticated searchtools.
• 1% use advanced search facilities.• 10% use query syntax - often incorrectly!
• Average length of search queries is currentlyaround 2.5 words!
• Accuracy (precision) is generally moreimportant than quantity (recall).– Although there are special applications such as
patent search where high recall is very important.
- 6 -
When they are searching, what dothey want to find?
… blogs, tweets, babies, prices, dogs, sales, maps, products,recipes, news, answers, movies, books, flight schedules, mail,news stories, video clips, papers, calendar entries, goals,appointments, what their friends spoke about last night … andWWW pages …
- 7 -
Where are they from?
… US, Canada, UK, Norway, France, Sweden, Australia, Japan,Korea, Ireland, South Africa, India, Germany, Singapore, Russia,Italy, Brazil, Portugal, China, Mexico, SE Asia, New Zealand,Finland, …
- 8 -
How good is the User?
• User information needs need to be expressedas a query…– But users don’t often know what they want– or can’t articulate the information need– or don’t understand query syntax– or don’t understand search engines
• Users also…– Don’t give much explicit feedback– Don’t look beyond the first page of results– Can’t adequately express an information need as a
query…
- 9 -
User Queries are…
• Misspelled• Ambiguous• Context sensitive
– Novel information is better?
• Representative of different types ofsearch request– Fact search, homepage finding, general…
• Usually textual in nature
- 10 -
… query log extract …
• responsibility• electronic parts tv repair do it yourself• gordon• can i find informationabout susan b. anthony• coupons off the web• nissan• cad• another word for hue• how do i reprogram a gm keyless remote• supervisory training programs ontario• where can i find pictures of 1989 z24• alison norris• how to make man made boulders• godzilla soundtrack• spring lake nc• absolutelymale day• away in a manger• toysoldier• mastercraft boats• where can i find spanish recipes• tobacco• history of environmentalism silent spring• how to make man made boulders• godzilla soundtrack• spring lake nc• absolutelymale day• away in a manger• how to make man made boulders• godzilla soundtrack• spring lake nc• absolutelymale day• away in a manger
• toysoldier• mastercraft boats• where can i find spanish recipes• tobacco• history of environmentalism silent spring• psychology of lying• country song lyrics• where can i find information about depression• alexander graham bell• friends scripts• booman• what is kwanza• fishing• where can i find art in 1964• hubble• yellow pages of san diego california• project change forms• antique clocks• golden pacific systems• can i find informationabout susan b. anthony• metallica• karagoz• javascipts• encoders• cheverlay• christian books• margaret kempe• angel policeman• pottery portland maine• woodstock sucks.
- 16 -
Access Devices
• Desktop Computer– What resolution? What plug-ins?
• Mobile Device– What resolution? Support zooming?
Location services? Bluetooth services?
• Games Console/TV– Integrated browsers are becoming the
norm, so what level of interactivity?
- 18 -
Ok, so that is users.. Users searchusing text. It is easier to expressan information need that way… so
let us look at text!
- 19 -
Text is:
A lot of IR difficulty arises because of the nature of text:
• word tokens from a surprisingly small lexicon or dictionary• each of which independently convey some meaning.• whose morphology is changed as words are concatenated
into units of dialogue called sentences.• for which there is a grammar of allowable syntactic
combinations to which sentences conform.• which are in turn are concatenated to make prose which
makes up documents.• which can reach a large enough size they may be
structurally organized.• and typically this is a hierarchy, to ease user navigation
through the document.– chapters, sections, sub-sections, paragraphs, sentences, clauses,
phrases, words, morphemes, letters…
- 20 -
Problems with Text
Tokens are Words are Terms…• Can be polysemous
– e.g. BAR
• ‘SMITH’ is NOT ‘Smith’ is NOT ‘smith’.– or
• ‘plane’ is NOT ‘aircraft’– or
• “cooking” is NOT “cooked” is NOT“cookery”
- 21 -
Information Retrieval
• Early days of IR– Indexing into internal representation was a
manual process• Keywords, abstraction and classification
– Not possible now• Nowadays IR is automatic
– Documents converted into an internalrepresentation• Not original document as this is inefficient• Rather terms are removed and only important
terms or concepts indexed• What terms are important?
- 23 -
Synthetic Power Law Data on LinearAxis
0
20
40
60
80
100
120
0 10 20 30 40 50 60
WebSites
Visitors
- 25 -
Genuine linkage distributions
Off-site Correlation = 0.9005On-site Correlation = 0.8542
1
10
100
1000
10000
1 10 100 1000
On-site dist.
Off-site dist.
Num
ber o
f Web
Pag
es
Outdegree
- 26 -
What does this tells us about text?
• We can make two observations relatingto term importance, that:– Terms below the lower bound are
considered too rare to be of benefit to theretrieval process .• They may be removed, but in practice this does
not happen
– Terms above the upper bound areconsidered to occur too frequently to be ofbenefit• They are usually removed from the internal
document representation.– These words are referred to as stopwords
- 27 -
Inverse Document Frequency
• How then do we allocate an importance weight to wordsin a document collection?
• We use a formula like IDF (Inverse DocumentFrequency)– allocates term importance which is inversely proportional
to the total number of documents containing that term.– Two core principles
• Higher document frequency (df), the less discriminating thatterm is…
• Lower document frequency a term occurs, the morediscriminating that term is.
!!"
#$$%
&=
jj df
Nidf log
- 28 -
Stopword Removal
• We want to remove high frequency words fromthe indexing process.– Can be done automatically using a predefined
stopword list• These stopwords have high DF values:• Benefits?
– Smaller index (30-50% smaller)• Problems?
– But removing stopwords does cause difficulties indealing with some valid queries… an example: “to beor not to be”…
– Phrase searching can be affected!
- 29 -
Stopwords for Englisha about above across after again against all almost alone alongalso although always am among an and another any anybody anyoneanything anywhere apart are around as aside at away be becausebeen before behind being below besides between beyond both butby can cannot could deep did do does doing done down downwardsduring each either else enough etc even ever every everybodyeveryone except far few for forth from get gets got had hardlyhas have having her here herself him himself his how however ifin indeed instead into inward is it its itself just kept many
maybe might mine more most mostly much must myself near neithernext no nobody none nor not nothing nowhere of off often on
only onto or other others ought our ours out outside over ownper please plus quite rather really said seem self selves
several shall she should since so some somebody somewhat stillsuch than that the their theirs them themselves then theretherefore these they this thorough thoroughly those throughthus to together too toward towards under until up upon verywas well were what whatever when whenever where whether whichwhile who whom whose will with within without would yet young
your yourself
- 30 -
Stemming
• Words can appear in different forms– Walk, walking, walks, walker
• We need some way to recognisecommon concept roots..
• The solution is stemming– Not a perfect solution…
• Policy / police• Arm / army• Organisation / organ
- 31 -
Stemming
• Here the indexing terms are word stems, notwords.
• Must happen to both documents and queries• Language Dependent : available for most
languages– A lot of development needed to make a new one.
Computer Computing Computational Compute
comput
- 32 -
Porter Stemming
Porter's (1980) algorithm is popular:
• remove plurals, -ED, -ING• terminal Y -> I when another vowel in stem• map double suffixes to single ... -ISATION• deal with -IC, -FULL, -NESS• take off -ANT, -ENCE• remove -E if word > 2
The code is available in most programminglanguages for downloading…
- 33 -
Stemmers
• Language dependent ... English,American, French, Norwegian
• High cost in generating a stemmingalgorithm
“May I have information on the computationalcomplexity of nearest neighbour problems in
graph theory. This will give us:”
INFORM, COMPUT, COMPLEX, NEAR,NEIGHBOUR, PROBLEM, GRAPH, THEORI.
- 34 -
Summary
May I have information on the computational complexity ofnearest neighbour problems in graph theory.
may i have information on the computational complexity ofnearest neighbour problems in graph theory
information computational complexity nearest neighbourproblems graph theory
INFORM, COMPUT, COMPLEX, NEAR,NEIGHBOUR, PROBLEM, GRAPH, THEORI
Document Tokenisation & Term Normalisation
Stopword Removal
Stemming
- 35 -
Extended BooleanFuzzy
Boolean
Generalised VectorLatent Semantic IndexingNeural Networks
Vector
Inference NetworkBelief Network
Probabilistic
Classical Models
Non-overlapping listsProximal Nodes
Structured Models
Retrieval
FlatStructure GuidedHypertext
Browsing
User Task
IR models
• Having turned a document into a set of terms,how do we do retrieval ?
• Need to model, mathematically, the retrievalprocess and from that derive retrievalimplementations. So there is a taxonomy ofvery many IR models …
- 37 -
Limitations of Boolean IR
• complexity of query formulation for multi-concepttopics.
• Boolean logic is intimidating and off-putting.• no control over size of output produced.• Boolean formulations are restrictive and not powerful
for subtle queries.• no adequate ranking of output in decreasing probability
of relevance.• batch process with no feedback from user back into
search.• no differentiation among terms in the query.
- 38 -
A Document Vector (Boolean)
web directories are comprised of a structuredhierarchy of pages each of which contains manylinks to other web pages based on the content ofthese pages these usually have beenpainstakingly handcrafted by people which makethem very expensive to maintain and grow in linewith the ever expanding web however they doact as excellent starting points for a user tobrowse the web if one views the web as a bookthen the web directory is like the table ofcontents with a high level overview of thecontents of the www if you are just browsing anon-fictional book using the table of contents is agreat way to quickly locate the desired section
web directories comprised structured hierarchypages contains links other web pages basedcontent pages painstakingly handcrafted peoplemake expensive maintain grow line expandingweb act excellent starting points user browseweb views web book web directory table contentshigh level overview contents www browsing nonfictional book table contents great way quicklylocate desired section
actbasedbook
browsebrowsingcomprisedcontainscontentcontents
……
waywebwww
actbasedbookbrowsebrowsingcomprisedcontainscontentcontentsdesireddirectoriesdirectoryexcellentexpandingexpensivefictionalgreatgrowhandcraftedhierarchyhighlevellinelinkslocatemaintainmakenonotheroverviewpagespainstakinglypeoplepointsquicklysectionstartingstructuredtableuserviewswaywebwww
Term
Cleaning
Stopwordremoval
Doc Vector
- 39 -
A Document Vector (term-weighted)
web directories are comprised of a structuredhierarchy of pages each of which contains manylinks to other web pages based on the content ofthese pages these usually have beenpainstakingly handcrafted by people which makethem very expensive to maintain and grow in linewith the ever expanding web however they doact as excellent starting points for a user tobrowse the web if one views the web as a bookthen the web directory is like the table ofcontents with a high level overview of thecontents of the www if you are just browsing anon-fictional book using the table of contents is agreat way to quickly locate the desired section
web directories comprised structured hierarchypages contains links other web pages basedcontent pages painstakingly handcrafted peoplemake expensive maintain grow line expandingweb act excellent starting points user browseweb views web book web directory table contentshigh level overview contents www browsing nonfictional book table contents great way quicklylocate desired section
actbasedbook
browsebrowsingcomprisedcontainscontentcontents
……
waywebwww
actbasedbookbrowsebrowsingcomprisedcontainscontentcontentsdesireddirectoriesdirectoryexcellentexpandingexpensivefictionalgreatgrowhandcraftedhierarchyhighlevellinelinkslocatemaintainmakenonotheroverviewpagespainstakinglypeoplepointsquicklysectionstartingstructuredtableuserviewswaywebwww
211321131……162
Term TF
Cleaning
Stopwordremoval
Doc Vector
- 40 -
How to implement term weighting …we need inverted files
• Inverted files allow for fast searching...– avoid having to search the entire document
collection at query time.
• BOOK EXAMPLE: if you are looking through ahistory book for references to X, two choices:– Read or scan through each page and pull out
references to the topic or– Look at the index at the back of the book and it will
point you at the relevant pages.
• An inverted file is to a Search Engine what anindex is to a book.
- 42 -
Conventional Inverted Index
We want to do informationmanagement / retrieval /indexing / categorisation /filtering / routing / clustering/ extraction / summarisationand all of this is based ontext content which is morethan just the words used in adocument.
Even discussing a planecrash, the words we may usewould be aeroplane, plane,aircraft, flight, airplane,crash, accident, disaster
evenairbus & boeing… How shouldIR systems handle this… If a
TERM
IDs
Doc
IDsMATRIX
Search EngineDocuments
Internal Representation to aidFaster Searching
T
D
In boolean IR, the term document matrixContains 0s or 1s. In a term-weightedModel, actual term weights are stored
Instead of the binary values of Boolean IR
- 43 -
Conventional Inverted Index
This inverted file structure will allow us to generate a list ofrelevant documents by following these simple steps:
1. Accept a query, perhaps process it.2. For each query term:
– access the dictionary and get a listing of all the documents thatcontain that term
– store the documents in a set identified by the query term.3. And finally by using set theory (using set intersection or
difference operators) we can generate a list of relevantdocuments… e.g. A AND B NOT C
The final ranking of a document can be based on a count of thenumber of sets containing that particular document, or someother more complex techniques…
- 44 -
Using a Conventional Inv. Index (1)
RelevantDIDs
TERM
IDs
Doc
IDs
CATTerm ID ofCAT = 45
Documents
T
erm
s
Remember, the Matrix is sortedtermwise (a-z) to support fastidentification of documents containinga given term
- 45 -
Using a Conventional Inv. Index (2)
3
56
67
RelevantDIDs
TERM
IDs
Doc
IDs
CATTerm ID ofCAT = 45
Documents
T
erm
s
To user
- 46 -
3
56
67
set (CAT)
TERM
IDs
Doc
IDs
CATor
DOG Term ID ofCAT = 45DOG = 62
Documents
T
erm
s
To user
56
57
set (DOG)
BooleanAlgebra
3
56
57
67
Rele
van
tD
IDs
- 47 -
3
56
67
set (CAT)
TERM
IDs
Doc
IDs
CATand
DOG Term ID ofCAT = 45DOG = 62
Documents
T
erm
s
To user
56
57
set (DOG)
BooleanAlgebra
56
Rele
van
tD
IDs
- 48 -
Location Based Indexing
We may want to return documents that contain many terms in closeproximity, so this may be necessary…
- 49 -
Desirable Features of an IR system
• Above and beyond what Boolean IR hasto offer, we want:– ranked output rather than sets.– relevance feedback.– query modification/expansion.
• Done by incorporating term weights– Weights calculated using frequencies of
occurrence in natural language• Most large text collections (one language) will
have the same statistical characteristics
- 50 -
How to realise non-Boolean retrieval
What we do know about text is that:– most frequent words are function words– least frequent words are obscure– mid-range words are content-bearing
• There are two (reasonable) assumptions wemust make about the frequencies of words intext:– The more a document contains a given word, the
more that document is about a concept representedby that word.
– The more rarely a term occurs in individualdocuments in a collection, the more discriminatingthat term is.
• How to make these ‘reasonable assumptions’into retrieval algorithms … we model theretrieval process, mathematically
- 51 -
Vector Space Model
• Around since 60s– Formulated by Gerry Salton at Cornell
• Relatively simple statistical model– Based on two assumptions just mentioned
• Assigns non-binary weights to terms– In both docs and queries
• Non-binary weights are used tocalculate the degree of similarity ofdocs to a query– Generating ranked output
• Aim of which is to satisfy a user’s info need
- 52 -
Vector Space Model
cat
dog0,0
1
1
Doc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]
A
- 53 -
Vector Space Model
dog0,0
1
1
Doc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]
A
B
cat
- 54 -
Vector Space Model
dog0,0
1
1
cos δ
A
BDoc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]
Similarity of A and B
δ
cat
- 55 -
Vector Space Model
dog0,0
1
1
Doc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]
A
B
Query
cat
- 56 -
Vector Space Model
dog0,0
1
1
cos Φ cos δ
Doc A : cat[0.8], dog[0.1]Doc B : cat[0.5], dog[0.9]Query : cat[0.8], dog[0.7]
A
B
Query
cos Φ, δ = [0..1]
cat
- 57 -
Vector Space Model
• In reality there are more than twoterms in a language
• You need an axis for each unique termin your collection– Millions
• How is it usually implemented?– TF-IDF
- 58 -
TF-IDF
Recall the two (reasonable) assumptionsabout the frequencies of words in text:– The more a document contains a given
word, the more that document is about aconcept represented by that word.• TF value
– The more rarely a term occurs in individualdocuments in a collection, the morediscriminating that term is.• IDF value.. We saw this before..
• Calculate TF-IDF values for query anddocument terms
- 59 -
TF-IDF (Ranked Output)
TF : Term Frequency… the number of times a term occurs ina document.
DF : Document Frequency… the number of documents aterm occurs in.
IDF : Inverse Document Frequency… how important is theterm to the document in a whole collection.
wij is the weight assigned to a term Tj in a document Di.tfij = frequency of Term Tj in Document Di.N = number of Documents in collection.dfj = number of Documents where term Tj occurs at least
once.Calculated for each unique term in each doc and query
(mostly zero).. This creates a document vector
!!"
#$$%
&'=
jijij df
Ntfw log
- 60 -
TF-IDF similarity
• Simple approach : dot product (angle of vectors)
• For length normalisation, use cosine similarity
!=
"=t
kjkikji QTermTermQueryDocSIM
1)(),(
( ) ( )!
!
=
=
"
"=
t
kjkik
t
kjkik
ji
QTermTerm
QTermTermQueryDocCOSINE
1
22
1
)(
)(),(
Dot Product
Product ofEuclidean Lengths
Dot Product only
- 61 -
Finally with TF-IDF
The advantages of the vector model are:• Its term-weighting scheme improves
retrieval performance over Boolean IR.• Allows retrieval of documents that
approximately match a query.• It is easy to sort documents according
to their degree of similarity.BUT it does not allow for dependencies between terms …
e.g. information retrieval…
Lets have a look at it in operation…
- 62 -
MultiMedia InformationSystems (CA4, CAE4 & CL4).a final year undergraduate
course for the B.Sc. InComputer Applications and
The B.Sc. in AppliedComputational Linguistics,delivered to the full-time
and to the part-time classes
Step 1 – generate doc Vectors
The history of the PersonalComputer since 1985 is
Focussed on one company.That company is Microsoft.Microsoft was founded byBill Gates in the late 70s. Gates was not a computerGraduate, but had been
Using computers since he…
Jaguar-Racing’srecruitment drive hascontinued, the team
Announcing today theappointment of ItalianGuenther Steiner as
its new Managing Director.
198570sbill
companycomputercomputersfocussedfoundedgates
graduatehistory
latemicrosoft personal
since
111221112111212
DocumentVector TF
Convert into a Document
Vector of unique termsand generate a
similar sized vectorof TFs
746th document
45432 documents in total
- 63 -
Step 2, index the documents (a)
198570sbill
companycomputercomputersfocussedfoundedgates
graduatehistory
latemicrosoft personal
since
111221112111212
DocumentVector TF
d0d1………
d743d744d745
Document List
aardvarkaardwolfaargghh
………
gateringgatersgates
gatesesgatesheadgatesian
gatesqueakgatesville
gatesworldgraduate
……
zzz
Term List
A new entryin the
documentlist
012………8231823282338234823582368237823882398240……45432
01………743744745
‘Up to’ 15 newentries inthe term
list…
3414………35622310012576861……12
DF of eachTermurl
url………urlurlurl
- 64 -
Step 2, index the documents (b)
198570sbill
companycomputercomputersfocussedfoundedgates
graduatehistory
latemicrosoft personal
since
111221112111212
DocumentVector TF
And update the term – document matrix (15 updates)
0 1 2 … 8233 … 8240 … …
D0 4 1
D1 1 1
D2 4
…
…
…
d745 2 1
…
…
- 65 -
And set the weights for each term
0 1 2 … 8233 … 8240 … …
D0 0.1 0.2
D1 0.4 0.11
D2 0.1
…
…
…
d745 0.21 0.34
…
…
( ) !!"
#$$%
&'=
jijij df
Ntfw loglog ( ) !"
#$%
&'=3245432
log2log8233,745dw
- 66 -
Step 3, process a query(1)
jj dfdfN /)( !
Gates, Microsoft
8233, 16031
(a) Using Term Listconvert to TIDs
8233 16031
D1 0.11 0.23
D212 0.02 0.45
D745 0.21 0.34
D5612 0.12 0.16
(c) Generate a smallmatrix containing
only the query termsand the documents
containing these terms
(b) Using TF-IDF formula generate score For each query term
( ) !!"
#$$%
&'=
jijij df
Ntfw loglog
8233 0.32
160310.13
Tf-idf score for the query terms
jj dfdfN /)( !
- 67 -
Step 3, process a query(2)
8233 16031
D1 0.11 x 0.32 0.23 x 0.13
D212 0.02 x 0.32 0.45 x 0.13
D745 0.21 x 0.32 0.34 x 0.13
D5612 0.12 x 0.32 0.16 x 0.13
8233 0.32
160310.13
+
Taking a shortcut, we multiply term weights by thequery weights And add the results to produce adocument relevance score… really we use cosine similarity.
- 68 -
Step 3, process a query(2)
8233 16031
D1 0.11 x 0.32 0.23 x 0.13
D212 0.02 x 0.32 0.45 x 0.13
D745 0.21 x 0.32 0.34 x 0.13
D5612 0.12 x 0.32 0.16 x 0.13
8233 0.32
160310.13
+
Taking a shortcut, we multiply term weights by thequery weights And add the results to produce adocument relevance score… really we use cosine similarity.
Doc Doc
- 69 -
Step 3, process a query(2)
8233 16031
D1 0.11 x 0.32 0.23 x 0.13
D212 0.02 x 0.32 0.45 x 0.13
D745 0.21 x 0.32 0.34 x 0.13
D5612 0.12 x 0.32 0.16 x 0.13
8233 0.32
160310.13
+
Taking a shortcut, we multiply term weights by thequery weights And add the results to produce adocument relevance score… really we use cosine similarity.
Query Query
- 70 -
Step 3, process a query(2)
8233 16031
D1 0.11 x 0.32 0.23 x 0.13
D212 0.02 x 0.32 0.45 x 0.13
D745 0.21 x 0.32 0.34 x 0.13
D5612 0.12 x 0.32 0.16 x 0.13
8233 0.32
160310.13
+
0.0352 + 0.0299 = 0.0651
0.0064 + 0.0585 = 0.0649
0.0674 + 0.0442 = 0.1116
0.0384 + 0.0208 = 0.0592
Taking a shortcut, we multiply term weights by thequery weights And add the results to produce adocument relevance score… really we use cosine similarity.
Search Results ( 4 documents relevant )1, D745 - ……………………………………………2, D1 - ……………………………………………3, D212 - ……………………………………………4, D5612 - ……………………………………………
Rank in decreasing order of relevanceand present results touser…
- 71 -
A Simple IR system
Search Engine
Query Mgr.Indexer
TF-IDF
TFMatrix
Indexer
DF
TF-IDFMatrix
SimilarityMatrix
WWW
TL
DL
Using termweightingformula
1
2
3
- 72 -
BM25
• There are alternatives to tf*IDF whichcome from the other models of retrieval
ij
ijij tfK
tfkW
+
+=
)1( 1!"
#$%
& '+(=advllbbkKwhere i)1(1
( )jjqk
qkqj dfdfN
tfktf
w /)(ln3
!"+
=
• tfij indicating the within-document frequency of term j in document i• b, k1 k3 are parameters.• K represents the ratio between the length of document i measured by lI (sum oftfij) and the collection mean, denoted by advl.• dfj indicates collection-wide term frequency of term j• N is the number of documents in the collection.
To index terms in a document:
To index terms in a query:
- 73 -
Term Weighting Retrieval
• Some comments…– TF-IDF
• Not always used as shown– Doc length normalisation– Applying TF only
» E.g. for anchor text surrogates– Applying only IDF to query terms…
– BM25• Parameters should be optimised for different collections
– TV news is not web pages is not medical texts.
– Other approaches• There’s a whole gamut of techniques and formulae and
algorithms in the information retrieval research fieldbut that’s enough of them.