An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

55
An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield

Transcript of An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

Page 1: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

An overview of the technology used Information Retrieval

Louise Guthrie

University of Sheffield

Page 2: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

What is Information Retrieval(IR)?

• Retrieval of unstructured data

• Most often - Retrieval of Text

• Retrieval of Videos

• Retrieval of Images

Page 3: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Retrieval of Text Documents

• Searching for precedent in legal cases

• Searching files on your computer

• Searching on the web

• Siri

Page 4: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Information Retrieval~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Give me all documents where Enron executives discuss the company stock

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Page 5: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Information Retrieval

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

QUERY

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

DOCUMENTS

Page 6: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Concerns of an IR system

• How do you represent the text?

• How do you represent the query?

• How do you decide the documents to return?– how do you find them efficiently?– how do decide what is presented first to the user?

• How do you evaluate the system?

Page 7: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Finding Relevant Documents

• All systems want to return documents that satisfy the query

• To satisfy a natural language query perfectly requires understanding

• Understanding is still a research topic

Page 8: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

How is the text represented?

• Bag of words approach • Pay no heed to inter-word relations:

– syntax, semantics

• Bag does characterise document• Not perfect, words are

– ambiguous– used in different forms or synonymously

Page 9: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Still work to be done in IR

The following query:

“The destruction of the amazon rain forests”

would not trigger a system to find an article about

“Brazilian jungles being destroyed”

Page 10: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Forms of query/retrieval system

• Boolean– Rooted in commercial systems from 1970s

- Spotlight on the MAC - Westlaw - the system used to find legal cases

• Ranked retrieval– Long championed by academics

Page 11: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Boolean searching

• To find articles about the destruction of the Amazon rain forest

– “amazon” & “rain forest*” & (“destroy” | “destruction”)

• Break collection into two unordered sets– Documents that match the query

– Documents that don’t

• User has complete control but…– …not easy to use.

– Often results in too many or to few results – AND gives too few; OR too many

Page 12: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

Considerations in “matching the query?

• Should the system match upper and lower case letter – amazon and Amazon?

• Should destroying match destroy?

4/15/2013

Page 13: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Matching different forms of a word

• Matching the query term “forests”– to “forest” and “forested”

• Stemmers remove affixes– removal of suffixes - worker– prefixes? - megavolt– infixes? - un-bloody-likely

• Stick with suffixes

Page 14: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Plural stemmer

• Plurals in English– If word ends in “ies” but not “eies”, “aies”

• “ies” -> “y”

– if word ends in “es” but not “aes, “ees”, “oes”• “es” -> “e”

– if word ends in “s” but not “us” or “ss”• “s” -> “”

– First applicable rule is the one used

Page 15: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Plural stemmer examples

Forests - ? Statistics - ?

Queries - ? Foes - ?

Does - ? Is - ?

Plus - ? Plusses - ?

Page 16: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

For other endings - When to strip, when to stop?

• “ed”, “ing”, “ational”, “ation”, “able”, “ism”, etc, etc.

• What about– “bring”, “table”, “prism”, “bed”, “thing”?

• Use a dictionary as well? – “Buttered”

Page 17: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Is stemming used?

• Research says it is useful• Web search engines hardly use it

– Why?

• Unexpected results

– computer, computation, computing, computational, etc.

• Foreign languages?

Page 18: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

white sand vs. white sands

http://www.google.com/search?client=safari&rls=en&q=white+sand

&ie=UTF-8&oe=UTF-8

http://www.google.com/search?client=safari&rls=en&q="white+sand"&ie=UTF-8&oe=UTF-8 -

Page 19: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

RANKED RETRIEVAL

4/15/2013

Page 20: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Ranked retrieval

• The users query is one or more words in natural language

• A similarity score is calculated between query and every document

• Sort documents by their score

• Present top scoring documents to the user

Page 21: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Common techniques

• Stop word removal – From fixed list– “destruction amazon rain forests”

• Stemming

• Match lower and upper case

Page 22: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Stop Word Examples

donebottombecomeanotherall

detailbothbecauseandagain

crybetweenbackamountafter

couldn’tbesidesatamongstacross

couldbesideasamongabove

conbelowaroundamabout

cobeingarealwaysa

Page 23: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Measuring Similarity

• In place of understanding, we try to measure the similarity of a document to the query and we return to the user the most similar ones

• There are MANY different measures of similarity

Page 24: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

How do we assign a score?

• A first try - Jaccard Coefficient– jaccard(A,B) = |A ∩ B| / |A ∪ B|– jaccard(A,A) = 1– jaccard(A,B) = 0 if A ∩ B = 0

• Doesn’t consider how many times a word occurs• Doesn’t take into account that rare terms are

more informative

4/15/2013

Page 25: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

A better idea – use frequency counts

4/15/2013

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Page 26: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

TF

• More often a term is used in a document– More likely document is about that term– Depends on document length

– Harman, D. (1992): Ranking algorithms, in Frakes, W. & Baeza-Yates, B. (eds.), Information Retrieval: Data Structures & Algorithms: 363-392

• Watch out for mistake: not unique terms.

• Problems with spamming

Page 27: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

IDF

• Some query terms better than others?

• In general, fair to say that…– “amazon” > “forest” “destruction” > “rain”

• Inverse document frequency (idf)• n: Number of documents term occurs in• N: Number of documents in collection

Page 28: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

The scoring

• For each document

– Term frequency (tf)• t: Number of times term occurs in document• dl: Length of document (number of terms)

Page 29: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Very successful

• Simple, but effective

• Core of most weighting functions– tf (term frequency)– idf (inverse document frequency)

Page 30: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Robertson’s BM25

– Q is a query containing terms T– w is a form of IDF– k1, b, k2, k3 are parameters.– tf is the document term frequency.– qtf is the query term frequency.– dl is the document length (arbitrary units).– avdl is the average document length.

Page 31: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Getting the balance

• Documents with all the query terms?• Just those with high tf•idf terms?• Just nouns?• Just noun phrases?• With ancestor or children terms?

Page 32: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

Other considerations

• Should spelling be corrected?

• Should we remove punctuation?

• Should Feb. 20, 1980 match 2/20/80?

4/15/2013

Page 33: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Spamming the tf weight

A white font can be used to spam the tf weight

SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK SEX SEXY MONICA LEWINSKY JENNIFER LOPEZ CLAUDIA SCHIFFER CINDY CRAWFORD JENNIFER ANNISTON GILLIAN ANDERSON MADONNA NIKI TAYLOR ELLE MACPHERSON KATE MOSS CAROL ALT TYRA BANKS FREDERIQUE KATHY IRELAND PAM ANDERSON KAREN MULDER VALERIA MAZZA SHALOM HARLOW AMBER VALLETTA LAETITA CASTA BETTIE PAGE HEIDI KLUM PATRICIA FORD DAISY FUENTES KELLY BROOK

Page 34: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

IDF and collection context

• IDF sensitive to the document collection content– General newspapers

• “amazon” > “forest” “destruction” > “rain”

– Amazon book store press releases• “forest” “destruction” > “rain” > “amazon”

Page 35: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Models

• Mathematically modelling the retrieval process– So as to better understand it– Draw on work of others

• Vector space • Probabilistic

Page 36: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

A vector space model

• In a vector space model, similarity is measured as the cosine of the angle between the two vectors (one is the document vector and one is the query vector)

• The value of each component of the vector might be– 0 or 1, – might be TF,

– might be TF(IDF)– Some other weight ?

Page 37: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

A possible Anthony and Cleopatra vector

4/15/2013

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

(0, . . . 0, 157, 0, . . . 0, 4, 0, . . . 0, 235, 0, . . . 0, 57, 0, . . . 0, 2, 0, . . . 0, 2))

Page 38: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Authority

• In classic IR– authority not so important

• On the web– very important

• Query “Harvard”– Dwane’s Harvard home page

– The Harvard University home page

Page 39: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Simple methods

• URL length

• Domain name

Page 40: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Hubs and Authorities

Authorities - sites that other web pages link

to frequently on a particular topic

Hubs - sites that tend to cite authorities

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Research interests/publications/lecturing/supervising

My publications list probably does a reasonable job of ostensively defining my interests and past activities. For those with a preference for more explicit definitions, see below.

I'm now working at the CIIR with an interest in automatically constructed categorisations and means of explaining these constructions to users. I also plan to work some more on a couple aspects of my thesis that look promising.

Supervised by Keith van Rijsbergen in the Glasgow IR group, I finished my Ph.D. in 1997 looking at the issues surrounding the use of Word Sense Disambiguation applied to IR: a number of publications have resulted from this work. While doing my Ph.D., I was fortunate enough to apply for and get a small grant which enabled Ross Purves and I to investigate the use of IR ranked retrieval in the field of avalanche forecasting (snow avalanches that is), this resulted in a paper in JDoc. At Glasgow, I also worked on a number of TREC submissions and also co-wrote the guidelines for creating the very short queries introduced in TREC-6. Finally, I was involved in lecturing work on the AIS Advanced MSc course writing and presenting two short courses on Implementation and NLP applied to IR. I also supervised/co-supervised four MSc students. The work of three of these bright young things have been published in a number of good conferences.

My first introduction to IR was the building (with Iain Campbell) of an interface to a probabilistic IR system

Page 41: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Evaluation

• Measure how well an IR system is doing– Effectiveness

• Number of relevant documents retrieved

– Also• Speed• Storage requirements• Usability

Page 42: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Search Engine Technology - What we want

Page 43: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Search Engine Technology - What we get

Recall 3 / 11

Page 44: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Search Engine Technology

Precision - 3 / 9

Page 45: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

How do we tell what is good?

• In small closed collections, human judgments are used.

• On large collections the overlap of systems is used.

• Search engines don’t say very much about how they work - the number of click-throughs may certainly be a factor.

4/15/2013

Page 46: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

Effectiveness

•Precision is easy– P at rank 10.

•Recall is hard– Total number of relevant documents?

4/15/2013

Page 47: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Test collections

• Test collection– Set of documents (few thousand-few million)– Set of queries (50-400)– Set of relevance judgements

• Humans check all documents!• Use pooling

– Take top 100 from every submission

– Remove duplicates

– Manually assess these only.

Page 48: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Test collections

• Small collections (~3Mb)– Cranfield, NPL, CACM - title (& abstract)

• Medium (~4 Gb)– TREC - full text

• Large (~100Gb)– VLC track of TREC

• Compare with reality (~40Tb)– CIA, GCHQ, Large search services

Page 49: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

HOW DOCUMENTS ARE STORED

4/15/2013

Page 50: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Document 1: It’s the end of the world as we know it.

It

End

the

Of

World ..

..

is

It’s -> it is

Document 2: The end of the world is near.

Don’t duplicate

2

2

2

2

1

2

1: 2

1: 2

1: 2

1: 1

1: 1

1: 1

2: 1

2: 2

2: 1

2: 1

2: 1

• The index stores unique terms for

fast retrieval.

• Each word points to documents containing the term and its term frequency.

#docs

Document ID

Term frequency

Index the documentsIndex the documents

Page 51: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

A FEW REFERENCES

4/15/2013

Page 52: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Don’t need Boolean?

• Ranking found to be better than Boolean• But lack of specificity in ranking

– destruction AND (amazon OR south american) AND rain forest

– destruction, amazon, south american, rain forest

– Jansen, B.J., Spink, A., Bateman, J., and Saracevic, T. (1998): Real Life Information Retrieval: A Study Of User Queries On The Web, in SIGIR Forum: A Publication of the Special Interest Group on Information Retrieval, 32(1): 5-17

Page 53: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Model references

– wx,y - weight of vector element

• Vector space– Salton, G. & Lesk, M.E. (1968): Computer evaluation of

indexing and text processing. Journal of the ACM, 15(1): 8-36

– Any of the Salton SMART books

Page 54: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Is Stemming beneficial?

Research says ‘sort of’’; see:

Hull, D.A. (1996) Stemming algorithms. A case study for detailed evaluation, in Journal of the American Society for Information Science, 47(1); 70-84.

Page 55: An overview of the technology used Information Retrieval Louise Guthrie University of Sheffield.

4/15/2013

Reference for BM25

• Popular weighting scheme– Robertson, S.E., Walker, S., Beaulieu, M.M., Gatford, M.,

Payne, A. (1995): Okapi at TREC-4, in NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4): 73-96