McGettrick Query Expansion

24
Query Expansion Query Expansion By: Sean McGettrick By: Sean McGettrick

Transcript of McGettrick Query Expansion

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 1/24

Query ExpansionQuery ExpansionBy: Sean McGettrickBy: Sean McGettrick

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 2/24

What is Query Expansion?What is Query Expansion?

Query Expansion is the term given whenQuery Expansion is the term given when

a search engine adding search terms to aa search engine adding search terms to a

user¶s weigh

ted search.

user¶s weigh

ted search.

The goal is to improve precision and/or The goal is to improve precision and/or 

recall.recall.

Example: User Query: ³car´; ExpandedExample: User Query: ³car´; Expanded

Query: ³car cars automobile automobilesQuery: ³car cars automobile automobiles

auto´ etc«auto´ etc«

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 3/24

Classes of Query ExpansionClasses of Query Expansion

Human and/or computer generatedHuman and/or computer generated

thesaurithesauri

Relevance feedbackRelevance feedback Automatic query expansion Automatic query expansion

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 4/24

Query Expansion IssuesQuery Expansion Issues

Two major issuesTwo major issues

Which terms to include?Which terms to include?

Which

terms to weight more?

Which

terms to weight more?

ConceptConcept--Based vs. TermBased vs. Term--Based QueryBased Query

ExpansionExpansion

Is it better to expand based upon theIs it better to expand based upon the

individual terms in the query, or the overallindividual terms in the query, or the overall

concept of the query?concept of the query?

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 5/24

Relevance of Query ExpansionRelevance of Query Expansion

Query expansion is very important on the web.Query expansion is very important on the web.

The amount of information on the web is alwaysThe amount of information on the web is alwaysincreasing.increasing. In 1999, Google had 135 million pages. It now hasIn 1999, Google had 135 million pages. It now has

over 3 billion.over 3 billion.

Search engine users follow specific trends with Search engine users follow specific trends with their searches.their searches.

22--3 words3 words

Broad search termBroad search term

Do not like to expand their queries either through Do not like to expand their queries either through refining search terms or using Boolean operatorsrefining search terms or using Boolean operators

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 6/24

ThesauriThesauri

What is a Thesauri in the IR world?What is a Thesauri in the IR world?

³Any data structure that defines semantic³Any data structure that defines semantic

relatedness between words.´relatedness between words.´

Schutze and Pedersen (1997)Schutze and Pedersen (1997)

Often more complex than normal Thesauri.Often more complex than normal Thesauri.

Thought to be too broad to be useful.Thought to be too broad to be useful.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 7/24

The Need For ThesauriThe Need For Thesauri

Naturally assumed that pulling words fromNaturally assumed that pulling words froma thesauri would increase:a thesauri would increase:

The number of documents retrieved.The number of documents retrieved.

Possibly precision.Possibly precision.

The car example: ³car´ vs. ³car, auto,The car example: ³car´ vs. ³car, auto,automobile, vehicle, sedan, etc«´automobile, vehicle, sedan, etc«´

Which would retrieve the largest number of Which would retrieve the largest number of documents?documents?

Is larger necessarily better?Is larger necessarily better?

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 8/24

Human & Automatically GeneratedHuman & Automatically Generated

Thesauri

Thesauri

Earliest work began in the 1950s.Earliest work began in the 1950s.

H.P. LuhnH.P. Luhn

T hesaurofacet 

T hesaurofacet 

 ± ±detailed list of engineeringdetailed list of engineeringtermsterms

Largely used in such industries asLargely used in such industries as

medicine, aerospace, and other medicine, aerospace, and other 

technological fields.technological fields.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 9/24

Drawbacks of HandcraftedDrawbacks of Handcrafted

Thesauri

Thesauri

CostCost

Development.Development.

Maintenance.Maintenance.

Cost often outweighs benefit.Cost often outweighs benefit.

TimeTime

It often takes a long time for thesauri toIt often takes a long time for thesauri to

develop. develop.  Hard to keep up with the pace of scientific andHard to keep up with the pace of scientific and

technological development.technological development.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 10/24

 Automatically Generated Thesauri Automatically Generated Thesauri

Need grew from limitations of handcraftedNeed grew from limitations of handcrafted

thesauri.thesauri.

No longer the cost of experts to generateNo longer the cost of experts to generatethesauri.thesauri.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 11/24

 Automatically Generated Thesauri Automatically Generated Thesauri

3 Steps. 3 Steps. 

Extract word coExtract word co--occurrences.occurrences.

Define word similarities.

Define word similarities.

Based upon word coBased upon word co--occurrence or lexicaloccurrence or lexical

relationship.relationship.

Cluster words based upon their similarities.Cluster words based upon their similarities.

Not proven very successful.Not proven very successful.  As late as 1990 many industries were still As late as 1990 many industries were still

using handcrafted thesauri.using handcrafted thesauri.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 12/24

Relevance FeedbackRelevance Feedback

Began in the 1960s.Began in the 1960s.

Significant improvement in recall and precisionSignificant improvement in recall and precision

over early query expansion work.over early query expansion work.

Basic process as follows.Basic process as follows.

The user creates their initial query which returns anThe user creates their initial query which returns an

initial result set.initial result set.

The user then selects a list of documents that areThe user then selects a list of documents that are

relevant to their search.relevant to their search.

The system then reThe system then re--weights and/or expands the queryweights and/or expands the query

based upon the terms in the documents.based upon the terms in the documents.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 13/24

Relevance Feedback ModelsRelevance Feedback Models

Many different types of models.Many different types of models.

Depend on methods and theories behindDepend on methods and theories behind

them.them. Vector Space.Vector Space.

Probabilistic.Probabilistic.

Boolean.Boolean.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 14/24

³Ide dec³Ide dec--hi´ Methodhi´ Method

In this method, all the top ranked relevantIn this method, all the top ranked relevant

documents are used as is the highestdocuments are used as is the highest

ranked nonranked non--relevant document.relevant document.

The nonThe non--relevant document is used arelevant document is used a

point in the vector space from which thepoint in the vector space from which the

feedback query is removed.feedback query is removed.

Up to 160% improvement over nonUp to 160% improvement over non--

expanded queries.expanded queries.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 15/24

Interactive Query ExpansionInteractive Query Expansion

Uses a thesaurus.Uses a thesaurus.

 After initial query is submitted, the system After initial query is submitted, the system

returns a list of associated and relevantreturns a list of associated and relevantwords derived from both the result set andwords derived from both the result set and

a thesaurus.a thesaurus.

Useful, but more research is needed.Useful, but more research is needed.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 16/24

PseudoPseudo--relevance Feedbackrelevance Feedback

Grew from problems involved inGrew from problems involved in

implementing relevance feedbackimplementing relevance feedback

systems.systems.

Users do not like to give manual feedbackUsers do not like to give manual feedback

to the system.to the system.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 17/24

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 18/24

lollol

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 19/24

 Automatic Query Expansion Automatic Query Expansion

The process of automatic queryThe process of automatic query

expansion using computer generatedexpansion using computer generated

thesauri.thesauri.

Works somewhat like pseudoWorks somewhat like pseudo--relevancerelevance

feedback.feedback.

Implementation not as useful, but stillImplementation not as useful, but still

widely researched.widely researched.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 20/24

Term CoTerm Co--occurrence Measuresoccurrence Measures

Process of developing relationships betweenProcess of developing relationships between

words based upon their cowords based upon their co--occurrence inoccurrence in

documents.documents.

ClusteringClustering Documents that share a significant number of termsDocuments that share a significant number of terms

are grouped together .are grouped together .

 A thesaurus is then generated from the terms in these A thesaurus is then generated from the terms in these

categories.categories.

Categories sometimes too narrow or broad.Categories sometimes too narrow or broad.

Does not account for synonyms.Does not account for synonyms.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 21/24

Lexical CoLexical Co--Occurrence MeasuresOccurrence Measures

Instead of looking at the frequency of Instead of looking at the frequency of terms in a document, the proximity of terms in a document, the proximity of words in a document is looked at.words in a document is looked at.

Context of words becomes important. Context of words becomes important. 

Some performance improvement shownSome performance improvement shownin small document collections.in small document collections.

Not quite as good as relevance feedback,Not quite as good as relevance feedback,but better than pseudobut better than pseudo--relevancerelevancefeedback.feedback.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 22/24

Current State of Query ExpansionCurrent State of Query Expansion

Query Expansion technology has reachedQuery Expansion technology has reached

somewhat of a plateau.somewhat of a plateau.

This is due to limiting factors of relevance

This is due to limiting factors of relevancefeedback and word cofeedback and word co--occurrence.occurrence.

Current research attempting to refineCurrent research attempting to refine

previous research in the field.previous research in the field.

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 23/24

Where To Go From Here?Where To Go From Here?

Grammatical Based ThesauriGrammatical Based Thesauri Syntactical relationship between wordsSyntactical relationship between words

Words placed into classesWords placed into classes

Some improvement on small document collections. Some improvement on small document collections. Failed on larger ones.Failed on larger ones.

 AI Searching AI Searching Mostly theoryMostly theory

Intelligent AgentsIntelligent Agents

Could be customized reflect specific needs of theCould be customized reflect specific needs of theuser user 

Next logical step in IR, but still far off from commercialNext logical step in IR, but still far off from commercialuseuse

8/3/2019 McGettrick Query Expansion

http://slidepdf.com/reader/full/mcgettrick-query-expansion 24/24

Works CitedWorks Cited

 Attardi, G., S. Di Marco and F. Sebastiani. 1998. Automated Generation of  Attardi, G., S. Di Marco and F. Sebastiani. 1998. Automated Generation of CategoryCategory--Specific Thesauri for Interactive Query Expansion. Specific Thesauri for Interactive Query Expansion. 

Grefenstette, G. 1992. Use of Syntactic Context to Produce TermGrefenstette, G. 1992. Use of Syntactic Context to Produce Term Association Lists for Text Retrieval. In Association Lists for Text Retrieval. In Proceedings of the 15th Annual Proceedings of the 15th Annual International ACM International ACM--SIGIR Conference on Research and Development inSIGIR Conference on Research and Development inInformation Retrieval, Copenhagen, Denmark Information Retrieval, Copenhagen, Denmark , ed. N. Belkin, P. Ingwersen, ed. N. Belkin, P. Ingwersen

and A. M. Pesjtersen: pp. 89and A. M. Pesjtersen: pp. 89--97. New York: ACM Press.97. New York: ACM Press.Ide, E. 1971. New Experiments in Relevance Feedback. In G. Salton. Ide, E. 1971. New Experiments in Relevance Feedback. In G. Salton. T heT heSMAR T Retrieval System: Experiments in automatic document processing SMAR T Retrieval System: Experiments in automatic document processing . . Englewood Cliffs, NJ: PrenticeEnglewood Cliffs, NJ: Prentice--Hall.Hall.

Qiu, Y., 1993. Concept Based Query Expansion. InQiu, Y., 1993. Concept Based Query Expansion. In Proceedings of SIGIR Proceedings of SIGIR--93, 16 93, 16 thth ACM International Conference on Research and Development in ACM International Conference on Research and Development inInformation Retrieval.Information Retrieval.

Schutze, H. and J. Pederson. 1997. A CooccuranceSchutze, H. and J. Pederson. 1997. A Cooccurance--based Thesaurus andbased Thesaurus andTwo Applications to Information Retrieval. Two Applications to Information Retrieval. Information Processing and Information Processing and Management Management 33, no. 3: pp. 30733, no. 3: pp. 307--318.318.

Walker, D. 2001. Query Expansion Using Thesauri.Walker, D. 2001. Query Expansion Using Thesauri.