CS621 : Artificial Intelligence

85
CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 27: Towards more intelligent search

description

CS621 : Artificial Intelligence. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 27: Towards more intelligent search. Desired Features of the Search Engines. Meaning based More relevant results Multilingual Query in English, e.g. Fetch document in Hindi, e.g. Show it in English. - PowerPoint PPT Presentation

Transcript of CS621 : Artificial Intelligence

Page 1: CS621 : Artificial Intelligence

CS621 : Artificial Intelligence

Pushpak BhattacharyyaCSE Dept., IIT Bombay

Lecture 27: Towards more intelligent search

Page 2: CS621 : Artificial Intelligence

Desired Features of the Search Engines

• Meaning based– More relevant results

• Multilingual– Query in English, e.g.– Fetch document in Hindi, e.g.– Show it in English

Page 3: CS621 : Artificial Intelligence

Precision (P) and Recall (R)

• Tradeoff between P and R

Actual (A)Obtained (O)

Intersection: shaded area (S)

P= S/O

R= S/A

Page 4: CS621 : Artificial Intelligence

The UNL System: An Overview

Page 5: CS621 : Artificial Intelligence

Building blocks of UNL

• Universal Words (UWs)• Relations• Attributes• Knowledge Base

Page 6: CS621 : Artificial Intelligence

UNL Graph

obj

agt

@ entry @ past

minister(icl>person)

forward(icl>send)

mail(icl>collection)

he(icl>person)

@def

@def

gol

He forwarded the mail to the minister.

Page 7: CS621 : Artificial Intelligence

UNL Expression

agt (forward(icl>send).@ entry @ past, he(icl>person))

obj (forward(icl>send).@ entry @ past, minister(icl>person))

gol (forward(icl>send ).@ entry @ past, mail(icl>collection). @def)

Page 8: CS621 : Artificial Intelligence

Universal Word (UW)

• vocabulary of UNL• represents a concept

– Basic UW (an English word/compound word/phrase with no restrictions or Constraint List)

– Restricted UW (with a Constraint List )

• Examples:

“crane(icl>device)”

“crane(icl>bird)”

“crane(icl>do)”

nouns

verb (“crane the neck”)

Page 9: CS621 : Artificial Intelligence

Desirable features of UWs

• Expressibility: able to represent any concept in a language

• Economy: only enough to disambiguate the head word

• Formal situatedness: every UW should be defined in the UNL Knowledge-Base

Page 10: CS621 : Artificial Intelligence

UNL Knowledge Base

• A semantic network comprising every possible UW

• A lattice structure

Page 11: CS621 : Artificial Intelligence

Enconversion

InputSentence/Query

UNLexpression

Page 12: CS621 : Artificial Intelligence

Encoversion process

• Analysis at 3 levels– Morphological– Syntactic– Semantic

• Crucial role of disambiguation– Sense– Part of speech – Attachment (I saw the boy with a telescope)

(I bank with the bank on the river bank)

Page 13: CS621 : Artificial Intelligence

Deconversion

Outputsentence

UNLexpression

Page 14: CS621 : Artificial Intelligence

Deconversion process

• Syntax Planning• Case marking• Morphology

win

agtptn

obj

Brazil Japan

match

Braajila jaapaan mecha jiit

Braajila ne jaapaan ke saatha mecha jiitaa

@entry@past

Braajila ne jaapaan ke saatha mecha jiit

Page 15: CS621 : Artificial Intelligence

Application: meaning based multilingual search

Page 16: CS621 : Artificial Intelligence

Application: meaning based multilingual search

Page 17: CS621 : Artificial Intelligence

Top Level Description of the Methodology

• Documents represented in meanings graphs• Queries converted to meaning graphs• Matching on meaning graphs• Retrieved document (a collection of meaning

graphs) displayed in the language of interest

Page 18: CS621 : Artificial Intelligence

System Constituents: 1/2

• Search Front– Crawler– Indexer (3 level)

• On Expression• On Concept• On keywords

Page 19: CS621 : Artificial Intelligence

System Constituents: 2/2

• Language Front– EnConverter (analyses sentence to UNL)– DeConverter (generates sentence from UNL)– Stemmer and Morphology Analyser– Parser– Word Sense Disambiguator

• Needs wordnets

Page 20: CS621 : Artificial Intelligence

Retrieved UNL Documents

Complete UNL

Match

Search engine

Query

Stemmers

Enconverter

UNL

Deconverter

UNL

Index Index

Lucene

Search Results

Enconverter

UNL

Stemmers

Search Results

PartialUNL

Match

UW Match

Yes

Yes

Yes

No

No

No

Overall Architecture:

WSD

Query Expansion

HTMLCorpus

Failsafe Search Strategy

Page 21: CS621 : Artificial Intelligence

Indexing and Failsafe Search Strategy: 1/2

• The indexer creates a three level indexing in the form of 

a. UNL expressions (phrasal and sentential concepts)

b. Universal Words (lexical concepts)

c. Keywords/Stem Words (Using Stemmers & Lucene)

Page 22: CS621 : Artificial Intelligence

Indexing and Failsafe Search Strategy: 2/2

• This enables a failsafe search strategy: 

- Complete expression matching, else

- Partial expression matching, else

- Universal Word (UW) matching, else

- Search on Keywords/Stem Words

Page 23: CS621 : Artificial Intelligence

Indexing• UNL Document Index – Keeps information about each UNL Document

• UNL Index – Stores the actual index of UNL Expressions

Fields Description

docid UNL document id

orilink Link to original document

language Language of original document

numlines Number of sentences in the document

Fields Description

rel Stores the relation of a UNL expression

uw1 Stores the first Universal Word of a relation

uw2 Stores the second Universal Word of a relation

uwid1 Stores the id of uw1

uwid2 Stores the id of uw2

docid UNL document id in which above fields occur

sent Sentence no in which above fields occur

Page 24: CS621 : Artificial Intelligence

Index of UNL Expressions

rel uw1 uw2

mod support(icl>help) financial(mod<thing)

mod performance(icl>operation)

agriculture(icl>activity)

mod government(icl>governmental organization)

australian(mod<thing)

and strategy(icl>idea) trade(icl>activity)

… … …

agro4

1

agro7

20agro4

1

agro2

12agro4

1

agro4

1

• Each entry (UNL Expression (rel,uw1,uw2)) points to the pair of document id and its sentence number where it occurs.• Sample UNL Expressions:

mod:02(support(icl>help):4T, financial(mod<thing):4J)mod:01(government(icl>governmental organization):5L.@def, australian(mod<thing):5A)and:04(strategy(icl>idea):1D.@entry.@pl, trade(icl>activity):16)mod:05(performance(icl>operation):2B.@entry.@topic, agriculture(icl>activity):2X)

Page 25: CS621 : Artificial Intelligence

Index of UWs• Each UW points to pair of document id and its sentence number where it occurs.

UWs

support(icl>help)

financial(mod<thing)

government(icl>governmental organization)

performance(icl>operation)

indian(aoj>thing)

marketing(icl>commerce)

agro4 1

agro7 20

agro4 1

agro5 15

agro7 20

agro2 8,16..

agro4 1,22,26..

agro5 3,4,5..

agro6 3,5,22..

agro7 3,21,32..

agro2 12

agro4 1

agro2 12

agro3 15

agro4 1

agro1 13

agro4 6

agro5 24

agro6 6

Page 26: CS621 : Artificial Intelligence

Sophisticated matching

• Complete set of expressions matching• Weighted expression matching• Partial set of expressions matching• Complete UW matching• Headword matching (equivalent to keyword)• Restriction matching• Attribute matching

Page 27: CS621 : Artificial Intelligence

Keyword-based Matching needs morphology for Indian languages

User Input ganne Stemmers

LuceneIndexAll documents containing

gannoM, gannaa, ganneOutput

gannaa

Page 28: CS621 : Artificial Intelligence

Multilingual Keyword Search• An UW dictionary based approach• Given a query in a language, generates a multilingual query using

UW dictionaries• Example:

- Monolingual Query: Farmer - Multilingual Query: Farmer

UW Dictionary

Preprocessor

Dictionary database

Query

Multilingual Query

Multilingual KeywordGenerator

• Provides Multilingual capability at the keyword level to the search engine

Stemmer

शे�तकरी�किकसान

Page 29: CS621 : Artificial Intelligence

Experimentation

• Chosen Domain: Agriculture• Languages: English, Hindi, Marathi• Document base: Pesticide and Diseases• Word order sensitive

– Money lenders exploit farmers vs. farmers exploit moneylenders

• For CLIR: Tested on– Hindi and Marathi query retrieval from

English Display in Hindi/Marathi

Page 30: CS621 : Artificial Intelligence

System Interface

• Agricultural Search Engine

Page 31: CS621 : Artificial Intelligence

गा�य     --     वह व्यक्ति� जो� बहुत सा�धा-साधा ह�:"वह गाय ह�,उसा� जो� क� छ भी� कह जोत ह� चु�पचुप स्व�करी करी ले�त ह�"

Wordnet Sub-graph(Hindi)

H Y P E R N Y M Yथन

जो�गाली% करीन

ब�ले

ABLIT Y

VERB

GLOSS

MERONYMY

H Y P O N Y M Y

लेवई न�चुक'

दुम

चु*पय

सा+गावले एक शेकहरी� मदा चु*पय जो� अपन� दूधा क� क्तिलेए प्रक्तिसाद्ध ह�:"किहन्दू ले�गा गाय क� गा� मत कहत� ह3 एव4 उसाक' प5जो करीत� ह3"

गा�य,गाऊ,गा�य(SYNONYMY)

स्तनपय� जो4त�

H Y P E R N Y M Y

H Y P E R N Y M Y

ANTONYNY

POLYSEMY

Page 32: CS621 : Artificial Intelligence

Wordnet Sub-graph(Marathi)

H Y P E R N Y M Yखो�ड रीन

बगा

HOLONYMY

GLOSS

MERONYMY

H Y P O N Y M Y

भी�मथड�अरीबी�

मी5ळ

सास्तन प्रणी�

ओझे� वहणी�,गाड� ओढणी� किंक@व बसाण्यसाठीC उपय�गात आणीले जोणीरी एक चुत�ष्पदा प्रणी�:"प्रचु�न कळपसा5न अरीबस्थानतले� घो�ड� प्रक्तिसाद्ध आह�त"

  घो�डा�,अश्व

जोरीजो

H Y P E R N Y M Y

घो�डा�     --  ब�द्धिद्धबळच्य खो�ळत�ले एक साKगाटी%:"घो�ड अड�चु

घोरी� चुलेत�"

POLYSE

MY

Page 33: CS621 : Artificial Intelligence

Semantically Relatable Set (SRS) Based Search

(Pl look up publications under www.cse.iitb.ac.in/~pb for descriptions

of SRS and SRS based search)

Page 34: CS621 : Artificial Intelligence

What is SRS

• SRSs are UNL expressions without the semantic relations

• E.g., “the first non-white president of USA”– (the, president)– (president, of, USA)– (first president)– (non-white president)

Page 35: CS621 : Artificial Intelligence

SRS Based matching

• Complete SRS match– All the SRSs of the query should match with

the SRSs of the sentence• Partial SRS match

– All the query SRSs need not match with that of the sentence SRSs.

Page 36: CS621 : Artificial Intelligence

System Architecture

Page 37: CS621 : Artificial Intelligence

Experimental Setup• Text Retrieval Conference (TREC) data was used.• TREC provides the gold standard for query and relevant

documents:

Query Number Document-ID Relevance Score

8 WSJ911010-0114 1

8 WSJ911011-0085 0

21 AP880304-0049 1

21 AP880304-0192 0

Table: Relevance Judgments in TREC

• We chose 1919 documents and the first 250 queries.– Mostly from the AP newswire, Wall Street Journal

and the Ziff data.

Page 38: CS621 : Artificial Intelligence

Experiment Process

• Lucene with search strategy tf-idf as the keyword based search engine (baseline)

• Used SRS based search on the other hand• Compared both the search methods on various

parameters

Page 39: CS621 : Artificial Intelligence

Precision Comparison

• Shows that SRS search filters out non-relevant documents much more effectively than the keyword based tf-idf search.

Page 40: CS621 : Artificial Intelligence

Recall Comparison

• tf-idf consistently outperforms the SRS search engine here.

Page 41: CS621 : Artificial Intelligence

Mean Average Precision (MAP) Comparison

• MAP contains both recall and precision oriented aspects and is also sensitive to entire ranking.

• SRS Search could not perform here because of the low recall.

R

rrelrPMAP

N

r

1

Page 42: CS621 : Artificial Intelligence

Reasons for poor Recall: Word Divergence 1/2

• Inflectional Morphology Divergence – Query: “child abuse”

• Query SRS: (child, abuse) – Sentence: “children are abused”

• Sentence SRS: (children, abused)

• Derivational Morphology Divergence– Query: “debt rescheduling”

• Query SRS: (debt, rescheduling)– Sentence: “rescheduling of debt”

• Sentence SRS: (rescheduling, of, debt)– Query: “polluted water”

• Query SRS: (polluted, water)– Sentence: “water pollution has increased in the city”

• Sentence SRS: (water, pollution)

Page 43: CS621 : Artificial Intelligence

Reasons for poor Recall: Word Divergence 2/2

• Synonymy Divergence– Query: “antitrust cases”

• Query SRS: (antitrust, cases)

– Sentence: “An antitrust lawsuit was charged today”. • Sentence SRS: (antitrust, lawsuit)

• Hypernymy Divergence– Query has keyword “car”, while the document has keyword

“automobile”.

• Hyponymy Divergence– Query can be “car” whereas the document might contain

“minicar”.

Page 44: CS621 : Artificial Intelligence

Physical Separation Divergence

• Physical Separation Divergence– Query: “antitrust lawsuit”

• Query SRS: (antitrust, lawsuit) – Sentence: “The federal lawsuit represents the

largest antitrust action”• Sentence SRSs: (lawsuit, represents),

(represents, action), (antitrust, action)

Page 45: CS621 : Artificial Intelligence

Solutions for Divergences

Page 46: CS621 : Artificial Intelligence

Solution to Morphological Divergence

• Stemming– All words in the document and the query

SRSs are stemmed before matching. – Gets the base form based on WordNet, while

keeping the tag of the word unchanged.• children_NN stemmed to child_NN, but

childish_JJ not stemmed to child_NN

Page 47: CS621 : Artificial Intelligence

Solution to Synonymy-Hyperonymy-Hyponmy Divergence

• Find related words from the WordNet• Algorithm Outline

1. Get synonyms2. Get hypernyms upto depth 23. Get hyponyms upto depth 24. Repeat step 1,2 and 3 for all synonyms5. All the words are related words

• Found related words for all words in corpus (Nouns and Verbs).

• Calculated similarity between a word and the related words

Page 48: CS621 : Artificial Intelligence

SRS Tuning

• Deals with the “Other Divergences” problem.• Enriches the SRSs in the corpus.• Basically adds new SRSs by applying augment

rules on existing SRSs.

Page 49: CS621 : Artificial Intelligence

Sample Rules I

Rule: (N1, N2) => (N2(J), N1)

Sentence: “water pollution”

Sentence SRS: (water_N, pollution_N) Tuned SRS: (polluted_J, water_N)

Page 50: CS621 : Artificial Intelligence

Sample Rules II

• Rule: (V, N) => (N, V(N))• Sentence: “destroy city”• Sentence SRS: (destroy_V, city_N)• Augmented SRS: (city_N, destruction_N)

Page 51: CS621 : Artificial Intelligence

Sample Rules III• Rule: (N1, of, N2) => (N2, N1)

– Sentence: “rescheduling of debt”– Sentence SRS: (rescheduling_N, of, debt_N)– Augmented SRS: (debt_N, rescheduling_N)

• Rule: (N1, of, N2) => (N2(J), N1)– Sentence: “cup of gold”– Sentence SRS: (cup_N, of, gold_N)– Augmented SRS: (golden_J, cup_N)

Page 52: CS621 : Artificial Intelligence

Sample Rules IV• Rule: (V, for, N) => (N, V(N))

– Sentence: “applied for a certificate”– Sentence SRS: (applied_V, for, certificate_N)– Augmented SRS: (certificate_N, application_N)

• Rule: (J, for, N-ANIMATE) => (N, J(N))– Sentence: “famous for her painting”– Sentence SRS: (famous_J, for, painting_N)– Augmented SRS: (painting_N, fame_N) – Sentence: “It is good for John”– Sentence SRS: (good_J, for, John_N)– Augmented SRS: (John_N, goodness_N) X

Page 53: CS621 : Artificial Intelligence

Getting Derived Form- Using Porter Stemmer

• Let the word be “national_J”. Want the noun form.

• Step 1. Get the stem using Porter– “national” -> “nat”

• Step 2. Get all nouns from WordNet which start with “nat”– “nature”, “natural”, “nation”, “nationhood”, “native” etc.

• Step 3. Get the words which have the largest lexicographical match with “national”– “nation”, “nationhood”

• Choose any one of them– “nation_N”

Page 54: CS621 : Artificial Intelligence

New System Architecture

Page 55: CS621 : Artificial Intelligence

New Formulation for Sentence Relevance

qsrsid

srsidsrs

qsrsidssrssrsidsrs

qsrsweight

srssrstsrsweightsr

))((max

)))',((max)((max)(

'

where,

the SRS Similarity t() is calculated as

'2,2','1,1', cwcwtfwfwequalcwcwtsrssrst

t (w1 , w2) is calculated using the similarity measure discussed.

t (cw1 , cw1 ’) and equal (fw , fw ’) become 1 while matching (FW , CW)s and (CW , CW)s respectively.

Page 56: CS621 : Artificial Intelligence

Recall Comparison

– Dramatic improvement in recall

Page 57: CS621 : Artificial Intelligence

Precision Comparison

• Drop observed in precision, but still higher than TF-IDF• Non relevant documents effectively filtered out

Page 58: CS621 : Artificial Intelligence

Mean Average Precision (MAP) Comparison

• MAP better than TF-IDF

Page 59: CS621 : Artificial Intelligence

Highlight of Results

• Recall of the enhanced system dramatically improved from 0.102 to 0.362 (comparable to TF-IDF)

• Significant rise in MAP (0.149 from 0.054)

Page 60: CS621 : Artificial Intelligence

Searching with Enriched Information and Structures

Verma Kamaljeet S. M.Tech Project, 2008

(advised by Prof. Pushpak Bhattacharyya)

Page 61: CS621 : Artificial Intelligence

Motivation

• Language Modeling Approach is popular– Solid theoretical foundations– Promising empirical retrieval performance

• Semantic Smoothing incorporates synonym, context and sense information to produce more accurate results

• SRSs – Semantically Relatable Sequences– Are usually unambiguous and should give precise

results– Ideal Candidates for Semantic Smoothing

Page 62: CS621 : Artificial Intelligence

SRS Language Model

• Two Goals– Incorporate synonym and sense information in the model

• e.g. query – “case”, documents – “lawsuit”

– Use the contextual information present in SRS tuples• e.g. SRS – “instrument case”

– p(container/“instrument case”) > p(lawsuit/ “instrument case”)

• SRS – “antitrust case”– p(lawsuit/“antitrust case”) > p(container/“antitrust case”)

Page 63: CS621 : Artificial Intelligence

SRS Language Model

• Query

– q = (q1,q2,…..,qn)• Corpus C

– documents d1,d2…..• Key Notion

– Document to Query translation or Query Generation– p(q / d)

• Query terms assumed to be independent

j

j dqpdqp )/()/(

Page 64: CS621 : Artificial Intelligence

High Level System Architecture

Searcher

Indexer

Evaluator

Raw Documents

SRS Index

Word Index

SRS Documents

SRS Generator

Trans Matrix

Trans Prob. Estimator

SearcherSearcher

IndexerIndexer

EvaluatorEvaluator

Raw Documents

Raw Documents

SRS IndexSRS Index

Word IndexWord Index

SRS Documents

SRS Documents

SRS Generator

SRS Generator

Trans MatrixTrans Matrix

Trans Prob. Estimator

Trans Prob. Estimator

Page 65: CS621 : Artificial Intelligence

High Level System Architecture

Searcher

Indexer

Evaluator

Raw Documents

SRS Index

Word Index

SRS Documents

SRS Generator

Trans Matrix

Trans Prob. Estimator

SearcherSearcher

IndexerIndexer

EvaluatorEvaluator

Raw Documents

Raw Documents

SRS IndexSRS Index

Word IndexWord Index

SRS Documents

SRS Documents

SRS Generator

SRS Generator

Trans MatrixTrans Matrix

Trans Prob. Estimator

Trans Prob. Estimator

SRS Pruning

Page 66: CS621 : Artificial Intelligence

Indexing

• Word Index– Stop Words are removed– Stemming done – Porter Stemmer

• SRS Index– Useful SRSs are indexed

VSRS Vd Vw

SRS1

SRS2

SRS3

w1

w2

w3

D1

D2

Page 67: CS621 : Artificial Intelligence

SRS pruning

• Goal– Identification of “Good” SRSs

• PoS tags Based Pruning– SRSs with 2 <= length <= 4 kept– Starting with NN, ending with NN– Starting with NN/JJ, ending with NN– Starting with NN/JJ/DT, ending with NN

• Results– Did not improve

Page 68: CS621 : Artificial Intelligence

Translation Probabilities

• EM Training– Starts by an initial guess– Iteratively improves the guess by increasing the

likelihood until it converges

• Update Equations

)/()/()1(

)/()1()(

)(

)()(

Cwpwp

wpwp

n

nn

ii

nki

nkn

wpDwc

wpDwcwp

)(),(

)(),()/(

)(

)()1(

Page 69: CS621 : Artificial Intelligence

SRS Translation Probabilities

{space, program}

Term Prob.

space 0.0266

program 0.0229

launch 0.0169

technology 0.0161

orbit 0.0148

astronaut 0.0148

mission 0.0139

NASA 0.0136

satellite 0.0134

earth 0.0132

Page 70: CS621 : Artificial Intelligence

Experimental Setup• Document Collection

– TREC AP89• 84,678 documents• 145,349 distinct words• 180.1 unique words per document in average• Criteria

– TREC Collections are popular and many published results exist

• Queries– TREC queries 1-50

• TREC queries have title, description, narrative and concept sections

• Used only title section

Page 71: CS621 : Artificial Intelligence

Experimental Setup

• Language Modeling Toolkit– Dragon Toolkit

• Java Based Toolkit• Has implemented baseline models• Designed to support language modeling IR

Page 72: CS621 : Artificial Intelligence

Results

Comparison of the SRS Model to the Okapi, Two-

Stage and MWE Models

Model MAP Recall P@10 P@100

Okapi 0.186 1627 0.259 0.139

Two-Stage 0.187 1623 0.259 0.139

MWE 0.204 1809 0.272 0.142

SRS 0.205 1836 0.262 0.150

Page 73: CS621 : Artificial Intelligence

Results

Comparison of the SRS Model with the MWE Model at λ = 1

Metric MWE SRS Improv.

MAP 0.077 0.098 +27.27%

Recall 1289 1413 +9.62%

P@10 0.130 0.168 +29.23%

P@100 0.091 0.104 +14.29%

Page 74: CS621 : Artificial Intelligence

Results

Metric MWE SRS SRS+MWE vs. MWE vs. SRS

MAP 0.204 0.205 0.217 +6.37% +5.85%

Recall 1809 1836 1865 +3.1% +1.58%

P@10 0.272 0.262 0.277 +1.84% +5.73%

P@100 0.142 0.150 0.153 +7.75% +2.0%

Page 75: CS621 : Artificial Intelligence

Some important tasks done

• Made simple yet effective changes to SRS Generation module.– Raw document to SRS document conversion time reduced from

5-10 minutes to 40sec – 2 minutes– Increased testing corpus size from 1818 to more than 84,000

documents

• Proposed and implemented entirely new Searching Strategy– SRS Based Context-Sensitive Semantic Smoothing– Decent SRS pruning Module to identify “good” SRSs– Results were the best amongst all the Language Modeling

Approaches (2-stage, Word Translation, MWE Topic Signature)

Page 76: CS621 : Artificial Intelligence

Conclusions

• Novel approach to Context-sensitive Semantic smoothing

• Semantically Relatable Sequences (SRSs) are used– Semantically related, not necessarily consecutive words– Context leads to more accurate results– Not all SRSs are useful indexing units (SRS Pruning).

• Mixture model of SRS Translation Model and Two stage language model is effective

• NLP inspired patterns in Language Modeling approach hold the promise of better IR performance

Page 77: CS621 : Artificial Intelligence

Future Work

• SRS Pruning– Other schemes like tf-idf

• Complex combination of MWE and SRSs– Mixture model of MWE, SRSs and baseline models

• Learning of the mixture weight coefficient• Improvement of the SRS generation• Experimenting with other NLP patterns

Page 78: CS621 : Artificial Intelligence

Thank You

Page 79: CS621 : Artificial Intelligence

References

1. J. Lafferty and C. Zhai, “Document Language Models, Query Models, and Risk Minimization for Information Retrieval,” Proc. 24th Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '01), pp. 111-119, 2001.

2. X. Zhou, X. Hu, X. Zhang, "Topic Signature Language Models for Ad hoc Retrieval," IEEE Transactions on Knowledge and Data Engineering, vol. 19,  no. 9,  pp. 1276-1287,  2007.

3. R. Mohanty, M.K. Prasad, L. Narayanswamy and P. Bhattacharyya, “Semantically Relatable Sequences in the Context of Interlingua Based Machine Translation”, International Conference on Natural Language Processing, 2007.

4. S. Khaitan, K. Verma, R. Mohanty and P. Bhattacharyya, “Exploiting Semantic Proximity for Information Retrieval”, IJCAI 2007 Workshop on Cross Lingual Information Access, 2007.

5. J. Ponte and W.B. Croft, “A Language Modeling Approach to Information Retrieval,” Proc. 21st Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '98), pp. 275-281, 1998.

6. C. Zhai and J. Lafferty, “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval,” Proc. 24th Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '01), pp. 334-342, 2001.

7. C. Zhai and J. Lafferty, “Model-Based Feedback in the Language Modeling Approach to Information Retrieval,” Proc. 10th Int'l Conf. Information and Knowledge Management (CIKM '01), pp. 403-410, 2001.

Page 80: CS621 : Artificial Intelligence

References

8. C. Zhai and J. Lafferty, “Two-Stage Language Models for Information Retrieval,” Proc. ACM Conf. Research and Development in Information Retrieval (SIGIR '02), 2002.

9. A. Berger and J. Lafferty, “Information Retrieval as Statistical Translation,” Proc. 22nd Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '99), pp. 222-229, 1999.

10. A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. 39, pp. 1-38, 1977.

11. X. Zhou, X. Hu, X. Zhang, X. Lin, and I.-Y. Song, “Context-Sensitive Semantic Smoothing for the Language Modeling Approach to Genomic IR,” Proc. 29th Ann. Int'l ACM Conf. Research and Development on Information Retrieval (SIGIR '06), pp.70-77, Aug. 2006.

12. X. Zhou, X. Zhang, and X. Hu, “The Dragon Toolkit Developer Guide,” Data Mining and Bioinformatics Laboratory, Drexel Univ., http://www.dragontoolkit.org/tutorial.pdf, 2007.

13. S.E. Robertson et al., “Okapi at TREC-4,” Proc. Fourth Text Retrieval Conf. (TREC '95), 1995.

14. F. Smadja, “Retrieving Collocations from Text: Xtract,” Computational Linguistics, vol. 19, no. 1, pp. 143-177, 1993.

Page 81: CS621 : Artificial Intelligence

Presentation End

Page 82: CS621 : Artificial Intelligence

Translation Probabilities

• Dk- set of documents containing SRSk

• Not all terms in Dk center on SRSk

– Some terms address the issue of other SRSs– Some represent the background information

• Generative model similar to [7] is used– Mixture model of SRS translation model and the background

collection model

– θ is the set of parameters of the model of SRSk

)/()/()1()/( Cwpwpwp

Page 83: CS621 : Artificial Intelligence

Translation Probabilities

• Log-Likelihood of generating Dk

• c(w,Dk) is the frequency of the term w in Dk

• Goal– To estimate the translation probabilities by

maximizing the log-likelihood– Expectation Maximization

w kk wpDwcDp )/(log),()/(log

Page 84: CS621 : Artificial Intelligence

Non-Interpolated Average Precision

• Formula

• Where r(D) is the rank of the document D and Rel is the set of relevant documents for a query Q.

• To obtain the MAP score, we average the non-interpolated average precision across all the queries of the collection.

lD Dr

DrDrlD

l Re )(

)()'(,Re'

Re

1

Page 85: CS621 : Artificial Intelligence

Two Stage Language Model & Okapi

• TSLM Formula

• Okapi Model

• tf(q,D) is the term frequency of q in document D• df(q) is the document frequency of q• avg_dl is the average document length in the collection

Qq

CqpD

CqpDqtfDQp )}/(

||

)/(),()1{()/(

Qq

Dqtfdlavg

D

qdf

qdfNDqtf

DQsim

),(_

5.15.0

5.0)(

5.0)(log),(

),(