Mehran Sahami Timothy D. Heilman A Webbased Kernel Function for Measuring the Similarity of Short...

21
Mehran Sahami Timothy D. Heilman A Webbased Kernel Function for A Webbased Kernel Function for Measuring the Similarity of Measuring the Similarity of Short Text Snippets Short Text Snippets
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    1

Transcript of Mehran Sahami Timothy D. Heilman A Webbased Kernel Function for Measuring the Similarity of Short...

Mehran Sahami Timothy D. Heilman

A Web based Kernel Function forA Web based Kernel Function forMeasuring the Similarity ofMeasuring the Similarity of

Short Text SnippetsShort Text Snippets

IntroductionIntroduction

Wish to determine how similar two short text snippets are.

High degree of semantic similarityUnited Nations Secretary General vs Kofi AnnanAI vs Articial Intelligence

Share termsgraphical models vs graphical interface

5%

Related WorkRelated Work

Query expansion techniquesOther means of determining query

similaritySet overlap (intersection)SVM for text classification

Latent Semantic Kernels (LSK)Semantic Proximity Matrix

Cross-lingual techniques

10%

A New Similarity FunctionA New Similarity Function

represent a short text snippet (query) to a search engine S

be the set of n retrieved documents

Compute the TFIDF term vector for each document

Truncate each vector to include its m highest weighted term

x

)(xR

nddd ,...,, 21

iv

)(xRdi

iv

15%

NormalizeNormalize

Let be the centroid of the L2 normalized vector

Let QE(x) be the L2 normalization of the centroid C(x)

)(xCiv

n

iv

vn i

ixC1

1

2

)(

2)(

)()(xC

xCxQE

20%

Kernel FunctionKernel Function

)()(),( yQExQEyxK

25%

Initial Results with KernelInitial Results with Kernel

Three genres of text snippet matchingAcronymsIndividuals and their positionsMulti-faceted terms

30%

AcronymsAcronyms

Text1 Text2 Kernel Cosine Set Overlap

Support vector machine SVM 0.812 0.0 0.110Portable document format PDF 0.732 0.0 0.060Artificial intelligence AI 0.831 0.0 0.255Artificial insemination AI 0.391 0.0 0.000term frequency inverse document frequency

tf idf 0.831 0.0 0.125

term frequency inverse document frequency

tfidf 0.507 0.0 0.060

35%

Individuals and their positionsIndividuals and their positions

40%

Multi-faceted termsMulti-faceted terms

45%

Related Query SuggestionRelated Query Suggestion

Kernel function foru is any newly issued user query A repository Q of approximately 116 million

popular user queries issued in 2003, determined by sampling anonymized web search logs from the Google search engine

),( iquK Qqi

50%

AlgorithmAlgorithm

Given user query and list of matched queries from repository

Output list of queries to suggest Initialize suggestion list Sort kernel scores in descending

order to produce an ordered list of corresponding queries

MAX is set to the maximum number of suggestions

u

ZZ

),( iquK

iq),...,,( 21 kqqqL

55%

Post-Filter

|q| denotes the number of terms in query q60%

Evaluation of Evaluation of Query Suggestion SystemQuery Suggestion System

1. suggestion is totally off topic.2. suggestion is not as good as original

query.3. suggestion is basically same as original

query.4. suggestion is potentially better than

original query.5. suggestion is fantastic - should suggest

this query since it might help a user find what they're looking for if they issued it instead of the original query.

65%

EvaluationsEvaluations

Original Query

Suggested Queries Kernel Score

Human Rating

california lottery

california lotto home 0.812 3

winning lotto numbers in california 0.792 5

california lottery super lotto plus 0.778 3

valentines day

2003 valentine's day 0.832 3

valentine day card 0.822 4

valentines day greeting cards 0.758 4

I love you valentine 0.736 2

new valentine one 0.671 1

70%

Average ratings at Average ratings at various kernel thresholdsvarious kernel thresholds

75%

Average ratings versus average Average ratings versus average number of query suggestionsnumber of query suggestions

80%

Application in QAApplication in QA

K("Who shot Abraham Lincoln", "John Wilkes Booth") = 0.730

K("Who shot Abraham Lincoln", "Abraham Lincoln") = 0.597

85%

ConclusionConclusion

A new kernel function for measuring the semantic similarity between pairs of short text snippets

The first is improvement in the generation of query expansions with the goal of improving the match score for the kernel function

Term Weighting SchemeTerm Weighting Scheme

The weight associated with the term in document is defined to be :

Where is the frequency of in N is the total number of ducuments ,

and is the total number of documents that contain

jiw ,

it jd

)log(,, idfN

jiji tfw

jitf , it jd

idf

it

Given by:

Most common casesP=1 ,This is the L1 norm, which is also

called Manhattan distanceP=2 ,This is the L2 norm, which is also

called the Euclidean distanceP= , This is the L norm, also called the

infinity norm or the Chebyshev norm

Lp NormLp Norm