Post on 20-Dec-2015
Mehran Sahami Timothy D. Heilman
A Web based Kernel Function forA Web based Kernel Function forMeasuring the Similarity ofMeasuring the Similarity of
Short Text SnippetsShort Text Snippets
IntroductionIntroduction
Wish to determine how similar two short text snippets are.
High degree of semantic similarityUnited Nations Secretary General vs Kofi AnnanAI vs Articial Intelligence
Share termsgraphical models vs graphical interface
5%
Related WorkRelated Work
Query expansion techniquesOther means of determining query
similaritySet overlap (intersection)SVM for text classification
Latent Semantic Kernels (LSK)Semantic Proximity Matrix
Cross-lingual techniques
10%
A New Similarity FunctionA New Similarity Function
represent a short text snippet (query) to a search engine S
be the set of n retrieved documents
Compute the TFIDF term vector for each document
Truncate each vector to include its m highest weighted term
x
)(xR
nddd ,...,, 21
iv
)(xRdi
iv
15%
NormalizeNormalize
Let be the centroid of the L2 normalized vector
Let QE(x) be the L2 normalization of the centroid C(x)
)(xCiv
n
iv
vn i
ixC1
1
2
)(
2)(
)()(xC
xCxQE
20%
Initial Results with KernelInitial Results with Kernel
Three genres of text snippet matchingAcronymsIndividuals and their positionsMulti-faceted terms
30%
AcronymsAcronyms
Text1 Text2 Kernel Cosine Set Overlap
Support vector machine SVM 0.812 0.0 0.110Portable document format PDF 0.732 0.0 0.060Artificial intelligence AI 0.831 0.0 0.255Artificial insemination AI 0.391 0.0 0.000term frequency inverse document frequency
tf idf 0.831 0.0 0.125
term frequency inverse document frequency
tfidf 0.507 0.0 0.060
35%
Related Query SuggestionRelated Query Suggestion
Kernel function foru is any newly issued user query A repository Q of approximately 116 million
popular user queries issued in 2003, determined by sampling anonymized web search logs from the Google search engine
),( iquK Qqi
50%
AlgorithmAlgorithm
Given user query and list of matched queries from repository
Output list of queries to suggest Initialize suggestion list Sort kernel scores in descending
order to produce an ordered list of corresponding queries
MAX is set to the maximum number of suggestions
u
ZZ
),( iquK
iq),...,,( 21 kqqqL
55%
Evaluation of Evaluation of Query Suggestion SystemQuery Suggestion System
1. suggestion is totally off topic.2. suggestion is not as good as original
query.3. suggestion is basically same as original
query.4. suggestion is potentially better than
original query.5. suggestion is fantastic - should suggest
this query since it might help a user find what they're looking for if they issued it instead of the original query.
65%
EvaluationsEvaluations
Original Query
Suggested Queries Kernel Score
Human Rating
california lottery
california lotto home 0.812 3
winning lotto numbers in california 0.792 5
california lottery super lotto plus 0.778 3
valentines day
2003 valentine's day 0.832 3
valentine day card 0.822 4
valentines day greeting cards 0.758 4
I love you valentine 0.736 2
new valentine one 0.671 1
70%
Average ratings versus average Average ratings versus average number of query suggestionsnumber of query suggestions
80%
Application in QAApplication in QA
K("Who shot Abraham Lincoln", "John Wilkes Booth") = 0.730
K("Who shot Abraham Lincoln", "Abraham Lincoln") = 0.597
85%
ConclusionConclusion
A new kernel function for measuring the semantic similarity between pairs of short text snippets
The first is improvement in the generation of query expansions with the goal of improving the match score for the kernel function
Term Weighting SchemeTerm Weighting Scheme
The weight associated with the term in document is defined to be :
Where is the frequency of in N is the total number of ducuments ,
and is the total number of documents that contain
jiw ,
it jd
)log(,, idfN
jiji tfw
jitf , it jd
idf
it