Information Models for Ad Hoc Information Retrieval, SIGIR 2010
Named Entity Recognition in Query Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li (ACM SIGIR 2009) Speaker:...
-
Upload
dulcie-mosley -
Category
Documents
-
view
223 -
download
0
Transcript of Named Entity Recognition in Query Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li (ACM SIGIR 2009) Speaker:...
Named Entity Recognition in Query
Jiafeng Guo, Gu Xu, Xueqi Cheng, Hang Li(ACM SIGIR 2009)
Speaker: Yi-Lin,HsuAdvisor: Dr. Koh, Jia-ling
Date: 2009/11/16
Outline
• Introduction to NERQ• NERQ Problem• Implementation• WSLDA• Experimental Results• Conclusion and Future work
2009/10/22 2
Introduction to NERQ
• Named entity recognition (NER)is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
2009/10/22 3
Introduction to NERQ
• NERQ involves 2 tasks:– 1. Detection of the named entity in a given query – 2. Classification of the named entity into
predefined classes.– Example: mine movie titles – Applications: Web search, etc.
• Challenges– Queries are usually very short– Queries are not necessarily in standard form
2009/10/22 4
Query Data
• New data source for NER– About 70% of search queries contain named
entities.– Rich context for determining the classes of entities.• Query Context
– “harry potter walkthrough”→“harry potter cheats” (context in the same class)
• Wisdom-of-crowds• Very Large-scale data and keep on growing• Frequent update with emerging named entities
2009/10/22 5
NERQ Problem
• A query having one named entity is represented as a triple (e, t, c), – e : named entity,– t : context of e α#β– c : class of e
2009/10/22 6
Probabilistic Approach
• (e,t,c)* = argmax (e,t,c) Pr(q,e,t,c) = argmax (e,t,c) Pr(q|e,t,c) Pr(e,t,c)
= argmax (e,t,c) Pr(e,t,c) (1)
• Pr(e,t,c) = Pr(e) Pr(c|e) Pr(t|e,c)= Pr(e) Pr(c|e) Pr(t|c)
(2)
2009/10/22 7
)(qG
Make an assumption
here
Topic Model for NERQ
• T = {(ei,ti,ci) | i = 1..N} , the learning problem can be formalized as :
2009/10/22 8
Implementation
• Offline Training• Online Prediction
2009/10/22 9
Offline Training
2009/10/22 10
………………..Harry Potter………………..………………..
………………..Harry Potter………………..………………..
Seeds
Scan the query log with the seed name entity and collect the queries contain themScan the query log with the seed name entity and collect the queries contain them
………………..Harry Potter trailHarry Potter walk throughHarry Potter cheats………………..
………………..Harry Potter trailHarry Potter walk throughHarry Potter cheats………………..
Query log
movie
Offline Training• Pr(e) : the total frequency of queries
containing e in the query log
2009/10/22 11
Harry Potter trailsNew Moon
Name entity Context Class
Query
Pr(c|e) : estimated by WS-LDAPr(c|t) : fixed
Online Prediction
harry
2009/10/22 12
trailspotter
Find the most likely triple (e,t,c) in G(q)
WSLDA
2009/10/22 13
WSLDA
• Introduce Weak Supervision– LDA log likelihood + soft constraints
– Soft Constraints
2009/10/22 14
yCwpywL log,LDA Probability Soft Constraints
i ii zyyC Document Probability
on i-th ClassDocument Probability on i-th Class
Document Binary Label on i-th Class Document Binary Label on i-th Class
WSLDA
• Objective Fuction :
2009/10/22 15
Experiments• A real data set consisting of 6 billion queries• 930 million unique queries• Four semantic classes ,“Movie”, “Game”,
“Book”, and “Music”. • 4 human annotators.• 180 named entities were selected from the web
sites of Amazon, GameSpot, and Lyrics.• 120 for training and 60 for test.• Finally , we obtain 432,304 contexts and about
1.5 millions name entities.
2009/10/22 16
Experiments• Randomly sampled 400 queries from the recognition results(0.14 millions)
for evaluation.
2009/10/22 17
Example Queries
pics of fight club braveheart quote
watch gladiator online american beauty company
12 angry men characters mario kart guide
pc mass effect crysis mods
mother teresa images condemned screenshots
4 minutes lyric king kong
the black swan summary blackwater novel
new moon rehab the song
nineteen minutes synopsis umbrella chords
all summer long video girlfriend lyrics
Experiments• The performance of NERQ is evaluated in terms of Top
N accuracy.
2009/10/22 18
Experiments
• We performed experiments to make comparison between the WS-LDA approach and two baseline methods: Determ and LDA.
• Determ learns the contexts of a certain class by simply aggregating all the contexts of named entities belonging to that class.
• LDA and WS-LDA take a probabilistic approach
2009/10/22 19
Experiments
2009/11/16 20
Movie Contexts Game Contexts
Book Contexts Music Contexts
Determ LDA WS-LDA Determ LDA WS-LDA
Determ LDA WS-LDA Determ LDA WS-LDA
• Table 5: Comparisons on Learned Named Entities of Each Class (P@N)
2009/11/16 21
Movie Game Book Music Average-Class
Experiments
• Comparisons between WS-LDA and LDA
2009/10/22 22
Conclusion
• Formalized the Problem of NERQ• Proposed a novel method for NERQ• Develop a new topic model called WSLDA• Future Works:
– We plan to add more classes and conduct the experiments.– The proposed method focuses on single named entity
queries.– Some queries contained the named entity out of
predefined classes. (e.g. American beauty company)– Some contexts were not learned in our approach since
they are uncommon. (e.g lyrics for # by chris brown )
2009/10/22 23