IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia
-
Upload
abhay-prakash -
Category
Engineering
-
view
816 -
download
0
Transcript of IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia
Did you know?- Mining Interesting Trivia for Entities from Wikipedia
Abhay Prakash1, Manoj K. Chinnakotla2, Dhaval Patel1, Puneet Garg2
1Indian Institute of Technology Roorkee, India 2Microsoft, India
Did you know?
Dark Knight (2008): To prepare for Joker’s role, Heath Ledger lived alone in a hotel
room for a month, formulating the character’s posture, voice, and personality.
IJCAI-15: IJCAI-15 is the first IJCAI edition in South America, and the southern most
edition ever.
Argentina: In 2001, Argentina had 5 Presidents in 10 days!
Tom Hanks: Tom Hanks has an asteroid named after him: “12818 tomhanks”
What is a Trivia?
Definition: Trivia is any fact about an entity which is interesting due to any of the following characteristics
Unusualness Uniqueness UnexpectednessWeirdness
But, Isn’t interestingness subjective? Yes! For the current work, we take a majoritarian view for interestingness
Why Trivia?
0
2
4
6
8
10
12
14
2/1
4/2
01
5
2/1
6/2
01
5
2/1
8/2
01
5
2/2
0/2
01
5
2/2
2/2
01
5
2/2
4/2
01
5
2/2
6/2
01
5
2/2
8/2
01
5
3/2
/20
15
3/4
/20
15
3/6
/20
15
3/8
/20
15
3/1
0/2
01
5
3/1
2/2
01
5
3/1
4/2
01
5
Trivia Follow-on Engagement
• Helps in drawing user attention and improves user engagement with the experience
• Appeals to their sense of appreciating novelty, curiosityand inquisitiveness
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
2/1
4/2
01
5
2/1
6/2
01
5
2/1
8/2
01
5
2/2
0/2
01
5
2/2
2/2
01
5
2/2
4/2
01
5
2/2
6/2
01
5
2/2
8/2
01
5
3/2
/20
15
3/4
/20
15
3/6
/20
15
3/8
/20
15
3/1
0/2
01
5
3/1
2/2
01
5
3/1
4/2
01
5
Trivia Click Through
Trivia Curation
Manual process – Hard to scale across large number of entities
Wikipedia Trivia Miner (WTM)
Automatically mine trivia for entities from unstructured text of Wikipedia
Why Wikipedia? Reliable for factual correctness
Ample # of interesting trivia (56/100 in expt.)
Learn a model of interestingness for target domain
Use the interestingness model to rank sentences from Wikipedia
Interestingness Model
Collect Ratings from Humans
Train a Model
Harness Publicly Available Sources
Train a Model
CandidateSelection
Candidates’ Source
Top-K Interesting Triviafrom Candidates
Feature ExtractionSVMrank
Knowledge Base
Retrieval Phase
Human Voted Trivia Source
Train Dataset
Filtering & Grading
Feature Extraction SVMrank
Train Phase
Model
System Architecture
CandidateSelection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Triviafrom Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
Training PhaseLearn Interestingness Model
Train Phase
Filtering & Grading
Crawled Trivia from IMDB Top 5K movies, 99K trivia in total
Filter facts with lesser reliability Number of votes < 5
𝐿𝑖𝑘𝑒𝑛𝑒𝑠𝑠 𝑅𝑎𝑡𝑖𝑜 𝐿. 𝑅 =# 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔 𝑉𝑜𝑡𝑒𝑠
# 𝑜𝑓 𝑇𝑜𝑡𝑎𝑙 𝑉𝑜𝑡𝑒𝑠
Convert this skewed distribution into grades
Sample Trivia for movie 'Batman Begins‘ [screenshot taken from IMDB]
0
5
10
15
20
25
30
35
4039.56
30.33
17.08
4.883.57
1.74 1.06 0.65 0.6 0.33 0.21
%ag
e C
ove
rage
Likeness Ratio
Filtering & Grading (Contd..)
High Support for High LRFor L.R. > 0.6, # of votes >= 100
Graded by Percentile-Cutoff to get 5 grades[90,100], [75-90), [25-75), [10-25), [0-10)
6163 samples from 846 movies
706
1091
2880
945
541
0
500
1000
1500
2000
2500
3000
3500
4 (VeryInteresting)
3(Interesting)
2(Ambiguous)
1 (Boring) 0 (VeryBoring)
Freq
uen
cy
Trivia Grade
Feature Engineering
Bucket Feature SignificanceSample features
Example Trivia
Unigram (U)Features
Each word’sTF-IDF
Identify imp. words which make the trivia interesting
“stunt”, “award”, “improvise”
“Tom Cruise did all of his own stunt driving.”
Linguistic (L)Features
SuperlativeWords
Shows the extremeness (uniqueness)
“best”, “longest”, “first”
“The longest animated Disney film since Fantasia (1940).”
ContradictoryWords
Opposing ideas could spark intrigue and interest
“but”, “although”, “unlike”
“The studios wanted Matthew McConaugheyfor lead role, but James Cameron insisted on Leonardo DiCaprio.”
Root Word(Main Verb)
Captures core activitybeing discussed in the sentence
root_gross “Gravity grossed $274 Mn in North America”
Subject Word(First Noun)
Captures core thing being discussed in the sentence
subj_actor “The actors snorted crushed B vitamins for scenes involving cocaine”
Readability Complex and lengthy trivia are hardly interesting
FOG Index binned in 3 bins ---
Feature Engineering (Contd…)
Bucket Feature Significance Sample features Example Trivia
Entity (E)Features
Generic NEs captures general about-ness
MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION
“The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.”
• ORGANIZATION and LOCATION
RelatedEntities
captures specific about-ness(Entities resolved using DBPedia)
entity_producer,entity_director
“According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.”
• entity_producer, entity_character
Entity Linking before(L) Parsing
Captures generalized story of sentence
subj_entity_producer
[The same trivia above]• “According to entity_producer, …”• subj_Victoria subj_entity_producer
Focus Entities Captures core entities being talked about
underroot_entity_producer
[The same trivia above]• underroot_entity_producer,
underroot_entity_character
Domain Independence of Features
All the features are automatically generated and domain-independent
Entity Features are automatically generated using attribute:value pairs in Dbpedia
For a match of ‘value’ in sentence, the match is replaced by entity_‘attribute’
Unigram (U) and Linguistic (L) features are clearly domain independent
DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
Interestingness Ranking Model
Given facts (sentences) along with their interestingness grade, learn a model of interestingness which will rank sentences based on their interestingness
Use Rank SVM model
MOVIE_ID FEATURES GRADE
1 1:1 5:2 … 4
1 … 2
1 … 1
2 … 4
2 … 3
2 … 1
2 … 1
MOVIE_ID FEATURES
1 1:1 5:2 …
1 …
2 …
2 …
2 …
3 …
3 …
Image taken and modified from Wikipedia
SCORE
1.7
2.4
1.2
2.7
0.13
3.1
1.3
INPUT FOR TRAINING MODEL BUILT (Hyperplane) INPUT FOR RANKING OUTPUT OF RANKING
MODEL
Interestingness Model: Cross Validation Results
0.934
0.919
0.929
0.94190.944
0.951
0.9
0.91
0.92
0.93
0.94
0.95
0.96
Unigram (U) Linguistic (L) Entity Features (E) U + L U + E WTM (U + L + E)
ND
CG
@1
0
Feature Group
Interestingness Model: Feature Weights
Rank Feature Group
1 subj_scene Linguistic
2 subj_entity_cast Linguistic + Entity
3 entity_produced_by Entity
4 underroot_unlinked_organization Linguistic + Entity
6 root_improvise Linguistic
7 entity_character Entity
8 MONEY Entity (NER)
14 stunt Unigram
16 superPOS Linguistic
17 subj_actor Linguistic
Entity Linking leads to better generalization else these would have been subj_wolverine etc.
CandidateSelection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Triviafrom Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
Retrieval Phase
Retrieval PhaseGet Trivia from Wikipedia Page
Candidate Selection Sentence Extraction
Crawled only the text in paragraph tag <p>…</p>
Sentence detection took each sentence for further processing
Removed sentences with missing context
E.g. “It really reminds me of my childhood.”
Co-ref resolution to find out links to different sentence
Remove if out link not the target entity
“Hanks revealed that he signed onto the film after an hour and a half of reading the script. He initially ...”
First ‘he’ not an out link, ‘the film’ points to the target entity. Second ‘He’ is an out link. First sentence kept, Second removed
Evaluation Dataset
20 New Movie Pages from Wikipedia
No. of Sentences: 2928
No. of Positive Sentences: 791
Judged (crowd-sourced) by 5 judges
Two scale voting
Boring / Interesting
Majority voting for class rating
Statistically significant?
Got 100 trivia from IMDB also judged by 5 judges only
Mechanism I: Majority voting of IMDB crowd v/s Mechanism II: Crowd-sourced by 5 judges
Agreement between two mechanisms = Substantial (Kappa Value = 0.618)
Kappa Agreement
< 0 Less than chance agreement
0.01-0.20 Slight agreement
0.21-0.40 Fair agreement
0.41-0.60 Moderate agreement
0.61-0.80 Substantial agreement
0.81-0.99 Almost perfect agreement
Comparative Baselines
I. Random [Baseline I]:
- 10 sentences picked randomly from Wikipedia
II. CS + Random
- Candidates Selected
- Remove sentences like “it really reminds me of my childhood”
III. CS + supPOS(Best) [Baseline II]:
- Candidates Selected
- Ranked by No. of Superlative Words
Rank # of sup. words
Class
1 2 Interesting
2 2 Boring
3 1 Interesting
4 1 Interesting
5 1 Interesting
6 1 Boring
7 1 Boring
supPOS (Best Case)
Results: Precision@10
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS(Best Case)
WTM (U) WTM(U+L+E)
P@
10
Approaches
Results: Precision@10
CS+Random > Random
Shows significance of Candidate Selection
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS(Best Case)
WTM (U) WTM(U+L+E)
P@
10
Approaches
Results: Precision@10
CS+Random > Random
Shows significance of Candidate Selection
WTM (U+L+E) >> WTM (U)
Shows significance of Engineered Linguistic (L) and Entity (E) Features
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS(Best Case)
WTM (U) WTM(U+L+E)
P@
10
Approaches
Results: Recall@K
supPOS limited to one kind of trivia
WTM captures varied types
62% recall till rank 25
Performance Comparison
supPOS better till rank 3
Soon after rank 3, WTM beats superPOS
0
10
20
30
40
50
60
70
0 5 10 15 20 25
% R
ecal
l
Rank
SuperPOS (Best Case) WTM Random
Qualitative Analysis
Result Movie Trivia Description
WTM Wins
(Sup. POS
Misses)
Interstellar
(2014)
Paramount is providing a virtual reality walkthrough
of the Endurance spacecraft using Oculus Rift
technology.
Due to Organization being
subject, and (U) features
(technology, reality, virtual)
Gravity
(2013)
When the script was finalized, Cuarón assumed it
would take about a year to complete the film, but it
took four and a half years.
Due to Entity.Director,
Subject (the script), Root
word (assume) and (U)
features (film, years)
WTM’s BadElf (2003) Stop motion animation was also used. Candidate Selection failed
Rio 2
(2014) Rio 2 received mixed reviews from critics.
Root verb "receive" has high
weightage in model
Qualitative Analysis (Contd…)
Result Movie Trivia Description
Sup. POS Wins
(WTM misses)
The
Incredibles
(2004)
Humans are widely considered to be the most
difficult thing to execute in animation.
Presence of ‘most’,
absence of any Entity,
vague Root word
(consider)
Sup. POS's Bad
Lone
Survivor
(2013)
Most critics praised Berg's direction, as well as the
acting, story, visuals and battle sequences.
Here 'most' is not to show
degree but instead to
show generality.
Our Contributions
Introduced a novel research problem
Mining Interesting Facts for Entities from Unstructured Text
Proposed a novel approach “Wikipedia Trivia Miner (WTM)”
For mining top-k interesting trivia for movie entities based on their interestingness
For movie entities, we leverage already available user-generated trivia data from IMDB for learning interestingness
All the Data and Code used in this paper have been made publicly available for research purposes athttps://github.com/abhayprakash/WikipediaTriviaMiner_SharedResources/
Acknowledgements
First author travel was supported by travel grants from Xerox Research Centre India, IIT Roorkee, IJCAI and Microsoft Research India.