Mining Interesting Trivia for Entities from Wikipedia PART-II
-
Upload
abhay-prakash -
Category
Education
-
view
387 -
download
1
Transcript of Mining Interesting Trivia for Entities from Wikipedia PART-II
Mining Interesting Trivia for Entities from Wikipedia
Supervised By: Presented By:
Dr. Dhaval Patel,Assistant Professor,IIT Roorkee
Abhay Prakash,En. No. - 10211002,
IIT Roorkee
Dr. Manoj K. Chinnakotla,Applied Researcher,Microsoft India
Publication Accepted[1] Abhay Prakash, Manoj K. Chinnakotla, Dhaval Patel, Puneet Garg: “Did you know?- Mining Interesting Trivia for Entities from Wikipedia”. In 24th
International Joint Conference on Artificial Intelligence (IJCAI), 2015.
Conference Rating: A*
Introduction: Problem StatementDefinition: Trivia are any facts about an entity which are interesting due to any of the following characteristics - unusualness, uniqueness, unexpectedness or weirdness. Generally appear in “Did you know?” articles
E.g. “To prepare for Joker’s role, Heath Ledger secluded himself in a hotel room for a month” [Batman Begins]
Unusual for an actor/human to seclude himself for a month
Problem Statement: For a given entity, mine top-k interesting trivia from its Wikipedia page, where a trivia is considered interesting if when it is shown to 𝑁 persons, more than 𝑁/2 persons find it interesting. For evaluation of unseen set, we chose 𝑁 = 5 (statistical significance discussed in mid evaluation)
Wikipedia Trivia Miner (WTM) Based on ML approach to mine trivia from unstructured text
Trains a ranker using sample trivia of target domain Experiment with Movie entities and Celebrity entities
Harness trained ranker to mine Trivia from entity’s Wikipedia page Retrieves Top-k standalone interesting sentences from entity’s page
Why Wikipedia? Reliable for factual correctness
Ample # of interesting trivia (56/100 in expt.)
System Architecture Filtering & Grading Filters out noisy samples
Give a grade to each sample, as reqd. by ranker
Interestingness Ranker Extracts features from the samples/candidates
Trains ranker(SVMrank)/Ranks candidates
Candidate Selection Identifies candidates from Wikipedia
CandidateSelection
Human Voted Trivia Source
Train Dataset Candidates’ Source
Top-K Interesting Triviafrom Candidates
Wikipedia Trivia Miner (WTM)
Interestingness Ranker
Filtering & Grading
Feature Extraction Feature ExtractionSVMrank
Knowledge Base
CandidateSelection
Candidates’ Source
Top-K Interesting Triviafrom Candidates
Feature ExtractionSVMrank
Knowledge Base
Retrieval Phase
Human Voted Trivia Source
Train Dataset
Filtering & Grading
Feature Extraction SVMrank
Train Phase
Model
Execution Phases Train Phase Crawls and prepares train data
Featurize the train data
Trains SVMrank to build a model
Retrieval Phase Crawls entity’s Wikipedia text
Identify candidates for trivia
Featurize the candidates
Rank the candidates using already built model
Feature EngineeringBucket Feature Significance Sample features Example Trivia
Unigram (U)Features
Each word’sTF-IDF
Identify imp. words which make the trivia interesting
“stunt”, “award”, “improvise”
“Tom Cruise did all of his own stunt driving.”
Linguistic (L)Features
SuperlativeWords
Shows the extremeness (uniqueness)
“best”, “longest”, “first”
“The longest animated Disney film since Fantasia (1940).”
ContradictoryWords
Opposing ideas could spark intrigue and interest
“but”, “although”, “unlike”
“The studios wanted Matthew McConaugheyfor lead role, but James Cameron insisted on Leonardo DiCaprio.”
Root Word(Main Verb)
Captures core activity being discussed in the sentence
root_gross “Gravity grossed $274 Mn in North America”
Subject Word(First Noun)
Captures core thing being discussed in the sentence
subj_actor “The actors snorted crushed B vitamins for scenes involving cocaine”
Readability Complex and lengthy trivia are hardly interesting
FOG Index binned in 3 bins ---
Feature Engineering (Contd…)
Bucket Feature Significance Sample features Example Trivia
Entity (E)Features
Generic NEs captures general about-ness
MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION
“The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.”
• ORGANIZATION and LOCATION
RelatedEntities
captures specific about-ness(Entities resolved using DBPedia)
entity_producer,entity_director
“According to Victoria Alonso, Rocket Raccoonand Groot were created through a mix of motion-capture and rotomation VFX.”
• entity_producer, entity_character
Entity Linking before(L) Parsing
Captures generalized story of sentence
subj_entity_producer [The same trivia above]• “According to entity_producer, …”• subj_Victoria subj_entity_producer
Focus Entities Captures core entities being talked about
underroot_entity_producer
[The same trivia above]• underroot_entity_producer,
underroot_entity_character
Feature Engineering: ExampleEx. “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.”
Features extracted: 18025 (U) + 5 (L) + 4686 (E) columns in total for all train data
Rest of the features have value 0. entity_actor = 0, award = 0, subj_actor = 0, root_win = 0, ….
create mix motion capture rotomation VFX root_create supPOS subj_entity_producer FOG
0.25 0.75 0.96 0.4 0.85 0.75 1 0 1 3
contradictory entity_producer entity_character underroot_entiy_producer underroot_entity_character
0 1 1 1 1
Comparative ApproachesI. Random [Baseline I]:
- 10 sentences picked randomly from Wikipedia
II. CS + Random
- Candidates Selected (standalone context independent sentences)
- i.e., remove sentences like “it really reminds me of my childhood”
- 10 sentences picked randomly from candidates
III. CS + supPOS(Best) [Baseline II]:
- Candidates Selected
- Ranked by # of sup. words
- Deliberately taking interesting sent. for same # of sup. words
Rank # of sup. words
Class
1 2 Interesting
2 2 Boring
3 1 Interesting
4 1 Interesting
5 1 Interesting
6 1 Boring
7 1 Boring
supPOS (Best Case)
Variants of WTMI. WTM (U)
- Candidates Selected
- ML Ranking of candidates using only Unigram Features
II. WTM (U+L+E)
- Candidates Selected
- ML Ranking of candidates using all features: Unigram (U) + Linguistic (L) + Entity (E)
Results: P@10 Metric is Precision at 10 (P@10), which
means out of top 10 ranked candidates, how many actually are interesting
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS(Best Case)
WTM (U) WTM(U+L+E)
P@
10
Approaches
Results: P@10 Metric is Precision at 10 (P@10), which
means out of top 10 ranked candidates, how many actually are interesting
CS+Random > Random
Shows significance of Candidate Selection
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS(Best Case)
WTM (U) WTM(U+L+E)
P@
10
Approaches
Results: P@10 Metric is Precision at 10 (P@10), which
means out of top 10 ranked candidates, how many actually are interesting
CS+Random > Random
Shows significance of Candidate Selection
WTM (U+L+E) >> WTM (U)
Shows significance of Engineered Linguistic (L) and Entity (E) Features
0.25
0.3
0.34 0.34
0.45
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Random CS+Random supPOS(Best Case)
WTM (U) WTM(U+L+E)
P@
10
Approaches
Results: Recall@K supPOS limited to one kind of trivia
WTM captures varied types 62% recall till rank 25
Performance Comparison supPOS better till rank 3
Soon after rank 3, WTM beats superPOS0
10
20
30
40
50
60
70
0 5 10 15 20 25
% R
ecal
l
Rank
SuperPOS (Best Case) WTM Random
Sensitivity to Training Size Current Results reported with 6163 Train
Trivia
WTM precision increases with train size
Desirable property as precision can beimproved by taking more train data
WTM’s Domain Independence Experiment on Celebrity Domain to justify claim of domain independence.
Dataset: Crawled Trivia for Top 1000 Movie celebrities from IMDB and did 5 fold test
Train dataset: 4459 Trivia (106 entities)
Test dataset: 500 Trivia (10 entities)
Doubtful feature for being domain dependent – Entity Features
Unigram (E) Features Linguistic (L) Features Entity (E) Features
All words subj_actor, root_reveal,subj_scene, but, best, FOG_index = 7.2
entity_producer, entity_director, …
WTM’s Domain Independence (Contd…)
Entity Features are domain independent too
Entity Features are automatically generated using attribute:value pairs in DBpedia For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’
Unigram (U) and Linguistic (L) features clearly domain independent
DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
WTM’s Domain Independence (Contd…)
Entity Features are domain independent too
Entity Features are automatically generated using attribute:value pairs in DBpedia For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’
Unigram (U) and Linguistic (L) features clearly domain independent
DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)
FEATURE ENTITY TRIVIA
entity_partner Johnny Depp Engaged to Amber Heard [January 17, 2014].**
entity_citizenship Nicole Kidman First Australian actress to win the Best Actress Academy Award.
** After Entity Linking sentence parsed as “Engaged to entity_partner”
Entity Feature Generation from DBpedia
Example of Entity Features in Celebrity Domain
WTM’s Domain Independence (Contd…)
Movie Domain (ex. Batman Begins (2005) ) Celebrity Domain (ex. Angelina Jolie)
DBpedia attribute:value Feature generated DBpedia attribute:value Feature generated
Director: Christopher Nolan entity_director Partner: Brad Pitt entity_partner
Producer: Larry J. Franco entity_producer birthplace: California entity_birthPlace
Feature Contribution (Movie v/s Celeb.)
Rank Feature Group
1 win Unigram
3 magazine Unigram
4 superPOS Linguistic
5 MONEY Entity (NER)
6 entity_alternativenames Entity
7 root_engage Linguistic
14 subj_earnings Linguistic
15 subj_entity_children Linguistic + Entity
18 entity_birthplace Entity
19 subj_unlinked_location Linguistic + Entity
Rank Feature Group
1 subj_scene Linguistic
2 subj_entity_cast Linguistic + Entity
3 entity_produced_by Entity
4 underroot_unlinked_organization Linguistic + Entity
6 root_improvise Linguistic
7 entity_character Entity
8 MONEY Entity (NER)
14 stunt Unigram
16 superPOS Linguistic
17 subj_actor Linguistic
Top Features: Our advanced features are useful and intuitive for humans too
Entity Linking leads to better generalization (instead of entity_wolverine, model gets entity_cast)
Movie Domain Celebrity Domain
Results: P@10 (Celebrity Domain)
0.39
0.540.58
0.71
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Random supPOS(BestCase)
WTM (U) WTM(U+L+E)
P@
10
Approaches
Again WTM (U+L+E) >> WTM (U) Significance of advanced (L) and (E)
features
Hence, Features and Approach areDomain Independent
For entities of any domain, just replaceTrain Data (Sample Trivia)
Dissertation Contribution Identified, Defined and Provided a novel research problem not just only providing solutions to existing problem
Proposed a Domain Independent system “Wikipedia Trivia Miner (WTM)” To mine top-k interesting trivia for any given entity based on their interestingness
Engineered features that capture ‘about-ness’ of sentence Generalizes which one are interesting
Proposed a mechanism to prepare ground truth for test-set Cost-effective but statistically significant
Future Works New Features to increase Ranking Quality Unusualness: Probability of occurrence of the sentence in considered domain
Fact Popularity: Lesser known trivia could be more interesting to majority people
Trying Deep Learning Could be helpful as in case of sarcasm detection
Generating Questions from mined trivia To present Trivia in question form
Obtaining personalized Interesting Trivia In this dissertation work, we took interesting based on majority voting. Ranking based on user
demographics