IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia

Did you know?- Mining Interesting Trivia for Entities from Wikipedia

Abhay Prakash1, Manoj K. Chinnakotla2, Dhaval Patel1, Puneet Garg2

1Indian Institute of Technology Roorkee, India 2Microsoft, India

Did you know?

Dark Knight (2008): To prepare for Joker’s role, Heath Ledger lived alone in a hotel

room for a month, formulating the character’s posture, voice, and personality.

IJCAI-15: IJCAI-15 is the first IJCAI edition in South America, and the southern most

edition ever.

Argentina: In 2001, Argentina had 5 Presidents in 10 days!

Tom Hanks: Tom Hanks has an asteroid named after him: “12818 tomhanks”

What is a Trivia?

Definition: Trivia is any fact about an entity which is interesting due to any of the following characteristics

Unusualness Uniqueness UnexpectednessWeirdness

But, Isn’t interestingness subjective? Yes! For the current work, we take a majoritarian view for interestingness

Why Trivia?

0

2

4

6

8

10

12

14

2/1

4/2

01

5

2/1

6/2

01

5

2/1

8/2

01

5

2/2

0/2

01

5

2/2

2/2

01

5

2/2

4/2

01

5

2/2

6/2

01

5

2/2

8/2

01

5

3/2

/20

15

3/4

/20

15

3/6

/20

15

3/8

/20

15

3/1

0/2

01

5

3/1

2/2

01

5

3/1

4/2

01

5

Trivia Follow-on Engagement

• Helps in drawing user attention and improves user engagement with the experience

• Appeals to their sense of appreciating novelty, curiosityand inquisitiveness

0.00%

0.10%

0.20%

0.30%

0.40%

0.50%

0.60%

0.70%

0.80%

2/1

4/2

01

5

2/1

6/2

01

5

2/1

8/2

01

5

2/2

0/2

01

5

2/2

2/2

01

5

2/2

4/2

01

5

2/2

6/2

01

5

2/2

8/2

01

5

3/2

/20

15

3/4

/20

15

3/6

/20

15

3/8

/20

15

3/1

0/2

01

5

3/1

2/2

01

5

3/1

4/2

01

5

Trivia Click Through

Trivia Curation

Manual process – Hard to scale across large number of entities

Wikipedia Trivia Miner (WTM)

Automatically mine trivia for entities from unstructured text of Wikipedia

Why Wikipedia? Reliable for factual correctness

Ample # of interesting trivia (56/100 in expt.)

Learn a model of interestingness for target domain

Use the interestingness model to rank sentences from Wikipedia

Interestingness Model

Collect Ratings from Humans

Train a Model

Harness Publicly Available Sources

Train a Model

CandidateSelection

Candidates’ Source

Top-K Interesting Triviafrom Candidates

Feature ExtractionSVMrank

Knowledge Base

Retrieval Phase

Human Voted Trivia Source

Train Dataset

Filtering & Grading

Feature Extraction SVMrank

Train Phase

Model

System Architecture

CandidateSelection


Train Dataset Candidates’ Source



Interestingness Ranker

Filtering & Grading

Feature Extraction Feature ExtractionSVMrank

Knowledge Base

Training PhaseLearn Interestingness Model

Train Phase

Filtering & Grading

Crawled Trivia from IMDB Top 5K movies, 99K trivia in total

Filter facts with lesser reliability Number of votes < 5

𝐿𝑖𝑘𝑒𝑛𝑒𝑠𝑠 𝑅𝑎𝑡𝑖𝑜 𝐿. 𝑅 =# 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔 𝑉𝑜𝑡𝑒𝑠

# 𝑜𝑓 𝑇𝑜𝑡𝑎𝑙 𝑉𝑜𝑡𝑒𝑠

Convert this skewed distribution into grades

Sample Trivia for movie 'Batman Begins‘ [screenshot taken from IMDB]

0

5

10

15

20

25

30

35

4039.56

30.33

17.08

4.883.57

1.74 1.06 0.65 0.6 0.33 0.21

%ag

e C

ove

rage

Likeness Ratio

Filtering & Grading (Contd..)

High Support for High LRFor L.R. > 0.6, # of votes >= 100

Graded by Percentile-Cutoff to get 5 grades[90,100], [75-90), [25-75), [10-25), [0-10)

6163 samples from 846 movies

706

1091

2880

945

541

0

500

1000

1500

2000

2500

3000

3500

4 (VeryInteresting)

3(Interesting)

2(Ambiguous)

1 (Boring) 0 (VeryBoring)

Freq

uen

cy

Trivia Grade

Feature Engineering

Bucket Feature SignificanceSample features

Example Trivia

Unigram (U)Features

Each word’sTF-IDF

Identify imp. words which make the trivia interesting

“stunt”, “award”, “improvise”

“Tom Cruise did all of his own stunt driving.”

Linguistic (L)Features

SuperlativeWords

Shows the extremeness (uniqueness)

“best”, “longest”, “first”

“The longest animated Disney film since Fantasia (1940).”

ContradictoryWords

Opposing ideas could spark intrigue and interest

“but”, “although”, “unlike”

“The studios wanted Matthew McConaugheyfor lead role, but James Cameron insisted on Leonardo DiCaprio.”

Root Word(Main Verb)

Captures core activitybeing discussed in the sentence

root_gross “Gravity grossed $274 Mn in North America”

Subject Word(First Noun)

Captures core thing being discussed in the sentence

subj_actor “The actors snorted crushed B vitamins for scenes involving cocaine”

Readability Complex and lengthy trivia are hardly interesting

FOG Index binned in 3 bins ---

Feature Engineering (Contd…)

Bucket Feature Significance Sample features Example Trivia

Entity (E)Features

Generic NEs captures general about-ness

MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION

“The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.”

• ORGANIZATION and LOCATION

RelatedEntities

captures specific about-ness(Entities resolved using DBPedia)

entity_producer,entity_director

“According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.”

• entity_producer, entity_character

Entity Linking before(L) Parsing

Captures generalized story of sentence

subj_entity_producer

[The same trivia above]• “According to entity_producer, …”• subj_Victoria subj_entity_producer

Focus Entities Captures core entities being talked about

underroot_entity_producer

[The same trivia above]• underroot_entity_producer,

underroot_entity_character

Domain Independence of Features

All the features are automatically generated and domain-independent

Entity Features are automatically generated using attribute:value pairs in Dbpedia

For a match of ‘value’ in sentence, the match is replaced by entity_‘attribute’

Unigram (U) and Linguistic (L) features are clearly domain independent

DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)

Interestingness Ranking Model

Given facts (sentences) along with their interestingness grade, learn a model of interestingness which will rank sentences based on their interestingness

Use Rank SVM model

MOVIE_ID FEATURES GRADE

1 1:1 5:2 … 4

1 … 2

1 … 1

2 … 4

2 … 3

2 … 1

2 … 1

MOVIE_ID FEATURES

1 1:1 5:2 …

1 …

2 …

2 …

2 …

3 …

3 …

Image taken and modified from Wikipedia

SCORE

1.7

2.4

1.2

2.7

0.13

3.1

1.3

INPUT FOR TRAINING MODEL BUILT (Hyperplane) INPUT FOR RANKING OUTPUT OF RANKING

MODEL

Interestingness Model: Cross Validation Results

0.934

0.919

0.929

0.94190.944

0.951

0.9

0.91

0.92

0.93

0.94

0.95

0.96

Unigram (U) Linguistic (L) Entity Features (E) U + L U + E WTM (U + L + E)

ND

CG

@1

0

Feature Group

Interestingness Model: Feature Weights

Rank Feature Group

1 subj_scene Linguistic

2 subj_entity_cast Linguistic + Entity

3 entity_produced_by Entity

4 underroot_unlinked_organization Linguistic + Entity

6 root_improvise Linguistic

7 entity_character Entity

8 MONEY Entity (NER)

14 stunt Unigram

16 superPOS Linguistic

17 subj_actor Linguistic

Entity Linking leads to better generalization else these would have been subj_wolverine etc.

CandidateSelection


Train Dataset Candidates’ Source



Interestingness Ranker

Filtering & Grading

Feature Extraction Feature ExtractionSVMrank

Knowledge Base

Retrieval Phase

Retrieval PhaseGet Trivia from Wikipedia Page

Candidate Selection Sentence Extraction

Crawled only the text in paragraph tag <p>…</p>

Sentence detection took each sentence for further processing

Removed sentences with missing context

E.g. “It really reminds me of my childhood.”

Co-ref resolution to find out links to different sentence

Remove if out link not the target entity

“Hanks revealed that he signed onto the film after an hour and a half of reading the script. He initially ...”

First ‘he’ not an out link, ‘the film’ points to the target entity. Second ‘He’ is an out link. First sentence kept, Second removed

Evaluation Dataset

20 New Movie Pages from Wikipedia

No. of Sentences: 2928

No. of Positive Sentences: 791

Judged (crowd-sourced) by 5 judges

Two scale voting

Boring / Interesting

Majority voting for class rating

Statistically significant?

Got 100 trivia from IMDB also judged by 5 judges only

Mechanism I: Majority voting of IMDB crowd v/s Mechanism II: Crowd-sourced by 5 judges

Agreement between two mechanisms = Substantial (Kappa Value = 0.618)

Kappa Agreement

< 0 Less than chance agreement

0.01-0.20 Slight agreement

0.21-0.40 Fair agreement

0.41-0.60 Moderate agreement

0.61-0.80 Substantial agreement

0.81-0.99 Almost perfect agreement

Comparative Baselines

I. Random [Baseline I]:

- 10 sentences picked randomly from Wikipedia

II. CS + Random

- Candidates Selected

- Remove sentences like “it really reminds me of my childhood”

III. CS + supPOS(Best) [Baseline II]:

- Candidates Selected

- Ranked by No. of Superlative Words

Rank # of sup. words

Class

1 2 Interesting

2 2 Boring

3 1 Interesting

4 1 Interesting

5 1 Interesting

6 1 Boring

7 1 Boring

supPOS (Best Case)

Results: Precision@10

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Random CS+Random supPOS(Best Case)

WTM (U) WTM(U+L+E)

P@

10

Approaches


CS+Random > Random

Shows significance of Candidate Selection

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


WTM (U) WTM(U+L+E)

P@

10

Approaches


CS+Random > Random

Shows significance of Candidate Selection

WTM (U+L+E) >> WTM (U)

Shows significance of Engineered Linguistic (L) and Entity (E) Features

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5


WTM (U) WTM(U+L+E)

P@

10

Approaches

Results: Recall@K

supPOS limited to one kind of trivia

WTM captures varied types

62% recall till rank 25

Performance Comparison

supPOS better till rank 3

Soon after rank 3, WTM beats superPOS

0

10

20

30

40

50

60

70

0 5 10 15 20 25

% R

ecal

l

Rank

SuperPOS (Best Case) WTM Random

Qualitative Analysis

Result Movie Trivia Description

WTM Wins

(Sup. POS

Misses)

Interstellar

(2014)

Paramount is providing a virtual reality walkthrough

of the Endurance spacecraft using Oculus Rift

technology.

Due to Organization being

subject, and (U) features

(technology, reality, virtual)

Gravity

(2013)

When the script was finalized, Cuarón assumed it

would take about a year to complete the film, but it

took four and a half years.

Due to Entity.Director,

Subject (the script), Root

word (assume) and (U)

features (film, years)

WTM’s BadElf (2003) Stop motion animation was also used. Candidate Selection failed

Rio 2

(2014) Rio 2 received mixed reviews from critics.

Root verb "receive" has high

weightage in model

Qualitative Analysis (Contd…)

Result Movie Trivia Description

Sup. POS Wins

(WTM misses)

The

Incredibles

(2004)

Humans are widely considered to be the most

difficult thing to execute in animation.

Presence of ‘most’,

absence of any Entity,

vague Root word

(consider)

Sup. POS's Bad

Lone

Survivor

(2013)

Most critics praised Berg's direction, as well as the

acting, story, visuals and battle sequences.

Here 'most' is not to show

degree but instead to

show generality.

Our Contributions

Introduced a novel research problem

Mining Interesting Facts for Entities from Unstructured Text

Proposed a novel approach “Wikipedia Trivia Miner (WTM)”

For mining top-k interesting trivia for movie entities based on their interestingness

For movie entities, we leverage already available user-generated trivia data from IMDB for learning interestingness

All the Data and Code used in this paper have been made publicly available for research purposes athttps://github.com/abhayprakash/WikipediaTriviaMiner_SharedResources/

https://github.com/abhayprakash/WikipediaTriviaMiner_SharedResources/

Acknowledgements

First author travel was supported by travel grants from Xerox Research Centre India, IIT Roorkee, IJCAI and Microsoft Research India.

IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia

Engineering

Transcript of IJCAI 2015 Presentation: Did you know?- Mining Interesting Trivia for Entities from Wikipedia