Mining temporal footprints from Wikipedia

38
[email protected] School of Computer Science Dublin, 23/08/2014 presentation 1st AHA! Workshop, COLING 2014 Mining temporal footprints from Wikipedia Michele Filannino, Goran Nenadic

description

Discovery of temporal information is key for organising knowledge and therefore the task of extracting and representing temporal information from texts has received an increasing interest. In this paper we focus on the discovery of temporal footprints from encyclopaedic descriptions. Temporal footprints are time-line periods that are associated to the existence of specific concepts. Our approach relies on the extraction of date mentions and prediction of lower and upper bound- aries that define temporal footprints. We report on several experiments on persons’ pages from Wikipedia in order to illustrate the feasibility of the proposed methods.

Transcript of Mining temporal footprints from Wikipedia

Page 1: Mining temporal footprints from Wikipedia

[email protected] of Computer Science

Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Mining temporal footprints from Wikipedia

Michele Filannino, Goran Nenadic

Page 2: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

introduction

■ Temporal information is crucial for organising

structured and unstructured data

■ Several temporal information extraction (TIE)

systems are nowadays available

● thanks to TempEval challenge series

2

Page 3: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

URL: http://www.cs.man.ac.uk/~filannim/mantime.html

ManTIME

3

Page 4: Mining temporal footprints from Wikipedia

/ 23Test with long text 4

Page 5: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Immanuel Kant, Paul Guyer, and Allen W Wood. 1998. Critique of pure reason. Cambridge University Press.

temporal footprint

A temporal footprint is a

continuous period on the time-

line that temporally defines the

existence of a particular concept.

5

Page 6: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

problem

■ input: textual description of a concept

■ output: prediction of a temporal

interval

Can we predict temporal footprints from

encyclopaedic descriptions of concepts?

Page 7: Mining temporal footprints from Wikipedia

/ 23Examples of temporal footprints 7

Web

Cellphone

Computer

Car

Richard Feynman

Bicycle

Carl Friedrich Gauss

French revolution

Age of Enlightenment

Galileo Galilei

Leonardo Da Vinci

Christopher Columbus

Renaissance

Arming sword

High Middle Ages

Gengis Khan

1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000

Object Person Historical period

Page 8: Mining temporal footprints from Wikipedia

/ 238

Page 9: Mining temporal footprints from Wikipedia

/ 238

Page 10: Mining temporal footprints from Wikipedia

/ 238

Page 11: Mining temporal footprints from Wikipedia

/ 238

Page 12: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

methodology

1. date mention extraction

2. outlier filtering

3. normal distribution fitting

4. prediction

9

Page 13: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

date mentions extraction

10

freq

0.000

0.013

0.025

0.038

0.050

time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810

Page 14: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

freq

0.000

0.013

0.025

0.038

0.050

time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810

freq

0.000

0.013

0.025

0.038

0.050

time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810

Gamma parameter controls the outlier region’s boundaries.

outlier filtering

11

γ param.

Page 15: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Alpha and Beta parameters control the size and offset of the gaussian bell.

normal distribution fitting

12

freq

0.000

0.013

0.025

0.038

0.050

time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810

α param.

Page 16: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Alpha and Beta parameters control the size and offset of the gaussian bell.

normal distribution fitting

12

freq

0.000

0.013

0.025

0.038

0.050

time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810

α param.

Page 17: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

freq

0.000

0.013

0.025

0.038

0.050

time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810

Alpha and Beta parameters control the size and offset of the gaussian bell.

normal distribution fitting

13

β param.

Page 18: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

freq

0.000

0.013

0.025

0.038

0.050

time (in years)1360 1410 1460 1510 1560 1610 1660 1710 1760 1810

Alpha and Beta parameters control the size and offset of the gaussian bell.

normal distribution fitting

13

β param.

Page 19: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Fatima De Carvalho. 1996. Histogrammes et indices de proximite en analyse donne es

symboliques. Acyes de l’e cole d’e te sur l’analyse des donne es symboliques. LISE-

CEREMADE, Universite de Paris IX Dauphine, pages 101–127.

error measure

14

union overlap

gold

prediction

Page 20: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Fatima De Carvalho. 1996. Histogrammes et indices de proximite en analyse donne es

symboliques. Acyes de l’e cole d’e te sur l’analyse des donne es symboliques. LISE-

CEREMADE, Universite de Paris IX Dauphine, pages 101–127.

error measure

15

union

gold

prediction

Page 21: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

strategies

A. RegEx

B. RegEx + Filtering

C. RegEx + Filtering + Gaussian fitting

D. HeidelTime + Filtering + Gaussian fitting

16

Page 22: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

evaluation

■ subject: people

■ lived from 1000 AD to 2014

● text from Wikipedia web pages

● year of birth and death from DBpedia

■ 228,824 people collected

■ simple definition of temporal footprint

● birth and death dates

17

Page 23: Mining temporal footprints from Wikipedia

/ 23People per textual length 18

#p

eo

ple

0

100

200

300

400

500

#words

0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750

Page 24: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

aggregate results

19

StrategyMean

Distance Error

Standard Deviation

RegEx 0.2636 0.3409

RegEx + Filtering 0.2596 0.3090

RegEx + Filtering + Gaussian fitting 0.3503 0.2430

HeidelTime + Filtering + Gaussian fitting 0.5980 0.2470

Page 25: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

results

20

MD

E

0.0

0.2

0.4

0.6

0.8

1.0

#words

1112 3336 5560 7785 10009 12233 14458 16682 18906 21131 23355 25579 27804

RegEx RegEx + FilteringHeidelTime + Filtering + Gaussian fitting RegEx + Filtering + Gaussian fitting

Page 26: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

results

20

MD

E

0.0

0.2

0.4

0.6

0.8

1.0

#words

1112 3336 5560 7785 10009 12233 14458 16682 18906 21131 23355 25579 27804

RegEx RegEx + FilteringHeidelTime + Filtering + Gaussian fitting RegEx + Filtering + Gaussian fitting

Page 27: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

E: 0.204

results

■ Galileo Galilei (1564-1642), prediction: 1556-1654

21

Page 28: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

E: 0.159

results

■ Robin Williams (1951 - 2014), prediction: 1953-2006

22

Page 29: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Prediction: 1366-2057 (1451-1506), E: 0.92

other types of temporal footprint?

■ Christopher Columbus will die in 2057 ?!

23

Page 30: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Prediction: 1366-2057 (1451-1506), E: 0.92

other types of temporal footprint?

■ Christopher Columbus will die in 2057 ?!

23

Page 31: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Prediction: 1366-2057 (1451-1506), E: 0.92

other types of temporal footprint?

■ Christopher Columbus will die in 2057 ?!

23

Page 32: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

Prediction: 1366-2057 (1451-1506), E: 0.92

other types of temporal footprint?

■ Christopher Columbus will die in 2057 ?!

23

AHA!

Page 33: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

physical existence vs. social coverage

■ Anne Frank’s footprint is shifted in the future

24

Page 34: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

physical existence vs. social coverage

■ Anne Frank’s footprint is shifted in the future

24

Page 35: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

physical existence vs. social coverage

■ Anne Frank’s footprint is shifted in the future

24

Page 36: Mining temporal footprints from Wikipedia

/ 25Dublin, 23/08/2014

presentation 1st AHA! Workshop, COLING 2014

conclusions

■ how the methodology behaves on different

languages? how on different sources?

■ oracle-like side-effect behaviour:

• Apple Inc. will be closed down this year

• Stanford University will be closed down in 2029

■ Future works

• mixture of normal distributions

25

Page 37: Mining temporal footprints from Wikipedia

Thank you.

Page 38: Mining temporal footprints from Wikipedia

Contact:

[email protected] !Visit:

tinyurl.com/temporal-footprints

?QUESTIONS