Annotating streams of heterogeneous data for topic generation
-
Upload
giuseppe-rizzo -
Category
Education
-
view
1.449 -
download
1
description
Transcript of Annotating streams of heterogeneous data for topic generation
Annotating streams of Annotating streams of heterogeneous data for topic heterogeneous data for topic
generationgeneration
Giuseppe [email protected]
@giusepperizzo
Ferbruary 6, 2013 2/22VU University Amsterdam, NL
Spotting entities while reading a document
➢ Name of People, Locations, Organizations, etc..
➢ Named entities are fundamental keys for topic understanding
➢ But, the same location can refer to different places
source: http://goo.gl/kVzlK
Ferbruary 6, 2013 3/22VU University Amsterdam, NL
A Web of Linked Entities
➢ GGG (global giant graph) http://goo.gl/fH3h
➢ Nodes are Web entities
➢ Entities provide disambiguation pointers
➢ Entities can be univocally referred (disambiguated)
➢ Entities as centroids for topic generation and undestanding
source: http://wole2013.eurecom.fr
source: http://wole2012.eurecom.fr
Ferbruary 6, 2013 4/22VU University Amsterdam, NL
Entity extractors
Web
API
Disam
bigu
atio
n
URI
Ferbruary 6, 2013 5/22VU University Amsterdam, NL
DiversityAlchemy
APIDBpedia Spotlight
Extractiv Lupedia OpenCalais
Saplo SemiTags
Wikimeta Yahoo! Zemanta
Language EN,FR,DE,IT,PT,RU,SP,SW
EN EN EN,FR,IT
EN,FRSP
EN,SW
DE,NL
EN,FRSP
EN EN
Granularity OEN OEN OEN OEN OEN OED OED OEN OEN OED
Entityposition
N/A charoffset
wordoffset
range of chars
charoffset
N/A charoffset
POSoffset
rangeof
chars
N/A
Classificationschema
Alchemy DBpediaFreeBaseScema.or
g
Extractiv DBpediaLinkedM
DB
OpenCalais
Saplo ConLL-3
ESTER Yahoo FreeBase
Number of classes
324 320 34 319 95 5 4 7 13 81
ResponseFormat
JSONMicroFXMLRDF
HTMLJSONRDFXML
HTMLJSONRDFXML
HTMLJSONRDFaXML
JSONMicroFormat
JSON XML JSONXML
JSONXML
XMLJSONRDF
Quota (calls/day)
30000 unl 3000 unl 50000 1333 unl unl 5000 10000
Ferbruary 6, 2013 6/22VU University Amsterdam, NL
Harmonizing annotations
http://nerd.eurecom.fr
ontology1
REST API2
UI3
1 http://nerd.eurecom.fr/ontology2 http://nerd.eurecom.fr/api/application.wadl3 http://nerd.eurecom.fr
Ferbruary 6, 2013 7/22VU University Amsterdam, NL
NERD Ontology NERD type Occurrence
Person 10
Organization 10
Country 6
Company 6
Location 6
Continent 5
City 5
RadioStation 5
Album 5
Product 5
... ...
The NERD ontology has been integrated in the NIF project, a EU FP7 in the context of the LOD2: Creating Knowledge out of Interlinked Data
Ferbruary 6, 2013 8/22VU University Amsterdam, NL
ETAPE2012
➢ DGA (French radio transcripts)– Train: 7h 50m– Dev: 3h – Eval: 3h
➢ ELDA (French TV transcripts)– Train: 18h 10m– Dev: 7h 55m– Eval: 7h 55m
➢ Annotation schema Quaero: 32 classes
Ferbruary 6, 2013 9/22VU University Amsterdam, NL
We can do better: combined
(eA1
,tA1
,URIA1
,siA1
,eiA1
) .........(e
A2,t
A2,URI
A2,si
A2,ei
A2)
(eA3
,tA3
,URIA3
,siA3
,eiA3
)
(eN2
,tN2
,URIN2
,siN2
,eiN2
)
(eN1
,tN1
,URIN1
,siN1
,eiN1
)
extraction
cleaning
fusionWhen at least 2 extractors classify the same entity with a different type then we apply a preferred selection order (learning rules): Wikimeta, AlchemyAPI, OpenCalais, Lupedia
ETAPE2012
Ferbruary 6, 2013 10/22VU University Amsterdam, NL
… but it introduced systematic errors
SLR (Slot Error Rate)
prec recall F1 %correct
alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%
lupedia 39.49% 22.87% 1.56% 2.91% 1.56%
opencalais 37.47% 41.69% 3.53% 6.49% 3.53%
wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%
combined (nerd)
86.85% 35.31% 17.69% 23.44% 17.69%
ETAPE2012
Ferbruary 6, 2013 11/22VU University Amsterdam, NL
Gazetteers: combined+
(eA1
,tA1
,URIA1
,siA1
,eA1
)
`
(eA2
,tA2
,URIA2
,siA2
,eiA2
)
(eN1
,tN1
,URIN1
,sN1
,eN1
)
...
Learned model
Created static rules
fusion
Conflicts handled by priority selection:own, Wikimeta,AlchemyAPI,OpenCalais,Lupedia
POS tagger
Apply rules
(e1,t
1,URI
1,si
1,ei
1)
ETAPE2012
Ferbruary 6, 2013 12/22VU University Amsterdam, NL
Over-estimated training model
SLR (Slot Error Rate)
prec recall F1 %correct
combined 86.85% 35.31% 17.69% 23.44% 17.69%
combined+ 188.81% 15.13% 28.40% 19.45% 28.40%
ETAPE2012
Ferbruary 6, 2013 13/22VU University Amsterdam, NL
General NER limitations
➢ Perfomances drop– with common settings using off-the-shelf
models, while annotating corpora which differs from the training model (empirically recall drops of ~20%)
– with noisy texts such as transcripts, microposts
➢ Lack of knowledge for particular categories, in particular Event
Ferbruary 6, 2013 14/22VU University Amsterdam, NL
Participation at the #MSM2013 challenge
➢ English Twitter posts– Train: 2815 posts– Eval: 1526 posts
➢ Annotation schema: 4 classes
➢ Objective: perform better than the Stanford CFR, properly trained with the challenge settings
prec recall F1
LOC 80.12% 57.76% 67.63%
MISC 68.18% 31.51% 43.10%
ORG 83.28% 50.71% 63.04%
PER 79.93% 70.72% 75.04%
4-fold cross validation over training - provisional results of the Stanford CFR
on going
Ferbruary 6, 2013 15/22VU University Amsterdam, NL
Poor performances of spotting events
➢ Exploit large domain knowledge represented by the Eventmedia dataset1
➢ EventSpotter– Entities classified according to the LODE ontology– Spotting according to the event name, agents,
temporal and geo spatial information– Confidence computed according to the similarity
of the surrounding text where the entity has been spotted and the event description
– Disambiguation provided by the event URIs (nodes of the Eventmedia graph)
1 http://eventmedia.eurecom.fr/sparql
Ferbruary 6, 2013 16/22VU University Amsterdam, NL
Entities for concept mining
➢ Used to annotate textual data– news articles, and ...
➢ Video transcripts:– video segmentation (MediaFragment)– MediaFragment annotation– indexing– topic model generation
➢ Microposts:– text understanding– topic model generation
Ferbruary 6, 2013 17/22VU University Amsterdam, NL
Media Fragment Enricher
source: http://goo.gl/BMZK3joint work between University of
Southampton and EURECOM
Ferbruary 6, 2013 18/22VU University Amsterdam, NL
Annotating social streams
➢ Live and fresh breaking news: microposts
➢ Media items, such as pictures and videos, usually are attached to microposts
➢ Grouping microposts:– Entity labels– Entity classes– Latent Dirichlet allocation (LDA)– Density based micropost proximity (text similarity,
entity similarity, temporal distance)
➢ Create textual storyboards from vox populi
➢ Describe visually the created storyboards
Ferbruary 6, 2013 19/22VU University Amsterdam, NL
Centroids for topic generation
➢ Each cloud represents a topic
➢ A topic is depicted by an entity
➢ Leaf are media items, which visually represent the microposts
➢ Each leaf can belong to many topics
Ferbruary 6, 2013 20/22VU University Amsterdam, NL
Topic storyboard
➢ Visual summary of the topic
➢ Topic is labelled with an entity
➢ A poster picture is displayed according to the relevance of the micropost in the generated topic
➢ If the entity points to a LOD resource, a textual description is displayed
Ferbruary 6, 2013 21/22VU University Amsterdam, NL
Outlook
➢ Modelling heterogeneous data with entities
➢ Linking entities according to the topics extracted from the text
➢ Enhancing topic modelling with the GGG
➢ Providing visual storyboards tailored with the extracted topics
Ferbruary 6, 2013 22/22VU University Amsterdam, NL
Thanks for your time and attention
http://www.slideshare.net/giusepperizzo
Agenda:– Web of Linked Entities (sl. 3)– Aligning annotations (sl. 6)– Combining performances of 3rd-
party entity extractors (sl. 9) – Spotting events (sl. 15)– Annotating MFs and microposts for
topic generation (sl. 16)– Topic storyboard generation (sl. 19)