Connecting political data to media data
-
Upload
laura-hollink -
Category
Science
-
view
194 -
download
1
Transcript of Connecting political data to media data
Connecting political data to media data
Laura Hollink
VU University AmsterdamWeb & Media group
ASCoR Spring Colloquium ‘Big Data at the University of Amsterdam’February 18, 2014
Laura Hollink Damir JuricGeert-Jan Houben
Martijn KleppeMax KemmanHenri Beunders
Johan OomenJaap Blom
Funded by Clarin-NL
Questions we want to answer
• Which events have attracted a lot of media attention?
• What are the differences between different media? E.g. in different newspapers, or newspapers vs. radio bulletins?
• Has the coverage changed over time?
• How are the events visualized (photos, layout of newspaper, etc.).
Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches.
Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches. Archives of hundreds of
newspaper with tons of newspaper issues or 10’s of Millions of articles between 1618-1995.
(We only use 1945-1995)
Transcriptions of all 9,294 meetings of the Dutch parliament between 1945-1995, consisting of 1,208,903 speeches.
Roughly 1.8 Million news bulletins between 1937-1984
(We only use 1945-1995)
Archives of hundreds of newspaper with tons of newspaper issues or 10’s of Millions of articles between 1618-1995.
(We only use 1945-1995)
PoliMedia methods
Step 1: Translate the Dutch parliamentary debates to the standard structured web format RDF
nl.proc.sgd.d.194519460000002
nl.proc.sgd.d.194519460000002.1
PartOfDebateDebate
http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002
http://statengeneraaldigitaal.nl/
http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf
nl.proc.sgd.d.19720000002
Handelingen Verenigde Vergadering...
Dutch
1945-11-20rdf:type
dc:id
dc:source
dc:source
dc:publisher
dc:language
dc:date
hasPart
rdf:type
nl.proc.sgd.d.194519460000002.1.1hasPart
DebateContext
rdf:type
nl.proc.sgd.d.194519460000002.1.2
Speech
rdf:type
hasPart
nl.proc.sgd.d.194519460000002.1.3
hasSubsequentSpeech
"Mijnheer de Voorzitter, de Commissie van …"
hasSpokenText
sem:hasActorSpeaker_0006
4
Party_kvp
hasParty
hasSpeaker
member_of _parliament
"De voorzitter opent de vergadering…"
hasText
http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr
coveredIn
Party
KVP
Katholieke Volkspartijrdf:type
hasAcronym
hasFullName
Joannes Antonius James
Bargefoaf:firstName
foaf:lastName
Bargerdfs:label
http://resolver.politicalmashup.nl/nl.m.00064
dc:source
Politician
rdf:typehasRole
nl.proc.sgd.d.194519460000002.2
hasSubsequentPartOfDebate
XML by War in
Parliament Project
Modeling the debates as events
• An event has a date, a location, actors, and possibly sub-events.
• We build on the Simple Event Model (SEM).
•links to the original sources•reusing existing
vocabularies
nl.proc.sgd.d.194519460000002
Debate
http://resolver.politicalmashup.nl/nl.proc.sgd.d.194519460000002
http://statengeneraaldigitaal.nl/
http://resolver.kb.nl/resolve?urn=sgd:mpeg21:19451946:0000002:pdf
nl.proc.sgd.d.19720000002
Handelingen Verenigde Vergadering...
Dutch
1945-11-20rdf:type
dc:id
dc:source
dc:source
dc:publisher
dc:language
dc:date
dc:title
•the part-of structure and chronological order of the debates.
nl.proc.sgd.d.194519460000002
nl.proc.sgd.d.194519460000002.1
PartOfDebate
hasPart
rdf:type
nl.proc.sgd.d.194519460000002.1.1hasPart
DebateContext
rdf:type
nl.proc.sgd.d.194519460000002.1.2
Speech
rdf:type
hasPart
nl.proc.sgd.d.194519460000002.1.3
hasSubsequentSpeech
"Mijnheer de Voorzitter, de Commissie van …"
hasSpokenText
"De voorzitter opent de vergadering…"
hasText
nl.proc.sgd.d.194519460000002.2
hasSubsequentPartOfDebate
Handelingen Verenigde Vergadering...
dc:title
•the different roles and parties that a speaker can have in his/her career.
nl.proc.sgd.d.194519460000002.1.2
Speech
rdf:type
"Mijnheer de Voorzitter, de Commissie van …"
hasSpokenText
sem:hasActorSpeaker_0006
4
Party_kvp
hasParty
hasSpeaker
member_of _parliament
http://resolver.kb.nl/resolve?urn=ddd:011198136:mpeg21:a0525:ocr
coveredIn
Party
KVP
Katholieke Volkspartijrdf:type
hasAcronym
hasFullName
Joannes Antonius James
Bargefoaf:firstName
foaf:lastName
Bargerdfs:label
Politician
rdf:typehasRole
Step 2: Linking speeches in the debate to the newspaper articles that cover them
We created a linking method to deal with our two challenges:1.How to link documents that are so different in nature?2. Can we use the structure of the debates: people, chronologic
order of speeches, introductions to each new topic, etc?
Detect topics in
speeches
Create queries
Search newspaper
archive
Topics
Named Entities
Name of speaker
Detect Named
Entities in speeches
Candidate articles
Queries
Rank candidate
articles
Links between speeches
and articles
Debates
Date of debate
Step 2: Linking speeches in the debate to the newspaper articles that cover them
Detect topics in
speeches
Create queries
Search newspaper
archive
Topics
Named Entities
Name of speaker
Detect Named
Entities in speeches
Candidate articles
Queries
Rank candidate
articles
Links between speeches
and articles
Debates
Date of debate
Intuition 1: The name of the speaker should appear in the article and the article should be published within a week of the debate
Step 2: Linking speeches in the debate to the newspaper articles that cover them
Detect topics in
speeches
Create queries
Search newspaper
archive
Topics
Named Entities
Name of speaker
Detect Named
Entities in speeches
Candidate articles
Queries
Rank candidate
articles
Links between speeches
and articles
Debates
Date of debate
Intuition 1: The name of the speaker should appear in the article and the article should be published within a week of the debate
Intuition 2: the more the article and the speech overlap in terms of topics and named entities, the more they are related.
Evaluation: what do we use to rank the candidate articles?
• Experiment on 150 <newspaper article, speech in debate> pairs, 2 raters, K = 0.5
• Compare text of candidate articles to:• Setting 1: Named Entities in speech
• Setting 2: Named Entities + Topics in speech
• Setting 3: Named Entities + Topics in speech and larger part-of-debate
Score Setting 1 Setting 2 Setting 3
I don’t know 0.14 0.15 0.08
0 - unrelated 0.38 0.23 0.12
1- related 0.29 0.36 0.36
2- explicit mention of the debate 0.19 0.26 0.44
1+2 0.48 0.62 0.80
Results
• An open data set of Dutch parliamentary debates,
• with almost 3 Million links between 450.000 speeches and URL’s of 1.5 Million news paper articles and radio bulletins at the National Library.
• accessible though a Web demonstrator and through a SPARQL endpoint.
Demo
SPARQL endpoint
• A service to query a knowledge base using the SPARQL query language.
“All speeches with more than 60 associated news items.”
SELECT ?speech ?no_newsitems {{ SELECT ?speech (COUNT(?news) AS ?no_news_items) WHERE{ ?speech <http://purl.org/linkedpolitics/nl/polivoc#coveredAt> ?news . }GROUP BY ?speech }FILTER (?no_news_items > 60) }
Reflection: to what extend can we answer these questions?
• Which events have attracted a lot of media attention?
• What are the differences between different media? E.g. in different newspapers, or newspapers vs. radio bulletins?
• Has the coverage changed over time?
• How are the events visualized (photos, layout of newspaper, etc.).
Future work
• More types of links
• From just “coveredIn” to “quotedIn”, “coveredIn”, “backgroundOf” “talksAbout”
• More types of media
• More types of (political) events.
Project ‘Talk of Europe / Traveling Clarin Campus’2014-2015Funded by CLARIN-ERIC
From left to right: Max Kemman, Marnix van Berchum, Laura Hollink, Astrid van Aggelen, Steven Krauwer, Henri Beunders. (Unfortunately, Martijn Kleppe and Johan Oomen were not present to join the group pic.)
Plans of ‘ToE/TTC’
1.Publish proceedings of the EU parliamentary debates in RDF• hosted by DANS
2.Organize 3 workshops/hackathons/‘Traveling Clarin Campuses’ in which we invite international partners to work with the data.
3.In collaboration with international partners:• enrich with annotations, e.g. topics, structured data about people, parties,
etc. • link to national datasets, e.g. media or national parliaments