Semanticnews 230913-final
-
Upload
david-newman -
Category
Technology
-
view
315 -
download
1
description
Transcript of Semanticnews 230913-final
Mark A Greenwood, Jonathon Hare, David R Newman, Wim Peters
SemanticMedia@TheBritishLibraryMonday 23rd September 2013
The Project Vision• Semantic News is 6 month project:• June to November 2013• Two 50% FTEs (1 Southampton, 1 Sheffield)
• An interactive `second screen’ to provide contextual information on Question Time questions• Use multiple data sources• Perform named entity recognition• Exploit Linked Open Datasets• Towards an almost real-time system
Where is the Data? (1)• Question Time in
2010• 34 episodes, 163
questions• BBC Subtitles• XML encoded• Broadcast as the
subtitles stream
Where is the Data? (2)• BBC Programmes Data• XML encoded• Information about the
programme, (panellists, topics, broadcast dates, etc.)
• Tweets• Taken from the Twitter
‘Garden Hose’ (10% stream)
Pre-parsing Subtitles Data• Raw XML subtitles• Remove duplicate words• Parse into CSV • time offset• sentence
• Break into questions• BBC Programmes data provides question time
offsets • Compare with subtitles time offsets and split
Pre-parsing Twitter Data• Twitter ‘Garden Hose’ for 2010 Dataset• Used Apache Hadoop and filtered on:• @bbcqt, @bbcquestiontime• #bbcqt, #bbcquestiontime, #questiontime• “Question Time” “David Dimbleby”
• Collated JSON results and imported into OpenRefine• Removed irrelevant fields• Filtered out tweets that did not contain “bbc”• Exported as CSV
Information Extraction with GATE● General Architecture for Text Engineering (GATE)
● Developed by University of Sheffield since 2000● Used by many researchers, scientists and
organisations all over the world● Includes various components for language processing
● Parsers, machine learning tools, stemmers, IR tools, IE components for various languages...
● Also performs visualising and manipulating of text, annotations, ontologies, parse trees, etc., and tools for evaluation
Linguistic pre-processing● Techniques
● Tokenization● Sentence Splitting● Language Identification● POS tagging● Morphological analysis
● Adapted for use with social media like Twitter
Named Entity Recognition● Approaches
● Gazetteer lookup● JAPE grammars● Co-reference
● Types● Location: countries, regions, cities etc.● Organisation: names of companies, government organisations,
committees, agencies, universities, etc.● Person: names of people ● Date: absolute dates like ‘October 2012’ or ‘2007’, as well as
relative dates, such as ‘last year’. ● Measurements: e.g. “8,596 km”, “one fifth”, percentages and
probabilities
Enrichment: LODIE● Under constant development in various projects
● Associates the most probable LOD URI with named entities
● Disambiguation against DBPedia
● Various techniques to enhance recall
Enrichment: LODIE
“Ken Clarke: The Labour plotters hide behind the knife and stab with the cloak! Brilliant!!”
“Hain just lost Labour votes by supporting the £25k �benefits of an extremist.”
Representing Extracted Information
Conceptualising a Question
http://www.youtube.com/watch?v=O3l9Mi-KylI
Show Me The Data!• Use (Linked) Open Data Datasets• Crime Data• Election Data (constituencies, majorities, etc.)• MP voting records• School league tables• NHS performance league tables• Economic Figures (GDP, Inflation, Unemployment)
• Compare and contrast
Let’s have some questions from our audience.