Semanticnews 230913-final

17
Mark A Greenwood, Jonathon Hare, David R Newman, Wim Peters SemanticMedia@TheBritishLibrary Monday 23 rd September 2013

description

Slides presented about the SemanticNews project at the SematicMedia@theBritishLibrary event on September 23rd 2013.

Transcript of Semanticnews 230913-final

Page 1: Semanticnews 230913-final

Mark A Greenwood, Jonathon Hare, David R Newman, Wim Peters

SemanticMedia@TheBritishLibraryMonday 23rd September 2013

Page 2: Semanticnews 230913-final
Page 3: Semanticnews 230913-final
Page 4: Semanticnews 230913-final

The Project Vision• Semantic News is 6 month project:• June to November 2013• Two 50% FTEs (1 Southampton, 1 Sheffield)

• An interactive `second screen’ to provide contextual information on Question Time questions• Use multiple data sources• Perform named entity recognition• Exploit Linked Open Datasets• Towards an almost real-time system

Page 5: Semanticnews 230913-final

Where is the Data? (1)• Question Time in

2010• 34 episodes, 163

questions• BBC Subtitles• XML encoded• Broadcast as the

subtitles stream

Page 6: Semanticnews 230913-final

Where is the Data? (2)• BBC Programmes Data• XML encoded• Information about the

programme, (panellists, topics, broadcast dates, etc.)

• Tweets• Taken from the Twitter

‘Garden Hose’ (10% stream)

Page 7: Semanticnews 230913-final

Pre-parsing Subtitles Data• Raw XML subtitles• Remove duplicate words• Parse into CSV • time offset• sentence

• Break into questions• BBC Programmes data provides question time

offsets • Compare with subtitles time offsets and split

Page 8: Semanticnews 230913-final

Pre-parsing Twitter Data• Twitter ‘Garden Hose’ for 2010 Dataset• Used Apache Hadoop and filtered on:• @bbcqt, @bbcquestiontime• #bbcqt, #bbcquestiontime, #questiontime• “Question Time” “David Dimbleby”

• Collated JSON results and imported into OpenRefine• Removed irrelevant fields• Filtered out tweets that did not contain “bbc”• Exported as CSV

Page 9: Semanticnews 230913-final

Information Extraction with GATE● General Architecture for Text Engineering (GATE)

● Developed by University of Sheffield since 2000● Used by many researchers, scientists and

organisations all over the world● Includes various components for language processing

● Parsers, machine learning tools, stemmers, IR tools, IE components for various languages...

● Also performs visualising and manipulating of text, annotations, ontologies, parse trees, etc., and tools for evaluation

Page 10: Semanticnews 230913-final

Linguistic pre-processing● Techniques

● Tokenization● Sentence Splitting● Language Identification● POS tagging● Morphological analysis

● Adapted for use with social media like Twitter

Page 11: Semanticnews 230913-final

Named Entity Recognition● Approaches

● Gazetteer lookup● JAPE grammars● Co-reference

● Types● Location: countries, regions, cities etc.● Organisation: names of companies, government organisations,

committees, agencies, universities, etc.● Person: names of people ● Date: absolute dates like ‘October 2012’ or ‘2007’, as well as

relative dates, such as ‘last year’. ● Measurements: e.g. “8,596 km”, “one fifth”, percentages and

probabilities

Page 12: Semanticnews 230913-final

Enrichment: LODIE● Under constant development in various projects

● Associates the most probable LOD URI with named entities

● Disambiguation against DBPedia

● Various techniques to enhance recall

Page 13: Semanticnews 230913-final

Enrichment: LODIE

“Ken Clarke: The Labour plotters hide behind the knife and stab with the cloak! Brilliant!!”

“Hain just lost Labour votes by supporting the £25k �benefits of an extremist.”

Page 14: Semanticnews 230913-final

Representing Extracted Information

Page 15: Semanticnews 230913-final

Conceptualising a Question

http://www.youtube.com/watch?v=O3l9Mi-KylI

Page 16: Semanticnews 230913-final

Show Me The Data!• Use (Linked) Open Data Datasets• Crime Data• Election Data (constituencies, majorities, etc.)• MP voting records• School league tables• NHS performance league tables• Economic Figures (GDP, Inflation, Unemployment)

• Compare and contrast

Page 17: Semanticnews 230913-final

Let’s have some questions from our audience.