Tools for (Almost) Real-Time Social Media Analysis

University of Sheffield, NLP

Tools for (Almost) Real-Time Social Media Analysis

Dr. Diana Maynard

Dept of Computer ScienceUniversity of Sheffield, UK

19 March 2015, Vienna


We are all connected to each other...

● Information, thoughts and opinions are shared prolifically on the social web these days

● 72% of online adults use social networking sites


Your grandmother is three times as likely to use a social networking site now as in 2009


There are hundreds of tools for social media analytics

● Most of them are commercial and not freely available

● The research tools tend to focus on specific topics and scenarios, and aren't easily adaptable

● The analysis they do often doesn't go much beyond number crunching, e.g.

– look at number of tweets, retweets, favourites

– filter by hashtag or keyword for topic categorisation

– use off-the-shelf sentiment tools

– use counts of word length, POS categories etc

– very little semantics, don't deal with variation, ambiguity, slang, sarcasm etc.


Analysing Social Media is harder than it sounds

There are lots of things to think about!


Analysing language in social media is hard

● Grundman:politics makes #climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do

● @adambation Try reading this article , it looks like it would be really helpful and not obvious at all. http://t.co/mo3vODoX

● Want to solve the problem of #ClimateChange? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99%

● Human Caused #ClimateChange is a Monumental Scam! http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them!


Let's search for keywords like “Arctic”

Oops!


Seems like we need something to help!

How about NLP?


It is difficult to access unstructured information efficiently

Information extraction tools can help you:

● Save time and money on management of text and data from multiple sources

● Find hidden links scattered across huge volumes of diverse information

● Integrate structured data from variety of sources

● Interlink text and data

● Collect information and extract new facts


What is Entity Recognition?● Entity Recognition is about recogising and classifying key Named

Entities and terms in the text

● A Named Entity is a Person, Location, Organisation, Date etc.

● A term is a key concept or phrase that is representative of the text

● Entities and terms may be described in different ways but refer to the same thing. We call this co-reference.

Mitt Romney, the favorite to win the Republican nomination for president in 2012

DatePerson Term

The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.

Organisation

co-reference

Location


What is Event Recognition?

● An event is an action or situation relevant to the domain expressed by some relation between entities or terms.

● It is always grounded in time, e.g. the performance of a band, an election, the death of a person

Mitt Romney, the favorite to win the Republican nomination for president in 2012

Event DatePerson

Relation Relation


Why are Entities and Events Useful?

● They can help answer the “Big 5” journalism questions (who, what, when, where, why)

● They can be used to categorise the texts in different ways

– look at all texts about Obama.

● They can be used as targets for opinion mining

– find out what people think about President Obama

● When linked to an ontology and/or combined with other information, they can be used for reasoning about things not explicit in the text

– seeing how opinions about different American presidents have changed over the years


Approaches to Information Extraction

Knowledge Engineering rule based developed by

experienced language engineers

make use of human intuition

easier to understand results

development could be very time consuming

some changes may be hard to accommodate

Learning Systems use statistics or other

machine learning developers do not need

LE expertise requires large amounts of

annotated training data some changes may

require re-annotation of the entire training corpus


Seems like we need a tool to do this clever stuff for us.

How about GATE?


What is GATE?

GATE is an NLP toolkit developed at the University of Sheffield over the last 20 years.

It's open source and freely available. http://gate.ac.uk

• components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages...

• tools for visualising and manipulating text, annotations, ontologies, parse trees, etc.

• various information extraction tools

• evaluation and benchmarking tools


GATE components● Language Resources (LRs), e.g. lexicons, corpora,

ontologies

● Processing Resources (PRs), e.g. parsers, generators, taggers

● Visual Resources (VRs), i.e. visualisation and editing components

● Algorithms are separated from the data, which means:

– the two can be developed independently by users with different expertise.

– alternative resources of one type can be used without affecting the other, e.g. a different visual resource can be used with the same language resource


ANNIE

• ANNIE is GATE's rule-based IE system

• It uses the language engineering approach (though we also have tools in GATE for ML)

• Distributed as part of GATE

• Uses a finite-state pattern-action rule language, JAPE

• ANNIE contains a reusable and easily extendable set of components:– generic preprocessing components for tokenisation,

sentence splitting etc

– components for performing NE on general open domain text


ANNIE Modules


19

Document with Tokens


20

Document with Sentences


Gazetteer editor

definition file entries entries for selected list


Named Entity Grammars • Hand-coded rules written in JAPE applied to annotations to

identify NEs

• Phases run sequentially and constitute a cascade of FSTs over annotations

• Annotations from format analysis, tokeniser. splitter, POS tagger, morphological analysis, gazetteer etc.

• Because phases are sequential, annotations can be built up over a period of phases, as new information is gleaned

• Standard named entities: persons, locations, organisations, dates, addresses, money

• Basic NE grammars can be adapted for new applications, domains and languages


Document with NEs


Coreference


Right, so we have a technology, and we have a tool to apply the technology.

Now how do we do it?


Framework

● Data collection (via Twitter streaming API)

● Documents stored as JSON and processed (annotated) via GCP

● Documents indexed via MIMIR

● Search and visualisation via MIMIR/Prospector


Live streaming (coming soon)

● If we want to process the tweets in real time, we can use the Twitter streaming client to feed the incoming tweets to a message queue.

● A separate process then reads messages from the queue, annotates them and pushes them into Mimir.

● If the rate of incoming tweets exceeds the capacity of the processing side, we can simply launch more instances of the message consumer across different machines to scale the capacity.

● Query and visualisation can then be performed as before on whatever data we currently have available


Let's look at some examples

(For anyone who grew up in the UK): “Here's one I prepared earlier”


DecarboNet project: what do people think about climate change?

And how much do we really know about it?

How do we know what's really true?

“It's cold in my flat“

https://www.youtube.com/watch?v=mxXiZB5i8pc


Political Futures Tracker Application

● Example of using the technology on a real scenario - analysing political tweets in the run-up to the UK elections

● Project funded by Nesta http://www.nesta.org.uk/

● Series of blog posts about the project, leading up to the election, see e.g.

http://www.nesta.org.uk/blog/silver-surfers-and-westminster-twitterati


Twitter collection

● collected Tweets using Twitter's “statuses/filter” streaming API

● can follow up to 5000 user IDs and receive in real time

● collected all tweets and retweets posted by these users

● also retweets of, and replies to, any tweet posted by these users


Twitter collection (2)

● Initial list of 506 UK MPs' Twitter accounts, extracted from a CSV file made available by BBC News Labs and cleaned

● Also added list of UK election candidates collected and made available at https://yournextmp.com, and updated periodically

– 1,504 on 13th January 2015

– 1,811 on 2nd February 2015

● Added 21 official party accounts

● Total number of accounts followed at 16th Feb: 1,894

– 444 MPs standing again are included in both the MP and candidate lists

https://yournextmp.com/


Tweets per hour collected

Government U-turn on fracking

Douglas Carswell's accidental “Hello Kitty” tweet

http://www.climateactionprogramme.org/news/fracking_blocked_across_uk_after_government_forced_into_u_turn

http://www.climateactionprogramme.org/news/fracking_blocked_across_uk_after_government_forced_into_u_turn

http://www.bbc.co.uk/news/uk-england-essex-31112032

http://www.bbc.co.uk/news/uk-england-essex-31112032


Longer web documents

● Also crawled websites of UK political parties (Con, Lab, LD, Green, UKIP, BNP, SNP, PC, plus the NI parties and various smaller parties)

● Initial crawl on 28th-29th October retrieved 375MB (compressed)

● Re-crawled regularly to pick up new pages


Politician and candidate annotation

● Acquired and corrected a list of UK MPs and election candidates and their affiliations, twitter accounts and DBpedia URIs

● Converted to gazetteers so that MPs in various forms (name or twitter handle) can be recognised in tweets and annotated with the relevant info (URI, full name, constituency etc.)


Recognition of MPs / Candidates


Topic Recognition● A set of themes was taken from the categories used on http://www.gov.uk

● For each theme, a gazetteer list was developed containing typical keywords and phrases representative of that theme

● e.g. “asylum seeker” indicates the topic “borders and immigration”

● Each list was expanded via:

– automatic term recognition (based on tf.idf) over a corpus of manifestos and other political documents

– manual additions

● Each list also contains potential head terms and modifiers which can be expanded into longer terms on the fly from the text during analysis stage

● e.g. “terrorist” can modify many other words (terrorist attack, terrorst threat, ...)


Topic recognition

This term is found by first recognising the head word “job” from a list under the theme “employment” and matching against its root form in the text, i.e. “job”.

It is then extended to include the adjectival modifier “British”, which is not present in a list anywhere.


Sentiment annotation

● Annotations are created over the whole sentence and contain the following features:

– sentiment_kind: optimism / pessimism

– holder: the person holding the opinion (MP's name)

– holder_URI: the URI fo the holder

– target: the target of the opinion, e.g. MP or topic

– target_URI: if appropriate, the URI of the target

– score: a positive/negative value reflecting the strength of opinion

– sarcasm: yes / no (whether sarcasm is present)

– sentiment_string: the main word(s) that contain sentiment

● These annotations and features will be used as input to MIMIR to facilitate analysis/aggregation


Positive opinion about science and innovation


GATE Mímir: Answering Questions Google Can't


GATE Mimir

● can be used to index and search over text, annotations, semantic metadata (concepts and instances)

● allows queries that arbitrarily mix full-text, structural, linguistic and semantic annotations

● is open source


Show me:

● all documents mentioning a temperature between 30 and 90 degrees F (expressed in any unit)

● all abstracts written in French on Patent Documents from the last 6 months which mention any form of the word “transistor” in the English abstract

● the names of the patent inventors of those abstracts

● all documents mentioning steel industries in the UK, along with their location


Search news articles for politicians born in Sheffield


Document Indexing with MIMIR

● MIMIR allows for indexing and querying text, annotations and semantic knowledge

– this gives a rich source of data for analysis

● Currently we have used MIMIR to index

– the raw collected text

– annotations provided by Twitter and by the applications


Examples of Mimir queries on our corpus

● Get all documents which talk about the borders/immigration topic

{Topic theme = "borders_and_immigration"}

● Get all documents where the author of the document is a candidate

{DocumentAuthor sparql = "?c nesta:candidate ?author_uri"}

● Get all documents where the author is an MP standing for re-election for the same seat

{DocumentAuthor sparql="?c nesta:candidate ?author_uri . ?c dbp-prop:mp ?author_uri"}

● Get all documents where the author is a candidate for the Sheffield Hallam constituency

{DocumentAuthor sparql="<http://dbpedia.org/resource/Sheffield_Hallam_(UK_Parliament_constituency)> nesta:candidate ?author_uri"}


What do the different parties talk about? Conservative vs Labour

Transport, Europe and employment are mentioned more by Conservatives

NHS is mentioned more by Labour


Topics mentioned by UKIP


Topics mentioned by the Green Party

Expected topics are high on the list


How old are MPs?


Where did people tweet about the economy?


Measuring engagement about climate change

● We also used the tools to measure how engaged both MPs and the public are about the topic of climate change

● Comparison of climate change with other political topics

● Theory is that people are quite apathetic about most political topics in general

● But people are more enthusiastic about climate change, because it's something they can actively do something about

● Results showed that climate change is not frequently tweeted about by most politicians apart from the Green Party, but is in top 3 topics for most of the engagement criteria we applied (retweets, replies, sentiment, optimism, @mentions, URLs)

● Climate change tweets contained the highest number of URLs - direct engagement with additional information


Average retweets per tweet


Terms that co-occur with environmental topic


Terms that co-occur with environment topic


Terms that co-occur with immigration topic


Climate change tweets often express sentiment


Summary

● Once you have the indexed data, you can carry on doing all kinds of interesting comparisons and analysis.

● Simple analysis tools can give you pretty pictures, but you can do much more interesting things when you delve a bit deeper and make use of information not explicit in the text

● For this you need both NLP and Linked Open Data

● Our tools are all freely available and open source

Tools for (Almost) Real-Time Social Media Analysis

Technology

Transcript of Tools for (Almost) Real-Time Social Media Analysis