SwiftRiver 2011 Overview

48
SWIFT RIVER 2011 Jon Gosier, Director of Product http://swiftly.org @swiftriver @jongos An Ushahidi Initiative

description

A look at the SwiftRiver platform, the progress of it's various APIs and apps.

Transcript of SwiftRiver 2011 Overview

Page 1: SwiftRiver 2011 Overview

SWIFT RIVER 2011

Jon Gosier, Director of Producthttp://swiftly.org

@swiftriver@jongos

An Ushahidi Initiative

Page 2: SwiftRiver 2011 Overview

Initial development began during the Haiti earthquakes, one of Ushahidi’s largest deployments to date.

Objective became to offer smart tools for curating real-time data of all types (Email, Twitter, SMS, Web feeds).

Page 3: SwiftRiver 2011 Overview

PLATFORM GOALS

‣ Democratize access to intelligence tools

‣ Structure unstructured data feeds

‣ Data-mine overwhelming realtime datasets

‣ Surface signal & suppress noise

‣ Identify and rate authoritative users & sources

‣ Easy to use tools & applications for curating

data on the user’s terms

Page 4: SwiftRiver 2011 Overview

“It’s not information overload. It’s filter failure.”

- Clay Shirky

Page 5: SwiftRiver 2011 Overview

Sweeper - User Interface

Page 6: SwiftRiver 2011 Overview

The Brain as an API

‣ Breaks content (data) into pieces

‣ Analyzes pieces separately

‣ Conditionally prioritizes

‣ Learns from experience

‣ Processing is distributed

‣ Recombination of pieces

Page 7: SwiftRiver 2011 Overview

APIs‣ Tagging API - parses text and adds taxonomy

‣ Location API - detects origin location of content

‣ Influence API - measures influence of content online

‣ Reputation API - stores information about user

behavior

‣ Duplication Filter API - receives feeds and filters out

duplicate content to cut down on retweets

Applications

‣ Sweeper - sweep, structure and sort

realtime data-streams

‣ SwiftMeme - meme/keyword tracker,

content discovery

‣ SwiftMail - sort email by relevance

Page 8: SwiftRiver 2011 Overview
Page 9: SwiftRiver 2011 Overview
Page 10: SwiftRiver 2011 Overview

Product (RED)

Page 11: SwiftRiver 2011 Overview

Queensland - ABC Australia Deployment

Page 12: SwiftRiver 2011 Overview

RIVER IDGlobal Trust and Reputation Server

Web Services

Page 13: SwiftRiver 2011 Overview

WHAT IS RIVER ID?

• Opt-in product for Ushahidi deployments

• Collect information on all contributors

• Use contributions to build trust profile

• Use trust profile to help validate information in the future

• Global trust bank built on OAUTH standards

Page 14: SwiftRiver 2011 Overview
Page 15: SwiftRiver 2011 Overview

REVERBERATIONSMeasuring Influence

Web Services

Page 16: SwiftRiver 2011 Overview

INFORMATION REVERBERATES

• Good information and Bad information spread the same

• Reverberations tracks influences

• Breadcrumb trails for information and content

Page 17: SwiftRiver 2011 Overview
Page 18: SwiftRiver 2011 Overview
Page 19: SwiftRiver 2011 Overview
Page 20: SwiftRiver 2011 Overview

AUTO-TAGGINGSILCC: SwiftRiver Language Computation Core

Web Services

Page 21: SwiftRiver 2011 Overview

WHAT IS SILCC?•Swift Language Computation Component•One of the SwiftRiver Web Services•Open Web API•Semantic Tagging of Short Text•Natural Language Processing•Multilingual•Multiple sources (twitter, email, SMS, blogs etc)•Active Learning capability

Page 22: SwiftRiver 2011 Overview

Swiftriver    SiLCC  Dataflow    

SLISa  

Swiftriver  Language  

Improvement  Service  

 

SiLCC  

Swiftriver  Language  

Computational  Core  

 

Content   Items  coming   from  the  SiSLS  have    where  

SiSLS   integrations   is   enabled     global   trust   values  

added  to  the  object  model.  

Although   the   NLP   tags   have  now   been   applied,   the  

SLISa   is   now   responsible   for   applying   instance  

specific  tagging  corrections.  

The  text  of  the  

content  is  sent  to  the  

SiLCC.  

Using  NLP,  the  SiLCC  

extracts  Nouns  and  

other  keywords  from  

the  text.  

An  API  key  is  sent  along  with  the  text  to  ensure  that  

the  SiLCC  is  not  open  to  any  malicious  usage.    

The   lists   of   tags   sent   back   from   the   SiLCC   can   be  

added  to  the  Content  Item  along  with  any  that  were  

extracted  from  the  source  data  by  the  parser.  

SiSLS  

Swiftriver  Source    

Library  Service  

 

The  SiLCC  send  back  

a  list  of  tags  that  are  

added  to  the  

Content  Item  

There  is  still  a  bit  of  ambiguity  around  what  the  NLP  

should  extract  from  the  text  but  at  its  most  simple,  

all  the  nouns  would  be  a  good  start.  

Page 23: SwiftRiver 2011 Overview

OUR GOALS•Simple Tagging of short snippets of text•Rapid tagging for high volume environments•Simple API, easy to use•Learns from user feedback•Routing of messages to upstream services•Semantic Classification•Sorts rapid streams into buckets•Clusters like messages •Visual effects•Cross-referencing

Page 24: SwiftRiver 2011 Overview

WHAT IT’S NOT•Does not do deep analysis of text •Only identifies words within original text

Page 25: SwiftRiver 2011 Overview

HOW DOES IT WORK?•Step 1: Lexical Analysis•Step 2: Parsing into constituent parts•Step 3: Part of Speech tagging•Step 4: Feature extraction•Step 5: Compute using feature weights

•Lets examine each one in turn...

Page 26: SwiftRiver 2011 Overview

STEP 1: LEXICAL ANALYSIS•For news headlines, email subjects this is trivial, just

split on spaces.

•For Twitter this is more complex...

Page 27: SwiftRiver 2011 Overview

TWEET ANALYSIS•Tweets are surprisingly complex•Only 140 characters but many features•Emergent features from community (e.g. hashtags)•Lets take a look at a typical tweet...

Page 28: SwiftRiver 2011 Overview

TWEET ANALYSISThe typical Tweet: “RT @directrelief: RT @PIH: PBS @NewsHour addresses mental health needs in the aftermath of the #Haiti earthquake #health #earthquake... http://bit.ly/bNhyK6”

•RT indicates a “re-tweet”•@name indicates who the original tweeter was•Multiple embedded retweets•Hashtags (e.g. #Haiti) can play two roles, as a tag

and as part of the sentence

Page 29: SwiftRiver 2011 Overview

TWEET ANALYSIS 2•Two or more hashtags within a tweet (e.g.

#health and #earthquake)•Continuation dots “...” indicates that there

was more text that didn’t fit into the 140 limit somewhere in it’s history

•Urls many tweets contain one or more urls

As we can see this simple tweet contains no less than 7 different features and that’s not all!

Page 30: SwiftRiver 2011 Overview

TWEET ANALYSIS 3We want to break up the tweet into the following parts:

{

'text': ['PBS addresses mental health needs in the aftermath of the Haiti earthquake'],

'hashtags': ['#Haiti', '#health', '#earthquake'],

'names': ['@directrelief', '@PIH', '@NewsHour'], 'urls': ['http://bit.ly/bNhyK6'],

}

Page 31: SwiftRiver 2011 Overview

TWEET ANALYSIS 4Why do we want to break up the tweet into parts (parsing)?

•Because we want to further process the grammatically correct english text

•Part of speech tagging would otherwise be corrupted by words it cannot recognize (e.g. urls, hashtags, @names etc.)

•We want to save the hashtags for later use•Many of the features are irrelevant to the task of

identifying tags (e.g. dots, punctuation, @name, RT)

Page 32: SwiftRiver 2011 Overview

TWEET ANALYSIS 5•We now take the “text” portion of the tweet and

perform part of speech tagging on it•After part of speech tagging, we perform feature

extraction•Features are now passed through the keyword

classifier which returns a list of keywords / tags•Finally we combine these tags with the hashtags we

saved earlier to give the complete tag set

Page 33: SwiftRiver 2011 Overview

HEADLINE AND EMAIL SUBJECT ANALYSIS

•This is much simpler to do•Its a subset of the steps in Tweet Analysis•There is no parsing since there are no hashtags,

@names etc.

Page 34: SwiftRiver 2011 Overview

FEATURE EXTRACTION • For the active learning algorithm we need to extract features to use in classification• These features should be subject/domain independent• We therefore never use the actual words as features• This would for example give artificially high weights to words such as “earthquake”• We don't want these artificial weights as we can’t foresee future disasters and we

want to be as generic with classification as possible• The use of training sets does allow for domain customization if where necessary

Page 35: SwiftRiver 2011 Overview

FEATURE EXTRACTION • Capitalization of individual words: Either first caps, or all caps, this is an

important indicator of proper nouns or other important words that make good tag candidates

• Position in text: Tags seem to have a greater preponderance near the beginning of text

• Part of Speech: Nouns and proper nouns are particularly important but so are some adjectives and adverbs

• Capitalization of entire text: sometimes the whole text is capitalized and this should reduce overall weighting of other features

• Length of the text: In shorter texts the words are more likely to be tags• The parts of speech of previous and next words (effectively this means we

are using trigrams; or a window of 3)

Page 36: SwiftRiver 2011 Overview

TRAINING• Requires user reviewed examples• Lexical analysis, parsing and feature extraction on the examples• Multinomial naïve Bayes algorithm• NB: The granularity we are classifying is at the word level• For each word in the text, we classify it as either a keyword or not • This has pleasant side effect of providing several training examples from each user

reviewed text• Even with less than 50 reviewed texts the results are comparable to the simple

approach of using nouns only

Page 37: SwiftRiver 2011 Overview

ACTIVE LEARNING•The API also provides a method for users to send

back corrected text•The corrected text is saved and then used in the

next iteration of training•User may optionally specify a corpus for the

example to go into•Training can be performed using any combination of

corpora

Page 38: SwiftRiver 2011 Overview

DEVELOPER FRIENDLY•Two levels of API, the web API and the internal

Python API•Either one may be used but most users will use the

web API•Design is highly modular and maintainable•For very rapid backend processing the native Python

API can be used

Page 39: SwiftRiver 2011 Overview

PYTHON CLASSESMost of the classes that make up the library are divided into three types:

1) Tokenizers 2) Parsers 3) Taggers

All three types have consistent API's and are interchangeable.

Page 40: SwiftRiver 2011 Overview

PYTHON API•A tagger calls a parser•A parser calls a tokenizer•Output of the tokenizer goes into the parser•Output of the parser goes into the tagger•Output of the tagger goes into the user!

Page 41: SwiftRiver 2011 Overview

CLASSES• BasicTokenizer – This is used for splitting basic (non-tweet) text into individual

words• TweetTokenizer – This is used to tokenize a tweet, it may also be used to

tokenize plain text since plain text is a subset of tweets• TweetParser – Calls the TweetTokenizer and the parses the output (see

previous example)• TweetTagger – Calls the TweetTokenizer and then tags the output of the text

part and adds the hashtags• BasicTagger – Calls the BasicTokenizer and then tags the text, should only be

used for non-tweet text, uses simple Part of Speech to identify tags• BayesTagger – Same as BasicTagger but uses weights from the naïve Bayes

training algorithm

Page 42: SwiftRiver 2011 Overview

DEPENDANCIES•Part of speech tagging is currently performed by the

Python NLTK•The Web API uses the Pylons web framework

Page 43: SwiftRiver 2011 Overview

CURRENT STATUS•Tag method of API is ready for use, individual

deployments can choose between using the BasicTagger or the BayesTagger

•Tell method (for user feedback) will be ready by the time you read this!

•Training is possible on corpora of tagged data in .csv format (see examples in distribution)

Page 44: SwiftRiver 2011 Overview

CURRENT LIMITATIONS•Only English text is supported at the moment•Tags are always one of the words in the supplied

text ie they can never be a word not in the supplied text

•Very few training examples exist at the moment

Page 45: SwiftRiver 2011 Overview

GEO DICTLOCATION DISAMBIGUATION

Web Services

Page 46: SwiftRiver 2011 Overview

WHAT IS GEO DICT?

• For auto-mapping data

• Reverse lookup lat/lon from place names

• Works with data from Twitter, Email, RSS, SMS

• SVM or Naive Bayes/Fisher Classification

• Database of global place-names corresponding lat/lon

Page 47: SwiftRiver 2011 Overview

ITEM CLASSIFICATION

• Feature Extraction: Bag of Words, String Kernels

• Higher Level Features: Topic Modeling

• Linguistic pre-processing: lemmatization, stemming

• Natural Language Processing

• Named Entity Recognition

• Multi-class classification: One-vs.-One, One-vs.-All

Page 48: SwiftRiver 2011 Overview

SWIFT RIVER

Jon Gosier, Director of Producthttp://swiftly.org

@swiftriver@jongos

An Ushahidi Initiative