SwiftRiver 2011 Overview
-
Upload
ushahidi -
Category
Technology
-
view
1.253 -
download
1
description
Transcript of SwiftRiver 2011 Overview
SWIFT RIVER 2011
Jon Gosier, Director of Producthttp://swiftly.org
@swiftriver@jongos
An Ushahidi Initiative
Initial development began during the Haiti earthquakes, one of Ushahidi’s largest deployments to date.
Objective became to offer smart tools for curating real-time data of all types (Email, Twitter, SMS, Web feeds).
PLATFORM GOALS
‣ Democratize access to intelligence tools
‣ Structure unstructured data feeds
‣ Data-mine overwhelming realtime datasets
‣ Surface signal & suppress noise
‣ Identify and rate authoritative users & sources
‣ Easy to use tools & applications for curating
data on the user’s terms
“It’s not information overload. It’s filter failure.”
- Clay Shirky
Sweeper - User Interface
The Brain as an API
‣ Breaks content (data) into pieces
‣ Analyzes pieces separately
‣ Conditionally prioritizes
‣ Learns from experience
‣ Processing is distributed
‣ Recombination of pieces
APIs‣ Tagging API - parses text and adds taxonomy
‣ Location API - detects origin location of content
‣ Influence API - measures influence of content online
‣ Reputation API - stores information about user
behavior
‣ Duplication Filter API - receives feeds and filters out
duplicate content to cut down on retweets
Applications
‣ Sweeper - sweep, structure and sort
realtime data-streams
‣ SwiftMeme - meme/keyword tracker,
content discovery
‣ SwiftMail - sort email by relevance
Product (RED)
Queensland - ABC Australia Deployment
RIVER IDGlobal Trust and Reputation Server
Web Services
WHAT IS RIVER ID?
• Opt-in product for Ushahidi deployments
• Collect information on all contributors
• Use contributions to build trust profile
• Use trust profile to help validate information in the future
• Global trust bank built on OAUTH standards
REVERBERATIONSMeasuring Influence
Web Services
INFORMATION REVERBERATES
• Good information and Bad information spread the same
• Reverberations tracks influences
• Breadcrumb trails for information and content
AUTO-TAGGINGSILCC: SwiftRiver Language Computation Core
Web Services
WHAT IS SILCC?•Swift Language Computation Component•One of the SwiftRiver Web Services•Open Web API•Semantic Tagging of Short Text•Natural Language Processing•Multilingual•Multiple sources (twitter, email, SMS, blogs etc)•Active Learning capability
Swiftriver SiLCC Dataflow
SLISa
Swiftriver Language
Improvement Service
SiLCC
Swiftriver Language
Computational Core
Content Items coming from the SiSLS have where
SiSLS integrations is enabled global trust values
added to the object model.
Although the NLP tags have now been applied, the
SLISa is now responsible for applying instance
specific tagging corrections.
The text of the
content is sent to the
SiLCC.
Using NLP, the SiLCC
extracts Nouns and
other keywords from
the text.
An API key is sent along with the text to ensure that
the SiLCC is not open to any malicious usage.
The lists of tags sent back from the SiLCC can be
added to the Content Item along with any that were
extracted from the source data by the parser.
SiSLS
Swiftriver Source
Library Service
The SiLCC send back
a list of tags that are
added to the
Content Item
There is still a bit of ambiguity around what the NLP
should extract from the text but at its most simple,
all the nouns would be a good start.
OUR GOALS•Simple Tagging of short snippets of text•Rapid tagging for high volume environments•Simple API, easy to use•Learns from user feedback•Routing of messages to upstream services•Semantic Classification•Sorts rapid streams into buckets•Clusters like messages •Visual effects•Cross-referencing
WHAT IT’S NOT•Does not do deep analysis of text •Only identifies words within original text
HOW DOES IT WORK?•Step 1: Lexical Analysis•Step 2: Parsing into constituent parts•Step 3: Part of Speech tagging•Step 4: Feature extraction•Step 5: Compute using feature weights
•Lets examine each one in turn...
STEP 1: LEXICAL ANALYSIS•For news headlines, email subjects this is trivial, just
split on spaces.
•For Twitter this is more complex...
TWEET ANALYSIS•Tweets are surprisingly complex•Only 140 characters but many features•Emergent features from community (e.g. hashtags)•Lets take a look at a typical tweet...
TWEET ANALYSISThe typical Tweet: “RT @directrelief: RT @PIH: PBS @NewsHour addresses mental health needs in the aftermath of the #Haiti earthquake #health #earthquake... http://bit.ly/bNhyK6”
•RT indicates a “re-tweet”•@name indicates who the original tweeter was•Multiple embedded retweets•Hashtags (e.g. #Haiti) can play two roles, as a tag
and as part of the sentence
TWEET ANALYSIS 2•Two or more hashtags within a tweet (e.g.
#health and #earthquake)•Continuation dots “...” indicates that there
was more text that didn’t fit into the 140 limit somewhere in it’s history
•Urls many tweets contain one or more urls
As we can see this simple tweet contains no less than 7 different features and that’s not all!
TWEET ANALYSIS 3We want to break up the tweet into the following parts:
{
'text': ['PBS addresses mental health needs in the aftermath of the Haiti earthquake'],
'hashtags': ['#Haiti', '#health', '#earthquake'],
'names': ['@directrelief', '@PIH', '@NewsHour'], 'urls': ['http://bit.ly/bNhyK6'],
}
TWEET ANALYSIS 4Why do we want to break up the tweet into parts (parsing)?
•Because we want to further process the grammatically correct english text
•Part of speech tagging would otherwise be corrupted by words it cannot recognize (e.g. urls, hashtags, @names etc.)
•We want to save the hashtags for later use•Many of the features are irrelevant to the task of
identifying tags (e.g. dots, punctuation, @name, RT)
TWEET ANALYSIS 5•We now take the “text” portion of the tweet and
perform part of speech tagging on it•After part of speech tagging, we perform feature
extraction•Features are now passed through the keyword
classifier which returns a list of keywords / tags•Finally we combine these tags with the hashtags we
saved earlier to give the complete tag set
HEADLINE AND EMAIL SUBJECT ANALYSIS
•This is much simpler to do•Its a subset of the steps in Tweet Analysis•There is no parsing since there are no hashtags,
@names etc.
FEATURE EXTRACTION • For the active learning algorithm we need to extract features to use in classification• These features should be subject/domain independent• We therefore never use the actual words as features• This would for example give artificially high weights to words such as “earthquake”• We don't want these artificial weights as we can’t foresee future disasters and we
want to be as generic with classification as possible• The use of training sets does allow for domain customization if where necessary
FEATURE EXTRACTION • Capitalization of individual words: Either first caps, or all caps, this is an
important indicator of proper nouns or other important words that make good tag candidates
• Position in text: Tags seem to have a greater preponderance near the beginning of text
• Part of Speech: Nouns and proper nouns are particularly important but so are some adjectives and adverbs
• Capitalization of entire text: sometimes the whole text is capitalized and this should reduce overall weighting of other features
• Length of the text: In shorter texts the words are more likely to be tags• The parts of speech of previous and next words (effectively this means we
are using trigrams; or a window of 3)
TRAINING• Requires user reviewed examples• Lexical analysis, parsing and feature extraction on the examples• Multinomial naïve Bayes algorithm• NB: The granularity we are classifying is at the word level• For each word in the text, we classify it as either a keyword or not • This has pleasant side effect of providing several training examples from each user
reviewed text• Even with less than 50 reviewed texts the results are comparable to the simple
approach of using nouns only
ACTIVE LEARNING•The API also provides a method for users to send
back corrected text•The corrected text is saved and then used in the
next iteration of training•User may optionally specify a corpus for the
example to go into•Training can be performed using any combination of
corpora
DEVELOPER FRIENDLY•Two levels of API, the web API and the internal
Python API•Either one may be used but most users will use the
web API•Design is highly modular and maintainable•For very rapid backend processing the native Python
API can be used
PYTHON CLASSESMost of the classes that make up the library are divided into three types:
1) Tokenizers 2) Parsers 3) Taggers
All three types have consistent API's and are interchangeable.
PYTHON API•A tagger calls a parser•A parser calls a tokenizer•Output of the tokenizer goes into the parser•Output of the parser goes into the tagger•Output of the tagger goes into the user!
CLASSES• BasicTokenizer – This is used for splitting basic (non-tweet) text into individual
words• TweetTokenizer – This is used to tokenize a tweet, it may also be used to
tokenize plain text since plain text is a subset of tweets• TweetParser – Calls the TweetTokenizer and the parses the output (see
previous example)• TweetTagger – Calls the TweetTokenizer and then tags the output of the text
part and adds the hashtags• BasicTagger – Calls the BasicTokenizer and then tags the text, should only be
used for non-tweet text, uses simple Part of Speech to identify tags• BayesTagger – Same as BasicTagger but uses weights from the naïve Bayes
training algorithm
DEPENDANCIES•Part of speech tagging is currently performed by the
Python NLTK•The Web API uses the Pylons web framework
CURRENT STATUS•Tag method of API is ready for use, individual
deployments can choose between using the BasicTagger or the BayesTagger
•Tell method (for user feedback) will be ready by the time you read this!
•Training is possible on corpora of tagged data in .csv format (see examples in distribution)
CURRENT LIMITATIONS•Only English text is supported at the moment•Tags are always one of the words in the supplied
text ie they can never be a word not in the supplied text
•Very few training examples exist at the moment
GEO DICTLOCATION DISAMBIGUATION
Web Services
WHAT IS GEO DICT?
• For auto-mapping data
• Reverse lookup lat/lon from place names
• Works with data from Twitter, Email, RSS, SMS
• SVM or Naive Bayes/Fisher Classification
• Database of global place-names corresponding lat/lon
ITEM CLASSIFICATION
• Feature Extraction: Bag of Words, String Kernels
• Higher Level Features: Topic Modeling
• Linguistic pre-processing: lemmatization, stemming
• Natural Language Processing
• Named Entity Recognition
• Multi-class classification: One-vs.-One, One-vs.-All
SWIFT RIVER
Jon Gosier, Director of Producthttp://swiftly.org
@swiftriver@jongos
An Ushahidi Initiative