Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago...

22
Joe Olson Data Architect Smart Chicago Collaborative 27 Mar 2014 [email protected] (All the cool buzzwords in one place!) Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics

Transcript of Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago...

Page 1: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Joe OlsonData ArchitectSmart Chicago Collaborative27 Mar [email protected]

(All the cool buzzwords in one place!)

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics

Page 2: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Social Media - Twitter

• What can we learn from Twitter?• 400 million tweets per day

source: http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jack-dorsey-twitter

• 218 million users source: http://techcrunch.com/2013/10/03/bweeting/

• Excellent source of sentiment

• Excellent source of big data• Prototyping

• Modeling natural language

• Resume padding

Page 3: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Social Media - Twitter

• How do we get at the data?

• Twitter provided APIs:

• https://dev.twitter.com/docs

• Streaming

• Set up a real time data stream (json) based on keywords

• REST (v1.1)

• Make REST requests, and get results

• Possible parameters:

• Geospatial bounding box

• By time

• By user, hashtag, retweets etc

• Fire hose

• Big $$$. Big data

Page 4: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Social Media - Twitter

• Information & Obstacles

• Who

• What

• At best: Plain English (!)

• Worse: (Spanish or Arabic or Portuguese...)

• Worst: “Textspeak” symbols :-0, UTF8 chars, etc.

• Absolute Worst: combination of all of them

• Where

• 1-2% with latitude / longitude

• Geocode

• When

Page 5: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Social Media - Twitter

JSON Tweet example:• "created_at":"Sun Oct 27 13:57:40 +0000 2013",• "id":394462908261740540,• "text":"Flu :(",• "source":"<a href=\"http://twitmania.com\" rel=\"nofollow\">TwitMania™</a>",• "user":{• "id":594141140,• "name":"Yultiana Farida N",• "screen_name":"yultiana",• "followers_count":231,• "friends_count":252,• "created_at":"Tue May 29 23:58:25 +0000 2012",• "statuses_count":2397,• },• "geo":null,

Page 6: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Cloud Computing

• What does cloud computing bring to the table?

• Amazon’s EC2:

• Commoditized hardware

• Low cost

• Only charged for resources you use

• No long term commitments

• Scalable

• "Throwaway" mentality

**IF** you play by their rules!

Page 7: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Cloud Computing –AWS

• Tools• Virtual Machines

• # of Processors, RAM, OS, disk capacity and I/O – all configurable• Price range: $.02/hr - $4.60/hr• Licensed OSes cost 50% more than Linux OSes

• Archive Storage• S3 / Glacier

• Work Queues• SQS

• Data Stores• Dynamo (key value store), Red Shift (analysis store)

• Virtual Networking• Routers, VPN gateways, access control lists, etc

• APIs• Command line• HTTPS REST• Native programming languages (Python, bash, PHP, Java etc.)

Ideal for rapid prototyping / proof of concepts

Page 8: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Cloud Computing –AWS

• APIs

• Basic

• Start an instance (and start billing)

• Stop an instance (stop billing)

• Insert item into queue

• Remove item from queue

• Write to backup store

• Ultra advanced

• Reserved vs. on demand vs. spot instances

• Price can drop as much as 80% due to market demand

• Instance can disappear at any time

Page 9: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Big Data Analytics

• Can we skirt the “big data” problem by distilling the tweets down from millions and millions “noise” tweets into a more desirable data set?

• Enrich in real time, rather than on archived data, and avoid the overhead of map/reduce?

• Possible Enrichment of raw data:• Classification – separate tweets into “relevant” and “irrelevant”

• Geocoding – improve on the 1-2% ?

• Aggregation –> map reduce• Mapping -> Reduce Function -> Output

• AWS – Elastic Map Reduce

• Clustering

Page 10: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Machine Learning

• Classification: relevant, or irrelevant?

• Human trained model

• Once model is established, bounce new data off it for classification

• Validation of model

• Accuracy = (Total # of classifications – Mismatches between machine / human)

Total # of classifications

• Crowdsourcing – AWS Mechanical Turk

• Improve model by feeding disagreements back into the model

• Our best text classification model to date: low 90%

Page 11: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Open Source

• Friendly to the commoditized computing paradigm

• Don’t have to worry about licensing issues

• Contributes to the “throwaway” discipline

• Don’t have to re-invent the wheel (collaboration)

• Solutions applicable to all parts of the architecture

• Acquire data: Node.js – non blocking

• Analyze data: R – statistical engine

• Store and query data: MongoDB (document store) or Riak (key-value database)

Page 12: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Architecture

• We know Twitter is providing a mountain of data from all parts of the world

• We know Amazon is providing a framework of low cost, on-demand, no commitment computing

• Open source is providing a rich tool set

• Goals:• Architect with cost in mind!

• Enrichment - Real time and after-the-fact enrichment (open data)

• Scalable

• Decoupled

• Service based

• Rapid development

• Prove the concepts

Page 13: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Architecture - Acquire

• Acquire the data from Twitter

• If classifying in real time:

• Store then classify?

• Classify then store?

• Tools

• Twitter streaming API

• Keywords

• Node.js

• Several different packages to interface with Twitter APIs

• Amazon

• EC2

• SQS (?) Extremely useful, but drives the cost up

Page 14: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Architecture - Analyze

• Classification interface

• Service based – HTTP REST

• Push or pull?

• Push – classifiers listen on port 80

• Pull – classifier starts pulling from an established work queue

• Both highly scalable and flexible with respect to cost.

• Stateless

• R

• Human trained machine learning packages available

• Cloud friendly – no licenses

• Automatable – from install, configuration, execution

Page 15: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Architecture - Store

• Store JSON as an object (document store) or normalize (relational database)?

• Relational databases

• disk I/O intensive – not cloud friendly

• allow complex indexing

• Easy to get a business intelligence front end on them

• Requires a schema / ETL

• Key-value document stores

• Designed to be scalable – doesn’t need fast disks

• Indexing is not nearly as flexible as RDBMS

• More difficult to front a UI – no “drag and drop” tools

• No schema / ETL needed.

• Not as mature

• MongoDB / Riak

Page 16: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Architecture – Presentation

• Least need for cloud friendly scalability here?

• Options

• Licensed BI software – Tableau, Endeca, Jaspersoft, Pentaho

• Open source BI software – SpagoBI

• Roll your own - PHP, Ruby, Visual Basic, Javascript, etc

• Connect to an existing system instead?

Page 17: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Costs –Real Time Classification

• Number of tweets collected per day: 1,000,000 (comfortable - .25%)• Machine used on EC2 to acquire (node.js): micro

• $.02/hr * 24 hrs = .48/day

• Machine used on EC2 to classify (R): small (x2)• $.06/hr * 24 hrs = $1.44/day*2 = $2.88/day

• Machine used on EC2 to store (MongoDB): large• $.24/hr * 24 hrs = $5.76 /day

• Machine used on EC2 for GUI (Apache): small• $.06/hr * 24 = $1.44•

$0.48+$2.88+$5.76+$1.44 = $10.56 / 1,000,000 = .00001056 cents/tweet

Can add more zeros if you relax real-time classification (spot instances)

Page 18: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Costs - Archive

• Size of average tweet: 2.5 KB

• Cost to archive:

• s3 : .095 GB/month

• 0.0000002 per tweet per month

• Glacier: .01 GB/month

• 0.00000002 per tweet per month

• Compression will add even more zeros, but will require more computing power, and mean more latency for post collection data analysis. Can be automated.

Page 19: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Use Cases• Foodborne Chicago (http://foodborne.smartchicagoapps.org/)

• Public-private partnership with City of Chicago Dept. of Public Health and Smart Chicago Collaborative

• Reach out to city residents on Twitter tweeting about food poisoning symptoms, in an attempt to get them to log information in the City’s 311 database (via the Open311 API)

• Once in the 311 database, it follows established City workflows, and becomes actionable

• Numbers (1 year):• 2,390 tweets classified as related to food poisoning• 282 tweets responded to• 205 reports submitted• 145 inspections

• Real time classification examples: • “Ugh! I got food poisoning from the McDonalds’s on Halstead!”

http://184.73.52.31/cgi-bin/R/fp_classifier?text=Ugh!%20I%20got%20food%20poisoning%20from%20McDonalds%20on%20Halstead

• “U of Chicago releases a new paper on the effects of food poisoning”http://184.73.52.31/cgi-bin/R/fp_classifier?text=U%20of%20Chicago%20releases%20new%20paper%20on%20the%20effects%20of%20food%20poisoning

• Video: http://www.youtube.com/watch?v=RNf9XQ_25Yw&feature=youtu.be

Page 20: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Use Cases

• Disease Tracker

• Large scale attempt to track disease occurrences in the United States.

• Sponsored by the Dept. of HHS

• Approximately 1 million tweets a day (cold, flu) classified in real time

• EC2 scalable instances

• Geolocation

• Cost to run for 6 months: $850

Page 21: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Future Directions

• Turnkey service

• Can all this functionality be abstracted down to a pushbutton service?

• Open data

• Can you advertise the data collected, how you enriched it, and allow others to come along an enrich it as well?

• General purpose bridge between Twitter and issue tracking databases

• Big industry problem

Page 22: Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data Analytics (Chicago Summit)

Github Sources

• Tweet Collector

• https://github.com/smartchicago/TweetCollector

• Classifier Code

• https://github.com/corynissen/foodborne_classifier