Processing Twitter Data with MongoDB -...

Post on 31-Jan-2018

234 views 0 download

Transcript of Processing Twitter Data with MongoDB -...

Processing Twitter Data with MongoDB

Xiaoxiao Liu

Issue with Facebook Data

● Original, I planned to do this project with Facebook Data.

- Facebook Graph API

- Third-Party Java Library: restFB● I was interested in doing social network analysis,

so the information I need to get including users information, users' friends information, and the relationship between these users.

However.....

Limitation of Graph API:

As stated by Facebook: “This will only return any friends who have used (via Facebook Login) the app making the request.”

(In this case, the app is graph API itself).

Only one friend

showed up :(

Only myself showed up!

User

Friend1 Friend2 Friend n…........................

Friend1's Friends

Friend1's Friends

Friend1's Friends

Friends of Friend Friends of Friend authorization exception

What else can I do?

● Twitter!

-Mid-term election

-tweets related to vote

Data Source: Twitter

● Twitter Rest APIs

- The REST APIs provides programmatic access to read and write Twitter data. Author a new Tweet, read author profile and follower data, and more. The REST API identifies Twitter applications and users using OAuth; responses are available in JSON.

– Rate Limits:

- Search will be rate limited at 180 queries per 15 minute window for the time being, but we may adjust that over time.

The Search API

● The Twitter Search API is part of Twitter’s v1.1 REST API. It allows queries against the indices of recent or popular Tweets and behaves similarily to, but not exactly like the Search feature available in Twitter mobile or web clients.

● Geolocalization:

The search operator “near” isn’t available in API, but there is a more precise way to restrict your query by a given location using the geocode parameter specified with the template “latitude,longitude,radius”.

Twitter4J

I used a third-party java library called Twitter 4J. This library makes it easier to integrate Java application with Twitter service.

To use this library, simply download it and add the .jar file to class path.

QueryString: theSearch keyword

QueryDate: searchTweets sent in

Certain day

Report back how manyTweets were gathered

Search Keywords

● 11/1/2014 – 11/4/2014(Election Day)

- quinn (Democrat Candidate's Lastname)

- rauner (Republican Candidate's Lastname)

- democrat

- republican

- governor

● 11/3/2014 – 11/4/2014(Election Day)

- election

I stored data in txt file with a wired format

Why MongoDB● My needs:

My input data is basically tweets.

I need to run word count.

I need to query the tweets with different keywords.

I do not want to separate one tweet into several columns.

● MongoDB is great for modeling many of the entities

Form data: MongoDB makes it easy to evolve the structure of form data over time

Blogs / user-generated content: can keep data with complex relationships together in one object

Messaging: vary message meta-data easily per message or message type without needing to maintain separate collections or schemas

System configuration: just a nice object graph of configuration values, which is very natural in MongoDB

Log data of any kind: structured log data is the future

Graphs: just objects and pointers – a perfect fit

Location based data: MongoDB understands geo-spatial coordinates and natively supports geo-spatial indexing

MongoDB● Document-Oriented Storage

JSON-style documents with dynamic schemas offer simplicity and power.

● Full Index Support Index on any attribute, just like you're used to.

● Querying Rich, document-based queries.

● Map/Reduce

Flexible aggregation and data processing.

- I wanted to re-run my java code to gather tweets again, and this time I would like to store them in json format.- Unfortunately, it did not work out. ”You cannot use the Search API to find Tweets older than about a week”

-I wrote another java application to convert that txt file to a json file

{'user_name': 'xyz', 'tweet': 'whatever tweet text'}

Import Data to MongoDB:mongoimport --db mydb --collection tweets --file tweets.json

{“user_name”: “xyz”, “tweet”: “whatever tweet text”}

● Run mongo shell

● Structure/Schema

● Run mapreduce to count words

Relevant Keywords:VotingVoteWageCitizens#democrats#politics#rockthevote

Possible relevant keywords:

shitStupidProtectfuck

Interesting Finds

● Robert Quinn kisses the bicep after that quarterback sack.

(keywords: bicep, quarterback)

Interesting Finds

● @EliseStefanik REPUBLICAN WOMEN Set to Make History Tonight http://t.co/eQOWGBznv8 via @gatewaypundit @JoniErnst @EliseStefanik @MiaBLo…

● @m_silverberg

-Wifi for media at the Bruce Rauner party is $50 a pop...

-Every TV station in Illinois about to go live at 5 from Bruce Rauner's election night party.

User who sent most tweetsCode

Result

@grammy620: Vote for @JeanneShaheen and this will continue!http://t.co/AleJxTqS1n CLOSE OUR BORDERS! #NHsen Stop the Obama Agenda

@DJGalaxieIL:Vote for Quinn tomorrow!!!!!!!!!!!!!!!!!!!

@Williamjkelly:@progressIL Why I'm NOT drinking the Rauner Kool-Aid http://t.co/kI0H0ohlSN

@haydeevilma06:RT @FitzGeraldForOH: You’re ready to vote, and we’re ready to help you find out where! http://t.co/iOj3wFnf3I

Relevant Users:

Word Count for Keyword ”democrat”

● code

● Result

Word Count for Keyword “republican”

● result @Tigerfists88: Pres. ✰#Obama Brings The Jobless Rate From 10.1% to 5.9% despite republican obstacles http://t.co/852t9ANDF1 #TheyMad #news #p2 #TFB Obama

The “Big Data” Ecosystem at LinkedIn

Roshan Sumbaly, Jay Kreps, and Sam Shah

● This paper describes the systems that engender effortless ingress and egress out of the Hadoop system and presents case studies of how data mining applications are built at LinkedIn.

● Kafka, Azkaban● Ingress, egress

● For egress, three main mechanisms are necessary:

- 70% is key-value access – Voldemort

– 20% is stream-oriented access – Kafka

– Multidimensional or OLAP access – Avatara

Given the high velocity of feature development and the difficulty in accurately gauging capacity needs, these systems are all horizontally scalable.

These systems are run as a multitenant service where no real stringent capacity planning needs to be done: rebalancing data is a relatively cheap operation, engendering rapid capacity changes as needs arise.

Ingress

● Kafka is a distributed publish-subscribe system that persists messages in a write-ahead log, partitioned and distributed over multiple brokers.

● It allows data publishers to add records to a log.● Each of these logs is referred to as a topic.● Example: search. The search service would produce these records and

publish them to a topic named “SearchQueries” where any number of subscribers might read these messages.

● All Kafka topics support multiple subscribers as it is common to

have many different subscribing systems. ● Kafka supports distributing data consumption within each of these

subscribing systems, because many of these feeds are too large to be handled by a single machine

Ingress: Data Evolution● Two solutions

1. Simply load data stream in whatever form they appear.

2. manually map the source data into a stable, well-through-out schema and perform whatever transformations are necessary to support this.

● LinkedIn's solution:

retains the same structure throughout data pipeline and enforces compatibility and other correctness conventions on changes to this structure.

– Maintain a schema with each topic in a singe consolidated schema registry.

– If data is published to a topic with and incompatible schema, it is rejected.

– If it is published with a new backwards compatible schema, it evolves automatically.

– Each schema also goes through a review process to help ensure consistency with the rest of activity data model.

Ingress: Hadoop Load● The activity data generated and stored on Kafka is pulled into

Hadoop using a map-only job that runs every 10 minutes on a dedicated ETL Hadoop cluster as a part of an Azkaban workflow.

● First, reads the Kafka log offsets and checks for any new topics.

● Then, starts a fixed number of mapper tasks to pull data into HDFS partition files, and finally registers it with LinkedIn's various systems.

● ETL workflow also runs an aggregator job every day to combine and dedup data saved throughout the day into another HDFS location and run predefined retention policies on a per topic basis. (This combining and cleanup prevents having many small files)

Egress

● The result of workflows are usually pushed to other systems, either back for online serving or as a derived data-set for further consumption.

● The workflows appends an extra job at the end of their pipeline for data delivery out of Hadoop.

Egress: Key-Value

● Voldemort is a distributed key-value store akin to Amazon's Dynamo with a simple get(key) and put{key, value} interface.

● Tuples are grouped together into logical stores.● Each key is replicated to multiple nodes

depending on the preconfigured replication factor of its corresponding store.

● Every node is futher split into logical partitions.

Egress: Stream

● The ability to publish data to Kafka is implemented as Hadoop OutputFormat.

● Each MapReduce slot acts as Kafka producer that emits essages, throttling as necessary to avoid overwhelming the Kafka brokers.

● As Kafka is a pull-based queue, the consuming application can read message at its own pace.

Egress: OLAP

● A system called Avatara that moves the cube generation to a high throughput offline system and the query serving to a low latency system.

● By separating the two systems, we lose some freshness of data, but are able to scale them independently.

● This independence also prevents the query layer from the performance impact that will occur due to concurrent cube computation.

Applications

● Key-value

- people you may know– Collaborative

Filtering

– Skill Endorsements

– Related searches

Applications

● Stream

- News Feed Updates– Email

– Relationship Strength

Application

● OLAP

- who viewed my profile?– Who's viewed this job?

Thank you!