FOLLOWERS MUTUAL FRIENDScis.csuohio.edu/~sschung/cis612/TwitterPresentationAdamRyan.pdf · Develop...
Transcript of FOLLOWERS MUTUAL FRIENDScis.csuohio.edu/~sschung/cis612/TwitterPresentationAdamRyan.pdf · Develop...
MUTUAL FRIENDSFOLLOWERS
By Adam Kuns & Ryan Chesla
The Unified Logging Infrastructure for Data Analytics
1. Introduction - logging based on sessions2. Scribe/Zookeeper - takes data from web servers and inputs into HDFS. - per category, per hour directory. ie:
/logs/category/YYYY/MM/DD/HH)3. Motivation - different apps had different schemas for logging, making it hard to query. A unified format fixed this.
-
The Unified Logging Infrastructure for Data Analytics
4. Data logged by client events: (client, page, section, component, element, action) (client, page, section, component, *, action) (client, page, section, *, *, action) (client, page, *, *, *, action)
Oink schedules common query jobs in advance (counts)
5. Applications: Summary statistics, user modeling, funnel analytics, event counting
-
Original Proposal
Develop a system to find “mutual friends” between users of Facebook using Hadoop and MapReduce.
Issues with Graph API 2.0
As of version 2.0 of Graph, the friends object only returns that person's friends who also use the app.
Since none of our friends use this app, the returned objects from Facebook were empty, making our original proposal not feasible.
Possible WorkaroundsWe did find some workarounds, but were not feasible for the scope of this project:
● The first option would require accessing the TaggableFriends object, but access to this object requires Facebook App approval.
● The second would be to classify that our app was a canvas app so that we could access the game invites list for Facebook games (basically lie, saying our App was a Facebook game).
New Proposal
● Develop a system to find “mutual friends followers” between users of Facebook Twitter using Hadoop and MapReduce.
● In addition to our original proposal, we decided to store user data on initial pulls into HBase also, so that we can pull user information later based on our data mining MapReduce job.
Twitter Search API
● For our programs, utilizing Twitter’s Search API fitted our needs where we could query specific user’s information.
● The Streaming API would be more suited for tweets.
● Drawback: We would often run into the request limit, this would be solved by using the Firehouse API.
Populating the HDFS
In order to populate the HDFS with data from Twitter, we decided to use Twitter4J, an unofficial Java library for the Twitter API.
This provided an easy-to-use, object-oriented approach to pulling data into the HDFS by way of a Java program.
Twitter4J API● Using Twitter4J, we can pull user specific information on a per user basis.● In our case, we use the showUser function to create a User object for the
user we are currently querying.● Using this User object, we can call many functions to pull multiple details
about a particular user.
Populating the HDFS (cont.)
Once we have the user information (both user’s and their followers’ information), we create a flat file for the two users’ friends list and populate HBase with information for the two users and all of their followers.
MapReduce Input File Example
line 1: 123 574 234 920 984line 2: 658 997 111 322 123 125
= User ID
= Follower ID
MapReduce Input File
MapReduce AlgorithmOur MapReduce program “Mutual Followers” reads this text file from the HDFS.
Each line in the text file represents a user’s follower list.
The lines consist of user ID’s, tab delimited. The first ID is the user.
MapReduce Split Phase
Splits input file by each new line in the text.
Done implicitly by MapReduce (no programmer intervention)
Each line represents a person and their followers
Mapper FunctionThe mapper function tokenizes each line, outputting key value pairs where:
Key = followerIDValue = 1
Note: That the userID’s of the users we are querying are also emitted as a key value pair, in the case that the users we querying follow each other.
{key, value} = {followerID, 1}123 1456 1999 1
Mapper Code
Reducer FunctionAggregates key value pairs. If the sum of a value given a certain key is equal to the number of users we’re comparing (in our case 2), then we output that follow ID.
if (count == 2)//output follower ID
*note: our program will work with comparing any number of followers, just need to change the if statement accordingly (ie. if (count == 100))
Reducer Code
MapReduce Output
Below is the output from the MapReduce job, returning the user id’s that are mutual followers
HBase
The map-reduce job provides mutual user_ids
From there we can use those ids to index into our HBASE system and get real user data!
Populating HBaseCircling back to our first program, along with creating the flat file for the MapReduce program, we also inserted the user information and their followers’ information into HBase at the same time using the HBase Java API.
HBase Data Model
Row ID = user_ID + timestamp (since user ID is unique)
Two Column Families:1. User Info (name, profile, picture,etc)
- All this user info is related and would usually be queried at the same time, so they will be grouped into the same column family
2. Followers
HBase StructureRow ID Time Stamp ColumnFamily User Info ColumnFamily Followers
ajkuns t3 userInfo:id = “123”userInfo:name = “ajkuns”userInfo:bio = “hello i am adam”userInfo:profilePic = “http://www.cute_kittens.jpg”
Followers:1 = “123”Followers:2 = “456”Followers:3 = “999”Followers:4 = “11”
RyanChesla_ t4 userInfo:id = “456”userInfo:name = “RyanChesla_”userInfo:bio = “hi everybody!”userInfo:profilePic = “http://www.volcano.jpg”
Followers:1 = “9”Followers:2 = “599”
Note: the number of followers per user can vary so we took that into consideration
HBase Contents
scan ‘TwitterUser’
Future Work
Go N levels deep of followers
(find mutual followers of your followers’ followers!)