M:\Social Media\Social Media Workshops\Step Into Social Media July 2010
Social Media Analytic - University of Arizonaziyinpeng/project/Intelligence Commerce Social...Social...
-
Upload
nguyenlien -
Category
Documents
-
view
215 -
download
2
Transcript of Social Media Analytic - University of Arizonaziyinpeng/project/Intelligence Commerce Social...Social...
Social Media Analytic Project Report: MIS 587, Business Intelligence
Team Intelligence Commerce Firoz Pathan, Kaustubh Khole, Lathika Amin, Rishi Mittal, Shraddha Patil, Ziyin Peng 5/2/2012
1 | P a g e
Introduction This project involved analysis of collected tweet data for a given scenario and answer various
questions. We collected tweets on situational comedy shows aired globally between 31st March to
7th April, 2012 using Twitter Streaming API with the keywords related to all the shows. The tweets of
following shows were collected: 1) How I met your mother 2) Family Guy 3) South park 4) The
Simpsons 5) Curb your enthusiasm 6) Its always sunny in Philadelphia 7) Futurama 8) Modern Family
9) The big bang theory 10) Two and a half men. The analysis has been done with the help of
dashboards and charts using Microsoft BI Suite and Gephi for Network Analysis.
Data Collection The data collection process started from 31st March to 7th April, 2012 using Twitter Streaming API
with Java external library “Twitter4J”. The collection period totals one week and we collected over
1.5 million tweets from twitter.com. The data was originally from JSON format and we used Java
Twitter4J library to parse the raw data into different attributes, eg. username, userid, tweet,
language, location, etc. The refined data was imported into Microsoft SQL Server in E-Commerce Lab
at Eller College of Management, University of Arizona. The whole process utilized remote server at
lab using VPN connection. The data was collected from twitter.com, parsed from JSON and imported
into SQL Server database in real time.
Collection Criteria The criteria for data collection are via keywords selections. We collected the total of 10 sitcoms with
the average of 3 keywords per sitcom. These keywords include:
For How I Met Your Mother:
himym, how%20i%20met%20your%20mother, barney, thebrocode, the%20bro%20code, brocode,
bro%20code, broslife
For Modern Family:
Modernfamily, modern%20family, dunphyfamily, dunphy%20family
For The Big Bang Theory:
Thebigbangtheory, the%20big%20bang%20theory, leonard%20hofstadter, Sheldon,
sheldon_cooper_, tbbt, bigbang_cbs
For The Simpsons
Thesimpsons, the%20simpsons, simpsons, homerjsimpson, marjoriesimpson, marge%20simpson,
homer%20simpson
For Two and A Half Men:
Twoandahalfmen, two%20and%20a%20half%20men, twoandhalfmen
For South Park:
2 | P a g e
Southpark, south%20park
For Family Guy:
Familyguy, family%20guy, griffin%20family, peter%20griffin, family_guy
For It’s Always Sunny in Philadelphia:
Itsalwayssunnyinphiladelphia
For Curb Your Enthusiasm:
Curbyourenthusiasm, larry%20david
For Pretty Little Liars:
Prettylittleliars, pretty%20little%20liars, spencer%20hastings, hanna%20martin, allthingsppl, pll
For Futurama:
futurama
Data Cleansing The data cleansing part is the second process for the whole project. We know that this is a toughest
part of the entire project. The quality of the data directly leads to the value of the analysis.
We use DQ Analyzer from Atacama for data profiling at the first process. The DQ Analyzer can
connect with SQL Server to full data from our sitcom database. The results from DQ Analyzer
indicate the statistics of the data, the pattern of data, the mask of data, the trend of data, and the
frequency distribution of data.
From the data profiles, we know the percentage of NULL values in database, the unstructured data.
Therefore, we can work out a series of methods to do data cleansing based on data in each
attribute.
The data cleansing part was mainly done by SQL manipulation. We queried tweets contents by
verifying whether the tweet is relevant to the selected sitcom. If not, then we established a flag and
keep them out of the refined dataset.
Another example is for location attribute. Because twitter does not enforce users to type valid
location data, users can type anything they want or nothing. We discovered that over 80% of users
do not have location or have invalid location. Among them, the majority of them do that in purpose.
Our data cleansing philosophy can illustrated as following:
For each tweet, we extract their keywords from our keywords base. If that tweet matches the main
keyword, like ‘himym’ for How I Met Your Mother, we will consider it as a valid tweet. Otherwise, if
3 | P a g e
that tweet matches two and more marginal keywords, like ‘brocode’ or ‘barney’, we also consider it
as a relevant tweet. Other scenarios, we will flag them as irrelevant.
The data cleansing process went on toughly but we successful refined the dataset for about 20,000
tweets.
Star Schema The Star Schema design is highly related with ERD model. We extend our original table into ERD
diagram by fetching relevant attributes to describe each existing attribute.
The ERD model is illustrated as follows:
After this, the Star Schema is build base on the questions that we want to ask and the ERD diagram.
The Star Schema has 6 dimensions and one factless fact table, which can be described as below:
4 | P a g e
ETL Process The clean data was then loaded to the dimensional modeling process. The dimensional model is
used for Business Intelligence analysis, OLAP, and reporting services.
We used Microsoft Business Intelligence tool to do the queries for analytical components. The actual
analytical question can be addressed as the following sections.
5 | P a g e
The Analysis 1. Which is the most famous sitcom among all?
Results: Upon analysis, highest number of tweets where collected for the sitcom “Family Guy”
followed by “How I met your Mother”. The same has been confirmed by network analysis using
Gephi.
2. Popularity of a particular sitcom in a country
Results: This analysis showed a popularity of a sitcom in a country. Eg. For Argentina, Futurma was
the most popular followed by How I met your mother. For China, surprisingly only three sitcoms
where viewed and out of it, Family Guy was the most popular one, followed by How I met your
mother. The number of tweets were very low for China since social websites like Twitter, facebook
are banned there. Analysis for India indicates highest craze for Family Guy followed by how I met
your mother and South Park. Again for India, the numbers of tweets were low.
6 | P a g e
3. Popularity of a particular sitcom in a state of US
Results: Addressing this question helped us answer a particular sitcom in a state of US. Eg. In
California, The Big Bang Theory has been the most popular followed by Family Guy. In Michigan, The
Family Guy has been the most popular followed by the South Park.
4. Most popular language used in a particular sitcom
Results: Obviously, the most popular language was English for all the sitcoms except for The big bang
theory where Dutch lead the popularity. Spanish was the second most popular language used except
for the big bang theory, were the second most popular language was Dutch. Least used languages
were Arabic, Magyar (Hungarian), Chinese, Danish, Polish.
5. Most active day of the week for a particular sitcom
Results: The trend of number of tweets over the collection period shows and unusual trend. The
number tweets are less on weekends as compared to week days. Sunday, Monday, Wednesday and
Thursday had the highest number of tweets but Tuesday showed an immediate decline. Wednesday
had the highest among all.
Input to GEPHI The nodes were taken to be as Sitcoms and the number users watching these sitcoms determined
the popularity of the sitcom. The bigger the node, the more popular was the sitcom can be easily
concluded from the data. So we had 10 nodes in all with the respective sizes and the nodes were
joined by the edges which had weights associated to each and determined the strong weak
relationship between the sitcoms. Every edge had source and destination and its breadth was
calculated as follows. The thicker the edge, conclusion can be drawn between the common number
of viewers from both the nodes (sitcoms).
Example:
Node Sitcom1 = Number of users watching Sitcom1
7 | P a g e
Edge (Sitcom1, Sitcom2) = Weights, calculated by of common viewers of Sitcom1 and Sitcom2
Input to GEPHI was in GDF format.
Network Graphs Generated: In the below graph you can see the edges between Family Guy and South Park is very thick. Hence
more number of users watched Family Guy as well as South Park followed by Futurama and How I
met your mother. Also most of the users who watched all the other sitcoms watched Family Guy.
8 | P a g e
Lessons Learned The Social Media Analysis project is very interesting and challenging that it deals with Big Data and
sophisticated network graph theory. To successful complete the project and fulfil the deliverables,
team need to be commitment, strong desire to take the tasks, willingness to take the challenges,
great communications, efficient group work and time allocation.
Team Intelligence Commerce has a group of great talented people from MIS program. They are quite
active during each meeting and willing to take the tasks assigned to them.
For future improvement, team need to become as a more integrated part and time commitment is
also a key for success for this project.
9 | P a g e
Appendix 1. Number of Tweets for a Sitcom as per Country
2. Tweet Trend as per days
10 | P a g e
Spain
3. Given a sitcom and state , tweet trend for a sitcom on the basis of a given day
How I met your mother, California
11 | P a g e
Family guy, new York
Big Bang Theory, California
12 | P a g e
Simpson, Massachusetts
13 | P a g e
How I met your mother
14 | P a g e
15 | P a g e
16 | P a g e
17 | P a g e
18 | P a g e