Social Media Analytic - University of Arizonaziyinpeng/project/Intelligence Commerce Social...Social...

Social Media Analytic Project Report: MIS 587, Business Intelligence

Team Intelligence Commerce Firoz Pathan, Kaustubh Khole, Lathika Amin, Rishi Mittal, Shraddha Patil, Ziyin Peng 5/2/2012

1 | P a g e

Introduction This project involved analysis of collected tweet data for a given scenario and answer various

questions. We collected tweets on situational comedy shows aired globally between 31st March to

7th April, 2012 using Twitter Streaming API with the keywords related to all the shows. The tweets of

following shows were collected: 1) How I met your mother 2) Family Guy 3) South park 4) The

Simpsons 5) Curb your enthusiasm 6) Its always sunny in Philadelphia 7) Futurama 8) Modern Family

9) The big bang theory 10) Two and a half men. The analysis has been done with the help of

dashboards and charts using Microsoft BI Suite and Gephi for Network Analysis.

Data Collection The data collection process started from 31st March to 7th April, 2012 using Twitter Streaming API

with Java external library “Twitter4J”. The collection period totals one week and we collected over

1.5 million tweets from twitter.com. The data was originally from JSON format and we used Java

Twitter4J library to parse the raw data into different attributes, eg. username, userid, tweet,

language, location, etc. The refined data was imported into Microsoft SQL Server in E-Commerce Lab

at Eller College of Management, University of Arizona. The whole process utilized remote server at

lab using VPN connection. The data was collected from twitter.com, parsed from JSON and imported

into SQL Server database in real time.

Collection Criteria The criteria for data collection are via keywords selections. We collected the total of 10 sitcoms with

the average of 3 keywords per sitcom. These keywords include:

For How I Met Your Mother:

himym, how%20i%20met%20your%20mother, barney, thebrocode, the%20bro%20code, brocode,

bro%20code, broslife

For Modern Family:

Modernfamily, modern%20family, dunphyfamily, dunphy%20family

For The Big Bang Theory:

Thebigbangtheory, the%20big%20bang%20theory, leonard%20hofstadter, Sheldon,

sheldon_cooper_, tbbt, bigbang_cbs

For The Simpsons

Thesimpsons, the%20simpsons, simpsons, homerjsimpson, marjoriesimpson, marge%20simpson,

homer%20simpson

For Two and A Half Men:

Twoandahalfmen, two%20and%20a%20half%20men, twoandhalfmen

For South Park:

2 | P a g e

Southpark, south%20park

For Family Guy:

Familyguy, family%20guy, griffin%20family, peter%20griffin, family_guy

For It’s Always Sunny in Philadelphia:

Itsalwayssunnyinphiladelphia

For Curb Your Enthusiasm:

Curbyourenthusiasm, larry%20david

For Pretty Little Liars:

Prettylittleliars, pretty%20little%20liars, spencer%20hastings, hanna%20martin, allthingsppl, pll

For Futurama:

futurama

Data Cleansing The data cleansing part is the second process for the whole project. We know that this is a toughest

part of the entire project. The quality of the data directly leads to the value of the analysis.

We use DQ Analyzer from Atacama for data profiling at the first process. The DQ Analyzer can

connect with SQL Server to full data from our sitcom database. The results from DQ Analyzer

indicate the statistics of the data, the pattern of data, the mask of data, the trend of data, and the

frequency distribution of data.

From the data profiles, we know the percentage of NULL values in database, the unstructured data.

Therefore, we can work out a series of methods to do data cleansing based on data in each

attribute.

The data cleansing part was mainly done by SQL manipulation. We queried tweets contents by

verifying whether the tweet is relevant to the selected sitcom. If not, then we established a flag and

keep them out of the refined dataset.

Another example is for location attribute. Because twitter does not enforce users to type valid

location data, users can type anything they want or nothing. We discovered that over 80% of users

do not have location or have invalid location. Among them, the majority of them do that in purpose.

Our data cleansing philosophy can illustrated as following:

For each tweet, we extract their keywords from our keywords base. If that tweet matches the main

keyword, like ‘himym’ for How I Met Your Mother, we will consider it as a valid tweet. Otherwise, if

3 | P a g e

that tweet matches two and more marginal keywords, like ‘brocode’ or ‘barney’, we also consider it

as a relevant tweet. Other scenarios, we will flag them as irrelevant.

The data cleansing process went on toughly but we successful refined the dataset for about 20,000

tweets.

Star Schema The Star Schema design is highly related with ERD model. We extend our original table into ERD

diagram by fetching relevant attributes to describe each existing attribute.

The ERD model is illustrated as follows:

After this, the Star Schema is build base on the questions that we want to ask and the ERD diagram.

The Star Schema has 6 dimensions and one factless fact table, which can be described as below:

4 | P a g e

ETL Process The clean data was then loaded to the dimensional modeling process. The dimensional model is

used for Business Intelligence analysis, OLAP, and reporting services.

We used Microsoft Business Intelligence tool to do the queries for analytical components. The actual

analytical question can be addressed as the following sections.

5 | P a g e

The Analysis 1. Which is the most famous sitcom among all?

Results: Upon analysis, highest number of tweets where collected for the sitcom “Family Guy”

followed by “How I met your Mother”. The same has been confirmed by network analysis using

Gephi.

2. Popularity of a particular sitcom in a country

Results: This analysis showed a popularity of a sitcom in a country. Eg. For Argentina, Futurma was

the most popular followed by How I met your mother. For China, surprisingly only three sitcoms

where viewed and out of it, Family Guy was the most popular one, followed by How I met your

mother. The number of tweets were very low for China since social websites like Twitter, facebook

are banned there. Analysis for India indicates highest craze for Family Guy followed by how I met

your mother and South Park. Again for India, the numbers of tweets were low.

6 | P a g e

3. Popularity of a particular sitcom in a state of US

Results: Addressing this question helped us answer a particular sitcom in a state of US. Eg. In

California, The Big Bang Theory has been the most popular followed by Family Guy. In Michigan, The

Family Guy has been the most popular followed by the South Park.

4. Most popular language used in a particular sitcom

Results: Obviously, the most popular language was English for all the sitcoms except for The big bang

theory where Dutch lead the popularity. Spanish was the second most popular language used except

for the big bang theory, were the second most popular language was Dutch. Least used languages

were Arabic, Magyar (Hungarian), Chinese, Danish, Polish.

5. Most active day of the week for a particular sitcom

Results: The trend of number of tweets over the collection period shows and unusual trend. The

number tweets are less on weekends as compared to week days. Sunday, Monday, Wednesday and

Thursday had the highest number of tweets but Tuesday showed an immediate decline. Wednesday

had the highest among all.

Input to GEPHI The nodes were taken to be as Sitcoms and the number users watching these sitcoms determined

the popularity of the sitcom. The bigger the node, the more popular was the sitcom can be easily

concluded from the data. So we had 10 nodes in all with the respective sizes and the nodes were

joined by the edges which had weights associated to each and determined the strong weak

relationship between the sitcoms. Every edge had source and destination and its breadth was

calculated as follows. The thicker the edge, conclusion can be drawn between the common number

of viewers from both the nodes (sitcoms).

Example:

Node Sitcom1 = Number of users watching Sitcom1

7 | P a g e

Edge (Sitcom1, Sitcom2) = Weights, calculated by of common viewers of Sitcom1 and Sitcom2

Input to GEPHI was in GDF format.

Network Graphs Generated: In the below graph you can see the edges between Family Guy and South Park is very thick. Hence

more number of users watched Family Guy as well as South Park followed by Futurama and How I

met your mother. Also most of the users who watched all the other sitcoms watched Family Guy.

8 | P a g e

Lessons Learned The Social Media Analysis project is very interesting and challenging that it deals with Big Data and

sophisticated network graph theory. To successful complete the project and fulfil the deliverables,

team need to be commitment, strong desire to take the tasks, willingness to take the challenges,

great communications, efficient group work and time allocation.

Team Intelligence Commerce has a group of great talented people from MIS program. They are quite

active during each meeting and willing to take the tasks assigned to them.

For future improvement, team need to become as a more integrated part and time commitment is

also a key for success for this project.

9 | P a g e

Appendix 1. Number of Tweets for a Sitcom as per Country

2. Tweet Trend as per days

10 | P a g e

Spain

3. Given a sitcom and state , tweet trend for a sitcom on the basis of a given day

How I met your mother, California

11 | P a g e

Family guy, new York

Big Bang Theory, California

12 | P a g e

Simpson, Massachusetts

13 | P a g e

How I met your mother

14 | P a g e

15 | P a g e

16 | P a g e

17 | P a g e

18 | P a g e

Social Media Analytic - University of Arizonaziyinpeng/project/Intelligence Commerce Social...Social...

Documents

Transcript of Social Media Analytic - University of Arizonaziyinpeng/project/Intelligence Commerce Social...Social...