551 Final Report

TERM PAPER

A proposed framework to capture supporter’s behavior through Sentiment Analysis and Text mining

Sojib Ahmed

14 December 2015 1. Introduction Social Network Analysis (SNA) has provided various methods to understand different dynamics of users that are present in a network. Text mining, a key component of SNA, has been proven to be of great value to extract user behavior and motivation efficiently and quickly. With this study, we try to determine how sport’s fans behave on social media based on the current performance of the team they support. We apply sentiment analysis on our data collected from Twitter to generate reaction amongst fans around the globe. Since Twitter has grown into a massive platform where users from around the world share their views, we take this as an admirable scope to conduct our research and grasp an overall viewpoint on the subject fans are happy or unhappy about. Conducting sentiment analysis has made possible to understand viewpoint on the relationship among the factors present in a network. It is evident that using sentiment analysis, analyzing large data provides a better resolution to understand a large group of supporters. The use of sentiment analysis with twitter data is not only confined with study involving on one particular set of distribution but it manages to deliver analysis on “joint distribution” and the relevant “association” between a set of words that are dominant in the twitter feed. Therefore, analyzing twitter texts with the use of sentiment analysis can be extremely helpful to predict positive/negative correlation among the twitter participants – our focus will be mainly on sports supporters. The objective of this project is to provide a framework to conduct sentiment analysis and predict behavior of sports fan on social media based on team’s performance. During the data collection stage, we wanted to allocate our dataset so that it can encompass fan’s behavior over a longitudinal timeframe. This is an important as well as an interesting part of our study since; we can identify how behaviors change over a period of time. In order to achieve this we collected our

data each time after a game is played with successive intervals. The data used in the study covered five games in total, which spanned over almost a month just so a broader insight on the supporters behavior could be obtained. Profound focus was given on word clustering and graphic visualization as a way to correctly predict the motivational factors that drive users toward making positive remarks v/s negative remarks. Furthermore, graphic visualization has also enabled us to locate the major contributors/subjects that users focus on based on the performance of their team. Our initial thought on this study was to carry out a hypothesis testing between two dataset – positive word cluster and negative work cluster using UCInet and identify the subjects/words that are significant in understanding the appropriate sentiment of positivity or negativity present in the dataset. However, based on expert opinion and given the confounding factors present and the bias amongst supporters, the test of hypothesis was replaced with graphic visualization since its provided a clear-‐cut analysis and representation on how supporters behave. Moreover, the presence of lurking variable (noise data) can easily shift our hypothesis test results and give us false assumption about our data. Nevertheless, we believe that analysis through graph visualization has enabled us to correctly understand the positive and negative sentiment and focus only on the subject/words that are dominant. As far as this report goes, we will provide a general overview as to how the graphs generated can help us understand the behaviors of the supporters. A correct execution of the study will provide a greater insight on user’s behavior and how sentiment plays a part on the user based on their team’s performance. Furthermore, the framework can be implemented in other form of studies that pertain to observe the impact of positivity and negativity and its effect on growth and sustainability -‐ could be an artist, clothing brand, a newly launched product/service, or a new automobile about to hit the market. In the remainder part of the report we will conduct a literature review that focuses on different aspect of social network analysis, and benefits of analysis run through text mining. Consequently, we will then discuss the motivation and approach that that drove us to conduct this study. Next we will go on to discuss the methods that were used in order to successfully conduct text mining and sentiment analysis. The methods section will be discussed step-‐by-‐step in broad details to show how subsequent tools and technique were applied to generate our results. The results section will report the main finding and provide appropriate interpretation of the graphs obtained through the methods that was applied. We will include the important graphs, interpret them, and convey why the results obtained are useful

and make appropriate sense of it. In the conclusion section we will cover the benefit as the well the challenges and limitations of the study. 2. Literature Review In this section, we covered insights from social network literature to provide an intuition on factors that motivate user behavior from a social network viewpoint. While conducting the literature review, we came across various papers that focuses on business sector and how business operations are highly influenced by social network user’s behavior. Although, not a lot of work has been done to address fans reactions in sports based on team performance, we can draw analogy from a business firm’s perspective to understand the effect of consumers’ action on online social network. Moreover, we also discuss few papers that feature the capabilities of text mining and sentiment analysis and its application on various sectors. 2.1 “The (Real) World Is Not Enough:” Motivational Drivers and User Behavior in Virtual Worlds: In this literature, the study reveals a social influence to describe user behavior and the drivers that triggers user participation from a business viewpoint. The motivational drivers those are responsible for user engagement in a “marketing -‐ relevant context” is portrayed. The study finds that in a social network, “socializing, creativity and escape emerge as individual drivers”. In addition, it also identifies various important characteristics of a user, given the variability among actors present in a network; this study does decent job capturing distinct motivational drivers and presents them in segments. Although the effort made in this paper differs in larger context in contrast to our focus. We believe it helped us grasp the importance of social network analysis from a slightly different context. 2.2 “Comparing Twitter and Facebook user behavior: Privacy and other aspects” When it comes to online social media network, Facebook and Twitter are two most common form of social network that comes to mind. The paper tries to compare user behavior between FB and twitter. While, a primary part of this study focuses on user’s privacy, the paper also provides some interesting facts on user behavior from a different aspect – friend overlap, friend distribution, user activity. In a nutshell, we can draw a rough idea about the common factors that are present in widely used social medias like Facebook and Twitter.

2.3 “Experimental Evidence of Massive -‐ scale Emotional Contagion through Social Networks” The objective of the study was to find out if negative and positive emotional conditions can be transferred to other people through emotional contagion, which leads people to experience similar emotions exclusive of their sentience. The largest online social network, Facebook was chosen to carry out the demonstration of the study. The persistence of the experiments on people using Facebook was to test the occurrence of emotional contagion among individuals by controlling the exposure of emotional matters in the news feed of their Facebook account. Furthermore, the experiments ensured inclusion of both positive and negative emotions by conducting 2 parallel experiments; reducing the exposure of friends’ positive emotional contents in one and reducing the exposure of friends’ negative emotional contents in another. The results indicate emotional contagion. Reduction of positive content in the news feed led the greater proportion to choose negative words and smaller proportion to choose positive words in their status updates. The reverse pattern was observed upon reducing negativity. Hence, the results support the claim that emotional contagion can arise through social network without in-‐person interactions. Numerous functions of emotional contagion were emphasized by the results. Merely an exposure or non-‐exposure to a friend’s emotional expression in news feed is sufficient to affect one’s emotion. Moreover, nonverbal cues are not mandatory for contagion since textual contents can independently ensue emotional contagion. Simply imitation alone cannot justify the “cross-‐emotional encouragement effect” (e.g., reduction in negative posts resulted in increase in positive post). The resemblance of effect sizes was also noted while reducing positivity and negativity. Exposure to fewer emotional posts led people to be less expressive (“withdrawal effect”) in next couple of days, which reveals how emotional expression influences social interactions online. 2.4 “Sentiment Analyses and Opinion Mining” This paper explains sentiment analysis and how opinion mining is applied in order to gather sentiment data on a user group. Opinion mining signifies the positive, negative or neutral perception of people over the social web about a specific commodity, issue or individual. The authors’ asserts that opinion mining is important for gathering significant information from the huge collection of data in web. The study confers the importance of opinion mining for twitter data. The study thoroughly discusses numerous sentiment tools for twitter data extraction. Twitter is the most popular microblog , which receives over “500 million tweets” per day.

2.5 Tactics of Twitter Data Extraction for Opinion Mining The paper highlights the available tools for Twitter data extraction and opinion mining. Twitter was chosen for sentiment analysis in this study. Both programming and non-‐programming approaches can be used to extract data from twitter. R language is popular in programming technique that works on Windows, Linux, Mac etc. programming method can be used for numerous advantage such as, applying quality control rules, language detection algorithm, blacklisting spamming words etc. However, implementation of non-‐programming techniques is simpler and more flexible. Sentiment tools are not time consuming like the programming method, which are very efficient in extracting twitter data. The sentiment tools mentioned in the study are given below: “Sentiment140” helps to discover the current sentiments or recent tweets on a topic. “Sentiment viz” focuses on distinctive visualization techniques. “Topsy” is a Social Media Analytics that has the ability to deliver greater number of tweets than any other sentiment tools. “Trackur” can track anything that is said on Twitter, Facebook, Google+, tumbler etc. “Tweet archivist” is essential in analyzing and exporting tweets into excel sheet. Each sentiment tool has unique functions and purposes. And user has to choose particular sentiment tools based on their distinctive requirements. As per knowledge, there were numerous studies that tried to focus on sentiment analysis through text mining. Hence, we believe, this is a great opportunity to explore how negative sentiment among a group of users is more dominant – especially among sports team fans. Also, the five papers selected for literature review, has certainly helped to grasp how users motivation works in a large social network – the likes of Twitter and Facebook. 3. Approach Before setting up the approach for our study we focused on the possible challenges the study might posses. Firstly, we figured that making appropriate sense of the data and feed them into the model effectively would be a big task, since an incorrect execution in this part will ruin the entire model. Therefore, importance was given to several open-‐ended questions that aided to effectively use the data into the model. The major ones are as follows:

• How and when to collect data and describe their properties – positive or negative?

• How to create a text document that encompasses a fan’s reaction over a long time range?

• How to segregate the data into positive and negative corpus? • How to reduce confounds present in the data? • How to develop graphical illustration? • How to interpret the graphs/results? • How to evaluate the model/framework we created?

While conducting our study the above challenges were deemed most important. However, there were several parameters/sub-‐branches of the open-‐ended question that we had to deal with while conducting methodology. As previously mentioned, the project seeks to deliver an appropriate framework to understand sports fan behavior on social network based on team’s performance. A secondary objective of this project was to show that negative remarks/news, in a social network, has a higher dispersion rate (spread) than a positive remark/news. In other words, users feel more driven towards criticizing; in contrast, similar drive is not observed when a performance is praiseworthy. We selected a sports team that has a fair share of fans around the globe and has frequent games week in week out. We selected Manchester United, a soccer team based in England that plays in the top division in English League. Based on their rich history and worldwide fan-‐base we felt it was a reasonable inclusion for the study. Next we followed the team’s results over a span of five games and collected twitter feeds – mainly through “package.twitter” in R programming language. Although, we wanted the games results to vary each time (win and loss), but unfortunately that was not the case as three of the five games were drawn and the last two games were lost. Therefore we could only do comparison between how the supporters behave between and draw and a loss. In order to successfully generate the framework, we took the following approaches.

• Calculated sentiment analysis on the data (twitter feeds) and calculated the sentiment score.

• Created two corpuses – basically text documents containing each category of tweets (positive and negative).

• Created a third corpus with combining positive and negative documents to evaluate the weight between positive and negative sentiment.

• Processed the corpuses in-‐order to get rid of extraneous text present in the tweets – empty spaces, special character etc. – this was achieved by processing the corpuses in R and running them with various commands that are available.

• Finally after streamlining the data we explored what was present in each corpus and studied them visually with graphs and various diagrams.

• Identified high frequency words that were dominant in positive, negative and combined corpus through clutter dendrogram.

The above approach was applied successfully and we were able to compare visually the dominant words present in all cluster types, and thus, explain what in particular provokes the negative or positive sentiment among the fans. Observing the behavior over a longitudinal timeframe enabled us to compare how fans’ sentiment as well as perception changes over time and how much it depends on the performance of the team they support. 4. Methods For the majority of our study we used R programming language for twitter data mining as well as various packages to carry out the sentiment analysis and subsequently visualize them with graphs and figure. The following packages (figure 4.1) were used in this study. library('twitteR') #allows to access twitter library(‘base64enc’) #tools for handling base64 encoding library(‘httpuv’) #to create graph library(‘RCurl’) #set SSN globally library(‘igraph’) #create graph library(‘plyr’) #splitting, applying and combining data library(‘stringr’) #wrapper for common string operation library(‘ggplot2’) #create plots library(‘wordcloud’) #create word-cloud library(‘tm’) #allows text mining library(‘SnowballC’) #remove common word endings (ing, ed) library(‘bicluster’) #find clusters library(‘cluster’) #find group in data

figure 4.1: R packages used

First off, we created a twitter account and created a twitter app to retrieve Twitter data through native Twitter API. We establish the connection with the following command. KEY="YOUR KEY" SECRET="YOUR CONSUMER SECRET" setup_twitter_oauth(KEY, SECRET) The “key” and “secret” section is replaced with the access key provided by Twitter API. After connection with Twitter and loading the necessary package into R, we moved to data collection phase for our sentiment analysis. We began our sentiment analysis by loading the “positive” and “negative” word-‐bank lexicon provided by Hu and Liu in R. pos = readLines("[Directory]\\positive_words.txt") neg = readLines("[Directory]\\negative_words.txt") For all the five games that were covered, a total of 40,000 tweets in multiple sessions were collected. In each session we collected 2,000 tweets. mufc_tweets = searchTwitter("manchester united", n=2000, lang="en") The first set of data was collected within an hour after the first game. The flowing 3 sets of data were collected followed by 12 hours gap. The data for the remaining games were collected in the same fashion. In the next step we extracted the text from the tweets that was collected. mufc_txt = sapply(mufc_tweets, function(x) x$getText()) The obtained text was then merged into a single vector and function was applied that calculated the number of tweets that carried the “token” word present in the lexicon. We loaded the function (app.1) that calculates sentiment score and ran our “score.sentiment” function. scores = score.sentiment(mufc_txt, pos, neg,.progress= 'text') The following equation was used to calculate the sentiment score:

𝑆𝑐𝑜𝑟𝑒 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠 − 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑤𝑜𝑟𝑑𝑠

We assigned parameters to locate the tweets that had more than or equal to 2 words from each lexicon and called them “very positive” if it contained two or more positive words and “very negative” if it contained two or more very negative. scores$very.pos = as.numeric(scores$score >= 2) scores$very.neg = as.numeric(scores$score <= -2) The set of parameters were:

𝑣𝑒𝑟𝑦 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 𝑠𝑐𝑜𝑟𝑒 ≥ 2 𝑣𝑒𝑟𝑦 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = 𝑠𝑐𝑜𝑟𝑒 ≤ −2

We then created a data frame and saved all the scores along with the tweets into an excel file so that the tweets were safely stored and can be used later when the final analysis takes place. We applied excel "𝑖𝑓 𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡" to separate the very positive tweets and very negative tweets and copied each category types (positive/negative) in text files (.txt). Each set of data generated two text documents (positive & negative). For the final analysis, there were 17 text documents in the positive corpus and 18 for the negative corpus. The combined corpus had total 35 documents. The analysis was done in three separate phases – phase one where only the positive documents were studied, phase two was conducted analyzing the negative documents and finally the last phase was conducted analyzing the combined corpus that contained both positive and negative documents. Each corpus was loaded with the following command cname <- file.path ("C:\\Users\\sojibahm\\Documents\\Tweets", "postivetexts") docs <- Corpus(DirSource(cname)) Command to remove punctuations, special characters and blacks from the text documents docs <- tm_map(docs, removePunctuation) for(j in seq(docs)) { docs[[j]] <- gsub("/", " ", docs[[j]]) docs[[j]] <- gsub("@", " ", docs[[j]]) docs[[j]] <- gsub("\\|", " ", docs[[j]]) } Commands to remove number, convert all the texts into lowercase and remove stop-‐words such as “a” “is” etc. docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("english")) Command to remove some ambiguous word that is of no interest to us. For example “https”. docs <- tm_map(docs, removeWords, c("https…", " httpst…")) Command to use some pair of words to stick together so they preserve a particular entity or meaning. for (j in seq(docs)) { docs[[j]] <- gsub("louis van gaal", "lvg", docs[[j]]) } Command to remove part of words ending with “ing”, “ed”, “s”, etc. docs <- tm_map(docs, stemDocument) Creating a Document Term Matrix (DTM) DTM <- DocumentTermMatrix(docs) Remove sparse item in the document by 50%. This enabled us to focus on the words that matter and remove confounds. DTMS <- removeSparseTerms(DTM, 0.5) Calculate and arrange words by frequency freq <- colSums(as.matrix(DTMS)) ord <- order(freq) Arrange by decreasing frequency freq <- sort(colSums(as.matrix(DTM)), decreasing=TRUE) Histogram plot with frequency greater than 150/words that appear in the document more than 150 times. p <- ggplot(subset(wf, freq>150), aes(word, freq)) p <- p + geom_bar(stat="identity") p <- p + theme(axis.text.x=element_text(angle=55, hjust=1)) p

Generating word-‐cloud with words that were mentioned more than or equal to 35 times across the documents wordcloud(names(freq), freq, min.freq=35, scale=c(5, .1), colors=brewer.pal(6, "Dark2")) Generate Cluster-‐Dendrogram with red borders around five separate clusters (‘K’ denotes the number of clusters). d <- dist(t(dtmss), method="euclidian") fit <- hclust(d=d, method="ward.D") groups <- cutree(fit, k=5) rect.hclust(fit, k=5, border="red") fit plot(fit, hang=-1) After running all the documents, the results obtained were one histogram plot, one word-‐cloud, and one cluster dendrogram for all three documents combination (positive, negative and combined). The graphs and the appropriate visual outputs obtained are explained in the following result section. 5. Results After applying our methodology, we managed to generate three types of visual representation, which was sufficient to draw a proper understanding on what the fans are generally discussing on social media. 5.1 Positive Sentiment The highlight from the positive corpus histogram (figure 5.1.1) and word-‐cloud (figure 5.1.2) shows that the most frequently mentioned words were rather vague and unclear – ‘fans’, ‘will’, ‘win’ etc. This indicates that the performance was not satisfactory and there were not enough positive emotions shown by the supporters. Studying the word-‐cloud, we can sense that the positive emotions are not that praiseworthy. The cluster dendrogram (figure 5.1.3), which clusters word together to indicate a dominant subject within the text, also fails to provide any concrete example on subjects fans show their positive emotions.

figure 5.1.1: positive sentiment histogram

figure 5.1.2: positive sentiment word-‐cloud

figure 5.1.3: positive sentiment cluster dendrogram

5.2 Negative Sentiment Unlike positive sentiment results, the negative results are in par with the reactions of the fans. The histogram (figure 5.2.1) points out two defeats suffered against teams “Wolfsburg” and “Bournemouth” as these words have relatively high mentions than the other words. The word-‐cloud (figure 5.2.2) also highlights the words “bore” and “disappoint” which suggests that game were boring and the results disappointing. The dendrogram (figure 5.2.3) correctly points out the elimination from the Champions League following the defeat against Wolfsburg, and the fans were clear in showing their negative emotions with the tweets they posted.

figure 5.2.1: Negative sentiment histogram

fiure 5.2.2: Negative sentiment word-‐cloud

figure 5.2.3: Negative sentiment cluster dendrogram

5.3 Combined Sentiment Studying the combined sentiment allowed us to evaluate between positive and negative sentiment side by side. From the graphs obtained we can visualize that negative emotions had greater emphasize than the positive emotions. Both the histogram (figure 5.3.1) and word-‐cloud (figure 5.3.2) highlights the words that were present in the negative tweet documents – the likes of, “Wolfsburg”, “Bournemouht” and “Champions League”. The cluster dendrogram (figure 5.3.3) features similar results to that of the negative sentiment results. Thus, we can conclude that the fans showed greater intend to share their negative emotions compared to positive emotions.

figure 5.3.1: Combined sentiment histogram

figure 5.3.2: Combined sentiment word-‐cloud

figure 5.3.3: Combined sentiment dendrogram

6. Conclusion The major shortcomings of this study lie in the dataset. We must consider that the data we used covers a significant amount of time and as a result has higher confounding factors. The results give us very current information about fan’s behavior. It is evident that the model will work better in highlighting important sentiment features of fans through increased longitudinal application of data i.e. data collected over a period of several months rather than only 5 games. Moreover, variation in game results, win or lose rather than a draw and loss, will certainly capture a broader intuition on the behavior of fans and illustrate the results with more robustness. Finally, studying our sentiment analysis we can conclude that the model performed practically well encapsulating fan’s behavior on Twitter. By visually studying the

cluster and graphs obtained through sentiment analysis we can reasonably predict the fan’s feelings and their current experience with the team. We identified multiple events where fans show their sentiment. This indicates that the framework, if further enhanced, can play an integral role in effectively capturing the motivational factors that are responsible for fans behavior in social network. Through this longitudinal analysis of user’s sentiment we can uncover various supplementary parameters that are not visible at this moment. The study also proves that text mining on social media such as Twitter is a great tool to cover a large group of population and therefore generate successful prediction about different dynamics of user’s behavior and sentiment. 7. Reference [1] Easley, David, and Jon Kleinberg. Networks, Crowds, and Markets: Reasoning

About a Highly Connected World. Manhattan: Cambridge UP (2010), 2009. Print.

[2] Buccafurri, Francesco, Gianluca Lax, Serena Nicolazzo, and Antonino Nocera.

"Comparing Twitter and Facebook User Behavior: Privacy and Other Aspects." Computers in Human Behavior 52 (2015): 87-‐95. Web.

[3] Eisenbeiss, Maik, Boris Blechschmidt, Klaus Backhaus, and Philipp Alexander

Freund. ”The (Real) World Is Not Enough:” Motivational Drivers and User Behavior in Virtual Worlds." Journal of Interactive Marketing 26.1 (2012): 4 -‐20. Web.

[4] Curras,-‐Perez, Rafael, Carla Ruiz-‐Mafe, and Silvia Sanz-‐Blas. "Determinants of

User Behavior and Recommendation in Social Networks." Industrial Management & Data Systems 114.9 (2014): 1477-‐498. Web.

[5] Adam D. I. Kramer, Jamie E. Guillory, Jeffrey T. Hancock. Experimental Evidence

of Massive-‐scale Emotional Contagion through Social Networks -‐ Hiduth.com." Hiduth.com. N.p., 16 June 2015. Web. 30 Nov. 2015.

[6] Chatterjee, Ram. Goyal, Monika. “Tactics of Twitter Data Extraction for Opinion

Mining”. 2015 2nd International Conference on Computing for Sustainable Global Development.

[7] Bing Liu. "Sentiment Analysis and Subjectivity." Invited Chapter for the

Handbook of Natural Language Processing, Second Edition. March, 2010.

Appendix 1. Sentiment function score.sentiment = function(sentences, pos.words, neg.words,

.progress='none') { scores = laply(sentences, function(sentence, pos.words, neg.words) { sentence = gsub("[[:punct:]]", "", sentence) sentence = gsub("[[:cntrl:]]", "", sentence) sentence = gsub('\\d+', '', sentence) tryTolower = function(x) { y = NA try_error = tryCatch(tolower(x), error=function(e) e) if (!inherits(try_error, "error")) y = tolower(x) return(y) } sentence = sapply(sentence, tryTolower) package) word.list = str_split(sentence, "\\s+") words = unlist(word.list) pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(text=sentences, score=scores) return(scores.df) }

551 Final Report

Documents

Transcript of 551 Final Report