551 Final Report

19
TERM PAPER A proposed framework to capture supporter’s behavior through Sentiment Analysis and Text mining Sojib Ahmed 14 December 2015 1. Introduction Social Network Analysis (SNA) has provided various methods to understand different dynamics of users that are present in a network. Text mining, a key component of SNA, has been proven to be of great value to extract user behavior and motivation efficiently and quickly. With this study, we try to determine how sport’s fans behave on social media based on the current performance of the team they support. We apply sentiment analysis on our data collected from Twitter to generate reaction amongst fans around the globe. Since Twitter has grown into a massive platform where users from around the world share their views, we take this as an admirable scope to conduct our research and grasp an overall viewpoint on the subject fans are happy or unhappy about. Conducting sentiment analysis has made possible to understand viewpoint on the relationship among the factors present in a network. It is evident that using sentiment analysis, analyzing large data provides a better resolution to understand a large group of supporters. The use of sentiment analysis with twitter data is not only confined with study involving on one particular set of distribution but it manages to deliver analysis on “joint distribution” and the relevant “association” between a set of words that are dominant in the twitter feed. Therefore, analyzing twitter texts with the use of sentiment analysis can be extremely helpful to predict positive/negative correlation among the twitter participants – our focus will be mainly on sports supporters. The objective of this project is to provide a framework to conduct sentiment analysis and predict behavior of sports fan on social media based on team’s performance. During the data collection stage, we wanted to allocate our dataset so that it can encompass fan’s behavior over a longitudinal timeframe. This is an important as well as an interesting part of our study since; we can identify how behaviors change over a period of time. In order to achieve this we collected our

Transcript of 551 Final Report

Page 1: 551 Final Report

TERM  PAPER    

A  proposed  framework  to  capture  supporter’s  behavior  through    Sentiment  Analysis  and  Text  mining  

 Sojib  Ahmed  

14  December  2015    1.  Introduction    Social  Network  Analysis  (SNA)  has  provided  various  methods  to  understand  different  dynamics  of  users  that  are  present  in  a  network.  Text  mining,  a  key  component  of  SNA,  has  been  proven  to  be  of  great  value  to  extract  user  behavior  and  motivation  efficiently  and  quickly.  With  this  study,  we  try  to  determine  how  sport’s  fans  behave  on  social  media  based  on  the  current  performance  of  the  team  they  support.  We  apply  sentiment  analysis  on  our  data  collected  from  Twitter  to  generate  reaction  amongst  fans  around  the  globe.  Since  Twitter  has  grown  into  a  massive  platform  where  users  from  around  the  world  share  their  views,  we  take  this  as  an  admirable  scope  to  conduct  our  research  and  grasp  an  overall  viewpoint  on  the  subject  fans  are  happy  or  unhappy  about.    Conducting  sentiment  analysis  has  made  possible  to  understand  viewpoint  on  the  relationship  among  the  factors  present  in  a  network.  It  is  evident  that  using  sentiment  analysis,  analyzing  large  data  provides  a  better  resolution  to  understand  a  large  group  of  supporters.  The  use  of  sentiment  analysis  with  twitter  data  is  not  only  confined  with  study  involving  on  one  particular  set  of  distribution  but  it  manages  to  deliver  analysis  on  “joint  distribution”  and  the  relevant  “association”  between  a  set  of  words  that  are  dominant  in  the  twitter  feed.  Therefore,  analyzing  twitter  texts  with  the  use  of  sentiment  analysis  can  be  extremely  helpful  to  predict  positive/negative  correlation  among  the  twitter  participants  –  our  focus  will  be  mainly  on  sports  supporters.    The  objective  of  this  project  is  to  provide  a  framework  to  conduct  sentiment  analysis  and  predict  behavior  of  sports  fan  on  social  media  based  on  team’s  performance.  During  the  data  collection  stage,  we  wanted  to  allocate  our  dataset  so  that  it  can  encompass  fan’s  behavior  over  a  longitudinal  timeframe.  This  is  an  important  as  well  as  an  interesting  part  of  our  study  since;  we  can  identify  how  behaviors  change  over  a  period  of  time.  In  order  to  achieve  this  we  collected  our  

Page 2: 551 Final Report

data  each  time  after  a  game  is  played  with  successive  intervals.  The  data  used  in  the  study  covered  five  games  in  total,  which  spanned  over  almost  a  month  just  so  a  broader  insight  on  the  supporters  behavior  could  be  obtained.  Profound  focus  was  given  on  word  clustering  and  graphic  visualization  as  a  way  to  correctly  predict  the  motivational  factors  that  drive  users  toward  making  positive  remarks  v/s  negative  remarks.    Furthermore,  graphic  visualization  has  also  enabled  us  to  locate  the  major  contributors/subjects  that  users  focus  on  based  on  the  performance  of  their  team.  Our  initial  thought  on  this  study  was  to  carry  out  a  hypothesis  testing  between  two  dataset  –  positive  word  cluster  and  negative  work  cluster  using  UCInet  and  identify  the  subjects/words  that  are  significant  in  understanding  the  appropriate  sentiment  of  positivity  or  negativity  present  in  the  dataset.  However,  based  on  expert  opinion  and  given  the  confounding  factors  present  and  the  bias  amongst  supporters,  the  test  of  hypothesis  was  replaced  with  graphic  visualization  since  its  provided  a  clear-­‐cut  analysis  and  representation  on  how  supporters  behave.  Moreover,  the  presence  of  lurking  variable  (noise  data)  can  easily  shift  our  hypothesis  test  results  and  give  us  false  assumption  about  our  data.  Nevertheless,  we  believe  that  analysis  through  graph  visualization  has  enabled  us  to  correctly  understand  the  positive  and  negative  sentiment  and  focus  only  on  the  subject/words  that  are  dominant.  As  far  as  this  report  goes,  we  will  provide  a  general  overview  as  to  how  the  graphs  generated  can  help  us  understand  the  behaviors  of  the  supporters.      A  correct  execution  of  the  study  will  provide  a  greater  insight  on  user’s  behavior  and  how  sentiment  plays  a  part  on  the  user  based  on  their  team’s  performance.  Furthermore,  the  framework  can  be  implemented  in  other  form  of  studies  that  pertain  to  observe  the  impact  of  positivity  and  negativity  and  its  effect  on  growth  and  sustainability  -­‐  could  be  an  artist,  clothing  brand,  a  newly  launched  product/service,  or  a  new  automobile  about  to  hit  the  market.    In  the  remainder  part  of  the  report  we  will  conduct  a  literature  review  that  focuses  on  different  aspect  of  social  network  analysis,  and  benefits  of  analysis  run  through  text  mining.  Consequently,  we  will  then  discuss  the  motivation  and  approach  that  that  drove  us  to  conduct  this  study.  Next  we  will  go  on  to  discuss  the  methods  that  were  used  in  order  to  successfully  conduct  text  mining  and  sentiment  analysis.  The  methods  section  will  be  discussed  step-­‐by-­‐step  in  broad  details  to  show  how  subsequent  tools  and  technique  were  applied  to  generate  our  results.  The  results  section  will  report  the  main  finding  and  provide  appropriate  interpretation  of  the  graphs  obtained  through  the  methods  that  was  applied.  We  will  include  the  important  graphs,  interpret  them,  and  convey  why  the  results  obtained  are  useful  

Page 3: 551 Final Report

and  make  appropriate  sense  of  it.  In  the  conclusion  section  we  will  cover  the  benefit  as  the  well  the  challenges  and  limitations  of  the  study.    2.  Literature  Review    In  this  section,  we  covered  insights  from  social  network  literature  to  provide  an  intuition  on  factors  that  motivate  user  behavior  from  a  social  network  viewpoint.  While  conducting  the  literature  review,  we  came  across  various  papers  that  focuses  on  business  sector  and  how  business  operations  are  highly  influenced  by  social  network  user’s  behavior.  Although,  not  a  lot  of  work  has  been  done  to  address  fans  reactions  in  sports  based  on  team  performance,  we  can  draw  analogy  from  a  business  firm’s  perspective  to  understand  the  effect  of  consumers’  action  on  online  social  network.  Moreover,  we  also  discuss  few  papers  that  feature  the  capabilities  of  text  mining  and  sentiment  analysis  and  its  application  on  various  sectors.    2.1  “The  (Real)  World  Is  Not  Enough:”  Motivational  Drivers  and  User  Behavior  in  Virtual  Worlds:    In  this  literature,  the  study  reveals  a  social  influence  to  describe  user  behavior  and  the  drivers  that  triggers  user  participation  from  a  business  viewpoint.  The  motivational  drivers  those  are  responsible  for  user  engagement  in  a  “marketing  -­‐  relevant  context”  is  portrayed.  The  study  finds  that  in  a  social  network,  “socializing,  creativity  and  escape  emerge  as  individual  drivers”.  In  addition,  it  also  identifies  various  important  characteristics  of  a  user,  given  the  variability  among  actors  present  in  a  network;  this  study  does  decent  job  capturing  distinct  motivational  drivers  and  presents  them  in  segments.  Although  the  effort  made  in  this  paper  differs  in  larger  context  in  contrast  to  our  focus.  We  believe  it  helped  us  grasp  the  importance  of  social  network  analysis  from  a  slightly  different  context.    2.2  “Comparing  Twitter  and  Facebook  user  behavior:  Privacy  and  other  aspects”    When  it  comes  to  online  social  media  network,  Facebook  and  Twitter  are  two  most  common  form  of  social  network  that  comes  to  mind.  The  paper  tries  to  compare  user  behavior  between  FB  and  twitter.  While,  a  primary  part  of  this  study  focuses  on  user’s  privacy,  the  paper  also  provides  some  interesting  facts  on  user  behavior  from  a  different  aspect  –  friend  overlap,  friend  distribution,  user  activity.  In  a  nutshell,  we  can  draw  a  rough  idea  about  the  common  factors  that  are  present  in  widely  used  social  medias  like  Facebook  and  Twitter.    

Page 4: 551 Final Report

2.3  “Experimental  Evidence  of  Massive  -­‐  scale  Emotional  Contagion  through  Social  Networks”    The  objective  of  the  study  was  to  find  out  if  negative  and  positive  emotional  conditions  can  be  transferred  to  other  people  through  emotional  contagion,  which  leads  people  to  experience  similar  emotions  exclusive  of  their  sentience.  The  largest  online  social  network,  Facebook  was  chosen  to  carry  out  the  demonstration  of  the  study.  The  persistence  of  the  experiments  on  people  using  Facebook  was  to  test  the  occurrence  of  emotional  contagion  among  individuals  by  controlling  the  exposure  of  emotional  matters  in  the  news  feed  of  their  Facebook  account.  Furthermore,  the  experiments  ensured  inclusion  of  both  positive  and  negative  emotions  by  conducting  2  parallel  experiments;  reducing  the  exposure  of  friends’  positive  emotional  contents  in  one  and  reducing  the  exposure  of  friends’  negative  emotional  contents  in  another.  The  results  indicate  emotional  contagion.  Reduction  of  positive  content  in  the  news  feed  led  the  greater  proportion  to  choose  negative  words  and  smaller  proportion  to  choose  positive  words  in  their  status  updates.  The  reverse  pattern  was  observed  upon  reducing  negativity.  Hence,  the  results  support  the  claim  that  emotional  contagion  can  arise  through  social  network  without  in-­‐person  interactions.  Numerous  functions  of  emotional  contagion  were  emphasized  by  the  results.  Merely  an  exposure  or  non-­‐exposure  to  a  friend’s  emotional  expression  in  news  feed  is  sufficient  to  affect  one’s  emotion.  Moreover,  nonverbal  cues  are  not  mandatory  for  contagion  since  textual  contents  can  independently  ensue  emotional  contagion.  Simply  imitation  alone  cannot  justify  the  “cross-­‐emotional  encouragement  effect”  (e.g.,  reduction  in  negative  posts  resulted  in  increase  in  positive  post).  The  resemblance  of  effect  sizes  was  also  noted  while  reducing  positivity  and  negativity.  Exposure  to  fewer  emotional  posts  led  people  to  be  less  expressive  (“withdrawal  effect”)  in  next  couple  of  days,  which  reveals  how  emotional  expression  influences  social  interactions  online.    2.4  “Sentiment  Analyses  and  Opinion  Mining”    This  paper  explains  sentiment  analysis  and  how  opinion  mining  is  applied  in  order  to  gather  sentiment  data  on  a  user  group.  Opinion  mining  signifies  the  positive,  negative  or  neutral  perception  of  people  over  the  social  web  about  a  specific  commodity,  issue  or  individual.  The  authors’  asserts  that  opinion  mining  is  important  for  gathering  significant  information  from  the  huge  collection  of  data  in  web.  The  study  confers  the  importance  of  opinion  mining  for  twitter  data.  The  study  thoroughly  discusses  numerous  sentiment  tools  for  twitter  data  extraction.  Twitter  is  the  most  popular  microblog  ,  which  receives  over  “500  million  tweets”  per  day.    

Page 5: 551 Final Report

2.5  Tactics  of  Twitter  Data  Extraction  for  Opinion  Mining    The  paper  highlights  the  available  tools  for  Twitter  data  extraction  and  opinion  mining.  Twitter  was  chosen  for  sentiment  analysis  in  this  study.  Both  programming  and  non-­‐programming  approaches  can  be  used  to  extract  data  from  twitter.  R  language  is  popular  in  programming  technique  that  works  on  Windows,  Linux,  Mac  etc.  programming  method  can  be  used  for  numerous  advantage  such  as,  applying  quality  control  rules,  language  detection  algorithm,  blacklisting  spamming  words  etc.  However,  implementation  of  non-­‐programming  techniques  is  simpler  and  more  flexible.  Sentiment  tools  are  not  time  consuming  like  the  programming  method,  which  are  very  efficient  in  extracting  twitter  data.  The  sentiment  tools  mentioned  in  the  study  are  given  below:  “Sentiment140”  helps  to  discover  the  current  sentiments  or  recent  tweets  on  a  topic.  “Sentiment  viz”  focuses  on  distinctive  visualization  techniques.  “Topsy”  is  a  Social  Media  Analytics  that  has  the  ability  to  deliver  greater  number  of  tweets  than  any  other  sentiment  tools.  “Trackur”  can  track  anything  that  is  said  on  Twitter,  Facebook,  Google+,  tumbler  etc.  “Tweet  archivist”  is  essential  in  analyzing  and  exporting  tweets  into  excel  sheet.  Each  sentiment  tool  has  unique  functions  and  purposes.  And  user  has  to  choose  particular  sentiment  tools  based  on  their  distinctive  requirements.    As  per  knowledge,  there  were  numerous  studies  that  tried  to  focus  on  sentiment  analysis  through  text  mining.  Hence,  we  believe,  this  is  a  great  opportunity  to  explore  how  negative  sentiment  among  a  group  of  users  is  more  dominant  –  especially  among  sports  team  fans.  Also,  the  five  papers  selected  for  literature  review,  has  certainly  helped  to  grasp  how  users  motivation  works  in  a  large  social  network  –  the  likes  of  Twitter  and  Facebook.    3.  Approach    Before  setting  up  the  approach  for  our  study  we  focused  on  the  possible  challenges  the  study  might  posses.  Firstly,  we  figured  that  making  appropriate  sense  of  the  data  and  feed  them  into  the  model  effectively  would  be  a  big  task,  since  an  incorrect  execution  in  this  part  will  ruin  the  entire  model.  Therefore,  importance  was  given  to  several  open-­‐ended  questions  that  aided  to  effectively  use  the  data  into  the  model.  The  major  ones  are  as  follows:    

• How  and  when  to  collect  data  and  describe  their  properties  –  positive  or  negative?  

Page 6: 551 Final Report

• How  to  create  a  text  document  that  encompasses  a  fan’s  reaction  over  a  long  time  range?  

• How  to  segregate  the  data  into  positive  and  negative  corpus?  • How  to  reduce  confounds  present  in  the  data?  • How  to  develop  graphical  illustration?  • How  to  interpret  the  graphs/results?  • How  to  evaluate  the  model/framework  we  created?  

 While  conducting  our  study  the  above  challenges  were  deemed  most  important.  However,  there  were  several  parameters/sub-­‐branches  of  the  open-­‐ended  question  that  we  had  to  deal  with  while  conducting  methodology.  As  previously  mentioned,  the  project  seeks  to  deliver  an  appropriate  framework  to  understand  sports  fan  behavior  on  social  network  based  on  team’s  performance.  A  secondary  objective  of  this  project  was  to  show  that  negative  remarks/news,  in  a  social  network,  has  a  higher  dispersion  rate  (spread)  than  a  positive  remark/news.  In  other  words,  users  feel  more  driven  towards  criticizing;  in  contrast,  similar  drive  is  not  observed  when  a  performance  is  praiseworthy.        We  selected  a  sports  team  that  has  a  fair  share  of  fans  around  the  globe  and  has  frequent  games  week  in  week  out.  We  selected  Manchester  United,  a  soccer  team  based  in  England  that  plays  in  the  top  division  in  English  League.  Based  on  their  rich  history  and  worldwide  fan-­‐base  we  felt  it  was  a  reasonable  inclusion  for  the  study.  Next  we  followed  the  team’s  results  over  a  span  of  five  games  and  collected  twitter  feeds  –  mainly  through  “package.twitter”  in  R  programming  language.  Although,  we  wanted  the  games  results  to  vary  each  time  (win  and  loss),  but  unfortunately  that  was  not  the  case  as  three  of  the  five  games  were  drawn  and  the  last  two  games  were  lost.  Therefore  we  could  only  do  comparison  between  how  the  supporters  behave  between  and  draw  and  a  loss.      In  order  to  successfully  generate  the  framework,  we  took  the  following  approaches.    

• Calculated  sentiment  analysis  on  the  data  (twitter  feeds)  and  calculated  the  sentiment  score.    

• Created  two  corpuses  –  basically  text  documents  containing  each  category  of  tweets  (positive  and  negative).  

• Created  a  third  corpus  with  combining  positive  and  negative  documents  to  evaluate  the  weight  between  positive  and  negative  sentiment.    

Page 7: 551 Final Report

• Processed  the  corpuses  in-­‐order  to  get  rid  of  extraneous  text  present  in  the  tweets  –  empty  spaces,  special  character  etc.  –  this  was  achieved  by  processing  the  corpuses  in  R  and  running  them  with  various  commands  that  are  available.  

• Finally  after  streamlining  the  data  we  explored  what  was  present  in  each  corpus  and  studied  them  visually  with  graphs  and  various  diagrams.  

• Identified  high  frequency  words  that  were  dominant  in  positive,  negative  and  combined  corpus  through  clutter  dendrogram.  

 The  above  approach  was  applied  successfully  and  we  were  able  to  compare  visually  the  dominant  words  present  in  all  cluster  types,  and  thus,  explain  what  in  particular  provokes  the  negative  or  positive  sentiment  among  the  fans.  Observing  the  behavior  over  a  longitudinal  timeframe  enabled  us  to  compare  how  fans’  sentiment  as  well  as  perception  changes  over  time  and  how  much  it  depends  on  the  performance  of  the  team  they  support.    4.  Methods    For  the  majority  of  our  study  we  used  R  programming  language  for  twitter  data  mining  as  well  as  various  packages  to  carry  out  the  sentiment  analysis  and  subsequently  visualize  them  with  graphs  and  figure.  The  following  packages  (figure  4.1)  were  used  in  this  study.      library('twitteR') #allows to access twitter library(‘base64enc’) #tools for handling base64 encoding library(‘httpuv’) #to create graph library(‘RCurl’) #set SSN globally library(‘igraph’) #create graph library(‘plyr’) #splitting, applying and combining data library(‘stringr’) #wrapper for common string operation library(‘ggplot2’) #create plots library(‘wordcloud’) #create word-cloud library(‘tm’) #allows text mining library(‘SnowballC’) #remove common word endings (ing, ed) library(‘bicluster’) #find clusters library(‘cluster’) #find group in data

 figure  4.1:  R  packages  used  

 

Page 8: 551 Final Report

First  off,  we  created  a  twitter  account  and  created  a  twitter  app  to  retrieve  Twitter  data  through  native  Twitter  API.  We  establish  the  connection  with  the  following  command.    KEY="YOUR KEY" SECRET="YOUR CONSUMER SECRET" setup_twitter_oauth(KEY, SECRET)  The  “key”  and  “secret”  section  is  replaced  with  the  access  key  provided  by  Twitter  API.      After  connection  with  Twitter  and  loading  the  necessary  package  into  R,  we  moved  to  data  collection  phase  for  our  sentiment  analysis.  We  began  our  sentiment  analysis  by  loading  the  “positive”  and  “negative”  word-­‐bank  lexicon  provided  by  Hu  and  Liu  in  R.    pos = readLines("[Directory]\\positive_words.txt") neg = readLines("[Directory]\\negative_words.txt")  For  all  the  five  games  that  were  covered,  a  total  of  40,000  tweets  in  multiple  sessions  were  collected.  In  each  session  we  collected  2,000  tweets.  mufc_tweets = searchTwitter("manchester united", n=2000, lang="en")  The  first  set  of  data  was  collected  within  an  hour  after  the  first  game.  The  flowing  3  sets  of  data  were  collected  followed  by  12  hours  gap.  The  data  for  the  remaining  games  were  collected  in  the  same  fashion.  In  the  next  step  we  extracted  the  text  from  the  tweets  that  was  collected.  mufc_txt = sapply(mufc_tweets, function(x) x$getText())  The  obtained  text  was  then  merged  into  a  single  vector  and  function  was  applied  that  calculated  the  number  of  tweets  that  carried  the  “token”  word  present  in  the  lexicon.  We  loaded  the  function  (app.1)  that  calculates  sentiment  score  and  ran  our  “score.sentiment”  function.    scores = score.sentiment(mufc_txt, pos, neg,.progress= 'text')  The  following  equation  was  used  to  calculate  the  sentiment  score:    

   

𝑆𝑐𝑜𝑟𝑒  =  𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒  𝑤𝑜𝑟𝑑𝑠  −  𝑁𝑢𝑚𝑏𝑒𝑟  𝑜𝑓  𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒  𝑤𝑜𝑟𝑑𝑠  

Page 9: 551 Final Report

We  assigned  parameters  to  locate  the  tweets  that  had  more  than  or  equal  to  2  words  from  each  lexicon  and  called  them  “very  positive”  if  it  contained  two  or  more  positive  words  and  “very  negative”  if  it  contained  two  or  more  very  negative.    scores$very.pos = as.numeric(scores$score >= 2) scores$very.neg = as.numeric(scores$score <= -2)  The  set  of  parameters  were:  

𝑣𝑒𝑟𝑦  𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 𝑠𝑐𝑜𝑟𝑒 ≥ 2  𝑣𝑒𝑟𝑦  𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 = 𝑠𝑐𝑜𝑟𝑒 ≤  −2  

 We  then  created  a  data  frame  and  saved  all  the  scores  along  with  the  tweets  into  an  excel  file  so  that  the  tweets  were  safely  stored  and  can  be  used  later  when  the  final  analysis  takes  place.  We  applied  excel  "𝑖𝑓  𝑠𝑡𝑎𝑡𝑒𝑚𝑒𝑛𝑡"  to  separate  the  very  positive  tweets  and  very  negative  tweets  and  copied  each  category  types  (positive/negative)  in  text  files  (.txt).  Each  set  of  data  generated  two  text  documents  (positive  &  negative).  For  the  final  analysis,  there  were  17  text  documents  in  the  positive  corpus  and  18  for  the  negative  corpus.  The  combined  corpus  had  total  35  documents.    The  analysis  was  done  in  three  separate  phases  –  phase  one  where  only  the  positive  documents  were  studied,  phase  two  was  conducted  analyzing  the  negative  documents  and  finally  the  last  phase  was  conducted  analyzing  the  combined  corpus  that  contained  both  positive  and  negative  documents.      Each  corpus  was  loaded  with  the  following  command  cname <- file.path ("C:\\Users\\sojibahm\\Documents\\Tweets", "postivetexts") docs <- Corpus(DirSource(cname)) Command  to  remove  punctuations,  special  characters  and  blacks  from  the  text  documents  docs <- tm_map(docs, removePunctuation) for(j in seq(docs)) { docs[[j]] <- gsub("/", " ", docs[[j]]) docs[[j]] <- gsub("@", " ", docs[[j]]) docs[[j]] <- gsub("\\|", " ", docs[[j]]) }  Commands  to  remove  number,  convert  all  the  texts  into  lowercase  and  remove  stop-­‐words  such  as  “a”  “is”  etc.  docs <- tm_map(docs, removeNumbers)

Page 10: 551 Final Report

docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, stopwords("english"))  Command  to  remove  some  ambiguous  word  that  is  of  no  interest  to  us.  For  example  “https”.  docs <- tm_map(docs, removeWords, c("https…", " httpst…"))  Command  to  use  some  pair  of  words  to  stick  together  so  they  preserve  a  particular  entity  or  meaning.    for (j in seq(docs)) { docs[[j]] <- gsub("louis van gaal", "lvg", docs[[j]]) }  Command  to  remove  part  of  words  ending  with  “ing”,  “ed”,  “s”,  etc.    docs <- tm_map(docs, stemDocument)  Creating  a  Document  Term  Matrix  (DTM)  DTM <- DocumentTermMatrix(docs)        Remove  sparse  item  in  the  document  by  50%.  This  enabled  us  to  focus  on  the  words  that  matter  and  remove  confounds.  DTMS <- removeSparseTerms(DTM, 0.5)  Calculate  and  arrange  words  by  frequency    freq <- colSums(as.matrix(DTMS)) ord <- order(freq)  Arrange  by  decreasing  frequency    freq <- sort(colSums(as.matrix(DTM)), decreasing=TRUE)  Histogram  plot  with  frequency  greater  than  150/words  that  appear  in  the  document  more  than  150  times.  p <- ggplot(subset(wf, freq>150), aes(word, freq)) p <- p + geom_bar(stat="identity") p <- p + theme(axis.text.x=element_text(angle=55, hjust=1)) p  

Page 11: 551 Final Report

Generating  word-­‐cloud  with  words  that  were  mentioned  more  than  or  equal  to  35  times  across  the  documents    wordcloud(names(freq), freq, min.freq=35, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))  Generate  Cluster-­‐Dendrogram  with  red  borders  around  five  separate  clusters  (‘K’  denotes  the  number  of  clusters).    d <- dist(t(dtmss), method="euclidian") fit <- hclust(d=d, method="ward.D") groups <- cutree(fit, k=5) rect.hclust(fit, k=5, border="red") fit plot(fit, hang=-1)  After  running  all  the  documents,  the  results  obtained  were  one  histogram  plot,  one  word-­‐cloud,  and  one  cluster  dendrogram  for  all  three  documents  combination  (positive,  negative  and  combined).    The  graphs  and  the  appropriate  visual  outputs  obtained  are  explained  in  the  following  result  section.      5.  Results    After  applying  our  methodology,  we  managed  to  generate  three  types  of  visual  representation,  which  was  sufficient  to  draw  a  proper  understanding  on  what  the  fans  are  generally  discussing  on  social  media.      5.1  Positive  Sentiment  The  highlight  from  the  positive  corpus  histogram  (figure  5.1.1)  and  word-­‐cloud  (figure  5.1.2)  shows  that  the  most  frequently  mentioned  words  were  rather  vague  and  unclear  –  ‘fans’,  ‘will’,  ‘win’  etc.  This  indicates  that  the  performance  was  not  satisfactory  and  there  were  not  enough  positive  emotions  shown  by  the  supporters.    Studying  the  word-­‐cloud,  we  can  sense  that  the  positive  emotions  are  not  that  praiseworthy.  The  cluster  dendrogram  (figure  5.1.3),  which  clusters  word  together  to  indicate  a  dominant  subject  within  the  text,  also  fails  to  provide  any  concrete  example  on  subjects  fans  show  their  positive  emotions.    

Page 12: 551 Final Report

 figure  5.1.1:  positive  sentiment  histogram  

 figure  5.1.2:  positive  sentiment  word-­‐cloud    

Page 13: 551 Final Report

 figure  5.1.3:  positive  sentiment  cluster  dendrogram  

 5.2  Negative  Sentiment  Unlike  positive  sentiment  results,  the  negative  results  are  in  par  with  the  reactions  of  the  fans.  The  histogram  (figure  5.2.1)  points  out  two  defeats  suffered  against  teams  “Wolfsburg”  and  “Bournemouth”  as  these  words  have  relatively  high  mentions  than  the  other  words.  The  word-­‐cloud  (figure  5.2.2)  also  highlights  the  words  “bore”  and  “disappoint”  which  suggests  that  game  were  boring  and  the  results  disappointing.  The  dendrogram  (figure  5.2.3)  correctly  points  out  the  elimination  from  the  Champions  League  following  the  defeat  against  Wolfsburg,  and  the  fans  were  clear  in  showing  their  negative  emotions  with  the  tweets  they  posted.  

Page 14: 551 Final Report

 figure  5.2.1:  Negative  sentiment  histogram  

 

 fiure  5.2.2:  Negative  sentiment  word-­‐cloud  

Page 15: 551 Final Report

 figure  5.2.3:  Negative  sentiment  cluster  dendrogram    

 5.3  Combined  Sentiment  Studying  the  combined  sentiment  allowed  us  to  evaluate  between  positive  and  negative  sentiment  side  by  side.  From  the  graphs  obtained  we  can  visualize  that  negative  emotions  had  greater  emphasize  than  the  positive  emotions.  Both  the  histogram  (figure  5.3.1)  and  word-­‐cloud  (figure  5.3.2)  highlights  the  words  that  were  present  in  the  negative  tweet  documents  –  the  likes  of,  “Wolfsburg”,  “Bournemouht”  and  “Champions  League”.  The  cluster  dendrogram  (figure  5.3.3)  features  similar  results  to  that  of  the  negative  sentiment  results.  Thus,  we  can  conclude  that  the  fans  showed  greater  intend  to  share  their  negative  emotions  compared  to  positive  emotions.    

Page 16: 551 Final Report

 figure  5.3.1:  Combined  sentiment  histogram  

 figure  5.3.2:  Combined  sentiment  word-­‐cloud  

Page 17: 551 Final Report

 figure  5.3.3:  Combined  sentiment  dendrogram  

 6.  Conclusion      The  major  shortcomings  of  this  study  lie  in  the  dataset.  We  must  consider  that  the  data  we  used  covers  a  significant  amount  of  time  and  as  a  result  has  higher  confounding  factors.  The  results  give  us  very  current  information  about  fan’s  behavior.  It  is  evident  that  the  model  will  work  better  in  highlighting  important  sentiment  features  of  fans  through  increased  longitudinal  application  of  data  i.e.  data  collected  over  a  period  of  several  months  rather  than  only  5  games.  Moreover,  variation  in  game  results,  win  or  lose  rather  than  a  draw  and  loss,  will  certainly  capture  a  broader  intuition  on  the  behavior  of  fans  and  illustrate  the  results  with  more  robustness.    Finally,  studying  our  sentiment  analysis  we  can  conclude  that  the  model  performed  practically  well  encapsulating  fan’s  behavior  on  Twitter.  By  visually  studying  the  

Page 18: 551 Final Report

cluster  and  graphs  obtained  through  sentiment  analysis  we  can  reasonably  predict  the  fan’s  feelings  and  their  current  experience  with  the  team.  We  identified  multiple  events  where  fans  show  their  sentiment.  This  indicates  that  the  framework,  if  further  enhanced,  can  play  an  integral  role  in  effectively  capturing  the  motivational  factors  that  are  responsible  for  fans  behavior  in  social  network.  Through  this  longitudinal  analysis  of  user’s  sentiment  we  can  uncover  various  supplementary  parameters  that  are  not  visible  at  this  moment.  The  study  also  proves  that  text  mining  on  social  media  such  as  Twitter  is  a  great  tool  to  cover  a  large  group  of  population  and  therefore  generate  successful  prediction  about  different  dynamics  of  user’s  behavior  and  sentiment.    7.  Reference    [1]  Easley,  David,  and  Jon  Kleinberg.  Networks,  Crowds,  and  Markets:  Reasoning  

About  a  Highly  Connected  World.  Manhattan:  Cambridge  UP  (2010),  2009.  Print.  

 [2]  Buccafurri,  Francesco,  Gianluca  Lax,  Serena  Nicolazzo,  and  Antonino  Nocera.  

"Comparing  Twitter  and  Facebook  User  Behavior:  Privacy  and  Other  Aspects."  Computers  in  Human  Behavior  52  (2015):  87-­‐95.  Web.  

 [3]  Eisenbeiss,  Maik,  Boris  Blechschmidt,  Klaus  Backhaus,  and  Philipp  Alexander  

Freund.  ”The  (Real)  World  Is  Not  Enough:”  Motivational  Drivers  and  User  Behavior  in  Virtual  Worlds."  Journal  of  Interactive  Marketing  26.1  (2012):  4  -­‐20.  Web.  

 [4]  Curras,-­‐Perez,  Rafael,  Carla  Ruiz-­‐Mafe,  and  Silvia  Sanz-­‐Blas.  "Determinants  of  

User  Behavior  and  Recommendation  in  Social  Networks."  Industrial  Management  &  Data  Systems  114.9  (2014):  1477-­‐498.  Web.  

 [5]  Adam  D.  I.  Kramer,  Jamie  E.  Guillory,  Jeffrey  T.  Hancock.  Experimental  Evidence  

of  Massive-­‐scale  Emotional  Contagion  through  Social  Networks  -­‐  Hiduth.com."  Hiduth.com.  N.p.,  16  June  2015.  Web.  30  Nov.  2015.  

 [6]  Chatterjee,  Ram.  Goyal,  Monika.  “Tactics  of  Twitter  Data  Extraction  for  Opinion  

Mining”.  2015  2nd  International  Conference  on  Computing  for  Sustainable  Global  Development.  

 [7]  Bing  Liu.  "Sentiment  Analysis  and  Subjectivity."  Invited  Chapter  for  the  

Handbook  of  Natural  Language  Processing,  Second  Edition.  March,  2010.  

Page 19: 551 Final Report

Appendix    1.  Sentiment  function    score.sentiment = function(sentences, pos.words, neg.words,

.progress='none') { scores = laply(sentences, function(sentence, pos.words, neg.words) { sentence = gsub("[[:punct:]]", "", sentence) sentence = gsub("[[:cntrl:]]", "", sentence) sentence = gsub('\\d+', '', sentence) tryTolower = function(x) { y = NA try_error = tryCatch(tolower(x), error=function(e) e) if (!inherits(try_error, "error")) y = tolower(x) return(y) } sentence = sapply(sentence, tryTolower) package) word.list = str_split(sentence, "\\s+") words = unlist(word.list) pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(text=sentences, score=scores) return(scores.df) }