SomeChallengingProblemsin...

53
Some Challenging Problems in Mining Social Media Arizona State University Data Mining and Machine Learning Lab April 22, 2014 1 Some Challenging Problems in Mining Social Media Huan Liu Joint work with Ali Abbasi Shamanth Kumar Fred Morsta?er Reza Zafarani Jiliang Tang

Transcript of SomeChallengingProblemsin...

Page 1: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   1  

Some  Challenging  Problems  in  Mining  Social  Media  

Huan  Liu  Joint  work  with  

Ali  Abbasi  Shamanth    Kumar   Fred  Morsta?er  Reza  Zafarani   Jiliang  Tang  

Page 2: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   2  2  

Social  Media  Mining  by  Cambridge  University  Press  

h?p://dmml.asu.edu/smm/  

Page 3: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   3  

TradiIonal  Media  and  Data  

Broadcast  Media  One-­‐to-­‐Many  

CommunicaHon  Media  One-­‐to-­‐One   TradiIonal  Data  

Page 4: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   4  

Social  Media:  Many-­‐to-­‐Many  

•  Everyone  can  be  a  media  outlet  or  producer  •  Disappearing  communicaHon  barrier  •  DisHnct  characterisHcs  – User  generated  content:  Massive,  dynamic,  extensive,  instant,  and  noisy  

–  Rich  user  interacHons:  Linked  data  –  CollaboraHve  environment,  and  wisdom  of  the  crowd  – Many  small  groups  (the  long  tail  phenomenon)  – AQenHon  is  expensive  

 

4  

Page 5: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   5  

Unique  Features  of  Social  Media  

•  Novel  phenomena  observed  from  people’s  interac(ons  in  social  media    

•  Unprecedented  opportuniHes  for  interdisciplinary  and  collabora(ve  research  – How  to  use  social  media  to  study  human  behavior?  •  It’s  rich,  noisy,  free-­‐form,  and  definitely  BIG  

– With  so  much  data,  how  can  we  make  sense  of  it?  • PuZng  “bricks”  into  a  useful  (meaningful)  “edifice”  • Developing  new  methods/tools  for  social  media  mining  

Page 6: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   6  

Some  Challenges  in  Mining  Social  Media  

•  EvaluaHon  Dilemma  – How  to  evaluate  without  convenHonal  test  data?  

•  Sampling  Bias  – O`en  we  get  a  small  sample  of  (sHll  big)  data.  How  can  we  ensure  if  the  data  can  lead  to  credible  findings?    

•  Noise-­‐Removal  Fallacy  – How  do  we  remove  noise  without  losing  too  much?    

•  Studying  Distrust  in  Social  Media  –  Is  distrust  simply  the  negaHon  of  trust?  Where  to  find  distrust  informaHon  with  “one-­‐way”  relaHons?    

Page 7: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   7  7  

•  EvaluaHon  is  important  in  data  mining  – TradiHonal  test  data  is  o`en  not  available  in  social  media  mining  

•  Can  we  evaluate  our  findings  without  ground  truth?  

•  A  case  study  of  Migra(on  in  Social  Media  – Users  are  a  primary  source  of  revenue  – New  social  media  sites  need  to  aQract  users  – ExisHng  sites  need  to  retain  their  users  – CompeHHon  for  precious  aQenHon  

EvaluaIon  Dilemma    

Page 8: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   8  

MigraIon  in  Social  Media  

• What  is  migraHon?  – MigraHon  can  be  described  as  the  movement  of  users  away  from  one  locaHon  toward  another,  either  due  to  necessity,  or  aQracHon  to  the  new  environment.  

•  Two  types  of  migraHon    – Site  migraHon  

– AQenHon  migraHon  

Site  2  

Site  3  Site  1  Site  2   Site  3  A`er  

Hme  t  

Site  2  

Site  3  Site  1  

A`er  Hme  t  

Site  2  

Site  3  Site  1  

Page 9: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   9  

Obtaining  User  MigraIon  Pa?erns  

•  Goal:  IdenHfying  trends  of  aQenHon  migraHon  of  users  across  the  two  phases  of  the  collected  data.  

•  Process  

Page 10: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   10  

Pa?erns  from  ObservaIon  

Page 11: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   11  

Facing  an  EvaluaIon  Dilemma  

•  Important  to  know  if  they  are  some  meaningful  paQerns  –  If  yes,  we  invesHgate  further  how  we  use  the  paQerns  for  prevenHon  or  promoHon  

–  If  not,  why  not?  And  what  can  we  do?  •  We  would  like  to  evaluate  migraHon  paQerns,  but  without  ground  truth  

•  How?    – User  study  or  AMT?    

Page 12: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   12  

EvaluaIng  Pa?erns’  Validity  

•  One  way  is  to  verify  if  these  paQerns  are  fortuitous    

•  Null  Hypothesis:  Migra(on  in  social  media  is  a  random  process  – GeneraHng  another  similar  dataset  for  comparison  •  PotenHal  migraHng  populaHon  includes  overlapping  users  from  Phase  1  and  Phase  2  •  Shuffled  datasets  are  generated  by  picking  random  acHve  users  from  the  potenHal  migraHng  populaHon  •  The  number  of  random  users  selected  for  each  dataset  is  the  same  as  the  real  migraHng  populaHon  

Page 13: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   13  

A  Significance  Test  

Shuffled  dataset  

Observed  migraHon  dataset  

Coefficients  of  user  

aQributes  

Comparing  and  

Sig.  Test  

LogisHc  Regression  

Chi  Square  StaHsHc  

Coefficients  of  user  

aQributes  LogisHc  Regression  

Page 14: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   14  

EvaluaIon  Results  

•  Significant  differences  observed  in  StumbleUpon,  TwiQer,  and  YouTube  

•  PaQerns  from  other  sites  are  not  staHsHcally  significant.  PotenHal  cause:    –  Insufficient  Data?  

Page 15: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   15  

Summary  

• MiHgaHng  or  promoHng  migraHon  by  targeHng  high  net-­‐worth  individuals  –  IdenHfying  users  with  high  value  to  the  network,  e.g.,  high  network  acHvity,  user  acHvity,  and  external  exposure    

•  Social  media  migraHon  is  first  studied  in  this  work  

•  AlternaHve  evaluaHon  approaches  can  help  address  the  evaluaHon  dilemma    

 Understanding  User  MigraHon  PaQerns  in  Social  Media,  S.  Kumar,  R.  Zafarani,  and  H.  Liu,  AAAI’2010  

Page 16: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   16  

Some  Challenges  in  Mining  Social  Media  

•  EvaluaHon  Dilemma  

•  Sampling  Bias  

•  Noise-­‐Removal  Fallacy  

•  Studying  Distrust  in  Social  Media  

Page 17: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   17  17  

•  TwiQer  provides  two  main  outlets  for  researchers  to  access  tweets  in  real  Hme:  – Streaming  API  (~1%  of  all  public  tweets,  free)  – Firehose  (100%  of  all  public  tweets,  costly)  

•  Streaming  API  data  is  o`en  used  to  by  researchers  to  validate  hypotheses.  

•  How  well  does  the  sampled  Streaming  API  data  measure  the  true  acHvity  on  TwiQer?  

Sampling  Bias  in  Social  Media  Data  

Page 18: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   18  

Facets  of  Twi?er  Data  

•  Compare  the  data  along  different  facets  •  Selected  facets  commonly  used  in  social  media  mining:  – Top  Hashtags  – Topic  ExtracIon  – Network  Measures  – Geographic  DistribuHons  

Page 19: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   19  

Preliminary  Results  

Top  Hashtags   Topic  ExtracIon  

•  No  clear  correlaHon  between  Streaming  and  Firehose  data.  

•  Topics  are  close  to  those  found  in  the  Firehose.  

Network  Measures   Geographic  DistribuIons  

•  Found  ~50%  of  the  top  tweeters  by  different  centrality  measures.  

•  Graph-­‐level  measures  give  similar  results  between  the  two  datasets.  

•  Streaming  data  gets  >90%  of  the  geotagged  tweets.  

•  Consequently,  the  distribuHon  of  tweets  by  conHnent  is  very  similar.  

Page 20: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   20  

How  are  These  Results?  

•  Accuracy  of  streaming  API  can  vary  with  analysis  to  be  performed  

•  These  results  are  about  single  cases  of  streaming  API  

•  Are  these  findings  significant,  or  just  an  arHfact  of  random  sampling?  

•  How  do  we  verify  that  our  results  indicate  sampling  bias  or  not?  

Page 21: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   21  

Histogram  of  JS  Distances  in  Topic  Comparison  

•  This  is  just  one  streaming  dataset  against  Firehose  •  Are  we  confident  about  this  set  of  results?  •  Can  we  leverage  another  streaming  dataset?  •  Unfortunately,  we  cannot  rewind  as  we  have  only  one  streaming  dataset  

Page 22: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   22  

VerificaIon  

•  Created  100  of  our  own  “Streaming  API”  results  by  sampling  the  Firehose  data.  

0"

10"

20"

30"

40"

50"

60"

Firehose" Streaming" Random"1" Random"2" …" Random"100"

Num

er&of&tweets&(k

)&

Genera2ng&Random&Samples&

Page 23: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   23  

Comparison  with  Random  Samples  

Page 24: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   24  

Summary  

•  Streaming  API  data  could  be  biased  in  some  facets  

•  Our  results  were  obtained  with  the  help  of  Firehose  

•  Without  Firehose  data,  it’s  challenging  to  figure  out  which  facets  might  have  bias,  and  how  to  compensate  them  in  search  of  credible  mining  results    

F.  MorstaQer,  J.  Pfeffer,    H.  Liu,  and  K.  Carley.    Is  the  Sample  Good  Enough?  Comparing  Data  from  TwiAer’s  Streaming    API  and  Data  from  TwiAer’s  Firehose.  ICWSM,  2013.      Fred  MorstaQer,  Jürgen  Pfeffer,  Huan  Liu.  When  is  it  Biased?  Assessing  the  Representa(veness  of  TwiAer's  Streaming    API,  WWW  Web  Science  2014.  

Page 25: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   25  

Some  Challenges  in  Mining  Social  Media  

•  EvaluaHon  Dilemma  

•  Sampling  Bias  

•  Noise-­‐Removal  Fallacy  

•  Studying  Distrust  in  Social  Media  

Page 26: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   26  26  

•  We  o`en  learn  that:  “99%  TwiQer  data  is  useless.”  – “Had  eggs,  sunny-­‐side-­‐up,  this  morning”  – Can  we  remove  noise  as  we  usually  do  in  DM?  

•  What  is  le`  a`er  noise  removal?  – TwiQer  data  can  be  rendered  useless  a`er  convenHonal  noise  removal  

•  As  we  are  certain  there  is  noise  in  data,    how  can  we  remove  it?  

Noise  Removal  Fallacy  

Page 27: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   27  

Social  Media  Data  

•  Massive  and  high-­‐dimensional  social  media  data  poses  unique  challenges  to  data  mining  tasks  – Scalability  – Curse  of  dimensionality    

•  Social  media  data  is  inherently  linked  – A  key  difference  between  social  media  data  and  aQribute-­‐value  data    

Page 28: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   28  

Feature  SelecIon  of  Social  Data  

•  Feature  selecHon  has  been  widely  used  to  prepare  large-­‐scale,  high-­‐dimensional  data  for  effecHve  data  mining  

•  TradiHonal  feature  selecHon  algorithms  deal  with  only  “flat"  data  (aAribute-­‐value  data).  –  Independent  and  IdenHcally  Distributed  (i.i.d.)  

•  We  need  to  take  advantage  of  linked  data  for  feature  selecHon  

         

Page 29: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   29  

RepresentaIon  for  Social  Media  Data  

User-­‐post  relaHons  

1

1 1 1

1

1 1

𝑢↓1 

𝑢↓2 

𝑢↓3 

𝑢↓4 

𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4 

𝑝↓1 

𝑝↓2 

𝑝↓5  𝑝↓6 

𝑝↓4 

𝑝↓7  𝑝↓8 

𝑓↓𝑚  …. …. …. 𝑐↓𝑘  ….

Page 30: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   30  

RepresentaIon  for  Social  Media  Data  

1

1 1 1

1

1 1

𝑢↓1 

𝑢↓2 

𝑢↓3 

𝑢↓4 

𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4 

𝑝↓1 

𝑝↓2 

𝑝↓5  𝑝↓6 

𝑝↓4 

𝑝↓7  𝑝↓8 

𝑓↓𝑚  …. …. …. 𝑐↓𝑘  ….

User-­‐user  relaHons  

Page 31: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   31  

RepresentaIon  for  Social  Media  Data  

1

1 1 1

1

1 1

𝑢↓1 

𝑢↓2 

𝑢↓3 

𝑢↓4 

𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4 

𝑝↓1 

𝑝↓2 

𝑝↓5  𝑝↓6 

𝑝↓4 

𝑝↓7  𝑝↓8 

𝑓↓𝑚  …. …. …. 𝑐↓𝑘  ….

Social  Context  

Page 32: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   32  

Problem  Statement  

•  Given  labeled  data  X  and  its  label  indicator  matrix  Y,  the  dataset  F,  its  social  context  including  user-­‐user  following  relaHonships  S  and  user-­‐post  relaHonships  P,    

•  Select  k  most  relevant  features  from  m  features  on  dataset  F  with  its  social  context  S  and  P  

   

Page 33: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   33  

How  to  Use  Link  InformaIon  

•  The  new  quesHon  is  how  to  proceed  with  addiHonal  informaHon  for  feature  selecHon  

•  Two  basic  technical  problems  – RelaHon  extracHon:  What  are  disHncHve  relaHons  that  can  be  extracted  from  linked  data  

– MathemaHcal  representaHon:  How  to  use  these  relaHons  in  feature  selecHon  formulaHon  

•  Do  we  have  theories  to  guide  us?        

Page 34: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   34  

𝑢↓1 

𝑢↓2 

𝑢↓3 

𝑢↓4 

𝑝↓1  𝑝↓2 

p3 𝑝↓5 

𝑝↓6 

𝑝↓4 

𝑝↓7 

𝑝↓8 

1. CoPost  2. CoFollowing  3. CoFollowed  4. Following  

RelaIon  ExtracIon  

Page 35: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   35  

RelaIons,  Social  Theories,  Hypotheses  

•  Social  correlaHon  theories  suggest  that  the  four  relaHons  may  affect  the  relaHonships  between  posts    

•  Social  correlaHon  theories  – Homophily:  People  with  similar  interests  are  more  likely  to  be  linked  

–  Influence:  People  who  are  linked  are  more  likely  to  have  similar  interests  

•  Thus,  four  relaHons  lead  to  four  hypotheses  for  verificaHon  

 

Page 36: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   36  

Modeling  CoFollowing  RelaIon    

•  Two  co-­‐following  users  have  similar  topics  of  interests    

||||

)(^

k

Ffi

T

k

Ffi

k F

fW

F

fTuT kiki

∑∑∈∈ ==)(

Users'  topic  interests      

∑ ∑∈

−++−u Nuu

jiFT

uji

uTuT,

22

^^

1,22

W||)()(||||W||||YWX||min βα

Page 37: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   37  

EvaluaIon  Results  on  Digg  

Page 38: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   38  

EvaluaIon  Results  on  Digg  

Page 39: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   39  

Summary  

•  LinkedFS  is  evaluated  under  varied  circumstances  to  understand  how  it  works.    – Link  informaHon  can  help  feature  selec(on  for  social  media  data.  

•  Unlabeled  data  is  more  o`en  in  social  media,  unsupervised  learning  is  more  sensible,  but  also  more  challenging.  

Jiliang  Tang  and  Huan  Liu.  ``  Unsupervised  Feature  SelecHon  for  Linked  Social  Media  Data'',  the  Eighteenth  ACM  SIGKDD  InternaHonal  Conference  on  Knowledge  Discovery  and  Data  Mining  ,  2012.    Jiliang  Tang,  Huan  Liu.  ``Feature  SelecHon  with  Linked  Data  in  Social  Media'',  SIAM  InternaHonal  Conference  on  Data  Mining,  2012.    

Page 40: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   40  

Some  Challenges  in  Mining  Social  Media  

•  EvaluaHon  Dilemma  

•  Sampling  Bias  

•  Noise-­‐Removal  Fallacy  

•  Studying  Distrust  in  Social  Media  

Page 41: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   41  41  

Studying  Distrust  in  Social  Media  

Trust  in  Social  CompuIng  

IncorporaIng    Distrust    

Summary  

IntroducIon  

Applying  Trust    

RepresenIng  Trust  

Measuring  Trust    

WWW2014  Tutorial  on  Trust  in  Social  CompuHng  Seoul,  South  Korea.  4/7/14  h?p://www.public.asu.edu/~jtang20/tTrust.htm  

Page 42: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   42  

Distrust  in  Social  Sciences  

•   Distrust  can  be  as  important  as  trust      

•   Both  trust  and  distrust  help  a  decision  maker  reduce  the  uncertainty  and  vulnerability  associated  with  decision  consequences  

 •   Distrust  may  play  an  equally  important,  if  not  more,  criHcal  role  as  trust  in  consumer  decisions  

Page 43: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   43  

Understandings  of  Distrust  from  Social  Sciences  

•   Distrust  is  the  negaHon  of  trust  ─   Low  trust  is  equivalent  to  high  distrust  ─   The  absence  of  distrust  means  high  trust  ─   Lack  of  the  studying  of  distrust  maQers  liQle  

   

•   Distrust  is  a  new  dimension  of  trust  ─   Trust  and  distrust  are  two  separate  concepts    ─   Trust  and  distrust  can  co-­‐exist  ─   A  study  ignoring  distrust  would  yield  an  incomplete  esHmate  of  the  effect  of  trust            

Page 44: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   44  

Distrust  in  Social  Media  

•  Distrust  is  rarely  studied  in  social  media  •  Challenge  1:  Lack  of  computaHonal  understanding  of  distrust  with  social  media  data  – Social  media  data  is  based  on  passive  observaHons  – Lack  of  some  informaHon  social  sciences  use  to  study  distrust  

•  Challenge  2:  Distrust  informaHon  is  usually  not  publicly  available  – Trust  is  a  desired  property  while  distrust  is  an  unwanted  one  for  an  online  social  community  

Page 45: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   45  

ComputaIonal  Understanding  of  Distrust    •   Design  computaHonal  tasks  to  help  understand  distrust  with  passively  observed  social  media  data    

§   Task  1:  Is  distrust  the  negaHon  of  trust?    –   If  distrust  is  the  negaHon  of  trust,  distrust  should  be  predictable  from  only  trust  

 §   Task  2:  Can  we  predict  trust  beQer  with  distrust?    –   If  distrust  is  a  new  dimension  of  trust,  distrust  should  have  added  value  on  trust  and  can  improve  trust  predicHon    

•   The  first  step  to  understand  distrust  is  to  make  distrust  computable  by  incorporaHng  distrust  in  trust  models    

Page 46: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   46  

Distrust  in  Trust  RepresentaIons  There  are  three  major  ways  to  incorporate  distrust  in  trust  representaHon  – Considering  low  trust  as  distrust        – Adding  signs  to  trust  values  – Adding  a  dimension  in  trust  representaHons  

         

0  

Trust  

Distrust  

1  

0  

Trust  

Distrust  

1  

-­‐1  

0  

Trust  

1  

-­‐1  Distrust  

Page 47: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   47  

An  IllustraIon  of  Distrust  in  Trust  RepresentaIons    

•  Considering  low  trust  as  distrust        ─ Weighted  unsigned  network  

   

•  Extending  negaHve  values  ─ Weighted  signed  network  

 

•  Adding  another  dimension    ─  Two-­‐dimensional  unsigned  network  

0.8  

1  

 0.8  

 1    1  

 0  

 -­‐1  

 1  

(0.8,0)  

 (1,0)  

 (0,1)  

 (1,0)  

Page 48: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   48  

Task  1:  Is  Distrust  the  NegaIon  of  Trust?  

•   If  distrust  is  the  negaHon  of  trust,  low  trust  is  equivalent  to  distrust    and  distrust  should  be  predictable  from  trust  

                 

•   Given  the  transiHvity  of  trust,  we  resort  to  trust  predicHon  algorithms  to  compute  trust  scores  for  pairs  of  users  in  the  same  trust  network  

Distrust     Low  Trust    

PredicIng    Distrust    

PredicIng    Low  Trust  

IF    

THEN  

Page 49: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   49  

EvaluaIon  of  Task  1

§   The  performance  of  using  low  trust  to  predict  distrust  is  consistently  worse  than  randomly  guessing    §   Task  1  fails  to  predict  distrust  with  only  trust;  and  distrust  is  not  the  negaHon  of  trust    

 dTP:  It  uses  trust  propagaHon  to  calculate  trust  scores  for  pairs  of  users    dMF:  It  uses  the  matrix  factorizaHon  based  predictor  to  compute  trust  scores  for  pairs  of  users  dTP-­‐MF:  It  is  the  combinaHon  of  dTP  and  dMF  using  OR  

Page 50: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   50  

Task  2:  Can  we  predict  Trust  be?er  with  Distrust  

§   If  distrust  is  not  the  negaHon  of  trust,  distrust  should  provide  addiHonal  informaHon  about  users,  and  could  have    added  value  beyond  trust    §   We  seek  answer  to  whether  using  both  trust  and  distrust  informaHon  can  help  achieve  beQer  performance  than  using  only  trust  informaHon  

§   We  can  add  distrust  propagaHon  in  trust  propagaHon  to  incorporate  distrust        

Page 51: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   51  

EvaluaIon  of  Trust  and  Distrust  PropagaIon  

§   IncorporaHng  distrust  propagaHon  into  trust  propagaHon  can  improve  the  performance  of  trust  measurement  §   One  step  distrust  propagaHon  usually  outperforms  mulHple  step  distrust  propagaHon      

Page 52: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   52  52  

•  EvaluaHon  Dilemma  •  Sampling  Bias  in  Social  Media  Data  •  Noise  Removal  Fallacy  •  Studying  Distrust  in  Social  Media  

Concluding  Remarks  

Page 53: SomeChallengingProblemsin Mining%Social%Media%huanliu/dmml_presentation/2014/...Arizona%State%University% Some%Challenging%Problems%in%Mining%Social%Media% %Data%Mining%and%Machine%Learning%Lab!

Some  Challenging  Problems  in  Mining  Social  Media  Arizona  State  University    Data  Mining  and  Machine  Learning  Lab   April  22,  2014   53  53  

•  Organizers  for  this  wonderful  opportunity  to  share  our  research  work  

•  Acknowledgments  – Grants  from  NSF,  ONR,  ARO  – DMML  members  and  project  leaders  – Collaborators  

THANKS  to  …  

Ali  Abbasi  Shamanth    Kumar   Fred  Morsta?er  Reza  Zafarani   Jiliang  Tang