Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Post on 14-Jun-2015

3.693 views 2 download

Tags:

description

Ed H. Chi, Palo Alto Research Center Large-Scale Social Analytics in Wikipedia, Delicious, and TwitterAbstractWe will illustrate an analytical research approach in social computing. Our research in Augmented Social Cognition is aimed at enhancing the ability of a group of people to remember, think, and reason. The drive to build models and theories for social computing research should further our understanding of how network science, behavioral economics, and evolutionary theories could explain how social systems work. Here we will summarize the published research we conducted on large-scale social analytics in Wikipedia, Delicious, and Twitter, and point out how social analytics can help us understand the intricacies of large social systems.About the SpeakerEd H. Chi is area manager and principal scientist at Palo Alto Research Center's Augmented Social Cognition Group. He leads the group in understanding how Web2.0 and Social Computing systems help groups of people to remember, think and reason. Ed completed his three degrees (B.S., M.S., and Ph.D.) in 6.5 years from University of Minnesota, and has been doing research on user interface software systems since 1993. He has been featured and quoted in the press, such as the Economist, Time Magazine, LA Times, and the Associated Press. With 20 patents and over 70 research articles, he has won awards for both teaching and research. In his spare time, Ed is an avid Taekwondo martial artist, photographer, and snowboarder.

Transcript of Large Scale Social Analytics on Wikipedia, Delicious, and Twitter (presented at IBM NPUC 2010)

Image from: http://www.flickr.com/photos/ourcommon/480538715/

Ed  H.  Chi,  Principal  Scientist  and  Area  Manager  

Peter  Pirolli,  Lichan  Hong  Bongwon  Suh,  Les  Nelson  Gregorio  Convertino,  Sharoda  Paul  

Interns:  Sanjay  Kairam,  Jilin  Chen,  Brent  HectMichael  Bernstein  Alumni:  Raluca  Budiu,  Bryan  Pendleton,  Niki  Kittur,  Todd  Mytkowicz,  Terrell  Russell,  Brynn  Evans,  Bryan  Chan,  KMRC  students  

Augmented  Social  Cognition  Area  Palo  Alto  Research  Center  

2010-10-22 IBM NPUC 2010 2

To:  chi@acm.org  From:  Brad  Barrish  <brad@…removed.for.privacy….com>  Subject:  Pancreatic  cancer  Date:  Thu,  1  Feb  2007  21:37:55  PST  

Hey  Ed.  I'm  a  fellow  del.icio.us  user  and  noticed  you  bookmark  a  lot      of  pancreatic  cancer  stuff.  I'm  at  home  with  my  dad  who  was  diagnosed      a  little  over  a  year  ago  and  is  now  at  the  tale  end  of  things.  I've      learned  a  lot  through  his  treatments  and  about  what's  out  there.  I      dunno  if  it's  something  you  or  a  family  member  has,  but  just  wanted      to  drop  you  an  email.  Be  well.  

Brad  

  Cognition:  the  ability  to  remember,  think,  and  reason;  the  faculty  of  knowing.  

  Social  Cognition:  the  ability  of  a  group  to  remember,  think,  and  reason;  the  construction  of  knowledge  structures  by  a  group.  –  (not  quite  the  same  as  in  the  branch  of  psychology  that  studies  the  

cognitive  processes  involved  in  social  interaction,  though  included)  

  Augmented  Social  Cognition:  Supported  by  systems,  the  enhancement    of  the  ability  of  a  group  to  remember,  think,  and  reason;  the  system-­‐supported  construction  of  knowledge  structures  by  a  group.    

Citation:  Chi,  IEEE  Computer,  Sept  2008  

3 2010-10-22 IBM NPUC 2010

Kudos  to  Todd  Mytkowicz  and  Rowan  Nairn  

Topics  Concepts  

Users   Documents  

Tags  

T1…Tn  Encoding  Decoding  

Noise  

2010-10-22 5 IBM NPUC 2010

H(Tag)  shows  tag  saturation   H(Doc  |  Tag),  browsability  

2010-10-22 IBM NPUC 2010 6

I(Doc;  Tag)    Mutual  Information   Raise  in  avg.  tag  /  bookmark  

2010-10-22 IBM NPUC 2010 7

2010-10-22 8

Guide

Web

Howto

Tips Help

Tools

Tip

Tricks

Tutorial

Tutorials

Reference

Semantic Similarity Graph

IBM NPUC 2010

  Spreading  Activation  in  a  bi-­‐graph    Computation  over  a  very  large  data  set  

–  150  Million+  bookmarks  

Tags URLs

P(URL|Tag)

P(Tag|URL)

2010-10-22 9 IBM NPUC 2010

2010-10-22 10 IBM NPUC 2010

Kudos  to  Bongwon  Suh,  Niki  Kittur  

What  drives  contributions  to  Wikipedia?  

  Conflicts  drives  most  of  the  contributions  to  Wikipedia.  –  How  do  we  measure  conflicts?  

  Conflicts  cause  coordination  costs  to  go  up.  –  Measuring  coordination  costs  

2010-10-22 IBM NPUC 2010 12

2010-10-22 13 IBM NPUC 2010

Mediators

Sympathetic to parents

Sympathetic to husband

Anonymous (vandals/spammers)

2010-10-22 14 IBM NPUC 2010

2010-10-22 IBM NPUC 2010 15

  Counting  ‘Controversial’  labels    5x  cross-­‐validation,  R2  =  0.897  

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Predicted controversial revisions

Actu

al c

ontr

over

sial r

evisi

ons

Number of Articles (Log Scale)

http://en.wikipedia.org/wiki/Wikipedia:Modelling_Wikipedia’s_growth

2010-10-22 16 IBM NPUC 2010

Monthly Edits

2010-10-22 17 IBM NPUC 2010

Monthly Edits

2010-10-22 18 IBM NPUC 2010

*In thousands Monthly Active Editors

2010-10-22 19 IBM NPUC 2010

*In thousands Monthly Active Editors

2010-10-22 20 IBM NPUC 2010

  Preferential  Attachment:  Edits  beget  edits  –  more  number  of  previous  edits,  more  number  of  new  edits  

Growth rate of population

Current population

Growth rate depends on: N = current population r = growth rate of the population

2010-10-22 21 IBM NPUC 2010

!

dNdt

= r " N

!

N(t) = N0 " ert

  Ecological  population  growth  model  –  Also  depend  on  environmental  conditions  –  K,  carrying  capacity  (due  to  resource  limitation)  

dNdt

= rN(1− NK)

2010-10-22 22 IBM NPUC 2010

  Follows  a  logistic  growth  curve  

New Article

2010-10-22 23 IBM NPUC 2010

  Biological  system  –  Competition  increases  as  

population  hit  the  limits  of  the  ecology  

–  Advantage  go  to  members  of  the  population  that  have  competitive  dominance  over  others  

  Analogy  –  Limited  opportunities  to  make  

novel  contributions  –  Increased  patterns  of  conflict  and  

dominance    

2010-10-22 24 IBM NPUC 2010

Monthly Ratio of Reverted Edits

2010-10-22 25 IBM NPUC 2010

2010-10-22 26 IBM NPUC 2010

Kudos  to  Brent  Hecht,  Jilin  Chen,    Bongwon  Suh,  Lichan  Hong  

n = 10,000 users with 5 or more tweets

All Users Who Manually Specified Location

n = 3,311 users with 5 or more tweets

Users w/ No Useful Location Information Manually Entered

Schrute Farms User ID 39111154

User ID 75135928

NONE YA BISNESS!!

User ID 57987417

in jail...smh

not tellin you User ID 130681147

wherever justin wants me to be

User ID 71097545

User ID 77503970

Justin Biebers heart!

User ID 134222427

Jonasbieberland3

Bieber Island User ID 91705969

n = 10,000 users with 5 or more tweets

All Twitter Users

n = 2,965 users with 5 or more tweets

Users w/ Informative Location in the United States

California User ID 125271323

User ID 92455577

Skinny Jeans City, IL

User ID 92455577

Bieberville, California

East Jesus Nowhere, Indiana

User ID 26526957

All 1,698 Fake Locations Yahoo! Geocoder

Justin Biebers heart!

All 1,698 Fake Locations Yahoo! Geocoder

Justin Biebers heart!

Lat = 36.328785 Lon = -91.700189

Location of Justin Bieber’s Heart (Don’t Tell Your Teenage Daughters)

Country-scale

10-fold cross validation multinomial naive bayes classifier

2.4x better than random

State-scale

20% test set multinomial naive bayes classifier

2.2x better than random

  Which  tweet  features  are  associated  with  retweet?    Retweet  Model  

–  #  Retweet  ~  function(f1,  f2,  ….,  fn),  where  fi  are  simple  features  extracted  from  a  tweet  

  74M  tweets  from  Twitter  Stream  API  –  Characterization  –  2~3  %  sample  –  Hadoop  /  Hbase  /  MapReduce    

2010-10-22 43 IBM NPUC 2010

#  Followees:  395  #  Followers:  1,400  #  Favorite:  1,657  #  Day:  (since  June  17,  2008)  #  Past  tweets:  21,000  

Contextual  Features  

URL   Hashtag  

Mention  

Content  Features  

2010-10-22 44 IBM NPUC 2010

Two  Types  of  Features  

Con

tent

Fac

tor

Contextual Factor

2010-10-22 45 IBM NPUC 2010

Information Streams =>Information Overload

ASC Social Recommender

Engine

2010-10-22 46 IBM NPUC 2010

My Friends’ URLs

Popular URLs

Recommendation Algorithm: Combining Sources and

Models

Recommendations

My Friends’ Network and Tweeting Pattern

Social Ranking Model

My Tweets

My Friends’ Tweets

Topic Relevance Model

2010-10-22 47 IBM NPUC 2010

  Hadoop  Compute  Cluster  –  50  nodes,  depending  on  project  requirement  –  ~40TB  storage  capacity  –  Experience  with  Hbase,  Pig,  Interaction  with  Lucene,  MySQL  

  Large-­‐scale  crawling  and  analytics  experience  with  –  Wikipedia    (all  edits  up  to  2009)  –  Delicious  data  set  (200M  bookmarks)  –  Twitter  (70M+  Tweets)  

  Experience  with  Large  Scale  Social  Analytics  –  Example  1:  Visual  analytics  in  Wikipedia  (wikidashboard.com)    –  Example  2:  Search  engines  for  social  bookmarks  (mrtaggy.com)  –  Example  3:  Recommenders  for  Twitter  news  (zerozero88.com)  

2010-10-22 IBM NPUC 2010 48

2010-10-22 IBM NPUC 2010 49

Image from: http://www.flickr.com/photos/ourcommon/480538715/

  Research  Vision:  Understand  how  social  computing  systems  can  enhance  the  ability  of  a  group  of  people  to  remember,  think,  and  reason.  

  Understand and support Collective Intelligence by modeling social group behaviors and testing prototype tools in Living Labs

http://asc-­‐parc.blogspot.com  http://www.edchi.net  echi@parc.com