The Art of Social Media Analysis with Twitter & Python-OSCON 2012

131
The Art of Social Media Analysis with Twitter & Python krishna sankar @ksankar http://www.oscon.com/oscon2012/public/schedule/detail/23130

description

Final Slides for my 2012 Tutorial http://goo.gl/fpxVE

Transcript of The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Page 1: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

The Art of Social Media Analysis �

with Twitter & Python

krishna sankar @ksankar

http://www.oscon.com/oscon2012/public/schedule/detail/23130

Page 2: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Intro

API, Objects,…

Twitter Network Analysis Pipeline

@mention network

Growth, weakties

Rewteeet analytics, Information contagion

Cliques, social graph

NLP, NLTK, Sentiment Analysis

#tag Network

o  House  Rules  (1  of  2)  o  Doesn’t  assume  any  knowledge  

of  Twitter  API  o  Goal:  Everybody  in  the  same  

page  &  get  a  working  knowledge  of  Twitter  API  

o  To  bootstrap  your  exploration  into  Social  Network  Analysis  &  Twitter    

o  Simple  programs,  to  illustrate  usage  &  data  manipulation  

We will analyze @clouderati, 2072 followers, exploding to ~980,000 distinct users down one level

Page 3: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Intro

API, Objects,…

Twitter Network Analysis Pipeline

@mention network

Growth, weakties

Rewteeet analytics, Information contagion

Cliques, social graph

NLP, NLTK, Sentiment Analysis

#tag Network

o  House  Rules  (2  of  2)  o  Am  using  the  requests  library  o  There  are  good  Twitter  frameworks  

for  python,  but  wanted  to  build  from  the  basics.  Once  one  understands  the  fundamentals,  frameworks  can  help  

o  Many  areas  to  explore  –  not  enough  time.  So  decided  to  focus  on  social  graph,  cliques  &  networkx  

We will analyze @clouderati,2072 followers, exploding to ~980,000 distinct users down one level

Page 4: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

About  Me •  Lead  Engineer/Data  Scientist/AWS  Ops  Guy  at  

Genophen.com  o  Co-­‐chair  –  2012  IEEE  Precision  Time  Synchronization    

•  http://www.ispcs.org/2012/index.html  o  Blog  :  http://doubleclix.wordpress.com/  o  Quora  :  http://www.quora.com/Krishna-­‐Sankar  

•  Prior  Gigs  o  Lead  Architect  (Egnyte)  o  Distinguished  Engineer  (CSCO)  o  Employee  #64439  (CSCO)  to  #39(Egnyte)  &  now  #9  !  

•  Current  Focus:  o  Design,  build  &  ops  of  BioInformatics/Consumer  Infrastructure  on  AWS,  

MongoDB,  Solr,  Drupal,GitHub,…  o  Big  Data  (more  of  variety,  variability,  context  &  graphs,  than  volume  or  velocity  –  

so  far  !)  o  Overlay  based  semantic  search  &  ranking  

•  Other  related  Presentations  o  http://goo.gl/P1rhc  Big  Data  Engineering  Top  10  Pragmatics  (Summary)  o  http://goo.gl/0SQDV  The  Art  of  Big  Data  (Detailed)  o  http://goo.gl/EaUKH  The  Hitchhiker’s  Guide  to  Kaggle  OSCON  2011  Tutorial  

Page 5: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric  2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results  

before  blaming  Twitter  o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.  o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time  

and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id  o  Always test with one or two records before a long run ! - learned the hard way

3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data  o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !    o  Night runs are far more faster & error-free

4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer  o  Would  make  it  easy  to  work  with  Twitter  at  scale  o  I  use    MongoDB  o  Keep  the  schema  simple  &  no  fancy  transformation  

•  And  as  far  as  possible  same  as  the  (json)  response      o  Use  NOSQL  CLI  for  trimming  records  et  al  

The End As The Beginning

Page 6: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 5.  Always  use  a  big  data  pipeline  

o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,  

validation  et  al    6.  Use  functional  approach  for  a  scalable  pipeline  

o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing  o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)  o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &  

restart  techniques  •  This did create some trouble for me, as we will see later

7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle  o  The  equivalent  of  the  traditional  ETL  o  Validation  stage  &  validation  routines  are  important  

•  Cannot  expect  perfect  runs  •  Cannot  manually  look  at  data  either,  when  data  is  at  scale  

8.  Have  control  numbers  to  validate  runs  &  monitor  them  o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that

number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files

Page 7: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 9.  Program  defensively    

o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them    

10.  Have  Erlang-­‐style  supervisors  in  your  pipeline  o  Fail  fast  &  move  on  o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer  o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to  

correct  missing  spiders  and  crawls  o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that  

has  enough  context  to  take  corrective  actions  o  I have an example in part 2

11.  Data  will  never  be  perfect  o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies    

•  for  example:  0  followers,  protected  users,  0  friends,…  

Page 8: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a  

re-­‐startable  command  buffer  cache    o  See a MongoDB example in Part 2

13.  Don’t  bombard  the  URL  o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a  

scalable  system,  eventually  o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to

work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing  

o  Kind  of  early  warning  when  something  is  wrong  

15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  

Page 9: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism  

o  Leverage  data  parallelism  frameworks  like  MapReduce  o  But  first  :  

§  Prototype  as  a  linear  system,    §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,    §  Note  down  stages  and  tasks  that  can  be  parallelized  and    §  Then  parallelize  them  

o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial

17.   Pay  attention  to  handoffs  between  stages  o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list  

as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for  aggregation    

o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform

the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching,

checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through  

logs    

Page 10: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the  

inference  you  want  to  make  o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph  

o  Twitter  Network  is  more  of  an  Interest  Network  o  So, many of the traditional network mechanisms & mechanics, like network

diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do

Page 11: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Gripes 1.  Need  more  rich  APIs  for  #tags  

o  Somewhat  similar  to  users  viz.  followers,  friends  et  al  o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics  

2.  HTTP  Error  Return  is  not  uniform    o  Returns  400  bad  Request  instead  of  420  o  Granted, there is enough information to figure this out

3.  Need  an  easier  way  to  get  screen_name  from  user_id  4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.  

o  There are a few like this, most probably for backward compatibility 5.  Parameter  Validation  is  not  uniform  

o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “413  Too  Long”  or  “416  Range  Unacceptable”  

6.  Overall  more  validation  would  help  o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the

rest is easy to figure out

Page 12: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

A Fork

• Not enough time for both

•  NLP,NLTK  &  d

eep  

into  Tweets  

o  Sen4ment  

Analysis  

• I chose the Social Graph route

Page 13: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

A minute about Twitter as platform & it’s evolution

My  Wish  &  Hope  •  I  spend  a  lot  of  time  with  Twitter  &  derive  value;  the  platform  is  rich  &  the  APIs  intuitive  •  I  did  like  the  fact  that  tweets  are  part  of  LinkedIn.  I  still  used  Twitter  more  than  LinkedIn  

o  I  don’t  think  showing  Tweets  in  LinkedIn  took  anything  away  from  the  Twitter  experience  o  LinkedIn  experience  &  Twitter  experience  are  different  &  distinct.  Showing  tweets  in  LinkedIn  didn’t  change  that  

•  I  sincerely  hope  that  the  platform  grows  with  a  rich  developer  eco  system  •  Orthogonally  extensible  platform  is  essential  

•  Of  course,  along  with  a  congruent  user  experience  –  “  …  core  Twitter  consumption  experience  through  consistent  tools”  

https://dev.t

witter.com/blog/

delivering-­‐c

onsistent-­‐tw

itter-­‐

experience   “The micro-blogging service must find the

right balance of running a profitable business and maintaining a robust developers' community.” – Chenda, CBS news!

“.. we want to make sure that the Twitter experience is straightforward and easy to understand -- whether you’re on

Twitter.com or elsewhere on the web”-Michael!

Page 14: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Setup •  For  Hands  on  Today  

o  Python  2.7.3  o  easy_install  –v  requests  

•  http://docs.python-­‐requests.org/en/latest/user/quickstart/#make-­‐a-­‐request  

o  easy_install  –v  requests-­‐oauth  o  Hands  on  programs  at  https://github.com/xsankar/oscon2012-­‐handson  

•  For  advanced  data  science  with  social  graphs  o  easy_install  –v  networkx  

o  easy_install  –v  numpy  o  easy_install  –v  nltk    

•  Not  for  this  tutorial,  but  good  for  sentiment  analysis  et  al  o  Mongodb    

•  I  used  MongoDB  in  AWS  m2.xlarge,  RAID  10  X  8  X  15  GB  EBS  o  graphviz  -­‐  http://www.graphviz.org/;  easy_install  pygraphviz  o  easy_install  pydot  

Page 15: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Thanks To these Giants …

Page 16: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Problem Domain For this tutorial •  Data  Science  (trends,  analytics  et  al)  on  Social  Networks  as  

observed  by  Twitter  primitives  o  Not  for  Twitter  based  apps  for  real  time  tweets  o  Not  web  sites  with  real  time  tweets  

•  By  looking  at  the  domain  in  aggregate  to  derive  inferences  &  actionable  recommendations  

•  Which  also  means,  you  need  to  be  deliberate  &  systemic  (  i.e.  not  look  at  a  fluctuation  as  a  trend  but  dig  deeper  before  pronouncing  a  trend)  

Page 17: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Agenda I.  Mechanics  :  Twitter  API  (1:30  PM  -­‐  3:00  PM)    

o  Essential  Fundamentals  (Rate  Limit,  HTTP  Codes  et  al)  o  Objects  o  API  o  Hands-­‐on  (2:45  PM  -­‐  3:00  PM)  

II.  Break  (3:00  PM  -­‐  3:30  PM)  III.  Twitter  Social  Graph  Analysis  (3:30  PM  -­‐  5:00  PM)  

o  Underlying  Concepts  o  Social  Graph  Analysis  of  @clouderati  

§  Stages,  Strategies  &  Tasks  §  Code  Walk  thru    

Page 18: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Open  This  First

Page 19: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  API  :  Read  These  First •  Using  Twitter  Brand  

o  New  logo  &  associated  guidelines  :  https://twitter.com/about/logos  o  Twitter  Rules  :  

https://support.twitter.com/groups/33-­‐report-­‐a-­‐violation/topics/121-­‐guidelines-­‐best-­‐practices/articles/18311-­‐the-­‐twitter-­‐rules  

o  Developer  Rules  of  the  road  https://dev.twitter.com/terms/api-­‐terms  •  Read  These  Links  First  

1.  https://dev.twitter.com/docs/things-­‐every-­‐developer-­‐should-­‐know  2.  https://dev.twitter.com/docs/faq  3.  Field  Guide  to  Objects  https://dev.twitter.com/docs/platform-­‐objects  4.  Security  https://dev.twitter.com/docs/security-­‐best-­‐practices  5.  Media  Best  Practices  :  https://dev.twitter.com/media  6.  Consolidates  Page  :  https://dev.twitter.com/docs  7.  Streaming  APIs  https://dev.twitter.com/docs/streaming-­‐apis  8.  How  to  Appeal  (Not  that  you  all  would  need  it  !)  https://support.twitter.com/

articles/72585  •  Only  One  version  of  Twitter  APIs  

Page 20: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

API  Status  Page

•  https://dev.twitter.com/status  •  https://dev.twitter.com/issues  •  https://dev.twitter.com/discussions  

Page 21: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

h5ps://dev.twi5er.com/status

http://www.buzzfeed.com/tommywilhelm/google-­‐users-­‐being-­‐total-­‐dicks-­‐about-­‐the-­‐twitter  

Page 22: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Open  This  First •  Install  pre-­‐req  as  per  the  setup  slide  •  Run    

o  oscon2012_open_this_first.py  o  To  test  connectivity  –  “canary  query”  

•  Run  o  oscon2012_rate_limit_status.py  o  Use  http://www.epochconverter.com  to  check  reset_time  

•  Formats  xml,  json,  atom  &  rss  

Page 23: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter  API  

REST   Streaming  

Twitter  REST  

Core Data, Core Twitter Objects

Near-realtime, High Volume

Twitter  Search  

Seach & Trend

Keywords  Specific  User  Trends  

Build  Profile  Create/Post  Tweets  Reply  Favorite,  Re-­‐tweet  

Public  Streams  User  Streams  Site  Streams  

Follow users, topics, data mining

Rate  Limit  :  150/350  

Rate  Limit  :          Complexity  &  Frequency  

Firehose  

Page 24: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Rate  Limit

Page 25: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Rate  Limits •  By  API  type  &  Authentication  Mode  

API No authC authC Error

REST   150/hr   350/hr   400  

Search   Complexity  &  Frequency  

-­‐N/A-­‐   420  

Streaming   Upto  1%  

Fire  hose   none   none  

Page 26: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Rate  Limit  Header •  {  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "8e775a9323c45f2a541eeb4d2d1eb9b468w81c6",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "149",    •     "x-­‐ratelimit-­‐reset":  "1340467358",    •     "x-­‐runtime":  "0.04144",    •     "x-­‐transaction":  "2b49ac31cf8709af",    •     "x-­‐transaction-­‐mask":  

"a6183ffa5f8ca943ff1b53b5644ef114df9d6bba"  •  }  

Page 27: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Rate  Limit-­‐‑ed  Header •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "150",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:48:25  GMT",    •     "expires":  "Wed,  04  Jul  2012  00:53:25  GMT",    •     "server":  "tfe",    •     ”…  •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341363230",    •     "x-­‐runtime":  "0.01126"  •  }  

Page 28: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Rate  Limit  Example •  Run  

o  oscon2012_rate_limit_02.py  

•  It  iterates  through  a  list  to  get  followers    •  List  is  2072  long  

Page 29: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

•  {  •     …  •     "date":  "Wed,  04  Jul  2012  00:54:16  GMT",    •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "f31c7278ef8b6e28571166d359132f152289c3b8",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "147",    •     "x-­‐ratelimit-­‐reset":  "1341366831",    •     "x-­‐runtime":  "0.02768",    •     "x-­‐transaction":  "f1bafd60112dddeb",    •     "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"  •  }  

Last  time,  it  gave  me  5  min.  Now  the  reset  timer  is  1  hour    150  calls,  not  authenticated  

Page 30: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

•  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Wed,  04  Jul  2012  00:55:04  GMT",    •  …  •  "status":  "400  Bad  Request",    •     "transfer-­‐encoding":  "chunked",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341366831",    •     "x-­‐runtime":  "0.01342"  •  }  

And  Rate  Limit  kicked-­‐‑in

Page 31: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

API  with  OAuth •  {  •     …  •     "date":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "etag":  "\"dd419c02ed00fc6b2a825cc27wbe040\"",    •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •     "last-­‐modified":  "Wed,  04  Jul  2012  01:32:01  GMT",    •     "pragma":  "no-­‐cache",    •     "server":  "tfe",    •  …  •  "status":  "200  OK",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐access-­‐level":  "read",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐mid":  "5bbb87c04fa43c43bc9d7482bc62633a1ece381c",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341369121",    •     "x-­‐runtime":  "0.05539",    •     "x-­‐transaction":  "9f8508fe4c73a407",    •     "x-­‐transaction-­‐mask":  "a6183ffa5f8ca943ff1b53b5644ef11417281dbc"  •  }  

OAuth  “api-­‐identified”  

1  hr  reset  350  calls  

Page 32: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

•  {  •     …  •     "date":  "Thu,  05  Jul  2012  14:56:05  GMT",    •  …  

•     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "133",    

•     "x-­‐ratelimit-­‐reset":  "1341500165",    •   …  •  }  •  ********  2416  

•  {  •  …  •     "date":  "Thu,  05  Jul  2012  14:56:18  GMT",    

•  …  •     "status":  "200  OK",    •     ….  •     "x-­‐ratelimit-­‐class":  "api_identified",    

•     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •     "x-­‐ratelimit-­‐reset":  "1341503776",    •  ********  2417  

Rate  Limit  resets  during  consecutive  calls

+1  hour

Page 33: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Unexplained  Errors •  Traceback  (most  recent  call  last):  •     File  "oscon2012_get_user_info_01.py",  line  39,  in  <module>  •         r  =  client.get(url,  params=payload)  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  244,  in  get  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/sessions.py",  line  230,  in  request  •     File  "build/bdist.macosx-­‐10.6-­‐intel/egg/requests/models.py",  line  609,  in  send  •  requests.exceptions.ConnectionError:  HTTPSConnectionPool(host='api.twitter.com',  port=443):  Max  

retries  exceeded  with  url:  /1/users/lookup.json?user_id=237552390%2C101237516%2C208192270%2C340183853%2C221203257%2C15254297%2C44614426%2C617136931%2C415810340%2C76071717%2C17351462%2C574253%2C35048243%2C388547381%2C254329657%2C65585979%2C253580293%2C392741693%2C126403390%2C300467007%2C8962882%2C21545799%2C15254346%2C141083469%2C340312913%2C44614485%2C600359770%2C17351519%2C38323042%2C21545828%2C86557546%2C90751854%2C128500592%2C115917681%2C42517364%2C34128760%2C15254397%2C453559166%2C92849025%2C600359811%2C17351556%2C8962952%2C296038349%2C325503810%2C122209166%2C123827693%2C59294611%2C19448725%2C21545881%2C17351581%2C130468677%2C80266144%2C15254434%2C84680859%2C65586084%2C19448741%2C15254438%2C214483879%2C48808878%2C88654768%2C15474846%2C48808887%2C334021563%2C60214090%2C134792126%2C15254464%2C558416833%2C138986435%2C264815556%2C63488965%2C17222476%2C537445328%2C97854214%2C255598755%2C65586132%2C36226009%2C187220954%2C257346383%2C15254493%2C554222558%2C302564320%2C59165520%2C44614626%2C76071907%2C80266213%2C325503825%2C403227628%2C20368210%2C17351666%2C88654836%2C340313077%2C151569400%2C302564345%2C118014971%2C11060222%2C233229141%2C13727232%2C199803906%2C220435108%2C268531201  

While  trying  to  get  details  of  1,000,000  users,  I  get  this  error  –  usually  10-­‐6  AM  PST    Got  around  by  “Trap  &  wait  5  seconds”    Night  Runs  are  relatively  error  free  

Page 34: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

•  {  •   …  •     "date":  "Fri,  06  Jul  2012  03:41:09  GMT",    •     "expires":  "Fri,  06  Jul  2012  03:46:09  GMT",    •     "server":  "tfe",    •     "set-­‐cookie":  "dnt=;  domain=.twitter.com;  path=/;  expires=Thu,  01-­‐Jan-­‐1970  00:00:00  GMT",    •     "status":  "400  Bad  Request",    •     "vary":  "Accept-­‐Encoding",    •     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1341546334",    •     "x-­‐runtime":  "0.01918"  •  }  •  Error,  sleeping  •  {  •   …  •   "date":  "Fri,  06  Jul  2012  03:46:12  GMT",    •   …  •   "status":  "200  OK",    •   …  •   "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "349",    •   …  •  }  

Missed  by  4  min!

OK  after  5  min  sleep

A Day in the life of �Twitter Rate Limit �

Page 35: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Strategies I  have  no  exotic  strategies,  so  far  !  1.  Obvious  :    Track  elapsed  time  &  sleep  when  rate  limit  kicks  in  2.  Combine  authenticated  &  non-­‐authenticated  calls  3.  Use  multiple  API  types  4.  Cache  5.  Store  &  get  only  what  is  needed  6.  Checkpoint  &  buffer  request  commands  7.  Distributed  data  parallelism  –  for  example  AWS  instances  http://www.epochconverter.com/  <-­‐  useful  to  debug  the  timer Pl share your tips and tricks for conserving the Rate Limit

Page 36: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Authentication

Page 37: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Authentication •  Three  modes  

o  Anonymous  o  HTTP  Basic  Auth  o  OAuth  

•  As  of  Aug  31,  2010,  only  Anonymous  or  OAuth  are  supported  

•   OAuth  enables  the  user  to  authorize  an  application  without  sharing  credentials  

•  Also  has  the  ability  to  revoke  •  Twitter  supports  OAuth  1.0a  •  OAuth  2.0  is  the  new  standard,  much  simpler  

o  No  timeframe  for  Twitter  support,  yet      

Page 38: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

OAuth  Pragmatics •  Helpful  Links  

o  https://dev.twitter.com/docs/auth/oauth  o  https://dev.twitter.com/docs/auth/moving-­‐from-­‐basic-­‐auth-­‐to-­‐oauth  o  https://dev.twitter.com/docs/auth/oauth/single-­‐user-­‐with-­‐examples  o  http://blog.andydenmark.com/2009/03/how-­‐to-­‐build-­‐oauth-­‐consumer.html  

•  Discussion  on  OAuth  internal  mechanisms  is  better  left  for  another  day  

•  For  headless  applications  to  get  OAuth  token,  go  to  https://dev.twitter.com/apps  

•   Create  an  application  &  get  four  credential  pieces  o  Consumer  Key,  Consumer  Secret,  Access  Token  &  Access  Token  Secret  

•  All  the  frameworks  have  support  for  OAuth.  So  plug  –in  these  values  &  use  the  framework’s  calls  

•  I  used  request-­‐oauth  library  like  so:  

Page 39: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

request-­‐‑oauth def  get_oauth_client():  

     consumer_key  =  "5dbf348aa966c5f7f07e8ce2ba5e7a3badc234bc"          consumer_secret  =  "fceb3aedb960374e74f559caeabab3562efe97b4"          access_token  =  "df919acd38722bc0bd553651c80674fab2b465086782Ls"          access_token_secret  =  "1370adbe858f9d726a43211afea2b2d9928ed878"          header_auth  =  True          oauth_hook  =  OAuthHook(access_token,  access_token_secret,  consumer_key,  consumer_secret,  header_auth)          client  =  requests.session(hooks={'pre_request':  oauth_hook})          return  client  

def  get_followers(user_id):                                      url  =  'https://api.twitter.com/1/followers/ids.json’                                      payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                      r  =  requests.get(url,  params=payload)  

def  get_followers_with_oauth(user_id,client):                                        url  =  'https://api.twitter.com/1/followers/ids.json'                                        payload={"user_id":user_id}  #  if  cursor  is  needed  {"cursor":-­‐1,"user_id":scr_name}                                          r  =  client.get(url,  params=payload)  

Use  the  client  instead  of  requests  

Get  client  using  the  token,  key  &  secret  from  dev.twitter.com/apps  

Ref:  h5p://pypi.python.org/pypi/requests-­‐‑oauth

Page 40: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

OAuth  Authorize  screen •  The  user  

authenticates  with  Twitter  &  grants  access  to  Forbes  Social  

•  Forbes  social  doesn’t  have  the  users  credentials,  but  uses  OAuth  to  access  the  user’s  account  

Page 41: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

HTTP  Status  Codes

Page 42: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

HTTP  status  Codes •  0  Never  made  it  to  Twitter  Servers  -­‐  

Library  error  

•  200  OK  •  304  Not  Modified  •  400  Bad  Request  

o  Check  error  message  for  explanation  o  REST  Rate  Limit  !    

•  401  UnAuthorized  o  Beware  –  you  could  get  this  for  other  

reasons  as  well.      

•  403  Forbidden  o  Hit  Update  Limit  (>  max  Tweets/day,  

following  too  many  people)  

•  404  Not  Found  •  406  Not  Acceptable  •  413  Too  Long  •  416  Range  Unacceptable  •  420  Enhance  Your  Calm  

o  Rate  Limited  •  500  Internal  Server  Error  •  502  Bad  Gateway    

o  Down  for  maintenance  •  503  Service  Unavailable  

o  Overloaded  “Fail  whale”  •  504  Gateway  Timeout  

o  Overloaded  

h5ps://dev.twi5er.com/docs/error-­‐‑codes-­‐‑responses

Page 43: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  max-­‐age=300",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "91",    •     "content-­‐type":  "application/json;  charset=utf-­‐8",    •     "date":  "Sat,  23  Jun  2012  00:06:56  GMT",    •     "expires":  "Sat,  23  Jun  2012  00:11:56  GMT",    •     "server":  "tfe",    •   …  •     "status":  "401  Unauthorized",    •     "vary":  "Accept-­‐Encoding",    •     "www-­‐authenticate":  "OAuth  realm=\"https://api.twitter.com\"",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "0",    •     "x-­‐ratelimit-­‐remaining":  "0",    •     "x-­‐ratelimit-­‐reset":  "1340413616",    •     "x-­‐runtime":  "0.01997"  •  }  •  {  •     "errors":  [  •         {  •             "code":  53,    •             "message":  "Basic  authentication  is  not  supported"  •         }  •     ]  •  }  

Detailed  error  message    in  JSON  !  

I  like  this  

Page 44: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

HTTP  Status  Code  –  Confusing  Example •  {  •  …  •     "pragma":  "no-­‐cache",    •     "server":  "tfe",    •   …    •     "status":  "404  Not  Found",    •     …  •  }  •  {  •     "errors":  [  •         {  •             "code":  34,    •             "message":  "Sorry,  that  page  does  not  exist"  •         }  •     ]  •  }  

•  GET  https://api.twitter.com/1/users/lookup.json?screen_nme=twitterapi,twitter&include_entities=true  

•  Spelling  Mistake  o  Should  be  screen_name  

•  But  confusing  error  !  •  Should  be  406  Not  Acceptable  or  413  Too  Long  ,  

showing  parameter  error  

Page 45: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

HTTP  Status  Code  -­‐‑  Example •  {  •     "cache-­‐control":  "no-­‐cache,  no-­‐store,  must-­‐revalidate,  pre-­‐check=0,  post-­‐check=0",    •     "content-­‐encoding":  "gzip",    •     "content-­‐length":  "112",    •     "content-­‐type":  "application/json;charset=utf-­‐8",    •     "date":  "Sat,  23  Jun  2012  01:23:47  GMT",    •     "expires":  "Tue,  31  Mar  1981  05:00:00  GMT",    •  …  •     "status":  "401  Unauthorized",    •     "www-­‐authenticate":  "OAuth  realm=\"https://api.twitter.com\"",    •     "x-­‐frame-­‐options":  "SAMEORIGIN",    •     "x-­‐ratelimit-­‐class":  "api",    •     "x-­‐ratelimit-­‐limit":  "150",    •     "x-­‐ratelimit-­‐remaining":  "147",    •     "x-­‐ratelimit-­‐reset":  "1340417742",    •     "x-­‐transaction":  "d545a806f9c72b98"  •  }  •  {  •     "error":  "Not  authorized",    •     "request":  "/1/statuses/user_timeline.json?user_id=12%2C15%2C20"  •  }  

Sometimes,  the  errors  are  not  correct.  I  got  this  error  for  user_timeline.json  w/  user_id=20,15,12  Clearly  a  parameter  error  (i.e.  more  parameters)  

Page 46: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Objects

Page 47: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Users  

Tweets  

TimeLine  

Friends  Followers  

Status Update Entities  

Temporally Ordered

Follow Are Followed By

# media  

hashtags  

urls  

user_mentions  

embed

embed @

Places  

Twitter  Platform  Objects  

h5ps://dev.twi5er.com/docs/platform-­‐‑objects

Page 48: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Tweets •  A.k.a  Status  Updates  •  Interesting  fields  

o  Coordinates  <-­‐  geo  location  o  created_at  o  entities  (will  see  later)  o  Id,  id_str  o  possibly  sensitive  o  user  (will  see  later)  

•  perspectival  attributes  embedded  within  a  child  object  of  an  unlike  parent  –  hard  to  maintain  at  scale  

•  https://dev.twitter.com/docs/faq#6981  

o  withheld_in_countries    •  https://dev.twitter.com/blog/new-­‐withheld-­‐content-­‐fields-­‐api-­‐responses  

h5ps://dev.twi5er.com/docs/platform-­‐‑objects/tweets

Page 49: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

A  word  about  id,  id_str •  June  1,  2010  

o  Snowflake  the  id  generator  service  o  “The  full  ID  is  composed  of  a  timestamp,  

a  worker  number,  and  a  sequence  number”  

o  Had  problems  with  JavaScript  to  handle  numbers  >  53  bits  

o  “id”:819797  o  “id_str”:”819797”  

h5p://engineering.twi5er.com/2010/06/announcing-­‐‑snowflake.html h5ps://groups.google.com/forum/?fromgroups#!topic/twi5er-­‐‑development-­‐‑talk/ahbvo3VTIYI h5ps://dev.twi5er.com/docs/twi5er-­‐‑ids-­‐‑json-­‐‑and-­‐‑snowflake

Page 50: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Tweets  -­‐‑  example •  Let  us  run  oscon2012-­‐tweets.py  •  Example  of  tweet  

o  coordinates  o  id    o  id_str  

Page 51: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Users •  followers_count  •  geo_enabled  •  Id,  Id_str  •  name,  screen_name  •  Protected  •  status,  statuses_count  •  withheld_in_countries  

h5ps://dev.twi5er.com/docs/platform-­‐‑objects/users

Page 52: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Users  –  Let  us  run  some  examples •  Run    

o  oscon_2012_users.py  •  Lookup  users  by  screen_name  

o  oscon12_first_20_ids.py  •  Lookup  users  by  user_id  

•  Inspect  the  results  o  id,  name,  status,  status_count,  protected,  followers  

(for  top  10  followers),  withheld  users  

•  Can  use  information  for  customizing  the  user’s  screen  in  your  web  app  

Page 53: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Entities •  Metadata  &  Contextual  Information  •  You  can  parse  them,  but  Entities  

parse  them  out  as  structured  data  •  REST  API/Search  API  –  

include_entities=1  •  Streaming  API  –  included  by  default  •  hashtags,  media,  urls,  

user_mentions  h5ps://dev.twi5er.com/docs/platform-­‐‑objects/entities h5ps://dev.twi5er.com/docs/tweet-­‐‑entities h5ps://dev.twi5er.com/docs/tco-­‐‑url-­‐‑wrapper

Page 54: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Entities •  Run    

o  oscon2012_entities.py  

•  Inspect  hashtags,  urls  et  al    

Page 55: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Places •  attributes  •  bounding_box  •  Id  (as  a  string!)  •  country  •  name  

h5ps://dev.twi5er.com/docs/platform-­‐‑objects/places h5ps://dev.twi5er.com/docs/about-­‐‑geo-­‐‑place-­‐‑a5ributes

Page 56: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Places •  Can  search  for  tweets  near  a  place  like  so:  •  Get  latlong  of  conv  center  [45.52929,-­‐122.66289]  

o  Tweets  near  that  place  

•  Tweets  near  San  Jose  [37.395715,-­‐122.102308]  •  We  will  not  see  further  here.  But  very  useful  

Page 57: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Timelines •  Collections  of  tweets  ordered  by  time  •  Use  max_id  &  since_id  for  navigation  

h5ps://dev.twi5er.com/docs/working-­‐‑with-­‐‑timelines

Page 58: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Other  Objects  &  APIs •  Lists  •  Notifications  •  Friendships/exists  to  see  if  one  follows  the  other  

Page 59: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Users  

Tweets  

TimeLine  

Friends  Followers  

Status Update Entities  

Temporally Ordered

Follow Are Followed By

# media  

hashtags  

urls  

user_mentions  

embed

embed @

Places  

Twitter  Platform  Objects  

h5ps://dev.twi5er.com/docs/platform-­‐‑objects

Page 60: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Hands-­‐‑on  Exercise  (15  min) •  Setup  environment  –  slide  #14  •  Sanity  Check  Environment  &  Libraries  

o  oscon2012_open_this_first.py  o  oscon2012_rate_limit_status.py  

•  Get  objects  (show  calls)  o  Lookup  users  by  screen_name    -­‐  oscon12_users.py  o  Lookup  users  by  id  -­‐  oscon12_first_20_ids.py  o  Lookup  tweets  -­‐  oscon12_tweets.py  o  Get  entities  -­‐  oscon12_entities.py  

•  Inspect  the  results  •  Explore  a  little  bit  •  Discussion  

Page 61: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  APIs

Page 62: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter  API  

REST   Streaming  

Twitter  REST  

Core Data, Core Twitter Objects

Near-realtime, High Volume

Twitter  Search  

Seach & Trend

Keywords Specific  User Trends

Build  Profile Create/Post  Tweets Reply Favorite,  Re-­‐‑tweet

Public  Streams User  Streams Site  Streams

Follow users, topics, data mining

Rate  Limit  :  150/350 Rate  Limit  :        Complexity  &  Frequency

Firehose

Page 63: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  REST  API •  https://dev.twitter.com/docs/api  •  What  we  were  doing  were  the  REST  API  •  Request-­‐Response  •  Anonymous  or  OAuth  •  Rate  Limited  :  

o  150/350  

Page 64: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Trends •  oscon2012-­‐trends.py  •  Trends/weekly,  Trends/monthly  •  Let  us  run  some  examples  

o  oscon2012_trends_daily.py  o  oscon2012_trends_weekly.py  

•  Trends  &  hashtags  o  #hashtag  euro2012  o  http://hashtags.org/euro2012  o  http://sproutsocial.com/insights/2011/08/twitter-­‐hashtags/  o  http://blog.twitter.com/2012/06/euro-­‐2012-­‐follow-­‐all-­‐action-­‐on-­‐pitch.html  o  Top  10  :  http://twittercounter.com/pages/100,  http://twitaholic.com/  

Page 65: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Brand  Rank  w/  Twi5er •  Walk  Through  &  results  of  following  

o  oscon2012_brand_01.py  

•  Followed  10  user-­‐brands  for  a  few  days  to  find  growth  

•  Brand  Rank    o  Growth  of  a  brand  w.r.t  the  industry  o  Surge  in  popularity  –  could  be  due  to  –ve  or  +ve  buzz.  Need  to  understand  &  

correlate  using  Twitter  APIs  &  metrics  

•  API  :  url='https://api.twitter.com/1/users/lookup.json'  

•  payload={"screen_name":"miamiheat,okcthunder,nba,uefacom,lovelaliga,FOXSoccer,oscon,clouderati,googleio,OReillyMedia"}  

Page 66: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Brand  Rank  w/  Twi5er Clouderati  is  very  stable

Page 67: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

•  Google  I/O  showed  a  spike  on  6/27-­‐  6/28  

•  OReillyMedia  shares  some  spike  •  Looking  at  a  few  days  worth  of  

data,  our  best  inference  is  that  “oscon  doesn’t  track  with  googleio”  

•  “Clouderati  doesn’t  track  at  all”  

Brand  Rank  w/  Twi5er  Tech  Brands

Page 68: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

•  FOXSoccer,UEFAcom  track  each  other    

Brand  Rank  w/  Twi5er  World  of  Soccer

The  numbers  seldom  decrease.  So  calculating  –ve  velocity  will  not  

work OTOH,  if  you  see  a  –ve  velocity,  investigate

Page 69: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

•  NBA,  MiamiHeat,  okcthunder  track  each  other  •  Used  %  than  absolute  numbers  to  compare  •  The  hike  on  7/6  to  7/10  is  interesting.      

Brand  Rank  w/  Twi5er  World  of  Basketball

Page 70: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

•  For  some  reason,  all  numbers  are  going  up  7/6  thru  7/10  –  except  for  clouderati!  

•  Is  a  rising  (Twitter)  tide  lifting  all  (well,  almost  all)  ?  

Brand  Rank  w/  Twi5er  Rising  Tide  …

Page 71: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Trivia  :  Search  API •  Search(search.twitter.com)  

o  Built  by  Summize  which  was  acquired  by  Twitter  in  2008  

o  Summize  described  itself  as  “sentiment  mining”  

Page 72: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Search  API •  Very  simple    

o  GET  http://search.twitter.com/search.json?q=<blah>  

•  Based  on  a  search  criteria  •  “The Twitter Search API is a dedicated API for

running searches against the real-time index of recent Tweets”

•  Recent  =  Last  6-­‐9  days  worth  of  tweets  •  Anonymous  Call  •  Rate  Limit  

o  Not  No.  of  calls/hour,  but  Complexity  &  Frequency  h5ps://dev.twi5er.com/docs/using-­‐‑search h5ps://dev.twi5er.com/docs/api/1/get/search

Page 73: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Search  API •  Filters  

o  Search  URL  encoded  o  @  =  %40,  #=%23  o   emoticons    :)  and  :(,  o  http://search.twitter.com/search.atom?q=sometimes+%3A)  o  http://search.twitter.com/search.atom?q=sometimes+%3A(  

•  Location  Filters,  date  filters  •  Content  searches  

Page 74: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Streaming  API •  Not  request  response;  but  stream  •  Twitter  frameworks  have  the  support  •  Rate  Limit  :  Upto  1%  •  Stall  warning  if  the  client  is  falling  behind  •  Good  Documentation  Links  

o  https://dev.twitter.com/docs/streaming-­‐apis/connecting  o  https://dev.twitter.com/docs/streaming-­‐apis/parameters  o  https://dev.twitter.com/docs/streaming-­‐apis/processing  

Page 75: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Firehose •  ~  400  million  public  tweets/day  •  If  you  are  working  with  Twitter  firehose,  I  envy  you  !  

•  If  you  hit  real  limits,  then  explore  the  firehose  route  •  AFAIK,  it  is  not  cheap,  but  worth  it  

Page 76: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

API  Best  Practices 1.  Use  JSON  2.  Use  user_id  than  screen_name  

o  User_id  is  constant  while  screen_name  can  change  

3.  max_id  and  since_id  o  For  example  direct  messages,  if  you  have  last  message  use  

since_id  for  search  o  max_id  how  far  to  go  back  

4.  Cache  as  much  as  you  can  5.  Set  the  User-­‐Agent  header  for  debugging  I have listed a few good blogs that have API best practices, in the reference section, at the end of this presentation

These are gathered from various books, blogs & other media, I used for this tutorial. See Reference(at the end) for the sources

Page 77: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter  API  

REST   Streaming  

Twitter  REST  

Core Data, Core Twitter Objects

Near-realtime, High Volume

Twitter  Search  

Seach & Trend

Keywords Specific  User Trends

Build  Profile Create/Post  Tweets Reply Favorite,  Re-­‐‑tweet

Public  Streams User  Streams Site  Streams

Follow users, topics, data mining

Rate  Limit  :  150/350 Rate  Limit  :        Complexity  &  Frequency

Firehose

Questions  ?  

Page 78: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Part II

SNA

Part II Twitter Network Analysis

Page 79: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

1.  Collect   3.  Transform  &    Analyze  

2.  Store  

4.  Model  &    

Reason  5.  Predict,  

Recommend  &  Visualize  

Validate Dataset & re-crawl/refresh

Tip: 1. Implement a

s

a staged p

ipeline,

never a monolit

h�

Tip: 3. Keep t

he

schema simple; don’t

be afraid to

transform�

Most  important  &  the  ugliest  slide  in  

this  deck  !  

Page 80: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Trivia •  Social  Network  Analysis  originated  as  Sociometry  &  

the  social  network  was  called  a  sociogram  •  Back  then,  Facebook  was  called  SocioBinder!  •  Jacob  Levi  Morano,  is  considered  the  originator  

o  NYTimes,  April  3,  1933,  P.  17  

Page 81: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Networks-­‐‑Definitions •  Nodes  

o  Users  o  #tags  

•  Edges  o  Follows  o  Friends  o  @mentions  o  #tags  

•  Directed  

Page 82: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Networks-­‐‑Definitions •  In-­‐degree  

o  Followers  

•  Out-­‐Degree  o  Friends/Follow  

•  Centrality  Measures  •  Hubs  &  Authorities  

o  Hubs/Directories  tell  us  where  Authorities  are  

o  “Of  Mortals  &  Celebrities”  is  more  “Twitter-­‐style”  

Page 83: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Networks-­‐‑Properties •  Concepts  From  Citation  Networks  o  Cocitation  

•  Common  papers  that  cite  a  paper  •  Common  Followers  

o C  &  G  (Followed  by  F  &  H)  o  Bibliographic  Coupling  

•  Cite  the  same  papers  •  Common  Friends  (i.e.  follow  same  person)  o D,  E,  F  &  H  

N

M

K

L  

G

I

H

J

A

C

D F  E

B

Page 84: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Networks-­‐‑Properties •  Concepts  From  Citation  Networks  

o  Cocitation  •  Common  papers  that  cite  a  paper  •  Common  Followers  

o  C  &  G  (Followed  by  F  &  H)  o  Bibliographic  Coupling  

•  Cite  the  same  papers  •  Common  Friends    (i.e.  follow  same  person)  

o  D,  E,  F  &  H  follow  C  o  H  &  F  follow  C  &  G  

•  So  H  &  F  have  high  coupling  •  Hence,  if  H  follows  A,  we  can  

recommend  F  to  follow  A  

N

M

K

L  

G

I  

H

J  

A

CD

F  E

B

Page 85: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Networks-­‐‑Properties •  Bipartite/Affiliation  Networks  

o  Two  disjoint  subsets  o  The  bipartite  concept  is  very  relevant  to  Twitter  social  graph  o  Membership  in  Lists    

•  lists  vs.  users  bipartite  graph  o  Common  #Tags  in  Tweets    

•  #tags  vs.  members  bipartite  graph  o  @mention  together  

•  ?  Can  this  be  a  bipartite  graph  •  ?  How  would  we  fold  this  ?  

Page 86: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Other  Metrics  &  Mechanisms •  Kronecker  Graphs  Models  

o  Kronecker  product  is  a  way  of  generating  self-­‐similar  matrices  o  Prof.Leskovec  et  al  define  the  Kronecker  product  of  two  graphs  as  the  Kronecker  product  of  

their  adjacency  matrices  o  Application  :  Generating  models  for  analysis,  prediction,  anomaly  detection  et  al  

•  Erdos-­‐Renyl  Random  Graphs  o  Easy  to  build  a  Gn,p  graph  o  Assumes  equal  likelihood  of  edges  between  two  nodes  o  In a Twitter social network, we can create a more realistic expected distribution (adding the

“social reality” dimension) by inspecting the #tags & @mentions •  Network  Diameter  •  Weak  Ties  •  Follower  velocity  (+ve  &  –ve),  Association  strength  

o  Unfollow  not  a  reliable  measure  o  But  an  interesting  property  to  investigate  when  it  happens  

Not covered here, but potential for an encore ! Ref:  Jure  Leskovec:  Kronecker  Graphs,  Random  Graphs

Page 87: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Networks-­‐‑Properties •  Twitter != LinkedIn, Twitter != Facebook •  Twitter Network == Interest Network •  Be  cognizant  of  the  above  when  you  apply  traditional  network  

properties  to  Twitter    •  For  example,    

o  Six  degrees  of  separation  doesn't  make  sense  (most  of  the  time)  in  Twitter  –  except  may  be  for  Cliques  

o  Is  diameter  a  reliable  measure  for  a  Twitter  Network  ?  •  Probably  not  

o  Do  cut  sets  make  sense  ?    •  Probably  not  

o  But  citation  network  principles  do  apply;  we  can  learn  from  cliques  o  Bipartite  graphs  do  make  sense  

Page 88: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Cliques  (1  of  2) •  “Maximal  subset  of  the  vertices  in  an  

undirected  network  such  that  every  member  of  the  set  is  connected  by  an  edge  to  every  other”  

•  Cohesive  subgroup,  closely  connected  •  Near-­‐cliques  than  a  perfect  clique  (k-­‐plex  i.e.  

connected  to  at  least  n-­‐k  others)  •  k-­‐plex  clique  to  discover  sub  groups  in  a  sparse  

network;  1-­‐plex  being  the  perfect  clique  

Ref:  Networks,  An  Introduction-­‐‑Newman

Page 89: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Cliques  (2  of  2) •  k-­‐core  –  at  least  k  others  in  the  subset;  (n-­‐k)-­‐plex  

•  k-­‐clique  –  no  more  than  k  distance  away  o  Path  inside  or  outside  the  subset  o  k-­‐clan  or  k-­‐club  (path  inside  the  subset)  

•  We  will  apply  k-­‐plex  Cliques  for  one  of  our  hands-­‐on    

Ref:  Networks,  An  Introduction-­‐‑Newman

Page 90: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Sentiment  Analysis •  Sentiment  Analysis  is  an  important  &  interesting  work  

on  the  Twitter  platform  o  Collect  Tweets  o  Opinion  Estimation  -­‐Pass  thru  Classifier,  Sentiment  Lexicons  

•  Naïve  Bayes/Max  Entropy  Class/SVM  

o  Aggregated  Text  Sentiment/Moving  Average  

•  I  chose  not  to  dive  deeper  because  of  time  constraints  o  Couldn’t  do  justice  to  API,  Social  Network  and  Sentiment  Analysis,  

all  in  3  hrs  

•  Next  3  Slides  have  couple  of  interesting  examples  

 

Page 91: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Sentiment  Analysis •  Twitter  Mining  for  Airline  Sentiment  •  Opinion  Lexicon  -­‐  +ve  2000,  -­‐ve  4800    

h5p://www.inside-­‐‑r.org/howto/mining-­‐‑twi5er-­‐‑airline-­‐‑consumer-­‐‑sentiment h5p://sentiment.christopherpo5s.net/lexicons.html#opinionlexicon

Page 92: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Need  I  say  more  ?

h5p://www.economist.com/blogs/schumpeter/2012/06/tracking-­‐‑social-­‐‑media?fsrc=scn/gp/wl/bl/moodofthemarket h5p://www.relevantdata.com/pdfs/IUStudy.pdf

“A  bit  of  clever  math  can  uncover  interes4ng  pa7erns  that  are  not  visible  to  the  human  eye”      

Page 93: The Art of Social Media Analysis with Twitter & Python-OSCON 2012
Page 94: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Project  Ideas  

Page 95: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Interesting Vectors of Exploration 1.  Find  trending  #tags  &  then  related  #tags  –  using  

cliques  over  co-­‐#tag-­‐citation,  which  infers  topics  related  to  trending  topics  

2.  Related  #tag  topics  over  a  set  of  tweets  by  a  user  or  group  of  users  

3.  Analysis-­‐In/Out  flow,  Tweet  Flow  –  Frequent  @mention  

4.  Find  affiliation  networks  by  List  memberships,  #tags  or  frequent  @mentions    

Page 96: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Interesting Vectors of Exploration 5.  Use  centrality  measures  to  determine  mortals  vs.  

celebrities  6.  Classify  Tweet  networks/cliques  based  on  message  

passing  characteristics  –  Tweets  vs.  Retweets,  No  of  reweets,…  

7.  Retweet  Network  –  Measure  Influence  by  retweet  count  &  frequency  –  Information  contagion  by  looking  at  different  retweet  

network  subcomponents  –  who,  when,  how  much,…  

Page 97: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Network  Graph  Analysis

An  Example  

Page 98: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Analysis  Story  Board •  @clouderati  is  a  popular  cloud  related  Twitter  account  

•  Goals:  o  Analyze  the  social  graph  characteristics  of  the  users  who  are  

following  the  account  •  Dig  one  level  deep,  to  the  followers  &  friends,  of  the  

followers  of  @clouderati  o  How  many  cliques  ?  How  strong  are  they  ?  o  Does  the  @mention  support  the  clique  inferences  ?  o  What  are  the  retweet  characteristics  ?  o  How  does  the  #tag  network  graph  look  like  ?      

In this tutorial

For you to explore !!

Page 99: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Analysis  Pipeline  Story  Board  Stages,  Strategies,  APIs  &  Tasks

Stage  3  

o  Get  distinct  user  list  applying  the  set(union(list))  operation  

Stage  4  

o  Get  &  Store  User  details  (distinct  user  list)  

o  Unroll  

Stage  5  o  For  each  @clouderati  

follower  o  Find  friend=follower    -­‐  set  

intersection  

Stage  6  

o  Create  social  gra

ph  

o  Apply  network  theo

ry  

o  Infer  cliques  &  

other  

properties    

Note:  Needed  a  command  buffer  to  manage  scale  (~980,000  users)  

Note:  Unroll  stage  took  time  &  missteps  

Page 100: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

@clouderati  Twi5er  Social  Graph   •  Stats  (Retrospect  after  the  runs):  

o  Stage  1    •  @clouderati  has  2072  followers  

o  Stage  2  •  Limiting  followers  to  5,000  per  user  

o  Stage  3  •  Digging  1st  level  (set  union  of  followers  &  friends  of  the  

followers  of  @clouderati)  explodes  into  ~980,000  distinct  users  

o  MongoDB  of  the  cache  and  intermediate  datasets  ~10  GB  o  The  database  was  hosted  at  AWS  (Hi  Mem  XLarge  –  m2.xlarge  ),  8  

X  15  GB,  Raid  10,  opened  to  Internet  with  DB  authentication  

Page 101: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Code  &  Run  Walk  Through o Code:  

§  oscon_2012_user_list_spider_01.py  

o Challenges:  §  Nothing  fancy  §  Get  the  record  and  store  §  Would  have  had  to  recurse  through  a  REST  

cursor  if  there  were  more  than  5000  followers  §  @clouderati  has  2072  followers  

o  Interesting  Points:  

Stage  1  

o  Get  @clouderati  Followers  o  Store  in  MongoDB  

Page 102: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Code  &  Run  Walk  Through o  Code:  

§  oscon_2012_user_list_spider_02.py  §  oscon_2012_twitter_utils.py  §  oscon_2012_mongo.py  §  oscon_2012_validate_dataset.py  

o  Challenges:  §  Multiple  runs,  errors  et  al  !  

o  Interesting  Points:  §  Set  operation  between  two  mongo  collections  for  restart  buffer  §  Protected  users,  some  had  0  followers,  or  0  friends  §  Interesting  operations  for  validate,  re-­‐crawl  and  refresh  §  Added  “status_code”  to  differentiate  protected  users  

§  {'$set':  {'status_code':  '401  Unauthorized,401  Unauthorized'}}  §  Getting friends & followers of 2000 users is the hardest (or so I thought,

until I got through the next stage!)    

Stage  2  

o  Crawl  1  level  deep  o  Get  friends  &  followers  o  Validate,  re-­‐crawl  &  refresh  

Page 103: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Validate-­‐‑Recrawl-­‐‑Refresh  Logs •  pymongo  version  =    2.2  •  Connected  to  DB!  •  …  •  2075  •  Error  Friends  :    <type  'exceptions.KeyError'>  •  4ff3cd40e5557c00c7000000  -­‐  none  has  2072  followers  &  0  friends  •  Error  Friends  :    <type  'exceptions.KeyError'>  •  4ff3a958e5557cfc58000000  -­‐  none  has  2072  followers  &  0  friends  •  Error  Friends  :    <type  'exceptions.KeyError'>  •  4ff3ccdee5557c00b6000000  -­‐  none  has  2072  followers  &  0  friends  •  4ff3d3b9e5557c01b900001e  -­‐  371187804  has  0  followers  &  0  friends  •  4ff3d3d8e5557c01b9000048  -­‐  63488295  has  155  followers  &  0  friends  •  4ff3d3d9e5557c01b9000049  -­‐  342712617  has  0  followers  &  0  friends  •  4ff3d3d9e5557c01b900004a  -­‐  21266738  has  0  followers  &  0  friends  •  4ff3d3dae5557c01b900004b  -­‐  204652853  has  0  followers  &  0  friends  •  …  •  4ff475cfe5557c1657000074  -­‐  258944989  has  0  followers  &  0  friends  •  4ff475d3e5557c165700007d  -­‐  327286780  has  0  followers  &  0  friends  •  Looks  like  we  have  132  not  so  good  records  •  Elapsed  Time  =  0.546846  

o  1st  run  –  132  bad  records  o  This  is  the  classic  Erlang-­‐style  

supervisor  o  The  crawl  continue  on  transport  errors  

without  worrying  about  retry  o  Validate  will  recrawl  &  refresh  as  

needed  

Page 104: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Code  &  Run  Walk  Through o Code:  

§  oscon2012_analytics_01.py  

o Challenges:  o  Figure  out  the  right  Set  operations  

o  Interesting  Points:  §  973,323  unique  users  !  §  Recursively  apply  set  union  over  400,00  lists  §  Set  operations  took  slightly  more  than  a  minute    

Stage  3  

o  Get  distinct  user  list  applying  the  set(union(list))  operation  

Page 105: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Code  &  Run  Walk  Through o  Code:  

§  oscon2012_analytics_01.py  (focus  on  cmd  string  creation)  §  oscon2012_get_user_info_01.py  §  oscon2012_unroll_user_list_01.py  §  oscon2012_unroll_user_list_02.py  

o  Challenges:  §  Where  do  I  start  ?  

•  In  the  next  few  slides    §  Took  me  a  few  days  to  get  it  right  (along  with  my  daily  job!)  §  Unfortunately  I  did  not  employ  parallelism  &  didn’t  use  my  

MacPro  with  32  GB  memory.  So  the  runs  were  long  §  But  learned  hard  lessons  on  check  point  &  restart  

o  Interesting  Points:  §  Tracking  Control  Numbers  §  Time  …  Marathon  unroll  run  19:33:33  !  

Stage  4  

o  Get  &  Store  User  details  (distinct  user  list)  

o  Unroll  

Page 106: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  @  scale  Pa5ern •  Challenge:  

o  You  want  to  get  screen  names,  follower  counts  and  other  details  for  a  million  users  

•  Problem:  o  No  easy  REST  API  o  https://api.twitter.com/1/users/lookup.json  will  take  100  user_ids  and  give  

details  

•  Solution:  o  This  is  a  scalability  challenge.  Approach  it  like  so  o  Create  a  command  buffer  collection  in  MongoDB  splitting  millon  user_ids  

into  batches  of  100  o  Have  a  “done”  flag  initialized  to  0  for  checkpoint  &  restart  o  After  each  cmd  str  is  executed,  rest  “done”:1  o  For  subsequent  runs,  ignore  “done”:1.    o  Also  helps  in  control  number  tracking  

Page 107: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Control  numbers

Page 108: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Control  Numbers •  >  db.t_users_info.count()  •  8122  •  >  db.api_str.count({"done":0,"seq_no":{"$lt":8185}},{"seq_no”:)  •  63  •  >  db.api_str.find({"done":0,"seq_no":{"$lt":8185}},{"seq_no":1})  •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d53"),  "seq_no"  :  5433  }  •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d59"),  "seq_no"  :  5439  }  •  {  "_id"  :  ObjectId("4ff4daeae5557c28bf001d5f"),  "seq_no"  :  5445  }  •  {  "_id"  :  ObjectId("4ff4daebe5557c28bf001d74"),  "seq_no"  :  5466  }  •  {  "_id"  :  ObjectId("4ff4daece5557c28bf001d7a"),  "seq_no"  :  5472  }  •  {  "_id"  :  ObjectId("4ff4daece5557c28bf001d80"),  "seq_no"  :  5478  }  •  {  "_id"  :  ObjectId("4ff4daede5557c28bf001d90"),  "seq_no"  :  5494  }  •  {  "_id"  :  ObjectId("4ff4daefe5557c28bf001daf"),  "seq_no"  :  5525  }  •  {  "_id"  :  ObjectId("4ff4daf0e5557c28bf001dba"),  "seq_no"  :  5536  }  •  {  "_id"  :  ObjectId("4ff4daf1e5557c28bf001dcf"),  "seq_no"  :  5557  }  •  {  "_id"  :  ObjectId("4ff4daf2e5557c28bf001de9"),  "seq_no"  :  5583  }  •  {  "_id"  :  ObjectId("4ff4daf2e5557c28bf001def"),  "seq_no"  :  5589  }  •  {  "_id"  :  ObjectId("4ff4daf4e5557c28bf001e0e"),  "seq_no"  :  5620  }  •  {  "_id"  :  ObjectId("4ff4daf4e5557c28bf001e14"),  "seq_no"  :  5626  }  •  {  "_id"  :  ObjectId("4ff4daf6e5557c28bf001e2e"),  "seq_no"  :  5652  }  •  {  "_id"  :  ObjectId("4ff4daf6e5557c28bf001e39"),  "seq_no"  :  5663  }  •  {  "_id"  :  ObjectId("4ff4daf8e5557c28bf001e62"),  "seq_no"  :  5704  }  •  {  "_id"  :  ObjectId("4ff4dafae5557c28bf001e77"),  "seq_no"  :  5725  }  •  {  "_id"  :  ObjectId("4ff4dafae5557c28bf001e81"),  "seq_no"  :  5735  }  •  {  "_id"  :  ObjectId("4ff4dawe5557c28bf001e9b"),  "seq_no"  :  5761  }  •  Type  "it"  for  more  •  >  it  •  {  "_id"  :  ObjectId("4ff4dafce5557c28bf001ea6"),  "seq_no"  :  5772  }  •  {  "_id"  :  ObjectId("4ff4dafce5557c28bf001eac"),  "seq_no"  :  5778  }  •  {  "_id"  :  ObjectId("4ff4dafde5557c28bf001eb7"),  "seq_no"  :  5789  }  •  {  "_id"  :  ObjectId("4ff4dafde5557c28bf001ebd"),  "seq_no"  :  5795  }  •  {  "_id"  :  ObjectId("4ff4dafee5557c28bf001ec8"),  "seq_no"  :  5806  }  •  {  "_id"  :  ObjectId("4ff4daffe5557c28bf001ed8"),  "seq_no"  :  5822  }  •  {  "_id"  :  ObjectId("4ff4db00e5557c28bf001eed"),  "seq_no"  :  5843  }  •  {  "_id"  :  ObjectId("4ff4db00e5557c28bf001ef3"),  "seq_no"  :  5849  }  •  {  "_id"  :  ObjectId("4ff4db01e5557c28bf001efe"),  "seq_no"  :  5860  }  •  {  "_id"  :  ObjectId("4ff4db01e5557c28bf001f09"),  "seq_no"  :  5871  }  •  {  "_id"  :  ObjectId("4ff4db03e5557c28bf001f23"),  "seq_no"  :  5897  }  •  {  "_id"  :  ObjectId("4ff4db05e5557c28bf001f47"),  "seq_no"  :  5933  }  •  {  "_id"  :  ObjectId("4ff4db05e5557c28bf001f52"),  "seq_no"  :  5944  }  •  {  "_id"  :  ObjectId("4ff4db06e5557c28bf001f58"),  "seq_no"  :  5950  }  •  {  "_id"  :  ObjectId("4ff4db06e5557c28bf001f5e"),  "seq_no"  :  5956  }  •  {  "_id"  :  ObjectId("4ff4db06e5557c28bf001f69"),  "seq_no"  :  5967  }  •  {  "_id"  :  ObjectId("4ff4db07e5557c28bf001f74"),  "seq_no"  :  5978  }  •  {  "_id"  :  ObjectId("4ff4db07e5557c28bf001f7f"),  "seq_no"  :  5989  }  •  {  "_id"  :  ObjectId("4ff4db0ae5557c28bf001fa8"),  "seq_no"  :  6030  }  •  {  "_id"  :  ObjectId("4ff4db0ae5557c28bf001fae"),  "seq_no"  :  6036  }  •  Type  "it"  for  more  •  >  it  •  {  "_id"  :  ObjectId("4ff4db0ae5557c28bf001w9"),  "seq_no"  :  6047  }  •  {  "_id"  :  ObjectId("4ff4db0be5557c28bf001fc4"),  "seq_no"  :  6058  }  •  {  "_id"  :  ObjectId("4ff4db0be5557c28bf001fca"),  "seq_no"  :  6064  }  •  {  "_id"  :  ObjectId("4ff4db0de5557c28bf001fe0"),  "seq_no"  :  6086  }  •  {  "_id"  :  ObjectId("4ff4db0de5557c28bf001fe6"),  "seq_no"  :  6092  }  •  {  "_id"  :  ObjectId("4ff4db0de5557c28bf001fec"),  "seq_no"  :  6098  }  •  {  "_id"  :  ObjectId("4ff4db0ee5557c28bf002006"),  "seq_no"  :  6124  }  •  {  "_id"  :  ObjectId("4ff4db10e5557c28bf002025"),  "seq_no"  :  6155  }  •  {  "_id"  :  ObjectId("4ff4db12e5557c28bf002044"),  "seq_no"  :  6186  }  •  {  "_id"  :  ObjectId("4ff4db12e5557c28bf00204a"),  "seq_no"  :  6192  }  •  {  "_id"  :  ObjectId("4ff4db1ae5557c28bf0020e0"),  "seq_no"  :  6342  }  •  {  "_id"  :  ObjectId("4ff4db1ae5557c28bf0020e1"),  "seq_no"  :  6343  }  •  {  "_id"  :  ObjectId("4ff4db2ee5557c28bf002240"),  "seq_no"  :  6694  }  •  {  "_id"  :  ObjectId("4ff4db34e5557c28bf0022b9"),  "seq_no"  :  6815  }  •  {  "_id"  :  ObjectId("4ff4db41e5557c28bf00239f"),  "seq_no"  :  7045  }  •  {  "_id"  :  ObjectId("4ff4db53e5557c28bf0024fe"),  "seq_no"  :  7396  }  •  {  "_id"  :  ObjectId("4ff4db66e5557c28bf00265d"),  "seq_no"  :  7747  }  •  {  "_id"  :  ObjectId("4ff4db68e5557c28bf002678"),  "seq_no"  :  7774  }  •  {  "_id"  :  ObjectId("4ff4db6be5557c28bf0026af"),  "seq_no"  :  7829  }  •  >    

The  collection  should  have  8185  documents  But  it  has  only  8122.  Where  did  the  rest  go  ?  

63  of  them  still  have  done=0  8122  +  63  =  8185  !  Aha,  mystery  solved.          They  fell  through  the  cracks  Need  a  catch-­‐all  final  run      

Page 109: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Day  in  the  life  of  a  Control  Number  Detective  –  Run  #1 •  Remember  :  973,323  users.  So,  9734  cmd  strings  (100  users  perstring)  •  >  >  db.api_str.count()  •  9831  •  >  db.api_str.count({"done":0})  •  239  

•  >>  db.t_users_info.count()  •  9592  •  >  >  db.api_str.count({"api_str":""})  •  97  •  So  we  should  have  9831  –  97  =  9734  records  •  The  second  run  should  generate  9734-­‐9592  =  142  calls  (i.e.  350-­‐142=208  rate-­‐limit  should  remain).  Let  us  see.  •  {  •     …  

•     "x-­‐ratelimit-­‐class":  "api_identified",    •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "209",    •     …  •  }  •  Yep,  209  left  •  >  

Page 110: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Day  in  the  life  of  a  Control  Number  Detective  –  Run  #2 •  Remember  :  973,323  users.  So,  9734  cmd  strings  (100  users  perstring)  •  >  db.t_users_info.count()  •  9728  •  >  db.api_str.count({"api_str":""})  •  97  

•  >  db.api_str.count({"done":0})  •  103  •  >9734-­‐9728=6,  same  as  103-­‐97  !  •  Run  once  more  !  •  >  db.api_str.find({"done":0},{"seq_no":1})  •  …  •  {  "_id"  :  ObjectId("4ff4dbd4e5557c28bf002e22"),  "seq_no"  :  9736  }  •  {  "_id"  :  ObjectId("4ff4db05e5557c28bf001f47"),  "seq_no"  :  5933  }  

•  {  "_id"  :  ObjectId("4ff4db8be5557c28bf0028f6"),  "seq_no"  :  8412  }  •  {  "_id"  :  ObjectId("4ff4dba2e5557c28bf002a8c"),  "seq_no"  :  8818  }  •  {  "_id"  :  ObjectId("4ff4dbaee5557c28bf002b69"),  "seq_no"  :  9039  }  •  {  "_id"  :  ObjectId("4ff4dbb8e5557c28bf002c1c"),  "seq_no"  :  9218  }  •  …  

•  {  •   …  •     "x-­‐ratelimit-­‐limit":  "350",    •     "x-­‐ratelimit-­‐remaining":  "344",    •   …  

•  }  •  Yep,  6  more  records  •  >  db.t_users_info.count()  •  9734  

•  Good,  got  9734  !  

Professor Layton would be proud !

In  fact,  I  have  all  the  four  &  plan  to  spend  sometime  with  them  &  Laphraig  !

Page 111: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Monitor  runs  &  track  control  numbers

Unroll  run  8:48  PM  to  ~4:08  PM  next  day  !  

Page 112: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Track  error  &  the  document  numbers

Page 113: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Code  &  Run  Walk  Through o Code:  

§  oscon2012_find_strong_ties_01.py  §  oscon2012_social_graph_stats_01.py  

o Challenges:  §  None.  Python  set  operations  made  this  easy  

o  Interesting  Points:  §  Even  at  this  scale,  single  machine  is  not  enough  §  Should  have  tried  data  parallelism    

•  This  task  is  well  suited  to  leverage  data  parallelism  as  it  is  commutative  &  associative  

•  Was  getting  invalid  cursor  error  from  MongoDB  •  So  had  to  do  the  updates  in  two  steps  

Stage  5  

o  For  each  @clouderati  follower  

o  Find  friend=follower    -­‐  set  intersection  

Page 114: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Code  &  Run  Walk  Through o Code:  

§  oscon2012_find_cliques_01.py  

o Challenges:  o  Lots  of  good  information  hidden  in  

the  data  !  o  Memory  !  

o  Interesting  Points:  o  Graph,  List  &  set  operations  o  networkx  has  lots  of  interesting  

graph  algorithms  o  Collections.Counter  to  the  rescue  

Stage  6  

o  Create  social  graph  o  Apply  network  theory  o  Infer  cliques  &  other  

properties    

Page 115: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Social  Graph  Analysis  of  @clouderati

o                                       2072  Followers;  973,323  unique  users  one  level  down  w/  followers/friends  trimmed  at  5,000  

o  Strong  ties    o  follower=friend  

o  235,697  users,  462,  419  edges  o  501,367    Cliques  o  253  unique  users  8,906  Cliques  w/  >  

10  users  o  GeorgeReese  in  7,973  of  them  !  See  

List  for  1st  125  o  krishnan  3,446,randy  2,197,  joe  1,977,  

sam  1,937,  jp  485,  stu  403,  urquhart  263,beaker  226,  acroll  149,  adrian  63,  gevaperry  24  

o  Of  course,  clique  analysis  does  not  tell  us  the  whole  story  …    

Clique  Distribution  =  {2:  296521,  3:  58368,  4:  36421,  5:  28788,  6:  24197,  7:  20240,  8:  15997,  9:  11929,  10:  6576,  11:  1909,  12:  364,  13:  55,  14:  2}  

Page 116: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Social  Graph  Analysis  of  @clouderati

o  sort  by  followers  vs.  sort  by  strong  ties  is  interesting  

Celebrity  –  very  low  strong  ties  

Medium  Celebrity,  medium  strong  ties  

Higher  Celebrity,  low  strong  ties  

Page 117: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Social  Graph  Analysis  of  @clouderati o  A  higher  “Strong  Ties”  

number  is  interesting  §  It  means  a  very  high  

follower-­‐friend  intersection  

§  Reeves  62%,  bgolden    85%  

o  Bur  a  high  clique  with  a  smaller  “Strong  ties”  show  more  cohesive  &  stronger  social  graph  §  eg.Krishnan  -­‐  15%  

friends-­‐followers    §  Samj  –  33%  

Page 118: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twi5er  Social  Graph  Analysis  of  @clouderati

o  Ideas  for  more  Exploration  §  Include  all  

followers  (instead  of  stopping  at  the  5000  cap)  

§  Get  tweets  &  track  @mention  

§  Frequent  @mention  shows  more  stronger  ties  

§  #tag  analysis  could  show  some  interesting  networks  

Page 119: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 1.  Twitter  APIs  are  (more  or  less)  congruent  &  symmetric  2.  Twitter  is  usually  right  &  simple  -­‐  recheck  when  you  get  unexpected  results  

before  blaming  Twitter  o  I  was  getting  numbers  when  I  was  expecting  screen_names  in  user  objects.  o  Was  ready  to  send  blasting  e-­‐mails  to  Twitter  team.  Decided  to  check  one  more  time  

and  found  that  my  parameter  key  was  wrong-­‐screen_name  instead  of  user_id  o  Always test with one or two records before a long run ! - learned the hard way

3.  Twitter  APIs  are  very  powerful  –  consistent  use  can  bear  huge  data  o  In  a  week,  you  can  pull  in  4-­‐5  million  users  &  some  tweets  !    o  Night runs are far more faster & error-free

4.  Use  a  NOSQL  data  store  as  a  command  buffer  &  data  buffer  o  Would  make  it  easy  to  work  with  Twitter  at  scale  o  I  use    MongoDB  o  Keep  the  schema  simple  &  no  fancy  transformation  

•  And  as  far  as  possible  same  as  the  (json)  response      o  Use  NOSQL  CLI  for  trimming  records  et  al  

The Beginning As The End

Page 120: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 5.  Always  use  a  big  data  pipeline  

o  Collect - Store - Transform & Analyze - Model & Reason - Predict, Recommend & Visualize o  That  way  you  can  orthogonally  extend,  with  functional  components  like  command  buffers,  

validation  et  al    6.  Use  functional  approach  for  a  scalable  pipeline  

o  Compose  your  data  big  pipeline  with  well  defined  granular  functions,  each  doing  only  one  thing  o  Don’t  overload  the  functional  components  (i.e.  no  collect,  unroll  &  store  as  a  single  component)  o  Have  well  defined  functional  components  with  appropriate  caching,  buffering,  checkpoints  &  

restart  techniques  •  This did create some trouble for me, as we will see later

7.  Crawl-­‐Store-­‐Validate-­‐Recrawl-­‐Refresh  cycle  o  The  equivalent  of  the  traditional  ETL  o  Validation  stage  &  validation  routines  are  important  

•  Cannot  expect  perfect  runs  •  Cannot  manually  look  at  data  either,  when  data  is  at  scale  

8.  Have  control  numbers  to  validate  runs  &  monitor  them  o  I still remember control numbers which start with the number of punch cards in the input deck &d then follow that

number through the various runs ! o  There will be a separate printout of the control numbers that will be kept in the operations files

Page 121: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 9.  Program  defensively    

o  more so for a REST-based-Big Data-Analytics systems o  Expect  failures  at  the  transport  layer  &  accommodate  for  them    

10.  Have  Erlang-­‐style  supervisors  in  your  pipeline  o  Fail  fast  &  move  on  o  Don’t  linger  and  try  to  fix  errors  that  cannot  be  controlled  at  that  layer  o  A  higher  layer  process  will  circle  back  and  do  incremental  runs  to  

correct  missing  spiders  and  crawls  o  Be  aware  of  visibility  &  lack  of  context.  Validate  at  the  lowest  layer  that  

has  enough  context  to  take  corrective  actions  o  I have an example in part 2

11.  Data  will  never  be  perfect  o  Know  your  data  &  accommodate  for  it’s  idiosyncrasies    

•  for  example:  0  followers,  protected  users,  0  friends,…  

Page 122: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 12.  Check  Point  frequently  (preferably  after  ever  API  call)  &  have  a  

re-­‐startable  command  buffer  cache    o  See a MongoDB example in Part 2

13.  Don’t  bombard  the  URL  o  Wait  a  few  seconds  before  successful  calls.  This  will  end  up  with  a  

scalable  system,  eventually  o  I found 10 seconds to be the sweet spot. 5 seconds gave retry error. Was able to

work with 5 seconds with wait & retry. Then, the rate limit started kicking in ! 14.  Always  measure  the  elapsed  time  of  your  API  runs  &  processing  

o  Kind  of  early  warning  when  something  is  wrong  

15.  Develop  incrementally;  don’t  fail  to  check  “cut  &  paste”  errors  

Page 123: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 16.  The  Twitter  big  data  pipeline  has  lots  of  opportunities  for  parallelism  

o  Leverage  data  parallelism  frameworks  like  MapReduce  o  But  first  :  

§  Prototype  as  a  linear  system,    §  Optimize  and  tweak  the  functional  modules  &  cache  strategies,    §  Note  down  stages  and  tasks  that  can  be  parallelized  and    §  Then  parallelize  them  

o  For the example project, we will see later, I did not leverage any parallel frameworks, but the opportunities were clearly evident. I will point them out, as we progress through the tutorial

17.   Pay  attention  to  handoffs  between  stages  o  They  might  require  transformation  –  for  example  collect  &  store  might  store  a  user  list  

as  multiple  arrays,  while  the  model  requires  each  user  to  be  a  document  for  aggregation    

o  But resist the urge to overload collect with transform o  i.e let the collect stage store in arrays, but then have an unroll/flatten stage to transform

the array to separate documents o  Add transformation as a granular function – of course, with appropriate buffering, caching,

checkpoints & restart techniques 18.  Have  a  good  log  management  system  to  capture  and  wade  through  

logs    

Page 124: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Tips – A Baker’s Dozen 19.  Understand  the  underlying  network  characteristics  for  the  

inference  you  want  to  make  o  Twitter  Network    !=  Facebook  Network  ,    Twitter  Graph  !=  LinkedIn  Graph  

o  Twitter  Network  is  more  of  an  Interest  Network  o  So, many of the traditional network mechanisms & mechanics, like network

diameter & degrees of separation, might not make sense o  But, others like Cliques and Bipartite Graphs do

Page 125: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Twitter Gripes 1.  Need  more  rich  APIs  for  #tags  

o  Somewhat  similar  to  users  viz.  followers,  friends  et  al  o  Might  make  sense  to  make  #tags  a  top  level  object  with  it’s  own  semantics  

2.  HTTP  Error  Return  is  not  uniform    o  Returns  400  bad  Request  instead  of  420  o  Granted, there is enough information to figure this out

3.  Need  an  easier  way  to  get  screen_name  from  user_id  4.  “following”  vs.  “friends_count”  i.e.  “following”  is  a  dummy  variable.  

o  There are a few like this, most probably for backward compatibility 5.  Parameter  Validation  is  not  uniform  

o  Gives  “404  Not  found”  instead  of  “406  Not  Acceptable”  or  “416  Range  Unacceptable”  

6.  Overall  more  validation  would  help  o  Granted, it is more of growing pains. Once one comes across a few inconsistencies, the

rest is easy to figure out

Page 126: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Thanks To these Giants …

Page 127: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Thanks To these Giants …

Page 128: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Thanks To these Giants …

Page 129: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Thanks To these Giants …

Page 130: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

Thanks To these Giants …

Page 131: The Art of Social Media Analysis with Twitter & Python-OSCON 2012

I had a good time researching &

preparing for this Tutorial. ���

I hope you learned a few new things &

have a few vectors to follow