Katy perry and trend detection red dirt

42
Katy Perry and Trend Detec.on using Ma5 Kirk – Wetpaint

Transcript of Katy perry and trend detection red dirt

Page 1: Katy perry and trend detection   red dirt

Katy  Perry  and    Trend  Detec.on  using  

Ma5  Kirk  –  Wetpaint  

Page 2: Katy perry and trend detection   red dirt

`whoami`    

Former  finance  quant  turned  Ruby  hacker  

twi5er:  @mjkirk  

website:  ma5hewkirk.com  

Page 3: Katy perry and trend detection   red dirt

I  work  for  wetpaint.com  

Page 4: Katy perry and trend detection   red dirt

Who  the  hell  is  Wetpaint?  

Page 5: Katy perry and trend detection   red dirt

wetpaint.com  

Page 6: Katy perry and trend detection   red dirt

WTF  am  I  doing  here  at  RedDirt??!!!  

Page 7: Katy perry and trend detection   red dirt

6  months  ago  we  started  building  an  engine  

Page 8: Katy perry and trend detection   red dirt

To  find  relevant  news  

Page 9: Katy perry and trend detection   red dirt

That  U.lizes  

•  Sta.s.cal  processing  •  Natural  Language  Processing  •  Hardcore  mathema.cs  

And Magic

Page 10: Katy perry and trend detection   red dirt

Ruby  doesn’t  really  have  tools  for  this  sort  of  thing  

Except  for  magic    

Page 11: Katy perry and trend detection   red dirt

Java  Packages  of  help  

•  Apache  commons  math  •  Stanford  CoreNLP  •  OpenNLP  •  Weka  

•  LingPipe  •  Mahout  

•  etc,  etc  

Page 12: Katy perry and trend detection   red dirt

Who  wants  to  write  Java  all  day  long  though?  

Page 13: Katy perry and trend detection   red dirt

Java  +  JRuby  =  Awesome  

Page 14: Katy perry and trend detection   red dirt

So  what  about  Katy  Perry….  

Page 15: Katy perry and trend detection   red dirt

JRuby  helped  us  find  Katy  Perry  

Page 16: Katy perry and trend detection   red dirt

In  the  context  of  Glee  

Page 17: Katy perry and trend detection   red dirt

How  we  went  about  finding  Katy  Perry  

1.  Detect  ac.vity  around  shows  on  Twi5er,  Facebook  etc.  

2.  Extract  a5ributes  about  that  ac.vity  3.  Cluster  everything  together  to  reduce  clu5er  

Page 18: Katy perry and trend detection   red dirt

Back  in  November  she  tweeted  

Oh...My...Gosh...  this  just  brought  a  sweet  tear  to  my  eye!  Teenage  Dream  on  GLEE  makes  my  heart  go  WEEEEE!  

h5p://t.co/8SAFkGl  

Page 19: Katy perry and trend detection   red dirt

Which  fed  into  a  re-­‐tweet  frenzy  

Page 20: Katy perry and trend detection   red dirt

Obviously  that’s  an  outlier  

Page 21: Katy perry and trend detection   red dirt

But  how  would  we  find  that?  

•  Fit  the  Poisson  distribu.on  to  the  last  day  and  figure  out  the  percen.le  of  the  current  data  point.  

•  If  it’s  greater  than  say  95%  there’s  something  weird  going  on  

Page 22: Katy perry and trend detection   red dirt

Poisson  Distribu.on  

Page 23: Katy perry and trend detection   red dirt

Let’s  use  the  Apache  Commons  Math  Package!!!  

Page 24: Katy perry and trend detection   red dirt

require 'java’; require’math.jar’!Poisson = org.apache.commons.math.distribution.PoissonDistributionImpl!

mean = 20 # tweets per five minutes!

fishy = Poisson.new(mean)!

fishy.cumulative_probability(30) # => 98.6%  

Page 25: Katy perry and trend detection   red dirt

Coding  a  Poisson  distribu.on  in  ruby  wouldn’t  be  as  much  fun  

Page 26: Katy perry and trend detection   red dirt

Ok  so  we  know  there’s  something  there.    What  is  it?  

Let’s  assume  we  don’t  know  it  was  Katy  Perry  

Page 27: Katy perry and trend detection   red dirt

Extract  some  a5ributes  

•  Using  a  tool  like  the  Stanford  CoreNLP  

•  Extract  – n-­‐gram  phrases  – words  – urls  

Page 28: Katy perry and trend detection   red dirt

Probably  get  a5ributes  like    

n_grams = ["sweet tear", "teenage dream"]!

words = ["oh", "gosh", "just", "brought", "sweet", "tear", "eye", "teenage", "dream", "glee", "heart", "go", "weeeee"]!

urls = ["http://t.co/8SAFkGl"]!

Page 29: Katy perry and trend detection   red dirt

Stanford  Core  NLP  

require 'java'; require 'nlp.jar’!include_class "edu.stanford.nlp.ie.machinereading.domains.ace.reader.RobustTokenizer”!

# Seriously wtf guys...!

RT = RobustTokenizer!

Page 30: Katy perry and trend detection   red dirt

Stanford  Core  NLP  

tok = RT.new(katy_perry_tweet)!

tokens = tok.tokenize.map(&:to_s)!urls = tokens.select do |t| ! RT.is_url(t) !end!

words = tokens.uniq - urls - punctuation - stopwords!

Page 31: Katy perry and trend detection   red dirt

We  have  a  bag  full  of  words  and  urls.    Now  what?  

Page 32: Katy perry and trend detection   red dirt

Cluster  it!  

•  Li5le  more  of  a  difficult  problem.    In  our  case  we  wrote  our  own  package.  

•  Apache  Commons  math  and  Weka  both  have  k-­‐means  clustering  in  them  

Page 33: Katy perry and trend detection   red dirt
Page 34: Katy perry and trend detection   red dirt

Quickest  solu.on  is  to  use  Apache  Commons  Math  

Page 35: Katy perry and trend detection   red dirt

Build  a  Point  Class  First  

class Point! attr_reader :attrs! include

org.apache.commons.math.stat.clustering.Clusterable!

def initialize(attrs = [])! !@attrs = attrs! end! def distanceFrom(point)! ((point.attrs | @attrs) - (point.attrs &

@attrs)).length! end!

Page 36: Katy perry and trend detection   red dirt

Build  a  Point  Class  First  

def centroidOf(points)! u = Point.new(points.map(&:attributes).flatten.uniq)!

guess = points.first! points.each do |point|! !if u.distanceFrom(point) < u.distanceFrom(best_guess)!

!!guess = point! !end! end! best_guess!

end!end  

Page 37: Katy perry and trend detection   red dirt

Feed  it  into  Apache  Commons  

require 'java'; require 'math.jar’!include_class "org.apache.commons.math.stat.clustering.KMeansPlusPlusClusterer”!

clusterer = KMeansPlusPlusClusterer(java.util.Random.new)!

num_clusters = ? # Depends...!max_iter = -1 # no max!clusterer.cluster(collection_of_points, num_clusters, max_iter)  

Page 38: Katy perry and trend detection   red dirt

Now  that  we’ve  clustered,  our  peeps  can  find  Katy  Perry  

Page 39: Katy perry and trend detection   red dirt

And  now  we  can  have  a  party  

Page 40: Katy perry and trend detection   red dirt

Conclusion  

•  Detect  •  Extract  •  Cluster  

Page 41: Katy perry and trend detection   red dirt

Conclusion  

•  Java  +  JRuby  =  Killer  combo  for  fun  development  of  sta.s.cs  apps.  

Page 42: Katy perry and trend detection   red dirt

THANKS!!!  

If  you  want  to  work  with  Females  18-­‐34.    We’re  Hiring!!!  

h5p://bit.ly/wpdevs