Data Science at Tumblr

37
Data at Tumblr Adam Laiacano NYC Data Science Meetup @adamlaiacano adamlaiacano.tumblr.com Monday, April 8, 13

Transcript of Data Science at Tumblr

Page 1: Data Science at Tumblr

Data at Tumblr

Adam LaiacanoNYC Data Science Meetup

@adamlaiacanoadamlaiacano.tumblr.com

Monday, April 8, 13

Page 2: Data Science at Tumblr

What I Needed to LearnWhen I Started My Job

Monday, April 8, 13

Page 3: Data Science at Tumblr

Electrical Engineering backgroundWorked at CBS to learn more about stats / data

Joined Tumblr in August 201140th employee, now over 160

About Me

Monday, April 8, 13

Page 4: Data Science at Tumblr

blogging platform / social network100,000,000 blogs!

unique signals:asynchronous following graph

reblogs, likes, replies

About Tumblr

Monday, April 8, 13

Page 5: Data Science at Tumblr

Country March Apr May

USA 10000 12000 14000

Canada 7000 6500 5000

France 1200 1400 2000

Country Month Value

USA March 10000

USA April 12000

USA May 14000

Canada March 7000

Canada April 6500

Canada May 5000

France March 1200

France April 1400

France May 2000

About You

Monday, April 8, 13

Page 6: Data Science at Tumblr

Country March Apr May

USA 10000 12000 14000

Canada 7000 6500 5000

France 1200 1400 2000

Country Month Value

USA March 10000

USA April 12000

USA May 14000

Canada March 7000

Canada April 6500

Canada May 5000

France March 1200

France April 1400

France May 2000

Pivot Table!

About You

Monday, April 8, 13

Page 7: Data Science at Tumblr

About YouCountry Month Value

USA March 1000

USA April 12000

USA May 14000

Canada March 7000

Canada April 6500

Canada May 5000

France March 1200

France April 1400

France May 2000

Country March Apr May

USA 10000 12000 14000

Canada 7000 6500 5000

France 1200 1400 2000

Monday, April 8, 13

Page 8: Data Science at Tumblr

pivoted <- cast(melted, country~month) melted <- melt.data.frame(pivoted, id.vars='country')

About YouCountry Month Value

USA March 1000

USA April 12000

USA May 14000

Canada March 7000

Canada April 6500

Canada May 5000

France March 1200

France April 1400

France May 2000

Country March Apr May

USA 10000 12000 14000

Canada 7000 6500 5000

France 1200 1400 2000

Monday, April 8, 13

Page 9: Data Science at Tumblr

About YouCountry Month Value

USA March 1000

USA April 12000

USA May 14000

Canada March 7000

Canada April 6500

Canada May 5000

France March 1200

France April 1400

France May 2000

Country March Apr May

USA 10000 12000 14000

Canada 7000 6500 5000

France 1200 1400 2000

Monday, April 8, 13

Page 10: Data Science at Tumblr

Who Cares?

About YouCountry Month Value

USA March 1000

USA April 12000

USA May 14000

Canada March 7000

Canada April 6500

Canada May 5000

France March 1200

France April 1400

France May 2000

Country March Apr May

USA 10000 12000 14000

Canada 7000 6500 5000

France 1200 1400 2000

Monday, April 8, 13

Page 11: Data Science at Tumblr

One more question:

Monday, April 8, 13

Page 12: Data Science at Tumblr

Monday, April 8, 13

Page 13: Data Science at Tumblr

Hadoop

Monday, April 8, 13

Page 14: Data Science at Tumblr

What tools we use

What we do with those tools

Monday, April 8, 13

Page 15: Data Science at Tumblr

Plumbing

John D. Cook "The plumber programmer" November 2011 http://bit.ly/XfcXrt

Monday, April 8, 13

Page 16: Data Science at Tumblr

1. Record events / actions2. Store / archive everything3. Extract information

a. Reports / BIb. Back to Tumblr application

Pipes

Monday, April 8, 13

Page 17: Data Science at Tumblr

Built-in Variables•timestamp•referring page•user identifier•action identifier•location (city)•language setting

GiantOctopus: in-house event logging system.

GiantOctopus::log( ‘posts’, array(‘send_to_fb’=>1, ‘send_to_twitter’=>0 ));

Step 1: Log Events

Monday, April 8, 13

Page 18: Data Science at Tumblr

Scribe

Continuously Writing HDFSDaily

Cron

Web Servers Scribe Servers

Monday, April 8, 13

Page 19: Data Science at Tumblr

Step 2: Store in HadoopOne huge computer:

300TB hard drive7.8TB of RAM

85 x 2 = 170 hex-core processors

Monday, April 8, 13

Page 20: Data Science at Tumblr

Step 2: Store in HadoopOne huge computer:

300TB hard drive7.8TB of RAM

85 x 2 = 170 hex-core processors

One huge PITA:awful docs (search-hadoop.com helps)

java everywherefragmented community

Monday, April 8, 13

Page 21: Data Science at Tumblr

Hadoop

hive

pig

map/reduce

Monday, April 8, 13

Page 22: Data Science at Tumblr

Hive

"Basically SQL"

Compiles to Java map/reduce

About 100 hive tables

Each "table" is really a directory of flat files

SELECT root_post_id, count(*) AS likesFROM postsWHERE action='like'ORDER BY likes DESCLIMIIT 10;

10 most liked posts

Monday, April 8, 13

Page 23: Data Science at Tumblr

Hive Partitions

/posts/2013/03/26/*.lzo/posts/2013/03/27/*.lzo/posts/2013/03/28/*.lzo

dt='2013-03-26'dt='2013-03-26'dt='2013-03-26'

File location in HDFS Hive partition value

SELECT action, COUNT(*) AS views FROM pageviews WHERE dt = "2012-03-05" GROUP BY action

204 mappers 22,895 mappers

SELECT action, COUNT(*) AS views FROM pageviews WHERE ts > 1330927200 AND ts < 1331013600 GROUP BY action

Monday, April 8, 13

Page 24: Data Science at Tumblr

Extending Hive: Streaming

add file helpers.py;

FROM usersSELECT TRANSFORM(id, email) USING 'helpers.py' AS (id_with_gmail)

•Add all .py files you’ll need to the query•Sends each record to python script via stdin•Can be used as a subquery in a “normal” hive query

#!/usr/bin/python## helpers.py

import sys, regmail = re.compile(r'[email protected]')for row in sys.stdin: id, email = row.split('\t') if gmail.match(email): print id

Monday, April 8, 13

Page 25: Data Science at Tumblr

Pig

posts = LOAD 'posts.tsv' AS ( root_post_id:int, action:chararray);

likes = FILTER posts BY action=='like';

grouped = GROUP likes BY root_post_id;

counted = FOREACH grouped GENERATE group AS root_post_id, COUNT(likes.root_post_id) AS likes;

sorted = ORDER counted BY likes DESC;

top10 = LIMIT sorted 10;

STORE top10 INTO 'top10.csv';

"Basically SQL" if you had toexplain it piece by piece.

"DataBag" == "DataFrame"

Monday, April 8, 13

Page 26: Data Science at Tumblr

Extending Pig: Python UDFsExtract word prefixes for type-

ahead tag search

def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)]

>>> prefixes('museum')['m', 'mu', 'mus', 'muse', 'museu', 'museum']

Monday, April 8, 13

Page 27: Data Science at Tumblr

Extending Pig: Python UDFsExtract word prefixes for type-

ahead tag search

def prefixes(input, max_len=3): nchar = min(len(input), max_len) + 1 return [input[:i] for i in range(1,nchar)]

>>> prefixes('museum')['m', 'mu', 'mus', 'muse', 'museu', 'museum']

@outputSchema("t:(prefix:chararray)")

Monday, April 8, 13

Page 28: Data Science at Tumblr

package com.tumblr.swine;

import java.util.ArrayList;import java.util.List;

public class Prefixes {

private int maxTermLen;

public Prefixes() { this.maxTermLen = Integer.MAX_VALUE; }

public Prefixes(int maxTermLen) { this.maxTermLen = maxTermLen; }

public List<String> get(String s) { int size = s.length() < maxTermLen ? s.length() : maxTermLen; ArrayList<String> results = new ArrayList<String>(); for (int i=1; i < size + 1; i++) { results.add(s.substring(0,i)); } return results; }}

Extending Pig: Java UDFs

Monday, April 8, 13

Page 29: Data Science at Tumblr

package com.tumblr.swine;

import java.util.ArrayList;import java.util.List;

public class Prefixes {

private int maxTermLen;

public Prefixes() { this.maxTermLen = Integer.MAX_VALUE; }

public Prefixes(int maxTermLen) { this.maxTermLen = maxTermLen; }

public List<String> get(String s) { int size = s.length() < maxTermLen ? s.length() : maxTermLen; ArrayList<String> results = new ArrayList<String>(); for (int i=1; i < size + 1; i++) { results.add(s.substring(0,i)); } return results; }}

package com.tumblr.swine.pig;

import java.io.IOException;import java.util.ArrayList;

import java.util.List;

import org.apache.pig.EvalFunc;import org.apache.pig.FuncSpec;import org.apache.pig.data.DataBag;import org.apache.pig.data.DataType;import org.apache.pig.data.DefaultBagFactory;import org.apache.pig.data.Tuple;import org.apache.pig.data.TupleFactory;import org.apache.pig.impl.logicalLayer.FrontendException;import org.apache.pig.impl.logicalLayer.schema.Schema;

public class Prefixes extends EvalFunc<DataBag> {

public DataBag exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ DataBag output = DefaultBagFactory.getInstance().newDefaultBag(); String word = (String)input.get(0); int max = Integer.MAX_VALUE; if (input.size() == 2) { max = (Integer)input.get(1); } com.tumblr.swine.Prefixes prefixes = new com.tumblr.swine.Prefixes(max); for (String prefix : prefixes.get(word)) { Tuple t = TupleFactory.getInstance().newTuple(1); t.set(0, prefix); output.add(t); } return output; }catch(Exception e){ System.err.println("Prefixes: failed to process input; error - " + e.getMessage()); return null; } }

@Override public Schema outputSchema(Schema input) { Schema bagSchema = new Schema(); bagSchema.add(new Schema.FieldSchema("prefix", DataType.CHARARRAY)); try{ return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), bagSchema, DataType.BAG)); }catch (FrontendException e){ return null; } }

@Override public List<FuncSpec> getArgToFuncMapping() throws FrontendException { List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>(2); Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); // Allow specifying optional max length of prefix s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.INTEGER)); funcSpecs.add(new FuncSpec(this.getClass().getName(), s)); return funcSpecs; }

}

Extending Pig: Java UDFs

Monday, April 8, 13

Page 30: Data Science at Tumblr

HUE

Keeps query history

Preview tables / results

Save queries & templates

Monday, April 8, 13

Page 31: Data Science at Tumblr

What tools we use

What we do with those tools

Monday, April 8, 13

Page 32: Data Science at Tumblr

Spam

Classic example of supervised learning

Don't get too clever

Build good tooling!

Monday, April 8, 13

Page 33: Data Science at Tumblr

Spam: Vowpal WabbitOnline (continuously learning) system

Updates parameters with every new piece of information

Parallelizable, can run as service, very fast.

Loss functions:•squared•logistic•hinge•quantile

Monday, April 8, 13

Page 34: Data Science at Tumblr

Spam: Vowpal Wabbitblog: 'adamlaiacano',tags: ['free ipad', 'warez'],location: 'US~NY-New York',is_suspended: 0 or 1

Post:

Model: is_suspended ~ free_ipad + warez + US~NY-New_York + .....

Square loss functionVery high dimension: L1 regularization to avoid overfittingGreat precision, decent recall

Monday, April 8, 13

Page 35: Data Science at Tumblr

Type - Ahead search

Most popular tags for any letter combination

Store daily results in distributed Redis cluster

m: [me, model, mine]mu: [muscle, muscles, music video]mus: [muscle, muscles, music video]muse: [muse, museum, nine muses]museu: [museum, metropolitan museum of art, natural history museum]

Monday, April 8, 13

Page 36: Data Science at Tumblr

Type - Ahead search

Only keep popular prefixes: tag must occur 10 times

Only update keys that have changed.

- muse: [muse, museum, nine muses]+ muse: [muse, museum, arizona muse]

Monday, April 8, 13

Page 37: Data Science at Tumblr

Questions?

@adamlaiacano

http://adamlaiacano.tumblr.com

Monday, April 8, 13