DataEngConf: The Science of Virality at BuzzFeed

62

Transcript of DataEngConf: The Science of Virality at BuzzFeed

Page 1: DataEngConf: The Science of Virality at BuzzFeed
Page 2: DataEngConf: The Science of Virality at BuzzFeed

HISTORY OF VIRALITY

Page 3: DataEngConf: The Science of Virality at BuzzFeed
Page 4: DataEngConf: The Science of Virality at BuzzFeed

THE DATA

Page 5: DataEngConf: The Science of Virality at BuzzFeed

THE DATA: OLD VERSION

Article being viewedUser viewing articleTime of pageviewReferring domain

Page 6: DataEngConf: The Science of Virality at BuzzFeed

THE DATA: NEW VERSION

Article being viewed

Time of pageviewReferring domain

User viewing article

Referring User

Page 7: DataEngConf: The Science of Virality at BuzzFeed

DIFFERENT PERSPECTIVE:

Pageviews are a process on a graph!

Page 8: DataEngConf: The Science of Virality at BuzzFeed

WHAT THE GRAPH LOOKS LIKE:

Page 9: DataEngConf: The Science of Virality at BuzzFeed

WHAT THE PROCESS LOOKS LIKE:

Page 10: DataEngConf: The Science of Virality at BuzzFeed

WHAT THE DATA LOOKS LIKE:

Page 11: DataEngConf: The Science of Virality at BuzzFeed

WHAT CAN DO YOU WITH OLD PAGEVIEWS?

(Educated)

Guess!

Page 12: DataEngConf: The Science of Virality at BuzzFeed

CONNIE

Page 13: DataEngConf: The Science of Virality at BuzzFeed

OLD GRAPH RECONSTRUCTION: MODEL-BASED INFERENCEProbabilistic: You can infer connections that aren’t there! Error Prone: Graph statistics can be susceptible to small changes in the graph

Gets larger when differences in pageview times gets smaller

Page 14: DataEngConf: The Science of Virality at BuzzFeed

SIMPLIFIED VERSION:Observe:

Guess:

Page 15: DataEngConf: The Science of Virality at BuzzFeed

SIMPLIFIED VERSION:Guess:

Reality:

Page 16: DataEngConf: The Science of Virality at BuzzFeed

Check out a toy implementation here!

github.com/akellehe/pyconnie

Page 17: DataEngConf: The Science of Virality at BuzzFeed

NEW GRAPH RECONSTRUCTION: TRIVIAL

These are actually Unique Visitors …

Page 18: DataEngConf: The Science of Virality at BuzzFeed

LIFE IS A LITTLE MESSY…

This is more like what the Pageview graph looks like

Page 19: DataEngConf: The Science of Virality at BuzzFeed

PROBLEM: DATA MUNGING• Lots of potential for heuristics!• How do we get promotion attribution from

propagations?• Trees are important: how can we be sure

we get them?

Page 20: DataEngConf: The Science of Virality at BuzzFeed

PROBLEM: STREAMLINING ANALYSIS• How do we work from a common set of definitions?• How do we avoid repeating analysis?• How can we streamline data visualization? EDA?• How do we share optimized analyses? And avoid

inefficient (but correct) algorithms?

Page 21: DataEngConf: The Science of Virality at BuzzFeed

DEFINE DATA STRUCTURES!• All data munging happens “under the hood”• Data pre-processing is unit-tested• No room for heuristics: standardization!• Hard math definitions can be consistency-checked!

Page 22: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION SETFor one article

For the site (or other set of articles, S)

Page 23: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION SETPageviews to article b in time T

Pageviews to the site in time T

The simplest data structure. Just a representation of the raw pageview logs.

Represented as a generator of UserEdge objects

Page 24: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION GRAPH,

Page 25: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION GRAPH

Page 26: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION GRAPH

Page 27: DataEngConf: The Science of Virality at BuzzFeed

INFLUENCE GRAPHPropagation graph together with a map,

That measures the influence of the origin user in p on the pageviewing user

Page 28: DataEngConf: The Science of Virality at BuzzFeed

CONSIDER:

Page 29: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION FOREST

Page 30: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION FORESTThe propagation graph is great, but we’d also like a concept like unique visitors!

If there is attribution ordering in the graph, we can trace content back to its source!

Page 31: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION FOREST: FIRST PARENT ATTRIBUTION

n pageviews One UV

Page 32: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION FOREST gets the credit

Page 33: DataEngConf: The Science of Virality at BuzzFeed

RESULT: ALL GRAPHS ARE FORESTS

Promotions have 0 indegree,Users have 1 indegree

total edges in connected components:

Trees!

Page 34: DataEngConf: The Science of Virality at BuzzFeed

CAREFUL FOR EDGE CASES: MISSING DATA?All connected components should be rooted at a promotion source.

What happens if we lose the first edge (e.g. use the wrong T)?

Page 35: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION FOREST: CYCLE BREAKINGConsider … Cycle is not broken by

first-parent attribution

Traversal algorithms go on forever!

Page 36: DataEngConf: The Science of Virality at BuzzFeed

PROPAGATION FOREST: CYCLE BREAKINGConsider … As long as they’re not equal,

the can be ordered, say

Then, there is a node in the cycle with an out-edge younger than its in-edge:

The original pageview for that node must have been lost. Cut the in-edge (FPA!).

Page 37: DataEngConf: The Science of Virality at BuzzFeed

SUCCESS!Cycle-breaking + FPA = Trees!

Each tree is the UV graph downstream from a promotion source: promotion attribution!

Additional Benefits:Most information diffusion analyses model trees growing on graphs.

Many algorithms simplify when run on trees!

Page 38: DataEngConf: The Science of Virality at BuzzFeed

SUPERTREEWe may want to run an algorithm, or calculate a tree statistic from a whole forest, instead of just one tree. How can we do that?

Merge all the roots (promotion sources) together into one “super-node”

The whole forest becomes a SuperTree!

Page 39: DataEngConf: The Science of Virality at BuzzFeed

SUPERTREE: EXAMPLE

Page 40: DataEngConf: The Science of Virality at BuzzFeed

SUPERTREE: EXAMPLE

Page 41: DataEngConf: The Science of Virality at BuzzFeed

APPLICATION: LARGE SCALE DATA VIS

Page 42: DataEngConf: The Science of Virality at BuzzFeed

WHY IS IT SLOW?Layouts often consider repelling each node from every other: time complexity

Good for a few thousand nodes

Page 43: DataEngConf: The Science of Virality at BuzzFeed

OPENORD: SIMULATED ANNEALINGLinear main layout

Quadratic settling Phase

Implemented in Gephi

Page 44: DataEngConf: The Science of Virality at BuzzFeed

OPENORDGood for ~10k Users

Slow for ~100k Users

Messy! (if you skipthe quadratic step!)

Page 45: DataEngConf: The Science of Virality at BuzzFeed

TAKE ADVANTAGE OF TREE STRUCTURE!

Traverse the tree to decide where to place nodes!

Page 46: DataEngConf: The Science of Virality at BuzzFeed

H3 LAYOUTEach parent is in the center of a hemisphere.

Children are laid out on the surface of the hemisphere

They become centers of smaller hemispheres (if they’re parents)

Etc.

Page 47: DataEngConf: The Science of Virality at BuzzFeed
Page 48: DataEngConf: The Science of Virality at BuzzFeed
Page 49: DataEngConf: The Science of Virality at BuzzFeed
Page 50: DataEngConf: The Science of Virality at BuzzFeed
Page 51: DataEngConf: The Science of Virality at BuzzFeed

A NEW IMPLEMENTATIONpip install pyh3

Page 52: DataEngConf: The Science of Virality at BuzzFeed

WITH D3

Page 53: DataEngConf: The Science of Virality at BuzzFeed

MORE APPLICATIONS

Page 54: DataEngConf: The Science of Virality at BuzzFeed

ATTRIBUTION

Instead of

Page 55: DataEngConf: The Science of Virality at BuzzFeed

CASCADE PREDICTION

Page 56: DataEngConf: The Science of Virality at BuzzFeed

GRAPH AND TEMPORAL PROPERTIES ARE IMPORTANT!

Page 57: DataEngConf: The Science of Virality at BuzzFeed

TEST THE INFLUENTIALS HYPOTHESIS

Page 58: DataEngConf: The Science of Virality at BuzzFeed

IMPROVE CONTENT TARGETING

Page 59: DataEngConf: The Science of Virality at BuzzFeed

FINDING THE CAUSES OF VIRALITYConsider Fitting a Model:

User Features, content features, context features, User pair features

Page 60: DataEngConf: The Science of Virality at BuzzFeed

UNDER CONSTRUCTION:Online Regression!

Real-time feature weights tell which features correlate with propagation probabilities!

Drives hypothesis-building!

Page 61: DataEngConf: The Science of Virality at BuzzFeed

THE TEAM

Page 62: DataEngConf: The Science of Virality at BuzzFeed