weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1
Transcript of weigend_STATS252_ItamarRosenn_EricSun_2009.05.04.pptx - Slide 1
Data Science at Facebook
Itamar RosennEric Sun5/4/09
Facebook Data▪ Social Graph
▪ 200M+ active users
▪ 100M+ users come to site each day
▪ several hundred thousand new users join each day
▪ hundreds of dimensions per user (numerical, categorical, text)
▪ average user has over 120 friends
▪ friendships on Facebook span many different types of relationships
▪ Social Behavior
▪ Actions: users interact with hundreds of thousands of applications, on and off the site
▪ Interactions: users interact directly with each other via over 100 distinct types of events
▪ Social Content
▪ Photos, Status Updates, Platform Application Content, Events, Posts, Videos, Notes, etc...
Managing Data at Scale▪ Solution: Hadoop + Hive
▪ HDFS / Hadoop (MapReduce in Java)
▪ MetaStore (metadata management)
▪ HiveQL (SQL-like query language on top of Hadoop + MetaStore)
▪ Data Scale
▪ More than 1PB raw capacity in largest HDFS / Hadoop cluster
▪ Over 2TB uncompressed data collected each day
▪ Dozens of TB worth of data read / written each day via Hadoop + Hive
Data Science - What We Do
Product Health MetricsLaunch EvaluationsGrowth ModelingUser Churn ModelingProduction IncentivesContent Diffusion
Ad CTR PredictionPYMKSearch RankingHighlights
Behavioral Analysis Data-Driven Systems
Data Infrastructure
Hive
Hadoop
Data Science – Who We Are
Dennis Decoste Roddy Lindsay Alex Smith
Thomas Lento
Venky Iyer
Ravi Grover Cameron Marlow
Lee Byron Itamar Rosenn
Danny Ferrante
James Mayfield
Maintained Relationships on Facebook
▪ Question: is Facebook increasing the size of people’s personal networks?
▪ Task:
▪ the types of relationships people maintain on the site
▪ the relative size of these groups
Types of RelationshipsPeople you know
▪ Facebook friends = people you’ve met at some point in life
▪ Researchers have estimated this number to be somewhere between 300 and 3,000. (Gladwell, Killworth)
Communication network
▪ Individuals with whom you communicate on a regular basis
▪ Includes your core support network, which may be as low as 3 people
▪ Kossinets and Watts observed communication network size of 10-20
Maintained relationships
▪ Social technologies like Newsfeed or RSS readers allow you to keep up with the things that people you know are doing
▪ This information consumption is a form of relationship management, as it can lead to direct
communication in the future
Measuring Network Size on FacebookExamine the relationships of a random user sample over 30 days on the site. We defined
networks in 4 ways:
All friends
▪ The largest representation of a person’s network is the set of people they have verified as friends.
Reciprocal communication
▪ The number of friends with whom the user had reciprocal exchanges via messages, wall posts, or comments. This provides a measure of the user’s core network.
One-way communication
▪ The number of friends to whom the user has reached out via messages, wall posts, or comments.
Maintained relationships
▪ The number of friends whose Newsfeed stories the user has clicked on, or whose profiles the user has visited at least twice
Findings
▪ As a function of the # of friends a user has, she is passively engaging with 2 to 2.5 more people than with whom she directly communicates
Systemic Effects
▪ The stark constrast between these networks shows the effect of technologies like Newsfeed.
Content Production among New Users
▪ Mission: Give people the power to share and make the world more open and connected.
▪ Question: What mechanisms lead Facebook newcomers to share content on the site?
Content ProductionIn new users’ first two weeks:
▪ 45% upload a photo
▪ 41% use a 3rd-party app
▪ 30% send a private message
▪ 27% compose a status update
▪ 22% write on a friend’s wall
Content ProductionIn new users’ first two weeks:
▪ 45% upload a photo
▪ 41% use a 3rd-party app
▪ 30% send a private message
▪ 27% compose a status update
▪ 22% write on a friend’s wall
Production Incentives Hypotheses
▪ H1: Newcomers who receive more feedback on their initial content will go on to contribute more content.
▪ H2: Newcomers whose initial content receives greater
distribution will go on to produce more content.
▪ H3: Social learning: Newcomers whose friends share more content will go on to produce more content themselves.
▪ H4: Singling out: Newcomers who are singled out in content that their friends produce will go on to produce more content themselves.
MethodQuantitative
▪ Selected two cohorts:Nov. 5, 2007 (N= 347,403) Mar. 3, 2008 (N=254,603)
▪ Observed activity in their first two weeks
▪ Predicted how many photos they would upload between third and fifteenth week on Facebook
Qualitative
▪ 40-minute semi-structured interviews with seven new users
▪ Recorded audio/video and screen
▪ Asked about typical uses of facebook, content production, social norms, privacy
FeaturesIndependent VariablesH1. Feedback
▪ Comments received
H2. Distribution
▪ # of times content was viewed in Newsfeed
▪ # of friends who viewed content in Newsfeed
H3. Social Learning
▪ Number of friends’ photos seen
▪ H4. Singling Out
▪ Number of times tagged
Controls▪ Age
▪ Gender
▪ Number of friends
▪ Total pages viewed
▪ Initial engagement with photos:
▪ # of photos uploaded
▪ # of photos viewed
▪ Photo tags created
▪ Photo comments written
Results
Intercept 1.2
Controls Coefficient % change from int.
Age (in years) -0.01 -1.0% ***
Male (0/1) 0.48 +39.3% ***
Female (0/1) 1.21 +131.2% ***
Pages viewed + 0.24 +18.4% ***
Photo pages viewed + 2.80 +597.4% ***
Photo comments made 0.15 +11.2% ***
Photo tags created 0.10 +6.9% ***
Photos uploaded 0.30 +22.8% ***
Independent Vars Coefficient % change from int.
Comments received (0/1) 0.09 +6.2% ***
Photo views received 0.04 +2.6% ***
Photo stories seen 0.09 +6.1% ***
Photo tags received (0/1) 0.03 +2.1% (ns)
Model 1 – Early Uploaders Intercept 1.9
Controls Coefficient % change from int.
Age (in years) -0.01 -0.7% ***
Male (0/1) 0.84 +79.6% ***
Female (0/1) 1.43 +169.8% ***
Pages viewed + -0.02 -1.6% ***
Photo pages viewed + 2.35 +408.3% ***
Photo comments made 0.24 +17.7% ***
Photo tags created 0.17 +12.6% ***
Early-uploader (0/1) 0.39 +30.6% ***
Independent Vars Coefficient % change from int.
Photo stories seen X early-uploader
0.15 +10.7% ***
Photo stories seen Xnon-early-uploader
0.03 +2.2% ***
Photo tags received Xearly-uploader (0/1)
-0.05 -3.6% (ns)
Photo tags received Xnon-early-uploader (0/1)
0.10 +7.2% ***
Model 2 - Everyone
Summary of Results
Hypothesis Early-uploaders Non-early-uploaders
H1. Feedback Support N/A
H2. Distribution Modest Support N/A
H3. Social learning
Support Support
H4. Singling out No Support Support
▪ We learn from our friends. If our friends engage with photos, we do too. Social learning is the main lever for content production.
▪ For new users already uploading photos feedback is associated with increased content production, and distribution is marginally important.
Modeling Contagion Through Newsfeed▪ How do ideas spread through a social network?
▪ Use Facebook Pages to model diffusion patterns
▪ Compare results with existing models of diffusion
▪ Show how Facebook advertising campaigns may be more successful than off-Facebook advertising campaigns due to Facebook’s inter-connectedness and diffusion properties.
▪ Note: Research based on “old” Facebook (pre-March 2009)
▪ Still relevant: first empirical analysis of large-scale collisions of short chains
Theory of the Influentials▪ Old Theory: it’s all about the “influentials” (Malcolm Gladwell, etc.)
▪ Idea: reach a tiny group of Influential people, and you’ll reach everyone else through them for free
▪ $1+ billion/year spent on word-of-mouth campaigns targeting Influentials; amount is growing 36% per year (MarketingVOX)
Contagion Theory▪ Duncan Watts: Anyone can be an influencer.
▪ Ideas don’t spread via influentials. Instead, ideas spread like viruses: either you’re susceptible, or you’re not
▪ Success depends not on how persuasive the early adopter(s) are, but whether everyone else is easily persuaded.
How Do Ideas Spread on Facebook?▪ News Feed allows for efficient diffusion of ideas
▪ Facebook’s Pages product is one of the most viral features of the site.
▪ People may see multiple friends fan a Page in a single Feed story, so a node in the graph can have multiple parentsAlice fans
a Page
Bob sees Alice’s action on his
News Feed; Bob fans the Page as
well
Charlie sees Alice’s action on his News Feed; Charlie fans the
Page as well
Chain of Length 1
Large-Scale Result: Large Connected Trees of Diffusion
▪ Diffusion chain for Stripy, a cartoon popular in Bosnia (blue) & Slovenia (yellow). Croatia (green) has yet to find its connecting bridge.
Large Connected Clusters▪ Often, the vast majority of fans can be connected into one cluster; sometimes over 90% of the fans for one particular Page can be connected.
▪ Example: On 8/21/08, 71,090 of 96,922 fans of the Nastia Liukin Page (73.3%) were in one connected cluster.
▪ For Pages created after 7/1/08, the median Page had 69.48% of its Fans in one connected cluster as of 8/19/08.
How Do These Large Clusters Come About?• Are these large clusters started by “one guy”?
▪ No: across all Pages of meaningful size (>1000 Fans), 14.8% of the Fans in the biggest cluster were “start points.”
▪ The variability in this percentage becomes very small as # fans increases
▪ The average node in the biggest cluster is connected to 2.899 others.
• Large clusters are formed when many long chains of diffusion merge together.
Diffusion Chains on Facebook vs. Real Life• The connected nature of Facebook (combined
with easy methods of communication) makes long diffusion chains possible.
▪ In word-of-mouth studies of information propagation, most people hear of an idea from 1 person and pass it on to 1 other person
▪ Only 38% of paths involve at least four individuals (Brown & Reingen 1987)
▪ On Facebook, 86.4% of paths of Page diffusion involve at least 4 individuals
How are Long Diffusion Chains Created?• Goal: test whether the Influentials theory or the
Contagion theory is more applicable to Facebook
▪ Attempt to predict size of diffusion chains that a particular user will create using characteristics of the user and/or the Page.
▪ If size can be predicted, we can then identify the most influential users.
Data▪ Data consists of all the associations (actor follower) for a representative selection of Pages.
▪ Pages were at least 40 days old and had at least 5,000 fans
Prediction ModelResponse: max_chain_length
Predictors:
▪ gender
▪ log age
▪ log Facebook_age
▪ log feed_exposure (# friends who saw News Feed story)
▪ log friend_count
▪ log activity_count (wall posts + messages sent + photos added)
▪ log popularity (controls for News Feed exposure via Coefficient)
Method: zero-inflated negative binomial regression
Results• Only consistent coefficient is on feed_exposure
(# friends who saw News Feed story).
▪ Coefficient hovers around 1: if News Feed publishes a user’s action to 1% more people, we expect a 1% longer max_chain
• Implies that friend_count is not realistically meaningful.
▪ After controlling for distribution and popularity, neither demographic characteristics nor number of Facebook friend seems to play an important role in the prediction of maximum diffucion chain length.
Conclusions
• Facebook News Feed enables long-lasting chains of diffusion that may reach many more people than real-life diffusion chains.
• The Facebook network is very connected: ideas with good receptiveness will attract wide, long connected clusters.
• Long chains are not a function of Facebook age, activity, users’ demographics, or even # of friends: it’s only related to exposure.
Contact
www.facebook.com/data
(c) 2007 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0