Xin Fu, Carl Steinbach
Hadoop SummitTokyo, October 26, 2016
Path to 400M* Members: LinkedIn’s Data Powered Journey
* As of Q2 2016, LinkedIn had 450M members world wide
2
2004
2011 2012
2009
2012 2015
3
Real Time Visualization of New Sign-ups
What Does “Data-Driven” Mean at LinkedIn?
4
What Does “Data-Driven” Mean at LinkedIn?
5
Monitoring & Learning
6
What is This Phase Comprised of?
7
● Dashboards● Reports
● Trend explanation
○ Short term fluctuation: investigation
○ Long term trend: strategic analysis
Past Challenges
8
Reliability● Easily broken without operational support, huge time spent in
maintenance
Diverse technology● Self maintained pipelines● Various UIs with different visualization capabilities● Redundant computation
Standardized Reporting Tool
9
● Reduces dependency on 3rd party BI tools● Closer integration with LinkedIn’s ecosystem of experimentation
and anomaly detection solutions
Towards Real Time Monitoring
10
Sign
-up
Country
Platform
Language
Browser
Signup Type
OS
Experimentation & Analysis
11
What is This Phase Comprised of?
12
● Experiment design● Experiment analysis to inform ramp decisions
● Learning from multiple experiments to identify what works and what doesn’t work
Past Challenges
13
Experiment design● Interaction between experiments
Experiment analysis and ramp decision● Manual analysis, extended time-to-
decision● Ramp decisions based on localized
metrics● Reruns needed sometimes due to
undetected errors in setup
Worst of all, some ramps happened without A/B testing● e.g. infrastructural changes
Experimentation Platform @ LinkedIn
14
● Company-wide platform for A/B testing, ramping, and advanced targeting needs
● Automated reporting and analysis capabilities
Tiering of Metrics
15
Metrics at different tier:● Different review processes
● Different levels of visibility in dashboards and experiment scorecards
● Different computation priorities and SLAs in data pipelines
● Different life cycles
Backend Infrastructure for Tracking & Instrumentation
16
17
InvitationClickEvent()
Scale fact: ~1000 tracking event types, ~20TB per day, hundreds of metrics & data products
Tracking Data Records User Activity
Tracking Data Lifecycle and Teams
18
Product teams:PMs, Developers, TestEng
Infra teams: Hadoop, Kafka, DWH, ...
Data teams: Analytics, Relevance Engineers,...
Example: How Do We Track a Profile View?
19
PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"
"pageKey" : "profile_page"},
},"trackingInfo" : {["vieweeID" : "23456"],
...}
}
pageViews = LOAD ‘/data/tracking/PageViewEvent’;
profileViews = FILTER pageViews by header.pageKey==‘profile_page’;
Example: How Do We Track a Profile View?
20
PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"
"pageKey" : "new_profile_page"},
},"trackingInfo" : {["vieweeID" : "23456"],
...}
}
pageViews = LOAD ‘/data/tracking/PageViewEvent’;
profileViews = FILTER pageViews by header.pageKey==‘profile_page’ or header.pageKey==‘new_profile_page’;
At Some Point It Becomes Unmaintainable ...
21
How Do We Handle Old and New?
22
Producers Consumers
DALI: A Data Access Layer for LinkedInAbstract away underlying physical details to allow users to focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
We had been working on something that could help...
24
Data Catalog + Discovery
(DALI)
DaliFileSystem Client
Data Source(HDFS)
Data Sink(HDFS)
Processing Engine(MapReduce, Spark, Presto)
DALI Datasets (Tables + Views)
Query Layers (Hive, Pig, Spark)
View Defs + UDFs(Artifactory, Git)
Dataflow APIs(MR, Spark, Scalding)DALI CLI
DALI: Implementation Details in Context
Solving with DALI Views
Producers Consumers
State of the World Today with Dali
~ 100 producer views~ 200 consumer views~ 80 unique tracking event data sources
What’s next?! Views on streaming data! Selective materialization and caching! Open source
At the Core of “Data-Driven” is ....
27
28
Used to be Tug of War Between Speed and Quality
29
Before We Learned that Technology Could Break the Dichotomy Between Speed and Quality
30
Cultural Aspects: Partnership Data Scientists and Engineers
Interesting Challenges
- Metric trade-off, e.g. between engagement vs. monetization
- Real-time everything?- A/B test in a social
network- Human judge for
personalized search- Value of an action
31
It Took a Village
32
Thanks to all the Data Scientists, Engineers and Product partners at LinkedIn for being part of this great journey!
https://engineering.linkedin.com/data
Top Related