Path to 400M Members: LinkedIn’s Data Powered Journey

of 32/32
Xin Fu, Carl Steinbach Hadoop Summit Tokyo, October 26, 2016 Path to 400M* Members: LinkedIn’s Data Powered Journey * As of Q2 2016, LinkedIn had 450M members world wide
  • date post

    07-Jan-2017
  • Category

    Technology

  • view

    142
  • download

    0

Embed Size (px)

Transcript of Path to 400M Members: LinkedIn’s Data Powered Journey

  • Xin Fu, Carl Steinbach

    Hadoop SummitTokyo, October 26, 2016

    Path to 400M* Members: LinkedIns Data Powered Journey

    * As of Q2 2016, LinkedIn had 450M members world wide

  • 2

    2004

    2011 2012

    2009

    2012 2015

  • 3

    Real Time Visualization of New Sign-ups

  • What Does Data-Driven Mean at LinkedIn?

    4

  • What Does Data-Driven Mean at LinkedIn?

    5

  • Monitoring & Learning

    6

  • What is This Phase Comprised of?

    7

    Dashboards Reports

    Trend explanation

    Short term fluctuation: investigation

    Long term trend: strategic analysis

  • Past Challenges

    8

    Reliability Easily broken without operational support, huge time spent in

    maintenance

    Diverse technology Self maintained pipelines Various UIs with different visualization capabilities Redundant computation

  • Standardized Reporting Tool

    9

    Reduces dependency on 3rd party BI tools Closer integration with LinkedIns ecosystem of experimentation

    and anomaly detection solutions

  • Towards Real Time Monitoring

    10

    Sign

    -up

    Country

    Platform

    Language

    Browser

    Signup Type

    OS

  • Experimentation & Analysis

    11

  • What is This Phase Comprised of?

    12

    Experiment design Experiment analysis to inform ramp decisions

    Learning from multiple experiments to identify what works and what doesnt work

  • Past Challenges

    13

    Experiment design Interaction between experiments

    Experiment analysis and ramp decision Manual analysis, extended time-to-

    decision Ramp decisions based on localized

    metrics Reruns needed sometimes due to

    undetected errors in setup

    Worst of all, some ramps happened without A/B testing e.g. infrastructural changes

  • Experimentation Platform @ LinkedIn

    14

    Company-wide platform for A/B testing, ramping, and advanced targeting needs

    Automated reporting and analysis capabilities

  • Tiering of Metrics

    15

    Metrics at different tier: Different review processes

    Different levels of visibility in dashboards and experiment scorecards

    Different computation priorities and SLAs in data pipelines

    Different life cycles

  • Backend Infrastructure for Tracking & Instrumentation

    16

  • 17

    InvitationClickEvent()

    Scale fact: ~1000 tracking event types, ~20TB per day, hundreds of metrics & data products

    Tracking Data Records User Activity

  • Tracking Data Lifecycle and Teams

    18

    Product teams:PMs, Developers, TestEng

    Infra teams: Hadoop, Kafka, DWH, ...

    Data teams: Analytics, Relevance Engineers,...

  • Example: How Do We Track a Profile View?

    19

    PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"

    "pageKey" : "profile_page"},

    },"trackingInfo" : {["vieweeID" : "23456"],

    ...}

    }

    pageViews = LOAD /data/tracking/PageViewEvent;

    profileViews = FILTER pageViews by header.pageKey==profile_page;

  • Example: How Do We Track a Profile View?

    20

    PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"

    "pageKey" : "new_profile_page"},

    },"trackingInfo" : {["vieweeID" : "23456"],

    ...}

    }

    pageViews = LOAD /data/tracking/PageViewEvent;

    profileViews = FILTER pageViews by header.pageKey==profile_page or header.pageKey==new_profile_page;

  • At Some Point It Becomes Unmaintainable ...

    21

  • How Do We Handle Old and New?

    22

    Producers Consumers

  • DALI: A Data Access Layer for LinkedInAbstract away underlying physical details to allow users to focus solely on the logical concerns

    Logical Tables + Views

    Logical FileSystem

    We had been working on something that could help...

  • 24

    Data Catalog + Discovery

    (DALI)

    DaliFileSystem Client

    Data Source(HDFS)

    Data Sink(HDFS)

    Processing Engine(MapReduce, Spark, Presto)

    DALI Datasets (Tables + Views)

    Query Layers (Hive, Pig, Spark)

    View Defs + UDFs(Artifactory, Git)

    Dataflow APIs(MR, Spark, Scalding)DALI CLI

    DALI: Implementation Details in Context

  • Solving with DALI Views

    Producers Consumers

  • State of the World Today with Dali

    ~ 100 producer views~ 200 consumer views~ 80 unique tracking event data sources

    Whats next? Views on streaming data Selective materialization and caching Open source

  • At the Core of Data-Driven is ....

    27

  • 28

    Used to be Tug of War Between Speed and Quality

  • 29

    Before We Learned that Technology Could Break the Dichotomy Between Speed and Quality

  • 30

    Cultural Aspects: Partnership Data Scientists and Engineers

  • Interesting Challenges

    - Metric trade-off, e.g. between engagement vs. monetization

    - Real-time everything?- A/B test in a social

    network- Human judge for

    personalized search- Value of an action

    31

  • It Took a Village

    32

    Thanks to all the Data Scientists, Engineers and Product partners at LinkedIn for being part of this great journey!

    https://engineering.linkedin.com/data