Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts...

71
April 17th 2019 Jin Hyuk Chang | @jinhyukchang | Engineer, Lyft Tao Feng | @feng-tao | Engineer, Lyft Amundsen: A Data Discovery Platform from Lyft

Transcript of Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts...

Page 1: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

April 17th 2019Jin Hyuk Chang | @jinhyukchang | Engineer, LyftTao Feng | @feng-tao | Engineer, Lyft

Amundsen: A Data Discovery Platform from Lyft

Page 2: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Agenda

• Data at Lyft

• Challenges with Data Discovery

• Data Discovery at Lyft

• Demo

• Architecture

• Summary

2

Page 3: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Data platform users

3

Data Modelers Analysts Data Scientists GeneralManagers

Data Platform

Engineers ExperimentersProductManagers

Page 4: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

4

Core Infra high level architecture

Custom apps

Page 5: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Data Discovery

5

Page 6: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

• My first project is to analyze and predict Data council Attendance

• Where is the data?

• What does it mean?

Hi! I am a n00b Data Scientist!

6

Page 7: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

• Option 1: Phone a friend!

• Option 2: Github search

Status quo

7

Page 8: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

• What does this field mean?

‒ Does attendance data include employees?

‒ Does it include revenue?

• Let me dig in and understand

Understand the context

8

Page 9: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Explore

SELECT

*

FROM

default.my_table

WHERE ds=’2018-01-01’

LIMIT 100;

Page 10: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Exploring with SELECT * is EVIL

1. Lack of productivity for data scientists

2. Increased load on the databases

10

Page 11: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Data Scientists spend upto 1/3rd time in Data Discovery...

11

• Data discovery

‒ Lack of

understanding of

what data exists,

where, who owns it,

who uses it, and how

to request access.

Page 12: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Audience for data discovery

12

Page 13: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Data Discovery - User personas

13

Data Modelers Analysts Data Scientists GeneralManagers

Data Platform

Engineers ExperimentersProductManagers

Page 14: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

3 Data Scientist personas

Power user

● All info in their head● Get interrupted a lot

due to questions

● Lost● Ask “power users” a

lot of questions

● Dependencies landing on time

● Communicating with stakeholders

Noob user Manager

Page 15: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Search based Lineage based Network based

Where is the table/dashboard for X?What does it contain?

I am changing a data model, who are the owner and most common users?

I want to follow a power user in my team.

Does this analysis already exist?

This table’s delivery was delayed today, I want to notify everyone downstream.

I want to bookmark tables of interest and get a feed of data delay, schema change, incidents.

Data Discovery answers 3 kinds of questions

Page 16: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Meet Amundsen

16

First person to discover the South Pole -Norwegian explorer, Roald Amundsen

Page 17: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Landing page optimized for search

Page 18: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Search results ranked on relevance and query activity

Page 19: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

How does search work?

19

Page 20: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Relevance - search for “apple” on Google

20

Low relevance High relevance

Page 21: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Popularity - search for “apple” on Google

21

Low popularity High popularity

Page 22: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Striking the balance

22

Relevance Popularity

● Names, Descriptions, Tags, [owners, frequent users]

● Querying activity● Dashboarding● Different weights for automated vs adhoc

querying

Page 23: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Back to mocks...

23

Page 24: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Search results ranked on relevance and query activity

Page 25: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Detailed description and metadata about data resources

Page 26: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Data Preview within the tool

Page 27: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Computed stats about column metadata

Disclaimer: these stats are arbitrary.

Page 28: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Built-in user feedback

Page 29: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Demo

29

Page 30: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Open source in mind

• Pluggable code to each micro-services via Python entry point, etc

• Pluggable API endpoint via Blueprint

• Build your ingestion pipeline like a Lego brick

Page 31: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Amundsen’s architecture

31

Page 32: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

32

Postgres Hive Redshift ... PrestoGithubSource

File

Databuilder Crawler

Neo4j ElasticSearch

Metadata Service Search Service

Frontend ServiceML FeatureService

SecurityService

Other Microservices

Metadata Sources

Page 33: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

1. Frontend Service

33

Page 34: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

34

Postgres Hive Redshift ... PrestoGithubSource

File

Databuilder Crawler

Neo4j ElasticSearch

Metadata Service Search Service

Frontend ServiceML FeatureService

SecurityService

Other Microservices

Metadata Sources

Page 35: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Amundsen table detail page

Page 36: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

2. Metadata Service

36

Page 37: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

37

Postgres Hive Redshift ... PrestoGithubSource

File

Databuilder Crawler

Neo4j ElasticSearch

Metadata Service Search Service

Frontend ServiceML FeatureService

SecurityService

Other Microservices

Metadata Sources

Page 38: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

38

2. Metadata Service

• A thin proxy layer to interact with graph database‒ Currently Neo4j is the default option for graph backend engine‒ Work with the community to support Apache Atlas

• Support Rest API for other services pushing / pulling metadata directly

Page 39: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Trade Off #1Why choose Graph database

39

Page 40: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Why Graph database?

Page 41: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Why Graph database?

Page 42: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Trade Off #2Why not propagate the metadata back to source

42

Page 43: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Why not propagate the metadata back to source

43

Page 44: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Why not propagate the metadata back to source

44

?

?

Page 45: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Why not propagate the metadata back to source

45

Page 46: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

3. Search Service

46

Page 47: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

47

Postgres Hive Redshift ... PrestoGithubSource

File

Databuilder Crawler

Neo4j ElasticSearch

Metadata Service Search Service

Frontend ServiceML FeatureService

SecurityService

Other Microservices

Metadata Sources

Page 48: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

3. Search Service

• A thin proxy layer to interact with the search backend‒ Currently it supports Elasticsearch as the search backend.

• Support different search patterns‒ Normal Search: match records based on relevancy‒ Category Search: match records first based on data type, then

relevancy‒ Wildcard Search

48

Page 49: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Challenge #1How to make the search result more relevant?

49

Page 50: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

How to make the search result more relevant?

50

• Define a search quality metric‒ Click-Through-Rate (CTR) over top 5 results

• Search behaviour instrumentation is key

• Couple of improvements:‒ Boost the exact table ranking‒ Support wildcard search (e.g. event_*)‒ Support category search (e.g. column: is_line_ride)

Page 51: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

4. Data Builder

51

Page 52: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

52

Postgres Hive Redshift ... PrestoGithubSource

File

Databuilder Crawler

Neo4j ElasticSearch

Metadata Service Search Service

Frontend ServiceML FeatureService

OtherServices

Other Microservices

Metadata Sources

Page 53: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Challenge #1Various forms of metadata

53

Page 54: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

54

Metadata Sources @ Lyft

Page 55: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Metadata - Challenges

• No Standardization: No single data model that fits for all data resources‒ A data resource could be a table, an Airflow DAG or a dashboard

• Different Extraction: Each data set metadata is stored and fetched differently‒ Hive Table: Stored in Hive metastore‒ RDBMS(postgres etc): Fetched through DBAPI interface‒ Github source code: Fetched through git hook‒ Mode dashboard: Fetched through Mode API‒ …

55

Page 56: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Challenge #2Pull model vs Push model

56

Page 57: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Pull model vs. Push model

57

Pull Model Push Model

● Periodically update the index by pulling from the system (e.g. database) via crawlers.

● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to.

Crawler

Database Data graph

Scheduler

Database Message queue

Data graph

Page 58: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Pull model vs. push model

58

Pull Model Push Model

● Onus of integration lays on data graph● No interface to prescribe, hard to maintain

crawlers

● Onus of integration lies on database● Message format serves as the interface● Allows for near-real time indexing

Crawler

Database Data graph

Scheduler

Database Message queue

Data graph

Page 59: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Pull model vs. push model

59

Pull Model Push Model

● Onus of integration lays on data graph● No interface to prescribe, hard to maintain

crawlers

● Onus of integration lies on database● Message format serves as the interface● Allows for near-real time indexing

Crawler

Database Data graph Database Message queue

Data graph

Preferred if● Near-real time indexing is important● Clean interface doesn’t exist● Other tools like Wherehows are moving

towards Push Model

Preferred if● Waiting for indexing is ok● Working with “strapped” teams● There’s already an interface

Page 60: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

4. Databuilder

Page 61: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Databuilder in action

Page 62: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

How are we building data? Databuilder

Page 63: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

How is databuilder orchestrated?

Amundsen uses Apache Airflow to orchestrate Databuilder jobs

Page 64: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

What’s next?

64

Page 65: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Amundsen seems to be more useful than what we thought

• Tremendous success at Lyft

‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!

• Many organizations have similar problems

‒ Collaborating with ING, WeWork and more

‒ We plan to announce open source soon

65

Page 66: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Impact - Amundsen at Lyft

66

Beta release(internal)

Generally Available (GA) release

Alpha release

Page 67: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Summary

67

Page 68: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Adding more kinds of data resources

PeopleDashboardsData sets

Phase 1(Complete)

Phase 2(In development)

Phase 3(In Scoping)

Streams Schemas Workflows

Page 69: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Summary

• Data Discovery adds 30+% more productivity to Data Scientists

• Metadata is key to the next wave of big data applications

• Amundsen - Lyft’s metadata and data discovery platform

• Blog post with more details: go.lyft.com/datadiscoveryblog

69

Page 70: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Jin Hyuk Chang | @jinhyukchangTao Feng | @feng-tao

Slides at go.lyft.com/amundsen_datacouncil_2019Blog post at go.lyft.com/datadiscoveryblog

Icons under Creative Commons License from https://thenounproject.com/ 70

Page 71: Amundsen: A Data Discovery Platform from Lyft Council... · 2019-04-17 · Data Modelers Analysts Data Scientists General Managers Data Platform Product Engineers Experimenters Managers.

Backup

71