Democratizing Data at Airbnb

90
Democratizing Data at Airbnb CHRIS WILLIAMS / JOHN BODLEY / MAY 11, 2017

Transcript of Democratizing Data at Airbnb

Page 1: Democratizing Data at Airbnb

Democratizing Data at Airbnb

CHRIS WILLIAMS / JOHN BODLEY / MAY 11, 2017

Page 2: Democratizing Data at Airbnb

Airbnb connects people to unique travel experiences

Page 3: Democratizing Data at Airbnb

The problem

Page 4: Democratizing Data at Airbnb

tribal knowledge |ˈtrībəl ˈnäləj | noun

Tribal knowledge is any unwritten information that is not commonly known by others within a company

Page 5: Democratizing Data at Airbnb

Relying on tribal knowledge stifles productivity

Page 6: Democratizing Data at Airbnb

As Airbnb grows so do the challenges around the volume, complexity, and obscurity of data

Page 7: Democratizing Data at Airbnb

In a large and complex organization, with a sea of data resources, users struggle to find the right data

Page 8: Democratizing Data at Airbnb

Data is often siloed, inaccessible, or lacks context

Page 9: Democratizing Data at Airbnb

I’m a recovering Data Scientist who wants to democratize data, automate common workflows, surface relevant

information, and provide context

Page 10: Democratizing Data at Airbnb

Tables in our Hive data warehouse200k

Page 11: Democratizing Data at Airbnb

> 10,000 Superset charts and dashboards

> 6,000 Experiments and metrics

> 6,000 Tableau workbooks and charts

> 1,500 Knowledge posts

Data resourcesBeyond the data warehouse

Page 12: Democratizing Data at Airbnb

With many more data sources and data types to love

Page 13: Democratizing Data at Airbnb

and most importantly…

Page 14: Democratizing Data at Airbnb

> 3,500 Airbnb employees

Page 15: Democratizing Data at Airbnb

PortlandSan Francisco

Los Angeles

TorontoNew York

Miami

Sao Paulo

DublinLondon

Paris

Barcelona

Berlin

Milan

Copenhagen

New Delhi

SeoulBeijing

Tokyo

Sydney

Singapore

Washington, DC

> 20Offices around the world

Page 16: Democratizing Data at Airbnb

The mandate

Page 17: Democratizing Data at Airbnb

To democratize data and empower Airbnb employees to be data-informed by aiding with data exploration, discovery, and trust

Page 18: Democratizing Data at Airbnb

The concept

Page 19: Democratizing Data at Airbnb

Search…

Page 20: Democratizing Data at Airbnb

It should be fairly evident what we feed into the search indices

Page 21: Democratizing Data at Airbnb
Page 22: Democratizing Data at Airbnb

But are we missing something?

Page 23: Democratizing Data at Airbnb
Page 24: Democratizing Data at Airbnb

The relevancy of relationshipsNodes and relationships have equal standing

created consumedSpoke 3

Page 25: Democratizing Data at Airbnb

The graph

created

associated

associated

associated

consumed

consumed

created

consumed

Page 26: Democratizing Data at Airbnb

The graph

created

associated

associated

associated

consumed

consumed

created

consumed

Page 27: Democratizing Data at Airbnb

The graph

created

associated

associated

consumed

consumed

created

consumed

associated

Page 28: Democratizing Data at Airbnb

The graph

associated

associated

associated

consumed

consumed

consumed

created created

Page 29: Democratizing Data at Airbnb

The graph

created

associated

associated

associated

consumed

created

consumed

consumed

Page 30: Democratizing Data at Airbnb

The graph

created

associated

associated

associated consumed

created

consumed

consumed

Page 31: Democratizing Data at Airbnb

The graph

created

associated

consumed

consumed

created

consumed

associated

associated

Page 32: Democratizing Data at Airbnb

The construction

Page 33: Democratizing Data at Airbnb

Databases

6APIs

4Airflow DAG

1

Page 34: Democratizing Data at Airbnb

Databases6

APIs4

Airflow DAG1

We leverage all these data resources to build a graph in Hive comprising of nodes and relationships

The workflow is run everyday though the graph is left to soak to prevent flickering

Page 35: Democratizing Data at Airbnb

Addressing graph flickering

Page 36: Democratizing Data at Airbnb

Addressing graph flickering

The issue is certain types of relationships are sporadic in nature causing the graph to flicker

Page 37: Democratizing Data at Airbnb

Persistent vs. transient relationshipsPersistent relationships represent a snapshot in time

createdSpoke 3

Page 38: Democratizing Data at Airbnb

Persistent vs. transient relationshipsTransient relationships represent events which are somewhat sporadic in nature

M Tu W Th F

consumedSpoke 3

Page 39: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 40: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 41: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 42: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 43: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 44: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 45: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 46: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 47: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 48: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 49: Democratizing Data at Airbnb

The winding data path

Airflow Data transfer

Python Graph datastore

neo4j-driver Python Neo4j driver

Neo4j Graph database

GraphAware Neo4j/Elasticsearch plugin

Elasticsearch Search engine

Flask Python web framework

Hive Data warehouse

Page 50: Democratizing Data at Airbnb

Logical Given our data is represented as a graph it is logical to use a graph database to store the data

Nimble Performance wins when dealing with connected data versus relational databases

Popular It is the world’s leading graph database and the community edition is free

Integrative It integrates well with Python and Elasticsearch

Why we choose Neo4j for our databaseThe four main reasons

Page 51: Democratizing Data at Airbnb

The Neo4j and Elasticsearch symbiotic relationshipCourtesy of two GraphAware plugins

Neo4j plugin Provides bi-directional integration which transparently and asynchronously replicate data from Neo4j to Elasticsearch

Elasticsearch plugin Enables Elasticsearch to consult with the Neo4j database during a search query to enrich the search rankings by leveraging the graph topology

Page 52: Democratizing Data at Airbnb

Node label hierarchy

:Entity

:Org

:Group :User

:Tableau

:Workbook:Chart

:Hive

:Schema :Table

Page 53: Democratizing Data at Airbnb

jane_doe

(:Entity:Org:User {id: ‘jane_doe’})

(:Entity:Hive:Table {id: ‘dim_users’})

(:Entity:Tableau:Chart {id: ‘12345’})

dim_users

12345

Page 54: Democratizing Data at Airbnb

MATCH (n:Entity:Org:User {id: ’<id>’}) USING INDEX n:User(id) RETURN n

Page 55: Democratizing Data at Airbnb

From local to global uniquenessA mechanism to reference nodes in an abstract manner

GraphAware UUID plugin Transparently assigns a globally unique UUID property to newly created elements (nodes and relationships) which cannot be changed or deleted

Globally unique Enables us to uniquely identify a single node via the Entity label and UUID property which allows for parameterized queries which leads to faster query and execution times

Page 56: Democratizing Data at Airbnb

MATCH (n:Entity {uuid: ’<uuid>’}) USING INDEX n:Entity(uuid) RETURN n

Page 57: Democratizing Data at Airbnb

/api/graph/nodes/org/user/<id>

/api/graph/nodes/<uuid>

/api/graph/relationships/<uuid>/created/<uuid>

Page 58: Democratizing Data at Airbnb

The frontend

Page 59: Democratizing Data at Airbnb

web app

Page 60: Democratizing Data at Airbnb

Designing the interface and user experience of a data tool should not be an afterthought

Page 61: Democratizing Data at Airbnb

Technical data power user; the epitome of a tribal knowledge holder

Daphne Data

User personas

Less data literate; needs to keep tabs on her team’s resources

Manager MelNew employee, new team, or new to data; has no idea what’s going on

Nathan New

Page 62: Democratizing Data at Airbnb

Designing for data exploration, discovery, and trust

Company dataSearch Resource details& metadata User data Group data

Page 63: Democratizing Data at Airbnb

Company dataSearch User data Group dataResource details& metadata

Page 64: Democratizing Data at Airbnb

Search Resource details & metadata Company dataUser data Group data

Google-esque search filters

Resource details & metadata

Context, context, & context

Page 65: Democratizing Data at Airbnb

Search Resource details & metadata Company dataUser data Group data

Surface relationships, everything’s a link to promote exploration

Metadata & consumption

Description, external link, social

Page 66: Democratizing Data at Airbnb

Column details & value distributionsTable lineageEnrich metadata on the fly

Search Resource details & metadata Company dataUser data Group data

Page 67: Democratizing Data at Airbnb

Search Resource details & metadata Company dataUser data Group data

Page 68: Democratizing Data at Airbnb

User details & metadata

What they make, what they consume

Search Resource details & metadata Company dataUser data Group data

Page 69: Democratizing Data at Airbnb

Former employees also hold tribal knowledge

Search Resource details & metadata Company dataUser data Group data

Page 70: Democratizing Data at Airbnb

Group overview

Search Resource details & metadata Company dataUser data Group data

Thumbnails for maximum context

Basic organization functionality

Pinterest-like curation & suggested content

Page 71: Democratizing Data at Airbnb

We gather over 15,000 thumbnails from Tableau, Superset, and the Knowledge Repo

Page 72: Democratizing Data at Airbnb

Search Resource details & metadata Company dataUser data Group data

Pinning flow from resource page

Edit mode / draggable grid

Page 73: Democratizing Data at Airbnb

???? ??

Employees can feel disconnected from Company-level metrics

Search Resource details & metadata Company dataUser data Group data

Page 74: Democratizing Data at Airbnb

The technology stack

Application + dependencies

DOM Testing

eslint enzyme mocha

chai

Application state

Styling

khan/aphrodite

Page 75: Democratizing Data at Airbnb

The challenges

Page 76: Democratizing Data at Airbnb

Proxy nodes Abstracting complexity where necessary while accurately modeling the data ecosystem

Graph merging Non-trivial Git-like merging of graph updates

Data-dense design Balancing simplicity and functionality is hard; most internal design resources are not made for data-rich apps

Complex dependencies An umbrella data tool is vulnerable to changes in upstream resource dependencies

The challenges

Page 77: Democratizing Data at Airbnb

The future

Page 78: Democratizing Data at Airbnb

Game-ification Provide content producers with a sense of value

Alerts & recommendations Move from active exploration to deliver relevant updates and content suggestions

Certified content Use certification to build trust and enable users to filter through a sea of stale content

Network analysis Determine obsolete nodes, critical paths, lines of communication, etc.

The future

Page 79: Democratizing Data at Airbnb

The team

Page 80: Democratizing Data at Airbnb

The Dataportal teamAnalytics & Experimentation Products

John Bodley Software Engineer

Eli Brumbaugh Experience Designer

Jeff Feng Product Manager

Michelle Thomas Software Engineer

Chris Williams Data Visualization

Page 81: Democratizing Data at Airbnb
Page 82: Democratizing Data at Airbnb

Thank you

Page 83: Democratizing Data at Airbnb

Appendix

Page 84: Democratizing Data at Airbnb

Naturally bidirectional relationships

associated

Dealing with mutual relationships

Page 85: Democratizing Data at Airbnb

Naturally bidirectional relationships

associated

Modeling both creates an unnecessary relationship

associated

Page 86: Democratizing Data at Airbnb

Naturally bidirectional relationships

associated

Most efficient solution is to use a single relationship in the many-to-one direction

Page 87: Democratizing Data at Airbnb

CREATE TABLE nodes ( labels ARRAY<STRING>, id STRING, properties STRING )

Page 88: Democratizing Data at Airbnb

jane_doe

{ labels:[‘Org’,’User’], id:’jane_doe’ }

{ labels:[‘Hive’,’Table’], id:’dim_users’ }

{ labels:[‘Tableau’,’Chart’], id:’12345’ }

dim_users

12345

Page 89: Democratizing Data at Airbnb

CREATE TABLE relationships ( source STRUCT<labels:ARRAY<STRING>,id:STRING>, target STRUCT<labels:ARRAY<STRING>,id:STRING>, type STRING, properties STRING )

Page 90: Democratizing Data at Airbnb

Efficient data retrieval

Solution Create an index for every label keyed by the ID and UUID properties which in addition to index hints provides optimal node retrieval

Problem Indexes provide for efficient data retrieval similar to a RDBMS primary key, however they are only defined for a single label as opposed to our tuple of hierarchical labels

Restrictions and workarounds with Neo4j indexes