Zeppelin at twitter (sf data science meetup, july 2016)

16
Zeppelin at Twitter Prasad Wagle Technical Lead, Data Platform twitter.com/prasadwagle July 13, 2016 SF Data Science Meetup Galvanize, San Francisco, CA

Transcript of Zeppelin at twitter (sf data science meetup, july 2016)

Zeppelin at Twitter

Prasad Wagle Technical Lead, Data Platformtwitter.com/prasadwagle

July 13, 2016SF Data Science MeetupGalvanize, San Francisco, CA

Twitter Data Pipeline Overview

Production systems

Presto

Vertica

MySQL

Scalding

Spark

R

Custom Dashboards

Tableau

Zeppelin

Command line tools

HDFS

Analytics ToolsAnalytics Front-ends

One company-wide server860 notes4000 paragraphs

1500 Vertica, 1500 Presto, 300 MySQL400 Markdown300 Scalding, Spark, etc

850 users

Zeppelin Usage Metrics

Field of (Data) Dreams

Started as hackweek project in Dec 2015, Beta: Jan 27

Number of notes created

Feb - 270, Mar - 350, Apr - 470, May - 560, Jun - 750

Report CreatorsProduct managers (dashboards, product analytics)Data scientistsSales analystsEngineers and SREs

Report ViewersAnyone in the company

Zeppelin Users

My mind is absolutely blown away by the ease of use, speed, and power of Zeppelin. I've been wanting a tool like this at Twitter my entire time working here.

started playing with @ApacheZeppelin. amazingly addictive!

Thanks for all the updates to Zeppelin - Fabric has fallen in love with it fast (and we're even using it for daily tracking of our OKRs amongst all the other metrics)

Zeppelin Testimonials

Very easy to create and share reportsWeb based

Works seamlessly with analytics enginesJDBCNon-JDBC - Scalding, Spark

Open source (easy to add features)

Reasons for adoption

Drag and drop report builderCan create complex queries without SQL knowledge (e.g. Top N) Polished UI Filters and other transformations work on extracts

no new database queries (fast)Row level permissions (for sales reports)

Tableau

Areas:Security

StabilityOperationsInterpreter

Work Done Before Production

AuthenticationIntegrated with Twitter’s homegrown single sign-on system

SSLIntegrated with Twitter’s homegrown key distribution system

Notebook authorizationData source authorization

Work Done (Security)

Websocket deadlock issue with Jetty 8reduce communicationremove synchronized block (risky, will move to Jetty 9)

MonitoringStandby serverBackups

Work Done (Stability, Operations)

ScaldingJDBC

Create JDBC connection for every query to avoid Vertica closed connection issue

Work Done (Interpreter)

Notebook authorizationData source authorizationRun scheduled notes with a userScalding interpreterReduce websocket communicationParagraph footerRow level permissions

Work Contributed to Apache Project

Stability, ScalabilityJetty9

InterpretersScalding, Spark, R

Multiuser scalability, authentication, integration with Twitter sources

Use JDBC interpreter instead of Hive

Future / Work in Progress

Features and UXNotebook organization (folders)Email reports and alertsRow level permissions like tableau

OperationsMonitoring (end-to-end query)Admin (view/stop running jobs, resource usage)FailoverContinuous Integration

Future / Work in Progress

There is a real need for Notebook style interface for data analysis

Zeppelin is enterprise ready, flexible and easy to use

Zeppelin user and dev community is awesome

Takeaways