Download - Presto - Hadoop Conference Japan 2014

Transcript
Page 1: Presto - Hadoop Conference Japan 2014

Sadayuki Furuhashi

Founder & Software ArchitectTreasure Data, inc.

PrestoInteractive SQL Query Engine for Big DataHadoop Conference in Japan 2014

Page 2: Presto - Hadoop Conference Japan 2014

A little about me...

> Sadayuki Furuhashi> github/twitter: @frsyuki

> Treasure Data, Inc.> Founder & Software Architect

> Open-source hacker> MessagePack - efficient object serializer> Fluentd - data collection tool> ServerEngine - Ruby framework to build multiprocess servers> LS4 - distributed object storage system> kumofs - distributed key-value data store

Page 3: Presto - Hadoop Conference Japan 2014

0. Background + Intro

Page 4: Presto - Hadoop Conference Japan 2014

What’s Presto?

A distributed SQL query engine for interactive data analisys against GBs to PBs of data.

Page 5: Presto - Hadoop Conference Japan 2014

Presto’s history

> 2012 Fall: Project started at Facebook> Designed for interactive query> with speed of commercial data warehouse> and scalability to the size of Facebook

> 2013 Winter: Open sourced!

> 30+ contributes in 6 months> including people from outside of Facebook

Page 6: Presto - Hadoop Conference Japan 2014

What’s the problems to solve?> We couldn’t visualize data in HDFS directly

using dashboards or BI tools> because Hive is too slow (not interactive)> or ODBC connectivity is unavailable/unstable

> We needed to store daily-batch results to an interactive DB for quick response(PostgreSQL, Redshift, etc.)> Interactive DB costs more and less scalable by far

> Some data are not stored in HDFS> We need to copy the data into HDFS to analyze

Page 7: Presto - Hadoop Conference Japan 2014

What’s the problems to solve?> We couldn’t visualize data in HDFS directly

using dashboards or BI tools> because Hive is too slow (not interactive)> or ODBC connectivity is unavailable/unstable

> We needed to store daily-batch results to an interactive DB for quick response(PostgreSQL, Redshift, etc.)> Interactive DB costs more and less scalable by far

> Some data are not stored in HDFS> We need to copy the data into HDFS to analyze

Page 8: Presto - Hadoop Conference Japan 2014

What’s the problems to solve?> We couldn’t visualize data in HDFS directly

using dashboards or BI tools> because Hive is too slow (not interactive)> or ODBC connectivity is unavailable/unstable

> We needed to store daily-batch results to an interactive DB for quick response(PostgreSQL, Redshift, etc.)> Interactive DB costs more and less scalable by far

> Some data are not stored in HDFS> We need to copy the data into HDFS to analyze

Page 9: Presto - Hadoop Conference Japan 2014

What’s the problems to solve?> We couldn’t visualize data in HDFS directly

using dashboards or BI tools> because Hive is too slow (not interactive)> or ODBC connectivity is unavailable/unstable

> We needed to store daily-batch results to an interactive DB for quick response(PostgreSQL, Redshift, etc.)> Interactive DB costs more and less scalable by far

> Some data are not stored in HDFS> We need to copy the data into HDFS to analyze

Page 10: Presto - Hadoop Conference Japan 2014

HDFS

Hive

PostgreSQL, etc.

Daily/Hourly BatchInteractive query

CommercialBI Tools

Batch analysis platform Visualization platform

Dashboard

Page 11: Presto - Hadoop Conference Japan 2014

HDFS

Hive

PostgreSQL, etc.

Daily/Hourly BatchInteractive query

✓ Less scalable ✓ Extra cost

CommercialBI Tools

Dashboard

✓ More work to manage 2 platforms

✓ Can’t query against “live” data directly

Batch analysis platform Visualization platform

Page 12: Presto - Hadoop Conference Japan 2014

HDFS

Hive Dashboard

Presto

PostgreSQL, etc.

Daily/Hourly Batch

HDFS

Hive

Dashboard

Daily/Hourly Batch

Interactive query

Interactive query

Page 13: Presto - Hadoop Conference Japan 2014

Presto

HDFS

Hive

Dashboard

Daily/Hourly BatchInteractive query

Cassandra MySQL Commertial DBs

SQL on any data sets

Page 14: Presto - Hadoop Conference Japan 2014

Presto

HDFS

Hive

Dashboard

Daily/Hourly BatchInteractive query

Cassandra MySQL Commertial DBs

SQL on any data sets CommercialBI Tools

✓ IBM Cognos✓ Tableau✓ ...

Data analysis platform

Page 15: Presto - Hadoop Conference Japan 2014

dashboard on chart.io: https://chartio.com/

Page 16: Presto - Hadoop Conference Japan 2014

What can Presto do?> Query interactively (in milli-seconds to minues)

> MapReduce and Hive are still necessary for ETL

> Query using commercial BI tools or dashboards> Reliable ODBC/JDBC connectivity

> Query across multiple data sources such asHive, HBase, Cassandra, or even commertial DBs> Plugin mechanism

> Integrate batch analisys + visualizationinto a single data analysis platform

Page 17: Presto - Hadoop Conference Japan 2014

Presto’s deployment

> Facebook> Multiple geographical regions> scaled to 1,000 nodes> actively used by 1,000+ employees> who run 30,000+ queries every day> processing 1PB/day

> Netflix, Dropbox, Treasure Data, Airbnb, Qubole

> Presto as a Service

Page 18: Presto - Hadoop Conference Japan 2014

Today’s talk

1. Distributed architecture

2. Data visualization - Demo

3. Query Execution - Presto vs. MapReduce

4. Monitoring & Configuration

5. Roadmap - the future

Page 19: Presto - Hadoop Conference Japan 2014

1. Distributed architecture

Page 20: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

Page 21: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service1. find servers in a cluster

Page 22: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

2. Client sends a query using HTTP

Page 23: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

3. Coordinator builds a query plan

Connector pluginprovides metadata(table schema, etc.)

Page 24: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

4. Coordinator sends tasks to workers

Page 25: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

5. Workers read data through connector plugin

Page 26: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

6. Workers run tasks in memory

Page 27: Presto - Hadoop Conference Japan 2014

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

7. Client gets the result from a worker

Client

Page 28: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

Page 29: Presto - Hadoop Conference Japan 2014

What’s Connectors?> Connectors are plugins to Presto

> written in Java> Access to storage and metadata

> provide table schema to coordinators> provide table rows to workers

> Implementations:> Hive connector> Cassandra connector> MySQL through JDBC connector (prerelease)> Or your own connector

Page 30: Presto - Hadoop Conference Japan 2014

Client

Coordinator HiveConnector

Worker

Worker

Worker

HDFS,Hive Metastore

Discovery Service

find servers in a cluster

Hive connector

Page 31: Presto - Hadoop Conference Japan 2014

Client

Coordinator CassandraConnector

Worker

Worker

Worker

Cassandra

Discovery Service

find servers in a cluster

Cassandra connector

Page 32: Presto - Hadoop Conference Japan 2014

Client

Coordinator

otherconnectors

...

Worker

Worker

Worker

Cassandra

Discovery Service

find servers in a cluster

HiveConnector

HDFS / Metastore

Multiple connectors in a query

CassandraConnector

Other data sources...

Page 33: Presto - Hadoop Conference Japan 2014

1. Distributed architecture

> 3 type of servers:> Coordinator, worker, discovery service

> Get data/metadata through connector plugins.> Presto is NOT a database> Presto provides SQL to existent data stores

> Client protocol is HTTP + JSON> Language bindings:

Ruby, Python, PHP, Java (JDBC), R, Node.JS...

Page 34: Presto - Hadoop Conference Japan 2014

Client

Coordinator ConnectorPlugin

Worker

Worker

Worker

Storage / Metadata

Discovery Service

Coordinator

Coordinator HA

Page 35: Presto - Hadoop Conference Japan 2014

2. Data visualization

Page 36: Presto - Hadoop Conference Japan 2014

The problems to use BI tools

> BI tools need ODBC or JDBC connectivity> Tableau, IBM Cognos, QlickView, Chart.IO, ...> JasperSoft, Pentaho, MotionBoard, ...

> ODBC/JDBC is VERY COMPLICATED> Matured implementation needs LONG time

Page 37: Presto - Hadoop Conference Japan 2014

A solution: PostgreSQL protocol

> Creating a PostgreSQL protocol gateway

> Using PostgreSQL’s stable ODBC / JDBC driver

https://github.com/treasure-data/prestogres

Page 38: Presto - Hadoop Conference Japan 2014

How Prestogres works?

2. select run_presto_as_temp_table( ‘presto_result’, ‘SELECT COUNT(1) FROM tbl1’);

pgpool-II+ patchclient

1. SELECT COUNT(1) FROM tbl1

4. SELECT * FROM presto_result;

PostgreSQL

3. “run_persto_as_temp_table” function runs query on Presto

Coordinator

Page 39: Presto - Hadoop Conference Japan 2014

Demo

Page 40: Presto - Hadoop Conference Japan 2014

2. Data visualization with Presto

> Data visualization tools need ODBC/JDBC driver> but implemetation takes LONG time

> A solution is to use PostgreSQL protocol> and use PostgreSQL’s ODBC/JDBC driver

> Prestogres is already confirmed to work with some commertial BI tools

Page 41: Presto - Hadoop Conference Japan 2014

3. Query Execution

Page 42: Presto - Hadoop Conference Japan 2014

Presto’s execution model

> Presto is NOT MapReduce

> Presto’s query plan is based on DAG> more like Apache Tez or traditional MPP

databases

Page 43: Presto - Hadoop Conference Japan 2014

How query runs?

> Coordinator> SQL Parser> Query Planner> Execution planner

> Workers> Task execution scheduler

Page 44: Presto - Hadoop Conference Japan 2014

SQL

SQL Parser

AST

Logical Planner

Distributed Planner

LogicalQuery Plan

Execution Planner

Discovery Server

Connector

DistributedQuery Plan Execution Plan

Optimizer

NodeManager

✓ node list

✓ table schemaMetadata

Page 45: Presto - Hadoop Conference Japan 2014

SQL

SQL Parser

SQL

Distributed Planner

LogicalQuery Plan

Execution Planner

Discovery Service

Connector

Query Plan Execution Plan

Optimizer

NodeManager

✓ node list

✓ table schemaMetadata

(today’s talk)

Query Planner

Page 46: Presto - Hadoop Conference Japan 2014

Query Planner

SELECT name, count(*) AS cFROM impressionsGROUP BY name

SQL

impressions ( name varchar time bigint)

Table schemaTable scan

(name:varchar)

GROUP BY(name, count(*))

Output(name, c)

+

Sink

Final aggregation

Exchange

Sink

Partial aggregation

Table scan

Output

Exchange

Logical query plan

Distributed query plan

Page 47: Presto - Hadoop Conference Japan 2014

Query Planner - Stages

Sink

Final aggregation

Exchange

Sink

Partial aggregation

Table scan

Output

Exchange

inter-workerdata transfer

pipelinedaggregation

inter-workerdata transfer

Stage-0

Stage-1

Stage-2

Page 48: Presto - Hadoop Conference Japan 2014

Sink

Partial aggregation

Table scan

Sink

Partial aggregation

Table scan

Execution Planner

+ Node list✓ 2 workers

Sink

Final aggregation

Exchange

Output

Exchange

Sink

Final aggregation

Exchange

Sink

Final aggregation

Exchange

Sink

Partial aggregation

Table scan

Output

Exchange

Worker 1 Worker 2

Page 49: Presto - Hadoop Conference Japan 2014

Execution Planner - Tasks

Sink

Final aggregation

Exchange

Sink

Partial aggregation

Table scan

Sink

Final aggregation

Exchange

Sink

Partial aggregation

Table scan

Task1 task / worker / stage✓ All tasks in parallel

Output

Exchange

Worker 1 Worker 2

Page 50: Presto - Hadoop Conference Japan 2014

Execution Planner - Split

Sink

Final aggregation

Exchange

Sink

Partial aggregation

Table scan

Sink

Final aggregation

Exchange

Sink

Partial aggregation

Table scan

Output

Exchange

Split

many splits / task= many threads / worker(table scan)

1 split / task= 1 thread / worker

Worker 1 Worker 2

1 split / worker= 1 thread / worker

Page 51: Presto - Hadoop Conference Japan 2014

All stages are pipe-lined✓ No wait time✓ No fault-tolerance

MapReduce vs. Presto

MapReduce Presto

map map

reduce reduce

task task

task task

task

task

memory-to-memorydata transfer✓ No disk IO✓ Data chunk must fit in memory

task

disk

map map

reduce reduce

disk

disk

Write datato disk

Wait betweenstages

Page 52: Presto - Hadoop Conference Japan 2014

3. Query Execution> SQL is converted into stages, tasks and splits> All tasks run in parallel

> No wait time between stages (pipelined)> If one task fails, all tasks fail at once (query fails)

> Memory-to-memory data transfer> No disk IO> If aggregated data doesn’t fit in memory,

query fails• Note: query dies but worker doesn’t die.

Memory consumption of all queries is fully managed

Page 53: Presto - Hadoop Conference Japan 2014

4. Monitoring & Configuration

Page 54: Presto - Hadoop Conference Japan 2014

Monitoring

> Web UI> basic query status check

> JMX HTTP API> GET /v1/jmx/mbean[/{objectName}]

• com.facebook.presto.execution:name=TaskManager• com.facebook.presto.execution:name=QueryManager• com.facebook.presto.execution:name=NodeScheduler

> Event notification (remote logging)> POST http://remote.server/v2/event

• query start, query complete, split complete

Page 55: Presto - Hadoop Conference Japan 2014

Configuration> Execution planning (for coordinator)

> query.initial-hash-partitions• max number of hash buckets (=tasks) of a GROUP BY

(default: 8)> node-scheduler.min-candidates

• max number of workers to run a stage in parallel(default: 10)

> node-scheduler.include-coordinator• whether run tasks only on workers or include coordinator

> query.schedule-split-batch-size• number of splits of a stage to start at once

Page 56: Presto - Hadoop Conference Japan 2014

Configuration> Task execution (for workers)

> task.cpu-timer-enabled• enable detailed statistics (causes some overhead)

(default: true)> task.max-memory

• memory limit of a task especially for hash tables used by GROUP BY and JOIN operations (default: 256MB)

• enlarge if you get “Task exceeded max memory size” error> task.shard.max-threads

• max number of threads of a worker to run active splits(default: number of CPU cores * 4)

Page 57: Presto - Hadoop Conference Japan 2014

5. Roadmap

A report of Presto Meetup 2014

http://www.slideshare.net/dain1/presto-meetup-20140514-34731104

"Presto, Past, Present, and Future" by Dain Sundstrom at Facebook

Page 58: Presto - Hadoop Conference Japan 2014

Presto’s future> Huge JOIN and GROUP BY

> Spill to disk

> Task recovery

> CREATE VIEW (※implemented)

> Native store (※implemented)> Fast data store in Presto workers> to cache hot data

> Authentication and permissions

Page 59: Presto - Hadoop Conference Japan 2014

Presto’s future

> DDL/DML statements> CREATE TABLE with partitioning> DELETE and INSERT

> Plugin repository

> CLI plugin manager

> JOIN and aggregation pushdown

> Custom optimizers

Page 60: Presto - Hadoop Conference Japan 2014

Links

> Web site & document> http://prestodb.io

> Mailing list> https://groups.google.com/group/presto-users

> Github> https://github.com/facebook/presto

> Guidelines for contribution> https://github.com/facebook/presto/blob/master/CONTRIBUTING.md

Page 61: Presto - Hadoop Conference Japan 2014

Check: www.treasuredata.com

Cloud service for the entire data pipeline,including Presto. We’re hiring!