Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

34
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto Liang-Chi Hsieh HadoopCon 2014 in Taiwan 1

description

The slides for HadoopCon 2014 in Taiwan.

Transcript of Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Page 1: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Liang-Chi Hsieh

HadoopCon 2014 in Taiwan

1

Page 2: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

In Today’s talk

• Introduction of Presto

• Distributed architecture

• Query model

• Deployment and configuration

• Data visualization with Presto - Demo

2

Page 3: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

SQL on/over Hadoop• Hive

• Matured and proven solution (0.13.x)

• Drawbacks: execution model based on MapReduce

• Better execution engines: Hive-Tez and Hive-Spark

!

• Alternative and usually faster options including

• Impala, Presto, Drill, ...

3

Page 4: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Presto• Presto is a distributed SQL query engine optimized

for ad-hoc analysis at interactive speed

• Data scale: GBs to PBs

!

• Deployment at:

• Facebook, Netflix, Dropbox, Treasure Data, Airbnb, Qubole

4

Page 5: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

History of Presto• Fall 2012

• The development on Presto started at Facebook

• Spring 2013

• It was rolled out to the entire company and became major interactive data warehouse

• Winter 2013

• Open-sourced

5

Page 6: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

The Problems to Solve• Hive is not optimized for interactive data analysis as

the data size grows to petabyte scale

• In practice, we do need to have reduced data stored in an interactive DB that provides quick query response

• Redundant maintenance cost, out of date data view, data transferring, ...

• The need to incorporate other data that are not stored in HDFS

6

Page 7: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Typical Batch Data Architecture

7

HDFS

Data Flow Batch Run

DB

Query• Views generated in batch maybe out of date

• Batch workflow is too slow

Page 8: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Interactive Query on HDFS

8

HDFS

Data Flow Interactive query

Presto

Query

Page 9: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Interactive Query on HDFS and other Data Sources

9

HDFS

Data Flow Interactive query

Presto

QueryMySQL Cassandra

Page 10: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Distributed Architecture• Coordinator

• Parsing statements

• Planning queries

• Managing Presto workers !

• Worker

• Executing tasks

• Processing data

10

Page 11: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

11

Page 12: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Storage Plugins• Connectors

• Providing interfaces for fetching metadata, getting data locations, accessing the data

• Current connectors (v0.76)

• Hive: Hadoop 1.x, Hadoop 2.x, CDH 4, CDH 5

• Cassandra

• MySQL

• Kafka

• PostgreSQL

12

Page 13: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

13

Page 14: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Presto Clients

• Protocol: HTTP + JSON

!

• Client libraries available in several programming languages:

• Python, PHP, Ruby, Node.js, Java, R

!

• ODBC through Prestogres

14

Page 15: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Query Model

• Presto’s execution engine does not use MapReduce

• It employs a custom query and execution engine

• Based on DAG that is more like Apache Tez, Spark or MPP databases

15

Page 16: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Query Execution• Presto executes ANSI-compatible SQL statements

!

• Coordinator

• SQL parser

• Query planner

• Execution planner

• Workers

• Task execution scheduler

16

Page 17: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Query Execution

Query planner

AST Query planExecution planner

Connector

Metadata

Execution plan

NodeManager

17

Page 18: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Query Planner

SELECT name, count(*) from logs GROUP BY name

Logical query plan:

Table scan GROUP BY Output

Distributed query plan:

SQL:

Table scan

Stage-2

Partial aggregation

Output buffer

Exchange client

Final aggregation

Output buffer

Exchange client

Output

Stage-1 Stage-0

18

Page 19: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Distributed query plan:

Table scan

Stage-2

Partial aggregation

Output buffer

Exchange client

Final aggregation

Output buffer

Exchange client

Output

Stage-1 Stage-0

Worker 1

Worker 2

Table scan

Partial aggregation

Output buffer

Exchange client

Final aggregation

Output buffer

Exchange client

Output

Table scan

Partial aggregation

Output buffer

Exchange client

Final aggregation

Output buffer

* Tasks run on workers

19

Page 20: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Query Execution on Presto

• SQL is converted into stages, tasks, drivers

• Tasks operate on splits that are sections of data

• Lowest stages retrieve splits from connectors

20

Page 21: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Query Execution on Presto

• Tasks are run in parallel

• Pipelined to reduce wait time between stages

• One task fails then the query fails

!

• No disk I/O

• If aggregated data does not fit in memory, the query fails

• May spill to disk in future

21

Page 22: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Deployment & Configuration

• Basically, there are four configurations to set up for Presto

• Node properties: environment configuration specific to each node

• JVM config

• Config properties: configuration for Presto server

• Catalog properties: configuration for connectors !

• Detailed documents are provided on Presto site

22

Page 23: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Node Properties

• etc/node.properties

• Minimal configuration:

node.environment=production node.id=ffffffff-ffff-ffff-ffff-ffffffffffff node.data-dir=/var/presto/data

23

Page 24: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Config Properties

• etc/config.properties

• Minimal configuration for coordinator:

coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8080 task.max-memory=1GB discovery-server.enabled=true discovery.uri=http://example.net:8080

24

Page 25: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Config Properties

• Minimal configuration for worker:

coordinator=false http-server.http.port=8080 task.max-memory=1GB discovery.uri=http://example.net:8080

25

Page 26: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Catalog Properties

• Presto connectors are mounted in catalogs

• Create catalog properties in etc/catalog

• For example, the configuration etc/catalog/hive.properties for Hive connector:

connector.name=hive-hadoop2 hive.metastore.uri=thrift://example.net:9083

26

Page 27: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Presto’s Roadmap

• In next year:

• Complex data structures

• Create table with partitioning

• Huge joins and aggregations

• Spill to disk

• Basic task recovery

• Native store

• Authentication & authorization

* Based on the Presto Meetup, May 201427

Page 28: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Data Visualization with Presto - Demo

• There will be official ODBC driver for connecting Presto to major BI tools, according to Presto’s roadmap

• Prestogres provides alternative solution for now

• Use PostgreSQL’s ODBC driver

!

• It is also not difficult to integrate Presto with other data visualization tools such as Grafana

28

Page 29: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Grafana

• An open source metrics dashboard and graph editor for Graphite, InfluxDB & OpenTSDB

• But we may not be satisfied with these DBs or just want to visualize data on HDFS, especially for large-scale data

29

Page 30: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Integrating Presto with Grafana

• Presto provides many useful date & time functions

• current_date -> date

• current_time -> time with time zone

• current_timestamp -> timestamp with time zone

• from_unixtime(unixtime) → timestamp

• localtime -> time

• now() → timestamp with time zone

• to_unixtime(timestamp) → double

30

Page 31: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Integrating Presto with Grafana

• Presto also supports many common aggregation functions

• avg(x) → double

• count(x) → bigint

• max(x) → [same as input]

• min(x) → [same as input]

• sum(x) → [same as input]

• …..

31

Page 32: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Integrating Presto with Grafana

• So we implemented a custom datasource for Presto to work with Grafana

• Interactively visualize data on HDFS

HDFS

Interactive query

Presto

Grafana

32

Page 33: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

Demo

33

Page 34: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto

References• Martin Traverso, “Presto: Interacting with petabytes of data at

Facebook”

• Sadayuki Furuhashi, “Presto: Interactive SQL Query Engine for Big Data”

• Sundstrom, “Presto: Past, Present, and Future”

• “Presto Concepts” on Presto’s documents

34