Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

54
Introduction to Apache Dri Big Data Bellevue Meetup Big Data Bellevue Meetup @tnachen @tnachen Timothy Chen

Transcript of Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Page 1: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Introduction to Apache Drill

Big Data Bellevue MeetupBig Data Bellevue Meetup

@tnachen@tnachenTimothy Chen

Page 2: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Motivation

Key Facts

Architecture Overview

Page 3: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

About me

Page 4: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 5: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 6: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 7: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 8: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 9: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

I Open Source

Page 10: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 11: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 12: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 13: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 14: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 15: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 16: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 17: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 18: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 19: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 20: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 21: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 22: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Use Case: Marketing Campaign

Jane, a marketing analystDetermine target segmentsData from different sources

Page 23: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Use Case: Crime Detection

• Online purchases• Fraud, billing, etc.• Batch-generated overview• Modes

– Explorative– Alerts

Page 24: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Requirements

• Support for different data sources• Support for different query interfaces• Low-latency/real-time• Ad-hoc queries• Scalable, reliable

Page 25: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 26: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Google’s Dremel

http://research.google.com/pubs/pub36632.html

Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.…

““Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.…

Page 27: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Google’s Dremel

multi-level execution trees

columnar data layout

Page 28: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Google’s Dremel

nested data + schema column-striped representation

map nested data to tables

Page 29: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Google’s Dremel

experiments:datasets & query performance

Page 30: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Apache Drill–key facts

Inspired by Google’s Dremel

Standard SQL 2003 support

Plug-able data sources

Nested data is a first-class citizen

Schema is optional

Community driven, open, 100’s involved

Page 31: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 32: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 33: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

High-level Architecture

Page 34: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Principled Query Execution

Source query—what we want to do (analyst friendly)

Logical Plan— what we want to do (language agnostic, computer friendly)

Physical Plan—how we want to do it (the best way we can tell)

Execution Plan—where we want to do it

Page 35: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Principled Query Execution

Source Query Parser

Logical Plan Optimizer

Physical Plan Execution

SQL 2003 DrQLMongoQLDSL

scanner APITopologyCFetc.

query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” },

parser API

Page 36: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Wire-level Architecture

Each node: Drillbit - maximize data locality

Co-ordination, query planning, execution, etc, are distributed

Any node can act as endpoint for a query—foreman

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Page 37: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Wire-level Architecture

Curator/Zookeeper for ephemeral cluster membership info

Distributed cache (Hazelcast) for metadata, locality information, etc.

Curator/ZkCurator/Zk

Distributed Cache

Distributed Cache

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Distributed Cache

Distributed Cache

Distributed Cache

Distributed Cache

Distributed Cache

Distributed Cache

Page 38: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Wire-level Architecture

Originating Drillbit acts as foreman: manages query execution, scheduling, locality information, etc.

Streaming data communication avoiding SerDe

Curator/ZkCurator/Zk

Distributed Cache

Distributed Cache

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Storage ProcessStorage Process

Drillbit

node

Distributed Cache

Distributed Cache

Distributed Cache

Distributed Cache

Distributed Cache

Distributed Cache

Page 39: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Wire-level ArchitectureForeman turns into root of the multi-level execution tree, leafs activate their storage engine interface. node

node node

Curator/ZkCurator/Zk

Page 40: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Page 41: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

On the shoulders of giants …

Jackson for JSON SerDe for metadata

Typesafe HOCON for configuration and module management

Netty4 as core RPC engine, protobuf for communication

Vanilla Java, LArray and Netty ByteBuf for off-heap large data structures

Hazelcast for distributed cache

Netflix Curator on top of Zookeeper for service registry

Optiq for SQL parsing and cost optimization

Parquet (http://parquet.io)/ ORC

Janino for expression compilation

ASM for ByteCode manipulation

Yammer Metrics for metrics

Guava extensively

Carrot HPC for primitive collections

Page 42: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Key features

Full SQL – ANSI SQL 2003

Nested Data as first class citizen

Optional Schema

Extensibility Points …

Page 43: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Extensibility Points

Source query parser API

Custom operators, UDF logical plan

Serving tree, CF, topology physical plan/optimizer

Data sources &formats scanner API

Source Query Parser

Logical Plan Optimizer

Physical Plan Execution

Page 44: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

User Interfaces

API—DrillClient

Encapsulates endpoint discovery

Supports logical and physical plan submission, query cancellation, query status

Supports streaming return results

JDBC driver, converting JDBC into DrillClient communication.

REST proxy for DrillClient

Page 45: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

User Interfaces

Page 46: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Let’s get our hands dirty…

Page 47: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Demo

Install

Preparation

Usage

$ wget http://people.apache.org/~jacques/apache-drill-1.0.0-m1.rc3/apache-drill-1.0.0-m1-binary-release.tar.gz$ tar -zxf apache-drill-1.0.0-m1-binary-release.tar.gz

$ export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents/Home$ export DRILL_LOG_DIR=$PWD$ ./bin/drillbit.sh start

$ ./bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin

Page 48: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Useful Resources

Getting Started guidehttps://github.com/vrtx/incubator-drill/blob/getting_started/docs/getting_started.rst

Demo HowTohttps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo

How to build/install Apache Drill on Ubuntu 13.04http://www.confusedcoders.com/bigdata/apache-drill/how-to-build-apache-drill-on-ubuntu-13-04

Page 49: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Be a part of it!

Page 50: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Status

Heavy development by multiple organizations (MapR, Pentaho, Microsoft, Thoughtworks, XingCloud, etc.)

Currently more than 100k LOC

Alpha available via http://people.apache.org/~jacques/apache-drill-1.0.0-m1.rc3/

Page 51: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Kudos to …Julian Hyde, Pentaho

Lisen Mu, XingCloud

Tim Chen, Microsoft

Chris Merrick, RJMetrics

David Alves, UT Austin

Sree Vaadi, SSS

Srihari Srinivasan, ThoughtWorks

Alexandre Beche, CERN

Jason Altekruse, MapRhttp://incubator.apache.org/drill/team.html

• Ben Becker, MapR• Jacques Nadeau, MapR• Ted Dunning, MapR• Keys Botzum, MapR• Jason Frantz• Ellen Friedman• Chris Wensel, Concurrent• Gera Shegalov, Oracle• Ryan Rawson, Ohm Data

Page 52: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Contributing

Contributions appreciated—not only code drops …

Test data & test queries

Use case scenarios (textual/SQL queries)

Documentation

Page 53: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Engage!

Follow @ApacheDrill on Twitter

Sign up at mailing lists (user | dev) http://incubator.apache.org/drill/mailing-lists.html

Standing G+ hangouts every Tuesday at 18:00 CEThttp://j.mp/apache-drill-hangouts

Keep an eye on http://drill-user.org/

Page 54: Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Twitter: @tnachen

Email: [email protected]