Drill status report, Drill Meetup March 13

Drill Meetup March 13

[email protected] Technologies

mailto:[email protected]

Vision: A tool for interactive analysis using SQL

• Fast– Low latency queries– Columnar, vectorized execution– Fully pipelined streaming engine– Complement native interfaces and

MapReduce/Hive/Pig• Open

– Community driven open source project– Under Apache Software Foundation

• Modern– Standard ANSI SQL:2003 (select/into)– Nested/hierarchical data support– Schema is optional– Supports RDBMS, Hadoop and NoSQL

Interactive queriesData analystReporting100 ms-20 min

Data miningModelingLarge ETL20 min-20 hr

MapReduceHive

Pig

Apache Drill

Tenets• The community is core

– Make it easy for the Hadoop community to work with Drill– Mostly Java, native where it matters– Clean APIs at every layer allow extensions in other languages: a DSL in Scala, an optimizer in C, UDFs

in Python, etc.• Memory is scarce: keep things compressed wherever and whenever possible

– Work with record batches, not individual records– Batches should be off heap: allows dropping to native for things like codec or SIMD support and

minimizes gc concerns – Focus on in-memory formats before disk formats

• Embrace key developments from the past decade– Cache is the new memory, leverage cache-aware algorithms and vectorized operations– Rise of nested and late-schema data isn’t a fad, Drill must support– Support Late tuple materialization and column-aware operators– Provide extended compression interfaces to apply operations on compressed data

• E.g. sort an rle compressed column, filter a dictionary coded column while maintaining compression.

• Operational simplicity– Single process, No SPOF, Extensible HOCON based modular configuration, No dep…

High Level Architechture

• By default, Drillbits hold all roles, modules can optionally be disabled.• Any Drillbit can act as endpoint for particular query.• Zookeeper maintains ephemeral cluster membership information only• Small distributed cache utilizing embedded Hazelcast maintains information about

individual queue depth, cached query plans, metadata, locality information, etc.• Originating Drillbit acts as foreman, manages all execution for their particular query,

scheduling based on priority, queue depth and locality information.• Drillbit data communication is streaming and avoids any serialization/deserialization

Zookeeper

Storage Process

Storage Process

Storage Process

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Drillbit

Distributed Cache

Drillbit Modules

SQL Parser

Optimizer

Scheduler

Pig ParserPh

ysic

al P

lan

DFS Engine

HBase EngineHiveQL Parser

RPC Endpoint

Distributed Cache

Stor

age

Engi

ne

Inte

rfac

e

OperatorsOperators

Foreman

Logi

cal P

lan

Life of a SQL Query

Query

Logical Plan

Physical Plan

Execution Plan

Execution

Human or tool written ANSI compliant query

Dataflow of what should logically be done

How physical and exchange operators should be applied

Assignment to particular nodes and cores

Actual Query Execution

Physical Plan versus Execution Plan

Physical Plan (Optimizer)• Locations of exchanges. • Types and order of physical

operators (including spools)• Which projection of the raw data

to utilize• Query recovery points• Estimated memory, cpu,

bandwith and io required for each operation

Execution Plan (Scheduler)• Field ordering per fragment• The level of parallelization of each

exchange (remotely and locally).• The scheduling of each query

fragment (including any pauses)• The memory allocation for each

task• The size of record batches• What disk locations to use for

spooling purposes• When to start various sub-pieces

of the query plan.

Status Report and Plan

• The last few months:– Define a logical plan– Build a reference interpreter– Basic SQL Parser

• March/April– Larger SQL syntax– Physical plan– In-memory compressed data interfaces– Distributed execution focused on large cluster high performance

sort, aggregation and join

• Goals: Alpha Q2, Beta Q3

Exciting things to watch/leverage

• Parquet and ORC file formats– Drill will probably adopt one as a primary

• Tez/Stinger: Make Hive more SQL’y, add a new execution engine, faster with ORC. – Depending on status and code drop, maybe portions of execution engine can be shared

• x0data: Distributed Fork-Join framework plus analytics engine– Potential for code sharing lower level to simplify/combine cluster coordination and distributed cache,

ultimately support MPI-lite workloads

• Impala: Hive replacement query engine. Backend entirely in C++, flat data, primarily in-memory datasets when blocking operators required– Inspiration around external integration with Hive metastore, collaboration on use and extension of

Parquet

• Shark+Spark: Scala query engine, record at a time, focused on intermediate resultset caching – Ideas around Adaptive caching, cleaner Scala interfaces

• Tajo: Cleaner APIs, still record at a time execution, very object oriented– API Inspiration, front end test cases, expansion to reference interpreter via code sharing

Community

Shout-outs:• Julian Hyde @ Pentaho • Timothy Chen @ Microsoft• Chris Merrick @ RJMetrics • David Alves @ UT Austin• Sree Vaadi @ SSS/NGData

More needed: • Not just code, we need use cases, query planning, code

review, design help, ui, etc• Pick a JIRA, write your own JIRA, just say hi!

Join In

Join In• Mailing list: [email protected]• Twitter: @ApacheDrill• Source: http://github.com/apache/incubator-drill• Jira: https://issues.apache.org/jira/browse/DRILL

Upcoming Events• Meetups: Late April, Hadoop Summit• Hackathon: May

Work on Apache Drill full time:• MapR is hiring fulltime open source Drill developers • Come chat with us or write to [email protected]

mailto:@ApacheDrill

mailto:@ApacheDrill

http://github.com/apache/incubator-drill

https://issues.apache.org/jira/browse/DRILL

https://issues.apache.org/jira/browse/DRILL

Drill status report, Drill Meetup March 13

Documents

Transcript of Drill status report, Drill Meetup March 13