OSCON 2013: Apache Drill Workshop > Execution & ValueVectors

1

Apache Drill: Execution

Jacques Nadeau, OSCON July 23, 2013

[email protected] |@intjesus

mailto:[email protected]

2

Drill is…

–Optimistic & Pipelined–Columnar & Late materialized–Vectorized –Language Agnostic–MPP Query Engine

3

Optimistic Execution

Optimistic Recovery Pipelined Scheduling Pipelined Communication

4

Optimistic Recovery

Assume Failures Don’t overbuild for them– The shorter the queries, the less work lost on failure

Graceful management of node failure at a system level– Individual queries must be rerun

Avoid the overhead of persistence and barriers.

5

Pipelined Operators

Pipelining – push data along as soon as it is available– Cross-operator and cross-node

Straight forward for simple operators like filter, project Also possible with less common things like sort, radix hash join– External Sort: merge only what is needed to push first part of data down

pipeline

Destination buffering rather source buffering

6

Full pipelining requires query at once scheduling

Query at Once Schedule entire query at once

Pros:– Fastest data movement– Less herd effect

Cons:– Poorer workload distribution– Failure checkpoints hard

Task by Task Schedule each task when all

previous tasks are completed

Pros:– Potential better workload

distribution– Failure checkpoints

straightforward

Cons:– Slower data movement– Poorer routing decision

7

Comparison with MapReduce

Barriers–Map completion required before shuffle/reduce

commencement– All maps must complete before reduce can start– In chained jobs, one job must finish entirely before the next

one can start Persistence and Recoverability– Data is persisted to disk between each barrier– Serialization and deserialization are required between

execution phase

8

Record versus Columnar Representation

Record Column

9

Data Format ExampleDonut Price Icing

Bacon Maple Bar 2.19 [Maple Frosting, Bacon]

Portland Cream 1.79 [Chocolate]

The Loop 2.29 [Vanilla, Fruitloops]

Triple Chocolate Penetration

2.79 [Chocolate, Cocoa Puffs]

Record EncodingBacon Maple Bar, 2.19, Maple Frosting, Bacon, Portland Cream, 1.79, ChocolateThe Loop, 2.29, Vanilla, Fruitloops, Triple Chocolate Penetration, 2.79, Chocolate, Cocoa Puffs

Columnar EncodingBacon Maple Bar, Portland Cream, The Loop, Triple Chocolate Penetration2.19, 1.79, 2.29, 2.79Maple Frosting, Bacon, Chocolate, Vanilla, Fruitloops, Chocolate, Cocoa Puffs

10

Places to Apply Columnar

Columnar Storage (on disk)– Improved compression when similar data is co-located – Alternative compression techniques: dictionary, RLE, delta– Avoid column reads when not needed

Columnar Execution (in memory)– Improved cache locality– Improved cpu pipelineing (especially with things like null

checks)– Can reduce memory copies–Maintain unusual encoding schemas for direct relational

operator use

11

Columnar Execution: When to materialize

Users want rows Data is Columnar When do you transform?–On read into memory–On return to user–Somewhere in between

Later is generally better–Not always :)

12

Late Decompression

Don’t necessarily materialize each value Reduce memory consumption Reduce CPU cost Examples: RLE, Bit Dictionary

13

Example: RLE and Sum

Dataset – 2, 4– 8, 10

Goal– Sum all the records

Normal Work– Decompress & store: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8– Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8

Optimized Work– 2 * 4 + 8 * 10– Less Memory, less operations

14

Example: Bitpacked Dictionary VarChar Sort

Dataset:– Dictionary: [Rupert, Bill, Larry]– Values: [1,0,1,2,1,2,1,0]

Normal Work: – Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert– Sort: ~24 comparisons of variable width strings (requiring length

lookup and check during comparisons) Optimized Work– Sort Dictionary: {Bill: 1, Larry: 2, Rupert: 0}– Sort bitpacked values– Work: max 3 string comparisons, ~24 comparisons of fixed-width

dictionary bits– Data in 16 bits as opposed 368/736 for UTF8/16

15

Storage versus Relational operators

How do you write operator implementations for many different data representations– If you’re trying to inline, you have to avoid abstractions to complex for JVM

to simplify

Push optimizations to storage layer for things like RLE– Rare that data is exactly in desired format beyond simplest queries

Define a primary in-memory representation for columnar data– Support alternative randomly-accesible compressions schemas in all

operators (such as Dictionary/Bitpacked)

16

Vectorization

Operating on more than one record at the same time–Old school: use word-sized manipulations when records are

stored smaller than word size–New School: SIMD (single input multiple data) instructions• GCC, LLVM and JVM all to various otpimizations

automatically• More can be had manually coding algorithms

– Logical Vectorization:• Using general record characteristics to reduce CPU cycles per

collection of records

Alternative Meaning– Avoiding branching to speed CPU pipeline, working on large

cache local data in process

17

Drill Columnar Approach

A RecordBatch contains one or more ValueVectors corresponding to each Field within a BatchSchema

Operators can operate directly against ValueVector or work with an alternative view of data by work leveraging a SelectionVector

Leverage simple Vectorization and trust JIT to optimize SIMD by generating simple buffer based operations and loops.– Explore performance impact of advanced SIMD in C for specific

operators

18

Record Batch

Unit of work for the query system– Operators always work on a batch of records

All values associated with a particular collection of records

Each record batch must have a single defined schema– Possibly includes fields that have embedded types if

you have a heterogeneous field

Record batches are pipelined between operators and nodes

No more than 65k records Target single L2 cache (~256k) Operator reconfiguration is done at RecordBatch

boundaries

RecordBatch

VV VV VV VV

RecordBatch

VV VV VV VV

RecordBatch

VV VV VV VV

19

SelectionVector

Includes particular records from consideration by record batch index

Avoids early copying of records after applying filtering–Maintains random accessibility

All operators need to support SelectionVector accessDonut Price IcingBacon Maple Bar 2.19 [Maple Frosting,

Bacon]Portland Cream 1.79 [Chocolate]The Loop 2.29 [Vanilla, Fruitloops]

Triple Chocolate Penetration

2.79 [Chocolate, Cocoa Puffs]

Selection Vector0

3

20

ValueVector

One ore more contiguous buffers of data containing values– Stored in Native Order– In-memory representation fully specified for cross language portability

Associated with a single field– Synonymous with column in traditional flat tables

Nested fields are separate ValueVectors Randomly accessible Defined for each System datatype Each has Accessor and Mutator– Primitives and simple primitive “structs” are access interfaces

21

Drill DataTypes

MajorType = MinorType + DataMode + (Width|Scale)?

MinorType–Describes width and nature of data: smallint, bigint,

uint32, varchar4 (utf8), var16char4 (utf16) DataMode:–Optional (nullable)–Required (non-nullable)–Repeated (non item list/array)

22

Traditional 3 value semantics & Drill 4 value

SQL’s 3-Valued Semantics–True–False–Unknown

Drill adds fourth–Repeated

23

Fixed Value Vectors

24

Nullable Values

25

Repeated Values

26

Variable Width

27

Repeated Map

28

Strengths of RecordBatch + ValueVectors

RecordBatch separates high performance/low performance space– Record-by-record, avoid method invocation– Batch-by-batch, trust JVM

Avoid serialization/deserialization Off-heap means large memory footprint without GC woes Full specification combined with off-heap and batch-level

execution allows C/C++ operators as necessary Random access: sort without restructuring

29

Code Play Time

Get Latest Drill git clone git://git.apache.org/incubator-drill.git cd incubator-drill/sandbox/prototype git checkout 9f69ed0 mvn clean install

Download OSCON Drill examples: git clone https://github.com/jacques-n/oscon-drill.git cd oscon-drill mvn install cd vectors

http://bit.ly/19goc7R

https://github.com/jacques-n/oscon-drill.git

https://github.com/jacques-n/oscon-drill.git

30

Vectors Exercise

Goals RPC implementation to minimize data copies and support keeping all

data off-heap Basic benchmark analysis comparing ValueVectors and straight

ProtoBuf encoding

Logic C = A + B Assume two lists of fixed four byte integers (list a and list b). Send them to remote node Remote node decodes them, adds the two numbers together for

each record, then returns the list (list c) First node sums all returning numbers and verifies expected result

31

Vectors Exercise

├── pom.xml

└── src

├── main/java/org/apache/drill/oscon/rpc

│ │ ├── ClientConnectFuture.java

│ │ ├── ExampleClient.java

│ │ ├── ExampleConfig.java

│ │ └── ExampleServer.java

│ └── protobuf

│ └── Example.proto

└── test/java/org/apache/drill/oscon/rpc

└── TestRpc.java

OSCON 2013: Apache Drill Workshop > Execution & ValueVectors

Technology

Transcript of OSCON 2013: Apache Drill Workshop > Execution & ValueVectors