OpenDremel's Metaxa Architecture

Metaxa Architecture

June 22th

By Camuel, OpenDremel

Meet Metaxa • Implements Dremel using LAPHROAIG as execution engine and as storage

backend.

• No distribution, METAXA is single jar file and executed in single JVM, it produced and executes single threaded MAP job.

• All input data reside inside single LAPHROAIG object.

• Output is one of following: • New LAPHROAIG objet

• Streamed back.

• Convert type commands convert single LAPHROAIG object from popular objects serialization formats to nested columnar dremel format or vice versa.

• Query type commands process LAPHROAIG objects in nested columnar dremel format and can store result in another object or convert them to popular objects serialization formats and stream back to user.

• LAPHROAIG object is a container of other “serialized objects” or “columnar encoded objects”. Two types of objects not to be confused.

• Just four use cases: – Convert “serialized objects” into “columnar encoded objects”.

– Convert “columnar encoded objects” into “serialized objects”.

– Query “columnar encoded objects” with BQL producing “serialized objects” and streaming it back to caller.

– Query “columnar encoded objects” with BQL producing “serialized objects” and saving it as new LAPHROAIG “container” object

– Query “columnar encoded objects” with BQL producing “columnar encoded objects” and saving it as new LAPHROAIG “container” object

Use case #1: Convert serialized objects into columnar-encoded

objects

Metaxa.jar

Hierarchical

Schema Serialized objects

(Protobuf, Avro, Thrift)

columnar-encoded

objects (Tablet)

Convert

Command

LAPHROAIG

Use case #2: Convert columnar-encoded objects into serialized

objects

Metaxa.jar

Hierarchical



columnar-encoded

objects (Tablet)

Convert

Command

LAPHROAIG

Use case #3: Query “columnar encoded objects” with BQL

producing “serialized objects” and streaming it back to

caller.

Metaxa.jar

Hierarchical



columnar-encoded

objects (Tablet)

LAPHROAIG

BQL

Query

Use case #4: Query “columnar encoded objects” with BQL producing

“serialized objects” and saving it

Metaxa.jar

Hierarchical



columnar-encoded

objects (Tablet)

LAPHROAIG

BQL

Query

Use case #5: Query “columnar encoded objects” with BQL

producing “columnar encoded objects” and saving it

Metaxa.jar

columnar-encoded

objects (Tablet)

LAPHROAIG

BQL

Query

columnar-encoded

objects (Tablet)

SerObjs – Serialized Objects

• A result data got by serializing objects with

Protobuf, Avro and Thrift.

• Hierarchical data.

• Flat data like CSV

• RDBMS originated data.

• Data from KV-stores and document stores.

• Logs.

• Schema may be embedded or provided

separately.

Tablet– Columnar-encoded objects • Immutable chunk of data.

• Logically comprised from Slices and can be turned into Slice series.

• Columnar and dremel-encoded.

• Consists of header (called Tablet Schema) and multiple {byte, word, dword or

qword}-streams.

• Tablet schema describes

– Tablet columns (multi-dimensional arrays) including metadata and compression and encoding metadata

as well as references for associated dictionaries, rep & def levels and etc.

– Original SerObjs schema and mapping to tablet columns

– Future: additional SerObjs schemas and mappings

• Tablet data are a set of multidimensional arrays of 8,16 ,32 or 64 bit elements

denoted byte or b, word or w, double word or dw and quad word or qw. Each

arrays represents a column and can be accessed independently without incurring access

costs for neighbor arrays. Every element is a bit-field with various bits representing

different information. For example (multiple) column values, counts (RLE)m rep and

def levels.

• Tablet scanner can mask some of the details of column encoding and provide

higher-level interface to tablet automatically decoding RLE, dictionary and rep & def

levels. However, tablet binary format is an stable interface between Metaxa

modules and between different versions of OpenDremel system

• Tablet are horizontal partitions of larger columnar dataset.

Slice– Columnar-encoded object fraction • Slice is a vector (ordered list of scalars) where each scalar corresponds to a current

value of a different tablet column that is being scanned / iterated.

• Tablet can be broken down into ordered list of slices and comprised back from

series of slices.

• Slice in Metaxa contains plain integer values (not bit fields) of b, w, dw and qw.

• Slice may contain less values than columns in tablet. In this case columns

represented in slice are called “projected columns”.

• Slice also contains additional integer field called Level. This Level is also aliased as

FetchLevel or SelectLevel depending whether Tablet is being sliced into

series of slices or being reconstructed from series of slices.

Query Plan (QP)

• QP is a descriptor of source tablet, a result tablet and a set of scalar

transformations and a DAG of their dataflow interconnections.

• Scalar transformations are of one of following types

– Plain transformation => Also called expressions, many inputs but one output.

– Predicates => boolean expression which when evaluating to false cancels the issuance of

the result slice.

– Aggregates => Count, Sum and Distinct functions, aggregates slices and then when the

last slice in a aggregation group is detected, issues multiple result slices.

• QP input and output is always slice. Because of predicates it is

possible that for some input slices no output slice will be issued. Also

because of aggregates it is also possible that for one input slice,

multiple output slices will be issued.

• Input slices contain FetchLevel and output slices contain SelectLevel.

(according to appendix D in paper)

Conceptual View of Tablet

[ ]

Levels (dimensions)

[ ][ ]

[ ][ ][ ]

0 1 2 Record [5] Record [4] Record [3] Record [2] Record [1] Record [0]

Conceptual View of Tablet Slicing

[ ][ ]

[ ][ ][ ]

[ ]

Levels (dimensions)

[ ][ ]

[ ][ ][ ]

0 1 2 Record [0] Slice

[0][1][1]

Slice

[0][1][0]

Slice

[0][0][0]

Slice

[0][0][1]

Slice

[0][0][2]

Slice

[0][2][2]

Conceptual View of QP

[ ][ ]

[ ][ ][ ]

[ ]

Expr (rep=2)

Expr (rep=1)

Expr (rep=0)

Levels (dimensions)

[ ][ ]

[ ][ ][ ]

[ ][ ]

[ ]

[ ][ ][ ]

0 1 2 Record [1] Record [0] Slice

[0][1][1]

Slice

[0][0][0]

Requirements:

– Must parse and compile valid BQL as defined by BigQuery.

– Must not accept invalid BQL and supply user-friendly messages.

– Must produce executable QP object with following features:

• It is Serializable => without circular references, without references to “system”

objects like file handlers, pure object model

• getProcessSliceSource => returns text of in java source-code form

• getSourceTablets => returns tablets to run QP on

• setResultTablet => Sets result tablet

• setExecutionStatusCode => to indicate status of QP execution

• log => allows logging important events during QP execution

• getDiagram => returns graphic image of QP diagram (for debugging)

– Must provide basic command-line arguments functionality as well as

simple shell functionality.

Translates BQL into Query Plan Compiler

Vocabulary

• Token - lexeme

• Parse tree – token tree

• AST – Abstract Syntax Tree

• SM – Semantic Model

• ASM – Annotated Semantic Model

• QP – Query Plan

• DAG – Directed Acyclic Graph

• Schema – Metadata about dataset.

Compiler

– http://code.google.com/apis/bigquery/docs/query-reference.html

– http://www.antlr.org/

– http://en.wikipedia.org/wiki/Parsing

– http://en.wikipedia.org/wiki/Query_plan

– http://en.wikipedia.org/wiki/Compiler_construction

– http://www.amazon.com/Terence-Parr/e/B001JS3O0U

Prerequisite Materials Compiler

http://code.google.com/apis/bigquery/docs/query-reference.html



http://www.antlr.org/

http://en.wikipedia.org/wiki/Parsing

http://en.wikipedia.org/wiki/Query_plan

http://en.wikipedia.org/wiki/Compiler_construction

http://www.amazon.com/Terence-Parr/e/B001JS3O0U



High-Level Design (verbose)

Command

line

arguments

/ shell

input

Shell BQL Antlr

Parser AST SemanticP

arser

SM

Semantic Model

(Java object model

implemented via java

collections)

Semantic Analyzer •Validation

•Resolving references •Result Schema Inference

•Optimization

QP

Generator QP

Query Plan (includes

ResultTablet

metadata)

SM

Annotated

Semantic

Model

SerObjs

Schema

Result

Schema

Generator

Result SerObjs

Schema

Compiler

Metadata

(files locations

and statistics)

Optimization

Rules

Validation

Rules C / asm

Template

[Annotated] Semantic Model

• Comprehensibly describes query to every detail

• Java objects (packed into collections, without

spaghetti cyclic references)

• Must be serializable with SerObjLib

framework to a file and restorable.

• Must be printable to something comprehensible

by human

• Must be rendered on request into nice graphic

diagram with legend.

Compiler

QP: Scalar Transformation functions (Expr)

• Set of primitive predefined scalar operations and functions applied on

xfunc arguments in particular prescribed order.

• Expressed in valid C or assembly with some restrictions.

• Purely functional => side-effect free. Meaning no static/global

variables and no memory allocations. However, for performance and

brevity they are inlined into single processSlice function.

• Some functions have a context object where they can store their

externalized state between calls. One regular and one associative array

is provided as context for this functions

– Context-free transformation functions

• One value in, one value out a+b

– Scalar context transformation functions

• Many value in, many value out sum(a) within links

– Map context transformation functions

• Many value in, many value out (out of sync) sum(a) group by date

Compiler

QP in C Form Compiler

• Generated ProcessSlice(..){..} function.

– Input: inSlice

– Output: outSlice

– Context object for state-externalization

• inSlice contains scalar values for every source function and

also fetchLevel

• outSlice must have correct scalar values for every result

function and also correct selectLevel.

– outSlice are guarantied to preserve its content between calls. So it can be

used as cache result functions that haven’t changed and also as cache for

selectLevel if it is not changed.

– outSlice values can also be read (contains results of previous outSlice)

– on first call all values on outSlice are guaranteed to be zeros.

QP template (according to appendix D)

void processSlice(inSlice, outSlice, Context) {

Evaluate where clause…, if evaluates to false then do:

outSlice.setSkip;

outSlice.selectLevel = min(outSlice.selectLevel, inSlice.fetchLevel);

return;

If where clause evaluates to true then…

switch(inSlice.fetchLevel) {

case 0:

Evaluate expressions (xfuncs) with repetition level = 0

……..

……..

case n:

Evaluate expressions (xfuncs) with repetition level = n

If it is the last slide in aggregation group then:

//the below line will cause to additional calls to ProcessSlice

outSlice.setAdditionalSliceCount( Number of slices in aggregation

}

}

Compiler

Columnar Abstraction • Tableton is a set of sequentially-accessed multidimensional scalar arrays.

• Tablet is serialized dremel-encoded columnar dataset with fixed size. Each array in tablet can be independently serially accessed without incurring the cost of buffering neighbor arrays.

• Four types of arrays: bytes, words (16b), dwords(32b), qwords(64b).

• Following operations are defined:

– Parsing Tablet Schema => reading and parsing tablet header/metadata also called tablet

schema and providing an object model for it.

– Reading => converting Tablet to SerObjs using FSM for better performance as descrbed in Dremel paper (calling calback functions to let them construct SerObjs in various formats)

– Slicing => synchronized multi-array scalar iteration of Tablet

– Building Tablet Schema => creating tablet header/metadata also called tablet schema with convenient builder API. Also called TabletSchema Editor.

– Construction => re-creating Tablet from slices, this interface is also used for dissecting SerObjs into tablet.

– Compaction => constructed Tablet is compressed and hash key generated for it and from that point on it becomes immutable.

Tableton

What about other datatypes? • They are mapped into yet another dimension of

scalar array.

• It is strongly recommended not to use java strings. They are impossible to work with without incurring full cost of object lifecycle management.

• It is ok not to support them at all, and then gradually add support for them.

• All Java string class goodies will anyway be impossible to support in Metaxa because of performance.

• Same thing about BLOB, images and any other complex data type. All are mapped to yet another dimension of scalar array.

Tableton

Hierarchical vs. Columnar

• Different abstractions / domains / contexts

• Different schemas

• Most confusion stems from not differentiating!

• Always keep in mind the context when you r developing…

• Don’t thinks about both in the same time unless you are willing to develop schizophrenia.

• Columnar is not an implementation artifact of hierarchical. Columnar is whole new model in its own

• We must adopt two different vocabulary for these domains. Confusion is notoriously common here.

Tableton

Hierarchical vs. Columnar Hierarchical

A SerObjs in our lingo

Protobuf, Avro, Thirft files

Serialized Objects

The only user-level abstraction

BQL queries written against it

More frontend-related

More logical / external format

hierarchical is queried

SerObjLib component

Columnar

A Tablet in our lingo

Dremel generated tablets

Multi-dimensional arrays

User never knows what it is

Query plans executed against it

More backend-related

More physical / internal format

Columns are scanned

Tableton component

Tableton

Hierarchical Example

Tableton

Executes QP against tablets • Requirements

– Must convert QP into executable bytecode and execute it (not interpret).

– Must work with QP in object-model, but initially compiling and running

QP in java form will suffice.

– Must not mask data and task parallelism.

• Data parallelism on tablet level and also on column level within tablet.

• Task parallelism on separate QP transformation functions

– Must be ultra-high performance

• Latency overhead within few milliseconds (assuming data in RAM).

• Throughput multi GB/sec

Executor

• QP – Query Plan

• DAG – Directed Acyclic Graph

• Slot – Like thread (todo)

• Expression – operator tree on scalar arguments and scalar constants

• CF – Context Free (stateless scalar expression)

• FC – Fixed-size Context (scalar expr with accumulator)

• VC – variable-size Context (scalar expr with growing list of accumulators)

Vocabulary

Executor

• [todo] Janino!

• [todo] Explain dynamic java code generation

and compilation

• [todo] Use code templates! No classes/functions

& classes just code listing with labels and jumps.

Generated code is every time different no one is

going to study it. Put static-portions in library

and pre-compile it regularly. All dynamic portion

is just code snippet

Executor

Code generation

Thanks (sneak preview of future versions in next slides)

The overall vision for OpenDremel

• Interactive data cloud platform for managing high volumes of static data in forms of serialized objects.

• Compatible to Google tools such as BigQuery, prediction API, Fusion Tables and Google storage and etc...

• Aggressively use existing open-source software, preferably apache licensed to quickly “implement” desired functionality.

Features Backlog • Processing compressed data directly without decompressing.

• Macro parallelism 1) multithreading 2) multi-process 3)multi-node 4)

massive clustering

• Micro parallelism 1) SSE&AVX 2) OpenCL 3) Better machine code to

leverage ILP 4) light-threads for parallel processing of single tablet 5)

LLVM 6) special hardware GPU & tilera

• Interactive joins and indexing support, zone maps and global system-

recognized dimensions such as time, geography, ip

• Advanced analytics, statistics and machine learning capabilities.

• Richer SerObjLib, more formats

• Advanced visualization and streaming.

• Batch data-crunching and map-reduce support.

• Multi-tenancy, resource control, metering and accounting.

• CEP capabilities, fast lookups and querying also data that is not yet packed

into tablets.

• User-defined functions.

• Scratch tables and rolling queries

OpenDremel's Metaxa Architecture

Technology

Transcript of OpenDremel's Metaxa Architecture