OpenDremel's Metaxa Architecture
-
Upload
camuel-gilyadov -
Category
Technology
-
view
560 -
download
0
description
Transcript of OpenDremel's Metaxa Architecture
Metaxa Architecture
June 22th
By Camuel, OpenDremel
Meet Metaxa • Implements Dremel using LAPHROAIG as execution engine and as storage
backend.
• No distribution, METAXA is single jar file and executed in single JVM, it produced and executes single threaded MAP job.
• All input data reside inside single LAPHROAIG object.
• Output is one of following: • New LAPHROAIG objet
• Streamed back.
• Convert type commands convert single LAPHROAIG object from popular objects serialization formats to nested columnar dremel format or vice versa.
• Query type commands process LAPHROAIG objects in nested columnar dremel format and can store result in another object or convert them to popular objects serialization formats and stream back to user.
• LAPHROAIG object is a container of other “serialized objects” or “columnar encoded objects”. Two types of objects not to be confused.
• Just four use cases: – Convert “serialized objects” into “columnar encoded objects”.
– Convert “columnar encoded objects” into “serialized objects”.
– Query “columnar encoded objects” with BQL producing “serialized objects” and streaming it back to caller.
– Query “columnar encoded objects” with BQL producing “serialized objects” and saving it as new LAPHROAIG “container” object
– Query “columnar encoded objects” with BQL producing “columnar encoded objects” and saving it as new LAPHROAIG “container” object
Use case #1: Convert serialized objects into columnar-encoded
objects
Metaxa.jar
Hierarchical
Schema Serialized objects
(Protobuf, Avro, Thrift)
columnar-encoded
objects (Tablet)
Convert
Command
LAPHROAIG
Use case #2: Convert columnar-encoded objects into serialized
objects
Metaxa.jar
Hierarchical
Schema Serialized objects
(Protobuf, Avro, Thrift)
columnar-encoded
objects (Tablet)
Convert
Command
LAPHROAIG
Use case #3: Query “columnar encoded objects” with BQL
producing “serialized objects” and streaming it back to
caller.
Metaxa.jar
Hierarchical
Schema Serialized objects
(Protobuf, Avro, Thrift)
columnar-encoded
objects (Tablet)
LAPHROAIG
BQL
Query
Use case #4: Query “columnar encoded objects” with BQL producing
“serialized objects” and saving it
Metaxa.jar
Hierarchical
Schema Serialized objects
(Protobuf, Avro, Thrift)
columnar-encoded
objects (Tablet)
LAPHROAIG
BQL
Query
Use case #5: Query “columnar encoded objects” with BQL
producing “columnar encoded objects” and saving it
Metaxa.jar
columnar-encoded
objects (Tablet)
LAPHROAIG
BQL
Query
columnar-encoded
objects (Tablet)
SerObjs – Serialized Objects
• A result data got by serializing objects with
Protobuf, Avro and Thrift.
• Hierarchical data.
• Flat data like CSV
• RDBMS originated data.
• Data from KV-stores and document stores.
• Logs.
• Schema may be embedded or provided
separately.
Tablet– Columnar-encoded objects • Immutable chunk of data.
• Logically comprised from Slices and can be turned into Slice series.
• Columnar and dremel-encoded.
• Consists of header (called Tablet Schema) and multiple {byte, word, dword or
qword}-streams.
• Tablet schema describes
– Tablet columns (multi-dimensional arrays) including metadata and compression and encoding metadata
as well as references for associated dictionaries, rep & def levels and etc.
– Original SerObjs schema and mapping to tablet columns
– Future: additional SerObjs schemas and mappings
• Tablet data are a set of multidimensional arrays of 8,16 ,32 or 64 bit elements
denoted byte or b, word or w, double word or dw and quad word or qw. Each
arrays represents a column and can be accessed independently without incurring access
costs for neighbor arrays. Every element is a bit-field with various bits representing
different information. For example (multiple) column values, counts (RLE)m rep and
def levels.
• Tablet scanner can mask some of the details of column encoding and provide
higher-level interface to tablet automatically decoding RLE, dictionary and rep & def
levels. However, tablet binary format is an stable interface between Metaxa
modules and between different versions of OpenDremel system
• Tablet are horizontal partitions of larger columnar dataset.
Slice– Columnar-encoded object fraction • Slice is a vector (ordered list of scalars) where each scalar corresponds to a current
value of a different tablet column that is being scanned / iterated.
• Tablet can be broken down into ordered list of slices and comprised back from
series of slices.
• Slice in Metaxa contains plain integer values (not bit fields) of b, w, dw and qw.
• Slice may contain less values than columns in tablet. In this case columns
represented in slice are called “projected columns”.
• Slice also contains additional integer field called Level. This Level is also aliased as
FetchLevel or SelectLevel depending whether Tablet is being sliced into
series of slices or being reconstructed from series of slices.
Query Plan (QP)
• QP is a descriptor of source tablet, a result tablet and a set of scalar
transformations and a DAG of their dataflow interconnections.
• Scalar transformations are of one of following types
– Plain transformation => Also called expressions, many inputs but one output.
– Predicates => boolean expression which when evaluating to false cancels the issuance of
the result slice.
– Aggregates => Count, Sum and Distinct functions, aggregates slices and then when the
last slice in a aggregation group is detected, issues multiple result slices.
• QP input and output is always slice. Because of predicates it is
possible that for some input slices no output slice will be issued. Also
because of aggregates it is also possible that for one input slice,
multiple output slices will be issued.
• Input slices contain FetchLevel and output slices contain SelectLevel.
(according to appendix D in paper)
Conceptual View of Tablet
[ ]
Levels (dimensions)
[ ][ ]
[ ][ ][ ]
0 1 2 Record [5] Record [4] Record [3] Record [2] Record [1] Record [0]
Conceptual View of Tablet Slicing
[ ][ ]
[ ][ ][ ]
[ ]
Levels (dimensions)
[ ][ ]
[ ][ ][ ]
0 1 2 Record [0] Slice
[0][1][1]
Slice
[0][1][0]
Slice
[0][0][0]
Slice
[0][0][1]
Slice
[0][0][2]
Slice
[0][2][2]
Conceptual View of QP
[ ][ ]
[ ][ ][ ]
[ ]
Expr (rep=2)
Expr (rep=1)
Expr (rep=0)
Levels (dimensions)
[ ][ ]
[ ][ ][ ]
[ ][ ]
[ ]
[ ][ ][ ]
0 1 2 Record [1] Record [0] Slice
[0][1][1]
Slice
[0][0][0]
Requirements:
– Must parse and compile valid BQL as defined by BigQuery.
– Must not accept invalid BQL and supply user-friendly messages.
– Must produce executable QP object with following features:
• It is Serializable => without circular references, without references to “system”
objects like file handlers, pure object model
• getProcessSliceSource => returns text of in java source-code form
• getSourceTablets => returns tablets to run QP on
• setResultTablet => Sets result tablet
• setExecutionStatusCode => to indicate status of QP execution
• log => allows logging important events during QP execution
• getDiagram => returns graphic image of QP diagram (for debugging)
– Must provide basic command-line arguments functionality as well as
simple shell functionality.
Translates BQL into Query Plan Compiler
Vocabulary
• Token - lexeme
• Parse tree – token tree
• AST – Abstract Syntax Tree
• SM – Semantic Model
• ASM – Annotated Semantic Model
• QP – Query Plan
• DAG – Directed Acyclic Graph
• Schema – Metadata about dataset.
Compiler
– http://code.google.com/apis/bigquery/docs/query-reference.html
– http://www.antlr.org/
– http://en.wikipedia.org/wiki/Parsing
– http://en.wikipedia.org/wiki/Query_plan
– http://en.wikipedia.org/wiki/Compiler_construction
– http://www.amazon.com/Terence-Parr/e/B001JS3O0U
Prerequisite Materials Compiler
High-Level Design (verbose)
Command
line
arguments
/ shell
input
Shell BQL Antlr
Parser AST SemanticP
arser
SM
Semantic Model
(Java object model
implemented via java
collections)
Semantic Analyzer •Validation
•Resolving references •Result Schema Inference
•Optimization
QP
Generator QP
Query Plan (includes
ResultTablet
metadata)
SM
Annotated
Semantic
Model
SerObjs
Schema
Result
Schema
Generator
Result SerObjs
Schema
Compiler
Metadata
(files locations
and statistics)
Optimization
Rules
Validation
Rules C / asm
Template
[Annotated] Semantic Model
• Comprehensibly describes query to every detail
• Java objects (packed into collections, without
spaghetti cyclic references)
• Must be serializable with SerObjLib
framework to a file and restorable.
• Must be printable to something comprehensible
by human
• Must be rendered on request into nice graphic
diagram with legend.
Compiler
QP: Scalar Transformation functions (Expr)
• Set of primitive predefined scalar operations and functions applied on
xfunc arguments in particular prescribed order.
• Expressed in valid C or assembly with some restrictions.
• Purely functional => side-effect free. Meaning no static/global
variables and no memory allocations. However, for performance and
brevity they are inlined into single processSlice function.
• Some functions have a context object where they can store their
externalized state between calls. One regular and one associative array
is provided as context for this functions
– Context-free transformation functions
• One value in, one value out a+b
– Scalar context transformation functions
• Many value in, many value out sum(a) within links
– Map context transformation functions
• Many value in, many value out (out of sync) sum(a) group by date
Compiler
QP in C Form Compiler
• Generated ProcessSlice(..){..} function.
– Input: inSlice
– Output: outSlice
– Context object for state-externalization
• inSlice contains scalar values for every source function and
also fetchLevel
• outSlice must have correct scalar values for every result
function and also correct selectLevel.
– outSlice are guarantied to preserve its content between calls. So it can be
used as cache result functions that haven’t changed and also as cache for
selectLevel if it is not changed.
– outSlice values can also be read (contains results of previous outSlice)
– on first call all values on outSlice are guaranteed to be zeros.
QP template (according to appendix D)
void processSlice(inSlice, outSlice, Context) {
Evaluate where clause…, if evaluates to false then do:
outSlice.setSkip;
outSlice.selectLevel = min(outSlice.selectLevel, inSlice.fetchLevel);
return;
If where clause evaluates to true then…
switch(inSlice.fetchLevel) {
case 0:
Evaluate expressions (xfuncs) with repetition level = 0
……..
……..
case n:
Evaluate expressions (xfuncs) with repetition level = n
If it is the last slide in aggregation group then:
//the below line will cause to additional calls to ProcessSlice
outSlice.setAdditionalSliceCount( Number of slices in aggregation
}
}
Compiler
Columnar Abstraction • Tableton is a set of sequentially-accessed multidimensional scalar arrays.
• Tablet is serialized dremel-encoded columnar dataset with fixed size. Each array in tablet can be independently serially accessed without incurring the cost of buffering neighbor arrays.
• Four types of arrays: bytes, words (16b), dwords(32b), qwords(64b).
• Following operations are defined:
– Parsing Tablet Schema => reading and parsing tablet header/metadata also called tablet
schema and providing an object model for it.
– Reading => converting Tablet to SerObjs using FSM for better performance as descrbed in Dremel paper (calling calback functions to let them construct SerObjs in various formats)
– Slicing => synchronized multi-array scalar iteration of Tablet
– Building Tablet Schema => creating tablet header/metadata also called tablet schema with convenient builder API. Also called TabletSchema Editor.
– Construction => re-creating Tablet from slices, this interface is also used for dissecting SerObjs into tablet.
– Compaction => constructed Tablet is compressed and hash key generated for it and from that point on it becomes immutable.
Tableton
What about other datatypes? • They are mapped into yet another dimension of
scalar array.
• It is strongly recommended not to use java strings. They are impossible to work with without incurring full cost of object lifecycle management.
• It is ok not to support them at all, and then gradually add support for them.
• All Java string class goodies will anyway be impossible to support in Metaxa because of performance.
• Same thing about BLOB, images and any other complex data type. All are mapped to yet another dimension of scalar array.
Tableton
Hierarchical vs. Columnar
• Different abstractions / domains / contexts
• Different schemas
• Most confusion stems from not differentiating!
• Always keep in mind the context when you r developing…
• Don’t thinks about both in the same time unless you are willing to develop schizophrenia.
• Columnar is not an implementation artifact of hierarchical. Columnar is whole new model in its own
• We must adopt two different vocabulary for these domains. Confusion is notoriously common here.
Tableton
Hierarchical vs. Columnar Hierarchical
A SerObjs in our lingo
Protobuf, Avro, Thirft files
Serialized Objects
The only user-level abstraction
BQL queries written against it
More frontend-related
More logical / external format
hierarchical is queried
SerObjLib component
Columnar
A Tablet in our lingo
Dremel generated tablets
Multi-dimensional arrays
User never knows what it is
Query plans executed against it
More backend-related
More physical / internal format
Columns are scanned
Tableton component
Tableton
Hierarchical Example
Tableton
Executes QP against tablets • Requirements
– Must convert QP into executable bytecode and execute it (not interpret).
– Must work with QP in object-model, but initially compiling and running
QP in java form will suffice.
– Must not mask data and task parallelism.
• Data parallelism on tablet level and also on column level within tablet.
• Task parallelism on separate QP transformation functions
– Must be ultra-high performance
• Latency overhead within few milliseconds (assuming data in RAM).
• Throughput multi GB/sec
Executor
• QP – Query Plan
• DAG – Directed Acyclic Graph
• Slot – Like thread (todo)
• Expression – operator tree on scalar arguments and scalar constants
• CF – Context Free (stateless scalar expression)
• FC – Fixed-size Context (scalar expr with accumulator)
• VC – variable-size Context (scalar expr with growing list of accumulators)
Vocabulary
Executor
• [todo] Janino!
• [todo] Explain dynamic java code generation
and compilation
• [todo] Use code templates! No classes/functions
& classes just code listing with labels and jumps.
Generated code is every time different no one is
going to study it. Put static-portions in library
and pre-compile it regularly. All dynamic portion
is just code snippet
Executor
Code generation
Thanks (sneak preview of future versions in next slides)
The overall vision for OpenDremel
• Interactive data cloud platform for managing high volumes of static data in forms of serialized objects.
• Compatible to Google tools such as BigQuery, prediction API, Fusion Tables and Google storage and etc...
• Aggressively use existing open-source software, preferably apache licensed to quickly “implement” desired functionality.
Features Backlog • Processing compressed data directly without decompressing.
• Macro parallelism 1) multithreading 2) multi-process 3)multi-node 4)
massive clustering
• Micro parallelism 1) SSE&AVX 2) OpenCL 3) Better machine code to
leverage ILP 4) light-threads for parallel processing of single tablet 5)
LLVM 6) special hardware GPU & tilera
• Interactive joins and indexing support, zone maps and global system-
recognized dimensions such as time, geography, ip
• Advanced analytics, statistics and machine learning capabilities.
• Richer SerObjLib, more formats
• Advanced visualization and streaming.
• Batch data-crunching and map-reduce support.
• Multi-tenancy, resource control, metering and accounting.
• CEP capabilities, fast lookups and querying also data that is not yet packed
into tablets.
• User-defined functions.
• Scratch tables and rolling queries