Expression evaluation on protobuf data

Faster expression evaluation on Protobuf data

Oscar MollMIT DB GroupWork done as an intern at Google (Summer ‘16)

Protocol buffers = data definition language + serialization spec + compiler + libraries

// Protobuf Data Definition

message Person { string name = 1; int32 id = 2; string email = 3;}

// example c++ user code Person john; // Person is implemented as combo of// of generated C++ && shared base// classes

if(john.ParseFromString(data_buffer)){id = john.id();name = john.name();

Protocol buffer backward compatibility- Adding fields

// Protobuf Data Definition adds// with new field added

message Person { string name = 1; int32 id = 2; string email = 3; int32 new_field = 4;}

// same old c++ user code // using stale Data definition// still works correctly Person john; // Person is implemented as combo of// of generated C++ && shared base// classes

Protocol buffer backward compatibility - Deprecating fields

// Protobuf Data Definition removes// some old field.

message Person { string name = 1; int32 id = 2; string email = 3; int32 new_field = 4;}

// byte overhead is dependent only on fields actually present

// same old c++ user code // using stale Data definition// still works correctly // even though email-missing Person john;

Expression evaluation on Protobuf data

message TestProto { int32 x = 4; NestedProto bar = 6; int32 y = 7; ...}

message NestedProto { int32 bx = 4; ...}

Let foo be a TestProto

We want to eval:{"foo.x > foo.bar.bx", "foo.y > foo.x"};

E.g."x:1 bar { bx:10 } y:20" => {1 > 10, 20 > 10} => {false, true}

TestProto::ParseFromString example (and baseline)

test::TestProto f;auto b = f.ParseFromString(data);if (!b) { e.Invalid("parse failed"); return;}

output_buffer[0] = f.x() > f.bar().bx();output_buffer[1] = f.y() > f.x();

Type information received at runtimeExpression list also received at runtime

ParseFromString:❏ work proportional to every (nested) value in

the protobuf❏ Also limited by interface (eg. error checking)❏ allocates memory dynamically for every

variable sized value.

{"foo.x > foo.bar.bx", "foo.y > foo.x"};

We would prefer to access only what we need from the serialized buffer.

Which fields are needed is expression dependent.

We can do this as follows...

The protobuf wireformat is a concatenation of binary tag-value pairs

tag <len>

tag 20

// example datax:1 bar { bx:10 } y:20

tag 10

last binding wins:must check every tag at the top level

Implementing a general protobuf wireformat scannervoid GeneralProtobufScan(input, end, action_table) { while (input < end) { (field_label, type) = ParseTag(input, end); switch (action_table[field_label]) { case Action::kStore: switch (type) {

// how to parse varints, nested messages ... } case Action::kSkip: switch (type) {

// how to skip varints, nested messages... } // Other actions (like counting, summing …) } }}

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7; … // padding repeated NestedProto pad = 8;

message NestedProto { optional int32 bx = 4; ...}

Action table:(x)4 => store(bar)6 => recurse(y)7 => store(pad) 8 => skip

Fast expression evaluation via Scanning

# padding elts

ParseFromString* (ns) Per pad elt Scanner

(ns) Per pad elt Speedup

0 72 - 33 - 2.2

8 163 11.4 53 2.5 3.1

64 535 7.2 187 2.4 2.9

512 2578 4.9 1197 2.3 2.2

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;

// vary amount of padding // which we need to (partially) parse to skip repeated NestedProto pad = 8;}

CPU: Intel Sandybridge (2x8 cores) 2600 MHz dL1:32KB dL2:256KB dL3:20MB. Measuring 1/throughput *protoc set to optimize for speed.

Perf stat for fixed32 microbenchmark:

● 4.48 instructions per cycle (ie... high utilization of cpu) ● 30% of all instructions are branches (ie... likely not work efficient)● 0.03% mispredicted (ie… checks are predictable)● At 2 GB/s: Pretty good if data comes from a 1-10Gbit ethernet. But:

○ Far from per core memory bandwidth (~7-12 GB/s)○ Single core would have lower throughput than single PCIe SSD (4GB/s)○ Single core would have lower throughput than 40Gbit ethernet link

Generic scanner efficiency. How much faster is still useful?

// using fixed32 helps measure overhead repeated NestedProto→fixed32 pad = 8; }

The Protobuf wireformat is a concatenation of (compressed tag, compressed value) pairs

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7; ...}

message NestedProto { optional int32 bx = 4; ...}

Tag = Varint(field_label << 3 | type) Encoded(Value)

Varint(4 << 3 | 0) = 32 Varint(10)x:1 bar { bx:10 } y:20

Varint encoding: maps smaller integers to less bytes. 1 bit/byte metadata overhead. 1→0b 0000 0001 // first byte intact

128→0b 1000 0000 // first 7 bits in first byte →0b 0000 0001 // 8th bit in second byte.

Overheads of generic scanningvoid GeneralProtobufScan(input, end, action_table) { while (input < end) {

tag = Varint::Parse(input, end); wiretype = tag & 0b111; field_label = tag >> 3; switch (action_table[field_label]) { case Action::kStore: switch (wiretype) {

// parse varints, nested messages ... } case Action::kSkip: switch (wiretype) {

// Amount of skip work varies per type

len = Varint::Parse(input, end);input += len // actual work

} } }}

// kTagX = encode(4,0) = (4 << 3 | 0) = 32 if (input < end && *input == kTagX) { // parse this one and save it}

// kTagBar = encode(6,3) = (6 << 3 | 2) = 50if (input < end && *input == kTagBar){ // recurse to get bar.bx}

// kTagY = encode(7,0) = (7 << 3 | 0) = 56if (input < end && *input == kTagY) { // parse this one and save it}

// kTagPad = encode(8,2) = (8 << 3 | 2) = 66while (input < end && *input == kTagPad) { // ParseVarint len // skip}

if (input == end) { // yay return ErrorCode::kOk;}

Removing overhead via codegen

repeated NestedProto pad = 8;}

x bar <len> ... y pad

if (input < end && *input == kTagX) { // parse this one and save it}

if (input < end && *input == kTagBar){}

if (input < end && *input == kTagY) {}

while (input < end && *input == kTagPad) {}

if (input == end) { } else { return GeneralProtobufScan(

input, end, actions); }

Handling unexpected inputs

message TestProto { optional int32 x = 4; optional int32 old = 5; optional NestedProto bar = 6; optional int32 y = 7;

repeated NestedProto pad = 8;}

x bar <len> ... yold pad

Faster expression evaluation on protobuf via llvm codegen

# padding elts

Scanner (ns) Per pad elt

Codegen (ns) Per pad elt

Speedup (over

scanner)0 33 - 17 - 1.98 53 2.5 22 0.63 2.4

64 187 2.4 61 0.69 3.1512 1197 2.3 342 0.63 3.5

// varying padding repeated NestedProto pad = 8;}

Updated stats for fixed32 benchmark

Scanner no longer far of per-core memory bandwidth for this microbenchmark

scanner codegeninstructions per cycle 4.48 ipc 2.68 ipc

branches/instructions 30% 30%

branch misprediction 0.03% 0.06%

exercised bandwidth 2GB/s 7GB/s

● Example of applying LLVM to a problem outside compilers

● In retrospect: some hallmarks of a good candidate problem to apply JITting to:○ Highly predictable paths given we know some more runtime information

○ Eg. Mostly legal inputs, with stronger properties than guaranteed by the spec.

○ Dealing correctly with input unpredictability by leveraging a fallback path.

● It converts runtime tag decoding to static number of compares (+ compile time encoding)

● It inlines handlers

● Coalesces some length checks

○ It achieves much lower latency and CPU cost than the C++ protobuf library or the standalone scanner (7x and 3.5x respectively) in our microbenchmarks

Summary

My takeaways on using LLVM codegen as an outsider to compilers.

● Well documented instruction set and tutorial (Kaleidoscope)● Builder API is very helpful.● Type checks on LLVM code very useful.● Some hiccups with using some LLVM instructions (coming from C)

○ GEP, alloca (vs scoped local variable), i1 vs boolean (True < False in signed i1).

● Generating debug information requires me to learn different API (didn’t do it) ● Tracking signed vs unsigned integers becomes my problem.● Can link to C code in host process, but not obvious how to do inlining of

useful pre-existing routines written in C

LLVM JIT library takeaways ● JIT latency overhead: For these microbenchmarks ~3ms plain codegen +

mem2reg, up to ~15ms with extra passes such as function inline, simplify-cfg.○ 15ms → JIT would be the bottleneck for data below 100MB

● Hard for a beginner to know which passes should be run, in what order (multiple times?)

● Can be hard to intuit when passes fail to optimize critical points (eg, mem2reg: iteration variable not in register)

● Very tempting to make the codegen phase generate inlined code where it matters (but harder to debug, and maintain).

Would it be easier to use C as a JIT initially?

● Often found myself making the generated LLVM code catch up to a reference C code.

● Could still “JIT“ compile it in process by using llvm + libclang● Gets you debug symbols for free (does it?)● Gets you signed vs. unsigned, implicit conversions between integer widths.● Can continue using existing inlined header-only utilities (eg. ParseVarint32)● Learn LLVM best practices from Clang.● Is there an easy to use builder class for C code in Clang?

Expression evaluation on protobuf data

Documents

Transcript of Expression evaluation on protobuf data

An Investigation of Oral-Expression Evaluation Procedures ...

protobuf-net - Protocol Buffers library for idiomatic .NET

Relational Algebra Expression Evaluation

Expression Evaluation in Icon* Ralph E. Griswold TR 80-21 · Expression Evaluation in Icon 1. Introduction Icon [1,2] is a high-level, general-purpose programming language that emphasizes

Expression Evaluation:

Accurate and eï¬ƒcient expression evaluation and linear

In Situ Cytokine Expression and Morphometric Evaluation …downloads.hindawi.com/journals/mi/2017/6573802.pdf · In Situ Cytokine Expression and Morphometric Evaluation of Total Collagen

Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Evaluation of Expression in Query Processing

MOOC backbone using Netty and Protobuf

Expression of Interest for Carrying Out Evaluation Study ... · Expression of Interest for Carrying Out Evaluation Study of CCS NIAM ... The proposal should be submitted in English

Stack - tnbedcsvips.in · example that shows evaluation of the Postfix expression using stack as data structure. Algorithm for Evaluation of Postfix Expression Initialize(Stack S)

Evaluation of salicylic acid treatment on expression of ...

Evaluation of the expression of Bmi-1 stem cell marker in ...

Expression Evaluation in Icon* Ralph E. Griswold TR 80-21 · 2014. 2. 1. · Expression Evaluation in Icon 1. Introduction Icon [1,2] is a high-level, general-purpose programming

EFFICIENT COLUMNAR STORAGE WITH APACHE PARQUET · EFFICIENT COLUMNAR STORAGE WITH APACHE PARQUET ... parquet avro thrift protobuf pig hive …. avro thrift protobuf pig hive ….

Gene expression evaluation of antioxidant enzymes in ...

Protobuf & Code Generation + Go-Kit

Evaluation of Expression of Interest (EOI) for hiring CA ...

Evaluation of PD-L1 Expression and Associated Tumor-Inﬁltrating Lymphocytes … · Biology of Human Tumors Evaluation of PD-L1 Expression and Associated Tumor-Inﬁltrating Lymphocytes