Expression evaluation on protobuf data

21
Faster expression evaluation on Protobuf data Oscar Moll MIT DB Group Work done as an intern at Google (Summer ‘16)

Transcript of Expression evaluation on protobuf data

Page 1: Expression evaluation on protobuf data

Faster expression evaluation on Protobuf data

Oscar MollMIT DB GroupWork done as an intern at Google (Summer ‘16)

Page 2: Expression evaluation on protobuf data

Protocol buffers = data definition language + serialization spec + compiler + libraries

// Protobuf Data Definition

message Person { string name = 1; int32 id = 2; string email = 3;}

// example c++ user code Person john; // Person is implemented as combo of// of generated C++ && shared base// classes

if(john.ParseFromString(data_buffer)){id = john.id();name = john.name();

}

Page 3: Expression evaluation on protobuf data

Protocol buffer backward compatibility- Adding fields

// Protobuf Data Definition adds// with new field added

message Person { string name = 1; int32 id = 2; string email = 3; int32 new_field = 4;}

// same old c++ user code // using stale Data definition// still works correctly Person john; // Person is implemented as combo of// of generated C++ && shared base// classes

if(john.ParseFromString(data_buffer)){id = john.id();name = john.name();

}

Page 4: Expression evaluation on protobuf data

Protocol buffer backward compatibility - Deprecating fields

// Protobuf Data Definition removes// some old field.

message Person { string name = 1; int32 id = 2; string email = 3; int32 new_field = 4;}

// byte overhead is dependent only on fields actually present

// same old c++ user code // using stale Data definition// still works correctly // even though email-missing Person john;

if(john.ParseFromString(data_buffer)){id = john.id();name = john.name();

}

Page 5: Expression evaluation on protobuf data

Expression evaluation on Protobuf data

message TestProto { int32 x = 4; NestedProto bar = 6; int32 y = 7; ...}

message NestedProto { int32 bx = 4; ...}

Let foo be a TestProto

We want to eval:{"foo.x > foo.bar.bx", "foo.y > foo.x"};

E.g."x:1 bar { bx:10 } y:20" => {1 > 10, 20 > 10} => {false, true}

Page 6: Expression evaluation on protobuf data

TestProto::ParseFromString example (and baseline)

test::TestProto f;auto b = f.ParseFromString(data);if (!b) { e.Invalid("parse failed"); return;}

output_buffer[0] = f.x() > f.bar().bx();output_buffer[1] = f.y() > f.x();

Type information received at runtimeExpression list also received at runtime

ParseFromString:❏ work proportional to every (nested) value in

the protobuf❏ Also limited by interface (eg. error checking)❏ allocates memory dynamically for every

variable sized value.

message TestProto { int32 x = 4; NestedProto bar = 6; int32 y = 7; ...}

message NestedProto { int32 bx = 4; ...}

{"foo.x > foo.bar.bx", "foo.y > foo.x"};

Page 7: Expression evaluation on protobuf data

We would prefer to access only what we need from the serialized buffer.

Which fields are needed is expression dependent.

We can do this as follows...

Page 8: Expression evaluation on protobuf data

The protobuf wireformat is a concatenation of binary tag-value pairs

message TestProto { int32 x = 4; NestedProto bar = 6; int32 y = 7; ...}

message NestedProto { int32 bx = 4; ...}

tag 1

tag <len>

tag 20

// example datax:1 bar { bx:10 } y:20

tag 10

last binding wins:must check every tag at the top level

Page 9: Expression evaluation on protobuf data

Implementing a general protobuf wireformat scannervoid GeneralProtobufScan(input, end, action_table) { while (input < end) { (field_label, type) = ParseTag(input, end); switch (action_table[field_label]) { case Action::kStore: switch (type) {

// how to parse varints, nested messages ... } case Action::kSkip: switch (type) {

// how to skip varints, nested messages... } // Other actions (like counting, summing …) } }}

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7; … // padding repeated NestedProto pad = 8;

}

message NestedProto { optional int32 bx = 4; ...}

Action table:(x)4 => store(bar)6 => recurse(y)7 => store(pad) 8 => skip

{"foo.x > foo.bar.bx", "foo.y > foo.x"};

Page 10: Expression evaluation on protobuf data

Fast expression evaluation via Scanning

# padding elts

ParseFromString* (ns) Per pad elt Scanner

(ns) Per pad elt Speedup

0 72 - 33 - 2.2

8 163 11.4 53 2.5 3.1

64 535 7.2 187 2.4 2.9

512 2578 4.9 1197 2.3 2.2

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;

// vary amount of padding // which we need to (partially) parse to skip repeated NestedProto pad = 8;}

CPU: Intel Sandybridge (2x8 cores) 2600 MHz dL1:32KB dL2:256KB dL3:20MB. Measuring 1/throughput *protoc set to optimize for speed.

{"foo.x > foo.bar.bx", "foo.y > foo.x"};

Page 11: Expression evaluation on protobuf data

Perf stat for fixed32 microbenchmark:

● 4.48 instructions per cycle (ie... high utilization of cpu) ● 30% of all instructions are branches (ie... likely not work efficient)● 0.03% mispredicted (ie… checks are predictable)● At 2 GB/s: Pretty good if data comes from a 1-10Gbit ethernet. But:

○ Far from per core memory bandwidth (~7-12 GB/s)○ Single core would have lower throughput than single PCIe SSD (4GB/s)○ Single core would have lower throughput than 40Gbit ethernet link

Generic scanner efficiency. How much faster is still useful?

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;

// using fixed32 helps measure overhead repeated NestedProto→fixed32 pad = 8; }

Page 12: Expression evaluation on protobuf data

The Protobuf wireformat is a concatenation of (compressed tag, compressed value) pairs

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7; ...}

message NestedProto { optional int32 bx = 4; ...}

Tag = Varint(field_label << 3 | type) Encoded(Value)

Varint(4 << 3 | 0) = 32 Varint(10)x:1 bar { bx:10 } y:20

Varint encoding: maps smaller integers to less bytes. 1 bit/byte metadata overhead. 1→0b 0000 0001 // first byte intact

128→0b 1000 0000 // first 7 bits in first byte →0b 0000 0001 // 8th bit in second byte.

Page 13: Expression evaluation on protobuf data

Overheads of generic scanningvoid GeneralProtobufScan(input, end, action_table) { while (input < end) {

tag = Varint::Parse(input, end); wiretype = tag & 0b111; field_label = tag >> 3; switch (action_table[field_label]) { case Action::kStore: switch (wiretype) {

// parse varints, nested messages ... } case Action::kSkip: switch (wiretype) {

// Amount of skip work varies per type

len = Varint::Parse(input, end);input += len // actual work

} } }}

Page 14: Expression evaluation on protobuf data

// kTagX = encode(4,0) = (4 << 3 | 0) = 32 if (input < end && *input == kTagX) { // parse this one and save it}

// kTagBar = encode(6,3) = (6 << 3 | 2) = 50if (input < end && *input == kTagBar){ // recurse to get bar.bx}

// kTagY = encode(7,0) = (7 << 3 | 0) = 56if (input < end && *input == kTagY) { // parse this one and save it}

// kTagPad = encode(8,2) = (8 << 3 | 2) = 66while (input < end && *input == kTagPad) { // ParseVarint len // skip}

if (input == end) { // yay return ErrorCode::kOk;}

Removing overhead via codegen

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;

repeated NestedProto pad = 8;}

x bar <len> ... y pad

Page 15: Expression evaluation on protobuf data

if (input < end && *input == kTagX) { // parse this one and save it}

if (input < end && *input == kTagBar){}

if (input < end && *input == kTagY) {}

while (input < end && *input == kTagPad) {}

if (input == end) { } else { return GeneralProtobufScan(

input, end, actions); }

Handling unexpected inputs

message TestProto { optional int32 x = 4; optional int32 old = 5; optional NestedProto bar = 6; optional int32 y = 7;

repeated NestedProto pad = 8;}

x bar <len> ... yold pad

Page 16: Expression evaluation on protobuf data

Faster expression evaluation on protobuf via llvm codegen

# padding elts

Scanner (ns) Per pad elt

Codegen (ns) Per pad elt

Speedup (over

scanner)0 33 - 17 - 1.98 53 2.5 22 0.63 2.4

64 187 2.4 61 0.69 3.1512 1197 2.3 342 0.63 3.5

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;

// varying padding repeated NestedProto pad = 8;}

{"foo.x > foo.bar.bx", "foo.y > foo.x"};

Page 17: Expression evaluation on protobuf data

Updated stats for fixed32 benchmark

Scanner no longer far of per-core memory bandwidth for this microbenchmark

scanner codegeninstructions per cycle 4.48 ipc 2.68 ipc

branches/instructions 30% 30%

branch misprediction 0.03% 0.06%

exercised bandwidth 2GB/s 7GB/s

Page 18: Expression evaluation on protobuf data

● Example of applying LLVM to a problem outside compilers

● In retrospect: some hallmarks of a good candidate problem to apply JITting to:○ Highly predictable paths given we know some more runtime information

○ Eg. Mostly legal inputs, with stronger properties than guaranteed by the spec.

○ Dealing correctly with input unpredictability by leveraging a fallback path.

● It converts runtime tag decoding to static number of compares (+ compile time encoding)

● It inlines handlers

● Coalesces some length checks

○ It achieves much lower latency and CPU cost than the C++ protobuf library or the standalone scanner (7x and 3.5x respectively) in our microbenchmarks

Summary

Page 19: Expression evaluation on protobuf data

My takeaways on using LLVM codegen as an outsider to compilers.

● Well documented instruction set and tutorial (Kaleidoscope)● Builder API is very helpful.● Type checks on LLVM code very useful.● Some hiccups with using some LLVM instructions (coming from C)

○ GEP, alloca (vs scoped local variable), i1 vs boolean (True < False in signed i1).

● Generating debug information requires me to learn different API (didn’t do it) ● Tracking signed vs unsigned integers becomes my problem.● Can link to C code in host process, but not obvious how to do inlining of

useful pre-existing routines written in C

Page 20: Expression evaluation on protobuf data

LLVM JIT library takeaways ● JIT latency overhead: For these microbenchmarks ~3ms plain codegen +

mem2reg, up to ~15ms with extra passes such as function inline, simplify-cfg.○ 15ms → JIT would be the bottleneck for data below 100MB

● Hard for a beginner to know which passes should be run, in what order (multiple times?)

● Can be hard to intuit when passes fail to optimize critical points (eg, mem2reg: iteration variable not in register)

● Very tempting to make the codegen phase generate inlined code where it matters (but harder to debug, and maintain).

Page 21: Expression evaluation on protobuf data

Would it be easier to use C as a JIT initially?

● Often found myself making the generated LLVM code catch up to a reference C code.

● Could still “JIT“ compile it in process by using llvm + libclang● Gets you debug symbols for free (does it?)● Gets you signed vs. unsigned, implicit conversions between integer widths.● Can continue using existing inlined header-only utilities (eg. ParseVarint32)● Learn LLVM best practices from Clang.● Is there an easy to use builder class for C code in Clang?