Download - Story Writing Byte Serializer in Golang

Story writing byte Serializer in go

@dxhuy @huydx

Byte serializer

Problem space• Writting buffer control that need to persist struct

data in disk • Struct data is simple (will not change in near

future) • Program needs

• Low memory footprint • Low CPU usage

Bunches of options

• encoding/gob (base on encoding/binary) • gogoprotobuf • capnproto (glycerine/go-capnproto) • ugorji/go/codec • mgo.v2/bson • .....

Some problems • Some are overcomplex

• Cryptic error message • Some are fast, but not support all datastructure (map)

• flatbuffer (could use vector instead, but look up is not O(1))

• All libraries does some abstraction, make it hard to debug • Write to disk failed at the middle, some bytes are written,

some are not • Using library lack of fine-grained control

• You want some special behaviours for some special field • You want some special behaviours when it failed

So let's write your own

Serializer anatomy

type A struct { Name string BirthDay time.Time Phone string Siblings int Spouse bool Money float64 }

Example struct

Struct layout mattertype A struct { Name string BirthDay time.Time Phone string Siblings int Spouse bool Money float64 }

size + order

In general, there 2 types• Dynamic layout: just pass struct, serializer will do

everything for you • encoding/gob, encoding/json

• Library have to figure out "what type" first, than serialize later

• Fix layout: you have to tell serializer about your struct first • protobuf, capnproto, messagepack...

• Library already know type, just using code-gen to serialize

Dynamic layout Fix layout

Advantages

- Easy to use - Easy support nested struct... - No additional step

- Easy to optimize - Managable protocol file (.proto or .flatbuffer)

Disadvantages

- Harder to optimize - Need reflection (performance downgrade)

- Needs code generation

What should we use• What we have

• Fix protocol • Need low memory footprint / low CPU usage

• So I decided to have serialization method which is • Fix layout in code • But without codegen

func MarshalRaw(a *A, buf *bytes.Buffer) { encodeString(a.Name, buf) encodeUint64(uint64(a.BirthDay.UnixNano()), buf) encodeString(a.Phone, buf) encodeUint64(uint64(a.Siblings), buf) encodeBool(a.Spouse, buf) encodeFloat64(a.Money, buf) }

Your struct field order is fixed in code

NameBirthday

Phone

...

Implementation

First try• Using encoding/binary to convert type to byte

array • Write byte array to buffer • For dynamic size struct (vector, map..)

• Write size first as int and than write payload • When decode, read size first, and than read

payload

uint64func encodeUint64(v uint64, w io.Writer) error { b := [64 / 8]byte{} binary.LittleEndian.PutUint64(b[:], v) _, err := w.Write(b[:]) return err }

func decodeUint64(r io.Reader) (uint64, error) { var l uint64 err := binary.Read(r, binary.LittleEndian, &l) if err != nil { return 0, err } return l, nil }

stringfunc encodeString(v string, w io.Writer) error { l := len(v) err := encodeUint16(uint16(l), w) if err != nil { return err } _, err = w.Write([]byte(v)) return err }

func decodeString(r io.Reader) (string, error) { l, err := decodeUint16(r) if err != nil { return "", err } b := make([]byte, l) _, err = r.Read(b) return string(b), err }

float64func encodeFloat64(v float64, w io.Writer) error { var zeroByte byte b := [64 / 8]byte{}

bs := math.Float64Bits(v) binary.LittleEndian.PutUint64(b[:], bs) _, err := w.Write(b[:]) return err }

func decodeFloat64(r io.Reader) (float64, error) { var l uint64 err := binary.Read(r, binary.LittleEndian, &l) if err != nil { return 0, nil } return math.Float64fromBits(l), nil }

It's so simple And does nothing special

It must be fast!

Let's benchmark• Using

• https://github.com/alecthomas/go_serialization_benchmarks

• Add our own serialization method and compare with another • Call it `raw`

• Let's see result

https://github.com/alecthomas/go_serialization_benchmarks

BenchmarkMsgpMarshal-8 10000000 161 ns/op 128 B/op 1 allocs/op BenchmarkMsgpUnmarshal-8 5000000 307 ns/op 112 B/op 3 allocs/op BenchmarkVmihailencoMsgpackMarshal-8 1000000 1840 ns/op 368 B/op 6 allocs/op BenchmarkVmihailencoMsgpackUnmarshal-8 1000000 1874 ns/op 384 B/op 13 allocs/op BenchmarkRawMarshaller-8 2000000 826 ns/op 384 B/op 13 allocs/op BenchmarkRawUnmarshaller-8 2000000 710 ns/op 338 B/op 17 allocs/op BenchmarkJsonMarshal-8 500000 2804 ns/op 1232 B/op 10 allocs/op BenchmarkJsonUnmarshal-8 500000 2999 ns/op 464 B/op 7 allocs/op BenchmarkEasyJsonMarshal-8 1000000 1223 ns/op 784 B/op 5 allocs/op BenchmarkEasyJsonUnmarshal-8 1000000 1351 ns/op 160 B/op 4 allocs/op BenchmarkBsonMarshal-8 1000000 1405 ns/op 392 B/op 10 allocs/op BenchmarkBsonUnmarshal-8 1000000 1869 ns/op 248 B/op 21 allocs/op BenchmarkGobMarshal-8 2000000 903 ns/op 48 B/op 2 allocs/op BenchmarkGobUnmarshal-8 2000000 913 ns/op 112 B/op 3 allocs/op BenchmarkXdrMarshal-8 1000000 1553 ns/op 456 B/op 21 allocs/op BenchmarkXdrUnmarshal-8 1000000 1392 ns/op 240 B/op 11 allocs/op BenchmarkUgorjiCodecMsgpackMarshal-8 1000000 2190 ns/op 2753 B/op 8 allocs/op BenchmarkUgorjiCodecMsgpackUnmarshal-8 500000 2207 ns/op 3008 B/op 6 allocs/op BenchmarkUgorjiCodecBincMarshal-8 1000000 2070 ns/op 2785 B/op 8 allocs/op BenchmarkUgorjiCodecBincUnmarshal-8 500000 2386 ns/op 3168 B/op 9 allocs/op BenchmarkSerealMarshal-8 500000 2563 ns/op 912 B/op 21 allocs/op BenchmarkSerealUnmarshal-8 500000 3068 ns/op 1008 B/op 34 allocs/op BenchmarkBinaryMarshal-8 1000000 1221 ns/op 256 B/op 16 allocs/op BenchmarkBinaryUnmarshal-8 1000000 1389 ns/op 335 B/op 22 allocs/op BenchmarkFlatBuffersMarshal-8 5000000 345 ns/op 0 B/op 0 allocs/op BenchmarkFlatBuffersUnmarshal-8 5000000 259 ns/op 112 B/op 3 allocs/op BenchmarkCapNProtoMarshal-8 3000000 423 ns/op 56 B/op 2 allocs/op BenchmarkCapNProtoUnmarshal-8 5000000 384 ns/op 200 B/op 6 allocs/op BenchmarkCapNProto2Marshal-8 2000000 695 ns/op 244 B/op 3 allocs/op BenchmarkCapNProto2Unmarshal-8 2000000 859 ns/op 320 B/op 6 allocs/op BenchmarkHproseMarshal-8 1000000 1033 ns/op 479 B/op 8 allocs/op BenchmarkHproseUnmarshal-8 2000000 1028 ns/op 319 B/op 10 allocs/op BenchmarkProtobufMarshal-8 2000000 885 ns/op 200 B/op 7 allocs/op BenchmarkProtobufUnmarshal-8 2000000 641 ns/op 192 B/op 10 allocs/op BenchmarkGoprotobufMarshal-8 3000000 447 ns/op 312 B/op 4 allocs/op BenchmarkGoprotobufUnmarshal-8 3000000 592 ns/op 432 B/op 9 allocs/op BenchmarkGogoprotobufMarshal-8 10000000 131 ns/op 64 B/op 1 allocs/op BenchmarkGogoprotobufUnmarshal-8 10000000 222 ns/op 96 B/op 3 allocs/op BenchmarkColferMarshal-8 10000000 123 ns/op 64 B/op 1 allocs/op BenchmarkColferUnmarshal-8 10000000 181 ns/op 112 B/op 3 allocs/op BenchmarkGencodeMarshal-8 10000000 153 ns/op 80 B/op 2 allocs/op BenchmarkGencodeUnmarshal-8 10000000 172 ns/op 112 B/op 3 allocs/op BenchmarkGencodeUnsafeMarshal-8 20000000 98.2 ns/op 48 B/op 1 allocs/op BenchmarkGencodeUnsafeUnmarshal-8 10000000 142 ns/op 96 B/op 3 allocs/op BenchmarkXDR2Marshal-8 10000000 151 ns/op 64 B/op 1 allocs/op BenchmarkXDR2Unmarshal-8 10000000 145 ns/op 32 B/op 2 allocs/op BenchmarkGoAvroMarshal-8 500000 2291 ns/op 1032 B/op 33 allocs/op BenchmarkGoAvroUnmarshal-8 300000 5388 ns/op 3440 B/op 89 allocs/op

Not bad But slow as 1/10 compare to

BenchmarkGencode

What we did wrong?

Slow pattern• Use GODEBUG=allocfreetrace=1 to find

redundant allocation pattern

func encodeUint64(v uint64, w io.Writer) error {

b := [64 / 8]byte{} binary.LittleEndian.PutUint64(b[:], v) _, err := w.Write(b[:]) return err }

func encodeString(v string, w io.Writer) error { l := len(v) err := encodeUint16(uint16(l), w) if err != nil { return err }

_, err = w.Write([]byte(v)) return err }

func rawbyteslice(size int) (b []byte) { cap := roundupsize(uintptr(size)) p := mallocgc(cap, nil, false) if cap != uintptr(size) { memclrNoHeapPointers(add(p, uintptr(size)), cap-uintptr(size)) }

*(*slice)(unsafe.Pointer(&b)) = slice{p, size, int(cap)} return }

Slow pattern• Took a look at some fast serialization

• Just byte copying around, no alloc

• And in our case, we write to file write after encode, so we do not need each serialization buffer, we just need global one

Second try

• Prepare a global buffer • Grow if needed • Clear buffer each run

• Just copy byte around, no more allocation

var bufferByte = make([]byte, DEFAULT_BUFFER_CAP)

func (rs Raw2Serializer) Marshal(o interface{}) []byte { a := o.(*A)

cleanBuffer()

idx := 0 idx += WriteString(idx, a.Name) idx += WriteUint64(idx, uint64(a.BirthDay.UnixNano())) idx += WriteString(idx, a.Phone) idx += WriteUint64(idx, uint64(a.Siblings)) idx += WriteBool(idx, a.Spouse) idx += WriteFloat64(idx, a.Money)

// copy from a to bufferByte return bufferByte[0:idx] }

small different, need index control to know where we need to copy and need to clean Buffer for each run

func WriteUint64(idx int, n uint64) int { if (idx + 8) > currentCap {

growBufferIfneeded() }

for i := uint(idx); i < uint(8); i++ { bufferByte[i] = byte(n >> (i*8)) } return 8 }

func WriteString(idx int, s string) int { l := len(s)

if (idx + l) > currentCap { growBufferIfneeded()

}

n := WriteUint64(idx, uint64(l))

// NOTE: copy works without conversion copy(bufferByte[idx+n:idx+l], s) return l+n }

BenchmarkMsgpMarshal-8 10000000 158 ns/op 128 B/op 1 allocs/op BenchmarkMsgpUnmarshal-8 5000000 335 ns/op 112 B/op 3 allocs/op BenchmarkVmihailencoMsgpackMarshal-8 1000000 1764 ns/op 368 B/op 6 allocs/op BenchmarkVmihailencoMsgpackUnmarshal-8 1000000 1779 ns/op 384 B/op 13 allocs/op BenchmarkRawMarshaller-8 2000000 806 ns/op 384 B/op 13 allocs/op

BenchmarkRawUnmarshaller-8 2000000 706 ns/op 338 B/op 17 allocs/op BenchmarkRaw2Marshaller-8 20000000 83.6 ns/op 0 B/op 0 allocs/op

10 times faster ! As fast as fastest andyleap/gencode

bench again

What I learned• Hidden allocation reduce performance • Serialization to file pitfalls

• Need thread-safe implement to prevent dirty file • Need versioning (write version first, than payload

later) for backward compatibility • Checksum matter

• You can calculate checksum directly from struct, no need to calculate from bytes • Using fnv to hash all fields and add up together,

instead of using CRC32 to calculate the whole byte arrays

Interesting techniques of other serialization methods

varint (protobuf)• Available in many softwares (protobuf, sqlite,

webassemlby (LEB128 of LLVM), golang encoding/binary)

• Compressed positive integer (negative number with 2-complement will take more bits)

• Idea: • most of integer in our app is small ("not very big") • Use as little number of bits as possible • 7 bit per byte, MSB bit as "continuation bit"

• Cons: • CPU cost • decoding is a bit complex

varint (protobuf) t := uint64(l)

for t >= 0x80 { buf[i+8] = byte(t) | 0x80 t >>= 7 i++ }

Many variant (group varint encoding, prefix varint encoding etc ... )

zigzag encoding (protobuf)• varint works only with positive number • Zigzag encoding encode negative number as

nearest (in absolute) positive number0 0 -1 1 1 2 -2 3 2147483647 4294967294 -2147483648 4294967295

zigzag = (n << 1) ^ (n >> (BIT_WIDTH - 1) Remember that arithmetic shift replicates the sign bit (n >> (BIT_WIDTH - 1) -> 11111...1 for negative (n >> (BIT_WIDTH - 1) -> 00000...0 for positive So when XOR with negative n, a lot of 1 will be eliminate

float reverse (gob)// floatBits returns a uint64 holding the bits of a floating-point number. // Floating-point numbers are transmitted as uint64s holding the bits // of the underlying representation. They are sent byte-reversed, with // the exponent end coming out first, so integer floating point numbers // (for example) transmit more compactly. This routine does the // swizzling. func floatBits(f float64) uint64 { u := math.Float64bits(f) var v uint64 for i := 0; i < 8; i++ { v <<= 8 v |= u & 0xFF u >>= 8 } return v }

unsafe (andyleap/gencode)

v := *(*uint64)(unsafe.Pointer(&(d.Height)))

Unmarshal number without copy or allocation

You could use same technique for string too http://qiita.com/mattn/items/176459728ff4f854b165

http://qiita.com/mattn/items/176459728ff4f854b165

Finally• Write your own serialization is not hard, and fun

• You can learn a lot from existence method • There are tons of techniques could be used to

enhance performance • When there are no much preferences, let's use

fix-layout type serialization • Version control proto file • High performance