Story writing byte Serializer in go
@dxhuy @huydx
Byte serializer
Problem space• Writting buffer control that need to persist struct
data in disk • Struct data is simple (will not change in near
future) • Program needs
• Low memory footprint • Low CPU usage
Bunches of options
• encoding/gob (base on encoding/binary) • gogoprotobuf • capnproto (glycerine/go-capnproto) • ugorji/go/codec • mgo.v2/bson • .....
Some problems • Some are overcomplex
• Cryptic error message • Some are fast, but not support all datastructure (map)
• flatbuffer (could use vector instead, but look up is not O(1))
• All libraries does some abstraction, make it hard to debug • Write to disk failed at the middle, some bytes are written,
some are not • Using library lack of fine-grained control
• You want some special behaviours for some special field • You want some special behaviours when it failed
So let's write your own
Serializer anatomy
type A struct { Name string BirthDay time.Time Phone string Siblings int Spouse bool Money float64 }
Example struct
Struct layout mattertype A struct { Name string BirthDay time.Time Phone string Siblings int Spouse bool Money float64 }
size + order
In general, there 2 types• Dynamic layout: just pass struct, serializer will do
everything for you • encoding/gob, encoding/json
• Library have to figure out "what type" first, than serialize later
• Fix layout: you have to tell serializer about your struct first • protobuf, capnproto, messagepack...
• Library already know type, just using code-gen to serialize
Dynamic layout Fix layout
Advantages
- Easy to use - Easy support nested struct... - No additional step
- Easy to optimize - Managable protocol file (.proto or .flatbuffer)
Disadvantages
- Harder to optimize - Need reflection (performance downgrade)
- Needs code generation
What should we use• What we have
• Fix protocol • Need low memory footprint / low CPU usage
• So I decided to have serialization method which is • Fix layout in code • But without codegen
func MarshalRaw(a *A, buf *bytes.Buffer) { encodeString(a.Name, buf) encodeUint64(uint64(a.BirthDay.UnixNano()), buf) encodeString(a.Phone, buf) encodeUint64(uint64(a.Siblings), buf) encodeBool(a.Spouse, buf) encodeFloat64(a.Money, buf) }
Your struct field order is fixed in code
NameBirthday
Phone
...
Implementation
First try• Using encoding/binary to convert type to byte
array • Write byte array to buffer • For dynamic size struct (vector, map..)
• Write size first as int and than write payload • When decode, read size first, and than read
payload
uint64func encodeUint64(v uint64, w io.Writer) error { b := [64 / 8]byte{} binary.LittleEndian.PutUint64(b[:], v) _, err := w.Write(b[:]) return err }
func decodeUint64(r io.Reader) (uint64, error) { var l uint64 err := binary.Read(r, binary.LittleEndian, &l) if err != nil { return 0, err } return l, nil }
stringfunc encodeString(v string, w io.Writer) error { l := len(v) err := encodeUint16(uint16(l), w) if err != nil { return err } _, err = w.Write([]byte(v)) return err }
func decodeString(r io.Reader) (string, error) { l, err := decodeUint16(r) if err != nil { return "", err } b := make([]byte, l) _, err = r.Read(b) return string(b), err }
float64func encodeFloat64(v float64, w io.Writer) error { var zeroByte byte b := [64 / 8]byte{}
bs := math.Float64Bits(v) binary.LittleEndian.PutUint64(b[:], bs) _, err := w.Write(b[:]) return err }
func decodeFloat64(r io.Reader) (float64, error) { var l uint64 err := binary.Read(r, binary.LittleEndian, &l) if err != nil { return 0, nil } return math.Float64fromBits(l), nil }
It's so simple And does nothing special
It must be fast!
Let's benchmark• Using
• https://github.com/alecthomas/go_serialization_benchmarks
• Add our own serialization method and compare with another • Call it `raw`
• Let's see result
BenchmarkMsgpMarshal-8 10000000 161 ns/op 128 B/op 1 allocs/op BenchmarkMsgpUnmarshal-8 5000000 307 ns/op 112 B/op 3 allocs/op BenchmarkVmihailencoMsgpackMarshal-8 1000000 1840 ns/op 368 B/op 6 allocs/op BenchmarkVmihailencoMsgpackUnmarshal-8 1000000 1874 ns/op 384 B/op 13 allocs/op BenchmarkRawMarshaller-8 2000000 826 ns/op 384 B/op 13 allocs/op BenchmarkRawUnmarshaller-8 2000000 710 ns/op 338 B/op 17 allocs/op BenchmarkJsonMarshal-8 500000 2804 ns/op 1232 B/op 10 allocs/op BenchmarkJsonUnmarshal-8 500000 2999 ns/op 464 B/op 7 allocs/op BenchmarkEasyJsonMarshal-8 1000000 1223 ns/op 784 B/op 5 allocs/op BenchmarkEasyJsonUnmarshal-8 1000000 1351 ns/op 160 B/op 4 allocs/op BenchmarkBsonMarshal-8 1000000 1405 ns/op 392 B/op 10 allocs/op BenchmarkBsonUnmarshal-8 1000000 1869 ns/op 248 B/op 21 allocs/op BenchmarkGobMarshal-8 2000000 903 ns/op 48 B/op 2 allocs/op BenchmarkGobUnmarshal-8 2000000 913 ns/op 112 B/op 3 allocs/op BenchmarkXdrMarshal-8 1000000 1553 ns/op 456 B/op 21 allocs/op BenchmarkXdrUnmarshal-8 1000000 1392 ns/op 240 B/op 11 allocs/op BenchmarkUgorjiCodecMsgpackMarshal-8 1000000 2190 ns/op 2753 B/op 8 allocs/op BenchmarkUgorjiCodecMsgpackUnmarshal-8 500000 2207 ns/op 3008 B/op 6 allocs/op BenchmarkUgorjiCodecBincMarshal-8 1000000 2070 ns/op 2785 B/op 8 allocs/op BenchmarkUgorjiCodecBincUnmarshal-8 500000 2386 ns/op 3168 B/op 9 allocs/op BenchmarkSerealMarshal-8 500000 2563 ns/op 912 B/op 21 allocs/op BenchmarkSerealUnmarshal-8 500000 3068 ns/op 1008 B/op 34 allocs/op BenchmarkBinaryMarshal-8 1000000 1221 ns/op 256 B/op 16 allocs/op BenchmarkBinaryUnmarshal-8 1000000 1389 ns/op 335 B/op 22 allocs/op BenchmarkFlatBuffersMarshal-8 5000000 345 ns/op 0 B/op 0 allocs/op BenchmarkFlatBuffersUnmarshal-8 5000000 259 ns/op 112 B/op 3 allocs/op BenchmarkCapNProtoMarshal-8 3000000 423 ns/op 56 B/op 2 allocs/op BenchmarkCapNProtoUnmarshal-8 5000000 384 ns/op 200 B/op 6 allocs/op BenchmarkCapNProto2Marshal-8 2000000 695 ns/op 244 B/op 3 allocs/op BenchmarkCapNProto2Unmarshal-8 2000000 859 ns/op 320 B/op 6 allocs/op BenchmarkHproseMarshal-8 1000000 1033 ns/op 479 B/op 8 allocs/op BenchmarkHproseUnmarshal-8 2000000 1028 ns/op 319 B/op 10 allocs/op BenchmarkProtobufMarshal-8 2000000 885 ns/op 200 B/op 7 allocs/op BenchmarkProtobufUnmarshal-8 2000000 641 ns/op 192 B/op 10 allocs/op BenchmarkGoprotobufMarshal-8 3000000 447 ns/op 312 B/op 4 allocs/op BenchmarkGoprotobufUnmarshal-8 3000000 592 ns/op 432 B/op 9 allocs/op BenchmarkGogoprotobufMarshal-8 10000000 131 ns/op 64 B/op 1 allocs/op BenchmarkGogoprotobufUnmarshal-8 10000000 222 ns/op 96 B/op 3 allocs/op BenchmarkColferMarshal-8 10000000 123 ns/op 64 B/op 1 allocs/op BenchmarkColferUnmarshal-8 10000000 181 ns/op 112 B/op 3 allocs/op BenchmarkGencodeMarshal-8 10000000 153 ns/op 80 B/op 2 allocs/op BenchmarkGencodeUnmarshal-8 10000000 172 ns/op 112 B/op 3 allocs/op BenchmarkGencodeUnsafeMarshal-8 20000000 98.2 ns/op 48 B/op 1 allocs/op BenchmarkGencodeUnsafeUnmarshal-8 10000000 142 ns/op 96 B/op 3 allocs/op BenchmarkXDR2Marshal-8 10000000 151 ns/op 64 B/op 1 allocs/op BenchmarkXDR2Unmarshal-8 10000000 145 ns/op 32 B/op 2 allocs/op BenchmarkGoAvroMarshal-8 500000 2291 ns/op 1032 B/op 33 allocs/op BenchmarkGoAvroUnmarshal-8 300000 5388 ns/op 3440 B/op 89 allocs/op
Not bad But slow as 1/10 compare to
BenchmarkGencode
What we did wrong?
Slow pattern• Use GODEBUG=allocfreetrace=1 to find
redundant allocation pattern
func encodeUint64(v uint64, w io.Writer) error {
b := [64 / 8]byte{} binary.LittleEndian.PutUint64(b[:], v) _, err := w.Write(b[:]) return err }
func encodeString(v string, w io.Writer) error { l := len(v) err := encodeUint16(uint16(l), w) if err != nil { return err }
_, err = w.Write([]byte(v)) return err }
func rawbyteslice(size int) (b []byte) { cap := roundupsize(uintptr(size)) p := mallocgc(cap, nil, false) if cap != uintptr(size) { memclrNoHeapPointers(add(p, uintptr(size)), cap-uintptr(size)) }
*(*slice)(unsafe.Pointer(&b)) = slice{p, size, int(cap)} return }
Slow pattern• Took a look at some fast serialization
• Just byte copying around, no alloc
• And in our case, we write to file write after encode, so we do not need each serialization buffer, we just need global one
Second try
• Prepare a global buffer • Grow if needed • Clear buffer each run
• Just copy byte around, no more allocation
var bufferByte = make([]byte, DEFAULT_BUFFER_CAP)
func (rs Raw2Serializer) Marshal(o interface{}) []byte { a := o.(*A)
cleanBuffer()
idx := 0 idx += WriteString(idx, a.Name) idx += WriteUint64(idx, uint64(a.BirthDay.UnixNano())) idx += WriteString(idx, a.Phone) idx += WriteUint64(idx, uint64(a.Siblings)) idx += WriteBool(idx, a.Spouse) idx += WriteFloat64(idx, a.Money)
// copy from a to bufferByte return bufferByte[0:idx] }
small different, need index control to know where we need to copy and need to clean Buffer for each run
func WriteUint64(idx int, n uint64) int { if (idx + 8) > currentCap {
growBufferIfneeded() }
for i := uint(idx); i < uint(8); i++ { bufferByte[i] = byte(n >> (i*8)) } return 8 }
func WriteString(idx int, s string) int { l := len(s)
if (idx + l) > currentCap { growBufferIfneeded()
}
n := WriteUint64(idx, uint64(l))
// NOTE: copy works without conversion copy(bufferByte[idx+n:idx+l], s) return l+n }
BenchmarkMsgpMarshal-8 10000000 158 ns/op 128 B/op 1 allocs/op BenchmarkMsgpUnmarshal-8 5000000 335 ns/op 112 B/op 3 allocs/op BenchmarkVmihailencoMsgpackMarshal-8 1000000 1764 ns/op 368 B/op 6 allocs/op BenchmarkVmihailencoMsgpackUnmarshal-8 1000000 1779 ns/op 384 B/op 13 allocs/op BenchmarkRawMarshaller-8 2000000 806 ns/op 384 B/op 13 allocs/op
BenchmarkRawUnmarshaller-8 2000000 706 ns/op 338 B/op 17 allocs/op BenchmarkRaw2Marshaller-8 20000000 83.6 ns/op 0 B/op 0 allocs/op
10 times faster ! As fast as fastest andyleap/gencode
bench again
What I learned• Hidden allocation reduce performance • Serialization to file pitfalls
• Need thread-safe implement to prevent dirty file • Need versioning (write version first, than payload
later) for backward compatibility • Checksum matter
• You can calculate checksum directly from struct, no need to calculate from bytes • Using fnv to hash all fields and add up together,
instead of using CRC32 to calculate the whole byte arrays
Interesting techniques of other serialization methods
varint (protobuf)• Available in many softwares (protobuf, sqlite,
webassemlby (LEB128 of LLVM), golang encoding/binary)
• Compressed positive integer (negative number with 2-complement will take more bits)
• Idea: • most of integer in our app is small ("not very big") • Use as little number of bits as possible • 7 bit per byte, MSB bit as "continuation bit"
• Cons: • CPU cost • decoding is a bit complex
varint (protobuf) t := uint64(l)
for t >= 0x80 { buf[i+8] = byte(t) | 0x80 t >>= 7 i++ }
Many variant (group varint encoding, prefix varint encoding etc ... )
zigzag encoding (protobuf)• varint works only with positive number • Zigzag encoding encode negative number as
nearest (in absolute) positive number0 0 -1 1 1 2 -2 3 2147483647 4294967294 -2147483648 4294967295
zigzag = (n << 1) ^ (n >> (BIT_WIDTH - 1) Remember that arithmetic shift replicates the sign bit (n >> (BIT_WIDTH - 1) -> 11111...1 for negative (n >> (BIT_WIDTH - 1) -> 00000...0 for positive So when XOR with negative n, a lot of 1 will be eliminate
float reverse (gob)// floatBits returns a uint64 holding the bits of a floating-point number. // Floating-point numbers are transmitted as uint64s holding the bits // of the underlying representation. They are sent byte-reversed, with // the exponent end coming out first, so integer floating point numbers // (for example) transmit more compactly. This routine does the // swizzling. func floatBits(f float64) uint64 { u := math.Float64bits(f) var v uint64 for i := 0; i < 8; i++ { v <<= 8 v |= u & 0xFF u >>= 8 } return v }
unsafe (andyleap/gencode)
v := *(*uint64)(unsafe.Pointer(&(d.Height)))
Unmarshal number without copy or allocation
You could use same technique for string too http://qiita.com/mattn/items/176459728ff4f854b165
Finally• Write your own serialization is not hard, and fun
• You can learn a lot from existence method • There are tons of techniques could be used to
enhance performance • When there are no much preferences, let's use
fix-layout type serialization • Version control proto file • High performance