Post on 16-Jul-2015
Protocol Buffers OverviewFabrício Epaminondas - @fabricioepa
Senior Software Engineer, Signove
About me
BSc in Computer Science at Federal University of Campina Grande, UFCG.
Recent activities
• Implementation of IEEE Data Exchange Protocol 11073 part 20601
• Data modeling for Bluetooth services• Data synchronization using REST
services
Agenda
Background
What are Protocol Buffers?
How do they work?
Why use Protocol Buffers?
Techniques
Questions
Quick Links
Background
Data Formats in Information Technology
• Typing/interpretation, transmission, storage
Popular data formats...
CSV
• Simple to read/write by
application
• Tabular data structure
• Flat
• No validation
Name, Age, Phone
Fabricio, 26, +558388000000
Kaka, 28, +558388000001
Cafu, 40, +558388000002
Pele, 70, +558388000003
XML
• Markup language for Documents
• Hierarchical structure
• Data validation
• A common standard with great acceptance
<person>
<name>Fabricio</name>
<age>26</age>
<contacts>
<email>
my@email.com
</email>
<phone>999</phone>
</contacts>
</person>
JSON
• Lightweight data-interchange format
• Browser support
• Alternative to XML
person {
name: “Fabricio”
age: 26
contacts: {
email: “my@email.com”
phone: “999”
}
}
Comparison
CSV
XML
JSON
Parsing efficiency
ReusableModel Update
Hierarchical Small Size
Google's Data Interchange
RequirementsWe use literally thousands of different data formats to represent:
• networked messages between servers• index records in repositories• geospatial datasets
Most of these formats are structured, not flat. This raises an important question…
How do we encode it all?
Requirements:
Hierarchical data structure
Small data size
Parsing performance
Model update: add/ignore fields, modify parser code...
Backwards compatible
What are Protocol Buffers?
A language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more.
It was initially developed at Google to deal with an index server request/response protocol
How do they work?
You define how your structured data format is a descriptor file
Generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.
You can even update your data structure without breaking deployed programs that are compiled against the "old" format.
Writing some code…
.proto C++message Person {
required string name = 1;required int32 id = 2;optional string email = 3;
enum PhoneType {MOBILE = 0;HOME = 1;WORK = 2;
}
message PhoneNumber {required string number = 1;optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;}
Person person;person.set_name("John Doe");person.set_id(1234);person.set_email("jdoe@example.com");fstream output("myfile", ios::out | ios::binary);//Writeperson.SerializeToOstream(&output);
//Readfstream input("myfile", ios::in | ios::binary);Person person;person.ParseFromIstream(&input);cout << "Name: " << person.name() << endl;cout << "E-mail: " << person.email() << endl;
Generated code
Messages
• Immutable
Builders
Enums and Nested Classes
• C++: Person:: Mobile• Java: Person.PhoneType.MOBILE
Parsing and Serialization
Why use Protocol Buffers?
Protocol Buffers’ major design goals is simplicity
Protocol buffers are the flexible, efficient
PB are 3 to 10 times smaller than XML
PB are 20 to 100 times faster than XML
Comparison
CSV
XML
JSON
PB
Parsing efficiency
ReusableModel Update
Hierarchical Small Size
Why use Protocol Buffers?
Use object serialization (like in Java) causes interoperability problems.
In C/C++ the raw in-memory data structures can be sent/saved in binary form, but is hard to extend.
Alternatives
Thrift
ASN1
Java Externalizable
Others IDL...
• WSDL, XSD, XML• CORBA, Java-IDL, etc…
Techniques
Backward/Forward compatibility
Updating Message Types
O-O Design
Backward/Forward compatibility
You must not change the tag numbers of any existing fields.
You must not add or delete any required fields.
Consider writing application-specific custom validation routines instead of required fields
You may delete optional or repeated fields.
You may add new optional or repeated fields but you must use fresh tag numbers…
(i.e. tag numbers that were never used in this protocol buffer, not even by deleted fields).
Backward/Forward compatibility
Old code will simply ignore new fields, for deleted fields it will read default values
Unknown fields are not discarded, and if the message is later serialized, the unknown fields are serialized along with it
Changing a default value is generally OK, but remember default values are never sent over the wire
Receiver will NOT see the default value that was defined in the sender's code.
New code will also transparently read old messages
Updating Message Types
Don't change the numeric tags for any existing fields.
Despite of non-required fields can be removed, it’s better to rename the field instead to something like “DEPRECATED_...”
int32, uint32, int64, uint64, and bool are all compatible. It does not breaks forwards- or backwards-compatibility.
string and bytes are compatible as long as the bytes are valid UTF-8.
More issues in protobuf manual
O-O Design
Generated source code of message objects should not be modified
Use wrappers to encapsulate messages
Do not inherit from message objects
Questions…
Quick Links
• API
▫ http://code.google.com/apis/protocolbuffers/
• Post By Kenton Varda, Protocol Buffers Team▫ http://google-opensource.blogspot.com/2008/07/protocol-buffers-googles-data.html
• Kevin Weil, Analytics Lead, Twitter
▫ http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter
• Benchmarks
▫ http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
• Computer World Article
▫ http://www.computerworld.com/s/article/9191098/Twitter_solves_its_data_formatting_challenge