Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf...

33
Apache Avro Introduction and Applications Chris Cooper linkedin.com/in/chriscooper Chicago Hadoop User Group - Jan 2011 Tuesday, February 15, 2011

Transcript of Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf...

Page 1: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Apache AvroIntroduction and Applications

Chris Cooperlinkedin.com/in/chriscooper Chicago Hadoop User Group - Jan 2011

Tuesday, February 15, 2011

Page 2: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

• Technologist and Consultant (over 20 years)

• Variety of industries: financial, crm, travel, robotics, mobile, location and travel again

• Recently: large data problems with Hadoop, Avro, HBase

• Twitter: @cjcdoomed Web: www.coopertechnical.com

Bio

Tuesday, February 15, 2011

Page 3: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

What is Avro?

Tuesday, February 15, 2011

Page 4: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro is...

another data serialization framework

Tuesday, February 15, 2011

Page 5: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro is...

another data serialization framework

(it’s also a file format and an RPC framework)

Tuesday, February 15, 2011

Page 6: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Serialization Frameworks

Java Serialization

Protocol Buffers

ThriftBSON

JSON

JAXB

ASN.1

MessagePack

Kryo

XDRBERT

SOAP

Pickle

NetStrings

Avro

Tuesday, February 15, 2011

Page 7: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Serialization Frameworks - Features

Tuesday, February 15, 2011

Page 8: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Serialization Frameworks - Features

• Fast

Tuesday, February 15, 2011

Page 9: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Serialization Frameworks - Features

• Fast

• Compact

Tuesday, February 15, 2011

Page 10: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Serialization Frameworks - Features

• Fast

• Compact

• Cross-platform/Cross-language

Tuesday, February 15, 2011

Page 11: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Fast...

Tuesday, February 15, 2011

Page 12: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Compact...

Tuesday, February 15, 2011

Page 13: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Cross-platform/Cross Language

Avro

Thrift

Protobuf

C, C++, Java, PHP, Python, Ruby

C, C++, C#, Erlang, Haskell, Java, Objective C/Cocoa, OCaml, Perl, PHP, Python, Ruby, and Smalltalk Squeak

Action Script, C, C++, C#, Clojure, Common Lisp, D, Erlang,Go, Haskell, Java, Lua,Objective C, OCaml, Perl, PHP, Python, R, Ruby, Scala, Visual Basic

Tuesday, February 15, 2011

Page 14: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Why another Serialization Framework?

• Avro is dynamic

• Rich schema resolution

• Avro object container format

Tuesday, February 15, 2011

Page 15: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro is Dynamic

• Avro,Thrift and Protobuf all define serialization formats using schemas

• Thrift and Protobuf can only read and write using schemas known at compile time.

• Avro can deal with unknown schemas at runtime

Tuesday, February 15, 2011

Page 16: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro has Rich Schema Resolution

• Schema is always present

• Read schema and write schema need not be identical (schema projection)

• Schema evolution

Tuesday, February 15, 2011

Page 17: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Object Container Format

Tuesday, February 15, 2011

Page 18: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Object Container Format

Tuesday, February 15, 2011

Page 19: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Object Container Format

• Schema stored with the data (not at an instance level)

Tuesday, February 15, 2011

Page 20: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Object Container Format

• Schema stored with the data (not at an instance level)

• Objects are stored in blocks that can be compressed

Tuesday, February 15, 2011

Page 21: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Object Container Format

• Schema stored with the data (not at an instance level)

• Objects are stored in blocks that can be compressed

• Synch markers occur between blocks

Tuesday, February 15, 2011

Page 22: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Object Container Format

• Schema stored with the data (not at an instance level)

• Objects are stored in blocks that can be compressed

• Synch markers occur between blocks

• Hmmm...sounds like a Hadoop Sequence file!

Tuesday, February 15, 2011

Page 23: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro and Hadoop

• Object container format is compressable and splittable

• Schemas specify sort order (fast comparators for free)

• Avro data files are portable across languages

• Avro RPC integrated with Hadoop will allow rolling upgrades and multi-language clients

Tuesday, February 15, 2011

Page 24: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Applied

• Hadoop MR

• Middleware - transferring large amounts of data

• Mobile

• Evolving data

• Messaging (JMS, AMQP)

• Complex object hierarchies

Good Meh...

Tuesday, February 15, 2011

Page 25: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Details

Tuesday, February 15, 2011

Page 26: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Schemas and Protocols

struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb}service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid)}

message SearchRequest { required string query = 1; optional int32 page_number = 2; optional int32 result_per_page = 3;}

Thrift

Protobuf

{ "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ]}

Avro JSON@namespace("org.apache.avro.test")protocol Simple { enum Kind { FOO, BAR, // the bar enum value BAZ }

fixed MD5(16);

record TestRecord { @order("ignore") string name;

@order("descending") Kind kind;

MD5 hash;

union { MD5, null} nullableHash;

array<long> arrayOfLongs; }

error TestError { string message; }

string hello(string greeting); TestRecord echo(TestRecord `record`); int add(int arg1, int arg2); bytes echoBytes(bytes data); void `error`() throws TestError;}

Avro IDL

Tuesday, February 15, 2011

Page 27: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Types

• string

• bytes

• int

• long

• float

• double

• boolean

• null

• record

• array

• map

• union

• fixed

• enum

Simple Complex

Tuesday, February 15, 2011

Page 28: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

How Avro Schema Resolution Works

• Schema written along with data

• Reader uses both schemas

• Matching fields (name and type) will get processed, the rest is skipped

• Demo

Tuesday, February 15, 2011

Page 29: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Java API

• Specific - Classes generated

• Generic - Uses Record class + accessors

• Reflection - Generates the schema from an existing Java object

Tuesday, February 15, 2011

Page 30: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro RPC

• Transports - currently only HTTP

• Handshake (exchange/verify protocols)

• Asynchronous/Synchronous

Tuesday, February 15, 2011

Page 31: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro Tools

• BinaryFragmentToJsonTool Converts an input file from Avro binary into JSON.

• DataFileGetSchemaTool Reads a data file to get its schema.

• DataFileReadTool Reads a data file and dumps to JSON

• DataFileWriteTool Reads new-line delimited JSON records and writers an Avro data file.

• FromTextTool Reads a text file into an Avro data file.

• IdlTool Tool implementation for generating Avro JSON schemata from idl format files.

• JsonToBinaryFragmentTool Tool to convert JSON data into the binary form.

• RpcReceiveTool Receives one RPC call and responds.

• RpcSendTool Sends a single RPC message.

• ToTextTool Reads an avro data file into a plain text file.

Tuesday, February 15, 2011

Page 32: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

References

• Apache Avro - http://avro.apache.org

• Hadoop: The Definitive Guide by Tom White

• http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

• Mailing List:

Tuesday, February 15, 2011

Page 33: Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf all define serialization formats using schemas • Thrift and Protobuf can only read

Avro

Questions?

Thanks!

[email protected]

Tuesday, February 15, 2011