Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf...

Post on 11-Jul-2020

14 views 0 download

Transcript of Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf...

Apache AvroIntroduction and Applications

Chris Cooperlinkedin.com/in/chriscooper Chicago Hadoop User Group - Jan 2011

Tuesday, February 15, 2011

What is Avro?

Tuesday, February 15, 2011

Avro is...

another data serialization framework

Tuesday, February 15, 2011

Avro is...

another data serialization framework

(it’s also a file format and an RPC framework)

Tuesday, February 15, 2011

Serialization Frameworks

Java Serialization

Protocol Buffers

ThriftBSON

JSON

JAXB

ASN.1

MessagePack

Kryo

XDRBERT

SOAP

Pickle

NetStrings

Avro

Tuesday, February 15, 2011

Serialization Frameworks - Features

Tuesday, February 15, 2011

Serialization Frameworks - Features

• Fast

Tuesday, February 15, 2011

Serialization Frameworks - Features

• Fast

• Compact

Tuesday, February 15, 2011

Serialization Frameworks - Features

• Fast

• Compact

• Cross-platform/Cross-language

Tuesday, February 15, 2011

Fast...

Tuesday, February 15, 2011

Compact...

Tuesday, February 15, 2011

Cross-platform/Cross Language

Avro

Thrift

Protobuf

C, C++, Java, PHP, Python, Ruby

C, C++, C#, Erlang, Haskell, Java, Objective C/Cocoa, OCaml, Perl, PHP, Python, Ruby, and Smalltalk Squeak

Action Script, C, C++, C#, Clojure, Common Lisp, D, Erlang,Go, Haskell, Java, Lua,Objective C, OCaml, Perl, PHP, Python, R, Ruby, Scala, Visual Basic

Tuesday, February 15, 2011

Why another Serialization Framework?

• Avro is dynamic

• Rich schema resolution

• Avro object container format

Tuesday, February 15, 2011

Avro is Dynamic

• Avro,Thrift and Protobuf all define serialization formats using schemas

• Thrift and Protobuf can only read and write using schemas known at compile time.

• Avro can deal with unknown schemas at runtime

Tuesday, February 15, 2011

Avro has Rich Schema Resolution

• Schema is always present

• Read schema and write schema need not be identical (schema projection)

• Schema evolution

Tuesday, February 15, 2011

Avro Object Container Format

Tuesday, February 15, 2011

Avro Object Container Format

Tuesday, February 15, 2011

Avro Object Container Format

• Schema stored with the data (not at an instance level)

Tuesday, February 15, 2011

Avro Object Container Format

• Schema stored with the data (not at an instance level)

• Objects are stored in blocks that can be compressed

Tuesday, February 15, 2011

Avro Object Container Format

• Schema stored with the data (not at an instance level)

• Objects are stored in blocks that can be compressed

• Synch markers occur between blocks

Tuesday, February 15, 2011

Avro Object Container Format

• Schema stored with the data (not at an instance level)

• Objects are stored in blocks that can be compressed

• Synch markers occur between blocks

• Hmmm...sounds like a Hadoop Sequence file!

Tuesday, February 15, 2011

Avro and Hadoop

• Object container format is compressable and splittable

• Schemas specify sort order (fast comparators for free)

• Avro data files are portable across languages

• Avro RPC integrated with Hadoop will allow rolling upgrades and multi-language clients

Tuesday, February 15, 2011

Avro Applied

• Hadoop MR

• Middleware - transferring large amounts of data

• Mobile

• Evolving data

• Messaging (JMS, AMQP)

• Complex object hierarchies

Good Meh...

Tuesday, February 15, 2011

Avro Details

Tuesday, February 15, 2011

Avro Schemas and Protocols

struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb}service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid)}

message SearchRequest { required string query = 1; optional int32 page_number = 2; optional int32 result_per_page = 3;}

Thrift

Protobuf

{ "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ]}

Avro JSON@namespace("org.apache.avro.test")protocol Simple { enum Kind { FOO, BAR, // the bar enum value BAZ }

fixed MD5(16);

record TestRecord { @order("ignore") string name;

@order("descending") Kind kind;

MD5 hash;

union { MD5, null} nullableHash;

array<long> arrayOfLongs; }

error TestError { string message; }

string hello(string greeting); TestRecord echo(TestRecord `record`); int add(int arg1, int arg2); bytes echoBytes(bytes data); void `error`() throws TestError;}

Avro IDL

Tuesday, February 15, 2011

Avro Types

• string

• bytes

• int

• long

• float

• double

• boolean

• null

• record

• array

• map

• union

• fixed

• enum

Simple Complex

Tuesday, February 15, 2011

How Avro Schema Resolution Works

• Schema written along with data

• Reader uses both schemas

• Matching fields (name and type) will get processed, the rest is skipped

• Demo

Tuesday, February 15, 2011

Avro Java API

• Specific - Classes generated

• Generic - Uses Record class + accessors

• Reflection - Generates the schema from an existing Java object

Tuesday, February 15, 2011

Avro RPC

• Transports - currently only HTTP

• Handshake (exchange/verify protocols)

• Asynchronous/Synchronous

Tuesday, February 15, 2011

Avro Tools

• BinaryFragmentToJsonTool Converts an input file from Avro binary into JSON.

• DataFileGetSchemaTool Reads a data file to get its schema.

• DataFileReadTool Reads a data file and dumps to JSON

• DataFileWriteTool Reads new-line delimited JSON records and writers an Avro data file.

• FromTextTool Reads a text file into an Avro data file.

• IdlTool Tool implementation for generating Avro JSON schemata from idl format files.

• JsonToBinaryFragmentTool Tool to convert JSON data into the binary form.

• RpcReceiveTool Receives one RPC call and responds.

• RpcSendTool Sends a single RPC message.

• ToTextTool Reads an avro data file into a plain text file.

Tuesday, February 15, 2011

References

• Apache Avro - http://avro.apache.org

• Hadoop: The Definitive Guide by Tom White

• http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking

• Mailing List:

Tuesday, February 15, 2011

Avro

Questions?

Thanks!

ccooper@coopertechnical.com

Tuesday, February 15, 2011