Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf...
Transcript of Apache Avro - Meetupfiles.meetup.com/1634302/CHUG-ApacheAvro.pdf · • Avro,Thrift and Protobuf...
Apache AvroIntroduction and Applications
Chris Cooperlinkedin.com/in/chriscooper Chicago Hadoop User Group - Jan 2011
Tuesday, February 15, 2011
• Technologist and Consultant (over 20 years)
• Variety of industries: financial, crm, travel, robotics, mobile, location and travel again
• Recently: large data problems with Hadoop, Avro, HBase
• Twitter: @cjcdoomed Web: www.coopertechnical.com
Bio
Tuesday, February 15, 2011
What is Avro?
Tuesday, February 15, 2011
Avro is...
another data serialization framework
Tuesday, February 15, 2011
Avro is...
another data serialization framework
(it’s also a file format and an RPC framework)
Tuesday, February 15, 2011
Serialization Frameworks
Java Serialization
Protocol Buffers
ThriftBSON
JSON
JAXB
ASN.1
MessagePack
Kryo
XDRBERT
SOAP
Pickle
NetStrings
Avro
Tuesday, February 15, 2011
Serialization Frameworks - Features
Tuesday, February 15, 2011
Serialization Frameworks - Features
• Fast
Tuesday, February 15, 2011
Serialization Frameworks - Features
• Fast
• Compact
Tuesday, February 15, 2011
Serialization Frameworks - Features
• Fast
• Compact
• Cross-platform/Cross-language
Tuesday, February 15, 2011
Fast...
Tuesday, February 15, 2011
Compact...
Tuesday, February 15, 2011
Cross-platform/Cross Language
Avro
Thrift
Protobuf
C, C++, Java, PHP, Python, Ruby
C, C++, C#, Erlang, Haskell, Java, Objective C/Cocoa, OCaml, Perl, PHP, Python, Ruby, and Smalltalk Squeak
Action Script, C, C++, C#, Clojure, Common Lisp, D, Erlang,Go, Haskell, Java, Lua,Objective C, OCaml, Perl, PHP, Python, R, Ruby, Scala, Visual Basic
Tuesday, February 15, 2011
Why another Serialization Framework?
• Avro is dynamic
• Rich schema resolution
• Avro object container format
Tuesday, February 15, 2011
Avro is Dynamic
• Avro,Thrift and Protobuf all define serialization formats using schemas
• Thrift and Protobuf can only read and write using schemas known at compile time.
• Avro can deal with unknown schemas at runtime
Tuesday, February 15, 2011
Avro has Rich Schema Resolution
• Schema is always present
• Read schema and write schema need not be identical (schema projection)
• Schema evolution
Tuesday, February 15, 2011
Avro Object Container Format
Tuesday, February 15, 2011
Avro Object Container Format
Tuesday, February 15, 2011
Avro Object Container Format
• Schema stored with the data (not at an instance level)
Tuesday, February 15, 2011
Avro Object Container Format
• Schema stored with the data (not at an instance level)
• Objects are stored in blocks that can be compressed
Tuesday, February 15, 2011
Avro Object Container Format
• Schema stored with the data (not at an instance level)
• Objects are stored in blocks that can be compressed
• Synch markers occur between blocks
Tuesday, February 15, 2011
Avro Object Container Format
• Schema stored with the data (not at an instance level)
• Objects are stored in blocks that can be compressed
• Synch markers occur between blocks
• Hmmm...sounds like a Hadoop Sequence file!
Tuesday, February 15, 2011
Avro and Hadoop
• Object container format is compressable and splittable
• Schemas specify sort order (fast comparators for free)
• Avro data files are portable across languages
• Avro RPC integrated with Hadoop will allow rolling upgrades and multi-language clients
Tuesday, February 15, 2011
Avro Applied
• Hadoop MR
• Middleware - transferring large amounts of data
• Mobile
• Evolving data
• Messaging (JMS, AMQP)
• Complex object hierarchies
Good Meh...
Tuesday, February 15, 2011
Avro Details
Tuesday, February 15, 2011
Avro Schemas and Protocols
struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb}service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid)}
message SearchRequest { required string query = 1; optional int32 page_number = 2; optional int32 result_per_page = 3;}
Thrift
Protobuf
{ "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], "fields" : [ {"name": "value", "type": "long"}, {"name": "next", "type": ["LongList", "null"]} ]}
Avro JSON@namespace("org.apache.avro.test")protocol Simple { enum Kind { FOO, BAR, // the bar enum value BAZ }
fixed MD5(16);
record TestRecord { @order("ignore") string name;
@order("descending") Kind kind;
MD5 hash;
union { MD5, null} nullableHash;
array<long> arrayOfLongs; }
error TestError { string message; }
string hello(string greeting); TestRecord echo(TestRecord `record`); int add(int arg1, int arg2); bytes echoBytes(bytes data); void `error`() throws TestError;}
Avro IDL
Tuesday, February 15, 2011
Avro Types
• string
• bytes
• int
• long
• float
• double
• boolean
• null
• record
• array
• map
• union
• fixed
• enum
Simple Complex
Tuesday, February 15, 2011
How Avro Schema Resolution Works
• Schema written along with data
• Reader uses both schemas
• Matching fields (name and type) will get processed, the rest is skipped
• Demo
Tuesday, February 15, 2011
Avro Java API
• Specific - Classes generated
• Generic - Uses Record class + accessors
• Reflection - Generates the schema from an existing Java object
Tuesday, February 15, 2011
Avro RPC
• Transports - currently only HTTP
• Handshake (exchange/verify protocols)
• Asynchronous/Synchronous
Tuesday, February 15, 2011
Avro Tools
• BinaryFragmentToJsonTool Converts an input file from Avro binary into JSON.
• DataFileGetSchemaTool Reads a data file to get its schema.
• DataFileReadTool Reads a data file and dumps to JSON
• DataFileWriteTool Reads new-line delimited JSON records and writers an Avro data file.
• FromTextTool Reads a text file into an Avro data file.
• IdlTool Tool implementation for generating Avro JSON schemata from idl format files.
• JsonToBinaryFragmentTool Tool to convert JSON data into the binary form.
• RpcReceiveTool Receives one RPC call and responds.
• RpcSendTool Sends a single RPC message.
• ToTextTool Reads an avro data file into a plain text file.
Tuesday, February 15, 2011
References
• Apache Avro - http://avro.apache.org
• Hadoop: The Definitive Guide by Tom White
• http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking
• Mailing List:
Tuesday, February 15, 2011
Avro
Questions?
Thanks!
Tuesday, February 15, 2011