Hive - SerDe and LazySerde

11
Hive – SerDe and LazySerDe Part of Apache Hadoop Hive Project http://hadoop.apache.org/hive Data Infrastructure Team, Facebook Inc. (Slides by Zheng Shao)

description

This is a description of the SerDe layer in Hadoop Hive project.LazySerDe is a particular implementation of the SerDe interface.

Transcript of Hive - SerDe and LazySerde

Page 1: Hive - SerDe and LazySerde

Hive – SerDe and LazySerDe

Part of Apache Hadoop Hive Project http://hadoop.apache.org/hiveData Infrastructure Team, Facebook Inc. (Slides by Zheng Shao)

Page 2: Hive - SerDe and LazySerde

Where is SerDe?

File on HDFSFile on HDFS

Hierarchical

Object

Hierarchical

Object

Writable

Writable

StreamStream StreamStream

Hierarchical

Object

Hierarchical

Object

Map Output

File

Map Output

File

Writable

Writable

Writable

Writable

Writable

Writable

Writable

Writable

Writable

Writable

Hierarchical

Object

Hierarchical

Object

File on HDFSFile on HDFS

User ScriptUser Script

Hierarchical

Object

Hierarchical

Object

Hierarchical

Object

Hierarchical

Object

Hive Operator

Hive Operator

Hive Operator

Hive Operator

SerDe

FileFormat / Hadoop Serialization

Mapper Reducer

ObjectInspector

imp 1.0 3 54Imp 0.2 1 33clk 2.2 8 212Imp 0.7 2 22

thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>

BytesWritable(\x3F\x64\x72\x00)

Text(‘imp 1.0 3 54’) // UTF8 encoded

Java ObjectObject of a Java Class

Standard ObjectUse ArrayList for struct and arrayUse HashMap for map

LazyObjectLazily-deserialized

Page 3: Hive - SerDe and LazySerde

getTypeObjectInspector1

getFieldOI

getStructField

getTypeObjectInspector2

getMapValueOI

getMapValue

deserialize SerDeserialize getOI

SerDe, ObjectInspector and TypeInfo

Hierarchical

Object

Hierarchical

Object

Writable

Writable

Writable

Writable

Struct

Struct

intint stringstringlistlist

struct

struct

mapmap

stringstring stringstring

Hierarchical

Object

Hierarchical

Object

String ObjectString Object getType

ObjectInspector3

TypeInfo

BytesWritable(\x3F\x64\x72\x00)

Text(‘a=av:b=bv 23 1:2=4:5 abcd’)

class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}

List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)

intint intint

HashMap(“a” “av”, “b” “bv”),

HashMap<String, String> a,

“av”

Page 4: Hive - SerDe and LazySerde

LazySimpleSerDe components

LazyStructLazyStruct

LazyInteger

LazyInteger

LazyString

LazyString

LazyArrayLazyArray

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyInteger

LazyMapLazyMap

LazyStringLazyString

LazyStringLazyString

LazyStringLazyString

LazyStringLazyString

LazyStructOI(“ “)LazyStructOI(“ “)

LazyArrayOI(“:”)

LazyArrayOI(“:”)

LazyMapOI(“:”,”=“)

LazyMapOI(“:”,”=“)

StandardIntegerOI

StandardIntegerOI

StandardStringOIStandardStringOI

StandardStringOIStandardStringOI

byte[] databyte[] data

Hierarchical Object / LazyObject

One Per SerDe instance LazyObjectInspectorSingleton

byte[](‘a=av:b=bv 23 1:2=4:5 abcd’)

LazyStructLazyStructLazyStructOI(“=“)LazyStructOI(“=“)

StandardIntegerOI

StandardIntegerOI

LazyStructLazyStruct

Page 5: Hive - SerDe and LazySerde

LazyPrimitive▪ LazyString/LazyInteger

▪ setAll(byte[] data, int start, int length)

▪ LazyString: parse the data and create a String object

▪ LazyInteger: parse the data and create an Integer object

▪ getObject() – returns the corresponding String/Integer object

▪ Future

▪ Replace String/Integer with Text/IntWritable

▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.

Page 6: Hive - SerDe and LazySerde

LazyNonPrimitive▪ LazyStruct/LazyArray/LazyMap

▪ setAll(byte[] data, int start, int length)

▪ Remember data, start and length, and set parsed to false.

▪ getStructField/getArrayElement/getMapValue

▪ If not parsed yet, parse the byte and remember starting positions of each field/element/key/value

▪ For Struct/Array, do setAll on the corresponding LazyObject and return it

▪ For Map, search for the serialized key and return the corresponding value (after doing a setAll on the value).

Page 7: Hive - SerDe and LazySerde

Why another SerDe?▪ Functionality:

▪ MetadataTypedColumnSetSerDe can only deal with String columns

▪ DynamicSerDe can deal with all primitive columns and primitive lists/maps, but it does not fully support nested types yet.

▪ Efficiency:

▪ Both MetadataTypedColumnSetSerDe and DynamicSerDe uses String.split() and are not efficient for long rows

Page 8: Hive - SerDe and LazySerde

Features of LazySimpleSerDe▪ Functionality:

▪ Fully compatible with MetaDataSerDe and Dynamic/TCTLSeparated

▪ Fully support all nested types (Map Key must be primitive)

▪ Efficiency:

▪ Fully support lazy deserialization - only deserialize the field (and create Objects) when asked.

▪ Reuse multiple-levels of LazyObjects.

▪ Read numbers without UTF-8 decoding

▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String

▪ (TODO) Write numbers without UTF-8 encoding

Page 9: Hive - SerDe and LazySerde

Profiling result of a mapper

▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression)▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)▪ 50%: Operator.forward▪ |-18%: Text.decode (from LazySerDe)▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding)▪ | |- 5%: toString()  (where we create the string object)▪ |- 3%: LazyStruct.parse (the code that search for separators in the row)▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)▪ |- 8%: GroupByOperator.processHashAggr▪ |- 3%: HashMap.get() in GroupByOperator

▪ * Performance Data from Rodrigo Schmidt

Page 10: Hive - SerDe and LazySerde

TypeInfo String specification▪ Why not Thrift?

▪ Hard to parse

▪ Simple Syntax

▪ Type: PrimitiveType | MapType | ArrayType | StructType

▪ PrimitiveType: int | bigint | tinyint | smallint | double | string

▪ MapType: map<Type, Type>

▪ ArrayType: array<Type>

▪ StructType: struct< [Name : Type]+ >

▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>

Page 11: Hive - SerDe and LazySerde

Future of SerDe▪ HIVE-337 LazySimpleSerDe should support multi-level nested array, map, struct types (Done)

▪ HIVE-136 SerDe should escape some special characters

▪ HIVE-266 Improve SerDe performance by using Text instead of String

▪ HIVE-352 Make Hive support column based storage (Yongqiang He)

▪ HIVE-358 Short-circuiting serialization

▪ Binary-format Lazy SerDe