Hive - SerDe and LazySerde
-
Upload
zheng-shao -
Category
Technology
-
view
8.972 -
download
2
description
Transcript of Hive - SerDe and LazySerde
Hive – SerDe and LazySerDe
Part of Apache Hadoop Hive Project http://hadoop.apache.org/hiveData Infrastructure Team, Facebook Inc. (Slides by Zheng Shao)
Where is SerDe?
File on HDFSFile on HDFS
Hierarchical
Object
Hierarchical
Object
Writable
Writable
StreamStream StreamStream
Hierarchical
Object
Hierarchical
Object
Map Output
File
Map Output
File
Writable
Writable
Writable
Writable
Writable
Writable
Writable
Writable
Writable
Writable
Hierarchical
Object
Hierarchical
Object
File on HDFSFile on HDFS
User ScriptUser Script
Hierarchical
Object
Hierarchical
Object
Hierarchical
Object
Hierarchical
Object
Hive Operator
Hive Operator
Hive Operator
Hive Operator
SerDe
FileFormat / Hadoop Serialization
Mapper Reducer
ObjectInspector
imp 1.0 3 54Imp 0.2 1 33clk 2.2 8 212Imp 0.7 2 22
thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>
BytesWritable(\x3F\x64\x72\x00)
Text(‘imp 1.0 3 54’) // UTF8 encoded
Java ObjectObject of a Java Class
Standard ObjectUse ArrayList for struct and arrayUse HashMap for map
LazyObjectLazily-deserialized
getTypeObjectInspector1
getFieldOI
getStructField
getTypeObjectInspector2
getMapValueOI
getMapValue
deserialize SerDeserialize getOI
SerDe, ObjectInspector and TypeInfo
Hierarchical
Object
Hierarchical
Object
Writable
Writable
Writable
Writable
Struct
Struct
intint stringstringlistlist
struct
struct
mapmap
stringstring stringstring
Hierarchical
Object
Hierarchical
Object
String ObjectString Object getType
ObjectInspector3
TypeInfo
BytesWritable(\x3F\x64\x72\x00)
Text(‘a=av:b=bv 23 1:2=4:5 abcd’)
class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}
List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)
intint intint
HashMap(“a” “av”, “b” “bv”),
HashMap<String, String> a,
“av”
LazySimpleSerDe components
LazyStructLazyStruct
LazyInteger
LazyInteger
LazyString
LazyString
LazyArrayLazyArray
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyInteger
LazyMapLazyMap
LazyStringLazyString
LazyStringLazyString
LazyStringLazyString
LazyStringLazyString
LazyStructOI(“ “)LazyStructOI(“ “)
LazyArrayOI(“:”)
LazyArrayOI(“:”)
LazyMapOI(“:”,”=“)
LazyMapOI(“:”,”=“)
StandardIntegerOI
StandardIntegerOI
StandardStringOIStandardStringOI
StandardStringOIStandardStringOI
byte[] databyte[] data
Hierarchical Object / LazyObject
One Per SerDe instance LazyObjectInspectorSingleton
byte[](‘a=av:b=bv 23 1:2=4:5 abcd’)
LazyStructLazyStructLazyStructOI(“=“)LazyStructOI(“=“)
StandardIntegerOI
StandardIntegerOI
LazyStructLazyStruct
LazyPrimitive▪ LazyString/LazyInteger
▪ setAll(byte[] data, int start, int length)
▪ LazyString: parse the data and create a String object
▪ LazyInteger: parse the data and create an Integer object
▪ getObject() – returns the corresponding String/Integer object
▪ Future
▪ Replace String/Integer with Text/IntWritable
▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.
LazyNonPrimitive▪ LazyStruct/LazyArray/LazyMap
▪ setAll(byte[] data, int start, int length)
▪ Remember data, start and length, and set parsed to false.
▪ getStructField/getArrayElement/getMapValue
▪ If not parsed yet, parse the byte and remember starting positions of each field/element/key/value
▪ For Struct/Array, do setAll on the corresponding LazyObject and return it
▪ For Map, search for the serialized key and return the corresponding value (after doing a setAll on the value).
Why another SerDe?▪ Functionality:
▪ MetadataTypedColumnSetSerDe can only deal with String columns
▪ DynamicSerDe can deal with all primitive columns and primitive lists/maps, but it does not fully support nested types yet.
▪ Efficiency:
▪ Both MetadataTypedColumnSetSerDe and DynamicSerDe uses String.split() and are not efficient for long rows
Features of LazySimpleSerDe▪ Functionality:
▪ Fully compatible with MetaDataSerDe and Dynamic/TCTLSeparated
▪ Fully support all nested types (Map Key must be primitive)
▪ Efficiency:
▪ Fully support lazy deserialization - only deserialize the field (and create Objects) when asked.
▪ Reuse multiple-levels of LazyObjects.
▪ Read numbers without UTF-8 decoding
▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String
▪ (TODO) Write numbers without UTF-8 encoding
Profiling result of a mapper
▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression)▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)▪ 50%: Operator.forward▪ |-18%: Text.decode (from LazySerDe)▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding)▪ | |- 5%: toString() (where we create the string object)▪ |- 3%: LazyStruct.parse (the code that search for separators in the row)▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)▪ |- 8%: GroupByOperator.processHashAggr▪ |- 3%: HashMap.get() in GroupByOperator
▪ * Performance Data from Rodrigo Schmidt
TypeInfo String specification▪ Why not Thrift?
▪ Hard to parse
▪ Simple Syntax
▪ Type: PrimitiveType | MapType | ArrayType | StructType
▪ PrimitiveType: int | bigint | tinyint | smallint | double | string
▪ MapType: map<Type, Type>
▪ ArrayType: array<Type>
▪ StructType: struct< [Name : Type]+ >
▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
Future of SerDe▪ HIVE-337 LazySimpleSerDe should support multi-level nested array, map, struct types (Done)
▪ HIVE-136 SerDe should escape some special characters
▪ HIVE-266 Improve SerDe performance by using Text instead of String
▪ HIVE-352 Make Hive support column based storage (Yongqiang He)
▪ HIVE-358 Short-circuiting serialization
▪ Binary-format Lazy SerDe