How to use Parquet as a basis for ETL and analytics
-
Upload
julien-le-dem -
Category
Software
-
view
2.378 -
download
0
Transcript of How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
Julien Le Dem @J_
Analytics Data Pipeline tech lead, Data Platform
@ApacheParquet
Outline
2
- Instrumentation and data collection - Storing data efficiently for analysis - Openness and Interoperability
Instrumentation and data collection
Typical data flow
4
Serving
InstrumentedServices
MutableServing stores
mutation
Happy users
Typical data flow
5
Serving
InstrumentedServices
MutableServing stores
mutation
Data collection
log collection
Streaming log(Kafka, Scribe,
Chukwa ...)
periodic snapshots
log
Pull
Pull
streaming analysis
periodic consolidation
snapshots
schema
Typical data flow
6
Serving
InstrumentedServices
MutableServing stores
mutation
Data collection
log collection
Streaming log(Kafka, Scribe,
Chukwa ...)
periodic snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc queries(Impala,
Hive, Drill, ...)
automated dashboard
Batch computation(Graph, machine
learning, ...)
Streaming computation
(Storm, Samza, SparkStreaming..)
Query-efficient formatParquet
streaming analysis
periodic consolidation
snapshots
Typical data flow
7
Happy Data Scientist
Serving
InstrumentedServices
MutableServing stores
mutation
Data collection
log collection
Streaming log(Kafka, Scribe,
Chukwa ...)
periodic snapshots
log
Pull
Pull
analysis
Storage (HDFS)
ad-hoc queries(Impala,
Hive, Drill, ...)
automated dashboard
Batch computation(Graph, machine
learning, ...)
Streaming computation
(Storm, Samza, SparkStreaming..)
Query-efficient formatParquet
streaming analysis
periodic consolidation
snapshots
Storing data for analysis
Producing a lot of data is easy
9
Producing a lot of derived data is even easier.Solution: Compress all the things!
Scanning a lot of data is easy
10
1% completed
… but not necessarily fast.Waiting is not productive. We want faster turnaround.Compression but not at the cost of reading speed.
Interoperability not that easy
11
We need a storage format interoperable with all the tools we use and
keep our options open for the next big thing.
Enter Apache Parquet
Parquet design goals
13
- Interoperability
- Space efficiency
- Query efficiency
Efficiency
Columnar storage
15
Logical table representation
Row layout
Column layout
encoding
Nested schema
a b c
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5
a1 b1 c1a2 b2 c2a3 b3 c3a4 b4 c4a5 b5 c5
encoded chunk encoded chunk encoded chunk
Parquet nested representation
16
Document
DocId Links Name
Backward Forward Language Url
Code Country
Columns: docid links.backward links.forward name.language.code name.language.country name.url
Schema:
Borrowed from the Google Dremel paper
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Statistics for filter and query optimization
17
Vertical partitioning (projection push down)
Horizontal partitioning (predicate push down)
Read only the data you need!+ =
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
a b c
a1 b1 c1
a2 b2 c2
a3 b3 c3
a4 b4 c4
a5 b5 c5
+ =
Properties of efficient encodings
18
- Minimize CPU pipeline bubbles: highly predictable branching reduce data dependency !
- Minimize CPU cache misses reduce size of the working set
The right encoding for the right job
19
- Delta encodings:for sorted datasets or signals where the variation is less important than the absolute value. (timestamp, auto-generated ids, metrics, …) Focuses on avoiding branching. !
- Prefix coding (delta encoding for strings)When dictionary encoding does not work. !
- Dictionary encoding: small (60K) set of values (server IP, experiment id, …) !
- Run Length Encoding:repetitive data.
Interoperability
Interoperable
21
Model agnostic
Language agnostic
Java C++
Avro Thrift Protocol Buffer Pig Tuple Hive SerDe
Assembly/striping
Parquet file format
Object model
parquet-avroConverters parquet-thrift parquet-proto parquet-pig parquet-hive
Column encoding
Impala
...
...
Encoding
Queryexecution
Frameworks and libraries integrated with Parquet
22
Query engines: Hive, Impala, HAWQ, IBM Big SQL, Drill, Tajo, Pig, Presto !
Frameworks: Spark, MapReduce, Cascading, Crunch, Scalding, Kite !
Data Models: Avro, Thrift, ProtocolBuffers, POJOs
Schema management
Schema in Hadoop
24
Hadoop does not define a standard notion of schema but there are many available:
- Avro- Thrift- Protocol Buffers- Pig- Hive- …
And they are all different
What they define
25
Schema:Structure of a recordConstraints on the type
!Row oriented binary format: How records are represented one at a time
What they *do not* define
26
Column oriented binary format: Parquet reuses the schema definitions and provides a common column oriented binary format
Example: address book
27
AddressBook
Addressstreetcitystatezipcomment
addresses
Protocol Buffers
28
message AddressBook {! repeated group addresses = 1 {! required string street = 2;! required string city = 3;! required string state = 4;! required string zip = 5;! optional string comment = 6;! }!}!!- Allows recursive definition- Types: Group or primitive- binary format refers to field ids only => Renaming fields does not impact binary format- Requires installing a native compiler separated from your build
Fields have ids and can be optional, required or repeated
Lists are repeated fields
Thrift
29
struct AddressBook {! 1: required list<Address> addresses;!}!struct Addresses {! 1: required string street;! 2: required string city;! 3: required string state;! 4: required string zip;! 5: optional string comment;!}!!- No recursive definition- Types: Struct, Map, List, Set, Union or primitive- binary format refers to field ids only => Renaming fields does not impact binary format- Requires installing a native compiler separately from the build
Fields have ids and can be optional or required
explicit collection types
Avro
30
{! "type": "record", ! "name": "AddressBook",! "fields" : [{ ! "name": "addresses", ! "type": "array", ! "items": { ! “type”: “record”,! “fields”: [! {"name": "street", "type": “string"},! {"name": "city", "type": “string”}! {"name": "state", "type": “string"}! {"name": "zip", "type": “string”}! {"name": "comment", "type": [“null”, “string”]}! ] ! }! }]!}
explicit collection types
- Allows recursive definition- Types: Records, Arrays, Maps, Unions or primitive- Binary format requires knowing the write-time schema➡ more compact but not self descriptive➡ renaming fields does not impact binary format
- generator in java (well integrated in the build)
null is a type Optional is a union
Write to Parquet
Write to Parquet with Map Reduce
32
Protocol Buffers: job.setOutputFormatClass(ProtoParquetOutputFormat.class);!ProtoParquetOutputFormat.setProtobufClass(job, AddressBook.class);!!Thrift: job.setOutputFormatClass(ParquetThriftOutputFormat.class);!ParquetThriftOutputFormat.setThriftClass(job, AddressBook.class);!!Avro: job.setOutputFormatClass(AvroParquetOutputFormat.class);!AvroParquetOutputFormat.setSchema(job, AddressBook.SCHEMA$);
Write to Parquet with Scalding
33
// define the Parquet source!case class AddressBookParquetSource(override implicit val dateRange: DateRange)! extends HourlySuffixParquetThrift[AddressBook](“/my/data/address_book", dateRange)!// load and transform data!…! pipe.write(ParquetSource())!
Write with Parquet with Pig
34
…!STORE mydata !! INTO ‘my/data’ !! USING parquet.pig.ParquetStorer();
Query engines
Scalding
36
loading: new FixedPathParquetThrift[AddressBook](“my”, “data”) {! val city = StringColumn("city")! override val withFilter: Option[FilterPredicate] = ! Some(city === “San Jose”)!}!!operations: p.map( (r) => r.a + r.b )!p.groupBy( (r) => r.c )!p.join !…
Pig
37
loading: mydata = LOAD ‘my/data’ USING parquet.pig.ParquetLoader();!!operations: A = FOREACH mydata GENERATE a + b;!B = GROUP mydata BY c;!C = JOIN A BY a, B BY b;
Hive
38
loading: create table parquet_table_name (x INT, y STRING)! ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'! STORED AS ! INPUTFORMAT "parquet.hive.MapredParquetInputFormat"! OUTPUTFORMAT “parquet.hive.MapredParquetInputFormat";!!
operations: SQL!
Impala
39
loading: create table parquet_table (x int, y string) stored as parquetfile;!insert into parquet_table select x, y from some_other_table;!select y from parquet_table where x between 70 and 100;!
operations: SQL!
Drill
40
SELECT * FROM dfs.`/my/data`
Spark SQL
41
loading: val address = sqlContext.parquetFile(“/my/data/addresses“)!!operations: val result = sqlContext!! .sql("SELECT city FROM addresses WHERE zip == 94707”)!result.map((r) => …)!
Community
Parquet timeline
43
- Fall 2012: Twitter & Cloudera merge efforts to develop columnar formats
- March 2013: OSS announcement; Criteo signs on for Hive integration
- July 2013: 1.0 release. 18 contributors from more than 5 organizations.
- May 2014: Apache Incubator. 40+ contributors, 18 with 1000+ LOC. 26 incremental releases.
- Parquet 2.0 coming as Apache release
Thank you to our contributors
44
Open Source announcement
1.0 release
Get involved
45
Mailing lists: - [email protected] !
Parquet sync ups: - Regular meetings on google hangout
Questions
46
Questions.foreach( answer(_) )
@ApacheParquet