Parquet: A Columnar Storage for the People

3
Parquet Columnar storage for the people Julien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter Nong Li [email protected] Software engineer, Cloudera Impala http://parquet.io 1

description

We would like to introduce Parquet, a columnar file format for Hadoop. Performance and compression benefits of using columnar storage formats for storing and processing large amounts of data are well documented in academic literature as well as several commercial analytical databases. Parquet supports deeply nested structures, efficient encoding and column compression schemes, and is designed to be compatible with a variety of higher-level type systems. It is available as a standalone library, allowing any Hadoop framework or tool to build support for it with minimal dependencies. As of this release, Parquet is supported by Apache Pig, plain Hadoop Map-Reduce, and Cloudera?s Impala, and is being put into production at Twitter. We will discuss Parquet?s design and share performance numbers.

Transcript of Parquet: A Columnar Storage for the People

Page 1: Parquet: A Columnar Storage for the People

Parquet

Columnar storage for the peopleJulien Le Dem @J_ Processing tools lead, analytics infrastructure at Twitter

Nong Li [email protected] Software engineer, Cloudera Impala

http://parquet.io

1

Page 2: Parquet: A Columnar Storage for the People

Context from various companies

Early results

Format deep-dive

2

Outline

http://parquet.io

Page 3: Parquet: A Columnar Storage for the People

This presentation is only partially previewed.