Page 1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC 2015: Faster, BeCer, Smaller Prasanth Jayachandran Apache Hive Team, Hortonworks @prasanth_j
Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache ORC – Optimized Row-Columnar File
Apache TLP – orc.apache.org +
Type Specific Encodings +
Came out of Apache Hive +
Vectorized Readers (Java, C++) + ProjecVon and Predicate Pushdown +
Columnar Storage +
Block Compression +
Hive ACID transacVons +
Single SerDe Format + Protobuf Metadata Storage +
Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Format SpecificaVon
How ORC stores data?
Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC File Layout
§ File Footer and Postscript
§ Stripes
§ Indexes (Row group indexes and Bloom Filter interleaved)
§ Min/Max stats, Positions for every 10K rows
§ Data § Multiple streams per column encoded and
compressed independently
§ Stripe Footer
§ Locations to streams, type of encoding
§ Full specification at [1]
Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC Writer
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time>
§ One tree writer per flattened column
§ Multiple streams per column § PRESENT
§ DATA
§ LENGTH
§ DICTIONARY_DATA
§ SECONDARY
§ ROW_INDEX
§ BLOOM_FILTER
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC Data Streams
Schema: <i:int,m:map<k:string,v:struct<s:string,d:double>,t:time> § Streams can be suppressed. § Example: PRESENT stream is suppressed when all values in a stripe are non-null.
IS_PRESENT DATA DICTIONARY LENGTH SECONDARY
Compression Buffers
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Features Timeline
How ORC improved over <me?
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2013
§ Stinger Initiative Announcement* § Roadmap to improve Apache Hive’s performance by 100x § Delivered in 100% Apache Open Source
* http://hortonworks.com/blog/100x-faster-hive/
| 2013 | 2014 | 2015
SQL Engine
Vectorized SQL Engine
Columnar Storage
ORC
+ + Distributed Execution
Apache Tez
= 100x
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
March 2013
Optimized Row Columnar (ORC) file format committed to Hive § Hive version: 0.11 § Native data format in Hive
| 2013 | 2014 | 2015
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
March 2013
| 2013 | 2014 | 2015
Predicate Pushdown § SARG interface § Prune stripes and row groups based on min/max statistics
Improved Run Length Encoding § Tighter bit packing § Longer runs § DELTA, SHORT_REPEATS, DIRECT, PATCHED_BASE
Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Run Length Encoding Improvements
RLE (hive 0.11) RLE (hive >= 0.12)
Compression
RaVo Encoding Time (in
ms) Decoding Time (in
ms) Compression
RaVo Encoding Time (in
ms) Decoding Time (in
ms)
Twi$er Census API ID (24,556,361 records) 2.32 1770 1263 6.97 1558 864
HTTP Archive (bytes.json) 79.4 198 191 200.82 263 125
Github Archive (root.payload.name.txt.dict-‐len) 114.05 21 15 260.73 23 15
AOL Querylog Epoch (36,389,577 records) 2.51 553 364 3.7 652 246
Reference: h$ps://issues.apache.org/jira/secure/a$achment/12596722/ORC-‐Compression-‐RaWo-‐Comparison.xlsx
Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2013
| 2013 | 2014 | 2015
Vectorized ORC readers § Read and process columns in batches of size 1024
Null stream suppression § Suppress PRESENT stream if no nulls in a stripe § Enables fast path in vectorization
June 2013
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
October 2013
| 2013 | 2014 | 2015
Statistics Interface § Writer – Update statistics during load time § Reader – ANALYZE TABLE .. NOSCAN
Split Elimination § Stripe level column statistics § Eliminate stripes that do not satisfy predicate conditions
November 2013
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2014
| 2013 | 2014 | 2015
Zero copy read path § HDFS caching APIs to read directly into memory without extra data copies
Serialization Improvements § Bit width alignment (trade-off space for speed) § Unrolled bit packing and unpacking § Buffered double reader and writer
June 2014
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Serialization Improvements
0
200
400
600
800
1000
1200
1400
1600
1800
1 2 4 8 16 24 32 40 48 56 64
Mea
n Ti
me
(ms)
Bit Width
ORC Read Integer Performance (smaller is better)
hive 0.13 unpacking
hive-1.0 unpacking (new)
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Serialization Improvements
241.679
171.045 174.163
0
50
100
150
200
250
300
hive <= 0.13 buffered + BE buffered + LE
Mea
n Ti
me
(ms)
Double Read Modes
ORC Read Double Performance (smaller is better)
~1.4x improvement
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
June 2014
| 2013 | 2014 | 2015
Adaptive compression buffer size § >1000 columns adjust compression buffer size based on available memory § Avoids wide table OOMs
Fast stripe level file merging § Many small files to few large files § No Decompression, No Decoding § ALTER TABLE … CONCATENATE
July 2014
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Fast File Merging
1091
651
245
816
0
200
400
600
800
1000
1200
1400
1600
ORC RCFile
Tota
l Tim
e in
sec
onds
CONCAT Supporting File Formats
ETL With File Merging – TPC-H 1000 Scale Lineitem (smaller is better)
Merge Time
Load Time
1336 1467
~3.33x improvement in merge time
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
July 2014
| 2013 | 2014 | 2015
ORC Padding Improvements § Pad bytes to avoid remote HDFS reads § Last stripe is adjusted to fit within HDFS block boundary (worst case: 5% wastage)
Decouple stripe size vs block size § Smaller stripes (64MB) § More stripes per block (4 per block) § Better parallelism & split elimination
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013 | 2014 | 2015
String Dictionary Improvements § Row group level checking § Remember decision across stripes § Avoids expensive RBTree insertions
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
String Dictionary Improvements
767
540
0
100
200
300
400
500
600
700
800
900
hive <= 0.13 hive > 0.13
Tim
e in
sec
onds
Hive Version
String Dictionary Improvements - TPC-H 1000 Scale Lineitem (smaller is better)
Load Time
~1.4x improvement
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013 | 2014 | 2015
Improved ZLIB compression § Different streams compressed with different zlib strategies/levels § Compress integers and doubles differently § Data and Dictionary stream - Looks for smaller byte patterns § All other streams - Less LZ77, More Huffman
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ZLIB Improvements
178.5 172.2
225.1
0
50
100
150
200
250
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
Dat
a Si
ze in
GB
s
File Format + Compression Codec
Data Size Improvements - TPC-H 1000 Scale Lineitem (smaller is better)
~4% improvement ~1.3x smaller
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ZLIB Improvements
674
433 389
0
100
200
300
400
500
600
700
800
ORC + (old ZLIB) ORC + (new ZLIB) ORC + SNAPPY
Dat
a Si
ze in
GB
s
File Format + Compression Codec
Load Time Improvements - TPC-H 1000 Scale Lineitem (smaller is better)
~1.6x improvement Only ~10% slower than SNAPPY
Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
September 2014
| 2013 | 2014 | 2015
ACID transactions § Order of millions of rows § Not designed for OLTP requirements § Streaming Ingest via Flume or Storm § Atomically add base and delta directories § Minor compaction – Merge many delta files § Major compaction – Re-write base files to incorporate delta file changes
Broken pattern: Add Partitions for Atomicity -
Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
January 2015
| 2013 | 2014 | 2015
hasNull flag in ORC internal index § Better pruning of row groups § Improves the performance of SELECT .. WHERE column IS NULL;
Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
hasNull in Index Improvement
Bytes Read: 208.77 GB vs 539 MB 66.73
7.87
0
10
20
30
40
50
60
70
80
hive < 1.1.0 hive >= 1.1.0
Exec
utio
n Ti
me
in s
econ
ds
Hive Version
select * from lineitem where l_shipdate is null (smaller is better)
Execution Time ~8.5x improvement
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
February 2015
| 2013 | 2014 | 2015
Bloom Filter Index § Much better row group pruning when compared to min/max § Bloom filter evaluated after the fast Min/Max based elimination
Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bloom Filter Indexes Improvements
5999989709
540,000
10,000
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey = 1212000001; (log scale – smaller is better)
Rows Read
Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Bloom Filter Indexes Improvements
74
4.5 1.34
No Indexes Min-Max Indexes Bloomfilter Indexes
select * from tpch_1000.lineitem where l_orderkey=1212000001; (smaller is better)
Time Taken (seconds)
~16x improvement
~3.3x improvement
Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2015
| 2013 | 2014 | 2015
Split Strategies § BI – Skip reading file footer § ETL – Read and cache file footer § HYBRID – Default. Chooses BI/ETL based on number of files and average file size § Group splits based on columnar projection size instead of file size
Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Timeline
April 2015
| 2013 | 2014 | 2015
ORC became Apache Top Level Project § C++ reader with contributions from Hortonworks, HP and Microsoft § Column encryption to encrypt sensitive columns
http://orc.apache.org/
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: In ProducVon
Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Facebook
Saved more than 1,400 servers worth of storage. (2)
Compression i Compression raVo increased from 5x to 8x globally. (2)
Compression i
Page 35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Spotify
16x less HDFS read when using ORC versus Avro.(3)
IO i 32x less CPU when using ORC versus Avro.(3)
CPU i
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC at Yahoo!
6-‐50x speedup when using ORC versus Text File.(4)
Speedup i 1.6-‐30x speedup when using ORC versus RCFile.(4)
Speedup i
Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP and Sub-‐second
ORC – Pushing for Sub-‐second
Page 38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP
- JIT Performance for short queries +
Row-‐group level caching +
Asynchronous IO Elevator +
+ MulV-‐threaded Column Vector processing +
Page 39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: Vectorization + SIMD
0x00007f13d2e6afb0: vmovdqu 0x10(%rsi,%rax,8),%ymm2 0x00007f13d2e6afb6: vaddpd %ymm1,%ymm2,%ymm2 0x00007f13d2e6afba: movslq %eax,%r10 0x00007f13d2e6afbd: vmovdqu 0x30(%rsi,%r10,8),%ymm3 ;*daload vector.expressions.gen.DoubleColAddDoubleColumn::evaluate (line 94)
Example: Query: select ss_ext_tax + 1.0 from store_sales_orc; JVM Options: HADOOP_OPTS=“ -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly” Note: Make sure to have hotspot disassembler in $JAVA_HOME/jre/lib Generated Assembly:
§ AllocaVon free Vght inner loops enables JDK’s auto-‐vectorizaVon
§ Vectors can be filtered early in ORC
§ String dicVonary can be used to binary-‐search
§ Vectorized SIMD Join
§ Improves performance for single key joins
AVX - Vector Addition Packed Double 4 doubles loaded to 256 bit registers
Page 40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
ORC: LLAP (+ SIMD + Split Strategies + Row Indexes)
select * from tpch_1000.lineitem where l_orderkey=1212000001;
Page 41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Questions
?
Interested? Stop by the Hortonworks booth to learn more
Page 42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Endnotes (1) hXps://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-‐orc-‐
specORCFormatSpecifica<on (2) hXps://code.facebook.com/posts/229861827208629/scaling-‐the-‐facebook-‐data-‐warehouse-‐to-‐300-‐pb/
(3) hXp://www.slideshare.net/AdamKawa/a-‐perfect-‐hive-‐query-‐for-‐a-‐perfect-‐mee<ng-‐hadoop-‐summit-‐2014
(4) hXp://www.slideshare.net/Hadoop_Summit/w-‐1205p230-‐aradhakrishnan-‐v3
Top Related