Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is...
Transcript of Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is...
© 2018 IBM Corporation
Takeshi Yoshimura, Tatsuhiro Chiba, Hiroshi Horii
IBM Research – Tokyo
IEEE BigData 2018
Column Cache: Buffer Cache for Columnar Storage on HDFS
© 2018 IBM Corporation
Hadoop distributed file system (HDFS)
§A major data store of scale-out architecture in big data processing–Offers high scalability and availability to parallel distributed computing
§Typical use-case: relational query processing and machine learning–e.g., Spark SQL/ML, Apache Tez, Apache Mahout
2 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
© 2018 IBM Corporation
Columnar Storage on HDFS
§Designed as a special data representation for a HDFS file–Provides highly compressed data by columnar-wise data layout–Preserves high scalability and availability of HDFS
§Optimizes read heavy queries
3 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
metadata
RowGroup
RowGroup
…
Column AColumn B
Column AColumn B
Parquet file format
© 2018 IBM Corporation
…And Spark wraps all with RDD
4 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
© 2018 IBM Corporation
Problem: Deep software layering
§Disk I/O for data analytics is processed by multiple software layers–Redundant buffers cause inefficient memory usage
§Every software layer does not coordinate each other–User memory increases memory pressure and OS cache eviction
5 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Java virtual machine process
ParquetFileReader
OS cache
FSBufferedInputStream
RDD
redundant
© 2018 IBM Corporation
Preliminary experiments with TPC-DS
§Large JVM heap caused ~10 mins slowdown in TPC-DS on Spark–System memory: 128 GB, Scale factor: 1TB
6 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS
scale factor=1000
0
10
20
30
40
50
60
q23a q23b q14a q14b q16 q64 q94 q95* q80 q69
Ela
pse
d tim
e (
min
)
TPC-DS queries
heap=24GBheap=48GBheap=80GB
Low
er
is B
etter
© 2018 IBM Corporation
Slowdowns due to OS-level cache (TPC-DS q23a)
§Large heap evicted OS cache that is for Spark’s temporary files–Smaller heap caused fewer cache misses but insufficient memory for Spark
§Difficult to coordinate memory management in OS and Spark
7 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000Spark offheap: 48 GB
0
32
64
96
128
0 4 8 12 16 20 24 28 32 36 40 44U
ser M
emor
y (G
B)
Elapsed Time (min)heap=80GB
User memory OS cache Free
0
32
64
96
128
0 4 8 12 16 20 24 28 32
Use
r Mem
ory
(GB
)
Elapsed Time (min)heap=48GB
User memory OS cache Free
© 2018 IBM Corporation
Column cache: Buffer cache for columnar storage on HDFS
§Unifies buffering and caching logic from OS to Parquet–Leverages query plans to eagerly evict user memory and increase OS cache–Exploits HDFS’s optimization for direct I/O on HDFS files
§Provided as a Java library with Parquet compatible APIs–Required 88 LoC changes in Spark
§Modifies only disk read for Parquet files in HDFS client processes–Does not require restarting HDFS servers
8 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Java virtual machine process
Parquet
HDFS client
Spark
OS cache
Local filesystem
Columncache
New data flowExisting data flow
© 2018 IBM Corporation
Approach
§Cache management with query plans–Master node extracts and broadcasts scan info from query plans–Use anonymous mmap()/munmap() system calls for explicit memory management
§User-level buffer cache with direct I/O on HDFS files–Use JNA to call open(), pread(), fcntl(), close() system calls from Java processes
9 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Master node
Spark driver
Name node
Query plan extraction
Query planner
Spark SQL
SQL queries
Slave nodeSpark executor
Cache management with query plans
Data node
Local FS
Broadcastscan info
Direct I/O on HDFS files
SlavenodeSpark executor
cache Direct I/O
Slavenode
Data node
Local FSSpark executorcache Direct I/O
Data node
Local FS
© 2018 IBM Corporation
Approach
§Cache management with query plans–Master node extracts and broadcasts scan info from query plans–Use anonymous mmap()/munmap() system calls for explicit memory management
§User-level buffer cache with direct I/O on HDFS files–Use JNA to call open(), pread(), fcntl(), close() system calls from Java processes
10 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Master node
Spark driver
Name node
Query plan extraction
Query planner
Spark SQL
SQL queries
Slave nodeSpark executor
Cache management with query plans
Data node
Local FS
Broadcastscan info
Direct I/O on HDFS files
SlavenodeSpark executor
cache Direct I/O
Slavenode
Data node
Local FSSpark executorcache Direct I/O
Data node
Local FS
© 2018 IBM Corporation
Cache management with query plans
§Build cache entries by associating direct I/O result with corresponding
offset ranges of HDFS files
§Estimates the frequency of scanned file ranges within query plans
–Parquet metadata contains how each column is physically placed in an HDFS file
–Mark a cache entry releasable if its scanned count reaches the frequency
§Monitor system memory and free releasable cache under its pressure
–Apply LRU policy if two entries have the same frequency
11 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Scan col A, col B in file /a/bScan col A, col C in file /a/b
Scan info in
a query plan
Parquet metadata
Cache Entry#1: File /a/b, 0 MB – 24 MB * 2
Cache Entry#2: File /a/b, 24 MB – 48 MB * 1
Cache Entry#3: File /a/b, 48 MB – 64 MB * 1
Cache entries with calculated scan frequency
HDFSFile Col Offset range
/a/b A 0 MB – 24 MB
B 24 MB – 48 MB
C 48 MB – 64 MB
© 2018 IBM Corporation
Approach
§Cache management with query plans–Master node extracts and broadcasts scan info from query plans–Use anonymous mmap()/munmap() system calls for explicit memory management
§User-level buffer cache with direct I/O on HDFS files–Use JNA to call open(), pread(), fcntl(), close() system calls from Java processes
12 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Master node
Spark driver
Name node
Query plan extraction
Query planner
Spark SQL
SQL queries
Slave nodeSpark executor
Cache management with query plans
Data node
Local FS
Broadcastscan info
Direct I/O on HDFS files
SlavenodeSpark executor
cache Direct I/O
Slavenode
Data node
Local FSSpark executorcache Direct I/O
Data node
Local FS
© 2018 IBM Corporation
Challenge: direct I/O limitation
§Direct I/O requires a file descriptor that is opened with O_DIRECT–HDFS maintains local file information for each HDFS block
§Direct I/O does not merge multiple read() calls unlike normal FS reads–Linux page cache merges/splits multiple read() requests at block I/O layer–Splitting multiple read() calls hurts performance
13 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
© 2018 IBM Corporation
A slave node
Enabling direct I/O on HDFS files
§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket
§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream
14 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Column cache
Data Node
FS
UNIX domain socket
Linux kernel
© 2018 IBM Corporation
A slave node
Enabling direct I/O on HDFS files
§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket
§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream
15 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Data Node
FS
UNIX domain socket
Linux kernel
open()
FDColumn cache
© 2018 IBM Corporation
A slave node
Column cache
Enabling direct I/O on HDFS files
§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket
§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream
16 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Data Node
FS
UNIX domain socket
Linux kernel
open()
fcntl()
© 2018 IBM Corporation
A slave node
Column cache
Enabling direct I/O on HDFS files
§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket
§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream
17 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Data Node
FS
UNIX domain socket
Linux kernel
open()
fcntl()
pread()
© 2018 IBM Corporation
Optimizing direct I/O on HDFS files
§Explicitly aggregate pread() requests into a single call
–Existing buffering logic calls multiple pread() for a single column
§Track metadata for Parquet and HDFS in advance of a column scan
–Parquet metadata contains how each column is physically placed in an HDFS file
§Calculate the exact offset and size of local file reads for each column
§Call a single pread() on the calculated consecutive file area
18 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
HDFS read:/a/b, offset=24 MB,length=24 MB
Parquet read: /a/b, column=B
Local file read:/c/blk2, offset=0, length=24 MB
Parquet metadata
HDFSFile Offset range Node Local
File
/a/b 0 MB – 24 MB node1 /c/blk1
24 MB – 48 MB node2 /c/blk2
HDFS metadata
HDFSFile Col Offset range
/a/b A 0 MB – 24 MB
B 24 MB – 48 MB
C 48 MB – 64 MB
© 2018 IBM Corporation
Evaluation
§Microbenchmark
§Single TPC-DS query
§Seven concurrent TPC-DS queries
19 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
© 2018 IBM Corporation
Microbenchmark (parallel file read with 256 Parquet files)
§Column cache speeded up parallel file read with 256 Parquet files–Showed 1.2x throughput in 16 threads
20 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap
Hig
her i
s B
ette
r
0
20
40
60
80
100
120
1 2 4 8 16 32
Thro
ughp
ut (M
ops
/s)
Number of threads
Baseline Column cache
© 2018 IBM Corporation
0
32
64
96
128
0 10 20 30 40 50 60
Mem
ory
usag
e (G
B)Elapsed time (sec)
Column cache
Column cacheOS cacheUser memory
0
32
64
96
128
0 10 20 30 40 50 60 70 80
Mem
ory
usag
e (G
B)
Elapsed time (sec)
Baseline
OS cacheUser memory
Microbenchmark: Breakdown of Memory usage
§Column cache reduced memory usage by 14GB–Direct I/O removed page cache interleaving during Parquet file reads–No cache eviction but the speedup occurred
21 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Low
er is
Bet
ter
Low
er is
Bet
ter
Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap
© 2018 IBM Corporation
0
0.2
0.4
0.6
0.8
1
1.2
0 10 20 30 40 50 60 70 80
Dis
k re
ad
ba
nd
wid
th (
GB
/s)
Elapsed time (sec)
Disk usage
Baseline Column cache
0
5
10
15
20
10 20 30 40 50 60 70
Blo
ck I/O
re
qu
est
(K
op
s/s)
Elapsed time (sec)
Number of I/O requests
Baseline Column cache
Microbenchmark: Breakdown of disk usage
§Column cache increased average disk I/O bandwidth–Aggregated direct I/O decreased number of requests per second
22 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Low
er
is B
etter
Hig
her
is B
etter
average Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap
© 2018 IBM Corporation
TPC-DS queries
§Column cache reduced query processing time by up to 4 min compared to heap=48GB
23 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000, 48 GB JVM heap
Low
er is
Bet
ter
0
5
10
15
20
25
30
35
40
q23a q23b q14a q14b q16 q64 q94 q95 q80 q69
Ela
psed
tim
e (m
in)
TPC-DS queries
Baseline Column cache
© 2018 IBM Corporation
Q23a: Memory usage
§Less-frequently-used columns are evicted from user memory
§Increased page cache by 18%, and reduced disk reads by 43%
24 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000, 48 GB JVM heap
Column cacheBaseline
0
32
64
96
128
0 4 8 12 16 20 24 28M
emor
y U
sage
(GB)
Elapsed time (min)
User memory OS cache Free
0
32
64
96
128
0 4 8 12 16 20 24 28 32
Mem
ory
Usa
ge (G
B)
Elapsed time (min)
User memory OS cache Free
© 2018 IBM Corporation
Seven concurrent TPC-DS queries
§Column cache reduced query processing time by up to 9 min
25 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000, 48 GB JVM heap
Low
er is
Bet
ter
0
10
20
30
40
50
60
70
q23a q23b q14a q14b q16 q64 q94 q95 q80 q69
Elap
sed
time
(min
)
TPC-DS queries
Baseline Column cache
© 2018 IBM Corporation
Summary
§Introduced column cache, buffer cache for columnar storage on HDFS–Unifies buffering and caching logic from OS to Parquet–Leverages query plans to eagerly evict user memory and increase OS cache–Exploits HDFS’s optimization for direct I/O on HDFS files
§Speeded up TPC-DS queries of Spark SQL up to 9 min–Increased page cache usage by 18%, and reduced disk reads by 43%
26 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5
© 2018 IBM Corporation27 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5