Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is...

27
© 2018 IBM Corporation Takeshi Yoshimura , Tatsuhiro Chiba, Hiroshi Horii IBM Research – Tokyo IEEE BigData 2018 Column Cache: Buffer Cache for Columnar Storage on HDFS

Transcript of Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is...

Page 1: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Takeshi Yoshimura, Tatsuhiro Chiba, Hiroshi Horii

IBM Research – Tokyo

IEEE BigData 2018

Column Cache: Buffer Cache for Columnar Storage on HDFS

Page 2: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Hadoop distributed file system (HDFS)

§A major data store of scale-out architecture in big data processing–Offers high scalability and availability to parallel distributed computing

§Typical use-case: relational query processing and machine learning–e.g., Spark SQL/ML, Apache Tez, Apache Mahout

2 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Page 3: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Columnar Storage on HDFS

§Designed as a special data representation for a HDFS file–Provides highly compressed data by columnar-wise data layout–Preserves high scalability and availability of HDFS

§Optimizes read heavy queries

3 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

metadata

RowGroup

RowGroup

Column AColumn B

Column AColumn B

Parquet file format

Page 4: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

…And Spark wraps all with RDD

4 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Page 5: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Problem: Deep software layering

§Disk I/O for data analytics is processed by multiple software layers–Redundant buffers cause inefficient memory usage

§Every software layer does not coordinate each other–User memory increases memory pressure and OS cache eviction

5 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Java virtual machine process

ParquetFileReader

OS cache

FSBufferedInputStream

RDD

redundant

Page 6: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Preliminary experiments with TPC-DS

§Large JVM heap caused ~10 mins slowdown in TPC-DS on Spark–System memory: 128 GB, Scale factor: 1TB

6 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS

scale factor=1000

0

10

20

30

40

50

60

q23a q23b q14a q14b q16 q64 q94 q95* q80 q69

Ela

pse

d tim

e (

min

)

TPC-DS queries

heap=24GBheap=48GBheap=80GB

Low

er

is B

etter

Page 7: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Slowdowns due to OS-level cache (TPC-DS q23a)

§Large heap evicted OS cache that is for Spark’s temporary files–Smaller heap caused fewer cache misses but insufficient memory for Spark

§Difficult to coordinate memory management in OS and Spark

7 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000Spark offheap: 48 GB

0

32

64

96

128

0 4 8 12 16 20 24 28 32 36 40 44U

ser M

emor

y (G

B)

Elapsed Time (min)heap=80GB

User memory OS cache Free

0

32

64

96

128

0 4 8 12 16 20 24 28 32

Use

r Mem

ory

(GB

)

Elapsed Time (min)heap=48GB

User memory OS cache Free

Page 8: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Column cache: Buffer cache for columnar storage on HDFS

§Unifies buffering and caching logic from OS to Parquet–Leverages query plans to eagerly evict user memory and increase OS cache–Exploits HDFS’s optimization for direct I/O on HDFS files

§Provided as a Java library with Parquet compatible APIs–Required 88 LoC changes in Spark

§Modifies only disk read for Parquet files in HDFS client processes–Does not require restarting HDFS servers

8 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Java virtual machine process

Parquet

HDFS client

Spark

OS cache

Local filesystem

Columncache

New data flowExisting data flow

Page 9: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Approach

§Cache management with query plans–Master node extracts and broadcasts scan info from query plans–Use anonymous mmap()/munmap() system calls for explicit memory management

§User-level buffer cache with direct I/O on HDFS files–Use JNA to call open(), pread(), fcntl(), close() system calls from Java processes

9 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Master node

Spark driver

Name node

Query plan extraction

Query planner

Spark SQL

SQL queries

Slave nodeSpark executor

Cache management with query plans

Data node

Local FS

Broadcastscan info

Direct I/O on HDFS files

SlavenodeSpark executor

cache Direct I/O

Slavenode

Data node

Local FSSpark executorcache Direct I/O

Data node

Local FS

Page 10: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Approach

§Cache management with query plans–Master node extracts and broadcasts scan info from query plans–Use anonymous mmap()/munmap() system calls for explicit memory management

§User-level buffer cache with direct I/O on HDFS files–Use JNA to call open(), pread(), fcntl(), close() system calls from Java processes

10 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Master node

Spark driver

Name node

Query plan extraction

Query planner

Spark SQL

SQL queries

Slave nodeSpark executor

Cache management with query plans

Data node

Local FS

Broadcastscan info

Direct I/O on HDFS files

SlavenodeSpark executor

cache Direct I/O

Slavenode

Data node

Local FSSpark executorcache Direct I/O

Data node

Local FS

Page 11: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Cache management with query plans

§Build cache entries by associating direct I/O result with corresponding

offset ranges of HDFS files

§Estimates the frequency of scanned file ranges within query plans

–Parquet metadata contains how each column is physically placed in an HDFS file

–Mark a cache entry releasable if its scanned count reaches the frequency

§Monitor system memory and free releasable cache under its pressure

–Apply LRU policy if two entries have the same frequency

11 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Scan col A, col B in file /a/bScan col A, col C in file /a/b

Scan info in

a query plan

Parquet metadata

Cache Entry#1: File /a/b, 0 MB – 24 MB * 2

Cache Entry#2: File /a/b, 24 MB – 48 MB * 1

Cache Entry#3: File /a/b, 48 MB – 64 MB * 1

Cache entries with calculated scan frequency

HDFSFile Col Offset range

/a/b A 0 MB – 24 MB

B 24 MB – 48 MB

C 48 MB – 64 MB

Page 12: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Approach

§Cache management with query plans–Master node extracts and broadcasts scan info from query plans–Use anonymous mmap()/munmap() system calls for explicit memory management

§User-level buffer cache with direct I/O on HDFS files–Use JNA to call open(), pread(), fcntl(), close() system calls from Java processes

12 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Master node

Spark driver

Name node

Query plan extraction

Query planner

Spark SQL

SQL queries

Slave nodeSpark executor

Cache management with query plans

Data node

Local FS

Broadcastscan info

Direct I/O on HDFS files

SlavenodeSpark executor

cache Direct I/O

Slavenode

Data node

Local FSSpark executorcache Direct I/O

Data node

Local FS

Page 13: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Challenge: direct I/O limitation

§Direct I/O requires a file descriptor that is opened with O_DIRECT–HDFS maintains local file information for each HDFS block

§Direct I/O does not merge multiple read() calls unlike normal FS reads–Linux page cache merges/splits multiple read() requests at block I/O layer–Splitting multiple read() calls hurts performance

13 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Page 14: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

A slave node

Enabling direct I/O on HDFS files

§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket

§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream

14 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Column cache

Data Node

FS

UNIX domain socket

Linux kernel

Page 15: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

A slave node

Enabling direct I/O on HDFS files

§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket

§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream

15 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Data Node

FS

UNIX domain socket

Linux kernel

open()

FDColumn cache

Page 16: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

A slave node

Column cache

Enabling direct I/O on HDFS files

§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket

§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream

16 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Data Node

FS

UNIX domain socket

Linux kernel

open()

fcntl()

Page 17: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

A slave node

Column cache

Enabling direct I/O on HDFS files

§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket

§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream

17 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Data Node

FS

UNIX domain socket

Linux kernel

open()

fcntl()

pread()

Page 18: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Optimizing direct I/O on HDFS files

§Explicitly aggregate pread() requests into a single call

–Existing buffering logic calls multiple pread() for a single column

§Track metadata for Parquet and HDFS in advance of a column scan

–Parquet metadata contains how each column is physically placed in an HDFS file

§Calculate the exact offset and size of local file reads for each column

§Call a single pread() on the calculated consecutive file area

18 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

HDFS read:/a/b, offset=24 MB,length=24 MB

Parquet read: /a/b, column=B

Local file read:/c/blk2, offset=0, length=24 MB

Parquet metadata

HDFSFile Offset range Node Local

File

/a/b 0 MB – 24 MB node1 /c/blk1

24 MB – 48 MB node2 /c/blk2

HDFS metadata

HDFSFile Col Offset range

/a/b A 0 MB – 24 MB

B 24 MB – 48 MB

C 48 MB – 64 MB

Page 19: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Evaluation

§Microbenchmark

§Single TPC-DS query

§Seven concurrent TPC-DS queries

19 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Page 20: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Microbenchmark (parallel file read with 256 Parquet files)

§Column cache speeded up parallel file read with 256 Parquet files–Showed 1.2x throughput in 16 threads

20 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap

Hig

her i

s B

ette

r

0

20

40

60

80

100

120

1 2 4 8 16 32

Thro

ughp

ut (M

ops

/s)

Number of threads

Baseline Column cache

Page 21: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

0

32

64

96

128

0 10 20 30 40 50 60

Mem

ory

usag

e (G

B)Elapsed time (sec)

Column cache

Column cacheOS cacheUser memory

0

32

64

96

128

0 10 20 30 40 50 60 70 80

Mem

ory

usag

e (G

B)

Elapsed time (sec)

Baseline

OS cacheUser memory

Microbenchmark: Breakdown of Memory usage

§Column cache reduced memory usage by 14GB–Direct I/O removed page cache interleaving during Parquet file reads–No cache eviction but the speedup occurred

21 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Low

er is

Bet

ter

Low

er is

Bet

ter

Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap

Page 22: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80

Dis

k re

ad

ba

nd

wid

th (

GB

/s)

Elapsed time (sec)

Disk usage

Baseline Column cache

0

5

10

15

20

10 20 30 40 50 60 70

Blo

ck I/O

re

qu

est

(K

op

s/s)

Elapsed time (sec)

Number of I/O requests

Baseline Column cache

Microbenchmark: Breakdown of disk usage

§Column cache increased average disk I/O bandwidth–Aggregated direct I/O decreased number of requests per second

22 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Low

er

is B

etter

Hig

her

is B

etter

average Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap

Page 23: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

TPC-DS queries

§Column cache reduced query processing time by up to 4 min compared to heap=48GB

23 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000, 48 GB JVM heap

Low

er is

Bet

ter

0

5

10

15

20

25

30

35

40

q23a q23b q14a q14b q16 q64 q94 q95 q80 q69

Ela

psed

tim

e (m

in)

TPC-DS queries

Baseline Column cache

Page 24: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Q23a: Memory usage

§Less-frequently-used columns are evicted from user memory

§Increased page cache by 18%, and reduced disk reads by 43%

24 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000, 48 GB JVM heap

Column cacheBaseline

0

32

64

96

128

0 4 8 12 16 20 24 28M

emor

y U

sage

(GB)

Elapsed time (min)

User memory OS cache Free

0

32

64

96

128

0 4 8 12 16 20 24 28 32

Mem

ory

Usa

ge (G

B)

Elapsed time (min)

User memory OS cache Free

Page 25: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Seven concurrent TPC-DS queries

§Column cache reduced query processing time by up to 9 min

25 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000, 48 GB JVM heap

Low

er is

Bet

ter

0

10

20

30

40

50

60

70

q23a q23b q14a q14b q16 q64 q94 q95 q80 q69

Elap

sed

time

(min

)

TPC-DS queries

Baseline Column cache

Page 26: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation

Summary

§Introduced column cache, buffer cache for columnar storage on HDFS–Unifies buffering and caching logic from OS to Parquet–Leverages query plans to eagerly evict user memory and increase OS cache–Exploits HDFS’s optimization for direct I/O on HDFS files

§Speeded up TPC-DS queries of Spark SQL up to 9 min–Increased page cache usage by 18%, and reduced disk reads by 43%

26 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5

Page 27: Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is processed by multiple software layers –Redundant buffers cause inefficient memory usage

© 2018 IBM Corporation27 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5