Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is...

© 2018 IBM Corporation

Takeshi Yoshimura, Tatsuhiro Chiba, Hiroshi Horii

IBM Research – Tokyo

IEEE BigData 2018

Column Cache: Buffer Cache for Columnar Storage on HDFS


Hadoop distributed file system (HDFS)

§A major data store of scale-out architecture in big data processing–Offers high scalability and availability to parallel distributed computing

§Typical use-case: relational query processing and machine learning–e.g., Spark SQL/ML, Apache Tez, Apache Mahout

2 T Yoshimura et al, Column Cache: Buffer Cache for Columnar Storage on HDFS2018/12/5


Columnar Storage on HDFS

§Designed as a special data representation for a HDFS file–Provides highly compressed data by columnar-wise data layout–Preserves high scalability and availability of HDFS

§Optimizes read heavy queries


metadata

RowGroup

RowGroup

…

Column AColumn B

Column AColumn B

Parquet file format


…And Spark wraps all with RDD



Problem: Deep software layering

§Disk I/O for data analytics is processed by multiple software layers–Redundant buffers cause inefficient memory usage

§Every software layer does not coordinate each other–User memory increases memory pressure and OS cache eviction


Java virtual machine process

ParquetFileReader

OS cache

FSBufferedInputStream

RDD

redundant


Preliminary experiments with TPC-DS

§Large JVM heap caused ~10 mins slowdown in TPC-DS on Spark–System memory: 128 GB, Scale factor: 1TB


Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS

scale factor=1000

0

10

20

30

40

50

60

q23a q23b q14a q14b q16 q64 q94 q95* q80 q69

Ela

pse

d tim

e (

min

)

TPC-DS queries

heap=24GBheap=48GBheap=80GB

Low

er

is B

etter


Slowdowns due to OS-level cache (TPC-DS q23a)

§Large heap evicted OS cache that is for Spark’s temporary files–Smaller heap caused fewer cache misses but insufficient memory for Spark

§Difficult to coordinate memory management in OS and Spark


Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000Spark offheap: 48 GB

0

32

64

96

128

0 4 8 12 16 20 24 28 32 36 40 44U

ser M

emor

y (G

B)

Elapsed Time (min)heap=80GB

User memory OS cache Free

0

32

64

96

128

0 4 8 12 16 20 24 28 32

Use

r Mem

ory

(GB

)

Elapsed Time (min)heap=48GB



Column cache: Buffer cache for columnar storage on HDFS

§Unifies buffering and caching logic from OS to Parquet–Leverages query plans to eagerly evict user memory and increase OS cache–Exploits HDFS’s optimization for direct I/O on HDFS files

§Provided as a Java library with Parquet compatible APIs–Required 88 LoC changes in Spark

§Modifies only disk read for Parquet files in HDFS client processes–Does not require restarting HDFS servers


Java virtual machine process

Parquet

HDFS client

Spark

OS cache

Local filesystem

Columncache

New data flowExisting data flow


Approach

§Cache management with query plans–Master node extracts and broadcasts scan info from query plans–Use anonymous mmap()/munmap() system calls for explicit memory management

§User-level buffer cache with direct I/O on HDFS files–Use JNA to call open(), pread(), fcntl(), close() system calls from Java processes


Master node

Spark driver

Name node

Query plan extraction

Query planner

Spark SQL

SQL queries

Slave nodeSpark executor

Cache management with query plans

Data node

Local FS

Broadcastscan info

Direct I/O on HDFS files

SlavenodeSpark executor

cache Direct I/O

Slavenode

Data node

Local FSSpark executorcache Direct I/O

Data node

Local FS


Approach




Master node

Spark driver

Name node


Query planner

Spark SQL

SQL queries



Data node

Local FS

Broadcastscan info



cache Direct I/O

Slavenode

Data node


Data node

Local FS



§Build cache entries by associating direct I/O result with corresponding

offset ranges of HDFS files

§Estimates the frequency of scanned file ranges within query plans

–Parquet metadata contains how each column is physically placed in an HDFS file

–Mark a cache entry releasable if its scanned count reaches the frequency

§Monitor system memory and free releasable cache under its pressure

–Apply LRU policy if two entries have the same frequency


Scan col A, col B in file /a/bScan col A, col C in file /a/b

Scan info in

a query plan

Parquet metadata

Cache Entry#1: File /a/b, 0 MB – 24 MB * 2



Cache entries with calculated scan frequency

HDFSFile Col Offset range

/a/b A 0 MB – 24 MB

B 24 MB – 48 MB

C 48 MB – 64 MB


Approach




Master node

Spark driver

Name node


Query planner

Spark SQL

SQL queries



Data node

Local FS

Broadcastscan info



cache Direct I/O

Slavenode

Data node


Data node

Local FS


Challenge: direct I/O limitation

§Direct I/O requires a file descriptor that is opened with O_DIRECT–HDFS maintains local file information for each HDFS block

§Direct I/O does not merge multiple read() calls unlike normal FS reads–Linux page cache merges/splits multiple read() requests at block I/O layer–Splitting multiple read() calls hurts performance



A slave node

Enabling direct I/O on HDFS files

§Enable direct I/O on HDFS files by exploiting HDFS’s optimization–Short-circuit local reads that optimize reading HDFS files within a local node–Data node passes a requested file as an FD via UNIX domain socket

§Use fcntl() to enable O_DIRECT on a file descriptor (FD)–Use Java reflection to directly access an FD in an input stream


Column cache

Data Node

FS

UNIX domain socket

Linux kernel


A slave node





Data Node

FS

UNIX domain socket

Linux kernel

open()

FDColumn cache


A slave node

Column cache





Data Node

FS

UNIX domain socket

Linux kernel

open()

fcntl()


A slave node

Column cache





Data Node

FS

UNIX domain socket

Linux kernel

open()

fcntl()

pread()


Optimizing direct I/O on HDFS files

§Explicitly aggregate pread() requests into a single call

–Existing buffering logic calls multiple pread() for a single column

§Track metadata for Parquet and HDFS in advance of a column scan

–Parquet metadata contains how each column is physically placed in an HDFS file

§Calculate the exact offset and size of local file reads for each column

§Call a single pread() on the calculated consecutive file area


HDFS read:/a/b, offset=24 MB,length=24 MB

Parquet read: /a/b, column=B

Local file read:/c/blk2, offset=0, length=24 MB

Parquet metadata

HDFSFile Offset range Node Local

File

/a/b 0 MB – 24 MB node1 /c/blk1

24 MB – 48 MB node2 /c/blk2

HDFS metadata

HDFSFile Col Offset range

/a/b A 0 MB – 24 MB

B 24 MB – 48 MB

C 48 MB – 64 MB


Evaluation

§Microbenchmark

§Single TPC-DS query

§Seven concurrent TPC-DS queries



Microbenchmark (parallel file read with 256 Parquet files)

§Column cache speeded up parallel file read with 256 Parquet files–Showed 1.2x throughput in 16 threads


Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap

Hig

her i

s B

ette

r

0

20

40

60

80

100

120

1 2 4 8 16 32

Thro

ughp

ut (M

ops

/s)

Number of threads

Baseline Column cache


0

32

64

96

128

0 10 20 30 40 50 60

Mem

ory

usag

e (G

B)Elapsed time (sec)

Column cache

Column cacheOS cacheUser memory

0

32

64

96

128

0 10 20 30 40 50 60 70 80

Mem

ory

usag

e (G

B)

Elapsed time (sec)

Baseline

OS cacheUser memory

Microbenchmark: Breakdown of Memory usage

§Column cache reduced memory usage by 14GB–Direct I/O removed page cache interleaving during Parquet file reads–No cache eviction but the speedup occurred


Low

er is

Bet

ter

Low

er is

Bet

ter

Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap


0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80

Dis

k re

ad

ba

nd

wid

th (

GB

/s)

Elapsed time (sec)

Disk usage


0

5

10

15

20

10 20 30 40 50 60 70

Blo

ck I/O

re

qu

est

(K

op

s/s)

Elapsed time (sec)

Number of I/O requests


Microbenchmark: Breakdown of disk usage

§Column cache increased average disk I/O bandwidth–Aggregated direct I/O decreased number of requests per second


Low

er

is B

etter

Hig

her

is B

etter

average Environment: 1 node, x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, 48 GB JVM heap


TPC-DS queries

§Column cache reduced query processing time by up to 4 min compared to heap=48GB


Environment: 1master, 5 slaves x86 Ubuntu16 in IBM Cloud, 32 vCPUs 128 GB RAM, 1 TB SAN disk, TPC-DS scale factor=1000, 48 GB JVM heap

Low

er is

Bet

ter

0

5

10

15

20

25

30

35

40

q23a q23b q14a q14b q16 q64 q94 q95 q80 q69

Ela

psed

tim

e (m

in)

TPC-DS queries



Q23a: Memory usage

§Less-frequently-used columns are evicted from user memory

§Increased page cache by 18%, and reduced disk reads by 43%



Column cacheBaseline

0

32

64

96

128

0 4 8 12 16 20 24 28M

emor

y U

sage

(GB)

Elapsed time (min)


0

32

64

96

128

0 4 8 12 16 20 24 28 32

Mem

ory

Usa

ge (G

B)

Elapsed time (min)



Seven concurrent TPC-DS queries

§Column cache reduced query processing time by up to 9 min



Low

er is

Bet

ter

0

10

20

30

40

50

60

70

q23a q23b q14a q14b q16 q64 q94 q95 q80 q69

Elap

sed

time

(min

)

TPC-DS queries



Summary

§Introduced column cache, buffer cache for columnar storage on HDFS–Unifies buffering and caching logic from OS to Parquet–Leverages query plans to eagerly evict user memory and increase OS cache–Exploits HDFS’s optimization for direct I/O on HDFS files

§Speeded up TPC-DS queries of Spark SQL up to 9 min–Increased page cache usage by 18%, and reduced disk reads by 43%


Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is...

Documents

Transcript of Column Cache: Buffer Cache for Columnar Storage on HDFS · §Disk I/O for data analytics is...