1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | [email protected] Colin...

67
1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | [email protected] Colin McCabe | [email protected]

Transcript of 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | [email protected] Colin...

Page 1: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

1

In-memory Caching in HDFSLower latency, same great tasteAndrew Wang | [email protected] McCabe | [email protected]

Page 2: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Alice

Hadoop cluster

Query

Result set

Page 3: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Alice

Fresh data

Page 4: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Fresh data

Page 5: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Alice

Rollup

Page 6: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

6

Problems

• Data hotspots• Everyone wants to query some fresh data• Shared disks are unable to handle high load

• Mixed workloads• Data analyst making small point queries• Rollup job scanning all the data• Point query latency suffers because of I/O contention

• Same theme: disk I/O contention!

Page 7: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

7

How do we solve I/O issues?

• Cache important datasets in memory!• Much higher throughput than disk• Fast random/concurrent access

• Interesting working sets often fit in cluster memory• Traces from Facebook’s Hive cluster

• Increasingly affordable to buy a lot of memory• Moore’s law• 1TB server is 40k on HP’s website

Page 8: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Alice

Page cache

Page 9: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Alice

Repeated query

?

Page 10: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Alice

Rollup

Page 11: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Alice

Extra copies

Page 12: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

Alice

Checksum verification

Extra copies

Page 13: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

13

Design considerations

1. Explicitly pin hot datasets in memory2. Place tasks for memory locality3. Zero overhead reads of cached data

Page 14: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

14

Outline

• Implementation• NameNode and DataNode modifications• Zero-copy read API

• Evaluation• Microbenchmarks• MapReduce• Impala

• Future work

Page 15: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

15

Outline

• Implementation• NameNode and DataNode modifications• Zero-copy read API

• Evaluation• Microbenchmarks• MapReduce• Impala

• Future work

Page 16: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

16

Architecture

DataNode

The NameNode schedules which DataNodes cache each block of a file.

DataNode DataNode

NameNode

Page 17: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

17

Architecture

NameNode

DataNode

DataNodes periodically send cache reports describing which replicas they have cached.

DataNode DataNode

Page 18: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

18

Cache Locations API

• Clients can ask the NameNode where a file is cached via getFileBlockLocations

DataNode

DataNode DataNode

NameNode DFSClient

Page 19: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

19

Cache Directives

• A cache directive describes a file or directory that should be cached• Path• Cache replication factor

• Stored permanently on the NameNode

• Also have cache pools for access control and quotas, but we won’t be covering that here

Page 20: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

20

mlock

• The DataNode pins each cached block into the page cache using mlock.

• Because we’re using the page cache, the blocks don’t take up any space on the Java heap.

DataNode

Page Cache

DFSClientread

mlock

Page 21: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

21

Zero-copy read API

• Clients can use the zero-copy read API to map the cached replica into their own address space

• The zero-copy API avoids the overhead of the read() and pread() system calls

• However, we don’t verify checksums when using the zero-copy API• The zero-copy API can be only used on cached data, or

when the application computes its own checksums.

Page 22: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

22

Skipping Checksums

• We would like to skip checksum verification when reading cached data• DataNode already checksums when caching the block

• Requirements• Client needs to know that the replica is cached• DataNode needs to notify the client if the replica is

uncached

Page 23: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

23

Skipping Checksums

• The DataNode and DFSClient use shared memory segments to communicate which blocks are cached.

DataNode

Page Cache

DFSClient

read

mlock

Shared MemorySegment

Page 24: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

24

Outline

• Implementation• NameNode and DataNode modifications• Zero-copy read API

• Evaluation• Single-Node Microbenchmarks• MapReduce• Impala

• Future work

Page 25: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

25

Test Cluster

• 5 Nodes• 1 NameNode• 4 DataNodes

• 48GB of RAM• Configured 38GB of HDFS cache per DN

• 11x SATA hard disks• 2x4 core 2.13 GHz Westmere Xeon processors• 10 Gbit/s full-bisection bandwidth network

Page 26: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

26

Single-Node Microbenchmarks

• How much faster are cached and zero-copy reads?• Introducing vecsum (vector sum)

• Computes sums of a file of doubles• Highly optimized: uses SSE intrinsics• libhdfs program• Can toggle between various read methods

Page 27: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

27

Throughput

TCP TCP no csums

SCR SCR no csums

ZCR0

1

2

3

4

5

6

7

0.8 0.9

1.92.4

5.9

GB/

s

Page 28: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

28

ZCR 1GB vs 20GB

1GB 20GB0

1

2

3

4

5

6

75.9

2.7GB/

s

Page 29: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

29

Throughput

• Skipping checksums matters more when going faster• ZCR gets close to bus bandwidth

• ~6GB/s• Need to reuse client-side mmaps for maximum perf

• page_fault function is 1.16% of cycles in 1G• 17.55% in 20G

Page 30: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

30

Client CPU cycles

TCP TCP no csums

SCR SCR no csums

ZCR0

10

20

30

40

50

60

7057.6

51.8

27.123.4

12.7

CPU

cyc

les

(bill

ions

)

Page 31: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

31

Why is ZCR more CPU-efficient?

Page 32: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

32

Why is ZCR more CPU-efficient?

Page 33: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

33

Remote Cached vs. Local Uncached

• Zero-copy is only possible for local cached data• Is it better to read from remote cache, or local disk?

Page 34: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

34

Remote Cached vs. Local Uncached

TCP iperf SCR dd0

200

400

600

800

1000

1200

841

1092

125 137

MB/

s

Page 35: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

35

Microbenchmark Conclusions

• Short-circuit reads need less CPU than TCP reads• ZCR is even more efficient, because it avoids a copy• ZCR goes much faster when re-reading the same

data, because it can avoid mmap page faults• Network and disk may be bottleneck for remote or

uncached reads

Page 36: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

36

Outline

• Implementation• NameNode and DataNode modifications• Zero-copy read API

• Evaluation• Microbenchmarks• MapReduce• Impala

• Future work

Page 37: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

37

MapReduce

• Started with example MR jobs• Wordcount• Grep

• Same 4 DN cluster• 38GB HDFS cache per DN• 11 disks per DN

• 17GB of Wikipedia text• Small enough to fit into cache at 3x replication

• Ran each job 10 times, took the average

Page 38: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

38

wordcount and grep

wordco

unt

wordco

unt cac

hedgre

p

grep ca

ched

byteco

unt

byteco

unt cac

hed

byteco

unt-2G

byteco

unt-2G ca

ched

050

100150200250300350400

280

55

275

52

Page 39: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

39

wordcount and grep

wordco

unt

wordco

unt cac

hedgre

p

grep ca

ched

byteco

unt

byteco

unt cac

hed

byteco

unt-2G

byteco

unt-2G ca

ched

050

100150200250300350400

280

55

275

52

Almost no speedup!

Page 40: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

40

wordcount and grep

wordco

unt

wordco

unt cac

hedgre

p

grep ca

ched

byteco

unt

byteco

unt cac

hed

byteco

unt-2G

byteco

unt-2G ca

ched

050

100150200250300350400

280

55

275

52

~60MB/s

~330MB/s

Not I/O bound

Page 41: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

41

wordcount and grep

• End-to-end latency barely changes• These MR jobs are simply not I/O bound!

• Best map phase throughput was about 330MB/s• 44 disks can theoretically do 4400MB/s

• Further reasoning• Long JVM startup and initialization time• Many copies in TextInputFormat, doesn’t use zero-copy• Caching input data doesn’t help reduce step

Page 42: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

42

Introducing bytecount

• Trivial version of wordcount• Counts # of occurrences of byte values• Heavily CPU optimized

• Each mapper processes an entire block via ZCR• No additional copies• No record slop across block boundaries• Fast inner loop

• Very unrealistic job, but serves as a best case• Also tried 2GB block size to amortize startup costs

Page 43: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

43

bytecount

grep

grep ca

ched

byteco

unt

byteco

unt cac

hed

byteco

unt-2G

byteco

unt-2G ca

ched

010203040506070

5545

585239 35

Page 44: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

44

bytecount

grep

grep ca

ched

byteco

unt

byteco

unt cac

hed

byteco

unt-2G

byteco

unt-2G ca

ched

010203040506070

5545

585239 35

1.3x faster

Page 45: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

45

bytecount

grep

grep ca

ched

byteco

unt

byteco

unt cac

hed

byteco

unt-2G

byteco

unt-2G ca

ched

010203040506070

5545

585239 35

Still only ~500MB/s

Page 46: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

46

MapReduce Conclusions

• Many MR jobs will see marginal improvement• Startup costs• CPU inefficiencies• Shuffle and reduce steps

• Even bytecount sees only modest gains• 1.3x faster than disk• 500MB/s with caching and ZCR• Nowhere close to GB/s possible with memory

• Needs more work to take full advantage of caching!

Page 47: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

47

Outline

• Implementation• NameNode and DataNode modifications• Zero-copy read API

• Evaluation• Microbenchmarks• MapReduce• Impala

• Future work

Page 48: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

48

Impala Benchmarks

• Open-source OLAP database developed by Cloudera• Tested with Impala 1.3 (CDH 5.0)• Same 4 DN cluster as MR section

• 38GB of 48GB per DN configured as HDFS cache• 152GB aggregate HDFS cache• 11 disks per DN

Page 49: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

49

Impala Benchmarks

• 1TB TPC-DS store_sales table, text format• count(*) on different numbers of partitions

• Has to scan all the data, no skipping• Queries

• 51GB small query (34% cache capacity)• 148GB big query (98% cache capacity)• Small query with concurrent workload

• Tested “cold” and “hot”• echo 3 > /proc/sys/vm/drop_caches• Lets us compare HDFS caching against page cache

Page 50: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

50

Small Query

Uncached cold

Cached cold Uncached hot

Cached hot0

5

10

15

20

25

19.8

5.84.0 3.0

Aver

age

resp

onse

tim

e (s

)

Page 51: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

51

Small Query

Uncached cold

Cached cold Uncached hot

Cached hot0

5

10

15

20

25

19.8

5.84.0 3.0

Aver

age

resp

onse

tim

e (s

)

2550 MB/s 17 GB/s

I/O bound!

Page 52: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

52

Small Query

Uncached cold

Cached cold Uncached hot

Cached hot0

5

10

15

20

25

Aver

age

resp

onse

tim

e (s

)

3.4x faster,disk vs. memory

Page 53: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

53

Small Query

Uncached cold

Cached cold Uncached hot

Cached hot0

5

10

15

20

25

Aver

age

resp

onse

tim

e (s

)

1.3x after warmup, still wins on CPU efficiency

Page 54: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

54

Big Query

Uncached cold

Cached cold Uncached hot

Cached hot0

10

20

30

40

50

60

48.2

11.5

40.9

9.4

Aver

age

resp

onse

tim

e (s

)

Page 55: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

55

Big Query

Uncached cold

Cached cold Uncached hot

Cached hot0

10

20

30

40

50

60

Aver

age

resp

onse

tim

e (s

)

4.2x faster, disk vs mem

Page 56: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

56

Big Query

Uncached cold

Cached cold Uncached hot

Cached hot0

10

20

30

40

50

60

Aver

age

resp

onse

tim

e (s

)

4.3x faster, doesn’t fit in page cache

Cannot schedule for page cache locality

Page 57: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

57

Small Query with Concurrent Workload

Uncached Cached Cached (not concurrent)

05

101520253035404550

Aver

age

resp

onse

tim

e (s

)

Page 58: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

58

Small Query with Concurrent Workload

Uncached Cached Cached (not concurrent)

05

101520253035404550

Aver

age

resp

onse

tim

e (s

)

7x faster when small query working set is cached

Page 59: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

59

Small Query with Concurrent Workload

Uncached Cached Cached (not concurrent)

05

101520253035404550

Aver

age

resp

onse

tim

e (s

)

2x slower than isolated, CPU contention

Page 60: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

60

Impala Conclusions

• HDFS cache is faster than disk or page cache• ZCR is more efficient than SCR from page cache• Better when working set is approx. cluster memory

• Can schedule tasks for cache locality• Significantly better for concurrent workloads

• 7x faster when contending with a single background query• Impala performance will only improve

• Many CPU improvements on the roadmap

Page 61: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

61

Outline

• Implementation• NameNode and DataNode modifications• Zero-copy read API

• Evaluation• Microbenchmarks• MapReduce• Impala

• Future work

Page 62: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

62

Future Work

• Automatic cache replacement• LRU, LFU, ?

• Sub-block caching• Potentially important for automatic cache replacement

• Compression, encryption, serialization• Lose many benefits of zero-copy API

• Write-side caching• Enables Spark-like RDDs for all HDFS applications

Page 63: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

63

Conclusion

• I/O contention is a problem for concurrent workloads• HDFS can now explicitly pin working sets into RAM• Applications can place their tasks for cache locality• Use zero-copy API to efficiently read cached data• Substantial performance improvements

• 6GB/s for single thread microbenchmark• 7x faster for concurrent Impala workload

Page 64: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.
Page 65: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.
Page 66: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.
Page 67: 1 In-memory Caching in HDFS Lower latency, same great taste Andrew Wang | awang@cloudera.com Colin McCabe | cmccabe@cloudera.com.

67

bytecount

grep

grep ca

ched

byteco

unt

byteco

unt cac

hed

byteco

unt-2G

byteco

unt-2G ca

ched

010203040506070

5545

585239 35

Less disk parallelism