In-memory Caching in HDFS Lower latency, same great taste

Slide 1

1In-memory Caching in HDFSLower latency, same great tasteAndrew Wang | [email protected] McCabe | [email protected]

Alice

Hadoop clusterQueryResult set

AliceFresh data

Fresh data

Alice

RollupProblemsData hotspotsEveryone wants to query some fresh dataShared disks are unable to handle high loadMixed workloadsData analyst making small point queriesRollup job scanning all the dataPoint query latency suffers because of I/O contention

Same theme: disk I/O contention!6How do we solve I/O issues?Cache important datasets in memory!Much higher throughput than diskFast random/concurrent accessInteresting working sets often fit in cluster memoryTraces from Facebooks Hive clusterIncreasingly affordable to buy a lot of memoryMoores law1TB server is 40k on HPs website

7

Alice

Page cache

Alice

Repeated query?

Alice

Rollup

AliceExtra copies

Alice

Checksum verificationExtra copiesDesign considerationsExplicitly pin hot datasets in memoryPlace tasks for memory localityZero overhead reads of cached data13OutlineImplementationNameNode and DataNode modificationsZero-copy read APIEvaluationMicrobenchmarksMapReduceImpalaFuture work14OutlineImplementationNameNode and DataNode modificationsZero-copy read APIEvaluationMicrobenchmarksMapReduceImpalaFuture work15Architecture16DataNodeThe NameNode schedules which DataNodes cache each block of a file.DataNodeDataNodeNameNodeArchitecture17NameNodeDataNodeDataNodes periodically send cache reports describing which replicas they have cached.DataNodeDataNodeCache Locations APIClients can ask the NameNode where a file is cached via getFileBlockLocations18DataNodeDataNodeDataNodeNameNodeDFSClientCache DirectivesA cache directive describes a file or directory that should be cachedPathCache replication factorStored permanently on the NameNode

Also have cache pools for access control and quotas, but we wont be covering that here19mlockThe DataNode pins each cached block into the page cache using mlock.Because were using the page cache, the blocks dont take up any space on the Java heap.20DataNodePage CacheDFSClientreadmlockZero-copy read APIClients can use the zero-copy read API to map the cached replica into their own address spaceThe zero-copy API avoids the overhead of the read() and pread() system callsHowever, we dont verify checksums when using the zero-copy APIThe zero-copy API can be only used on cached data, or when the application computes its own checksums.21Skipping ChecksumsWe would like to skip checksum verification when reading cached dataDataNode already checksums when caching the blockRequirementsClient needs to know that the replica is cachedDataNode needs to notify the client if the replica is uncached22Skipping ChecksumsThe DataNode and DFSClient use shared memory segments to communicate which blocks are cached.23DataNodePage CacheDFSClientreadmlockShared MemorySegment OutlineImplementationNameNode and DataNode modificationsZero-copy read APIEvaluationSingle-Node MicrobenchmarksMapReduceImpalaFuture work24Test Cluster5 Nodes1 NameNode4 DataNodes48GB of RAMConfigured 38GB of HDFS cache per DN11x SATA hard disks2x4 core 2.13 GHz Westmere Xeon processors10 Gbit/s full-bisection bandwidth network25Single-Node MicrobenchmarksHow much faster are cached and zero-copy reads?Introducing vecsum (vector sum)Computes sums of a file of doublesHighly optimized: uses SSE intrinsicslibhdfs programCan toggle between various read methods26Throughput27ZCR 1GB vs 20GB28ThroughputSkipping checksums matters more when going fasterZCR gets close to bus bandwidth~6GB/sNeed to reuse client-side mmaps for maximum perfpage_fault function is 1.16% of cycles in 1G17.55% in 20G29Client CPU cycles30Why is ZCR more CPU-efficient?31

Why is ZCR more CPU-efficient?32

Remote Cached vs. Local UncachedZero-copy is only possible for local cached dataIs it better to read from remote cache, or local disk?33Remote Cached vs. Local Uncached34Microbenchmark ConclusionsShort-circuit reads need less CPU than TCP readsZCR is even more efficient, because it avoids a copyZCR goes much faster when re-reading the same data, because it can avoid mmap page faultsNetwork and disk may be bottleneck for remote or uncached reads35OutlineImplementationNameNode and DataNode modificationsZero-copy read APIEvaluationMicrobenchmarksMapReduceImpalaFuture work36MapReduceStarted with example MR jobsWordcountGrepSame 4 DN cluster38GB HDFS cache per DN11 disks per DN17GB of Wikipedia textSmall enough to fit into cache at 3x replicationRan each job 10 times, took the average37wordcount and grep38wordcount and grep39Almost no speedup!wordcount and grep40~60MB/s~330MB/sNot I/O boundwordcount and grepEnd-to-end latency barely changesThese MR jobs are simply not I/O bound!Best map phase throughput was about 330MB/s44 disks can theoretically do 4400MB/sFurther reasoningLong JVM startup and initialization timeMany copies in TextInputFormat, doesnt use zero-copyCaching input data doesnt help reduce step

41Introducing bytecountTrivial version of wordcountCounts # of occurrences of byte valuesHeavily CPU optimizedEach mapper processes an entire block via ZCRNo additional copiesNo record slop across block boundariesFast inner loopVery unrealistic job, but serves as a best caseAlso tried 2GB block size to amortize startup costs

42bytecount43bytecount441.3x faster44bytecount45Still only ~500MB/sMapReduce Conclusions46Many MR jobs will see marginal improvementStartup costsCPU inefficienciesShuffle and reduce stepsEven bytecount sees only modest gains1.3x faster than disk500MB/s with caching and ZCRNowhere close to GB/s possible with memoryNeeds more work to take full advantage of caching!

OutlineImplementationNameNode and DataNode modificationsZero-copy read APIEvaluationMicrobenchmarksMapReduceImpalaFuture work47Impala BenchmarksOpen-source OLAP database developed by ClouderaTested with Impala 1.3 (CDH 5.0)Same 4 DN cluster as MR section38GB of 48GB per DN configured as HDFS cache152GB aggregate HDFS cache11 disks per DN

48Impala Benchmarks1TB TPC-DS store_sales table, text formatcount(*) on different numbers of partitionsHas to scan all the data, no skippingQueries51GB small query (34% cache capacity)148GB big query (98% cache capacity)Small query with concurrent workloadTested cold and hotecho 3 > /proc/sys/vm/drop_cachesLets us compare HDFS caching against page cache

49Small Query50Small Query512550 MB/s17 GB/sI/O bound!Small Query523.4x faster,disk vs. memorySmall Query531.3x after warmup, still wins on CPU efficiencyBig Query54Big Query554.2x faster, disk vs memBig Query564.3x faster, doesnt fit in page cacheCannot schedule for page cache localitySmall Query with Concurrent Workload57Small Query with Concurrent Workload587x faster when small query working set is cachedSmall Query with Concurrent Workload592x slower than isolated, CPU contentionImpala ConclusionsHDFS cache is faster than disk or page cacheZCR is more efficient than SCR from page cacheBetter when working set is approx. cluster memoryCan schedule tasks for cache localitySignificantly better for concurrent workloads7x faster when contending with a single background queryImpala performance will only improveMany CPU improvements on the roadmap

60OutlineImplementationNameNode and DataNode modificationsZero-copy read APIEvaluationMicrobenchmarksMapReduceImpalaFuture work61Future WorkAutomatic cache replacementLRU, LFU, ?Sub-block cachingPotentially important for automatic cache replacementCompression, encryption, serializationLose many benefits of zero-copy APIWrite-side cachingEnables Spark-like RDDs for all HDFS applications62ConclusionI/O contention is a problem for concurrent workloadsHDFS can now explicitly pin working sets into RAMApplications can place their tasks for cache localityUse zero-copy API to efficiently read cached dataSubstantial performance improvements6GB/s for single thread microbenchmark7x faster for concurrent Impala workload63

bytecount67Less disk parallelism

In-memory Caching in HDFS Lower latency, same great taste

Documents

Transcript of In-memory Caching in HDFS Lower latency, same great taste