EVCache: Lowering Costs for a Low Latency Cache … Lowering Costs for a Low Latency Cache ... AWS...

EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott MansfieldVu NguyenEVCache

90 seconds

What do caches touch?

Signing up*Logging inChoosing a profilePicking liked videosPersonalization*Loading home page*Scrolling home page*A/B testsVideo image selection

Searching*Viewing title detailsPlaying a title*Subtitle / language prefsRating a titleMy ListVideo history*UI stringsVideo production*

* multiple caches involved

Home PageRequest

Key-Value store optimized for AWS and tuned for Netflix use cases

Ephemeral Volatile Cache

What is EVCache?

Distributed, sharded, replicated key-value storeTunable in-region and global replicationBased on MemcachedResilient to failureTopology awareLinearly scalableSeamless deployments

Why Optimize for AWS

Instances disappearZones failRegions become unstableNetwork is lossyCustomer requests bounce between regions

Failures happen and we test all the time

EVCache Use @ Netflix Hundreds of terabytes of dataTrillions of ops / dayTens of billions of items storedTens of millions of ops / secMillions of replications / secThousands of serversHundreds of instances per clusterHundreds of microservice clientsTens of distinct clusters3 regions4 engineers

Architecture

Server

Memcached

Application

Client Library

Client

Eureka(Service Discovery)

Architecture

us-west-2a us-west-2cus-west-2b

ClientClient Client

Reading (get)

Client

Primary Secondary

Writing (set, delete, add, etc.)

ClientClient Client

Use Case: Lookaside Cache

Application

Client Library

Client Ribbon Client

S S S S

C C C CData Flow

Use Case: Transient Data Store

Application

Client Library

Client

Application

Client Library

Client

Application

Client Library

Client

Use Case: Primary Store

Offline / Nearline Precomputes for

Recommendations

Online Services

Offline Services

Online Application

Client Library

Client

Data Flow

Online Services

Offline Services

Use Case: Versioned Primary Store

Offline Compute

Online Application

Client Library

Client

Data Flow

Archaius(Dynamic Properties)

Control System(Valhalla)

Use Case: High Volume && High Availability

Compute & Publish on schedule

Data Flow

Application

Client Library

In-memory Remote Ribbon Client

Optional

S S S S

C C C C

Pipeline of Personalization

Compute A

Compute B Compute C

Compute D

Online Services

Offline Services

Compute E

Data Flow

Online 1 Online 2

Additional Features

Kafka● Global data replication● Consistency metrics

Key Iteration● Cache warming● Lost instance recovery● Backup (and restore)

Additional Features (Kafka)

Global data replicationConsistency metrics

Region BRegion A

APP APP

Repl Proxy

Repl Relay

1 mutate

2 send metadata

3 poll msg

5 https s

end msg

6 mutate4 get data

for set

Kafka Repl Relay Kafka

Repl Proxy

Cross-Region Replication

7 read

Additional Features (Key Iteration)

Cache warmingLost instance recoveryBackup (and restore)

Cache Warming

Cache Warmer (Spark)

Application

Client Library

Client

Control

S3Data Flow

Metadata Flow

Control Flow

Moneta

Next-generationEVCache server

Moneta

Moneta: The Goddess of MemoryJuno Moneta: The Protectress of Funds for Juno

● Evolution of the EVCache server● Cost optimization● EVCache on SSD● Ongoing lower EVCache cost per stream● Takes advantage of global request patterns

Old Server

● Stock Memcached and EVCar (sidecar)● All data stored in RAM in Memcached● Expensive with global expansion / N+1 architecture

Memcached

external

Optimization

● Global data means many copies● Access patterns are heavily region-oriented● In one region:

○ Hot data is used often○ Cold data is almost never touched

● Keep hot data in RAM, cold data on SSD● Size RAM for working set, SSD for overall dataset

New Server

● Adds Rend and Mnemonic● Still looks like Memcached● Unlocks cost-efficient storage & server-side intelligence

go get github.com/netflix/rend

● High-performance Memcached proxy & server● Written in Go

○ Powerful concurrency primitives○ Productive and fast

● Manages the L1/L2 relationship● Tens of thousands of connections

● Modular set of libraries and an example main()● Manages connections, request orchestration, and

communication● Low-overhead metrics library● Multiple orchestrators● Parallel locking for data integrity● Efficient connection pool

Server Loop

Request Orchestration

Backend Handlers

METRICS

Connection Management

Protocol

Mnemonic

● Manages data storage on SSD● Uses Rend server libraries

○ Handles Memcached protocol

● Maps Memcached ops to RocksDB ops

Rend Server Core Lib (Go)

Mnemonic Op Handler (Go)

Mnemonic Core (C++)

RocksDB (C++)

Why RocksDB?

● Fast at medium to high write load○ Disk--write load higher than read load (because of Memcached)

● Predictable RAM Usage

memtable

Record A Record B

memtable memtable

SST SST . . .

SST: Static Sorted Table

How we use RocksDB

● No Level Compaction○ Generated too much traffic to SSD○ High and unpredictable read latencies

● No Block Cache○ Rely on Local Memcached

● No Compression

How we use RocksDB

● FIFO Compaction○ SST’s ordered by time○ Oldest SST deleted when full○ Reads access every SST until record found

How we use RocksDB

● Full File Bloom Filters○ Full Filter reduces unnecessary SSD reads

● Bloom Filters and Indices pinned in memory○ Minimize SSD access per request

How we use RocksDB

● Records sharded across multiple RocksDB per node○ Reduces number of files checked to decrease latency

Mnemonic Core

Key: ABCKey: XYZ

Region-Locality Optimizations

● Replication and Batch updates only RocksDB*○ Keeps Region-Local and “hot” data in memory○ Separate Network Port for “off-line” requests○ Memcached data “replaced”

FIFO Limitations

● FIFO compaction not suitable for all use cases○ Very frequently updated records may push out valid records

● Expired Records still exist● Requires Larger Bloom Filters

Record A2

Record B1

Record B2

Record A3

Record A1

Record A2

Record B1

Record B2

Record A3

Record A1

Record B3Record B3

Record C

Record D

Record E

Record F

Record G

Record H

SSTSST

AWS Instance Type

● i2.xlarge○ 4 vCPU○ 30 GB RAM○ 800 GB SSD

■ 32K IOPS (4KB Pages)■ ~130MB/sec

Moneta Perf Benchmark (High-Vol Online Requests)

Moneta Perf Benchmark (cont)

Moneta Performance in Production (Batch Systems)

● Request ‘get’ 95%, 99% = 729 μs, 938 μs● L1 ‘get’ 95%, 99% = 153 μs, 191 μs● L2 ‘get’ 95%, 99% = 1005 μs, 1713 μs

● ~20 KB Records● ~99% Overall Hit Rate● ~90% L1 Hit Rate

Moneta Performance in Prod (High Vol-Online Req)

● Request ‘get’ 95%, 99% = 174 μs, 588 μs● L1 ‘get’ 95%, 99% = 145 μs, 190 μs● L2 ‘get’ 95%, 99% = 770 μs, 1330 μs

● ~1 KB Records● ~98% Overall Hit Rate ● ~97% L1 Hit Rate

Get Percentiles:● 50th: 102 μs (101 μs)● 75th: 120 μs (115 μs)● 90th: 146 μs (137 μs)● 95th: 174 μs (166 μs)● 99th: 588 μs (427 μs)● 99.5th: 733 μs (568 μs)● 99.9th: 1.39 ms (979 μs)

Moneta Performance in Prod (High Vol-Online Req)

Set Percentiles:● 50th: 97.2 μs (87.2 μs)● 75th: 107 μs (101 μs)● 90th: 125 μs (115 μs)● 95th: 138 μs (126 μs)● 99th: 177 μs (152 μs)● 99.5th: 208 μs (169 μs)● 99.9th: 1.19 ms (318 μs)

Latencies: peak (trough)

70%Reduction in cost*

Challenges/Concerns

● Less Visibility○ Unclear of Overall Data Size because of duplicates and expired records○ Restrict Unique Data Set to ½ of Max for Precompute Batch Data

● Lower Max Throughput than Memcached-based Server○ Higher CPU usage○ Planning must be better so we can handle unusually high request spikes

Current/Future Work

● Investigate Blob Storage feature○ Less Data read/write from SSD during Level Compaction○ Lower Latency, Higher Throughput○ Better View of Total Data Size

● Purging Expired SST’s earlier○ Useful in “short” TTL use cases○ May purge 60%+ SST earlier than FIFO Compaction○ Reduce Worst Case Latency○ Better Visibility of Overall Data Size

● Inexpensive Deduping for Batch Data

Open Source

https://github.com/netflix/EVCache

https://github.com/netflix/rend

Thank You

smansfield@netflix.com (@sgmansfield)nguyenvu@netflix.comtechblog.netflix.com

Failure Resilience in Client

● Operation Fast Failure● Tunable Retries● Operation Queues● Tunable Latch for Mutations● Async Replication through Kafka

Region A

Consistency Checker

1 mutate

2 send metadata

3 poll msg Kafka

Consistency Metrics

4 pull data

Atlas(Metrics Backend)

5 report

ClientDashboards

Lost Instance Recovery

Application

Client Library

Client

S3Partial Data Flow

Metadata Flow

Control Flow

Control

Zone A Zone B

Data Flow

Backup (and Restore)

Application

Client Library

Client

Control

S3Data Flow

Control Flow

Moneta in Production

● Serving all of our personalization data● Rend runs with two ports:

○ One for standard users (read heavy or active data management)○ Another for async and batch users: Replication and Precompute

● Maintains working set in RAM● Optimized for precomputes

○ Smartly replaces data in L1

external internal

Memcached (RAM)

Mnemonic (SSD)

Rend batching backend

EVCache: Lowering Costs for a Low Latency Cache … Lowering Costs for a Low Latency Cache ... AWS...

Documents

Transcript of EVCache: Lowering Costs for a Low Latency Cache … Lowering Costs for a Low Latency Cache ... AWS...

A Scalable Low-Latency Cache Invalidation Strategy for ...danr/courses/6762/Spring01/... · Caching frequently accessed data items on the client side is an effective technique to

Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and ...hpc.serc.iisc.ernet.in/papers/2014/micro14-nagendra.pdf · that stacked DRAM organizations provide, the Bi-Modal Cache

Lowering the volatility: a practical cache allocation ...gao-xf/Research/pdf/(2016) Lowering the Volati… · 1 Shanghai Key Laboratory of Scalable Computing and Systems, ... not.

Agenda –Day 1 · L1 Cache L2 Cache L3 Cache Disk 1000 Bytes 0.3 ns 64 KB 1 ns 256 KB 3 -10 ns 2-4 MB 2030 ns 4-16 GB 50-100 ns 4-16 TB 5 -10e6 ns.25-1TB 2550e3 ns Size Latency HW

Performance Characteristics of the POWER8 Processor...L3: 96 MB (12 x 8 MB 8 way Bank) •“NUCA” Cache policy (Non-Uniform Cache Architecture) –Scalable bandwidth and latency

Architectures for Wafer-Scale Integration · • Cache stashing to allow accelerators or IO devices to stash critical data within a CPU cache for low latency access. • Far atomic

Cache memory and cache

YEE VANG WEB CACHE. INTRODUCTION Internet has many user Issues with access latency (lag) Server crashing How to solve? One solution, Web Cache.

LIBMPK: SOFTWARE ABSTRACTION FOR INTEL MEMORY …libmpk-slides.pdf · Chakracore (10 LoC) : protecting JIT pages . EVALUATION LATENCY - KEY VIRTUALIZATION Cache miss costs overhead

Cache-Oblivious Ray Reordering - KAISTsglab.kaist.ac.kr/CORR/CORR_techreport.pdf · cache-oblivious. Cache-aware algorithms utilize the knowledge of cache parameters, such as cache

Up to 32 TB of storage per VM 64,000 IOPS per VM 50,000 IOPS per disk ~5 ms read/write (no cache) less than 1ms read latency (cache) Disk Provisioning.

NetworkMultis.1 Review: Bus Connected SMPs (UMAs) Caches are used to reduce latency and to lower bus traffic Must provide hardware for cache coherence.

Persistent Memory for I/O · • High bandwidth, on processor memory • Large, high bandwidth cache • Latency cost for individual access may be an issue • Main memory • DRAM

S cache: Thwarting Cache Attacks via Cache Set Randomization

Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IOPS of Cassandra

The Locality-Aware Adaptive Cache Coherence …people.csail.mit.edu/devadas/pubs/adaptive-isca.pdfchip cache utilization, exploiting locality using low-latency private caching becomes

I-sim: An Instruction Cache Simulator · cache (and thus readily available for subsequent accesses), and the update latency is the additional time needed to write the rest ofthe line

Graal Compiler Optimizations On AArch64 · Lowering Snippet Non-snippet Auto-vectorization None Superword-Level Parallelism (SLP) auto-vectorization De-virtualization Cache more virtual

Bi-Modal DRAM Cache: Improving Hit Rate, Hit … DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth Nagendra Gulur, Mahesh Mehendale Texas Instruments (India), Bangalore, India

Lowering the Latency of Data Processing Pipelines Through FPGA … · 2019-09-18 · Lowering the Latency of Data Processing Pipelines Through FPGA based Hardware Acceleration Muhsen