Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks....

26
Bare Metal Library Abstractions for modern hardware Cyprien Noel

Transcript of Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks....

Page 1: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Bare Metal LibraryAbstractions for modern hardwareCyprien Noel

Page 2: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

1. Modern Hardware?2. New challenges & opportunities3. Three use cases

○ Current solutions○ Leveraging hardware○ Simple abstraction

Plan

Page 3: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

● High performance trading systems○ Lock-free algos, distributed systems

● H2O○ Distributed CPU machine learning, async SGD

● Flickr○ Scaling deep learning on GPU ⎼ Multi GPU Caffe○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark

● UC Berkeley○ NCCL Caffe, GPU cluster tooling○ Bare Metal

Myself

Page 4: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Modern Hardware?

Page 5: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Device-to-device networks

Page 6: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.
Page 7: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Number crunching ➔ GPU

FS, block io, virt mem ➔ Pmem

Network stack ➔ RDMA

RAID, replication ➔ Erasure codes

Device mem ➔ Coherent fabrics

And more:

Video, crypto etc.

Moving from ms software to µs hardware

Page 8: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

OS abstractions replaced by

More powerful, but also more complex and non-interoperable

● CUDA● OFED● Libpmem● DPDK● SPDK● Libfabric● UCX● VMA● More every week...

Page 9: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Summary So Far

● Big changes coming!○ At least for high-performance applications

● CPU should orchestrate○ Not in critical path○ Device-to-device networks

● Retrofitting existing architectures difficult○ CPU-centric abstractions○ ms software on µs hardware (e.g. 100s instructions per packet)○ OK in some cases, e.g. VMA (kernel bypass sockets), but

much lower acceleration, most features inexpressible

Page 10: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

What do we do?

● Start from scratch?○ E.g. Google Fushia - no fs, block io, network etc.○ Very interesting but future work

● Use already accelerated frameworks?○ E.g. PyTorch, BeeGFS○ Not general purpose, no interop, not device-to-device

● Work incrementally from use cases○ Look for simplest hardware solution○ Hopefully useful abstractions will emerge

Page 11: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Use cases

● Build datasets○ Add, update elements○ Apply functions to sets, map-reduce○ Data versioning

● Training & inference○ Compute graphs, pipelines○ Deployment○ Model versioning

Page 12: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Datasets

● Typical solution○ Protobuf messages○ KV store○ Dist. file system

● Limitations○ Serialization granularity○ Copies: kv log, kernel1, replication, kernel2, fs○ Remote CPU involved, stragglers○ Cannot place data in device

(x12)

Page 13: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Datasets

● Simplest hardware implementation○ Write protobuf in arena, like Flatbuffers○ Pick an offset on disks, e.g. a namespace○ Call ibv_exp_ec_encode_async

● Comments○ Management, coordination, crash resiliency○ Thin wrapper over HW: line rate perf.

● User abstraction?○ Simple, familiar○ Efficient, device friendly

(x12)

EC shard

Page 14: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

● Extension to classic mmap○ Distributed○ Typed - Protobuf, other formats planned

● Protobuf is amazing○ Forward and backward compatible○ Lattice

mmap

Page 15: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

mmap

● C++

const Test& test = mmap<Test>("/test");int i = test.field();

● Python

test = Test()bm.mmap("/test", test)i = test.field()

Page 16: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

● Simple abstraction for data storage

● Fully accelerated, “mechanically friendly”○ Thin wrapper over HW, device-to-device, zero copy○ ~1.5x replication factor○ Network automatically balanced○ Solves straggler problem○ No memory pinning or TLB thrashing, NUMA aware

mmap, recap

Page 17: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Use cases

● Compute○ Map-reduce, compute graphs, pipelines

● Typical setup○ Spark, DL frameworks○ Distribution using Akka, gRPC, MPI○ Kubernetes or SLURM scheduling

● Limitations○ No interop○ Placement difficult○ Inefficient resources allocation

Page 18: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Compute

● Simplest hardware implementation○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph○ Place tasks in queue○ Work stealing - RDMA atomics○ Device-to-device chaining - GPU Direct Async

● User abstraction?

Page 19: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

task

● Python

@bm.taskdef compute(x, y): return x * y

# Runs locallycompute(1, 2)

# Might be rebalanced on clusterdata = bm.list()bm.mmap("/data", data)compute(data, 2)

Page 20: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

task, recap

● Simple abstraction for CPU and device kernels

● Work stealing instead of explicit schedule○ No GPU hoarding○ Better work balancing○ Dynamic placement, HA

● Device-to-device chaining○ Data placed directly in device memory○ Efficient pipelines, even very short tasks○ E.g. model parallelism, low latency inference

Page 21: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Use cases

● Versioning○ Track datasets and models○ Deploy / rollback models

● Typical setup○ Copy before update○ Symlinks as versions to data○ Staging / production environments split

Page 22: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Versioning

● Simplest hardware implementation○ Keep multiple write ahead logs○ mmap updates○ tasks queues

● User abstraction?

Page 23: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

branch

● Like a git branch○ But any size data○ Simplifies collaboration, experimentation○ Generalized staging / production split

● Simplifies HA○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11)○ Replaces transactions, e.g. queues, persistent memory○ Allows duplicate work merge

Page 24: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

branch

● C++

Test* test = mutable_mmap<Test>("/test");branch b;# Only visible in current branchtest->set_field(12);

● Similar in Python

Page 25: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

● mmap, task, and branch simplify hardware-acceleration

● Helps build pipelines, manage cluster resources etc.

● Early micro benchmarks suggest very high performance

Summary

Page 26: Bare Metal Library - NVIDIA · Bare Metal Myself. Modern Hardware? Device-to-device networks. Number crunching GPU ... Video, crypto etc. Moving from ms software to µs hardware.

Thank You!Will be open sourced BSD

Contact me if interested - [email protected]

Thanks to our sponsor