AI, app-defined OS, and the ossified kernel

64
AI, app-defined OS, & the ossified Linux kernel Felix Xiaozhu Lin 林小竹 1

Transcript of AI, app-defined OS, and the ossified kernel

Page 1: AI, app-defined OS, and the ossified kernel

AI, app-defined OS, & the ossified Linux kernel

Felix Xiaozhu Lin 林小竹

1

Page 2: AI, app-defined OS, and the ossified kernel

Summary

Describe recent trends of OS research

Present three cases of our investigation

Share personal reflection on OS research

2

Page 3: AI, app-defined OS, and the ossified kernel

Self intro• 2014 – now. Asst. prof, Purdue ECE

• 2014 PhD in CS. Rice U (advisor: Lin Zhong)• Thesis: OS for mobile computing

• 2008 MS + BS. Tsinghua U

3

XSEL & Friends, 2017

Crossroads Systems Exploration Lab

Page 4: AI, app-defined OS, and the ossified kernel

AI + big data: the workloads in our era

How does OS respond?4

Page 5: AI, app-defined OS, and the ossified kernel

OS as in textbooks

• Resource management

• Abstraction

• Protection

• Multiplexing

5

Page 6: AI, app-defined OS, and the ossified kernel

6

SOSP 1995

Hardware

NN Stream M/R

Mgmt Abstraction

Protect Multiplex

OS in 2018

Hardware

Mgmt Abstraction

Protect Multiplex

OS in 2000

Page 7: AI, app-defined OS, and the ossified kernel

Two premises of our research

1. Important apps define OSes

2. The Linux kernel is the new firmware

7

Page 8: AI, app-defined OS, and the ossified kernel

Like it or not …

• Following the model: widely adopted• Various ML frameworks• Android• DPDK• RDMA

• Against the mode: less adopted• Library OS• Multikernel• Unikernels• (… and all kinds of research proposals)

8

Page 9: AI, app-defined OS, and the ossified kernel

This talk: our exploration in three cases

• App-defined OSes• Case 1: Memory mgmt

• Case 2: Storage

• The ossified Linux Kernel• Case 3: Transkernel

9

Page 10: AI, app-defined OS, and the ossified kernel

Case 1: App-defined memory management

Exploiting hybrid memory for stream analytics

10

Page 11: AI, app-defined OS, and the ossified kernel

High input throughput: millions of events per sec

Low delay processing: sub-seconds

IoT Data centers Humans

11

Background: Stream analytics in 1 minute

Page 12: AI, app-defined OS, and the ossified kernel

Background: streaming pipeline

IngressGroup by

wordCount word occurrences

12

Operators Computations consuming & producing streams

Pipeline A graph of operators

WordCount

Window egress

Page 13: AI, app-defined OS, and the ossified kernel

Background: streaming pipeline

Ingress

Group by word

Count word occurrences

13

Operators Computations consuming & producing streams

Pipeline A graph of operators

TopKwords

Window

SentimentScoring Join

Sentiment Detection

Page 14: AI, app-defined OS, and the ossified kernel

14

Hybrid high-bandwidth memory

Page 15: AI, app-defined OS, and the ossified kernel

Hybrid high-bandwidth memory

• Tradeoffs: capacity vs bandwidth

• Untraditional memory hierarchy

– No latency benefit (Unlike SRAM+DRAM)

– Configurable: sw-managed or hw-managed15

CPU High-bw mem (HBM)

DRAM

16 GB375 GB/s

96 GB80 GB/s

Page 16: AI, app-defined OS, and the ossified kernel

Accelerate stream analytics with HBM?• The most expensive stream operators: grouping

16

Grouping in Hash High bandwidth memory

High capacity demand Capacity limited

Extensive random access ❤ Seq access + parallelism

IngressGroup by

wordCount word occurrences

TopK wordsWindow

Grouping traditionally built as HashMismatch HBM!

Page 17: AI, app-defined OS, and the ossified kernel

HBM can accelerate grouping!

• … with an unconventional algorithm choice

• Grouping: hash → sort

• Sort is worse than Hash on algorithmic complexity• O(NlogN) vs. O(N)

• Yet, Sort beats Hash with …• Abundant mem bw• High task parallelism• Wide SIMD (avx512)

17

Page 18: AI, app-defined OS, and the ossified kernel

A bold design: only use HBM for in-mem index of parallel sorting

18

Mem mgmt

HBMDRAM

Streamingdata

In-memindex

DataBundles

Page 19: AI, app-defined OS, and the ossified kernel

Memory allocator: balancing two limited resources• Prevent either from becoming the bottleneck

• A single knob: where to allocate new index: HBM or DRAM?

19DRAM Bandwidth

HB

M C

apac

ity

Page 20: AI, app-defined OS, and the ossified kernel

Memory allocator: balancing two limited resources• Prevent either from becoming the bottleneck

• A single knob: where to allocate new index: HBM or DRAM?

20

Low-demandfor both

High-demandfor both

DRAM Bandwidth

HB

M C

apac

ity

Page 21: AI, app-defined OS, and the ossified kernel

Memory allocator: balancing two limited resources• Prevent either from becoming the bottleneck

• A single knob: where to allocate new index: HBM or DRAM?

21

Low-demandfor both

High-demandfor both

DRAM Bandwidth

HB

M C

apac

ity

Page 22: AI, app-defined OS, and the ossified kernel

Memory allocator: balancing two limited resources• Prevent either from becoming the bottleneck

• A single knob: where to allocate new index: HBM or DRAM?

22

Low-demandfor both

High-demandfor both

DRAM Bandwidth

HB

M C

apac

ity

Page 23: AI, app-defined OS, and the ossified kernel

Case 1: First streaming engine optimized for hybrid memory

• Works on real hardware

• Best stream analytics performance on a single machine

• Generic memory allocators are inadequate!

25

Page 24: AI, app-defined OS, and the ossified kernel

This talk: our exploration in three cases

• App-defined OSes• Case 1: Memory mgmt

• Case 2: Storage

• The ossified Linux Kernel• Case 3: Transkernel

26

Page 25: AI, app-defined OS, and the ossified kernel

Case 2: App-defined storage

A data store for video analytics

27

Page 26: AI, app-defined OS, and the ossified kernel

Pervasive cameras. Large videos.

• 130M surveillance cameras shipped per year

• Many campuses run > 200 cameras 24x7

• A single camera produces 24 GB video per day

28

Page 27: AI, app-defined OS, and the ossified kernel

Pervasive cameras. Large videos.

• 130M surveillance cameras shipped per year

• Many institutions run > 200 cameras 24x7

• A single camera produces 24 GB video per day

29

Page 28: AI, app-defined OS, and the ossified kernel

Retrospective video analytics in 30 seconds• A query: “find all white buses appeared yesterday”

• A cascade of operators

• Query picks operator accuracies• Lower accuracy → lower cost → faster execution

30Image credits: NoScope: Optimizing Neural Network Queriesover Video at Scale, Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia, VLDB 2017

Frame diff detector

Specializedneural net

Fullneural net

~10,000x ~1,000x ~10x

Page 29: AI, app-defined OS, and the ossified kernel

Need for a video store for retrospective analytics

31

Ingestion Storage Retrieval Consume

VideoData

Operator@ accuracy

Query

“Aren’t there many video databases already?”Yes. But they are for human consumers.Not for algorithmic consumers.

Page 30: AI, app-defined OS, and the ossified kernel

Central design issue: choose storage formats catering to operators

• Video format: resolution, frame rate, crop, compression level, color channel… Many knobs!

32

Page 31: AI, app-defined OS, and the ossified kernel

Central design issue: choose storage formats catering to operators

• Video format: resolution, frame rate, crop, compression level, color channel… Many knobs!

• Higher accuracy → richer format & higher cost

33Actual tuning of formats is complex!

200p 1FPS(~100KB/s)

720p 30FPS(~400KB/s)

Page 32: AI, app-defined OS, and the ossified kernel

Key problem:What video formats to store?

1. Retrieving formats must be fast enough to feed operators

2. Formats must be rich enough to satisfy all accuracy needs

34

Page 33: AI, app-defined OS, and the ossified kernel

35

Storageformats

<motion,0.95>

<motion, 0.7>

<OCR, 0.95>

<OCR, 0.90>

<NN, 0.95>

<NN, 0.80>

Video consumers<op, accuracy>

Video data path

Retrieving & decoding slows down operators

Storing one unified format?

Page 34: AI, app-defined OS, and the ossified kernel

Storing all needed formats?

36

Storageformats

<motion,0.95>

<motion, 0.7>

<OCR, 0.95>

<OCR, 0.90>

<NN, 0.95>

<NN, 0.80>

Video consumers<op, accuracy>

Video data path

Ingestion & storage are expensive

Page 35: AI, app-defined OS, and the ossified kernel

Proposal: Store minimum necessary formats

37

Consumption formats

Storageformats

<motion,0.95>

<motion, 0.7>

<OCR, 0.95>

<OCR, 0.90>

<NN, 0.95>

<NN, 0.80>

Video consumers<op, accuracy>

Video data path

Page 36: AI, app-defined OS, and the ossified kernel

Proposal: Store minimum necessary formats

38

Consumption formats

Storageformats

<motion,0.95>

<motion, 0.7>

<OCR, 0.95>

<OCR, 0.90>

<NN, 0.95>

<NN, 0.80>

Video consumers<op, accuracy>

Video data path

Key Challenge: Too many knobs to tune!

Page 37: AI, app-defined OS, and the ossified kernel

Proposal: Store minimum necessary formats & derive them automatically!

39

<motion,0.95>

<motion, 0.7>

<OCR, 0.95>

<OCR, 0.90>

<NN, 0.95>

<NN, 0.80>

Backward derivation of configuration

Video data path

Consumption formats

Storageformats

Video consumers<op, accuracy>

Page 38: AI, app-defined OS, and the ossified kernel

TargetAccuracy

1x

10x

100x

1000x

1

0.9

5

0.9

0.8

Qu

ery

sp

ee

d (

x re

alti

me

)Benefit: faster analytics

40

Page 39: AI, app-defined OS, and the ossified kernel

TargetAccuracy

1x

10x

100x

1000x

1

0.9

5

0.9

0.8

Qu

ery

sp

ee

d (

x re

alti

me

)

OursOne unified format

Benefit: faster analytics

Why?• Lower accuracy → analytics tolerates cheaper video formats • Video retrieval should be proportionally cheaper!

41

Page 40: AI, app-defined OS, and the ossified kernel

jackson

1x

10x

100x

1000x

1

0.9

5

0.9

0.8

miami

1

0.9

5

0.9

0.8

tucson

1

0.9

5

0.9

0.8

dashcam

1

0.9

5

0.9

0.8

park

1

0.9

5

0.9

0.8

airport

1

0.9

5

0.9

0.8

OursStore one format

Qu

ery

sp

ee

d (

x re

alti

me

)Benefit: faster analytics

16x

Why?• Lower accuracy → analytics tolerates cheaper video formats • Video retrieval should be proportionally cheaper!

42

Page 41: AI, app-defined OS, and the ossified kernel

Case 2: First data store for retro. video analytics• Baking analytics demands into storage decisions

• Showing clear advantage over generic file systems or storage layers

45

Page 42: AI, app-defined OS, and the ossified kernel

This talk: our exploration in three cases

• App-defined OSes• Case 1: Memory mgmt

• Case 2: Storage

• The ossified Linux Kernel• Case 3: Transkernel

46

Page 43: AI, app-defined OS, and the ossified kernel

Case 3: enhancing the ossified Linux kernel

A new OS structure for device driver offloading

48

Page 44: AI, app-defined OS, and the ossified kernel

A commodity kernel still irreplaceable

• The major reason: device drivers• Linux: a common environment for all drivers

• >70% Linux kernel source are device drivers

• All other OS functions have alternatives!

491. Understanding Modern Device Drivers, Asim Kadav and Michael Swift. ASPLOS’12.

Page 45: AI, app-defined OS, and the ossified kernel

• Mobile/IoT run frequent, ephemeral tasks• Android smartwatch: each minute

• Each task is short-lived• Smartwatch: 10 secs

• Background task: < 1 sec

50

Problem: suspend/resume in ephemeral tasks

Page 46: AI, app-defined OS, and the ossified kernel

Under the hood…

51

Time

Device

Suspend

Wakeup

User Task

Sleep

Device

Resume

Thaw user

Freeze user&fs

Page 47: AI, app-defined OS, and the ossified kernel

Suspend/Resume Is Slow

52

Nexus 5 Gear Note 4 Panda

Suspend 119 ms 191 ms 231 ms 262 ms

Resume 88 ms 159 ms 316 ms 492 ms

Why? IO devices are slow

Page 48: AI, app-defined OS, and the ossified kernel

Proposal: run these driver paths on a low-power tiny core

54

CPU

2.5GHz 50MHz

Page 49: AI, app-defined OS, and the ossified kernel

What are these tiny cores?

55

Page 50: AI, app-defined OS, and the ossified kernel

What are these tiny cores?

56

Page 51: AI, app-defined OS, and the ossified kernel

What are these tiny cores?

57

Page 52: AI, app-defined OS, and the ossified kernel

How can a tiny core help?

• Idling more efficiently• 3 mW vs 30 mW

• Executing kernel more efficiently• Small code working set

• Less predictable control flow

58

1. J. Mogul, J. Mudigonda, N. Binkert, P. Ranganathan, and V. Talwar. Using asymmetric single-isacmps to save energy on operating systems. Micro, IEEE, 2008

Page 53: AI, app-defined OS, and the ossified kernel

The workflows

59

Time

Device

Suspend

Wakeup

CPU

User Task

Sleep

Device

Resume

Thaw user

Freeze user&fs

Current

Page 54: AI, app-defined OS, and the ossified kernel

The workflows

60

Time

Device

Suspend

Wakeup

CPU

User Task

Sleep

Device

Resume

Thaw user

Freeze user&fs

CPU Tiny Core

Current Proposed

Page 55: AI, app-defined OS, and the ossified kernel

How?

• The tiny core is very wimpy• Essentially a microcontroller

• Has a different ISA• More aggressive than big.LITTLE

• Re-engineering the kernel?

• “Linux kernel is the new firmware”

61

Device

Drivers

Linux Kernel

DRAM IO

CPU

Page 56: AI, app-defined OS, and the ossified kernel

Proposal: use dynamic binary translation• Empower the tiny core to execute unmodified

kernel binaries

62

Dynamic

Binary

Translation

Device

Drivers

Linux Kernel

DRAM IO

CPU

Page 57: AI, app-defined OS, and the ossified kernel

Dynamic binary translation in 30 sec

• The technology behind QEMU and a PS emulator

63

Page 58: AI, app-defined OS, and the ossified kernel

“This sounds impractical”

• Dynamic bin translation (DBT) is known expensive• > 10x overhead

• No one runs DBT on microcontrollers• Always weak over strong

65

Page 59: AI, app-defined OS, and the ossified kernel

Solution: the Transkernel model

66

Key ideas:

1. Translate most of driver code

2. Emulate the remaining kernel infrastructure

Essentially a tiny VM for drivers!

CPU Tiny core

Device

Drivers

Linux

kernel

DRAM

Emu

Dynamic

Binary

Translation

IO

Translated

code

A transkernel

Page 60: AI, app-defined OS, and the ossified kernel

Is this practical?

67

Page 61: AI, app-defined OS, and the ossified kernel

Only through careful optimization!

68

Page 62: AI, app-defined OS, and the ossified kernel

Case 3: Transkernel: tiny VMs for device driver paths

First DBT engine running on a microcontroller-like core!

Demonstrated:

• One can use DBT for efficiency gain

• The ossified kernel can be tamed!

69

Page 63: AI, app-defined OS, and the ossified kernel

This talk: our exploration in three cases

• App-defined OSes• Case 1: Memory mgmt for stream analytics

• Case 2: Storage for retro. video analytics

• The ossified Linux Kernel• Case 3: Transkernel

70

Page 64: AI, app-defined OS, and the ossified kernel

Reflection: a tale of two inquiries

• App-defined OS• Brave new world

• Usefulness depends on the target apps

• OS as a tool: serving AI and big data

• Kernel as the new firmware• Still lots of opportunities

• Unique constraints. Require unconventional techniques

• Like fixing an engine of an airplane in the air

• OS as a craft. Fun!

72