RedrawingtheBoundaryBetween …nvsl.ucsd.edu/assets/talks/2012-05-21-DaMoN-release.pdf ·...

40
1 Redrawing the Boundary Between So3ware and Storage for Fast NonVola;le Memories Steven Swanson Director, NonVola;le System Laboratory Computer Science and Engineering University of California, San Diego

Transcript of RedrawingtheBoundaryBetween …nvsl.ucsd.edu/assets/talks/2012-05-21-DaMoN-release.pdf ·...

1  

Redrawing  the  Boundary  Between  So3ware  and  Storage  for  Fast  Non-­‐Vola;le  Memories  

Steven  Swanson    Director,  Non-­‐Vola;le  System  Laboratory  

Computer  Science  and  Engineering  University  of  California,  San  Diego  

2  

*http://hmi.ucsd.edu

Humanity processed 9 Zettabytes in 2008* 9,000,000,000,000,000,000,000 Bytes Welcome to the Data Age!

*hJp://hmi.ucsd.edu  

3  

Solid State Memories

•  NAND  flash  –  Ubiquitous,  cheap  –  Sort  of  slow,  idiosyncra;c  

•  Phase  change,  Spin  torque  MRAMs,  etc.  –  Not  (yet)  ready  for  prime  ;me  

–  DRAM-­‐like  speed  –  DRAM  or  flash-­‐like  density  

•  DRAM  +  BaJery  +  flash  –  Fast,  available  today  –  Reliability?  

4  

CPU  NVM

 

NVM

 

NVM

 

NVM

 

DRAM

 

DDR  PCIe  

PCIe-­‐aJached  NVMs  –  Low  Latency  –  High  bandwidth  –  Flexible  interface  –  Flexible  topology  –  Scalable  

DDR-­‐aJached  NVMs  –  Lowest  Latency  –  Highest  bandwidth  –  Rigid  interface  –  Restricted  topology  –  Limited  scalability  

Integrating Them Into Systems

5  

1  

10  

100  

1000  

10000  

100000  

1000000  

1   10   100   1000   10000   100000   1000000  10000000  

Band

width  Rela-

ve  to

 disk  

1/Latency  Rela-ve  To  Disk  

Hard  Drives  (2006)  PCIe-­‐Flash  (2007)  

PCIe-­‐PCM  (2010)  

PCIe-­‐Flash  (2012)  

PCIe-­‐PCM  (2013?)  

DDR  Fast  NVM  (2016?)  

5917x  à  2.4x/yr  

7200x  à  2.4x/yr  

6  

Software Latency Will Dominate

0.3%  

19.3%   21.9%  10.4%  

21.9%  

70.0%  

88.4%  94.09%  

0%  

20%  

40%  

60%  

80%  

100%  

So:ware's  C

ontribu-

on  to

 Laten

cy  

7  

Software Energy Will Dominate

0.4%  

97.0%  

75.8%   78.6%  

0%  

20%  

40%  

60%  

80%  

100%  

Hard  Drive  (2006)  

SSD  (2012)   PCIe-­‐Flash  (2012)  

PCIe-­‐PCM  (2013)  So:ware's  C

ontribu-

on  to

 Ene

rgy/IOP  

8  

Software’s New Role in Storage

For  Disks:    Adding  So3ware  Could  Make  the  System  Faster  

For  NVMs:    Adding  So3ware  Makes  It  Slower  

•  Reduce  so3ware  interac;ons  •  Remove  disk-­‐centric  features  •  Redesign  IO  so3ware  (OS,  FS,  languages,  and  apps)  

9  

CPU  NVM

 

NVM

 

NVM

 

NVM

 

DRAM

 

DDR  

NV-­‐Heaps:  NVMS  over  DDR  

 

Moneta:  SSDs  for  Fast  NVMs  

PCIe  

Moneta  Driver  IO  Stack  

Database  

File  System  

Block  Driver  IO  Stack  

Programming  Model  

File  System  

Two  Case  Studies  

10  

Moneta Internals: Performing a Write

16GB  NVM  

16GB  NVM  

16GB  NVM  

16GB  NVM  

Ring  Control  

Transfer  Buffers  

DMA  Control  

Scoreboard  

Tag  Status  Registers  

Host  via  PIO  

Host  via  DMA  

Request  Queue  

Ring  (4

 GB/s)  

11  

The Moneta Prototype

•  FPGA-­‐based  implementa;on  •  DDR2  DRAM  emulates  PCM  –  Configurable  memory  latency  –  48  ns  reads,  150  ns  writes  –  64GB  across  8  controllers  

•  PCIe:  2  GB/s,  full  duplex  

12  

Optimizing Software for Moneta

•  Op;miza;ons  –  Remove  IO  scheduler  –  Redesigned  HW  interface  –  Remove  locks  

•  SW  latency  drops  from  13.4  us  to  5us  –  62%  reduc;on  in  latency  –  Increased  concurrency    

•  Bandwidth  increases  by  up  to  10x    

0  

5  

10  

15  

20  

25  

Moneta  0.1  

Moneta  1.0  

Latency  (us)  

PCM  

Ring  

DMA  

Wait  

Interrupt  

Issue  

Copy  

Schedule  

OS/User  

13  

The App Gap

0.1  

1  

10  

100  

1000  

10000  

Moneta   FusionIO   SSD-­‐RAID   DISK-­‐RAID  

Log  Speedu

p  vs  DISK-­‐RA

ID  

XDD  4KB  RW   Btree   HashTable   Bio   PTF  

•  We  must  close  the  App  Gap  for  NVM  success  

14  

Where is the App Gap (Part I)?

0  200  400  600  800  1000  1200  1400  1600  1800  2000  

Sustaine

d  MB/s  

4KB  Accesses  

Moneta  1.0  +  fs  

Moneta  1.0  

Moneta 1.0

Data-bases

WebApps

SciApps

FileServers

Driver

File systemIO Stack

15  

Eliminating FS and OS overheads

•  Separate  protec;on  mechanisms  from  policy  –  Move  protec;on  checks  to  hardware  –  Allow  applica;ons  to  access  Moneta  directly  

•  41%  is  to  support  protec;on  and  sharing  

0  

5  

10  

15  

20  

25  

30  

Moneta  0.1   Moneta  1.0  

Latency  (us)  

PCM  

Ring  

DMA  

Wait  

Interrupt  

Issue  

Copy  

Schedule  

OS/User  

File  System   Moneta-D

Driver

AppDriver

App

IO StackFS

Perm CheckMoneta 1.0

App App

Driver

File systemIO Stack

•  Protec;on  overheads  are  almost  eliminated  

16  

Some Progress on the App Gap

0  

1  

2  

3  

4  

5  

6  

Speedu

p  vs.  K

erne

l  Spa

ce   Kernel  Space  

User  Space  

17  

Moneta

DriverApp

DriverApp

IO StackFS

Perm Check Moneta

DriverApp

DriverApp

IO StackFS

Perm Check

Logging/Atomicity

Where is the App Gap (Part II)?

•  Write  ahead  logging  – e.g.  ARIES  

•  Extra  IO  opera;ons  – Wasted  so3ware  latency/power  

– Wasted  IO  performance  

•  Mul;-­‐part  Atomic  Writes  – Apps  can  create  and  commit  write  groups  

– Write  data  resides  in  a  redo  log  

•  The  applica;on  can  read  and  change  the  log  contents  

18  

Bandwidth Comparison

2  to  3.8X    improvement    

0  

200  

400  

600  

800  

1000  

1200  

1400  

1600  

1800  

0.5   1   2   4   8   16   32   64   128   256   512  

Sustaine

d  Ba

ndwidth  (M

B/s)  

Access  Size  (KB)  

Normal  Writes  

WAL  Writes  

0  

200  

400  

600  

800  

1000  

1200  

1400  

1600  

1800  

0.5   1   2   4   8   16   32   64   128   256   512  

Sustaine

d  Ba

ndwidth  (M

B/s)  

Access  Size  (KB)  

Normal  Writes  

AtomicWrite  

WAL  Writes  

19  

ARIES – Write Ahead Logging for Databases on Disks

•  ARIES  provides  useful  database  features  –  Atomicity  –  Durability  –  High  Concurrency  –  Efficient  recovery  

•  But  it  makes  disk-­‐centric  design  decisions  –  Undo  and  redo  logs  to  avoid  random  

synchronous  writes  and  provide  IO  scheduling  flexibility  

–  Page-­‐based  storage  management  to  match  disk  seman;cs  

We  want  to  keep  these  

Rethink  these  

20  

ARIES’ Disk-Centric Design Decisions

Design  Decision   Advantages   Downsides  

No-­‐force    

Eliminates  synchronous  random  writes.  

Requires  synchronous  writes  to  the  redo  log.  

Steal    

Longer  sequen;al  writes.   Requires  an  undo  log.  

Pages   Matches  disk  and  OS  page-­‐based  IO  model.  

Page  granularity  is  not  always  ideal.  

21  

MARS: Rethinking ARIES for Moneta

 No-­‐force  

 Steal    

Pages        

Force  updates  on  commit,  but  offload  it  to  Moneta’s  memory  controllers.  

Get  rid  of  undo  logging.  Update  redo  log  contents  instead.  

So3ware  can  choose  object/page  granularity.  

22  

ARIES vs MARS

0  20000  40000  60000  80000  100000  120000  140000  160000  

1   2   4   8   16  

4KB  Sw

aps  T

X/sec  

Threads  

MARS  ARIES  

23  

Software’s Role in Moneta

•  The  basic  Moneta  hardware  is  (rela;vely)  simple  

•  So3ware  requirements  drove  the  HW  enhancements    

•  Reduce,  Remove,  Redesign  – Simplify  the  kernel  IO  stack  – Eliminate  the  OS  and  FS  overheads  for  most  accesses  

– Rethink  write-­‐ahead  logging  for  new  memories  

24  

CPU  NVM

 

NVM

 

NVM

 

NVM

 

DRAM

 

DDR  

NV-­‐Heaps:  NVMS  over  DDR  

 

Moneta:  SSDs  for  Fast  NVMs  

PCIe  

Moneta  Driver  IO  Stack  

Database  

File  System  

Block  Driver  IO  Stack  

Programming  Model  

File  System  

Two  Case  Studies  

25  

Disks and Files Are Passé

DDR  NVM  

Block  Driver  

File  System  

System  Calls  

Applica-ons  DDR  NVM  is  a  new  kind  of  disk  

-­‐  Use  conven-onal  file  opera-ons  to  access  data  -­‐  Files  are  bags  of  bytes  

DDR  NVM  is  a  new  kind  of    Memory  

-­‐ Use  VM  to  access  data  -­‐ Create  classes  

-­‐ Use  pointers/references  -­‐ Leverage  strong  types  -­‐   But  make  it  persistent  

DDR  NVM  

Block  Driver  

File  System  

System  Calls  

   Applica-ons  

26  

The Dangers of Non-Volatile Programming

Dlist  Insert(Foo  x,  Dlist  head)    {                Dlist  r;        r  =  new  Dlist;        r.data  =  x;        r.next  =  head;        r.prev  =  null;        if  (head)  {  head.prev  =  r;  }        return  r;  }  

What  if  the  system  crashes?  

What  if  x  and  head  are  in  different  files?  

27  

NV-Heaps: Safe, Fast, Persistent Objects

•  Container  for  NV  data  structures  – Generic  –  Self-­‐contained  

•  Access  via  loads  and  stores  – Memory-­‐like  performance  –  Familiar  seman;cs  

•  Strong  safety  guarantees  – ACID  transac;ons  –  Pointer  safety  

Transac-on  Handling  

NV-­‐Heap  

[ASPLOS  2011]  

28  

A Type System for NV-Heaps

1. Refine  each  type  with  a  heap-­‐index  2. Type  rules  ensure  that  the  heaps  match  

Dlist<h>  Insert(Foo<h>  x,  Dlist<h>  head)    {                atomic<h>  {              Dlist<h>  r;              r  =  new  Dlist  in  heapof(r);              r.data  =  x;              r.next  =  head;              r.prev  =  null  in  heapof(r);              if  (head)  {  head.prev  =  r;  }        }        return  r;  }  

29  

0  

20000  

40000  

60000  

80000  

100000  

120000  

Berkelely  DB  -­‐  RAM  Disk  

NV-­‐heaps  DRAM  

Memcached  (vola;le)  

Ope

ra-o

ns/sec  

Application: A Persistent Key-value Store

39X  8%  slowdown  

NV-­‐heaps  are  much  faster  than  the  block-­‐based  systems  NV-­‐heaps  make  the  cost  of  persistence  very  low  

30  

NV-Heaps Wrap Up

•  The  hardware  is  here  already.  •  NV-­‐heaps  can  change  how  programmers  think  about  persistent  state.  – Careful  system  design  can  avoid  the  pisalls  – And  the  performance  gains  are  huge  

31  

Other Efforts

•  PCIe  SSDs  –  FusionIO  et.  al.  –  Intel  NVMExpress  prototypes  [FAST’12]  

–  Texas  Memory  Systems  SSDs  

•  Atomic  Writes  –  Transac;onal  Flash  [OSDI’08]  

–  FusionIO  AtomicWrite  [HPCA’11]  

•  NV-­‐Heaps  –  Slow  but  safe  

•  Stasis  [OSDI  06]  •  Orthogonally  persistent  Java  [SIGMOD  96]  

•  Many  others…  

–  Fast  but  unsafe    •  Recoverable  Virtual  Memory  [SOSP  93]  

•  Rio  Vista  [SOSP  97]  

–  Fast  and  safe  •  Mnemosyne  [ASPLOS  11]  

32  

Open  Challenges  

33  

How Can We Elegantly Enrich Storage Semantics?

•  Atomic  Writes  help  but,…  •  Each  applica;ons  has  its  own  needs  •  Legacy  system  architectures  •  Novel  data  structures  

•  Can  we  build  flexible  seman;cs  for  storage?  •  Like  shaders  in  graphics  cards?  

34  

How do NVMs affect the data center?

•  How  should  we  organize  a  petabyte  of  NVMs?  •  Centralized?    Distributed?    Hybrid?  

•  NVM  performance  with  large-­‐scale  system  overheads  •  Replica;on  •  Network  latency  •  Thick  so3ware  layers  (e.g.  OpenStack,  HadoopFS)  

35  

Software Costs Will Remain High

0.3%  

19.3%   21.9%  10.4%  

21.9%  

42.2%  

11.1%   11.8%  

0%  

20%  

40%  

60%  

80%  

100%  

So:ware's  C

ontribu-

on  to

 Laten

cy  

PCIe  and  DDR  performance  will  con;nue  to  improve.    

PCIe  and  DDR  power  efficiency  will  con;nue  to  improve.  

 Thick  so3ware  stacks  will  push  so3ware  costs  higher.  

 Applica;ons  will  demand  richer,  more  sophis;cated  

interfaces.  

36  

The App Gap Remains

0.1  

1  

10  

100  

1000  

10000  

Moneta   FusionIO   SSD-­‐RAID   DISK-­‐RAID  

Log  Speedu

p  vs  DISK-­‐RA

ID  

XDD  4KB  RW   Btree   HashTable   Bio   PTF  

37  

Solid State Storage In the Data Age

•  Harnessing  the  data  we  collect  requires  vastly  faster  storage  systems  

•  Fast,  non-­‐vola;le  memories  can  provide  it,  but…  •  NVMs  will  force  so3ware’s  role  to  change  –  So3ware  will  become  a  drag  on  IO  performance  –  Reduce,  Remove,  Redesign  

•  Leveraging  NVMs  quickly  requires  a  coordinated  research  that  spans  system  layers  

•  No  aspect  of  the  system  is  safe!  

38  

Thanks!  

39  

40  

Ques;ons?