[G2]fa ce deview_2012

35
FlashBased Extended Cache for Higher Throughput and Faster Recovery Woonhak Kang, Sangwon Lee, and Bongki Moon 12. 9. 19. 1

description

 

Transcript of [G2]fa ce deview_2012

Page 1: [G2]fa ce deview_2012

Flash-­‐Based  Extended  Cache    for  Higher  Throughput  and  Faster  Recovery

Woon-­‐hak  Kang,  Sang-­‐won  Lee,  and  Bongki  Moon  

12.  9.  19. 1

Page 2: [G2]fa ce deview_2012

Outline •  IntroducIon  

•  Related  work  

•  Flash  as  Cache  Extension  (FaCE)  –  Design  choice  –  Two  opImizaIons  

•  Recovery  in  FaCE  

•  Performance  EvaluaIon  

•  Conclusion  

2

Page 3: [G2]fa ce deview_2012

Outline •  IntroducIon  

•  Related  work  

•  Flash  as  Cache  Extension  (FaCE)  –  Design  choice  –  Two  opImizaIons  

•  Recovery  in  FaCE  

•  Performance  EvaluaIon  

•  Conclusion  

3

Page 4: [G2]fa ce deview_2012

IntroducIon •  Flash  Memory  Solid  State  Drive(SSD)  –  NAND  flash  memory  based  non-­‐volaIle  storage  

•  CharacterisIcs  –  No  mechanical  parts  

•  Low  access  latency  and  High  random  IOPS  

– MulI-­‐channel  and  mulI-­‐plane  •  Intrinsic  parallelism,  high  concurrency  

–  No  overwriIng  •  Erase-­‐before-­‐overwriIng  •  Read  cost  <<  Write  cost  

–  Limited  life  span  •  #  of  erasures  of  the  flash  block  

4 Image  from  :  hXp://www.legitreviews.com/arIcle/1197/2/

Page 5: [G2]fa ce deview_2012

IntroducIon(2) •  IOPS  (IOs  Per  Second)  maXers  in  OLTP  

•  IOPS/$:  SSDs  >>  HDDs  –  e.g.  SSD    63  (=  28,495  IOPS  /  450$)  vs.  HDD  1.7  (=  409  IOPS  /  240$)  

•  GB/$:  HDDs  >>  SSDs  –  e.g.  SSD    0.073  (=  32GB  /  440$)  vs.    HDD  0.617  (=  146.8GB  /  240$)  

•  Therefore,  it  is  more  sensible  to  use  SSDs  to  supplement  HDDs,  rather  than  to  replace  them  –  SSDs  as  cache  between  RAM  and    HDDs  –  To  provide  both  the  performance  of  SSDs  and  the  capacity  of  HDDs  as  liXle  cost  as  possible  

5

Page 6: [G2]fa ce deview_2012

IntroducIon(3) •  A  few  exisIng  flash-­‐based  cache  schemes  

–  e.g.  Oracle  Exadata,  IBM,  MS  –  Pages  cached  in  SSDs  are  overwriXen;  the  write  paXern  in  SSDs  is  random  

•  Write  bandwidth  disparity  in  SSDs  –  e.g.  random  write  (25MB/s  =  6,314  x  4KBs/s  )  vs.  sequenIal  write  (243MB/s)  vs.    

6

4KB  Random  Throughput  (IOPS) Sequen=al  Bandwidth  (MBPS) Ra=o  Sequen=al/Random

 write

Read Write Read Write SSD  mid  A 28,495 6,314 251 243 9.85 SSD  mid  B 35,601 2,547 259 80 8.04 HDD  Single     409 343 156 154 114.94 HDD  Single  (x8) 2,598 2,502 848 843 86.25

Page 7: [G2]fa ce deview_2012

IntroducIon(4)  

7

•  FaCE  (Flash  as  Cache  Extension)  –  main  contribuIons  –  Write-­‐opImized  flash  cache  scheme:  e.g.  3x  higher  throughput  than  the  exisIng  ones  

–  Faster  database  recovery  support  by  exploiIng  the  non-­‐volaIle  cache  pages  in  SSDs  for  recovery:  e.g.  4x  faster  recovery  Ime

DRAM

HDD

SSD

Random  Read

Random  Read  (Low  cost)

Random  Write

Sequen=al  Write  (à  High  throughput)

Non-­‐vola=lity  of  flash  cache  for  recovery  (faster  recovery)

Page 8: [G2]fa ce deview_2012

Contents •  IntroducIon  

•  Related  work  

•  Flash  as  Cache  Extension  (FaCE)  –  Design  choice  –  Two  opImizaIons  

•  Recovery  in  FaCE  

•  Performance  EvaluaIon  

•  Conclusion  

8

Page 9: [G2]fa ce deview_2012

Related  work •  How  to  adopt  SSDs  in  the  DBMS  area?  

1.  SSD  as  faster  disk  –  VLDB  ‘08,  Koltsidas  et  al.,  “Flashing  up  the  Storage  Layer”  –  VLDB  ’09,  Canim  et  al.  “An  Object  Placement  Advisor  for  DB2  Usin

g  Solid  State  Storage”  –  SIGMOD  ‘08,  Lee  et  al.,  "A  Case  for  Flash  Memory  SSD  in  Enterpris

e  Database  ApplicaIons"    

2.  SSD  as  DRAM  buffer  extension  –  VLDB  ’10,  Canim  et  al.,  “SSD  Bufferpool  extensions  for  Database  s

ystems”  –  SIGMOD  ’11,  Do  et  al.,  “Turbocharging    DBMS  Buffer  Pool  Using  SS

Ds”

9

Page 10: [G2]fa ce deview_2012

Lazy  Cleaning  (LC)  [SIGMOD’11]  

•  Cache  on  exit  

•  Write-­‐back  policy  

•  LRU-­‐based  SSD  cache  replacement  policy  –  To  incur  almost  random  writes  against  SSD  

•  No  efficient  recovery  mechanism  provided

10 HDD

RAM Buffer (LRU) Flash memory SSD

Flash  hit

Fetch  on  miss

 Evict

Stage  out  dirty  pages

Random  writes

Page 11: [G2]fa ce deview_2012

Contents •  IntroducIon  

•  Related  work  

•  Flash  as  Cache  Extension  (FaCE)  –  Design  choices  –  Two  opImizaIons  

•  Recovery  in  FaCE  

•  Performance  EvaluaIon  

•  Conclusion  

11

Page 12: [G2]fa ce deview_2012

FaCE:  Design  Choices 1. When  to  cache  pages  in  SSD?  

2. What  pages  to  cache  in  SSD?  

3. Sync  policy  b/w  SSD  and  HDD  

4. SSD  Cache  Replacement  Policy  

12

Page 13: [G2]fa ce deview_2012

Design  Choices:  When/What/Sync  Policy

•  When  :  on  entry  vs.  on  exit  

•  What  :  clean  vs.  dirty  vs.  both  

•  Sync  policy  :  write-­‐thru  vs.  write-­‐back  

HDD

RAM Buffer (LRU) Flash as Cache Extension

Fetch  on  miss

Evict    

Stage  out  dirty  pages  

On  exit  :  dirty  pages  as  well  as  clean  pages

Sync  policy  :  for  the  performance,  write-­‐back  sync  

13

Page 14: [G2]fa ce deview_2012

•  What  to  do  when  a  page  is  evicted  from  DRAM  buffer  and  SSD  cache  is  full  

•  LRU  vs.  FIFO  (First-­‐In-­‐First-­‐Out)  – Write  miss:  LRU-­‐based  vicIm  selecIon,  write-­‐back  if  dirty  vicIm,  and  overwrite  the  old  vicIm  page  with  the  new  page  being  evicted  

– Write  hit:  overwrite  the  old  copy  in  flash  cache  with  the  updated  page  being  evicted  

Design  Choices:  SSD  Cache  Replacement  Policy

14

HDD

RAM Buffer (LRU) Flash as Cache Extension

Evict    

Random  writes  

against  SSD

Page 15: [G2]fa ce deview_2012

Design  Choices:  SSD  Cache  Replacement  Policy

•  LRU  vs.  FIFO  (First-­‐In-­‐First-­‐Out)  –  VicIms  are  chosen  from  the  rear  end  of  flash  cache  :  “sequenIal  writes”  against  SSD  

– Write  hit  :  no  addiIonal  acIon  is  taken  in  order  not  to  incur  random  writes.  

•  mulIple  versions  in  SSD  cache  

15

HDD

RAM Buffer (LRU) Flash as Cache Extension

Evict    

Multi-Version FIFO (mvFIFO)

Page 16: [G2]fa ce deview_2012

Write  ReducIon  in  mvFIFO •  Example  

– Reduce  three  writes  to  HDD  to  one  

RAM Buffer (LRU) Flash as Cache Extension

Evict    

Page  P-­‐v2 Invalidated    version

HDD

Page  P-­‐v3

16

Page  P-­‐v1 Invalidated  version Choose  Vic=m Discard

Mul=ple  Versions  of  Page  P

Write-­‐back  to  HDD

Page 17: [G2]fa ce deview_2012

Design  Choices:  SSD  Cache  Replacement  Policy

•  LRU  vs.  FIFO  

17

LRU FIFO Write  paXern Random Sequen=al Write  performance Low High  #  of  copy  pages Single MulIple Space  uIlizaIon High Low Hit  raIo  &  write  reducIon High Low

•  Trade-­‐off  :  hit-­‐raIo  <>  write  performance  – Write  performance  benefit  by  FIFO  >>  Performance  gain  from  higher  hit  raIo  by  LRU  

Page 18: [G2]fa ce deview_2012

mvFIFO:  Two  OpImizaIons •  Group  Replacement  (GR)      

– MulIple  pages  are  replaced  in  a  group  in  order  to  exploit  the  internal  parallelism  in  modern  SSDs  

–  Replacement  depth  is  limited  by  parallelism  size  (channel  *  plane)  

–  GR  can  improve  SSD  I/O  throughput  

•  Group  Second  Chance  (GSC)      –  GR  +  Second  chance  –  if  a  vicIm  candidate  page  is  valid  and  referenced,  will  re-­‐enque  the  vicIm  to  SSD  cache  

•  A  variant  of  “clock”  replacement  algorithm  for  the  FaCE  –  GSC  can  achieve  higher  hit  raIo  and  more  write  reducIons  

18

Page 19: [G2]fa ce deview_2012

RAM

Group  Replacement  (GR)  •  Single  group  read  from  SSD  (64/128  pages)  

•  Batch  random  writes  to  HDD  

•  Single  group  write  to  SSD  

19 HDD

RAM Buffer (LRU)

Flash as Cache Extension

1.  Fetch  on  miss

2.  Evict    

Check  valid  and  dirty  flag  

Flash  Cache  becomes  FULL

Page 20: [G2]fa ce deview_2012

RAM

Group  Second  Chance  (GSC)  •  GR  +  Second  Chance  

20 HDD

RAM Buffer (LRU)

Flash as Cache Extension

1.  Fetch  on  miss

2.  Evict    

Check  valid  and  dirty  flag

Check  reference  bit,  if  true  gave  them  

2nd  chance

Flash  Caches  become  FULL

reference  bit  is  ON

Page 21: [G2]fa ce deview_2012

Contents •  IntroducIon  

•  Related  work  

•  Flash  as  Cache  Extension  (FaCE)  –  Design  choice  –  Two  opImizaIons  

•  Recovery  in  FaCE  

•  Performance  EvaluaIon  

•  Conclusion  

21

Page 22: [G2]fa ce deview_2012

Recovery  Issues  in  SSD  Cache •  With  write-­‐back  sync  policy,  many  recent  copies  of  data  pages  ar

e  kept  in  SSD,  not  in  HDD.  

•  Therefore,  database  in  HDD  is  in  an  inconsistent  state  ayer  system  failure    

22

SSD Mapping Information (Metadata) RAM

HDD

Flash as Cache Extension

New  version  of  page  P

Old  version  of  page  P

Crash Inconsistent  

state

Page 23: [G2]fa ce deview_2012

Recovery  Issues  in  SSD  Cache •  With  write-­‐back  sync  policy,  many  recent  copies  of  data  pages  are  kept  

in  SSD,  not  in  HDD.  

•  Therefore,  database  in  HDD  is  in  an  inconsistent  state  ayer  system  failure    

•  In  this  situa=on,  one  recovery  approach  with  flash  cache  is  to  view  database  in  harddisk  as  the  only  persistent  DB  [SIGMOD  11]  –  Periodically  checkpoint  updated  pages  from  SSD  cache  as  well  as  DRAM  bu

ffer  to  HDD    

23

SSD Mapping Information RAM

HDD

Flash as Cache Extension

Persistent    DB

New  version  of  page  P

Old  version  of  page  P

Checkpoint Checkpoint Excessive  Checkpoint  

Cost

Page 24: [G2]fa ce deview_2012

Flush  every  update

Recovery  Issues  in  SSD  Cache(2) •  Fortunately,  because  SSDs  are  non-­‐vola=le,  pages  cached  in  SSD  are  al

ive  even  ayer  system  failure.  

•  SSD  mapping  informaIon  has  gone  

•  Two  approaches  for  recovering  metadata.  1.  Rebuild  lost  metadata  by  scanning  the  whole  pages  cached  in  SSD  (Naïve

 approach)  –  Time-­‐consuming  scanning  2.  Write  metadata  persistently  whenever  metadata  is  changed  [DaMon  11]

 –  Run-­‐Ime  overhead  for  managing  metadata  persistently  

24

SSD Mapping Information RAM

HDD

Flash as Cache Extension

Persistent    DB

Full  Scanning

New  version  of  page  P

Old  version  of  page  P

Page 25: [G2]fa ce deview_2012

Recovery  in  FaCE •  Metadata  checkpoinIng  

–  Because  a  data  page  entering  SSD  cache  is  wriXen  to  the  rear  in  chronological  order,  metadata  can  be  wriXen  regularly  in  a  single  large  segment  

 

25

SSD Metadata RAM

HDD

Flash as Cache Extension

64K  page  info.

Mapping Segment

Periodically  checkpoint  metadata

Recovery  :  Scanning  segment

Crash

Page 26: [G2]fa ce deview_2012

Contents •  IntroducIon  

•  Related  work  

•  Flash  as  Cache  Extension  (FaCE)  –  Design  choice  –  Two  opImizaIons  

•  Recovery  in  FaCE  

•  Performance  EvaluaIon  

•  Conclusion  

26

Page 27: [G2]fa ce deview_2012

Experimental  Set-­‐Up •  FaCE  ImplementaIon  in  PostgreSQL  

–  3  funcIons  in  buffer  mgr.  :  bufferAlloc(),  getFreeBuffer(),  bufferSync()  

–  2  funcIons  in  bootstrap  for  recovery  :  startupXLOG(),  initBufferPool()  

•  Experiment  Setup  –  Centos  Linux  –  Intel  Core  i7-­‐860  2.8  GHz  (quad  core)  and  4G  DRAM  –  Disks  :  8  RAIDed  15k  rpm  Seagate  SAS  HDDs  (146.8GB)  –  SSD  :  Samsung  MLC  (256GB)  

•  Workloads  –  TPC-­‐C  with  500  warehouses  (50GB)  and  50  concurrent  clients  –  BenchmarkSQL  

27

Page 28: [G2]fa ce deview_2012

TransacIon  Throughput

28

0  

1000  

2000  

3000  

4000  

5000  

6000  

7000  

4   8   12   16   20   24   28  

Tran

sac=on

s  Per  m

inute

|Flash  cache|/|Database|  (%)

HDD  only   SSD  only   LC   FaCE   FaCE+GR   FaCE+GSC  

3.9x

3.1x

2.6x

1.5x

HDD  only

SSD  only

LC

FaCE-­‐basic

FaCE+GR FaCE+GSC

2.1x 2.4x 2.6x

Page 29: [G2]fa ce deview_2012

Hit  RaIo,  Write  ReducIon,  and  I/O  Throughput

29

60  

65  

70  

75  

80  

85  

90  

95  

100  

2GB   4GB   6GB   8GB   10GB  

Hit  ra=

o  (%

)

Flash  cache  size

Flash  Cache  Hit  Ra=o

LC   FaCE   FaCE+GR   FaCE+GSC  

LC

FaCE-­‐basic FaCE+GR

FaCE+GSC

Page 30: [G2]fa ce deview_2012

Hit  RaIo,  Write  ReducIon,  and  I/O  Throughput

30

60  

65  

70  

75  

80  

85  

90  

95  

100  

2GB   4GB   6GB   8GB   10GB  

Hit  ra=

o  (%

)

Flash  cache  size

Flash  Cache  Hit  Ra=o

LC   FaCE   FaCE+GR   FaCE+GSC  

40  

50  

60  

70  

80  

90  

100  

2GB   4GB   6GB   8GB   10GB  

Ra=o

(%)  

Flash  cache  size

Write  Reduc=on  Ra=o  By  Flash  Cache

LC   FaCE   FaCE+GR   FaCE+GSC  

40  

50  

60  

70  

80  

90  

100  

2GB   4GB   6GB   8GB   10GB  

Ra=o

(%)  

Flash  cache  size

Write  Reduc=on  Ra=o  By  Flash  Cache

LC   FaCE   FaCE+GR   FaCE+GSC  

LC

FaCE-­‐basic FaCE+GR

FaCE+GSC

Page 31: [G2]fa ce deview_2012

Hit  RaIo,  Write  ReducIon,  and  I/O  Throughput

31

60  

65  

70  

75  

80  

85  

90  

95  

100  

2GB   4GB   6GB   8GB   10GB  

Hit  ra=

o  (%

)

Flash  cache  size

Flash  Cache  Hit  Ra=o

LC   FaCE   FaCE+GR   FaCE+GSC  

40  

50  

60  

70  

80  

90  

100  

2GB   4GB   6GB   8GB   10GB  

Ra=o

(%)  

Flash  cache  size

Write  Reduc=on  Ra=o  By  Flash  Cache

LC   FaCE   FaCE+GR   FaCE+GSC  

0  

2000  

4000  

6000  

8000  

10000  

12000  

14000  

16000  

2GB   4GB   6GB   8GB   10GB  

Throughp

ut  (4

KB)

Flash  cache  size

Throughput  of  4KB-­‐page  I/O

LC   FaCE   FaCE+GR   FaCE+GSC  

0  

2000  

4000  

6000  

8000  

10000  

12000  

14000  

16000  

2GB   4GB   6GB   8GB   10GB  

Throughp

ut  (4

KB)

Flash  cache  size

Throughput  of  4KB-­‐page  I/O

LC   FaCE   FaCE+GR   FaCE+GSC  

0  

2000  

4000  

6000  

8000  

10000  

12000  

14000  

16000  

2GB   4GB   6GB   8GB   10GB  

Throughp

ut  (4

KB)

Flash  cache  size

Throughput  of  4KB-­‐page  I/O

LC   FaCE   FaCE+GR   FaCE+GSC  

LC

FaCE-­‐basic

FaCE+GR

FaCE+GSC

Page 32: [G2]fa ce deview_2012

Recovery  Performance •  4.4x  faster  recovery  than  HDD  only  approach  

32

redo  Ime  :  823 redo  Ime  :  186

Metadata  recovery  :  2

Page 33: [G2]fa ce deview_2012

Contents •  IntroducIon  

•  Related  work  

•  Flash  as  Cache  Extension  (FaCE)  –  Design  choice  –  Two  opImizaIons  

•  Recovery  in  FaCE  

•  Performance  EvaluaIon  

•  Conclusion  

33

Page 34: [G2]fa ce deview_2012

Conclusion •  We  presented  a  low-­‐overhead  caching  method  called  FaCE  that  uIlizes  flash  memory  as  an  extension  to  a  DRAM  buffer  for  a  recoverable  database.  

•  FaCE  can  maximized  the  I/O  throughput  of  a  flash  caching  device  by  turning  small  random  writes  to  large  sequenIal  ones  

•  Also,  FaCE  takes  advantage  of  the  non-­‐volaIlity  of  flash  memory  to  accelerate  the  system  restart  from  a  failure.  

34

Page 35: [G2]fa ce deview_2012

QnA