Fast, Energy Efficient Scan inside Flash Memory · PDF fileSSD L1/L2 Cache 13 . ADMS 2011 Key...

44
Fast, Energy Efficient Scan inside Flash Memory SSDs Sungchan Kim, Chonbuk National University, Korea Hyunok Oh, Hanyang University, Korea Chanik Park, Samsung Electronics, Korea Sangyeun Cho, University of Pittsburgh, U.S.A Sang-won Lee, Sungkyunkwan University, Korea ADMS 2011 International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures

Transcript of Fast, Energy Efficient Scan inside Flash Memory · PDF fileSSD L1/L2 Cache 13 . ADMS 2011 Key...

Fast, Energy Efficient Scan inside Flash Memory SSDs

Sungchan Kim, Chonbuk National University, Korea

Hyunok Oh, Hanyang University, Korea

Chanik Park, Samsung Electronics, Korea

Sangyeun Cho, University of Pittsburgh, U.S.A

Sang-won Lee, Sungkyunkwan University, Korea

ADMS 2011 International Workshop on Accelerating Data

Management Systems Using Modern Processor and Storage Architectures

ADMS 2011

Trends in Large-scale Data Processing

• Data-centric computing

– Cloud computing, map-reduce, scientific data, analytics, search engine

– Data stored in company, public internet, and home is doubling every month • Several TBs/sec data to be processed

– Key operations • sequential scan / filtering / sorting / grouping / hashing …

• What implications on computing paradigm?

2

ADMS 2011

Trends in Large-scale Data Processing (cont’d)

• NO MORE conventional computing paradigm

– No data locality • Conventional memory hierarchy may not work

– No complex processing logics • Complex host CPU-based processing may be inefficient

• Then, new paradigm?

– Flash SSDs are computers

– In-Storage Processing inside flash Solid-State Disk(SSD)s

3

What is In-Storage Processing (ISP)?

ADMS 2011

Conventional Host Processing

• Current limitations – Bandwidth wall in conventional multiprocessor system in

handling data intensive applications

– Datacenter PUE (Power Utilization Efficiency)

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

5

ADMS 2011

Conventional Host Processing

• Current limitations – Bandwidth wall in conventional multiprocessor system in

handling data intensive applications

– Datacenter PUE (Power Utilization Efficiency)

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

6

ADMS 2011

Conventional Host Processing

• Current limitations – Bandwidth wall in conventional multiprocessor system in

handling data intensive applications

– Datacenter PUE (Power Utilization Efficiency)

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

7

ADMS 2011

Conventional Host Processing

• Current limitations – Bandwidth wall in conventional multiprocessor system in

handling data intensive applications

– Datacenter PUE (Power Utilization Efficiency)

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

8

ADMS 2011

Conventional Host Processing

• Current limitations – Bandwidth wall in conventional multiprocessor system in

handling data intensive applications

– Datacenter PUE (Power Utilization Efficiency)

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

New system balance is necessary !!

9

ADMS 2011

In-Storage Processing (ISP)

• Offloading host processor by moving (a part of) computations to storage medium

• Significant reduction of amount and latency of data transfer

• Unlimited I/O bandwidth

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

10

ADMS 2011

In-Storage Processing (ISP)

• Offloading host processor by moving (a part of) computations to storage medium

• Significant reduction of amount and latency of data transfer

• Unlimited I/O bandwidth

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

11

ADMS 2011

In-Storage Processing (ISP)

• Offloading host processor by moving (a part of) computations to storage medium

• Significant reduction of amount and latency of data transfer

• Unlimited I/O bandwidth

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

12

ADMS 2011

In-Storage Processing (ISP)

• Offloading host processor by moving (a part of) computations to storage medium

• Significant reduction of amount and latency of data transfer

• Unlimited I/O bandwidth

Data capacity

Hot data

Cold data

<Database SW pipeline for query planning>

Aggregation

Join

Scan/ Project

<Data set to be handled> <Computing hierarchy>

CPU

DRAM North Bridge

South Bridge

SSD

L1/L2 Cache

13

ADMS 2011

Key Enablers for ISP

• High speed host and NAND interface • NAND parallelism

– Increased raw bandwidth of storage medium

• Advanced SoC technology – Integration of massive computing elements close to storage

medium

14

Architecture for In-Storage Processing (ISP)

ADMS 2011

Why SCAN as Target Operation?

• Low data locality / reusability

– Only a small portion of record is used

• Parallelizable

– Multiple records can be scanned simultaneously

• Simple operation

– Hardware realization is feasible

• Reduction

– Aggregation / Low scan selectivity

– Below 1% in our experiments

16

ADMS 2011

Baseline ISP Architecture

• Embedded CPU

• Main performance bottleneck

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

CPU Local

memory

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

Host Interface

DRAM

17

ADMS 2011

Baseline ISP Architecture

• Embedded CPU

• Main performance bottleneck

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

CPU Local

memory

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

record record

record record

record record

record record

Host Interface

DRAM DRAM

Flash memory controller

Flash memory controller

Flash memory controller

Flash memory controller

NAND

NAND

NAND

NAND

18

ADMS 2011

Baseline ISP Architecture

• Embedded CPU

• Main performance bottleneck

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

CPU Local

memory

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

record record record record record record record record

Host Interface

DRAM

19

ADMS 2011

Baseline ISP Architecture

• Embedded CPU

• Main performance bottleneck

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

CPU Local

memory

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

Host Interface

DRAM

record record record

Host Interface

20

ADMS 2011

HW-accelerated ISP Architecture

• Parallel scans at each of FMCs

• “On-the-fly” processing

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

CPU Local

memory

Flash memory controller

Flash memory controller

NAND NAND NAND NAND

NAND NAND NAND NAND

Host Interface

DRAM

21

ADMS 2011

HW-accelerated ISP Architecture

• Parallel scans at each of FMCs

• “On-the-fly” processing

op1, val1

op2, val2

op3, val3

record

Aggregation logic

scan controller

compare logic

Accumulated compare result

NAND NAND NAND NAND

field1 field2

field3

… Aggr.

22

ADMS 2011

HW-accelerated ISP Architecture

• Parallel scans at each of FMCs

• “On-the-fly” processing

op1, val1

op2, val2

op3, val3

record

Aggregation logic

scan controller

compare logic

Accumulated compare result

NAND NAND NAND NAND

field1 field2

field3

… Aggr.

field1 val1

op1

field1 field2 field3 … Aggr. …

23

ADMS 2011

HW-accelerated ISP Architecture

• Parallel scans at each of FMCs

• “On-the-fly” processing

op1, val1

op2, val2

op3, val3

record

Aggregation logic

scan controller

compare logic

Accumulated compare result

NAND NAND NAND NAND

field1 field2

field3

… Aggr.

field1 val1

op1

field2 field3 … Aggr. …

field2 val2

op2

24

ADMS 2011

HW-accelerated ISP Architecture

• Parallel scans at each of FMCs

• “On-the-fly” processing

op1, val1

op2, val2

op3, val3

record

Aggregation logic

scan controller

compare logic

Accumulated compare result

NAND NAND NAND NAND

field1 field2

field3

… Aggr.

Aggr. …

25

ADMS 2011

HW-accelerated ISP Architecture

• Parallel scans at each of FMCs

• “On-the-fly” processing

op1, val1

op2, val2

op3, val3

record

Aggregation logic

scan controller

compare logic

Accumulated compare result

NAND NAND NAND NAND

field1 field2

field3

… Aggr.

End of record Write result

Aggregation result Final scan result

26

Evaluation

ADMS 2011

Analytic Model for Evaluation

• Estimation of scan execution time

– In-Host Processing / Baseline / HW-accelerated ISP

• Modeling accuracy

– Comparison with cycle-accurate simulation model

– Used query: Q6 in TPC-H benchmark

Baseline HW-ISP Model (cycles)

297282 16827

Simulation (cycles)

317446 16984

Error (%) 6.4 % 0.9 %

SELECT

sum (l_extendedprice * l_discount)

FROM

lineitem

WHERE

l_shipdate >= ‘1994-01-01’

and l_shipdate < 1995-01-01

and l_discount < 0.07

and l_discount > 0.05

and l_quantity < 24; 28

ADMS 2011

Throughput Evaluation

• Comparison of In-Host Processing and two ISP methods varying – Number of NAND channels: 8 or 16 channels

– NAND interface speed: 100, 200, 400 Mbps

• FIXED SATA host interface, low scan selectivity

Throughput (# of scanned records/s)

29

ADMS 2011

Throughput Evaluation

• Comparison of In-Host Processing and two ISP methods varying – Number of NAND channels: 8 or 16 channels

– NAND interface speed: 100, 200, 400 Mbps

• FIXED SATA host interface, low scan selectivity

30

ADMS 2011

Throughput Evaluation

• Comparison of In-Host Processing and two ISP methods varying – Number of NAND channels: 8 or 16 channels

– NAND interface speed: 100, 200, 400 Mbps

• FIXED SATA host interface, low scan selectivity

31

ADMS 2011

Throughput Evaluation

• Comparison of In-Host Processing and two ISP methods varying – Number of NAND channels: 8 or 16 channels

– NAND interface speed: 100, 200, 400 Mbps

• FIXED SATA host interface, low scan selectivity

0.000E+00

5.000E+06

1.000E+07

1.500E+07

2.000E+07

2.500E+07

3.000E+07

3.500E+07

100 MHz NAND 200 MHz NAND 400 MHz NAND

IHP-ch8

IHP-ch16

cpu-ch8

cpu-ch16

hw-ch8

hw-ch16

Scales linearly

13.9x speed up Throughput (# of scanned records/s)

32

ADMS 2011

Where is Performance Bottleneck?

33

ADMS 2011

Where is Performance Bottleneck?

• In-Host Processing: data transfer

34

ADMS 2011

Where is Performance Bottleneck?

• In-Host Processing: data transfer

• Baseline ISP: embedded processor

35

ADMS 2011

Where is Performance Bottleneck?

• In-Host Processing: data transfer

• Baseline ISP: embedded processor

• HW-ISP: NAND-bounded performance

0.0

500.0

1000.0

1500.0

2000.0

2500.0

3000.0

3500.0

4000.0

4500.0

5000.0

IHP cpu hw IHP cpu hw IHP cpu hw IHP cpu hw IHP cpu hw IHP cpu hw

100 MHz

NAND

200 MHz

NAND

400 MHz NAND 100 MHz

NAND

200 MHz

NAND

400 MHz

NAND

DRAM <-> host

DRAM <-> FMC

embedded

processor

host processor

Execution time (us)

36

ADMS 2011

Impact of Selectivity and Host Interface

• Host interface

– SATA2 (3 Gbps), SATA3 (6 Gbps), PCI-e (64 Gbps)

37

ADMS 2011

Impact of Selectivity and Host Interface

• Host interface

– SATA2 (3 Gbps), SATA3 (6 Gbps), PCI-e (64 Gbps)

38

ADMS 2011

Impact of Selectivity and Host Interface

• Host interface

– SATA2 (3 Gbps), SATA3 (6 Gbps), PCI-e (64 Gbps)

39

ADMS 2011

Impact of Selectivity and Host Interface

• Host interface

– SATA2 (3 Gbps), SATA3 (6 Gbps), PCI-e (64 Gbps)

0.000E+00

5.000E+06

1.000E+07

1.500E+07

2.000E+07

2.500E+07

3.000E+07

3.500E+07

4.000E+07

0.01 0.05 0.1 0.2 0.4 0.6 0.8 1

IHP-SATA2

IHP-SATA3

IHP-PCIe

cpu-isp-SATA2

cpu-isp-SATA3

cpu-isp-PCIe

hw-isp-SATA2

hw-isp-SATA3

hw-isp-PCIe

Throughput (# of scanned records/s)

DRAM inside SSD is performance bottleneck

Host interface is performance bottleneck

40

ADMS 2011

Energy Consumption Evaluation

• Another key benefit of ISP

• Emulation of HW-ISP on a real SSD platform

– Comparison based on actual measurement

Processing method Normalized

Energy consumption

ISP (modified firmware) 0.142

IHP (conventional) 1.000

41

ADMS 2011

Previous Efforts for ISP

• Database machine (1970s~1980s) – Accelerated operation with special purpose

hardware per head, track, or disk

• Active disks (1990s) – Disk array with low cost embedded processors

– Tries to offload host CPU’s workload with the computing power of the processors on disks

• Limitations – Limited bandwidth of disk media itself

– Faster and faster commodity CPUs

– No driving force in market • New storage interface, changes in software stacks

42

ADMS 2011

Summary

• In-Storage Processing as next generation data-centric computing paradigm – DO NOT bring data to computation

– BRING computation as close as to data

• We showed that – Significant performance/energy benefit potential

– Difference performance bottleneck points compared to the previous ISP approaches

• Future work – More DB operations ( join, sorting…)

– Extension to other data-processing domains (mining, web-searching…)

43

Thank You !!

Q&A