INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O...

36
INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce- Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences 2009.2.15 Workshop on The Influence of I/O

Transcript of INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O...

Page 1: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INS

TIT

UTE O

F C

OM

PU

TIN

G

TEC

HN

OLO

GY

Exploiting the Produce-Consume Relationship in DMA to Improve I/O

Performance

Dan Tang, Yungang Bao,

Yunji Chen, Weiwu Hu, Mingyu ChenInstitute of Computing Technology,

Chinese Academy of Sciences

2009.2.15

Workshop on The Influence of I/O on Microprocessor Architecture (IOM-2009)

Page 2: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

An Brief Intro Of ICT, CAS

ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10th in Top500.

ICT has developed the Loongson CPU

Page 3: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Overview

Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work

Page 4: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Importance of I/O operations

I/O are ubiquitous Load binary files : Disk Memory Brower web, media stream : NetworkMemory…

I/O are important Many commercial applications are I/O intensive:

Database, Internet applications etc.

Page 5: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

State-of-the-Art I/O Technologies

I/O Bues: 20GB/s PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect

I/O Devices RAID: 400MB/s 10GE: 1.25GB/s

Page 6: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

A Typical Computer Architecture

NIC

Page 7: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Direct Memory Access (DMA)

DMA is an essential feature of I/O operation in all modern computers

DMA allows I/O subsystems to access system memory for reading and/or writing independently of CPU.

 Many I/O devices use DMA Including disk drive controllers, graphics

cards, network cards, sound cards and GPUs

Page 8: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Overview

Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work

Page 9: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA in Computer Architecture

NIC

Page 10: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver Buffer

Descriptor①

②③

Kernel Buffer

User Buffer

An Example of Disk Read:DMA Receiving Operation

• Cache Access Latency : ~20 Cycles• Memory Access Latency : ~200 Cycles

Page 11: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver Buffer

Descriptor①

②③

Kernel Buffer

User Buffer

Potential Improvement of DMA

• This is a typical Shared-Cache Scheme

Page 12: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Problems of Shared-Cache Scheme

Cache Pollution Cache Thrashing

Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application

Page 13: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Rethink DMA Mechanism

The Nature of DMA There is a producer-consumer relationship between CPU and

DMA engine Memory plays a role of transient place for I/O data transferred

between processor and I/O device

Corollaries  Once I/O data is produced, it will be consumed I/O data within DMA buffer will be used only once in most cases

(i.e. almost no reuse). Characterizations of I/O data are different from CPU data It may not be appropriate to store I/O data and CPU data

together

Page 14: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Overview

Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work

Page 15: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Proposal

A Dedicated Cache

Storing I/O data

Capable of exchanging data with processor’s last level cache (LLC)

Reduce overhead of I/O data movement

DMA

Page 16: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Cache Coherence Data Path Replacement Policy Write Policy Prefetching

CP

U C

ache S

tate Diagram

DM

A C

ache S

tate Diagram

DMA Cache State Diagram is similar to CPU Cache in Uniprocessor system

We are researching multiprocessor platform…

Page 17: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Cache Coherence Data Path Replacement Policy Write Policy Prefetching

DMA

Additional data paths and data access ports for LLC are not required because data migration operations between DMA cache and LLC can share existing data paths and ports of snooping mechanism

Page 18: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Data Path: CPU Read

Cache Ctrl

Snoop Ctrl

Last Level Cache

Mem Ctrl

Memory

CPU read

cmddata

Hit in DMA cache?

System Bus

DMA Ctrl

I/O Device

Cache Ctrl

Snoop Ctrl

DMACache

Miss in LLC &

Hit in DMA Cache

Page 19: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Cache Ctrl

Snoop Ctrl

Last Level Cache

Mem Ctrl

Memory

cmddata

Hit in LLC?

System Bus

DMA Ctrl

I/O Device

Cache Ctrl

Snoop Ctrl

DMACache

Miss in DMA Cache

& Hit in LLC

Data Path: DMA Read

DMA read

Page 20: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Cache Coherence Data Path Replacement Policy Write Policy Prefetching

An LRU-like Replace Policy

1. Invalid Block

2. Clean Block

3. Dirty Block

Page 21: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issue

Cache Coherence Data Path Replacement Policy Write Policy Prefetching

Adopt Write-Allocate Policy Both Write-Back or Write Through

policies are available

Page 22: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issue

Cache Coherence Data Path Replacement Policy Write Policy Prefetching

Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

Page 23: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Overview

Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work

Page 24: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Memory Trace Collection

Hyper Memory Trace Tool (HMTT) Capable of Collecting all memory requests Provide APIs for injecting tags into memory trace

to identify high-level system operations

Page 25: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

FPGA Emulation L2 Cache from Godson-2F DDR2 Memory Controller from Godson-2F DDR2 DIM model from Micron Technology Xtreme system from Cadence

L2 CacheL2 Cache

MemCtrlMemCtrl

DDR2 DramDDR2 Dram

DMA CacheDMA

Cache

Memory trace

Page 26: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Overview

Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work

Page 27: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Experimental Setup

Machine AMD Opteron 2GB Memory 1 GE NIC IDE disk

Benchmark File Copy TPC-H SPECWeb2005

Configurations Snoop Cache (2MB) Shared Cache (2MB) DMA Cache

256KB + prefetch 256KB w/o prefetch 128KB + prefetch 128KB w/o prefetch 64KB + prefetch 64KB w/o prefetch 32KB + prefetch 32KB w/o prefetch

Page 28: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Characterization of DMA

The portions of DMA memory reference varies depending on applications

The sizes of DMA requests varies depending on application

Page 29: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Normalized Speedup

Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others

Page 30: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Write & CPU Read Hit Rate

Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme?

Page 31: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Breakdown of Normalized Total Cycles

Page 32: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

% of DMA Writes causing Dirty Block Replacement

Those DMA writes cause cache pollution and thrashing problem The 256KB DMA cache is able to significantly eliminate  these

phenomena

Page 33: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

% of Valid Prefetched Blocks

DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

Page 34: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Overview

Background Nature of DMA Mechanism DMA Cache Scheme Research Methodology Evaluations Conclusions and Ongoing Work

Page 35: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

Conclusions and Ongoing Work The Nature of DMA

There is a producer-consumer relationship between CPU and DMA engine Memory plays a role of transient place for I/O data transferred between

processor and I/O device

We propose a DMA cache scheme and its design issues.

Experimental results show that DMA cache can significantly improve I/O performance.

Ongoing Work The impact of multiprocessor, multiple DMA channels for DMA cache In theory, a shared cache with an intelligent replacement policy can achieve the

effect of DMA cache scheme. Godson-3 has integrated an dedicate cache management policy for I/O data.

Page 36: INSTITUTE OF COMPUTING TECHNOLOGY Exploiting the Produce-Consume Relationship in DMA to Improve I/O Performance Dan Tang, Yungang Bao, Yunji Chen, Weiwu.

INSTITUTE OF COMPUTING

TECHNOLOGY

THANKS !Q&A?