Download - Manil Dev Gomony

Manil Dev Gomony

An introduction to SDRAM and memory controllers

5kk73

Slides contributions: Sven Goossens, Benny Akesson

2

Outline

► Part 1: DRAM and controller basics

– DRAM architecture and operation– Timing constraints– DRAM controller

► Part 2: DRAMs in embedded systems

– Challenges in sharing DRAMs– Real-time guarantees with DRAMs– Future DRAM architectures and controllers

3

Memory device

► “A device that preserves information for retrieval” - Web definition

4

Semiconductor memories

► “Semiconductor memory is an electronic data storage device, often used as computer memory, implemented on a semiconductor-based integrated circuit” - Wikipedia definition

► The main characteristics of semiconductor memory are low cost, high density (bits per chip), and ease of use

5

Semiconductor memory types

► RAM (Random Access Memory)– DRAM (Dynamic RAM)

• Synchronous DRAM (SDRAM)– SRAM (Static RAM)

► ROM (Read Only Memory)– Mask ROM, Programmable ROM

(PROM), EPROM (Erasable PROM), UVEPROM (Ultra Violet EPROM)

► NVRAM (Non-Volatile RAM) or Flash memory

6

Memory hierarchy

Registers

L1 Cache

L2 Cache

Off-chip memory

Secondary memory (Hard disk)

Acc

ess

spee

d

Dis

tanc

e fr

om p

roce

ssor

Size (capacity)

CPU

Off-chip memory

Processor

Registers

L1 Cache

L2 Cache

Secondary memory

7

Memory hierarchy

Module Memory type used

Access Time

Capacity Managed by

Registers SRAM 1 cycle ~500B Software/compiler

L1 Cache SRAM 1-3 cycles ~64KB Hardware

L2 Cache SRAM 5-10 cycles 1-10MB Hardware

Off-chip memory

DRAM~100 cycles ~10GB Software/OS

Secondary memory

Disk drive ~1000 cycles ~1TB Software/OS

Credits: J.Leverich, Stanford

CPU

Off-chip memory

Processor

Registers

L1 Cache

L2 Cache

Secondary memory

8

SRAM vs DRAM

► A bit is stored as charge on the capacitor

► Bit cell loses charge over time (read operation and circuit leakage)

- Must periodically refresh- Hence the name Dynamic RAM


Static Random Access Memory Dynamic Random Access Memory

► Bitlines driven by transistors - Fast (10x)

► 1 transistor and 1 capacitor vs. 6 transistors

– Large (~6-10x)

9

SRAM vs DRAM: Summary

► SRAM is preferable for register files and L1/L2 caches– Fast access– No refreshes– Simpler manufacturing (compatible with logic process)– Lower density (6 transistors per cell)– Higher cost

► DRAM is preferable for stand-alone memory chips– Much higher capacity– Higher density– Lower cost

– DRAM is the main focus in this lecture!


10

DRAM: Internal architecture

► Bit cells are arranged to form a memory array

► Multiple arrays are organized as different banks

– Typical number of banks are 4, 8 and 16

► Sense amplifiers raise the voltage level on the bitlines to read the data out


Bank 4

– Row Buffer

Bank 3

– Row Buffer

Bank 2

– Row Buffer

Bank 1

Memory Array

Row

de

code

r

Column decoder

Sense amplifiers (row buffer)

Add

ress

regi

ster

Address MS bits

LS bits

Data

11

DRAM: Read access sequence

► Decode row address & drive word-lines

► Selected bits drive bit-lines– Entire row read

► Amplify row data► Decode column address & select

subset of row► Send to output► Precharge bit-lines for next access

– Row Buffer– Row Buffer

Bank 1

Memory Array

Row

de

code

r

Column decoder

Sense amplifiers

Add

ress

regi

ster

Address MS bits

LS bits


Data

12

DRAM: Memory access protocol

► To reduce pin count, row and column share same address pins

– RAS = Row Address Strobe– CAS = Column Address

Strobe

► Data is accessed by issuing memory commands

► 5 basic commands– ACTIVATE– READ– WRITE– PRECHARGE– REFRESH

– Row Buffer– Row Buffer

Bank 1

Memory Array

2n Row x 2nColumn

Row

de

code

r

Column decoder

Sense amplifiers

RA

SAddress


Data

CA

S

n 2n

m2m

1

2m

13

DRAM: Basic operation

– Row Buffer

Row

dec

oder

Column decoder

Row buffer

Row address

Data

Row 0

Addresses(Row 0, Column 0)

(Row 0, Column 1)

(Row 0, Column 10)

(Row 1, Column 0)

CommandsACTIVATE Row 0

READ Column 0

READ Column 1

READ Column 10

PRECHARGE Row 0

ACTIVATE Row 1

READ Column 0

Columns

Row

s

Row 0 Row buffer HIT!

Column address

Row 1

Row 1 Row buffer MISS!


14

DRAM: Basic operation (Summary)

► Access to an “open row”– No need to issue ACTIVATE command– READ/WRITE will access row buffer

► Access to a “closed row”– If another row is already active, issue PRECHARGE first– Issue ACTIVATE to open a new row– READ/WRITE will access row buffer– Optional: PRECHARGE after READ/WRITEs finished

• If PRECHARGE issued Closed-page policy• If not Open-page policy


15

DRAM: Burst access

► Each READ/WRITE command can transfer multiple words (8 in DDR3)

► Observe the number of words transferred in a single clock cycle– Double Data Rate (DDR)


16

DRAM: Banks

► DRAM chips can consist of multiple banks– Address = (Bank x, Row y, Column z)

► Banks operate independently, but share command, address and data pins

– Each bank can have a different row active– Can overlap ACTIVATE and PRECHARGE latencies!(i.e. READ

to bank 0 while ACTIVATING bank 1) Bank-level parallelism

– Row Buffer

Row 0

– Row Buffer

Row 1

Bank 0 Bank 1


17

DRAM: Bank-level parallelism

► Enable DRAM access from different banks in parallel– Reduces memory access latency and improves efficiency!


18

2Gb x8 DDR3 Chip [Micron]

► Observe the bank organization


19

2Gb x8 DDR3 Chip [Micron]

► Observe row width, bi-directional bus and 64 8 data-path


20

DDR3 SDRAM: Current standard

► Introduced in 2007► SDRAM Synchronous DRAM (Clocked)

– DDR = Double Data Rate• Data transferred on both clock edges

– 400 MHz = 800 MT/s– x4, x8, x16 datapath widths– Minimum burst length of 8– 8 banks– 1Gb, 2Gb, 4Gb capacity

21

DRAM: Timing Constraints

– tRCD= Row to Column command delay• Time taken by the charge stored in the capacitor cells to reach the

sense amps– tRAS= Time between RAS and data restoration in DRAM array

(minimum time a row must be open)– tRP= Time to precharge DRAM array

► Memory controller must respect the physical device characteristics!

RDACT PRE

DnD1

NOP NOP NOP NOP NOP NOP NOPCMD

DATA

tRAStRCD tRL tRP

22

DRAM: Timing Constraints

► There are a bunch of other timing constraints…– tCCD= Time between column commands– tWTR= Write to read delay (bus turaround time)– tCAS= Time between column command and data out– tWR= Time from end of last write to PRECHARGE– tFAW= Four ACTIVATE window (limits current surge)

• Maximum number of ACTIVATEs in this window is limited to four– tRC= tRAS+ tRP= Row “cycle” time

• Minimum time between accesses to different rows

► Timing constraints makes performance analysis and memory controller design difficult!

23

DRAM controller

► Request scheduler decides which memory request to be selected► Memory map translates logical address physical address

• Loical address = incoming address• Physical address = (Bank, Row Column)

► Command generator issues memory commands respecting the physical device characteristics

Address

Command

DRAM

Request scheduler

Memorymap

Command generator

DRAM controller

Back-endFront-end

24

Request scheduler

► Many algorithms exist to determine how to schedule memory requests

– Prefer requests targeting open rows• Increases number of row buffer hit

– Prefer read after read and write after write• Minimize bus turnaround

– Always prefer reads, since reads are blocking and writes often posted• Reduce stall cycles of processor

25

Memory map

► Memory map decodes logical address to physical address– Physical address is (bank, row, column)– Decoding is done by slicing the bits in the logical address

► Several memory mapping schemes exist– Continuous, Bank Interleaved

Memorymap

Logical addr. Physical addr.0x10FF00 (2, 510, 128)

26

Continuous memory map

► Map sequential address to columns in row► Switch bank when all columns in row are visited► Switch row when all banks are visited

27

Bank-interleaved memory map

►Bank-interleaved memory map– Maps bursts to different banks in interleaving fashion– Active row in a bank is not changed until all columns are

visited

28

Memory map generalization

► Continuous and interleaving memory map are just 2 possible memory mapping schemes

– In the most general case, an arbitrary set of bits out of the logical address could be used for the row, column and bank address, respectively

Example memory map (1 burst per bank, 2 banks interleaving, 8 words per burst):

Logical address: RRR RRRR RRRR RRBB CCCC CCCB CCCWBurst-size

Bank interleavingBank-offsetExample memory:

16-bit DDR3-1600 64 MB8 banks

8K rows / bank1024 columns / row

16 bits / column

Row

Bit 0Bit 26

Can be done in different ways – choice affects memory efficiency!

29

Command generator

► Decide selection of memory requests► Generate SDRAM commands without violating timing

constraints

Command generator

30

Command generator

► Different page policies to determine which command to schedule

– Close-page policy: Close rows as soon as possible to activate new one faster, i.e, not to waste time to PRECHARGE the open row of the previous request

– Open page policy: Keep rows open as long as possible to benefit from locality, i.e., assuming the next request will target the same open row

31

Open page or Close page?

– Row Buffer

Row

dec

oder

Column decoder

Row buffer

Row address

Data

Row 0

Addresses(Row 0, Column 0)

(Row 0, Column 1)

(Row 0, Column 10)

(Row 1, Column 0)

CommandsACTIVATE Row 0

READ Column 0

READ Column 1

READ Column 10

PRECHARGE Row 0

ACTIVATE Row 1

READ Column 0

Columns

Row

s

Row 0 Row buffer HIT!

Column address

Row 1

Row 1 Row buffer MISS!


32

A modern DRAM controller [Altera]

Image: Altera

33

Conclusions (Part 1)

► SDRAM is used as off-chip high-volume storage– Cheaper, slower than SRAM

► DRAM timing constraints makes it hard to design memory controller

► Selection of memory map and command/request sheduling algorithms impacts memory access time and/or efficiency

34

Outline

► Part 1: DRAM and controller basics

– DRAM architecture and operation– Timing constraints– DRAM controller

► Part 2: DRAMs in embedded systems

– Challenges in sharing DRAMs– Real-time guarantees with DRAMs– Future DRAM architectures and controllers

35

Trends in embedded systems

►Embedded systems get increasingly complex– Increasingly complex applications (more functionality)– Growing number of applications integrated in a device– Requires increased system performance without increasing power

►The case of a generic car manufacturer– Typical number of ECUs in a car in 2000 20– Number of ECUs in Audi A8 Sedan over 80

36

System-on-Chip (SoC)

► The resulting complex contemporary platforms are heterogeneous multi-processor systems

– Resources in the system are shared to reduce cost

37

SoC: Video and audio processing system

► DRAM is typically used as shared main memory for cost and reasons

Host CPU

Video Engine

Audio Processor

GPU

DMA Controller

Input processor

LCD Controller

Memory controller

Inte

rcon

nect

DRAM

A.B. Soares et.al., Development of a SoC for Digital Television Set-Top Box: Architecture and System Integration Issues, International Journal of Reconfigurable Computing Volume 2013

38

Set-top box architecture [Philips]

39

DRAM controller architecture

► The arbiter grants memory access to one of the memory clients at a time

– Example: Round-Robin, Time Division Multiplexing (TDM) priority-based arbiters

Bus

Arbiter

Client 1Client 2Client 3Client 4

Client n

DRAM

DRAM controller

Memorymap

Command generator

40

DRAM controller for real-time systems

► Clients in real-time systems have requirements on latency/bandwidth– A fixed set of memory access parameters such as burst size, page-

policy etc in the back-end bounds transaction execution time– Predictable arbiters, such as TDM fixed time slots, Round Robin

bounds response time

Bounds response time

Bounds execution time

Back-end

InterconnectArbiter

Client 1Client 2Client 3Client 4

Client n

DRAM

B.Akesson et.al., “Predator: A Predictable SDRAM Memory Controller”, CODES+ISSS, 2007

41

DRAMs in the market

Family Generations Datapath width (bits) Frequency range (MHz)

DDR

DDR 16 100-200

DDR2 16 200-400

DDR3 16 400-1066

DDR4 16 800-1600

LPDDRLPDDR 16 and 32 133 - 208

LPDDR2 16 and 32 333-533

LPDDR3 16 and 32 667 - 800

WIDE IO SDR 128 200-266

► Observe the increase in operating frequency with every generation

42

DRAMs: Bandwidth vs clock frequency

► WIDE IO gives much higher bandwidth at lower frequency– Low power consumption

0 200 400 600 800 1000 1200 1400 1600 18000

2

4

6

8

10

12

14

16

18

Max operating frequency (MHz)

Peak

ban

dwid

th (G

B/s

)

DDR2

LPDDR DDR3LPDDR2

LPDDR3DDR4

WIDE IO SDR

DDR

43

Multi-channel DRAM: WIDE IO

► Bandwidth demands of future embedded systems > 10 GB/s

– Memory power consumption scales up with memory operating frequency “Go parallel”

► Multi-channel memories– Each channel is an independent

memory module with dedicated data and control path

– WIDE IO DRAM (4 channels)

DRAM

Channel 2

DRAM

Channel 1

DRAM

Channel 2

DRAM

Channel 3

DRAM

Channel 4

Channel 3128-bit IO

Channel 1128-bit IO

Channel 4128-bit IO

Channel 2128-bit IO

44

Multi-channel DRAM controller

► The Atomizer chops the incoming requests into a number of service units► Channel Selector (CS) routes the service units to the different memory

channels according to the configuration in the Sequence Genrators

Memory client 1

Arbiter

Back-end

DRAM controller 1

Atomizer

Sequence gen 1

Memory client 2

Atomizer

Sequence gen 2 Arbiter

Back-end

DRAM controller 2

CS

CS

Channel 1

Channel 2

M.D.Gomony et.al., “Architecture and Optimal Configuration of a Real-Time Multi-Channel Memory Controller”, DATE, 2012

45

Multi-channel DRAM controller

► Multi-channel memories allow memory requests to be interleaved across multiple memory channels

– Reduces access latency

Memory client 1

Arbiter

Back-end

DRAM controller 1

Sequence gen 1

Arbiter

Back-end

DRAM controller 2

Channel 1

Channel 2Atomizer

CS

46

Wide IO memory controller [Cadence]

Image: Cadence

47

Future DRAM: HMC

► Hybrid Memory Cube (HMC)– 16 memory channels

► How does the memory controller for HMC look like?

Image: Micron, HMC

48

Conclusions (part 2)

► DRAMs are shared in multi-processor SoC to reduce cost and to enable communication between the processing elements

► Sharing DRAMs between multiple memory clients can done using different arbitration algorithms

► Predictable arbitration and back-end provides real-time guarantees on latency and bandwidth to real-time clients

► Multi-channel DRAMs allows a memory request to be interleaved across memory channels

49

Questions?

[email protected]

50

References

► B. Jacob et al., Memory systems: cache, DRAM, disk. Morgan Kaufmann, 2007

► B.Akesson et.al., “Predator: A Predictable SDRAM Memory Controller”, CODES+ISSS, 2007

► M.D.Gomony et.al., “Architecture and Optimal Configuration of a Real-Time Multi-Channel Memory Controller”, DATE, 2012

► http://hybridmemorycube.org/