CE431 Parallel Computer Architecture Spring 2019
Transcript of CE431 Parallel Computer Architecture Spring 2019
![Page 1: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/1.jpg)
CE431Parallel Computer Architecture
Spring 2019
DRAM Memory
Nikos Bellas
Electrical and Computer Engineering DepartmentUniversity of Thessaly
ECE431 Parallel Computer Architecture
1
![Page 2: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/2.jpg)
Main Memory Basics
ECE431 Parallel Computer Architecture 2
![Page 3: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/3.jpg)
3
MotivationDRAM and the memory subsystem significantly impacts the performance and cost of a system
Need to understand DRAM technologies
• to architect an appropriate memory subsystem for an application
• to utilize chosen DRAM efficiently
• to design a memory controller
ECE431 Parallel Computer Architecture
![Page 4: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/4.jpg)
Importance of Main Memory
• The Performance Perspective
• The Energy Perspective
• The Reliability Perspective
4
![Page 5: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/5.jpg)
The Performance Perspective• “It’s the Memory, Stupid!” (Richard Sites, MPR,
1996)
Mutlu+, “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors,” HPCA 2003.
![Page 6: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/6.jpg)
The Energy Perspective
6
Dally, HiPEAC 2015
A memory access consumes ~1000X the energy of a complex addition
![Page 7: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/7.jpg)
The Reliability Perspective• Data from all of Facebook’s servers worldwide• Meza+, “Revisiting Memory Errors in Large-Scale Production Data Centers,” DSN’15.
7
![Page 8: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/8.jpg)
The Main Memory System
• Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor
• Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits
8
Processorand caches
Main Memory Storage (SSD/HDD)
ECE431 Parallel Computer Architecture
![Page 9: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/9.jpg)
Main Memory in the System
9
CORE 1
L2 C
AC
HE 0
SHA
RED
L3 C
AC
HE
DR
AM
INTER
FAC
E
CORE 0
CORE 2 CORE 3L2
CA
CH
E 1
L2 C
AC
HE 2
L2 C
AC
HE 3
DR
AM
BA
NK
S
DRAM MEMORY CONTROLLER
ECE431 Parallel Computer Architecture
![Page 10: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/10.jpg)
The Main Memory Chip/System Abstraction
10
2n x k bits
Data
Address
CE (Chip Enable)
WE (Write Enable)n
k
ECE431 Parallel Computer Architecture
![Page 11: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/11.jpg)
Basic Functionality and Organization• Read access sequence:
1. Decode row address & drive word-lines
2. Selected bits drive bit-lines
• Entire row read
3. Amplify row data
4. Decode column address & select subset of row
• Send to output
5. Precharge bit-lines
• For next access
11ECE431 Parallel Computer Architecture
![Page 12: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/12.jpg)
The DRAM Storage Cell• DRAM stores charge in a capacitor (charge-based memory)
– Capacitor must be large enough for reliable sensing
– Access transistor should be large enough for low leakage and high retention time
– There are 2n x k of those capacitors
– Precharging puts Bit Lines (BL) to high voltage
12ECE431 Parallel Computer Architecture
![Page 13: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/13.jpg)
13
Main Memory BackgroundPerformance of Main Memory:
Latency: Cache Miss Penalty
Access Time: time between request and word arrives
Cycle Time: time between requests (CT > AT)
Bandwidth:
Main Memory is DRAM: Dynamic Random Access Memory
Dynamic since needs to be refreshed periodically (64 ms, 1% time)
Addresses divided into 2 halves (Memory as a 2D matrix):
RAS or Row Access Strobe
CAS or Column Access Strobe
Cache uses SRAM: Static Random Access MemoryNo refresh (6 transistors/bit vs. 1 transistorSize: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16
ECE431 Parallel Computer Architecture
![Page 14: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/14.jpg)
DRAM Internal Organization
DRAM Types
ECE431 Parallel Computer Architecture 14
![Page 15: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/15.jpg)
ECE431 Parallel Computer Architecture
15
DRAM internal organization
![Page 16: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/16.jpg)
ECE431 Parallel Computer Architecture
16
DRAM access
A cache miss triggers a cache line refill from the main memory.The Memory Controller (MC) receives the request (along with potentially more requests from the same or other masters)The memory access request consists of:
1. the physical address 2. the data in case of a memory write
![Page 17: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/17.jpg)
ECE431 Parallel Computer Architecture
17
DRAM basics Precharge (PRE) and Row Access (ACT)
The MC breaks the access into two parts:Row Access:1. The MC precharges the DRAM array (opens a page). Any previously selected row is flushed
from the sense amps2. It creates the RAS signal to latch the Row Address to an internal latch. 3. The row decoder selects one row of bits that charges the sense amps (opens a row)
![Page 18: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/18.jpg)
ECE431 Parallel Computer Architecture
18
Sense Amps and Column Decoding
Column Access:4. It creates the CAS signal to latch the Column Address to an internal latch. The CAS signal can
also be used to latch the data in case of writes 5. The column decoder selects the bit to be read out. The CAS signal acts as Output Enable (OE)
to drive data out to the output buffers
![Page 19: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/19.jpg)
ECE431 Parallel Computer Architecture
19
Read Out (READ)
A new column access in the SAME row reduces access time and increases bandwidth
![Page 20: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/20.jpg)
ECE431 Parallel Computer Architecture
20
Send back to CPU
The MC redirects the data to the bus to fill the cache
![Page 21: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/21.jpg)
DRAM Bank Operation
Row Buffer
(Row 0, Column 0)
Ro
w d
eco
der
Column mux
Row address 0
Column address 0
Data
Row 0Empty
(Row 0, Column 1)
Column address 1
(Row 0, Column 85)
Column address 85
(Row 1, Column 0)
HITHIT
Row address 1
Row 1
Column address 0
CONFLICT !
Columns
Ro
ws
Access Address:
ECE431 Parallel Computer Architecture 21
![Page 22: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/22.jpg)
ECE431 Parallel Computer Architecture
22
Asynchronous DRAM : Basic timing
![Page 23: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/23.jpg)
ECE431 Parallel Computer Architecture
23
Asynchronous DRAM evolution : Fast Page Mode (FPM)
Read row (~1KB) once in the column latch, and reuse dataData in same row are accessed more quickly.
Exploits spatial locality of memory accesses
![Page 24: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/24.jpg)
ECE431 Parallel Computer Architecture
24
Asynchronous DRAM evolution:Burst Mode
Access multiple successive words after the requested wordAfter initial latency penalty, get 1 word/cycle (e.g. 5-1-1-1)
![Page 25: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/25.jpg)
ECE431 Parallel Computer Architecture
25
Synchronous DRAM (SDRAM)
Add a CLK to avoid synchronization overhead between asynchronous memory array and the bus.
ACT: Activate Row
![Page 26: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/26.jpg)
ECE431 Parallel Computer Architecture
26
Double Data Rate SDRAM (DDR)Transfer data on both positive and negative clock edges
• doubles peak pin data bandwidth• 64-bits transferred at each edge (128 bits per cycle)• the frequency of the memory array and bus is not affected
Commands still sent only with positive clock edge• same pin command bandwidth• during random accesses, command bandwidth may limit usable data
bandwidth
![Page 27: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/27.jpg)
ECE431 Parallel Computer Architecture
27
DDR2 – DDR3 - ... - DDRnDDR2 is similar to DDR, with key difference that the bus is clocked twice as fast as in DDR
doubles PEAK pin data bandwidth
Extra buffer stages to sustain high clock frequencyNegatively impacts access latency
Mainly circuit optimizations and improved bus signaling
Similar for DDR3 (bus clocked four times as fast as in DDR)
• DDR: Memory Clock = Bus Clock = 133 MHz clock, BW = 266 Mtransfers/sec (DDR266)
• DDR: MC=BC=200 MHz , BW=400 Mtransfers/sec (DDR400)
• DDR2: MC=266MHz, BC=533MHz, BW= 1066Mtransfer/sec (DDR2-1066)
• DDR3: MC=200 MHz, BC=800MHz, BW=1600 Mtransfer/sec (max)
• DDR4: MC=400 MHz, BC=1600MHz, BW=3200 Mtransfer/sec (max)
![Page 28: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/28.jpg)
DRAM at the System Level
ECE431 Parallel Computer Architecture 28
![Page 29: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/29.jpg)
ECE431 Parallel Computer Architecture
29
SDRAM evolution:Multi-bank memories
All modern SDRAMs have multiple, independent banks SDRAM command scheme allows overlapped bank operations
• one bank may be activated and accessed• while other banks precharged• more efficient use of pin bandwidth
DRAM
Chip
This is a single DRAM Chip
![Page 30: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/30.jpg)
ECE431 Parallel Computer Architecture
30
How do we read more than 1 bit?
One bit/array.Read all arrays simultaneously to get byte
![Page 31: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/31.jpg)
ECE431 Parallel Computer Architecture
31
Rank: Wider bus by banks interleaving
Data Bus 32 bits
Byte 0, 4, 8... Byte 1,5,9,... Byte 2,6,10,... Byte 3,7,11,...
Simultaneous access of ALL 4 chips fetches multiple bytes (e.g. for cache fill)
Rank: Multiple chips operated together to form a wide interface
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
Command Data
![Page 32: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/32.jpg)
Generalized Memory Structure
32
A DRAM module consists of one or more ranksAlso known as DIMMs (dual inline memory modules)This is what you plug into your motherboard
chip
DIMM (dual inline memory module)
ECE431 Parallel Computer Architecture
![Page 33: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/33.jpg)
Generalized Memory Structure
33
DIMM (dual inline memory module)
ECE431 Parallel Computer Architecture
![Page 34: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/34.jpg)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Channel 0
DIMM 0
Rank 0
ECE431 Parallel Computer Architecture 34
![Page 35: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/35.jpg)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data bits <0:63>
. . .
ECE431 Parallel Computer Architecture 35
![Page 36: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/36.jpg)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
Row 0Col 0
. . .
ECE431 Parallel Computer Architecture 36
![Page 37: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/37.jpg)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
Row 0Col 0
. . .
8B
ECE431 Parallel Computer Architecture 37
![Page 38: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/38.jpg)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
Row 0Col 1
. . .
ECE431 Parallel Computer Architecture 38
![Page 39: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/39.jpg)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
8B
Row 0Col 1
. . .
8B
ECE431 Parallel Computer Architecture 39
![Page 40: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/40.jpg)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
8B
Row 0Col 1
A 64B cache block takes 8 I/O cycles to transfer.
During the process, 8 columns are read sequentially.
. . .
ECE431 Parallel Computer Architecture 40
![Page 41: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/41.jpg)
Interaction with VirtualPhysical Mapping• Operating System influences where an address maps to in
DRAM
• Operating system can influence which bank/channel/rank a virtual page is mapped to.
41
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)
Page offset (12 bits)Physical Page Number (19 bits)
Page offset (12 bits)Virtual Page number (52 bits) VA
PA
PA
ECE431 Parallel Computer Architecture
![Page 42: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/42.jpg)
Basics of Memory Controllers
ECE431 Parallel Computer Architecture 42
![Page 43: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/43.jpg)
ECE431 Parallel Computer Architecture
43
Memory Access Scheduling : Motivation
Memory bandwidth a big problem especially for application that do not cache well
– Multimedia or streaming applications have limited use of the cache due to poor temporal locality
– Data are read in, processed and thrown away
– DSP or multimedia processor often limited by poor memory bandwidth
– Real time requirements (e.g. 30 fps video compression) is an extra bottleneck
Memory Wall – CPU speed improvement (1.2 – 1.52 per year)
– DRAM latency improvement (1.07 per year)
![Page 44: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/44.jpg)
ECE431 Parallel Computer Architecture
44
Memory Access SchedulingBandwidth and latency of a memory system STRONGLY dependent on the order of memory accesses
Modern, multi-bank DRAMs are 3-D structures (Banks, Row, Columns)
– Access to different columns within a row an order of magnitude faster than accesses to different rows
– Simultaneous row reads in different banks
Memory scheduling uses the Mem Controller to dynamically reorder access requests to the 3-D memory structure
![Page 45: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/45.jpg)
ECE431 Parallel Computer Architecture
45
Memory Access Scheduling
Internal DRAM chip architecture
FSM for bank operationEach bank has its own FSMIDLE state: the bank is prechargedACTIVE state: the bank is being read/written
![Page 46: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/46.jpg)
ECE431 Parallel Computer Architecture
46
Memory Access Scheduling
• Given a set of pending memory accesses, a scheduler determines what actions to take next.
• One precharge arbiter per bank, one row arbiter per bank, and a single column arbiter for the common data bus.
• At each cycle, each one of the arbiters takes a decision which request to serve next.
• Arbitration priority determines the exact sequence of accesses• Split transactions are allowed to break a request and implement out of order
request services
![Page 47: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/47.jpg)
ECE431 Parallel Computer Architecture
47
Memory Access Scheduling
DRAM operationsP: bank precharge (3 cycle occupancy)A: row activation (3 cycle occupancy)C: column access (1 cycle occupancy)
Resource utilization
(Bank, Row, Column)
![Page 48: CE431 Parallel Computer Architecture Spring 2019](https://reader033.fdocuments.in/reader033/viewer/2022061016/629a41dd29cebe063e6277e3/html5/thumbnails/48.jpg)
ECE431 Parallel Computer Architecture
48
Memory Access Scheduling
Resource utilization
Example arbitration policythe row with the fewest pending column accesses is selected next. This minimizes the time that rows with little demand remain active, allowing other rows in the same bank to make progress sooner.