HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...
Transcript of HC18.220.S2T2.The Opteron CMP NorthBridge Architecture ...
The AMD Opteron™ CMP NorthBridge Architecture: Now and in the Future
Pat Conway & Bill Hughes
August, 2006
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future2
AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor
Integration:� Two 64-bit CPU cores
� 2MB L2 cache
� On-chip Router & Memory Controller
Bandwidth:� Dual channel DDR (128-bit) memory bus
� 3 HyperTransport™ (HT) links (16-bit each x 2 GT/sec x 2)
Usability and Scalability:
� Socket compatible: Platform and TDP!
� Glueless SMP up to 8 sockets
� Memory capacity & BW scale w/ CPUs
Power Efficiency:� AMD PowerNow!™ Technology with optimized power management
� Industry-leading system level power efficiency
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future3
AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future4
I/O HubI/O HubUSBUSB
PCIPCI
PCIeTM
Bridge
PCIeTM
BridgePCIeTM
Bridge
PCIeTM
Bridge
I/O HubI/O Hub
8 GB/S
8 GB/S 8 GB/S
8 GB/S
PCI-E Bridge
PCI-E BridgePCI-E Bridge
PCI-E BridgePCIeTM
Bridge
PCIeTM
Bridge
USBUSB
PCIPCI
I/O HubI/O Hub
XMBXMBXMBXMB XMBXMB XMBXMB
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
A Clean Break with the Past
Memory Controller
Hub
Memory Controller
Hub
MCPMCP MCPMCPMCPMCP MCPMCP
Legacy x86 Architecture
� 20-year old traditional front-side bus (FSB) architecture
� CPUs, Memory, I/O all share a bus
� Major bottleneck to performance
� Faster CPUs or more cores ≠ performance
AMD64’s Direct Connect Architecture� Industry-standard technology
� Direct Connect Architecture reduces FSB bottlenecks� HyperTransport™ interconnect offers scalable high bandwidth
and low latency� 4 memory controllers – increases memory capacity and bandwidth
Chip
XChip
XChip
XChip
XChip
XChip
XChip
XChip
X
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future5
4P System — Board Layout
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future6
System Overview
HT-H
B
HTHT
MCTCore 0Core 1 S
RI
HT
HT-H
B
HT
MCTCore 0Core 1 S
RI
HT
HT-H
B
HT
MCT Core 0Core 1S
RI
HT-H
B
HT
HT
MCT Core 0Core 1S
RI
DRAM DRAM
DRAM
I/OI/O I/O
ncHT
cHT
DRAM
I/OI/O
12.8 GB/s128-bit
4.0GB/s per direction@ 2GT/s Data Rate
“NorthBridge”
XBAR XBAR
XBAR XBAR
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future7
Northbridge Microarchitecture Overview
SystemRequestInterface
(SRI)
APIC(interrupts)
Crossbar(XBAR)
MemoryController
(MCT)
HT0 HT1 HT2
DRAMController 2 DDR2 channels
Core0
L2
Core1
L2
Y� Read Response
� Probe response
� Completion
Response
NBroadcast probesProbe
YPosted WritesPosted Request
Y� Read
� Non-posted Write
� Cache Block Commands
Request
DataUseVirtual Channel
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future8
Northbridge Command FlowCore 0
All buffers are 64-bit command/address
Router
10-entry Buffer
Router
16-entry Buffer
Router
16-entry Buffer
Router
16-entry Buffer
Router
12-entry Buffer
Memory Command
Queue20-entry
Core 1
HT0 Input HT1 Input HT2 Input
Victim Buffer (8-entry)
Write Buffer (4-entry)
Instruction MAB (2-entry)
Data MAB (8-entry)
toDCT
HT0 Output HT1 Output HT2 Output
toCore
XBAR
Address MAP& GART
System RequestQueue
24-entry
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future9
Northbridge Data Flow
Victim Buffer (8-entry)
Write Buffer (4-entry)
5-entry Buffer 8-entry Buffer8-entry Buffer 8-entry Buffer 8-entry Buffer
System Request
Data Queue12-entry
MemoryData Queue
8-entry
to Core to Host Bridge
to DCT
HT0 output HT1 output HT2 output
HT0 input HT1 input HT2 input
Core 0
Core 1from Host
Bridge
from DCT
All buffers are 64-byte cache lines
XBAR XBAR
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future10
5: D
6: D
3: PI2
3: P
I1
Lessons Learned #1Allocation of XBAR Command buffer across Virtual Channels can have big impact on performance
L2 L2
L2L2
P 0 P 2
Memory 3Memory 1
Memory 2Memory 0
P 3P 1
8:S
D
3: PI0
4: D
5: R
P01
: RD
2: RD
3: RD
4: RP0
4: R
P2
7: SD
5: RP1
4: PI1
Request (2 visits)
Probe (3 visits)
Response (8 visits)
MP traffic analysis gives the best allocatione.g. Opteron Read Transaction
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future11
Performance vs Average Memory Latency (single 2.8GHz core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache in MP system)
0
1
2
3
4
5
6
7
1N 2N 4N (SQ) 8N (TL) 8N (L)
Syste
m P
erf
orm
an
ce
0%
20%
40%
60%
80%
100%
120%
Pro
cesso
r P
erf
orm
an
ce
OLTP1
OLTP2
SW99
SSL
JBB
P6 P2 P0
P5
P4
I/O
I/O
P1 P3I/O
I/O
P7
8N Ladder
I/O I/O
P0
4 Node Square
P3 P1
P2
I/O
P0
1 Node
Lessons Learned #2Memory Latency is the Key to Application Performance!
AvgD 0 hops 1 hops 1.8 hops
Latency x + 0ns x + 44ns (124 cpuclk) x + 105ns (234 cpuclk)
0.5 hops 1.5 hops
x + 17ns (47 cpuclk) x + 76ns (214 cpuclk)
Performance vs Average Memory Latency (single 2.8GHz core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache in MP system)
0
1
2
3
4
5
6
7
1N 2N 4N (SQ) 8N (TL) 8N (L)
Syste
m P
erf
orm
an
ce
0%
20%
40%
60%
80%
100%
120%
Pro
cesso
r P
erf
orm
an
ce
OLTP1
OLTP2
SW99
SSL
JBB
P6 P2 P0
P5
P4
I/O
I/O
P1 P3 I/O
I/O
P7
8N Ladder
P6 P2 P0
P5
P4
I/O
I/O
P1 P3 I/O
I/O
P7
8N Twisted Ladder
I/OI/O
I/O I/O
P0
4 Node Square
P3 P1
P2
I/O I/OP0
2 Node
P1
I/O
P0
1 Node
Looking Forward
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future13
HyperTransport™-based Accelerators Imagine it, Build it
� Open platform for system builders (“Torrenza”)– 3rd Party Accelerators
– Media
– FLOPs
– XML
– SOA
� AMD Opteron™ Socket or HTX slot
� HyperTransport interface is an open standard see hypertransport.org
� Coherent HyperTransport interface available if the accelerator caches system memory (under license)
I/O HubI/O HubUSBUSB
PCIPCI
PCI-E Bridge
PCI-E Bridge
8 GB/S
8 GB/S 8 GB/S
8 GB/S
OtherOptimized
Silicon
OtherOptimized
Silicon
FLOPsAccelerator
FLOPsAccelerator
SOAAccelerator
SOAAccelerator
XMLAccelerator
XMLAccelerator
SRQ
Crossbar
HTMem.Ctrlr
SRQ
Crossbar
HTMem.Ctrlr
NativeDual-Core
NativeDual-Core
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future14
AMD’s Next Generation Processor Technology
Native quad core die
Ideal for 65nm SOIand beyond
IPC enhanced CPU cores
� 32B instruction fetch
� Improved branch prediction
� Out-of-order load execution
� Up to 4 DP FLOPS/cycle
� Dual 128-bit SSE dataflow
� Dual 128-bit loads per cycle
� Bit Manipulation extensions (LZCNT/POPCNT)
� SSE extensions (EXTRQ/INSERTQ, MOVNTSD/MOVNTSS)
Enhanced Direct Connect Architecture and Northbridge
� Four ungangable x16 HyperTransportTM links (up to 5.2GT/sec)
� Enhanced crossbar
� Next-generation memory support
� FBDIMM when appropriate
� Enhanced power management and RAS
Expandable shared L3 cache
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future15
Balanced, Highly Efficient Cache Structure
Efficient memory handling reduces the need for “brute force” cache sizes
L1 L2
Co
re 1
Dedicated L1
� Locality keeps most critical data in the L1 cache
� Low latency� 2 128 bit data paths� 2 loads per cycle
Cach
e
Co
ntr
ol
64
KB
51
2K
B
2+
MB
L3C
2C
3C
4
L2L1
L2L1
L2L1
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future16
Balanced, Highly Efficient Cache Structure
Efficient memory handling reduces the need for “brute force” cache sizes
L1 L2
Co
re 1
Cach
e
Co
ntr
ol
64
KB
51
2K
B
2+
MB
L3C
2C
3C
4
L2L1
L2L1
L2L1
Dedicated L2
� Sized to accommodate the majority of working sets today
� Dedicated to help eliminate conflicts common in shared caches
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future17
Balanced, Highly Efficient Cache Structure
Efficient memory handling reduces the need for “brute force” cache sizes
L1 L2
Co
re 1
Cach
e
Co
ntr
ol
64
KB
51
2K
B
2+
MB
L3C
2C
3C
4
L2L1
L2L1
L2L1
Shared L3 – Coming Soon
� Allocation policy which optimizes movement, placement and replication of data for multi-core
� Ready for expansion
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future18
Additional HyperTransport™ Ports
� Enable Fully Connected 4 Node (four x16 HT) and 8 Node (eight x8 HT)
� Reduced network diameter – Fewer hops to memory
� Increased Coherent Bandwidth– more links
– cHT packets visit fewer links
– HyperTransport3
� Benefits– Low latency because of lower diameter
– Evenly balanced utilization of HyperTransport links
– Low queuing delays
Low latency under load
Four x16 HT
Eight x8 HT
OR
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future19
4 Node Performance
4N SQ (2GT/s HyperTransport)
Diam 2 Avg Diam 1.00
XFIRE BW 14.9GB/s
W/ HYPERTRANSPORT3
4N FC (4.4GT/s HyperTransport3)
Diam 1 Avg Diam 0.75
XFIRE BW 65.8GB/s
XFIRE (“crossfire”) BW is the link-limitedall-to-all communication bandwidth (data only)
+ 2 EXTRA LINKS
4N FC (2GT/s HyperTransport)
Diam 1 Avg Diam 0.75
XFIRE BW 29.9GB/s
(2X) (4X)
4 Node Square
I/O I/O
I/O I/O
P0 P2
P1 P3
16
4 Node fully connected
I/O I/O
I/O I/O
P0 P2
P1 P3
16
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future20
8 Node Performance
8N FC (4.4GT/s HyperTransport3)
Diam 1 Avg Diam 0.88
XFIRE BW 94.4GB/s
8N TL (2GT/s HyperTransport)
Diam 3 Avg Diam 1.62
XFIRE BW 15.2GB/s
8
P0 P2
P1 P3
P4 P6
P5 P7 5
I/O
1
I/O
I/O
4 3
2 1
0
5 4 3
2 1
0
5 4 3
2 1
0
5 4 3
2
0 P0 P2
P3 P1
I/O
8 Node 6HT 2x4
4 3
8N 2x4 (4.4GT/s HyperTransport3)
Diam 2 Avg Diam 1.12
XFIRE BW 72.2GB/s
OR
(5X) (6X)
8N Twisted Ladder
I/O I/O
I/O I/O
P0 P2
P1 P3
16P4 P6
P5 P7
8 Node 6HT 2x4
I/O
I/O
I/O
I/O
8 Node Fully Connected
8
I/O I/O
I/O I/O
I/O
I/O
I/O
I/O
P0
P2
P1
P3
P4
P6
P5
P7
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future21
Why Quad-Core?
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1.00x
Max Frequency
Performance
Power
Baseline is 2 Node x 2 Core blade running OLTP
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future22
Increasing Frequency
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1.14x
1.51x
1.00x
Increased Freq
(+16%)
Max Frequency
Performance
Power
Baseline is 2 Node x 2 Core blade running OLTP
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future23
Decreasing Frequency
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1.14x
1.51x
1.00x0.87x
Increased Freq
(+16%)
Max Frequency Decreased Freq
(-16%)
Performance
Power
Baseline is 2 Node x 2 Core blade running OLTP
0.49x
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future24
Quad-CoreHigher Performance within a Fixed Power Budget
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1.14x
1.51x
1.00x
1.73x
0.99x
Increased Freq
(+16%)
Max Frequency Decreased Freq
(-16%)
Quad-Core
Performance
Power
Baseline is 2 Node x 2 Core blade running OLTP
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future25
Northbridge
VDDA
VHTVDDIO,VTT
Clock and Power Planes
HyperTransport Links
Misc
VDDNB
VDD
DIMMs
SVI
Core 0
PLL
Core 1
PLL
Core 2
PLL
Core 3
PLL
PLLs PLLsPLL
VRM
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future26
100% Workload
100% Workload
100% Workload
100% Workload
100% Power State
Ability to dynamically and individually adjust core frequencies to improve power efficiency
DICE: Dynamic Independent Core Engagement
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future27
100% Workload
33% Workload
33% Workload
33% Workload
DICE: Dynamic Independent Core Engagement
60% Power State
Ability to dynamically and individually adjust core frequencies to improve power efficiency
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future28
100% Workload
50% Workload
Halted Halted
45% Power State
Ability to dynamically and individually adjust core frequencies for improved power efficiency
DICE: Dynamic Independent Core Engagement
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future29
Enjoy the rest of the conference !
www.amd.com/power
21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future30
Trademark Attribution
AMD, the AMD Arrow, AMD Opteron, AMD PowerNow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. HyperTransport and HTX are trademarks of the HyperTransport Consortium. PCIe is a trademark of the PCI-SIG. Other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.