Duc Day1 03 CourseCode L1T1H1-10 CellArchitecturemueller/cluster/ps3/workshop/... · 2006. 7....
Transcript of Duc Day1 03 CourseCode L1T1H1-10 CellArchitecturemueller/cluster/ps3/workshop/... · 2006. 7....
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture1
Cell Architecture
Course code: L1T1H1-10Cell Ecosystem Solutions Enablement
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture2
Class Objectives – Things you will learn
Cell design motivation How cell overcomes three important limiters of contemporary microprocessor performance—power use, memory use, and processor frequencyCell processor organization and components– Power processor element, block diagram, PXU pipeline – Synergistic processor element, block diagram, SXU pipeline– Memory flow controller and MFC commands– Element interconnect bus, command and data topology– I/O and memory interfaces
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture3
Class AgendaCell concept
Architecture motivators
Cell synergy
Cell features
Cell processor components– Power processor element
– Synergistic processor element
– Memory flow controller
– Element interconnect bus
– I/O and memory interfaces
– Resource allocation management
References
Jim Kahle, Cell Broadband Engine and Cell Broadband Engine Architecture
Trademarks - Cell Broadband Engine ™ is a trademark of Sony Computer Entertainment, Inc.
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture4
Cell ConceptCompatibility with 64b Power Architecture™– Builds on and leverages IBM investment and community
Increased efficiency and performance– Attacks on the “Power Wall”
• Non Homogenous Coherent Multiprocessor• High design frequency @ a low operating voltage with advanced power management
– Attacks on the “Memory Wall”• Streaming DMA architecture• 3-level Memory Model: Main Storage, Local Storage, Register Files
– Attacks on the “Frequency Wall”• Highly optimized implementation• Large shared register files and software controlled branching to allow deeper pipelines
Interface between user and networked world– Image rich information, virtual reality– Flexibility and security
Multi-OS support, including RTOS / non-RTOS– Combine real-time and non-real time worlds
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture5
Market RequirementsƒNatural interaction with the systemƒConsumer acceptable interactionƒImprove Experience
–Ease of use–High degree of interaction–Responsive–Realism–Interconnected through network to other devices
Holistic Design ApproachƒArchitectureƒHardware implementationƒSystem structureƒProgramming Model
Technical RequirementsƒDual environment: Real time and conventionalƒHigh FLOPS Computational densityƒHigh parallelismƒBandwidth & latency controlsƒRealtime responseƒResource reservationƒHigh bandwidth
Architecture Motivators
Systems and Technology Group
© 2005 IBM Corporation04/06/065 Course Code: L1T1H1-10 Cell Architecture
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture6
Cell SynergyCell is not a collection of different processors, but a synergistic whole– Operation paradigms, data formats and semantics consistent – Share address translation and memory protection model PPE for operating systems and program controlSPE optimized for efficient data processing– SPEs share Cell system functions provided by Power Architecture– MFC implements interface to memory
• Copy in/copy out to local storagePowerPC provides system functions– Virtualization– Address translation and protection– External exception handlingEIB integrates system as data transport hub
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture7
Cell Processor Cell Processor
Power Processor Element (PPE): •General Purpose, 64-bit RISC
Processor (PowerPC 2.02)•2-Way Hardware Multithreaded•L1 : 32KB I ; 32KB D•L2 : 512KB•Coherent load/store•VMX•3.2 GHz
Internal Interconnect: •Coherent ring structure•300+ GB/s total internal
interconnect bandwidth•DMA control to/from SPEs
supports >100 outstanding memory requests
Synergistic Processor Elements (SPE):•8 per chip•128-bit wide SIMD Units•Integer and Floating Point capable•256KB Local Store•Up to 25.6 GF/s per SPE ---
200GF/s total *
External Interconnects: •25.6 GB/sec BW memory interface•2 Configurable I/O Interfaces
•Coherent interface (SMP)•Normal I/O interface (I/O & Graphics)•Total BW configurable between interfaces•Up to 35 GB/s out•Up to 25 GB/s in
Memory Management & Mapping•SPE Local Store aliased into PPE system memory•MFC/MMU controls SPE DMA accesses
•Compatible with PowerPC Virtual Memory architecture•S/W controllable from PPE MMIO
•Hardware or Software TLB management•SPE DMA access protected by MFC/MMU
““Supercomputer-on-a-Chip”
* At clock speed of 3.2GHz
Element Interconnect Bus
MFC
Local Store
SPU
N
AUC
MFC
Local Store
SPU
N
AUC
Power Core(PPE)
L2 Cache
NCULocal Store
SPUMFC
N
AUC
Local Store
SPUMFC
N
AUC
N N
N N
MFC
Local Store
SPU
N
AU
C
MFC
Local Store
SPU
N
AU
C
MFC
Loca
l Sto
re
SPU
N
AU
C
MFC
Loca
l Sto
re
SPU
N
AU
C
20 GB/sec Coherent
Interconnect
25.6 GB/sec Memory Inteface
5 GB/sec I/O Bus
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture8
Power Processor ElementPPE handles operating system and control tasks– 64-bit Power ArchitectureTM with VMX– In-order, 2-way hardware simultaneous multi-threading (SMT)– Coherent Load/Store with 32KB I & D L1 and 512KB L2
9
Systems and Technology Group
© 2006 IBM Corporation06/23/06Course Code: L1T1H1-10 Cell Architecture
Pre-Decode
L1 Instruction Cache
MicrocodeSMT Dispatch (Queue)
DecodeDependency
Issue
Branch Scan
Fetch ControlL2Interface
VMX/FPU Issue (Queue)
VMXLoad/Store/
Permute
VMXArith./Logic Unit
FPULoad/Store
FPUArith/Logic Unit
Load/StoreUnit
BranchExecution
Unit
Fixed-PointUnit
FPU CompletionVMX Completion
Completion/Flush
8
4
2 1
211 1 1
111
4Thread A Thread B
Threads alternate fetch and dispatch cycles
Thread A
Thread B
Thread A
L1 Data Cache
2
PPE BLOCK DIAGRAM
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture10
Synergistic Processor ElementSPE provides computational performance – Dual issue, up to 16-way 128-bit SIMD– Dedicated resources: 128 128-bit RF, 256KB Local Store– Each can be dynamically configured to protect resources– Dedicated DMA engine: Up to 16 outstanding requests
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture11
SPE HighlightsRISC like organization
– 32 bit fixed instructions– Clean design – unified Register file
User-mode architecture– No translation/protection within SPU– DMA is full Power Arch protect/x-late
VMX-like SIMD dataflow– Broad set of operations (8 / 16 / 32 Byte)– Graphics SP-Float– IEEE DP-Float
Unified register file– 128 entry x 128 bit
256KB Local Store– Combined I & D– 16B/cycle L/S bandwidth– 128B/cycle DMA bandwidth
LS
LS
LS
LSGPR
FXU ODD
FXU EVN
SFPDP
CO
NTR
OL
CHANNEL
DMA SMMATO
SBIRTB
BEB
FWD
14.5mm2 (90nm SOI)
12
Systems and Technology Group
© 2006 IBM Corporation06/23/06Course Code: L1T1H1-10 Cell Architecture
What is a Synergistic Processor?(and why is it efficient?)
Local Store “is” large 2nd level register file / private instruction store instead of cache– Asynchronous transfer (DMA) to shared memory
– Frontal attack on the Memory WallMedia Unit turned into a Processor
– Unified (large) Register File
– 128 entry x 128 bitMedia & Compute optimized
– One context
– SIMD architecture
LS
LS
LS
LSGPR
FXU ODD
FXU EVN
SFPDP
CO
NTR
OL
CHANNEL
DMA SMMATO
SBIRTB
BEB
FWD
SPU
SMF
13
Systems and Technology Group
© 2006 IBM Corporation06/23/06Course Code: L1T1H1-10 Cell Architecture
SPU DetailL S
L S
L S
L SG P R
F X U O D D
F X U E V N
S F PD P
CO
NTR
OL
C H A N N E L
D M A S M MA T O
S B IR T B
BE
B
F W D
Synergistic Processor Element (SPE)User-mode architecture
– No translation/protection within SPE
– DMA is full PowerPC protect/xlate
Direct programmer control– DMA/DMA-list
– Branch hintVMX-like SIMD dataflow
– Graphics SP-Float
– No saturate arith, some byte
– IEEE DP-Float (BlueGene-like)Unified register file
– 128 entry x 128 bit256KB Local Store
– Combined I & D
– 16B/cycle L/S bandwidth
– 128B/cycle DMA bandwidthMemory Flow Control (MFC) SPU Latencies
– Simple fixed point - 2 cycles*
– Complex fixed point - 4 cycles*
– Load - 6 cycles*• Local store size = 256 KB
– Single-precision (ER) float - 6 cycles*
– Integer multiply - 7 cycles*
– Branch miss - 20 cycles• No penalty if correctly hinted
– DP (IEEE) float - 13 cycles*• Partially pipelined
– Enqueue DMA Command - 20 cycles*
SPU Units:– Simple (FXU even)
• Add/Compare• Rotate• Logical, Count Leading Zero
– Permute (FXU odd)• Permute• Table-lookup
– FPU (Single / Double Precision)
– Control (SCN)• Dual Issue, Load/Store, ECC Handling
– Channel (SSC) – Interface to MFC
– Register File (GPR/FWD)
14
Systems and Technology Group
© 2006 IBM Corporation06/23/06Course Code: L1T1H1-10 Cell Architecture
Permute UnitLoad-Store Unit
Floating-Point UnitFixed-Point Unit
Branch UnitChannel Unit
Result Forwarding and StagingRegister File
Local Store(256kB)
Single Port SRAM
128B Read 128B Write
DMA Unit
Instruction Issue Unit / Instruction Line Buffer
8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle
On-Chip Coherent Bus
SPE BLOCK DIAGRAM
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture15
SPE Structure
Scalar processing supported on data-parallel substrate– All instructions are data parallel and operate on vectors
of elements – Scalar operation defined by instruction use, not opcode
• Vector instruction form used to perform operation
Preferred slot paradigm– Scalar arguments to instructions found in “preferred slot”– Computation can be performed in any slot
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture16
Register Scalar Data Layout
Preferred slot in bytes 0-3– By convention for procedure interfaces – Used by instructions expecting scalar data
• Addresses, branch conditions, generate controls for insert
17
Systems and Technology Group
Cell Processor Components
Memory Management & Mapping• SPE Local Store aliased into PPE system memory• MFC/MMU controls / protects SPE DMA accesses
– Compatible with PowerPC Virtual Memory Architecture
– SW controllable using PPE MMIO• DMA 1,2,4,8,16,128 -> 16Kbyte transfers for I/O
access• Two queues for DMA commands: Proxy & SPU
Loca
l Sto
re
SPU
MFC
N
AU
C
Loca
l Sto
re
SPU
MFC
N
AU
C
Local Store
SPUM
FCN
AU
C
Local Store
SPUM
FCN
AU
C
Local StoreSPUMFC
N
AUC
Local StoreSPUMFC
N
AUC
Local Store
SPU MFCN
AUC
Local Store
SPU MFCN
AUC
96 Byte/Cycle
Element Interconnect Bus
Power Core(PPE)
L2 Cache
NCU
Synergistic Processor Element (SPE):• Provides the computational performance• Simple RISC User Mode Architecture
– Dual issue VMX-like– Graphics SP-Float– IEEE DP-Float
• Dedicated resources: unified 128x128-bit RF, 256KB Local Store
• Dedicated DMA engine: Up to 16 outstanding requests
.1
18
Systems and Technology Group
© 2006 IBM Corporation06/23/06Course Code: L1T1H1-10 Cell Architecture
LSA - Local Store Address (32 bit)EA - Effective Address (32 or 64 bit)TS - Transfer Size (16 bytes to 16K bytes)LS - DMA List Size (8 bytes to 16 K bytes)TG - Tag Group(5 bit)CL - Cache Management / Bandwidth Class
DMA Commands
Command ParametersPut - Transfer from Local Store to EA spacePuts - Transfer and Start SPU executionPutr - Put Result - (Arch. Scarf into L2)Putl - Put using DMA List in Local StorePutrl - Put Result using DMA List in LS (Arch)Get - Transfer from EA Space to Local StoreGets - Transfer and Start SPU executionGetl - Get using DMA List in Local StoreSndsig - Send Signal to SPU Command Modifiers: <f,b>f: Embedded Tag Specific Fence
Command will not start until all previous commandsin same tag group have completed
b: Embedded Tag Specific BarrierCommand and all subsiquent commands in sametag group will not start until previous commands in sametag group have completed
SL1 Cache Management Commandssdcrt - Data cache region touch (DMA Get hint) sdcrtst - Data cache region touch for store (DMA Put hint)sdcrz - Data cache region zerosdcrs - Data cache region storesdcrf - Data cache region flush
Synchronization CommandsLockline (Atomic Update) Commands:
getllar - DMA 128 bytes from EA to LS and set Reservationputllc - Conditionally DMA 128 bytes from LS to EAputlluc - Unconditionally DMA 128 bytes from LS to EA
barrier - all previous commands complete before subsiquentcommands are started
mfcsync - Results of all previous commands in Tag groupare remotely visible
mfceieio - Results of all preceding Puts commands in samegroup visible with respect to succeeding Get commands
Memory Flow Controller Commands
19
Systems and Technology Group
Cell Processor ComponentsPower Processor Element (PPE):
• General purpose, 64-bit RISCprocessor (PowerPC AS 2.0.2)
• 2-Way hardware multithreaded• L1 : 32KB I ; 32KB D• L2 : 512KB• Coherent load / store• VMX-32• Realtime Controls
– Locking L2 Cache & TLB– Software / hardware managed TLB– Bandwidth / Resource Reservation– Mediated Interrupts
In the Beginning– the solitary Power Processor
Custom Designed– for high frequency, space,
and power efficiency
96 Byte/Cycle
Element Interconnect Bus
Power Core(PPE)
L2 Cache
NCU
Element Interconnect Bus (EIB):• Four 16 byte data rings supporting multiple
simultaneous transfers per ring• 96Bytes/cycle peak bandwidth• Over 100 outstanding requests
.1
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture20
Element Interconnect BusEIB data ring for internal communication– Four 16 byte data rings, supporting multiple transfers
– 96B/cycle peak bandwidth
– Over 100 outstanding requests
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture21
Internal Bandwidth Capability
Each EIB Bus data port supports 25.6GBytes/sec* in each direction
The EIB Command Bus streams commands fast enough to support 102.4 GB/sec for coherent commands, and 204.8 GB/sec for non-coherent commands.
The EIB data rings can sustain 204.8GB/sec for certain workloads, with transient rates as high as 307.2GB/sec between bus units
* The above numbers assume a 3.2GHz core frequency – internal bandwidth scales with core frequency
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture22
Element Interconnect Bus - Data TopologyFour 16B data rings connecting 12 bus elements
– Two clockwise / Two counter-clockwisePhysically overlaps all processor elementsCentral arbiter supports up to three concurrent transfers per data ring
– Two stage, dual round robin arbiterEach element port simultaneously supports 16B in and 16B out data path
– Ring topology is transparent to element data interface
16B 16B 16B 16B
Data Arb
16B 16B 16B 16B
16B 16B16B 16B16B 16B16B 16B
16B
16B16B
16B
16B
16B16B
16B
SPE0 SPE2 SPE4 SPE6
SPE7SPE5SPE3SPE1
MIC
PPE
BIF/IOIF0
IOIF1
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture23
Example of eight concurrent transactions
MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF1
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
Controller
Controller
Ramp
0
Controller
Ramp
1
Controller
Ramp
2
Controller
Ramp
3
Controller
Ramp
4
Controller
Ramp
5
Controller
Ramp
6
Controller
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
Data
Arbiter
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
ControllerController
Ramp
5
Controller
Ramp
4
Controller
Ramp
3
Controller
Ramp
2
Controller
Ramp
1
Controller
Ramp
0
PPE SPE1 SPE3 SPE5 SPE7 IOIF1PPE SPE1 SPE3 SPE5 SPE7 IOIF1
PPE SPE1 SPE3 SPE5 SPE7 IOIF1MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF0
Ring1Ring3
Ring0Ring2
controls
24
Systems and Technology Group
Cell Processor Components
IOIF0
20 GB/secBIF or IOIF0
MIC
25 GB/secXDR DRAM
IOIF1
SouthbridgeI/O
5 GB/sec
Loca
l Sto
re
SPU
MFC
N
AU
C
Loca
l Sto
re
SPU
MFC
N
AU
C
Local Store
SPUM
FCN
AU
C
Local Store
SPUM
FCN
AU
C
Local StoreSPUMFC
N
AUC
Local StoreSPUMFC
N
AUC
Local Store
SPU MFCN
AUC
Local Store
SPU MFCN
AUC
96 Byte/Cycle
Element Interconnect Bus
Power Core(PPE)
L2 Cache
NCU
IIC IOT
TKM
Token Manager (TKM):Bandwidth / Resource Reservation for shared resourcesOptionally enabled for RT tasks or LPARMultiple Resource Allocation Groups (RAGs)Generates access tokens at configurable rate for each allocation group
– 1 per each memory bank (16 total)– 2 for each IOIF (4 total)
Requestors assigned RAG ID by OS / hypervisor– Each SPE– PPE L2 / NCU– IOIF 0 Bus Master– IOIF 1 Bus Master
Priority order for using another RAGs unused tokensResource over committed warning interrupt
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture25
I/O and Memory InterfacesI/O Provides wide bandwidth– Dual XDRTM controller (25.6GB/s @ 3.2Gbps)– Two configurable interfaces (76.8GB/s @6.4Gbps)
• Configurable number of Bytes• Coherent or I/O Protection
– Allows for multiple system configurations
26
Systems and Technology Group
Cell Processor Components
IOIF0
20 GB/secBIF or IOIF0
IOIF1
SouthbridgeI/O
5 GB/sec
Loca
l Sto
re
SPU
MFC
N
AU
C
Loca
l Sto
re
SPU
MFC
N
AU
C
Local Store
SPUM
FCN
AU
C
Local Store
SPUM
FCN
AU
C
Local StoreSPUMFC
N
AUC
Local StoreSPUMFC
N
AUC
Local Store
SPU MFCN
AUC
Local Store
SPU MFCN
AUC
96 Byte/Cycle
Element Interconnect Bus
Power Core(PPE)
L2 Cache
NCU
Memory Interface Controller (MIC):• Dual XDRTM controller (25.6GB/s @ 3.2Gbps)• ECC support• Suspend to DRAM support
Broadband Interface Controller (BIC):Provides a wide connection to external devices Two configurable interfaces (60GB/s @ 5Gbps)
– Configurable number of bytes– Coherent (BIF) and / or
I/O (IOIFx) protocolsSupports two virtual channels per interfaceSupports multiple system configurations
MIC
25 GB/secXDR DRAM
.1
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture27
Cell BE Processor Can Support Many SystemsGame console systemsBladesHDTVHome media serversSupercomputers
Cell BEProcessor
XDRtm XDRtm
IOIF0 IOIF1
Cell BEProcessor
XDRtm XDRtm
IOIF BIF
Cell BEProcessor
XDRtm XDRtm
IOIF
Cell BEProcessor
XDRtm XDRtm
IOIFBIF
Cell BEProcessor
XDRtm XDRtm
IOIF
Cell BEProcessor
XDRtm XDRtm
IOIF
BIF
Cell BEProcessor
XDRtm XDRtm
IOIFSW
Systems and Technology Group
06/23/06Course Code: L1T1H1-10 Cell Architecture28
(c) Copyright International Business Machines Corporation 2005.All Rights Reserved. Printed in the United Sates April 2005.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.
IBM Microelectronics Division The IBM home page is http://www.ibm.com1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com