Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech
A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and...
-
Upload
megan-malone -
Category
Documents
-
view
212 -
download
0
Transcript of A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and...
A Gentler, Kinder Guide to the Multi-core Galaxy
Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer EngineeringGeorgia Tech
Guest lecture for ECE4100/6100 for Prof. Yalamanchili
2
Reality Check
• Conventional processor designs run out of steam– Complexity (verification)– Power (thermal)– Physics (CMOS scaling)
• Unanimous direction Multi-core – Simple cores (massive number)– Keep
• Wire communication on leash • Gordon Moore happy (Moore’s Law)
– Architects’ menace: kick the ball to the other side of the court?• What do you (or your customers) want?
– Performance (and/or availability)– Throughput > latency (turnaround time)– Total cost of ownership (performance per dollar)– Energy (performance per watt)– Reliability and dependability, SPAM/spy free
3
Multi-core Processor Gala
4
Intel’s Multicore Roadmap
• To extend Moore’s Law• To delay the ultimate limit of physics • By 2010
– all Intel processors delivered will be multicore– Intel’s 80-core processor
Source: Adapted from Tom’s Hardware
2006 20082007
SC 1MB
DC 2MB
DC 2/4MB shared
DC 3 MB/6 MB shared
(45nm)
2006 20082007
DC 2/4MB
DC 2/4MB shared
DC 4MB
DC 3MB /6MB shared (45nm)
2006 20082007
DC 2MB
DC 4MB
DC 16MB
QC 4MB
QC 8/16MB shared
8C 12MB shared (45nm)
SC 512KB/ 1/ 2MB
8C 12MB shared (45nm)
De
skto
p p
roce
sso
rs
Mo
bile
p
roce
sso
rs
En
terp
rise
p
roce
sso
rs
5
Is a Multi-core really better off?
Well, it is hard to say in Computing WorldWell, it is hard to say in Computing World
6
Is a Multi-core really better off?
DEEP BLUE
480 chess chips
Can evaluate 200,000,000 moves per second!!
7
Computing Paradigm Evolution
Thread 1Unused
Exec
utio
n Ti
me
FU1 FU2 FU3 FU4
ConventionalSuperscalar
SingleThreaded
SimultaneousMultithreading(or Intel’s HT)
Fine-grainedMultithreading(cycle-by-cycle
Interleaving)
Thread 2Thread 3Thread 4Thread 5
Coarse-grainedMultithreading
(Block Interleaving)
Chip Multiprocessor
(CMP) or
Multi-Core ProcessorsWhy SMT is failing?
8
Major Challenges for Multi-Core Designs• Communication
– Memory hierarchy– Data allocation (you have a large shared L2/L3 now)– Interconnection network– Scalability– Bus Bandwidth, how to get there?
• Power-Performance — Win or lose?– Borkar’s multicore arguments
• 15% per core performance drop 50% power saving• Giant, single core wastes power when task is small
– How about leakage?
• Process variation and yield• Programming Model
9
Intel Core 2 Duo
• Homogeneous cores• Bus based on chip
interconnect• Shared on-die Cache
Memory • Traditional I/O
Classic OOO: Reservation Stations, Issue ports, Schedulers…etc
Large, shared set associative, prefetch, etc.
Source: Intel Corp.
10
Core 2 Duo Microarchitecture
11
Why Sharing on-die L2?
• What happens when L2 is too large?
12
Intel Core 2 Duo (Merom)
13
CoreTM μArch — Wide Dynamic Execution
14
CoreTM μArch — Wide Dynamic Execution
15
CoreTM μArch — MACRO Fusion
• Common Intel 32 (once used to be called x86 or IA-32) instruction pairs are combined
16
Micro(-ops) Fusion (from Pentium M)• A misnomer..• Instead of breaking up an Intel32 instruction into μop, they
decide not to break it up…• A better naming scheme would call the previous techniques —
“IA32 fission” • To fuse
– Store address and store data μops– Load-and-op μops (e.g. ADD (%esp), %eax)
• Extend each RS entry to take 3 operands• To reduce
– micro-ops (10% reduction in the OOO logic)– Decoder bandwidth (simple decoder can decode fusion type
instruction)– Energy consumption
• Performance improved by 5% for INT and 9% for FP (Pentium M data)
17
Smart Memory Access
18
Sun UltraSparc T1• Eight cores, each 4-way threaded• Fine-grained multithreading
– a thread-selection logic– Round-robin cycle-by-cycle
• 1.2 GHz (90nm)• No OOO, 8 instructions per cycle• Cache
– 16K 4-way 32B L1-I– 8K 4-way 16B L1-D– Blocking cache (reason for MT)– 4-banked 12-way 3MB L2 + 4
memory controllers. – Data moved between the L2 and
the cores using an integrated crossbar switch to provide high throughput (200GB/s)
19
Sun UltraSparc T2
• A fatter version of T1• 1.4GHz (65nm)• 8 threads per core, 8 cores on-die• 1 FPU per core (1 FPU per die in T1), 16 INT EU (8 in T1)• L2 increased to 8-banked 16-way 4MB shared • 8 stage integer pipeline ( as opposed to 6 for T1)• 16 instructions per cycle• One PCI Express port (x8 1.0)• Two 10 Gigabit Ethernet ports with packet classification and
filtering• Eight encryption engines • Four dual-channel FBDIMM memory controllers• 711 signal I/O,1831 total
20
STI Cell Broadband Engine
• Heterogeneous!• 64-bit PowerPC• Eight SPEs
– In-order, Dual-issue
– 128-bit SIMD– 128x128b RF– 256KB LS– Fast Local SRAM– Globally coherent
DMA (128B/cycle)– 128+ concurrent
transactions to memory per core
• High bandwidth– EIB (96B/cycle)
21
Cell Chip Block Diagram
SynergisticMemory flow
controller