Current Trends in CMP/CMT Processors
description
Transcript of Current Trends in CMP/CMT Processors
Current Trends in CMP/CMT Current Trends in CMP/CMT ProcessorsProcessors
Wei HsuWei Hsu7/26/20067/26/2006
Trends in Emerging Systems
Industry and community could previously rely on increased frequency and micro-architecture innovations to steadily improve performance of computers each yearsuperscalar, out-of-order issue, on-chip caching, deep pipelines supported by sophisticated
branch predictors.
Trends in Emerging Systems (cont.)
Processor designers have found it increasingly difficult to:manage power dissipationchip temperaturecurrent swingsdesign complexitydecreasing transistor reliability in designs
Physics problems- not necessarily innovation problems
Moore’s Law
Performance Increase of Workstations
Less than 1.5x every 18 months
The Power ChallengeThe Power Challenge
As long as there
are sufficient
TLP
MultiCore becomes mainstreamCommercial examples:
CompanyCompany ChipChip
IBMIBM Power4, Power5, PPC970, CellPower4, Power5, PPC970, Cell
SunSun SparcIV, SparcIV+, UltraSparc T1SparcIV, SparcIV+, UltraSparc T1
IntelIntel PentiumD, Core Duo, ConroePentiumD, Core Duo, Conroe
AMDAMD Operon, Athlon X2, Turion X2Operon, Athlon X2, Turion X2
MS MS Xbox360 – 3 core PPCXbox360 – 3 core PPC
RazaRaza XLR – 8 MIPS coresXLR – 8 MIPS cores
BroadcomBroadcom Sibyte – multiple MIPS coresSibyte – multiple MIPS cores
Why MultiCore becomes mainstream
TLP vs. ILP
Physical limitations has caused serious heat dissipation
problems. Memory latency continues to limit single
thread performance. Now designers try to push TLP (Thread Level Parallelism) rather than ILP or higher clock frequency.
e.g. Sun UltraSparc T1 trades single thread performance for higher throughput to keep its server market. Server workloads are broadly characterized by high TLP, low ILP, and large working set.
Why MultiCore becomes mainstream
CMP with shared cache can reduce expensive coherence miss penaltyAs L2/L3 caches become larger, coherence misses
start to dominate the performance for server workloads.
SMP has been successfully used for years, so software is relatively mature for Multicore chips.
New applications tend to have high TLP e.g. media apps, server apps, games, network processing, … etc
One alternative to Multicore is SOC
Moore’s Law will continue to provide transistors
Transistors will be used for more cores, caches, and new features.
More cores for increasing TLP, Caches to address memory latency.
CMT (Chip Multi-Threading)
CMT processors support many simultaneous hardware threads of execution.– SMT (Simultaneous Multi-Threading)– CMP (i.e. multi-core)
CMT is about on-chip resource sharing– SMT: threads share most resources– CMP: threads share pins and bus to memory, may
also share L2/L3 caches.
Single Instruction Issue Processors
Time
Reduced FU utilization due to memory latency or data dependency orbranch misprediction
Superscalar Processors
Time
Superscalar leads to higher performance, butlower FU utilization.
SMT (Simultaneous Multi-Threading) Processors
Time
Maximize FU utilization by issuing operations fromtwo or more threads.Example: Pentium IV Hyper-threading
Vertical Multi-Threading
Time
DCache miss
occurs
Stall cycles
Vertical Multi-Threading
Time
DCache miss
occurs
Switch to the 2nd thread on a longlatency event (e.g. L2 cache miss)
Example: Montecito uses eventdriven MT
Horizontal MT
Time
Thread switch occurs on every cycle.Example: Sun Niagara (T1) with 4 threads per core
MT in Niagara T1
Time
Thread switch occurs on every cycle. The processor issues a single operation per cycle.
MT in Niagara T2
Time
Thread switch occurs on every cycle. The processor issues two operations per cycle.
CMT EvolutionCMT Evolution Stanford Hydra CMP project starts putting 4 MIPS Stanford Hydra CMP project starts putting 4 MIPS
processors on one chip in 1996processors on one chip in 1996 DEC/Compaq Piranha project proposed to include 8 Alpha DEC/Compaq Piranha project proposed to include 8 Alpha
cores and a L2 cache on a single chip in 2000.cores and a L2 cache on a single chip in 2000. SUN’s MAJC chip was a dual-core processors with shared SUN’s MAJC chip was a dual-core processors with shared
L1 cache, released in 1999L1 cache, released in 1999 IBM Power4 is dual-core (2001), and Power5 dual-core, IBM Power4 is dual-core (2001), and Power5 dual-core,
each core 2-way SMT.each core 2-way SMT. SUN’s Gemini and Jaguar were dual core processors (in SUN’s Gemini and Jaguar were dual core processors (in
2003), Panther (in 2005) with shared on-chip L2 cache, 2003), Panther (in 2005) with shared on-chip L2 cache, Niagara (T1 in 2006) is a 32-way CMT, with 8 cores, 4 Niagara (T1 in 2006) is a 32-way CMT, with 8 cores, 4 threads per core.threads per core.
Intel Montecito (Itanium2 follow-up) will have two cores, Intel Montecito (Itanium2 follow-up) will have two cores, and two threads per core.and two threads per core.
CMT Design TrendsCMT Design TrendsJaguar2003
Panther2005
Niagara(T1)2006
Multi-Core Software SupportMulti-Core Software Support Multi-Core demands Threaded SoftwareMulti-Core demands Threaded Software Importance of threadingImportance of threading
– Do nothingDo nothing OS is ready, background jobs can also benefit OS is ready, background jobs can also benefit
– ParallelizeParallelize Unlock the potential (apps, libraries, compiler Unlock the potential (apps, libraries, compiler
generated threads)generated threads)
Key ChallengesKey Challenges– ScalabilityScalability– CorrectnessCorrectness– Ease of programmingEase of programming
Multi-Core Software ChallengesMulti-Core Software Challenges
ScalabilityScalabilityOpenMP (for SMP/CMP node), MPI (for clusters), or OpenMP (for SMP/CMP node), MPI (for clusters), or mixedmixed
CorrectnessCorrectnessVarious thread checker, thread profiler, performance Various thread checker, thread profiler, performance analyzer, memory checker tools to simplify the creation analyzer, memory checker tools to simplify the creation and debugging of scalable thread safe codeand debugging of scalable thread safe code
Ease of programmingEase of programmingNew programming models (e.g. C++ template-based New programming models (e.g. C++ template-based runtime library to simplify app writing with pre-built and runtime library to simplify app writing with pre-built and tested algorithms and data structures) tested algorithms and data structures) Transactional memory conceptTransactional memory concept
CMT Optimization ChallengesCMT Optimization Challenges Traditional optimization assumes all the Traditional optimization assumes all the
resources in a processor can be usedresources in a processor can be used Prefetch may take away buy bandwidth from Prefetch may take away buy bandwidth from
the other core (and latency may be hidden the other core (and latency may be hidden anyway)anyway)
Code duplication/specialization may take away Code duplication/specialization may take away shared cache space.shared cache space.
Speculative execution may take away resource Speculative execution may take away resource from a second thread.from a second thread.
Parallelization may reduce total throughputParallelization may reduce total throughput Resource information is often determined at Resource information is often determined at
runtime. New runtime. New policies and mechanisms are needed to maximize total performance.
CMT Optimization ChallengesCMT Optimization Challenges I-cache optimization issuesI-cache optimization issues
In single thread execution, I-cache misses often come from In single thread execution, I-cache misses often come from conflicts between procedures. In multi-threaded execution, conflicts between procedures. In multi-threaded execution, the conflicts may come from different threads.the conflicts may come from different threads.
Thread scheduling issuesThread scheduling issues
Should two threads be scheduled on two separate cores or Should two threads be scheduled on two separate cores or on the same core with SMT? Schedule for performance or on the same core with SMT? Schedule for performance or schedule for power? (balanced schedule for power? (balanced vsvs unbalanced scheduling) unbalanced scheduling)
Some emerging IssuesSome emerging Issues
New low power, high performance coresNew low power, high performance cores
Current cores reuse the same design from previous Current cores reuse the same design from previous generation. This cannot last long since supply power generation. This cannot last long since supply power scaling is not sufficient to meet the requirement. New scaling is not sufficient to meet the requirement. New designs are called for to get low power and high designs are called for to get low power and high
performance cores.performance cores. Off-chip BandwidthOff-chip Bandwidth
How to keep up the needs for of off-chip bandwidth (double How to keep up the needs for of off-chip bandwidth (double every generation)? Cannot rely on the increase of pins every generation)? Cannot rely on the increase of pins (increased at 10% per generation). Must increase the (increased at 10% per generation). Must increase the bandwidth per pin.bandwidth per pin.
Some emerging Issues (cont.)Some emerging Issues (cont.)
Homogeneous or heterogeneous coresHomogeneous or heterogeneous coresfor workloads with sufficient TLP, multiple simple for workloads with sufficient TLP, multiple simple cores can deliver superior performance. However, cores can deliver superior performance. However, how to deliver robust performance for single how to deliver robust performance for single thread jobs? A complex core + many simple thread jobs? A complex core + many simple cores?cores?
Shared hardware acceleratorsShared hardware accelerators• network offload enginesnetwork offload engines• cryptographic engines• XML parsing or processing?XML parsing or processing?• FFT acceleratorFFT accelerator
New Research OpportunitiesNew Research Opportunitieswith CMP/CMTwith CMP/CMT
Speculative ThreadsSpeculative Threads With thread-level control speculation and runtime data With thread-level control speculation and runtime data
dependence check to speed up single program dependence check to speed up single program execution.execution.
Recent studies have shown ~20% of speed up potential Recent studies have shown ~20% of speed up potential
at loop level thread speculation on sequential code.at loop level thread speculation on sequential code.
Helper ThreadsHelper Threads Using otherwise idle cores to run dynamic optimization Using otherwise idle cores to run dynamic optimization
threads, performance monitoring (or profiling) threads, threads, performance monitoring (or profiling) threads, or scout threads.or scout threads.
New Research OpportunitiesNew Research Opportunitieswith CMP/CMTwith CMP/CMT
Monitoring ThreadsMonitoring Threads The monitoring threads can run on other cors The monitoring threads can run on other cors
to enforce the correct execution of the main to enforce the correct execution of the main thread.thread.
The main thread turns itself into a speculative The main thread turns itself into a speculative thread until the monitoring thread verify the thread until the monitoring thread verify the execution meets the requirements. If verification execution meets the requirements. If verification failed, the speculative execution aborts.failed, the speculative execution aborts.
New Research OpportunitiesNew Research Opportunities(Transient Fault Detection/Tolerance)(Transient Fault Detection/Tolerance)
Software Redundant Multi-ThreadingSoftware Redundant Multi-Threading
Using software controlled redundancy to detect and Using software controlled redundancy to detect and tolerate transient faults. Optimizations are critical to tolerate transient faults. Optimizations are critical to minimize communication and synchronizations. Redundant minimize communication and synchronizations. Redundant threads run onthreads run on multi cores – this is different from SMT multi cores – this is different from SMT where one error may corrupt both threads.where one error may corrupt both threads.
Process Level RedundancyProcess Level Redundancy
Only check on system calls to intercept faults that Only check on system calls to intercept faults that propagate to the output.propagate to the output.
New Research OpportunitiesNew Research Opportunities
For software debuggingFor software debuggingRunning a different path on the other core to increase path Running a different path on the other core to increase path coverage.coverage.
Future of CMP/CMTFuture of CMP/CMT Some companies already have 128/256 Some companies already have 128/256
cores CMP on their roadmap.cores CMP on their roadmap.Not sure what will happen, future is hard to predict. High-Not sure what will happen, future is hard to predict. High-end servers may be addressed by large scale CMP, but end servers may be addressed by large scale CMP, but desktop and embedded market may be not (perhaps small desktop and embedded market may be not (perhaps small scale or medium scale would be sufficient).scale or medium scale would be sufficient).
Today’s architectures are more likely be Today’s architectures are more likely be driven by software market than by hardware driven by software market than by hardware vendors.vendors.Itanium is one example. Even with Intel+HP, it has not been Itanium is one example. Even with Intel+HP, it has not been very successful. A successful product sells by itself.very successful. A successful product sells by itself.