Current Trends in CMP/CMT Processors

Current Trends in CMP/CMT Current Trends in CMP/CMT ProcessorsProcessors

Wei HsuWei Hsu7/26/20067/26/2006

Trends in Emerging Systems

Industry and community could previously rely on increased frequency and micro-architecture innovations to steadily improve performance of computers each yearsuperscalar, out-of-order issue, on-chip caching, deep pipelines supported by sophisticated

branch predictors.

Trends in Emerging Systems (cont.)

Processor designers have found it increasingly difficult to:manage power dissipationchip temperaturecurrent swingsdesign complexitydecreasing transistor reliability in designs

Physics problems- not necessarily innovation problems

Moore’s Law

Performance Increase of Workstations

Less than 1.5x every 18 months

The Power ChallengeThe Power Challenge

As long as there

are sufficient

TLP

MultiCore becomes mainstreamCommercial examples:

CompanyCompany ChipChip

IBMIBM Power4, Power5, PPC970, CellPower4, Power5, PPC970, Cell

SunSun SparcIV, SparcIV+, UltraSparc T1SparcIV, SparcIV+, UltraSparc T1

IntelIntel PentiumD, Core Duo, ConroePentiumD, Core Duo, Conroe

AMDAMD Operon, Athlon X2, Turion X2Operon, Athlon X2, Turion X2

MS MS Xbox360 – 3 core PPCXbox360 – 3 core PPC

RazaRaza XLR – 8 MIPS coresXLR – 8 MIPS cores

BroadcomBroadcom Sibyte – multiple MIPS coresSibyte – multiple MIPS cores

Why MultiCore becomes mainstream

TLP vs. ILP

Physical limitations has caused serious heat dissipation

problems. Memory latency continues to limit single

thread performance. Now designers try to push TLP (Thread Level Parallelism) rather than ILP or higher clock frequency.

e.g. Sun UltraSparc T1 trades single thread performance for higher throughput to keep its server market. Server workloads are broadly characterized by high TLP, low ILP, and large working set.

Why MultiCore becomes mainstream

CMP with shared cache can reduce expensive coherence miss penaltyAs L2/L3 caches become larger, coherence misses

start to dominate the performance for server workloads.

SMP has been successfully used for years, so software is relatively mature for Multicore chips.

New applications tend to have high TLP e.g. media apps, server apps, games, network processing, … etc

One alternative to Multicore is SOC

Moore’s Law will continue to provide transistors

Transistors will be used for more cores, caches, and new features.

More cores for increasing TLP, Caches to address memory latency.

CMT (Chip Multi-Threading)

CMT processors support many simultaneous hardware threads of execution.– SMT (Simultaneous Multi-Threading)– CMP (i.e. multi-core)

CMT is about on-chip resource sharing– SMT: threads share most resources– CMP: threads share pins and bus to memory, may

also share L2/L3 caches.

Single Instruction Issue Processors

Time

Reduced FU utilization due to memory latency or data dependency orbranch misprediction

Superscalar Processors

Time

Superscalar leads to higher performance, butlower FU utilization.

SMT (Simultaneous Multi-Threading) Processors

Time

Maximize FU utilization by issuing operations fromtwo or more threads.Example: Pentium IV Hyper-threading

Vertical Multi-Threading

Time

DCache miss

occurs

Stall cycles

Vertical Multi-Threading

Time

DCache miss

occurs

Switch to the 2nd thread on a longlatency event (e.g. L2 cache miss)

Example: Montecito uses eventdriven MT

Horizontal MT

Time

Thread switch occurs on every cycle.Example: Sun Niagara (T1) with 4 threads per core

MT in Niagara T1

Time

Thread switch occurs on every cycle. The processor issues a single operation per cycle.

MT in Niagara T2

Time

Thread switch occurs on every cycle. The processor issues two operations per cycle.

CMT EvolutionCMT Evolution Stanford Hydra CMP project starts putting 4 MIPS Stanford Hydra CMP project starts putting 4 MIPS

processors on one chip in 1996processors on one chip in 1996 DEC/Compaq Piranha project proposed to include 8 Alpha DEC/Compaq Piranha project proposed to include 8 Alpha

cores and a L2 cache on a single chip in 2000.cores and a L2 cache on a single chip in 2000. SUN’s MAJC chip was a dual-core processors with shared SUN’s MAJC chip was a dual-core processors with shared

L1 cache, released in 1999L1 cache, released in 1999 IBM Power4 is dual-core (2001), and Power5 dual-core, IBM Power4 is dual-core (2001), and Power5 dual-core,

each core 2-way SMT.each core 2-way SMT. SUN’s Gemini and Jaguar were dual core processors (in SUN’s Gemini and Jaguar were dual core processors (in

2003), Panther (in 2005) with shared on-chip L2 cache, 2003), Panther (in 2005) with shared on-chip L2 cache, Niagara (T1 in 2006) is a 32-way CMT, with 8 cores, 4 Niagara (T1 in 2006) is a 32-way CMT, with 8 cores, 4 threads per core.threads per core.

Intel Montecito (Itanium2 follow-up) will have two cores, Intel Montecito (Itanium2 follow-up) will have two cores, and two threads per core.and two threads per core.

CMT Design TrendsCMT Design TrendsJaguar2003

Panther2005

Niagara(T1)2006

Multi-Core Software SupportMulti-Core Software Support Multi-Core demands Threaded SoftwareMulti-Core demands Threaded Software Importance of threadingImportance of threading

– Do nothingDo nothing OS is ready, background jobs can also benefit OS is ready, background jobs can also benefit

– ParallelizeParallelize Unlock the potential (apps, libraries, compiler Unlock the potential (apps, libraries, compiler

generated threads)generated threads)

Key ChallengesKey Challenges– ScalabilityScalability– CorrectnessCorrectness– Ease of programmingEase of programming

Multi-Core Software ChallengesMulti-Core Software Challenges

ScalabilityScalabilityOpenMP (for SMP/CMP node), MPI (for clusters), or OpenMP (for SMP/CMP node), MPI (for clusters), or mixedmixed

CorrectnessCorrectnessVarious thread checker, thread profiler, performance Various thread checker, thread profiler, performance analyzer, memory checker tools to simplify the creation analyzer, memory checker tools to simplify the creation and debugging of scalable thread safe codeand debugging of scalable thread safe code

Ease of programmingEase of programmingNew programming models (e.g. C++ template-based New programming models (e.g. C++ template-based runtime library to simplify app writing with pre-built and runtime library to simplify app writing with pre-built and tested algorithms and data structures) tested algorithms and data structures) Transactional memory conceptTransactional memory concept

CMT Optimization ChallengesCMT Optimization Challenges Traditional optimization assumes all the Traditional optimization assumes all the

resources in a processor can be usedresources in a processor can be used Prefetch may take away buy bandwidth from Prefetch may take away buy bandwidth from

the other core (and latency may be hidden the other core (and latency may be hidden anyway)anyway)

Code duplication/specialization may take away Code duplication/specialization may take away shared cache space.shared cache space.

Speculative execution may take away resource Speculative execution may take away resource from a second thread.from a second thread.

Parallelization may reduce total throughputParallelization may reduce total throughput Resource information is often determined at Resource information is often determined at

runtime. New runtime. New policies and mechanisms are needed to maximize total performance.

CMT Optimization ChallengesCMT Optimization Challenges I-cache optimization issuesI-cache optimization issues

In single thread execution, I-cache misses often come from In single thread execution, I-cache misses often come from conflicts between procedures. In multi-threaded execution, conflicts between procedures. In multi-threaded execution, the conflicts may come from different threads.the conflicts may come from different threads.

Thread scheduling issuesThread scheduling issues

Should two threads be scheduled on two separate cores or Should two threads be scheduled on two separate cores or on the same core with SMT? Schedule for performance or on the same core with SMT? Schedule for performance or schedule for power? (balanced schedule for power? (balanced vsvs unbalanced scheduling) unbalanced scheduling)

Some emerging IssuesSome emerging Issues

New low power, high performance coresNew low power, high performance cores

Current cores reuse the same design from previous Current cores reuse the same design from previous generation. This cannot last long since supply power generation. This cannot last long since supply power scaling is not sufficient to meet the requirement. New scaling is not sufficient to meet the requirement. New designs are called for to get low power and high designs are called for to get low power and high

performance cores.performance cores. Off-chip BandwidthOff-chip Bandwidth

How to keep up the needs for of off-chip bandwidth (double How to keep up the needs for of off-chip bandwidth (double every generation)? Cannot rely on the increase of pins every generation)? Cannot rely on the increase of pins (increased at 10% per generation). Must increase the (increased at 10% per generation). Must increase the bandwidth per pin.bandwidth per pin.

Some emerging Issues (cont.)Some emerging Issues (cont.)

Homogeneous or heterogeneous coresHomogeneous or heterogeneous coresfor workloads with sufficient TLP, multiple simple for workloads with sufficient TLP, multiple simple cores can deliver superior performance. However, cores can deliver superior performance. However, how to deliver robust performance for single how to deliver robust performance for single thread jobs? A complex core + many simple thread jobs? A complex core + many simple cores?cores?

Shared hardware acceleratorsShared hardware accelerators• network offload enginesnetwork offload engines• cryptographic engines• XML parsing or processing?XML parsing or processing?• FFT acceleratorFFT accelerator

New Research OpportunitiesNew Research Opportunitieswith CMP/CMTwith CMP/CMT

Speculative ThreadsSpeculative Threads With thread-level control speculation and runtime data With thread-level control speculation and runtime data

dependence check to speed up single program dependence check to speed up single program execution.execution.

Recent studies have shown ~20% of speed up potential Recent studies have shown ~20% of speed up potential

at loop level thread speculation on sequential code.at loop level thread speculation on sequential code.

Helper ThreadsHelper Threads Using otherwise idle cores to run dynamic optimization Using otherwise idle cores to run dynamic optimization

threads, performance monitoring (or profiling) threads, threads, performance monitoring (or profiling) threads, or scout threads.or scout threads.

New Research OpportunitiesNew Research Opportunitieswith CMP/CMTwith CMP/CMT

Monitoring ThreadsMonitoring Threads The monitoring threads can run on other cors The monitoring threads can run on other cors

to enforce the correct execution of the main to enforce the correct execution of the main thread.thread.

The main thread turns itself into a speculative The main thread turns itself into a speculative thread until the monitoring thread verify the thread until the monitoring thread verify the execution meets the requirements. If verification execution meets the requirements. If verification failed, the speculative execution aborts.failed, the speculative execution aborts.

New Research OpportunitiesNew Research Opportunities(Transient Fault Detection/Tolerance)(Transient Fault Detection/Tolerance)

Software Redundant Multi-ThreadingSoftware Redundant Multi-Threading

Using software controlled redundancy to detect and Using software controlled redundancy to detect and tolerate transient faults. Optimizations are critical to tolerate transient faults. Optimizations are critical to minimize communication and synchronizations. Redundant minimize communication and synchronizations. Redundant threads run onthreads run on multi cores – this is different from SMT multi cores – this is different from SMT where one error may corrupt both threads.where one error may corrupt both threads.

Process Level RedundancyProcess Level Redundancy

Only check on system calls to intercept faults that Only check on system calls to intercept faults that propagate to the output.propagate to the output.

New Research OpportunitiesNew Research Opportunities

For software debuggingFor software debuggingRunning a different path on the other core to increase path Running a different path on the other core to increase path coverage.coverage.

Future of CMP/CMTFuture of CMP/CMT Some companies already have 128/256 Some companies already have 128/256

cores CMP on their roadmap.cores CMP on their roadmap.Not sure what will happen, future is hard to predict. High-Not sure what will happen, future is hard to predict. High-end servers may be addressed by large scale CMP, but end servers may be addressed by large scale CMP, but desktop and embedded market may be not (perhaps small desktop and embedded market may be not (perhaps small scale or medium scale would be sufficient).scale or medium scale would be sufficient).

Today’s architectures are more likely be Today’s architectures are more likely be driven by software market than by hardware driven by software market than by hardware vendors.vendors.Itanium is one example. Even with Intel+HP, it has not been Itanium is one example. Even with Intel+HP, it has not been very successful. A successful product sells by itself.very successful. A successful product sells by itself.

Current Trends in CMP/CMT Processors

Documents

Transcript of Current Trends in CMP/CMT Processors