SMT and CMP Architectures

19
SMT and CMP Architectures INTRODUCTION

Transcript of SMT and CMP Architectures

  • SMT and CMP ArchitecturesINTRODUCTION

  • Contemporary forms of parallelismInstruction-level parallelism(ILP)Wide-issue Superscalar processors (SS) 4 or more instruction per cycle Executing a single program or thread Attempts to find multiple instructions to issue each cycle.Out-of-order execution => instructions are sent to execution units based on instruction dependencies rather than program orderThread-level parallelism(TLP)Fine-grained multithreaded superscalars(FGMS) Contain hardware state for several threads Executing multiple threads On any given cycle a processor executes instructions from one of the threadsMultiprocessor(MP) Performance improved by adding more CPUs

  • Simultaneous MultithreadingKey idea Issue multiple instructions from multiple threads each cycle FeaturesFully exploit thread-level parallelism and instruction-level parallelism.Multiple functional unitsModern processors have more functional units available then a single thread can utilize.Register renaming and dynamic schedulingMultiple instructions from independent threads can co-exist and co-execute.

  • Summary: Multithreaded Categories*Time (processor cycle)SuperscalarFine-GrainedCoarse-GrainedSimultaneous MultithreadingThread 1Thread 2Thread 3Thread 4Thread 5Idle slot

  • Horizontal dimension represents the instruction issue capabilty in each clock cycles.Vertical dimension represents a sequence of clock cycles.Empty slots indicates that the corresponding issue slots are unused in that clock cycles.

  • Superscalar processor with no multithreading: only one thread is processed in one clock cycleUse of issue slots is limited by a lack of ILP.Stalls such as an instruction cache miss leaves the entire processor idle.Fine-grained multithreading: switches threads on every clock cyclePro: hide latency of from both short and long stallsCon: Slows down execution of the individual threads ready to go. Only one thread issues inst. In a given clock cycle.Course-grained multithreading: switches threads only on costly stalls (e.g., L2 stalls)Pros: no switching each clock cycle, no slow down for ready-to-go threads. Reduces no of completely idle clock cycles. Con: limitations in hiding shorter stalls

  • Simultaneous Multithreading:exploits TLP at the same time it exploits ILP with multiple threads using the issue slots in a single-clock cycle. issue slots is limited by the following factors:Imbalances in the resource needs.Resource availability over multiple threads.Number of active threads considered.Finite limitations of buffer.Ability to fetch enough instructions from multiple threads.Practical limitations of what instructions combinations can issue from one thread and multiple threads.

  • Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) this effect can be mitigated by trying to prioritize one thread While fetching instructions, thread priority can dramatically influence total throughput a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 Alpha 21464 and Intel Pentium 4 are examples of SMT

  • Effectively Using Parallelism on a SMT Processor Instruction Throughput executing a parallel workload

    Parallel workloadthreadsSSMP2MP4FGMTSMT13.32.41.53.33.32--4.32.64.14.74----4.24.25.68------3.56.1

  • Comparison of SMT vs SuperscalarSMT processors are compared to base superscalarprocessors in several key measures :Utilization of functional units.Utilization of fetch units.Accuracy of branch predictor.Hit rates of primary caches.Hit rates of secondary caches.Performance improvement:Issue slots.Funtional units.Renaming registers.

  • CMP ArchitectureChip-level multiprocessing(CMP or multicore): integrates two or more independent cores(normally a CPU) into a single package composed of a single integrated circuit(IC), called a die, or more dies packaged, each executing threads independently.Every funtional units of a processor is duplicated.Multiple processors, each with a full set of architectural resources, reside on the same dieProcessors may share an on-chip cache or each can have its own cacheExamples: HP Mako, IBM Power4Challenges: Power, Die area (cost)

  • Single core computer

  • Single coreSingle core CPU chip

  • Multi-core CPU chipCore 1Core 2Core 3Core 4

  • Chip Multithreading = Chip Multiprocessing + Hardware Multithreading.

    Chip Multithreading is the capability of a processor to process multiple s/w threads simulataneous h/w threads of execution.

    CMP is achieved by multiple cores on a single chip or multiple threads on a single core.

    CMP processors are especially suited to server workloads, which generally have high levels of Thread-Level Parallelism(TLP).

    Chip Multithreading

  • CMPs PerformanceCMPs are now the only way to build high performance microprocessors , for a variety of reasons:Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream.Cannot simply ratchet up the clock speed on todays processors,or the power dissipation will become prohibitive.CMT processors support many h/w strands through efficient sharing of on-chip resources such as pipelines, caches and predictors.CMT processors are a good match for server workloads,which have high levels of TLP and relatively low levels of ILP.

  • SMT and CMPThe performance race between SMT and CMP is not yet decided.CMP is easier to implement, but only SMT has the ability to hide latencies.A functional partitioning is not exactly reached within a SMT processor due to the centralized instruction issue.A separation of the thread queues is a possible solution, although it does not remove the central instruction issue.A combination of simultaneous multithreading with the CMP may be superior.Research : combine SMT or CMP organization with the ability to create threads with compiler support of fully dynamically out of a single thread.Thread-level speculationClose to multiscalar

  • Multiprocessor vs. SMT Multiprocessor(MP2) SMT

  • THANK U GUYS