Topic 4 Processor Performance
description
Transcript of Topic 4 Processor Performance
![Page 1: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/1.jpg)
Topic 4 Processor Performance
AH Computing
![Page 2: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/2.jpg)
Introduction
6502 8 bit processor, 16 bit address bus
Intel8086/88 (1979) IBM PC 16-bit data and address buses
Motorola 68000 16-bit data and 24-bit address PowerPC (1992)
Incorporated pipelining and superscaling
![Page 3: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/3.jpg)
8086
![Page 4: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/4.jpg)
Introduction
Technological developmentsRISC processorsSIMDPipeliningSuperscalar processing
![Page 5: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/5.jpg)
CISC and RISC
![Page 6: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/6.jpg)
CISC- Complex Instruction Set Computer
Memory in those days was expensive bigger program->more storage->more money
Hence needed to reduce the number of instructions per program
Number of instructions are reduced by having multiple operations within a single instruction
Multiple operations lead to many different kinds of instructions that access memory
In turn making instruction length variable and fetch-decode-execute time unpredictable – making it more complex
Thus hardware handles the complexity Example: x86 ISA
![Page 7: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/7.jpg)
CISC
CISC Language DevelopmentIncrease instruction size of instruction
sets (by providing more operations)Design ever more complex instructionsProvide more addressing modesImplement some HLL constructs in
machine instruction sets
![Page 8: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/8.jpg)
CISC
Intel 8086, 80286, 80386, 80486, Pentium The logic for each instruction has to be hard-
wired into the control unit As new instructions developed they were
added to original instructions set Difficult and expensive to design and build One way of solving this problem is to use
microprogramming
![Page 9: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/9.jpg)
CISC
Microprogramming – complex instructions are split into a series of simpler instructions
When a complex instruction is executed, the CPU executes a small microprogram stored in a control memory
This simplifies design of processor and allows the addition of new complex instructions
![Page 10: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/10.jpg)
RISC
Attempt to make architecture simplerReduced number of instructionsMake them all the same format if poss.Reduce the number of memory accesses
required by increasing the number of registers
Reduce the number of addressing modesAllow pipelining of instructions
![Page 11: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/11.jpg)
RISC
The characteristics of most RISC processors are…
A large number of GP registersA small number of simple instructions
that mostly have the same formatA minimal number of addressing modesOptimisation of instruction pipeline
![Page 12: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/12.jpg)
RISC
CISC processor RISC processor
Intel 80486 Sun SPARC
Year developed 1989 1987
No. instructions 235 69
Instruction Size (bytes)
1-11 4
Addressing modes
11 1
GP Registers 8 40-520
![Page 13: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/13.jpg)
RISC in the Home
Your home is likely to have many devices with RISC-based processors.
Devices using RISC-based processors include the Nintendo Wii, Microsoft Xbox 360, Sony PlayStation3, Nintendo DS and many televisions and phones.
However, x86 processors--those found in nearly all of the world's personal computers--are CISC. This is a limitation born of necessity; adopting a new instruction set for PC processors would mean that all the software used in PCs would no longer function.
![Page 14: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/14.jpg)
Scholar Activity
Characteristics of RISC processorReview QuestionsQ6 – 72010 14a-c
![Page 15: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/15.jpg)
Parallel Processing
At least two microprocessors handle parts of an overall task.
A computer scientist divides a complex problem into component parts using special software specifically designed for the task.
He or she then assigns each component part to a dedicated processor.
Each processor solves its part of the overall computational problem.
The software reassembles the data to reach the end conclusion of the original complex problem.
![Page 16: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/16.jpg)
Single Instruction, Single Data (SISD) computers have one processor that handles one algorithm using one source of data at a time. The computer tackles and processes each task in order, and so sometimes people use the word "sequential" to describe SISD computers. They aren't capable of performing parallel processing on their own.
![Page 17: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/17.jpg)
SIMD
Single Instruction, Multiple Data (SIMD) computers have several processors that follow the same set of instructions, but each processor inputs different data into those instructions. SIMD computers run different data through the same algorithm. This can be useful for analyzing large chunks of data based on the same criteria. Many complex computational problems don't fit this model.
![Page 18: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/18.jpg)
SIMD
A single computer instruction performing the same identical action (retrieve, calculate, or store) simultaneously on two or more pieces of data.
Typically this consists of many simple processors, each with a local memory in which it keeps the data which it will work on.
Each processor simultaneously performs the same instruction on its local data progressing through the instructions in lock-step, with the instructions issued by the controller processor.
The processors can communicate with each other in order to perform shifts and other array operations.
![Page 19: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/19.jpg)
SIMD
![Page 20: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/20.jpg)
SIMD Example
A classic example of data parallelism is inverting an RGB picture to produce its negative.
You have to iterate through an array of uniform integer values (pixels), and perform the same operation (inversion) on each one
…multiple data points, a single operation.
![Page 21: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/21.jpg)
MMX (implementation of SIMD)
Short for Multimedia Extensions, a set of 57 multimedia instructions built into Intel microprocessors and other x86-compatible microprocessors.
MMX-enabled microprocessors can handle many common multimedia operations, such as digital signal processing (DSP), that are normally handled by a separate sound or video card.
![Page 22: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/22.jpg)
SIMD
The Pentium III chip introduced eight 128 bit registers which could be operated on by the SIMD instructions
![Page 23: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/23.jpg)
SIMD
The Motorola Power PC 7400 chips used in Apple G4 computers also provided SIMD instructions, which can operate on multiple data items held in 32 128-bit registers.
![Page 24: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/24.jpg)
SIMD
Huge impact on the processing of multimedia data
Improves performance on any type of processing which requires the same instruction to be applied to multiple data items
Other examples - voice-to-text processing, data encryption/decryption
![Page 25: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/25.jpg)
SIMD PP Questions
2008 Q15
![Page 26: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/26.jpg)
Pipelining
Instruction pipelining = assembly linethe processor works on different steps of
the instruction at the same time, more instructions can be executed in a
shorter period of time.
![Page 27: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/27.jpg)
Analogy – washing, drying and folding clothes
![Page 28: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/28.jpg)
Analogy – washing, drying and folding clothes
![Page 29: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/29.jpg)
Execution of instructions without a pipeline
fetch decode execute
fetch decode execute
fetch decode execute
time
![Page 30: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/30.jpg)
Execution of instructions with a pipeline
fetch decode execute
fetch decode execute
fetch decode execute
time
![Page 31: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/31.jpg)
![Page 32: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/32.jpg)
Example - 5 Stage Pipeline
1. Instruction fetch (IF)
2. Instruction Decode (ID)
3. Execution (EX)
4. Memory Read/Write (MEM)
5. Result Writeback (WB)
All modern processors operate pipelining with 5 or more stages
![Page 33: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/33.jpg)
Example - 5 Stage Pipeline
![Page 34: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/34.jpg)
Problems with Pipelining
Led to an increase in performanceWorks best when
all instructions are the same length and follow in direct sequence
Not always the case!
![Page 35: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/35.jpg)
Problems with Pipelining
3 problems that can arise during pipeliningVarying instruction lengthsData DependencyBranch instructions
![Page 36: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/36.jpg)
Problems with Pipelining 1
Instruction LengthIn CISC-based designs, instructions can
vary in lengthA long slow instructions can hold up the
pipelineLess of a problem in RISC-based
designs as most instructions are fairly short
![Page 37: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/37.jpg)
Problems with Pipelining 2
Data dependencyIf one instruction relies on the result
produced by a previous instructionData required for the 2nd instruction may
not yet be available because the 1st instruction is still being executed
Pipeline must be stalled until data is ready for the 2nd instruction
![Page 38: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/38.jpg)
Problems with Pipelining 3
Branch instructionsBCC 25 - branch 25 bytes ahead if the
carry flag is clearIf the carry flag is set, the next
instructions is carried out as normalIf the carry flag is clear then the
instruction 25 bytes ahead is next
![Page 39: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/39.jpg)
Instruction 3 is a Branch Instruction – requiring a jump to instruction 15 – so 4 instructions are flushed from the pipeline
![Page 40: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/40.jpg)
Optimising the Pipeline
Techniques includeBranch predictionData flow analysisSpeculative loading of dataSpeculative execution of instructionsPredication
![Page 41: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/41.jpg)
Optimising the Pipeline
Branch predictionSome processors predict branch "taken"
for some op-codes and "not taken" for others.
The most effective approaches, however, use dynamic techniques.
![Page 42: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/42.jpg)
Optimising the Pipeline
Branch Prediction - ExampleMany branch instructions are repeated often in a program (e.g. the branch instruction at the end of a loop). The processor can then note whether or not the branch was "taken" previously, and assume that the same will happen this time. This requires the use of a branch history table, a small area of cache memory, to record the information. This method is used in the AMD29000 processor.
![Page 43: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/43.jpg)
Optimising the Pipeline
Data Flow AnalysisUsed to overcome dependencyProcessor analyses instructions for
dependencyThen allocates instructions to the
pipeline in an order which prevents dependency stalling the flow
![Page 44: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/44.jpg)
Optimising the Pipeline
Speculative loading of dataProcessor looks ahead and processes
early any instructions which load data from memory
Data stored in registers for later use (if required)
Discarded if not required
![Page 45: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/45.jpg)
Optimising the Pipeline
Speculative executionProcessor carries out instructions before
they are requiredResults stored in temporary registersDiscarded if not required
![Page 46: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/46.jpg)
Optimising the Pipeline
PredicationTackles conditional branches by
executing instructions from both branches until it knows which branch is to be taken
![Page 47: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/47.jpg)
Optimising the Pipeline
All of these techniques are possible due toThe increasing speedsThe increasing complexityThe increasing numbers of processors
available in modern processors
![Page 48: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/48.jpg)
Pipelining PP Questions
2010 Q11b,c, 13a,b2009 13f2008 14c 16e2007 16 a,b,c,d2006 18a,b,d2011 13b,c
![Page 49: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/49.jpg)
Superscalar Processing
More than one pipeline within the processor
Pipelines can work independentlySuperscalar processors try to take
advantage of instruction-level parallelism
![Page 50: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/50.jpg)
Superscalar Processing
A superscalar CPU architecture implements a form of parallelism called instruction-level parallelism within a single processor.
It thereby allows faster CPU throughput than would otherwise be possible at the same clock rate.
A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor.
![Page 51: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/51.jpg)
Superscalar Processing
Try to take advantage of instruction-level parallelism
The degree to which instructions in a program can be executed in parallel
a= a + 2
b= b + cCan be executed in parallel
a= a + 2
b= a + cCannot be executed in parallel – Why?
![Page 52: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/52.jpg)
Superscalar Processing
![Page 53: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/53.jpg)
Superscalar Processing
![Page 54: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/54.jpg)
Superscalar Processing
![Page 55: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/55.jpg)
Superscalar Processing
While early superscalar CPUs would have two ALUs and a single FPU, a modern design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units.
![Page 56: Topic 4 Processor Performance](https://reader036.fdocuments.in/reader036/viewer/2022062322/56814400550346895db092de/html5/thumbnails/56.jpg)
Scholar Activity
Review QuestionsQ8-142011 13a,b,c2009 13e2008 172006 18c