Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of...
![Page 1: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/1.jpg)
Multiscalar processors
Gurindar S. SohiScott E. BreachT.N. Vijaykumar
University of Wisconsin-Madison
![Page 2: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/2.jpg)
Outline
Motivation Multiscalar paradigm Multiscalar architecture Software and hardware support Distribution of cycles Results Conclusion
![Page 3: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/3.jpg)
Motivation Current architecture techniques reaching
their limits Amount of ILP that can be extracted by
superscalar processor is limited Kunle Olukotun (stanford university)
![Page 4: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/4.jpg)
Limits of ILP
Parallelism that can be extracted from a single program is very limited – 4 or 5 in integer programs
Limits of instruction-level parallelism- David W. Wall (1990)
![Page 5: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/5.jpg)
Limitations of superscalar
Branch prediction accuracy limits ILP Every 5 instruction is a branch Executing an instruction across 5 branches
leads to useful result only 60% of the time (with branch prediction accuracy 90%)
There are branches which are difficult to predict – increasing the window size doesn’t always means executing useful instructions
![Page 6: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/6.jpg)
Limitations of superscalar.. contd
Large window size Issuing more instructions per cycle needs large
window of instructions Each cycle search the whole window to find
instructions to issue Increases the pipeline length
Issue complexity To issue an instruction dependence checks
have to be performed with other issuing instructions
To issue n instructions complexity of issue is n2
![Page 7: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/7.jpg)
Limitations of superscalar.. contd
Load and store queue limitations Loads and stores cannot be reordered
before knowing their addresses One load or store waiting for its address
can block the entire processor
![Page 8: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/8.jpg)
Superscalar limitation example
Consider the following hypothetical loop:
Iter 1: inst 1 inst 2… inst nIter 2: inst 1 inst 2…
If window size is less than n, superscalar considers only one iteration at a time
Possible improvementIter 1: iter 2: inst 1 inst 1 inst 2 inst 2… … … inst n inst n
![Page 9: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/9.jpg)
Multiscalar paradigm
Divide the program (CFG) into multiple tasks (not necessarily parallel)
Execute the tasks in different processing elements, residing in the same die – communication cost is less
Sequential semantics is preserved by hardware and software mechanisms
Tasks are typically re-executed if there is any violations
![Page 10: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/10.jpg)
Crossing the limits of superscalar
Branch prediction Each thread executes independently Each thread is limited by branch prediction – but
number of useful instructions available is much larger than superscalar
Window size Each processing element has its own window Total size of the windows in a die can be very
large, while each window can be of moderate size
![Page 11: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/11.jpg)
Crossing the limits of superscalar.. contd
Issue Complexity Each processing element issue only a
few instructions – simplifies logic
Loads and Stores Loads and stores can executed without
waiting for the previous thread’s load or store
![Page 12: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/12.jpg)
Multiscalar architecture
A possible microarchitecture
![Page 13: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/13.jpg)
Multiscalar execution
The sequencer walks over the CFG According the hints inserted in the code, it
assigns tasks to PEs PEs execute the tasks in parallel Maintaining sequential semantics
Register dependencies Memory dependencies
Tasks are assigned in the ring order and are committed in the ring order
![Page 14: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/14.jpg)
Register Dependencies
Register dependencies can be easily identified using compiler
Dependencies are always synchronized Registers that a task may write are
maintained in a create mask Reservations are created in the successor
tasks using the accum mask If the reservation exist (value not arrived),
the instruction reading the register waits
![Page 15: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/15.jpg)
Memory dependencies
Cannot be statically found Multiscalar uses an aggressive
approach – speculate always The loads don’t wait for stores in the
predecessor tasks Hardware checks for violation and the
task is re-executed if it violates any memory dependency
![Page 16: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/16.jpg)
Task commit
Speculative tasks are not allowed to modify memory
Store values are buffered in hardware When the processing element becomes
head – it retires its values into memory In order to maintain sequential semantics
the tasks retire in order – ring arrangement of processing elements
![Page 17: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/17.jpg)
Compiler support
Structure of CFG Sequencer needs information of tasks Compiler or a assembly code analyzer
marks the structure of the CFG – task boundaries
Sequencer walks through this information
![Page 18: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/18.jpg)
Compiler support .. contd Communication information
Gives the create mask as part of task header Sets the forward and stop bits Register value is forwarded if forward bit is
set Task is done when it sees a stop bit Also needs to give release information
![Page 19: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/19.jpg)
Hardware support
Need to buffer speculative values Need to detect memory dependence
violations If a speculative thread loads a value its
address is recorded in ARB If a thread stores into some location, then
ARB is checked to see if there was a load from the same location by a later thread
Also the speculative values are buffered
![Page 20: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/20.jpg)
Cycle distribution
Best scenario – all processing element does useful work always – never happens
Possible wastage Non-useful computation
If the task is squashed later due to incorrect value or incorrect prediction
No computation Waits for some dependency to be resolved Waits to commit its result
Remains idle No task assigned
![Page 21: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/21.jpg)
Non-useful computation
Synchronization of memory values Squashes usually occur on global or static data
values Easy to predict this dependency Explicitly synchronizations can be inserted to
eliminate squashes due these dependencies Early validation of prediction
For example loop exit testing can be done at the beginning of the iteration
![Page 22: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/22.jpg)
No computation
Intra-task dependences These can be eliminated through a variety of
hardware and software techniques Inter-task dependences
Possible scope for scheduling to reduce the wait time
Load balancing Tasks retire in-order Some tasks finish fast and wait for a long time to
become the head task
![Page 23: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/23.jpg)
Differences with other paradigms
Major improvement over superscalar VLIW – limited because of the limits of
static optimizations Multiprocessor
Very much similar Communication costs is very less Leads to fine grained thread parallelism
![Page 24: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/24.jpg)
Methodology
Simulator which uses MIPS code 5 stage pipeline Sequencer has a 1024 entry direct
mapped cache of task descriptors
![Page 25: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/25.jpg)
Results
![Page 26: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/26.jpg)
Results
Compress – long critical path Eqntott and cmppt – has parallel loops
with good coverage Espresso – one loop has load
balancing issue Sc – also has load imbalance Tomcatv – good parallel loops Cmp and wc – intra task dependences
![Page 27: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/27.jpg)
Conclusion
Multiscalar paradigm has very good potential
Tackles the major limits of superscalar Lots of scope for compiler and
hardware optimizations Paper gives a good introduction to the
paradigm and also discusses the major optimization opportunities
![Page 28: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/28.jpg)
Discussion
![Page 29: Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d5a5503460f94a3ad3a/html5/thumbnails/29.jpg)
BREAK!