WaveScalar and the WaveCache
description
Transcript of WaveScalar and the WaveCache
![Page 1: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/1.jpg)
Spring 2003 CSE P548 1
WaveScalar and the WaveCache
Steven SwansonKen Michelson
Mark OskinTom AndersonSusan Eggers
University of Washington
![Page 2: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/2.jpg)
Spring 2003 CSE P548 2
Worries to Keep You up at Night
In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. Memory latency is still a problem. For reasonable yields, only 1 transistor in 24 billion
may be broken (if one flaw breaks a chip).
![Page 3: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/3.jpg)
Spring 2003 CSE P548 3
WaveScalar’s Solution: Utilize Die Capability
A sea of simple, RISClike processors in-order, single-issue takes advantage of billions of transistors without
exacerbating the other problems short design & implementation time operates at a short cycle not need lots of ILP fewer defects
![Page 4: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/4.jpg)
Spring 2003 CSE P548 4
L2 C
ache
WaveScalar Processing Element
FLOW CONTROL
FU
FLOW CONTROL
DECODE
CONFIG.LOGIC
INPUTS
OUTPUTS
![Page 5: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/5.jpg)
Spring 2003 CSE P548 5
WaveScalar’s Solution: Short Wires
Dataflow execution model each processor executes when it’s operands have
arrived same principle as out-of-order execution but applies to
the processor & includes fetching no single program counter
short wires: no long control lines no centralized hardware data structures no need for sequential & individual instruction fetches
![Page 6: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/6.jpg)
Spring 2003 CSE P548 6
WaveScalar’s Solution: Short Wires
Dataflow execution model, cont’d. differs from original dataflow computers
distributed tag management (matching between renamed producer-consumer registers)
special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution
all instructions in a “wave” execute on data with the same wave number
![Page 7: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/7.jpg)
Spring 2003 CSE P548 7
WaveScalar’s Solution: Short Wires
Dataflow execution model differs from original dataflow computers
explicit wave-ordered memory compiler assigns sequence number to each memory
operation in a bread-first manner sequence number for an operation, its predecessor &
successor all sent with produced data wave & sequence numbers provide a total order on
memory operations through any traversal of a wave+ normal memory semantics+ no need for special dataflow languages; C & C++ programs
execute just fine
![Page 8: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/8.jpg)
Spring 2003 CSE P548 8
WaveScalar’s Solution: Short Wires
Nearest-neighbor communication code placement to locate consumers near their
producers short, fast node-to-node links rather than slow
broadcast networks exploits dataflow locality: probability of producing a value
for a particular consumer instruction & therefore register (register renaming can destroy this)
instructions can dynamically migrate toward their neighbors during execution
![Page 9: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/9.jpg)
Spring 2003 CSE P548 9
Dynamic Optimization
The common case has higher costs, and the branch can detect this…
Common CaseRare Case
Branch
Join
![Page 10: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/10.jpg)
Spring 2003 CSE P548 10
Dynamic Optimization
…and fix it, by moving. The join can do the same.
Common CaseRare Case
Branch
Join
![Page 11: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/11.jpg)
Spring 2003 CSE P548 11
L2 C
ache
WaveScalar’s Solution: Short Wires
PE Domain
![Page 12: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/12.jpg)
Spring 2003 CSE P548 12
L2 C
ache
WaveScalar’s Solution: Short Wires
D$ + Store Buffer
Cluster
![Page 13: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/13.jpg)
Spring 2003 CSE P548 13
WaveScalar’s Solution: Creative Use of Untapped Parallelism
Expand the window for exploiting ILP no in-order fetch using only one PC (sucking though
a straw) place instructions with the processing elements out-of-order execution on a grand scale
Allow multiple threads to execute concurrently OS & applications multiple applications, parallel threads
![Page 14: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/14.jpg)
Spring 2003 CSE P548 14
WaveScalar’s Solution: The I-Cache is the Processor
Model is processor-in-memory (PIM) processing element associated with each instruction
WaveScalar version processing elements placed in the I-cache to reduce
latency
![Page 15: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/15.jpg)
Spring 2003 CSE P548 15
L2 C
ache
WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity
Fewer design & implementation errors from the grid of simple, uniform design
Route around processors with flaws
decentralized control
dynamic instruction migration
![Page 16: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/16.jpg)
Spring 2003 CSE P548 16
Research Agenda: Architecture
WaveScalar ISA Microarchitecture design
node design domain size cache-coherence across clusters cluster arrangement
Control & memory speculation WaveScalar instruction management
hardware for instruction placement & replacement hardware for dynamic, self-optimizing placement
![Page 17: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/17.jpg)
Spring 2003 CSE P548 17
Research Agenda: Architecture
Multithreaded WaveScalar Design of the network & routing issues Power management Static & dynamic fault detection & recovery (rerouting
instructions) System-level design Application to non-silicon designs
![Page 18: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/18.jpg)
Spring 2003 CSE P548 18
Research Agenda: Compilers
Instruction placement Revisit classic optimizations
code savings vs. communication costs cache pollution vs. loop parallelism
New opportunities for optimization a match between compiler & execute models WaveScalar-specific instructions
![Page 19: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/19.jpg)
Spring 2003 CSE P548 19
Research Agenda: OS & Networking
Tension between facilitating short routines & poor instruction locality
The software side of thread management A bunch of stuff I don’t know about
optimizing the OS interface new thread protection policies memory management issues security lazy context switching utilizing virtual machines
![Page 20: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/20.jpg)
Spring 2003 CSE P548 20
Putting It All Together
Grid of hundreds (maybe thousands) of simple, data-flow processing nodes
no centralized control; scalable few design errors; increase in yield
Processing nodes embedded in the I-cache Instructions execute in place Send results directly to the consumers
short, point-to-point links Instructions can dynamically migrate
reduce latency to hot consumers map around defects
3X performance without any prediction mechanisms more with them
![Page 21: WaveScalar and the WaveCache](https://reader036.fdocuments.in/reader036/viewer/2022062814/568167d1550346895ddd2244/html5/thumbnails/21.jpg)
Spring 2003 CSE P548 21