Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science...
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science...
Multithreaded ASC
Kevin Schaffer and Robert A. Walker
ASC Processor GroupComputer Science Department
Kent State University
Organization of an ASC Computer
Processing Element 1
Bro
adca
st/R
edu
ctio
n
Net
wo
rk
Control Unit(Scalar Processor)
Processing Element 2
Processing Element N
Bro
adca
st/R
edu
ctio
n
Net
wo
rk
……
Broadcast/Reduction Bottleneck
Time to perform a broadcast or reduction increases as the number of PEs increases
Even for a moderate number of PEs, this time can dominate the machine cycle time
Pipelining reduces the cycle time but increases the latency
Additional latency causes pipeline hazards
Instruction Types
Scalar instructions Execute entirely within the control unit
Broadcast/Parallel instructions Execute within the PE array Use the broadcast network to transfer instruction and data
Reduction instructions Execute within the PE array Use the broadcast network to transfer instruction and data Use the reduction network to combine data from PEs
Scalar Pipeline
Instruction Fetch (IF)
Instruction Decode (ID)
Execute (EX)
Memory Access (M)
Write Back (W)
Hazards in a Scalar Pipeline
Unified SIMD Pipeline
Broadcast (B1...Bn)
Reduction (R1...Rn)
Number of stages is variable
All instructions go through every stage
Diversified SIMD Pipeline
Separate paths for each instruction type so instructions only go through stages that they use
Stalls less often than a unified pipeline organization
Hazards
Multithreading
Pipelining alone cannot eliminate hazards caused by broadcast and reduction latencies
Solution: use instructions from multiple threads to keep the pipeline full
Instructions from different threads are independent so they cannot generate stalls due to data dependencies
As long as there are a sufficient number of threads, it is possible to fill any number of stall cycles
Types of Multithreading
Coarse-grain multithreading switches to a new thread when the current thread encounters a high latency operation
Fine-grain multithreading switches to a new thread every clock cycle
Simultaneous multithreading can issue instructions from multiple threads in the same clock cycle
For a SIMD processor, fine-grain or simultaneous multithreading is necessary as pipeline stalls are relatively short and occur frequently
Multithreaded Control Unit
Fetch Unit
InstructionCache
Decode Unit 1
Decode Unit 2
Decode Unit 3
Decode Unit N
Instruction Status Table
Sch
edu
ler
Th
read
Sta
tus
Tab
le
Reduction Hazard with a Single Thread
Reduction Hazard with Multiple Threads
Execution Time vs. Latency
0
50
100
150
200
250
300
350
400
450
1 2 3 4 5 6 7 8
Communication Latency (cycles)
Exe
cuti
on
Tim
e (c
ycle
s)
ASC Multithreaded ASC MASC
Throughput vs. Latency
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Communication Latency (cycles)
No
rmal
ized
Th
rou
gh
pu
t (i
nst
ruct
ion
s/cy
cle)
ASC Multithreaded ASC MASC
Multithreaded ASC vs. MASC
A multithreaded ASC computer can execute at most one instruction in a cycle
A MASC computer with j instruction streams can execute up to j instructions in a cycle
In multithread ASC each thread can access every PE
In MASC each instruction stream can only access its partition of PEs
A multithreaded MASC computer could combine the advantages of both
ASC
Multithreaded ASC
MASC
Multithreaded MASC
Multithreaded ASC Processor
In order to validate simulation results and estimate hardware costs, a prototype processor was developed
Targeted for an Altera Cyclone II (EP2C35) FPGA
Using an FPGA makes it possible to get detailed measurements of speed and hardware cost
Additional Enhancements
Flags (logical values) are a first-class data types with their own set of registers and instructions
Extra reduction operators Count Responders Sum
Hardware semaphores for thread synchronization
Synthesis Results
Targeted for an Altera Cyclone II FPGA (EP2C35)
16 x 16-bit PEs
16 hardware threads
Clock speed: 75 MHz