Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science...

24
Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science...

Page 1: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Multithreaded ASC

Kevin Schaffer and Robert A. Walker

ASC Processor GroupComputer Science Department

Kent State University

Page 2: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Organization of an ASC Computer

Processing Element 1

Bro

adca

st/R

edu

ctio

n

Net

wo

rk

Control Unit(Scalar Processor)

Processing Element 2

Processing Element N

Bro

adca

st/R

edu

ctio

n

Net

wo

rk

……

Page 3: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Broadcast/Reduction Bottleneck

Time to perform a broadcast or reduction increases as the number of PEs increases

Even for a moderate number of PEs, this time can dominate the machine cycle time

Pipelining reduces the cycle time but increases the latency

Additional latency causes pipeline hazards

Page 4: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Instruction Types

Scalar instructions Execute entirely within the control unit

Broadcast/Parallel instructions Execute within the PE array Use the broadcast network to transfer instruction and data

Reduction instructions Execute within the PE array Use the broadcast network to transfer instruction and data Use the reduction network to combine data from PEs

Page 5: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Scalar Pipeline

Instruction Fetch (IF)

Instruction Decode (ID)

Execute (EX)

Memory Access (M)

Write Back (W)

Page 6: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Hazards in a Scalar Pipeline

Page 7: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Unified SIMD Pipeline

Broadcast (B1...Bn)

Reduction (R1...Rn)

Number of stages is variable

All instructions go through every stage

Page 8: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Diversified SIMD Pipeline

Separate paths for each instruction type so instructions only go through stages that they use

Stalls less often than a unified pipeline organization

Page 9: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Hazards

Page 10: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Multithreading

Pipelining alone cannot eliminate hazards caused by broadcast and reduction latencies

Solution: use instructions from multiple threads to keep the pipeline full

Instructions from different threads are independent so they cannot generate stalls due to data dependencies

As long as there are a sufficient number of threads, it is possible to fill any number of stall cycles

Page 11: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Types of Multithreading

Coarse-grain multithreading switches to a new thread when the current thread encounters a high latency operation

Fine-grain multithreading switches to a new thread every clock cycle

Simultaneous multithreading can issue instructions from multiple threads in the same clock cycle

For a SIMD processor, fine-grain or simultaneous multithreading is necessary as pipeline stalls are relatively short and occur frequently

Page 12: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Multithreaded Control Unit

Fetch Unit

InstructionCache

Decode Unit 1

Decode Unit 2

Decode Unit 3

Decode Unit N

Instruction Status Table

Sch

edu

ler

Th

read

Sta

tus

Tab

le

Page 13: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Reduction Hazard with a Single Thread

Page 14: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Reduction Hazard with Multiple Threads

Page 15: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Execution Time vs. Latency

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6 7 8

Communication Latency (cycles)

Exe

cuti

on

Tim

e (c

ycle

s)

ASC Multithreaded ASC MASC

Page 16: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Throughput vs. Latency

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Communication Latency (cycles)

No

rmal

ized

Th

rou

gh

pu

t (i

nst

ruct

ion

s/cy

cle)

ASC Multithreaded ASC MASC

Page 17: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Multithreaded ASC vs. MASC

A multithreaded ASC computer can execute at most one instruction in a cycle

A MASC computer with j instruction streams can execute up to j instructions in a cycle

In multithread ASC each thread can access every PE

In MASC each instruction stream can only access its partition of PEs

A multithreaded MASC computer could combine the advantages of both

Page 18: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

ASC

Page 19: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Multithreaded ASC

Page 20: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

MASC

Page 21: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Multithreaded MASC

Page 22: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Multithreaded ASC Processor

In order to validate simulation results and estimate hardware costs, a prototype processor was developed

Targeted for an Altera Cyclone II (EP2C35) FPGA

Using an FPGA makes it possible to get detailed measurements of speed and hardware cost

Page 23: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Additional Enhancements

Flags (logical values) are a first-class data types with their own set of registers and instructions

Extra reduction operators Count Responders Sum

Hardware semaphores for thread synchronization

Page 24: Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.

Synthesis Results

Targeted for an Altera Cyclone II FPGA (EP2C35)

16 x 16-bit PEs

16 hardware threads

Clock speed: 75 MHz