ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki |...

44
| | Matheus Cavalcante PhD Student ETH Zurich ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI https://tmt.knect365.com/risc-v-summit @risc_v Fabian Schuiki PhD Student ETH Zurich 12.4.2018 Matheus Cavalcante and Fabian Schuiki 1

Transcript of ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki |...

Page 1: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Matheus Cavalcante

PhD Student

ETH Zurich

ARA: 64-BIT RISC-V VECTOR

IMPLEMENTATION IN

22NM FDSOI

https://tmt.knect365.com/risc-v-summit @risc_v

Fabian Schuiki

PhD Student

ETH Zurich

12.4.2018 Matheus Cavalcante and Fabian Schuiki 1

Page 2: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 2

Interconnect

64b

Ariane 1GHz

2 DP-GFLOPS

8 GB/s

Instruction Data

64b 64b

I$, D$

Page 3: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 3

Interconnect

64b

ARA

1GHz

8 DP-GFLOPS

8 GB/s

Data

Instruction

Queue

ACK/TRAP

MMU

64b 64b

64b

Ariane 1GHz

2 DP-GFLOPS

8 GB/s

Instruction Data

I$, D$

Page 4: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Operational Intensity: operations per byte

Algorithm dependent

One FMA = 2 operations

Memory- and compute-boundness

Memory- and compute-bound region for ARA

Operational Intensity ≥ 1

12.4.2018 Matheus Cavalcante and Fabian Schuiki 4

Memory Bandwidth and Performance: ARA and Ariane Rooflines

Compute-bound We want to

be here!

Page 5: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Global Foundries’ 22FDX process

Master’s Thesis (+ a few months of ongoing PhD studies!)

Planning to open-source it within the PULP platform (as usual!)

Snapshot of the current development

Challenges we faced

Results we achieved

Insights we gained

12.4.2018 Matheus Cavalcante and Fabian Schuiki 5

ARA: High-performance vector processor

Page 6: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Vector processing: SIMD

Less instruction BW, simpler control, less

energy per operation

Packed-SIMD vs. Vector “Cray-like” SIMD

CRAY-I (1977)

Hwacha

Vector-fetch architecture

More complex: vector unit fetches its

instructions and threads can diverge

64 DP-GFLOPS in TSMC 16nm

40 DP-GFLOPS/W in ST 28nm FD-SOI

12.4.2018 Matheus Cavalcante and Fabian Schuiki 6

Vector processors background

t

FE

TC

H

DE

CO

DE

EXECUTE

Page 7: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

RISC-V “V” Extension: “Cray-like” vector-SIMD approach

ARA: based on version 0.4-DRAFT

No full compliance

No support to fixed-point and vector atomics – not our focus

Limited support to type promotions (e.g., 64b ← 8b + 8b) – hardware cost

Eventually dropped in later versions of the Extension

12.4.2018 Matheus Cavalcante and Fabian Schuiki 7

RISC-V “V” Extension

Nov. 2017

0.4-DRAFT

May 2018

0.5-DRAFT

Dec. 2018

0.6-DRAFT Master’s Thesis

Feb. 2018

t

Aug. 2018

PhD

Page 8: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

ARA Microarchitecture

12.4.2018 Matheus Cavalcante and Fabian Schuiki 8

Page 9: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

We support masked FMA instructions (four operands):

vmadd vd, vsa, vsb, vsc, vmask

vd[i] = lsb(vmask[i]) ? vsa[i] + vsb[i] + vsc[i] : 0;

FMA is pipelined (5 cycles) to meet fmin

constraint

The four lanes operate in lockstep

Low control overhead

Each lane gets 64b operands from four

256b input FIFO buffers (A, B, C, VMASK)

Number of lanes determines buffer width

12.4.2018 Matheus Cavalcante and Fabian Schuiki 9

Main datapath element: FMA Units

64b

Page 10: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Queues needed to sustain maximum

throughput for the lock-step operation of the

FUs, while hiding the latency caused by

banking conflicts in the VRF

One input queue buffer provides one

operand to all the (four) lanes

256b (4x64b) wide entries

One FIFO buffer per operand per multi-lane

datapath unit 10 FIFO buffers

Output queue buffers for output operands,

one per multi-lane datapath unit

12.4.2018 Matheus Cavalcante and Fabian Schuiki 10

Operand FIFO queues

Page 11: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

256b banks one bank stores 4 operands consumed in parallel by the 4 lanes

8 banks BF of 1,6 for the worst-case read BW (FMA is 4R+1W/cycle)

12.4.2018 Matheus Cavalcante and Fabian Schuiki 11

Vector register file

Bank 0

256b

0 1 2

Bank 7

256b

3

Bank 6

256b

Bank 1

256b

Bank 2

256b

Bank 3

256b

Bank 4

256b

Bank 5

256b

4 5 6 7 8 9 A B v0

0 1 2 3 v1

0 1 2 3 v2

v3 0 1 2 3

Page 12: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 12

Vector Register File and Operand-Deliver Interconnect

All-to-all input log-interconnect

256b wide (64bx4)

8-source (VRF banks) x 10-dest (FIFO

buffers)

Registered boundaries (for timing)

All-to-all output log-interconnect

256b wide (64bx4)

4-source (out FIFO buffers) x 8 dest (RF

banks)

Fixed-priority arbiter

VRF is built as 1RW SRAM banks

𝑃𝑀 > 𝑃𝐴 > 𝑃𝐵 > 𝑃𝐶

Writes have lower priority than reads –

unless output queue is full

Page 13: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Consider the execution of the following instruction

vmadd vd, vsa, vsb, vsc, vsmask

We take a vector of length 256 (ideally 64 cycles to run)

12.4.2018 Matheus Cavalcante and Fabian Schuiki 13

Execution of a FMA instruction

Page 14: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 14

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

0 0 0 0

Cycle count: 1

FMA Utilization: 0/1 = 0%

The first 4 elements of all 4 operands are in Bank 0

3 access stalled due to banking conflicts

Page 15: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 15

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

1 0 0 0

Cycle count: 2

FMA Utilization: 0/2 = 0%

Page 16: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 16

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

2 1 0 0

Cycle count: 3

FMA Utilization: 0/3 = 0%

Page 17: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 17

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

6 5 4 3

Cycle count: 7

FMA Utilization: 1/7 = 14%

Page 18: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 18

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

7 6 5 4

Cycle count: 8

FMA Utilization: 2/8 = 25%

Page 19: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 19

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

3 2 1 0

Cycle count: 12

FMA Utilization: 6/12 = 50%

0

Page 20: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 20

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

4 3 2 1

Cycle count: 13

FMA Utilization: 7/13 = 54%

0

Page 21: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 21

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

5 4 3 2

Cycle count: 14

FMA Utilization: 8/14 = 57%

1

Page 22: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 22

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

6 5 4 3

Cycle count: 15

FMA Utilization: 9/15 = 60%

2

Page 23: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 23

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

Cycle count: 69

FMA Utilization: 63/69 = 91%

0

Page 24: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 24

Execution of a FMA instruction

Operand

Request

VRF

Priority

Arbiter

FMA

Bank 7

Bank 6

Bank 5

Bank 4

Bank 3

Bank 2

Bank 1

Bank 0

M A B C A

B

Mask

C

Cycle count: 70

FMA Utilization: 64/70 = 91%

1

Page 25: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Final result

written to regfile

Hardware Support for Vector Reductions

12.4.2018 Matheus Cavalcante and Fabian Schuiki 25

𝑦 = 𝐴𝑖 ∙ 𝐵𝑖𝑖

Triggered by VMADD instruction with scalar result register

Executed on FMA units by feeding results back in as operand C

E.g. Reduction of 64-element vector:

Avg. utilization in this case 36% (𝑁

𝑁+4∙29, 𝑁 = 64)

FMA

𝒚 = 𝒂 ∙ 𝒃 + 𝒄

5 cycles latency

A

B

C

Y

Lane 1

cycles

Lane 2

Lane 3

Lane 4

Accumulation Phase 16 cycles

20 partial sums (5 latency x 4 lanes)

Merge Results in each Lane 17 cycles

4 partial sums (4 lanes)

Merge Lanes 12 cycles

1 final sum

100% util. 24% util. 6.3% util.

Page 26: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Ways to Improve Reductions

Current Implementation (constant 29 cycle tail):

𝑁

𝑁+4∙29 (36% for 64-element vector)

Future Improvement A:

Schedule FPU operations of next instruction in gaps of the reduction

Utilization improves to 𝑁

𝑁+19 (77% for 64-element vector)

Future Improvement B:

Add separate reduction adder

Utilization improves up to 100%, since reduction tail not eating away performance

12.4.2018 Matheus Cavalcante and Fabian Schuiki 26

Lane 1

cycles

Lane 2

Lane 3

Lane 4

Instruction A Instruction B

Lane 1

cycles

Lane 2

Lane 3

Lane 4

Reduction Adder

Instruction A Instruction B Instruction C

Low Hardware Cost

(More Complex FSM)

Significant Hardware Cost

(Separate FP ADD Unit)

Page 27: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Benchmarks

12.4.2018 Matheus Cavalcante and Fabian Schuiki 27

Page 28: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018

ATTAINABLE

PERFORMANCE

Matheus Cavalcante and Fabian Schuiki 28

ARA and Ariane – Peak performance

Compute bound

8 op/cycle

Compute bound

2 op/cycle

Performances we

must achieve!

Page 29: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Can we achieve 8 GFLOPs peak performance?

Upper-bound: four FMAs working at 100%

Three key kernels:

Multiply-add (DAXPY): heavily memory-bound

Convolution (DCONV): compute-bound

Matrix-multiplication (DGEMM): compute-bound

Cycle-accurate simulation from the RTL

We ignore the initial set-up cycles (around 40 cycles)

Startup, instruction fetch, decoding, vector unit configuration…

12.4.2018 Matheus Cavalcante and Fabian Schuiki 29

Benchmarks

Page 30: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Strip-mined loop over the 𝑛 elements of

vectors 𝑥 and 𝑦

Operational intensity

3 × 8𝑛 = 24𝑛 bytes of memory transfers

2𝑛 operations (multiply-adds)

1

12 operations per byte

Memory-bound

We’ll be far from 8 GFLOPs!

But are we close to the performance limit?

12.4.2018 Matheus Cavalcante and Fabian Schuiki 30

DAXPY: 𝒀 ← 𝒂𝑿 + 𝒀

Page 31: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Vector

Length

FPU Utilization

(%)

Performance

(op/cycle)

32 5,6% 0,45

64 6,6% 0,53

128 7,3% 0,59

256 7,7% 0,62

512 8,0% 0,64

1024 8,1% 0,65

12.4.2018 Matheus Cavalcante and Fabian Schuiki 31

DAXPY: Performance

We achieve what we could in terms of perf

Can’t expect 8 GFLOPs from a memory-bound kernel

Ops/cycle grows to 8 if we increase memory port

width (e.g. 128b 2x perf)

Page 32: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Kernel particular for CNNs

Convolution kernel size: 7 channels, each 3 × 3

Image size: 7 channels, each 𝑛 × 1

Operational intensity

2 × 8 × 7𝑛 = 112𝑛 bytes of memory transfers

882𝑛 operations (multiply-adds)

7,875 operations per byte

Compute-bound kernel

It should be possible to achieve 8 ops/cycle

Scheduling is key

12.4.2018 Matheus Cavalcante and Fabian Schuiki 32

DCONV: 𝒀 = 𝑲 ∗ 𝑿

t

Page 33: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Vector

Length

FPU Utilization

(%)

Performance

(op/cycle)

32 19,8% 1,58

64 36,1% 2,89

128 61,5% 4,92

256 82,1% 6,57

512 82,3% 6,59

1024 82,5% 6,60

12.4.2018 Matheus Cavalcante and Fabian Schuiki 33

DCONV: Performance

Banking conflicts and extra scalar code

(e.g. pointer calculation) limit performance

Performance goes up until strip-mining

loop comes to play

Unroll strip-mining: programmability?

Hard to hide all the memory transfers (initial

loads and final stores)

Page 34: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

BLAS-3 routine

Common kernel in several applications

High data reuse

When the kernel is compute-bound, it should

be possible to achieve 8 ops/cycle

Operational intensity

8 × 3𝑛2 = 24𝑛2 bytes of memory transfers

2𝑛3 operations (multiply-adds)

𝑛

12 operations per byte

If 𝑛 ≤ 12, kernel is memory-bound by ARA’s

VLSU unit

12.4.2018 Matheus Cavalcante and Fabian Schuiki 34

DGEMM: 𝑪 ← 𝜶𝑨𝑩 + 𝜷𝑪

Page 35: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Vector

Length

FPU Utilization

(%)

Performance

(op/cycle)

32 19,2% 1,54

64 37,8% 3,02

128 70,3% 5,62

256 84,7% 6,77

512 85,5% 6,84

1024 86.3% 6,91

12.4.2018 Matheus Cavalcante and Fabian Schuiki 35

DGEMM: Performance

We see the same phenomena seen with DCONV

Initial banking conflicts and scalar code limiting

performance with shorter vectors

Strip-mining and unmaskable memory transfers

limiting steady performance

Page 36: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 36

So.. can we achieve 8 GFLOPs?

With vector length of

256 ARA is less than

10% from the roofline

on key compute

kernels!

Page 37: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Implementation results

12.4.2018 Matheus Cavalcante and Fabian Schuiki 37

Page 38: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| | 12.4.2018 Matheus Cavalcante and Fabian Schuiki 38

ARA: GF FDX22 1GHz implementation (SS, 0.72V, 125 ºC)

Ariane

36%

D$

I$

ARA

64%

VRF

1.3mm

0.7

mm

Page 39: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

ARA (0,49mm2) is 1.8 × bigger than Ariane

(0,27mm2)…

and has 4 × its computational power

Operation density:

Ariane: 7,27 GFLOPS/mm2

ARA: 16,23 GFLOPS/mm2

Critical path: around 45 gates

Through the arbiter lower priority requests

12.4.2018 Matheus Cavalcante and Fabian Schuiki

39

Ara and Ariane – Area breakdown

Page 40: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Conclusions

12.4.2018 Matheus Cavalcante and Fabian Schuiki 40

Page 41: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Higher operational intensity → minimize data transfers

By shuffling and reordering data inside vector registers

Only two* instructions available

vslide: vd{i} = vs1{i + rs2}

vrgather: vd{i} = vs1{ vs2{i} }

Register-gather is too general → hard to optimize!

Dedicated instructions to more specific shuffling: permutations, rotations?

*(three, more recently, as vslide was split into vslideup and vslidedown)

12.4.2018 Matheus Cavalcante and Fabian Schuiki 41

Shuffling instructions

Page 42: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

We did benefit from decoupling the scalar and the vector unit

Different “worlds”

Scalar unit: speculative, several in-flight instructions, latch-based register file

Vector unit: non-speculative, a few in-flight vector instructions, SRAM-based register file

We see with apprehension ISA decisions that push towards their recoupling

E.g., the recent decision of mapping the vector registers over the floating-point registers

12.4.2018 Matheus Cavalcante and Fabian Schuiki 42

Decoupling between scalar and vector units

Page 43: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

ARA supports mixed-precision to a certain

extent

Previous versions allowed for a mixed-

precision instruction as 64b ← 8b + 8b

8b, 16b and 32b operands could be promoted

to 64b operands

High hardware cost!

We now allow for a more restricted set of

type promotions

8 → 16b, 16 → 32b and 32 → 64b

Aligned with newer revisions of the V Extension

12.4.2018 Matheus Cavalcante and Fabian Schuiki 43

Mixed-precision

8b

64b

Page 44: ARA: 64-BIT RISC-V VECTOR IMPLEMENTATION IN 22NM FDSOI · Matheus Cavalcante and Fabian Schuiki | 12.4.2018 | 2 Interconnect 64b Ariane 1GHz 2 DP-GFLOPS 8 GB/s Instruction Data 64b

| |

Matheus Cavalcante

PhD Student

ETH Zurich

ARA: 64-BIT RISC-V VECTOR

IMPLEMENTATION IN

22NM FDSOI

https://tmt.knect365.com/risc-v-summit @risc_v

Fabian Schuiki

PhD Student

ETH Zurich

12.4.2018 Matheus Cavalcante and Fabian Schuiki 44