Hwacha V4: Decoupled Data Parallel Custom Extension

Colin Schmidt, Albert Ou, Krste Asanović

UC Berkeley

Introduction

• Data-parallel, custom ISA extension to RISC-V with a configurable ASIC-focused implementation

• Developed to research energy-efficient implementation of vector architecture

• Several versions over the years with this the most recent being V4

• Rocket-chip based accelerator with binutils, llvm-backend, FireSim, and ASIC VLSI support

• Open-sourced now!

Key Ideas in Vector Architectures

• Runtime-variable vector length register • Compact, portable code

• Hardware implementation scalability

• Reasonable encoding space usage

• Long vectors with temporal and spatial execution

• Scalar-vector computation

• Configurable register file • Trade architectural registers for longer vector lengths

Hwacha Key Ideas

• Goal: Maximize efficiency of in-order vector microarchitecture

• Access/execute decoupling

• Precision-reconfigurable vector RF to increase vector lengths

• Predication and consensual branches to enable control flow

• Mixed scalar-vector computation

• Mixed-precision registers and datapaths

• OS support with unified virtual memory and restartable exceptions

Hwacha Architecture Overview

• Configurable vector data and predicate register files

• Fixed number of address and scalar registers

• Unit-stride, constant-stride, indexed load/stores

• Full set of half, single, double FP, integer, and predicate operations

Vector Fetch Programming model

• Separate vector instructions from scalar control thread

• Vector-fetch instruction, vf, sends address from control thread to vector unit

• Vector unit fetches and decodes separate instruction stream

• Control processor runs ahead enqueuing more work, hiding latency

CP VI$

Vector

Execution

Vector Unit

Hwacha Vector-Fetch Architectural Paradigm

a0: n, a1: a,

a2: *x, a3: *y

vmcs vs0, a1

stripmine:

vsetvl t0, a0

vmca va0, a2

vmca va1, a3

vf saxpy

slli t1, t0, 2

add a2, a2, t1

add a3, a3, t1

sub a0, a0, t0

bnez a0, stripmine

saxpy:

vlw vv0, va0

vlw vv1, va1

vfma.svv vv1, vs0, vv0, vv1

vsw vv1, va1

Control Thread

Worker Thread

CP VI$

Vector

Execution

Vector Unit

for (i=0; i<n; i++) {

y[i] = a*x[i] + y[i];

a0: n, a1: a,

a2: *x, a3: *y

vmcs vs0, a1

stripmine:

vsetvl t0, a0

vmca va0, a2

vmca va1, a3

vf saxpy

slli t1, t0, 2

add a2, a2, t1

add a3, a3, t1

sub a0, a0, t0

bnez a0, stripmine

saxpy:

vlw vv0, va0

vlw vv1, va1

vsw vv1, va1

Control Thread

Worker Thread

CP VI$

Vector

Execution

Vector Unit

for (i=0; i<n; i++) {

y[i] = a*x[i] + y[i];

a0: n, a1: a,

a2: *x, a3: *y

vmss vs0, a1

stripmine:

vsetvl t0, a0

vmsa va0, a2

vmsa va1, a3

vf saxpy

slli t1, t0, 2

add a2, a2, t1

add a3, a3, t1

sub a0, a0, t0

bnez a0, stripmine

saxpy:

vlw vv0, va0

vlw vv1, va1

vsw vv1, va1

Control Thread

Worker Thread

CP VI$

Vector

Execution

Vector Unit

for (i=0; i<n; i++) {

y[i] = a*x[i] + y[i];

a0: n, a1: a,

a2: *x, a3: *y

vmcs vs0, a1

stripmine:

vsetvl t0, a0

vmca va0, a2

vmca va1, a3

vf saxpy

slli t1, t0, 2

add a2, a2, t1

add a3, a3, t1

sub a0, a0, t0

bnez a0, stripmine

saxpy:

vlw vv0, va0

vlw vv1, va1

vsw vv1, va1

Control Thread

Worker Thread

CP VI$

Vector

Execution

Vector Unit

for (i=0; i<n; i++) {

y[i] = a*x[i] + y[i];

Hwacha Microarchitecture

Decoupling

• Exploit regularity of constant-stride vector memory accesses for aggressive yet accurate prefetching

• Extensive decoupling in microarchitecture to tolerate latency and resolve memory references as early as possible

Decoupling

Before stripmine loop, control thread pushes scalar register write to VXU command queue

Decoupling

Vector length updates, address register writes, and vector fetch commands pushed to both queues for each stripmine iteration

Decoupling

VRU prefetches instructions and pre-executes vector loads, refilling L2 cache ahead of VMU access

Decoupling

VMU moves data from L2 cache into vector register file when eventually needed; overlaps with VXU computation

Decoupling

Control processor runs ahead of vector unit to next iteration

• Performs address computations

• Pushes next vector fetch command

Decoupling

VRU prefetches next iteration while VXU remains busy with previous one

Reconfigurable Vector Register File • Software specifies number of vector data and predicate registers

• Hardware dynamically remaps elements to physical storage

vsetcfg 4 vlen = 2

vsetcfg 2 vlen = 4

• Maximum hardware vector length automatically extends to fill available capacity

• Exchange unused architectural registers for longer hardware vectors

Mixed-Precision Support • Configurable element widths for individual vector registers

• Subword register packing and alignment of mixed-precision operands are managed by hardware – software remains fully oblivious

vsetcfg 1, 1 vlen = 5

Vector Lane Organization

• Compact register file of four 1R/1W SRAM banks

• Per-bank integer ALU/PLU • Two independently scheduled FMA

clusters • Total of four double-precision FMAs

per cycle

• Pipelined integer multiplier • Variable-latency decoupled

functional units • Integer divide • Floating-point divide with square root • Reduction

per cycle

Sequencer

• Instruction window

• Group of 8 elements striped across all four banks forms atomic unit of scheduling within lane

• Issues to expander after all hazards resolved for element group • Vector ILP: Element groups from

different instructions can issue/complete out-of-order

• Expander converts sequencer ops to μops – low-level control signals that drive lane datapath

Systolic Bank Execution

• Sustains n operands/cycle after n-cycle initial latency

• Stall-free cascade of μops (“fire and forget”) after clearing hazards

• Striped element mapping avoids bank conflicts 33

Bank 0

Bank 1

Bank 2

Bank 3

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Vector Memory Unit

• 128-bit TileLink2 interface to L2 cache

• Coalesced accesses for unit-stride memory operations

• Optimizations to handle response reordering for loads: • Out-of-order writeback mechanism to

minimize buffering

• Support for “tailgating” overlapped load operations

Benchmarks/Tape-outs

• Silicon-proven: 10+ tape-outs

• Peak power efficiency: 40+ DP-GFLOPS/W in ST 28nm FD-SOI

• Peak floating-point performance: 64+ DP-GFLOPS, 128+ SP-GFLOPS, 256+ HP-GFLOPS in TSMC 16nm 5x5mm

• 95+% of peak performance on DGEMM

• 10X acceleration over Rocket on Caffe NN with single lane

2.8x2.8mm ST 28nm FD-SOI

Jun May

Raven-1

Raven-2

HV2 Raven-3

Raven-4

HV2 EOS20

2011 2012 2013 2014 2015

May Apr Aug Feb Jul Sep Mar Nov Mar Apr

Hurricane-1

2-core HV4

Hurricane-2

2-lane HV4

CraftP1

4-lane HV4

Raven/Hurricane: ST 28nm FD-SOI; EOS: IBM 45nm SOI; CRAFT/EAGLE:TSMC16nm

8-core HV4

Hwacha Chip Chronology

Current/Future Work

• Move frontend to the standard RISC-V Vector extension • Retain lane structure

• Extensions to RVV for polymorphic instructions

• Explore domain-specific extensions

• Draft of RVV spec at github.com/riscv/riscv-v-spec

Open Source Release

• Full parameterized Chisel3 RTL • Multi-lane, Sequencer slots, queue depths

• ISA assembly tests, hand-coded micro-benchmark suite

• Random torture test generator for verification

• Spike extension for simple software simulation

• FireSim support for ~90MHz FPGA simulation

• GNU binutils assembler

• Preliminary OpenCL compiler based on LLVM+pocl

• Documentation: technical reports

Acknowledgements

• Yunsup Lee

• Alon Amid, Sagar Karandikar, Quan Nguyen, Stephen Twigg, Huy Vo, Andrew Waterman, Jerry Zhao

• All students who worked on these chips as part of the following Berkeley labs and centers: ParLab, ASPIRE, ADEPT, BWRC

Sponsors:

Research partially funded by DARPA CRAFT HR0011-16-C-0052 and

PERFECT HR0011-12-2-0016; Intel Science and Technology Center for

Agile Design; and ADEPT Lab industrial sponsors and affiliates Intel, Google, Huawei, Siemens, SK Hynix, Seagate.

Questions and Links

RTL: github.com/ucb-bar/hwacha

Example Project: github.com/ucb-bar/hwacha-template

Toolchain+spike+tests: github.com/ucb-bar/esp-tools

FireSim: github.com/firesim/firesim/tree/hwacha

OpenCL+LLVM: github.com/ucb-bar/esp-llvm

Docs: hwacha.org

Hwacha V4: Decoupled Data Parallel Custom Extension · Raven-1 HV1 Raven-2 HV2 Raven-3 HV3 Raven-4...

Transcript of Hwacha V4: Decoupled Data Parallel Custom Extension · Raven-1 HV1 Raven-2 HV2 Raven-3 HV3 Raven-4...

Hwacha V4: Decoupled Data Parallel Custom Extension · Raven-1 HV1 Raven-2 HV2 Raven-3 HV3 Raven-4...

Documents

Transcript of Hwacha V4: Decoupled Data Parallel Custom Extension · Raven-1 HV1 Raven-2 HV2 Raven-3 HV3 Raven-4...

Instruction manual for air compressor HV2/200 · 2018. 10. 10. · Instruction book for compressor type HV2/200 Page 1 PREFACE Sperre has produced this instruction manual in order

HV2/ 200.pdf

test raven

RAVEN GENERAL.xls

SJ-20100510160815-014-ZXSDR B8200 GU360 (HV2[1].5) Hardware Installation

Raven color

（P1、HV1） （P3、HV2） （W1、W2）

Vistiendo a la Industria Automotriz Sunbrella Viene al ...cdn.glenraven.net/glenraven/pdf/es_mx/Glen Raven - RAVEN 1 - Spani… · de Glen Raven, Raven. Hemos creado esta publicación

Raven Infantil

Raven Test

An Overview of PEO AVIATION - HAMA Webhamaweb.org › presentations › 2008 › February2008PaulBogosian.pdfCDR First Flight Start FRP Raven B FUE Raven B FRP II Raven B Raven PIP

Miller HV2 Series - Coastal Hydraulics Interchange... · Miller HV2 Series Heavy Duty ... 0, .....26, 27.....N/A Model 66, Cap Square Flange Mount ... Miller cylinder cushions are

Understanding Interpreting “The Raven” - Wikispaces7thgradehillsboro.wikispaces.com/file/view/Understanding The Raven... · As expected, the Raven answers "Nevermore," and the

RAVEN 10 XL SUBWOOFER · RAVEN DESIGN STUDIO • CAMARILLO, CA • 1.888.340.4403 • ENDESIGNSTUDIO.COM RAVEN 10 XL SUBWOOFER PG 01 THE RAVEN XL SUBWOOFERS (DESIGNED BY ORCA –

RAPTOR / RAVEN SERIES - Osprey Europe€¦ · ospreypacks.com OWNER'S MANUAL RAPTOR / RAVEN SERIES RAPTOR 14 RAPTOR 10 RAVEN 14 RAVEN 10 The Raptor / Raven Series…

Happy Valley 2 The lesson plans for HV2 have been edited ... · Happy Valley 2 1 The lesson plans for HV2 have been edited to reflect three phonics options: (1) no phonics (2) HVP1

Prof. Dr.-Ing. Jochen Schiller, SS057.1 SCO payload types payload (30) audio (30) audio (10) HV3 HV2 HV1 DV FEC (20) audio.

Raven, 2000

Manual Compresor HV2-200 Sperre

MCQV ISO-VDMA Standard Cylinders€¦ · RCA 50,63 80,100 125 160 HV1 HV2 HV3 HV4 HV5 SENSOR SWITCH SENSOR SWITCH HOLDER 2 1 DUAL ROD / MALE THREAD Port thread PT. NPT. are also available.

（P1、HV1）（P3、HV2）（W1、W2）