Gemmini: An Agile Systolic Array Generator Enabling ...alonamid/papers/...depth, banking strategy,...

Gemmini: An Agile Systolic Array Generator EnablingSystematic Evaluations of Deep-Learning Architectures

Hasan Genc∗, Ameer Haj-Ali∗, Vighnesh Iyer∗, Alon Amid∗, Howard Mao, John Wright,Colin Schmidt, Jerry Zhao, Albert Ou, Max Banister, Yakun Sophia Shao, Borivoje Nikolic,

Ion Stoica, Krste Asanovic

University of California, Berkeley

ABSTRACTAdvances in deep learning and neural networks haveresulted in rapid development of hardware acceleratorsthat support them. A large majority of ASIC acceler-ators, however, target a single hardware design pointto accelerate the main computational kernels of deepneural networks such as convolutions or matrix multipli-cation. On the other hand, the spectrum of use-cases forneural network accelerators, ranging from edge devicesto cloud, presents a prime opportunity for agile hard-ware design and generator methodologies. We presentGemmini1 - an open source and agile systolic array gen-erator enabling systematic evaluations of deep-learningarchitectures. Gemmini generates a custom ASIC ac-celerator for matrix multiplication based on a systolicarray architecture, complete with additional functionsfor neural network inference. Gemmini runs with theRISC-V ISA, and is integrated with the Rocket ChipSystem-on-Chip generator ecosystem, including Rocketin-order cores and BOOM out-of-order cores. Throughan elaborate design space exploration case study, thiswork demonstrates the selection processes of variousparameters for the use-case of inference on edge devices.Selected design points achieve two to three orders ofmagnitude speedup in deep neural network inferencecompared to the baseline execution on a host processor.Gemmini-generated accelerators were used in the fabri-cation of test systems-on-chip in TSMC 16nm and Intel22FFL process technologies.

1. INTRODUCTIONDeep neural networks [1] (DNNs) have gained major

interest in recent years due to their extraordinary androbust ability to make predictions on large amounts ofdata. These prediction abilities have been applied tocomputer vision [2], machine translation [3], gaming [4],robotics [4, 5], and many other fields. Hardware acceler-ators are a natural solution to the large computationalrequirements imposed by DNNs.

A large portion of DNN accelerators produced by ma-jor vendors such as Google [6], Samsung [7] and Tesla [8]have used systolic array architectures for matrix multipli-

∗Equal contribution.1https://github.com/ucb-bar/gemmini.git

cation and convolution operations. Systolic arrays wereoriginally proposed in the 1980s [9,10], but have recentlyregained interest from their effectiveness in acceleratinggeneral matrix multiplications (GEMM) and convolu-tions in modern machine-learning (ML) workloads.

Accelerators can be used in various stages of the ma-chine learning process: whether in training or inference,on edge devices or on the cloud. Each of these use casesapplies different constraints on the accelerator, includinglatency, power, throughput, energy, area, programmabil-ity, and system integration. Nevertheless, the intrinsiccomputational kernels used in these scenarios remain thesame. Critically, the differences between edge inferenceand cloud training accelerators can be cast as differentaccelerator parameters rather than changes to the basiccomputational kernels.

For these reasons, hardware generators [11, 12] are anattractive approach to building DNN accelerators. Al-though computational kernels may stay the same acrossworkloads, characteristics such as layer dimensions ormodel size impact how workloads are optimally sched-uled and mapped to any particular hardware accelera-tor [13]. Thus, full-stack generators must target softwarefrontends as well as hardware backends, so that work-loads and accelerators can be tuned together.

Systolic array hardware generators should target perti-nent architectural parameters such as dataflow, pipelinedepth, banking strategy, precision, and on-chip memorycapacity. Such generators also need to consider param-eters for system-level integration such as bus widths,off-chip memory bandwidth, and host CPU architecture.Accurately evaluating the generated system requires ahigh-fidelity simulator which can faithfully model system-level interactions with operating systems, DRAM con-trollers, networks, etc.

Many of these architectural parameters impact thephysical realizability, as well as the power, area, andmaximum clock frequency of the generated hardware.Therefore, any generator needs to be evaluated not onlyon its architectural or RTL characteristics, but also onthe full physical design flow which it enables.

In this paper, we address these needs and presentGemmini, an agile systolic array generator, which isintegrated with the Rocket Chip system-on-chip gen-erator [14] and the BOOM out-of-order processor [15].

arX

iv:1

911.

0992

5v1

[cs

.DC

] 2

2 N

ov 2

019

Rocket

Core

L1 I+D

L2

DRAM

Gemmini Accelerator

Controller

DMA Engine

Local TLB

ScratchpadBank 0…

Transposer

SystolicArray

++++++Accumulator

SRAMBank K Scaling

ReLU

Dependency Mgmt

RoCC Cmd

RoCC PTW

Figure 1: A system overview of the Gemmini basedsystolic array generator.

Gemmini is composed of a hardware/software generatorwhich produces both RTL and optimized C libraries forcommon neural network (NN) workloads. We utilizeFiresim [16], a cycle-exact FPGA-accelerated simula-tion platform, to extract accurate performance figuresfrom a design-space exploration (DSE) over architecturaland system-level parameters. Additionally, we evaluateGemmini across the full physical design flow to producetapeout-ready designs with varying timing constraintsand floorplans.

Our DSE revealed that evaluation of any one neu-ral network layer in isolation is not representative ofperformance on an entire network, partly because theperformance and energy-efficiency of different layers canvary widely based on their dimensions and tiling fac-tors. Furthermore, we show that CPU performanceand system-level constraints such as bus protocols canseverely limit the maximum performance an accelera-tor can provide. We also demonstrate that Gemminican produce tapeout-ready designs which meet timingand power constraints with a variety of different floor-plans and configurations. Gemmini designs have evenbeen taped-out in TSMC 16nm and Intel 22FFL processtechnologies.

2. GEMMINI GENERATORGemmini is an open-source modular and flexible gen-

erator of systolic array accelerators, supporting multipledataflows, targeting ASIC and FPGA implementations.Gemmini is open source, written in the Chisel hardwaredescription language [17], enabling parameterization andconfigurability through high-level meta-programmingand functional programming abstractions. Gemminiproduces instances of systolic architectures that can beintegrated with the Rocket Chip SoC generator. Itsparameterization and system-level integration enableefficient hardware and software co-design, and help per-form agile design space exploration. This section de-scribes the architecture of a systolic array generated byGemmini (Section 2.1), the major generator parameters(Section 2.2), and the accelerator programming model(Section 2.3).

2.1 Architecture

A system-level view of the Gemmini generated ac-celerator is illustrated in Figure 1. The core unit is a2-D systolic array that performs matrix multiplications,represented by the equation:

C = A ∗B +D

where A and B are the multiplied matrices, C is theresult and D is a bias matrix. The array is fed bya banked scratchpad memory made of SRAMs, withaccess to main system handled by a direct memory access(DMA) engine in the controller. There are dedicatedcomponents for non-linear activation functions, such asReLU and ReLU6, as well as components necessary forretaining network accuracy after quantization [18], suchas rounding and saturating bitshifts. The acceleratoralso includes an accumulator with a wider bitwidth thanthe systolic array itself to accumulate partial results.

The µArch of the systolic array is illustrated in Fig-ure 2. The basic element of the systolic array is a fullycombinational processing element (PE), which performsMACs, and optionally rounding bitshifts. The PEs cansupport a variety of dataflows, which may either be fixedat design time or configurable at runtime. The PEs cansupport different bitwidths for their inputs, outputs, andinternal buffer (Section 2.2), as determined at elabora-tion time. To enable full utilization of the MAC units,each PE is double-buffered such that weights/biases canbe loaded for a future computation while the currentcompute cycle is running. PEs are arranged in a combi-national grid to form a tile, and tiles are arranged in apipelined grid to form the systolic array itself.

To perform a GEMM operation, A, B, and D matri-ces must be explicitly moved into the scratchpad frommain memory (D may also be moved directly into theaccumulator). The systolic array is then configured withthe desired dataflow and activation functions. After-wards, the A, B, and D matrices are fed directly intothe systolic array, which writes the result, C, either backinto the scratchpad or into the accumulator. Finally,the result may be written back into main memory.

For workloads that are sensitive to precision androunding, the result of a matrix multiplication mustoften be of a higher bitwidth than the input matri-ces. To support this pattern, the Gemmini architecturealso includes a higher-bitwidth accumulator external tothe systolic array, which is implemented as a dual-portSRAM with adders at its inputs.

The template architecture also includes peripheralcircuitry, which performs activation functions and scaleshigh-bitwidth values down to lower-bitwidth values ifnecessary. For example, Gemmini supports rounding bit-shifts, which can be applied within PEs (for the output-stationary dataflow) or at the output of the accumulator(for the weight-stationary dataflow). In a quantized neu-ral network, output activations are usually accumulatedto higher precision, e.g., 32 bits. However, before beingfed into other layers, these activations must be scaledback down to a lower precision, such as 8 bits. Gem-mini saturates and rounds such scaling operations to the

Figure 2: µArch of the Gemmini systolic array. PEs are arranged in a combinational grid to form a tile, and tiles arearranged in a pipelined grid to form the systolic array itself

nearest bit in order to maximize accuracy [18].Some of this peripheral circuitry also preprocesses the

data. For example, our architecture includes a dedicatedtransposer, which is itself implemented as a smallersystolic array. For the output-stationary dataflow, aPE must consume the rows of A while consuming thecolumns of B. However, typically all matrices are storedin main memory as row-major. The transposer allowsboth A and B to be stored as row-major and makesmatrix transformations transparent to the programmer.

The accelerator is integrated with the Rocket ChipSystem-on-Chip generator, which can be configured touse either the Rocket [14] in-order core or the BOOM [15]out-of-order core. The accelerator communicates withthe host processor through the Rocket Co-Processor(RoCC) interface, which enables the host RISC-V coreto send the accelerator a stream of custom instructions.The RoCC interface enables the accelerator to be in-tegrated into the processor’s cache-coherent TileLink[19] memory system, and provides execution ordering

semantics with respect to the host processor.

2.2 ParametersAlthough all the accelerator instances produced by

Gemmini have the same general architecture, a designercan explore different trade-offs in performance, energyand area, based on a range of tunable generator pa-rameters. Choosing an appropriate design point for aspecific application is extremely important in the case ofa general kernel such as matrix multipication. As such,an energy-conscious accelerator for a mobile device withlimited parallelism would likely choose smaller arraysizes and a single dataflow (at the cost of performanceand flexibility), while larger cloud-based acceleratorswith batch-level parallelism can choose larger array sizesand multiple dataflows (for optimal performance). Someof the current parameters enabled by Gemmini are de-scribed below.

Dataflow: a dataflow describes the data movement

into and out of the systolic array and the communicationpatterns between PEs. In the classic three-level nestedfor-loop for matrix multiplications, the dataflow deter-mines which loops are unrolled spatially and which areunrolled temporally. Currently our generator supportsboth the output-stationary and the weight-stationarydataflows. The dataflow can either be fixed at elabo-ration time (improving energy efficiency and physicaldesign), or configured at runtime (improving flexibilityand possible performance). Previous work has demon-strated that runtime configurable dataflows can improveDNN inference performance and energy efficiency [13].

Dimensions: systolic arrays can be generated withany number of PEs the user chooses. As arrays getlarger, more operations will be executed per cycle, anddata reuse will improve. However, large arrays increasethe latency of small matrix multiplications, as operandsmust traverse the entire length and height of an arraybefore the result can be read out. Large arrays also sufferfrom low utilization when operating on small matrices,wasting energy and area. Furthermore, large arrays canhave a significant impact on physical design and cycletime, since the scratchpad memories need to be placedappropriately to reduce wire-delay between the memoryand the array edges.

Bitwidth: the generator can be configured at elabo-ration time to operate on arbitrary bitwidth matrices.The final accumulated result of a matrix multiplica-tion can also have a different bitwidth than the inputmatrices. Previous work has demonstrated that DNNcompression and quantization enable significant energysavings [20, 21] at the potential cost of accuracy. Ourbitwidth parameterization enables a designer to explorethe accuracy-efficiency trade-off and choose an appropri-ate design point for the respective application.

Pipeline Depth: traditionally, systolic arrays placeregisters between each PE. However, our generator allowsthe density of these registers to be reduced, even to the

point of the array being made of fully combinational logic.Fewer pipeline registers reduce the area requirement forour accelerator, but may reduce the maximum achievableclock frequency. The optimal pipeline depth is impactedby physical design and the choice of fabrication processtechnology.

Memory Capacity: both the scratchpad and accu-mulator memories (implemented using SRAMs) can beconfigured to have arbitrary capacities. Previous workhas found that data movement and coherency manage-ment between main memory and accelerators’ privatememory can consume up to 40% of an accelerated work-load’s total runtime [22]. Since data-transfer betweenmain memory and the private scratchpad/accumulatormemory is expensive, it is beneficial to have large scratch-pads to allow for maximal data re-use. However, over-provisioned private memory can lead to energy and areainefficiency. Therefore memory capacity should be bal-anced with the system bus and DMA bandwidths whichdrive the memory as well as the data re-use potential ofthe accelerated workload.

Memory Banks: the private memory scratchpadis divided into banks in order to maximize read/writethroughput. A larger number of banks allows for higherthroughput, but results in additional wiring and physicaldesign constraints.

System Parameters: since Gemmini is integratedwith the Rocket Chip SoC ecosystem, it can use SoC-level parameters which have been shown to have animpact on accelerator performance [22]. One such pa-rameter is the host processor, which can be an in-orderRocket core or an out-of-order BOOM core. Anotherexample is the SoC system-bus width, which impacts thebandwidth with which the accelerator can communicateand move data between main memory and the privatescratchpads.

Datatype Parameters: through the use of Chiselhardware description language and Scala typeclass fea-tures, our generator is type-generic over the concretedatatype being processed by the systolic array. Gemminican create accelerator instances which operate on signedintegers, unsigned integers, floating point values, or anyuser-defined datatype, such as a posit [23] or dynamicfixed-point number, through the implementation of therelevant Scala typeclass. This level of parameterizationcan enable the generator to produce instances special-ized for low-precision DNN integer inference operations,as well as for high-precision floating point DNN trainingand scientific computing.

2.3 Programming ModelGemmini is programmed via a stream of custom RISC-

V instructions transmitted directly from a host processorto our accelerator. Gemmini connects directly to thedatapath of a RISC-V core, through the Rocket CustomCoprocessor Interface [14]. The accelerator has its owninstruction queues, allowing it to run in parallel withthe host processor.

Data and memory management between the acceler-ator and the host processor is explicit, i.e., data must

I

J (stride)

DRAMDIM

num_blocks

DIM

Scratchpad/Accumulator

SRAM Banks

…mvin

local_addrdram_addr

mvout

Figure 3: Moving matrices in and out of private scratch-pad or accumulator memory with the mvin and mvoutinstructions.

be explicitly moved between the processor’s main ad-dress space and the accelerator’s private address spaceusing a sequence of movement instructions. The ISAdefines two data movement instructions mvin and mvoutshown in Figure 3. These instructions use Gemmini’sDMA unit to move multiple systolic-dimension (DIM)matrices between main memory and the accelerator’sprivate memory space consisting of the scratchpad andaccumulator’s SRAM banks.

Once matrices have been brought in from main mem-ory, the Gemmini ISA provides a compute instructionthat can be configured with a dataflow, a scaling factor,and an activation function. The compute instructiontakes the local addresses of the A,B,C, and D matriceswhich can be stored in any scratchpad or accumulatorbank.

The output stationary (OS) variant of compute (illus-trated in Figure 4) executes by loading the D matrixinto the PEs’ internal accumulators, pushing A and Bthrough the systolic array, and leaves the result C res-ident in each PEs accumulator. Providing addressesfor the D and C matrices are optional in the OS case.This is useful, for example, when a programmer wantsto repeatedly accumulate submatrix multiplications ontop of each other without reading the results out of thesystolic array until the final result has been calculated.

The weight stationary (WS) variant of compute (illus-trated in Figure 5) takes local addresses for A, B, and C.First, B is preloaded into the PEs’ weight buffer, thenA is pushed through the systolic array, and the resultC is written to the accumulator. A bias matrix D canbe used in the WS dataflow by first executing a mvininto the accumulator. Specifying B is optional, so theprogrammer can reuse the already loaded weights in thesystolic array.

The Gemmini architecture uses a decoupled-access-execute [24] architecture, where all instructions are is-sued to one of three independent, parallel commandqueues: the LOAD queue (mvin), the STORE queue(mvout), and the EXECUTE queue (compute). Anydata hazards within a command queue are handled trans-parently by hardware. However, dependencies betweenqueues must be encoded into the instructions themselvesby the compiler or programmer. Each instruction hasfour reserved bits which specify whether the instruc-

D B

…

AT

1

DIM

DIM

Preload D intoPE buffers

DIM

DIM

DIM

2 Push B

2

Transposeand push

A

DIM

C is stationaryin PE buffers

C1,DIM

…

Scratchpad

C21

CDIM,1

C11 C12

C22 C2,DIM

CDIM,DIMCDIM,2

…

…

… ……

3 Flush PE buffers to scratchpadPEs apply shi/activation

4

Figure 4: Execution of an output-stationary computeinstruction.

tion depends upon an instruction in another queue, orwhether an instruction in another queue will dependupon it. This scheme is inexpensive to implement inhardware, but it increases software complexity. Similarsoftware-based dependency management schemes havebeen implemented in other NN accelerator works [25].

To make it easier for programmers to use Gemminiaccelerators, we provide a software library that imple-ments hand-tuned, tiled GEMM functions (supportingboth dataflows): matrix multiplication of any size, multi-level perceptrons (MLP), convolutional neural networks(CNN), non-linear activations, and quantization. Tilingis performed along the parameterized size of the systolicarray and the accelerator scratchpad. The tiling parame-ters are generated by the Chisel generator, and includedas a header file in the software libraries. This approachfacilitates rapid software-hardware co-design.

3. DESIGN SPACE EXPLORATIONA major advantage of a generator-based methodology

is the ability to perform elaborate design space explo-ration across a multi-dimensional design space. In thissection we explore the performance of multiple designpoints on different DNN workloads.

3.1 Evaluation MethodWe chose to run the DNN applications under Linux

to evaluate their performance in a full-system context.Performing this type of RTL evaluation using a logic sim-ulator would take multiple compute-years. Therefore, forfull-system performance evaluation we used FireSim, anFPGA-accelerated cycle-exact simulation platform [16].Unlike FPGA prototyping, FireSim is designed to sim-ulate ASIC RTL designs with timing-accurate systemcomponents. FireSim facilitates full-system simulationby enabling integration of the simulated SoC with accu-rate peripheral and system-level interface models suchas DDR3 memory and a last-level-cache (LLC) [26]. By

B

…

A

2

D1

mvin D intoaccumulator

DIM

Preload B intoPE buffers

DIM

DIM

DIM 3

Push A

DIM

B remainsin PE buffers

B1,DIM

…

Accumulator

B21

BDIM,1

B11 B12

B22 B2,DIM

BDIM,DIMBDIM,2

…

…… ……

4 Write partial sumsto accumulator

4

DIM

Shi/activationapplied on mvout

5

Figure 5: Execution of a weight-stationary computeinstruction.

using FireSim’s timing-accurate models, we faithfullysimulate our target ASIC designs.

Power and area are evaluated using a Cadence VLSIflow with TSMC 16 nm FinFET technology libraries.Logic synthesis was performed using Genus, physicaldesign was performed using Innovus, and power estima-tion was performed using Voltus. The accelerators weresynthesized to meet frequencies of 1 GHz and 500 MHz.

For performance evaluation, the memory system in-cludes a 256 KiB L2 Cache and a 4 MiB last level cache(LLC). The simulated backing memory preserves thesame latency for the 500 MHz and 1 GHz design points,while proportionally scaling the memory bandwidth forthe 1 GHz design. At 500 MHz, we used the DDR31066 8-8-8 model and at 1 GHz we used the DDR3 213314-14-14 model.

Each design point for the DSE was selected by varyinga single design parameter relative to a baseline whichmatches design point 1 in Table 1. This method at-tempts to identify and isolate the impact of the differentparameters on area, performance, and power consump-tion. The baseline design point was selected based oncommon parameters published in the literature.

3.2 Area and PowerThe area and power consumption of our designs, nor-

malized and illustrated in Figure 6, were heavily cor-related. According to synthesis our baseline 500 MHzdesign took up 0.467 mm2 and consumed 611 mW, whenincluding the area and power of the RISC-V core con-nected to our systolic array. At 1 GHz, synthesis re-ported that the baseline design took up 0.541 mm2 andconsumed 1.4 W. However, we found that synthesisresults were generally pessimistic. Designs which weplace-and-routed sometimes consumed less than half ofwhat synthesis predicted they would consume, as seenin Section 4. The trends between different design points,however, remained the same.

Table 1: Design Points Under Evaluation.

No. Dataflow Bitwidth DimensionsPipelineDepth

Memory BanksBuswidth

HostCPU

1 OS8 bit input32 bit result

16 × 16fullypipelined

64 KiB 5 128 bits rocket

2 WS8 bit input32 bit result



3 OS + WS8 bit input32 bit result










16 × 16fullycombin.













64 KiB 5 128 bits BOOM

Each design point varies a single parameter compared to the baseline ( 1 ). The “Banks” column describes the numberof scratchpad banks with the addition of an additional bank for the accumulator (which is in its own memory addressspace).

The weight-stationary dataflow ( 2 ) consumed lesspower than the output-stationary baseline, as it did notrequire 32-bit accumulators in the PEs of the systolicmesh. Configurations which increased the size of thesystolic mesh, on the other hand, such as by scaling upits dimensions or bitwidth, increased power consumptionby up to 3.4× and area by up to 2.3×. Design 10 , whichreplaced the default in-order Rocket processor with afour-wide out-of-order BOOM processor also significantlyincreased both area and power consumption, whereasin the other design points, the CPU had only a minorimpact upon the overall power and area.

3.3 PerformanceWe evaluate the selected design points by running

DNNs such as MobileNet, ResNet50, and Resnet152, aswell as an additional collection of MLPs, which we referto in Figure 7 as MLP 1 [27], MLP 2 [28], MLP 3 [29],MLP 4 [30]. The evaluated DNNs represent a widerange of modern state-of-the-art neural network archi-tectures. They include MLPs (which make up more than61% of Google’s inference workloads [6]), autoencoders,non-linear activations, convolutions, quantization, anddepthwise convolutions.

We observed that many of the design points that wereexpected to boost performance did not have a largeimpact due to system-level and µArch effects, while alsonoting significant variability in the performance boostsachieved by different workloads.

As seen in Figure 7a, for DNN workloads, using abeefier processor in 10 boosted performance substan-tially, while increasing scratchpad memory had littleimpact, contrary to typical intuition. Since the DNNworkloads used the CPU core to perform tasks that mappoorly to GEMMs, the CPU often became the bottle-neck that limited the maximum speedup achievable. Forexample, our DNNs performed im2col reshaping [31,32]to convert 2D convolutions to GEMMs. With MobileNetin particular, accelerated computation time was dom-inated by depthwise convolutions on the CPU. Somelayers of the evaluated DNNs include 1×1 convolutionalkernels that could be mapped directly to matrix multi-plication without requiring any reshaping. Resnet-152included the highest portion of such kernels, and thus itperformed better in general in all the design points.

While the larger scratchpad ( 7 ) added more datalocality, improving performance by a marginal 1.18× onour DNNs, its benefit was limited by the CPU bottle-neck. On the other hand, increasing bitwidths to 32 bits( 4 ) reduced performance significantly in all cases, as itcaused memory requirement to increase, limiting re-useand locality within the scratchpad.

For MLP workloads, the CPU is only used for book-keeping, so increasing the memory and compute capacityof the accelerator had a larger impact on performanceas seen in Figure 7b. Increasing the host CPU’s perfor-mance did help, but not as substantially as increasing thedimensions of the systolic array or boosting its scratch-

Baseline Weight-stationary

Bothdataflows

32 bits 32-by-32dimensions

Fully com-binational

256 KBscratchpad

33 banks 64 bitbus

BOOMprocessor

0

2

4

Scal

e

1× 0.75

×

1.1× 1.

9× 2.3×

1.0× 1.

6×

1.1×

0.97

×

3.6×

1× 0.51

×

1.1× 2.

2× 3.4×

0.93

×

1.1×

1.1×

0.98

× 2.2×

Area and Power Consumption of Gemmini Design PointsAreaPower

Figure 6: The area and power consumption of synthesized Gemmini designs, normalized to the area and powerconsumed by the baseline design.

CPU Baseline Weight-stationary

Bothdataflows



256 KBscratchpad

33 banks 64 bitbus

BOOMprocessor

0

100

200

Spee

dup

1× 5.5×

5.5×

5.5×

4.6×

5.7×

5.5×

5.5×

5.5×

5.4× 18

×

1×

70×

72×

72×

61× 82×

70×

72×

70×

69× 11

0×

1×

90×

94×

94×

76× 11

0×

90×

94×

90×

89×

160×

Performance on DNNsMobileNetResNet-50ResNet-152

(a) The performance of Gemmini designs on various deep neural networks, normalized to the performance of a cache-blocking algorithmon a CPU.

CPU Baseline Weight-stationary

Bothdataflows



256 KBscratchpad

33 banks 64 bitbus

BOOMprocessor

0

1000

2000

Spee

dup

1× 240×

250×

250×

160× 69

0×

240×

310×

240×

250×

280×

1×

320×

260×

320×

160× 81

0×

290×

350×

320×

310×

310×

1×

310×

310×

310×

230×

1170

×

310×

370×

310×

310×

320×

1×

320×

320×

320×

230×

1270

×

320×

360×

320×

310×

330×

Performance on MLPsMLP 1MLP 2MLP 3MLP 4

(b) The performance of Gemmini designs on various multi-layer perceptrons, normalized to the performance of a cache-blocking algorithmon a CPU.

Figure 7: The performance of different Gemmini design points when performing matrix multiplications neural networkinference.

pad size.One would expect that Gemmini is memory-bandwidth

limited, and thus cutting the memory bus width woulddegrade performance. However, we observe no signficantperformance hit in design 9 owing to a system-level lim-itation on the number of memory requests in-flight. Thislimitation turns a bandwidth constraint into a memorylatency constraint. Since the round-trip latency of amemory request and the number of maximum requestsin flight are independent of the bus width, decreasing itdoes not impact the effective bandwidth. This revealsthe critical importance of system-level evaluation, sinceusing an ideal memory model at Gemmini’s memoryport would not reveal system-level bottlenecks.

We observe up to 4× performance improvement onMLP inference when increasing the size of the systolicarray to 32 × 32 ( 5 ). The Gemmini µArch requestsmultiple systolic-dimension matrix rows at a time whenexecuting the mvin instruction. Increasing the arraydimension results in larger blocks of memory requestedper mvin over TileLink. Doubling the systolic arraydimensions doubles the effective memory bandwidth andquadruples the compute throughput. Depending on how

much reuse there is within a layer and according to thetiling factors, the expected performance boost can beanywhere from 2×-4×.

For all the models, before feeding the data into thesystolic array, the operands are zeropadded so that theirdimensions are multiples of the size of the systolic array.In most of our benchmarks, this resulted in negligibleadded overhead of multiplying zeros. This overhead washighest in MobileNet, where it consumed 10% of theworkload, but it significantly dropped with larger DNNslike Resnet.

We also found that due to their low arithmetic inten-sity and large memory footprint, depthwise convolutionswould require feeding inputs sequentially into the systolicarray, which would limit their performance. Therefore,we perform depthwise convolutions in Mobilenet on thehost processor itself. Prior work has demonstrated thatthe low arithmetic intensity of depthwise convolutioncan be an impediment to the efficient acceleration of Mo-bileNet. This is also demonstrated in the results of ourDSE - while depthwise convolution layers take up 18%of the runtime on our CPU implementation, they takeup nearly 100% of the execution time in the accelerated

0.0 0.5 1.0 1.5 2.0 2.5Performance/Area

0.00

0.25

0.50

0.75

1.00

1.25

Task

s/Jo

ule

1

2

345

67 8

9

101

2

34

5

67 89

10

MobileNet1 GHz500 MHz

(a) Performance-energy-area trade-offs forMobileNet inference.

0 1 2 3Performance/Area

0.0

0.5

1.0

1.5

2.0

Task

s/Jo

ule

1

2

34

5

67 8 910

1

2

34

5

678

910

ResNet1 GHz500 MHz

(b) Performance-energy-area trade-offs forinference on ResNet50 and ResNet152.

0 25 50 75 100 125 150Performance/Area

0

20

40

60

Task

s/Jo

ule

1

2

3

4

5678

9

10

1

2

34

567

8 910

MLPs1 GHz500 MHz

(c) Performance-energy-area trade-offs forinference on various MLPs.

Figure 8: The tasks completed per unit energy as a function of performance per area for all our design points. Thenumeric labels correspond to rows in Table 1.

workload.We observe from Figure 7b that the performance

improvement between MLP topologies varies wildly, asa consequence of the shapes of their layers. The shapesof the layers affected not only the input/weight reuse,but also the amount by which GEMMs could be tiledin the scratchpad. The maximum tiling factors were afunction of the scratchpad size and array dimensions, butnarrow or non-divisible layers often reduced tiling factors,which also reduced performance. As an example, MLP4 outperformed MLP 3, because its dimensions, whichwere powers-of-2 mapped better onto our maximumtiling factors.

3.4 Design Space AnalysisWe integrate our results for power, performance, and

area in Figure 8. We plot the performance per joule ofeach workload against the performance per unit area. Inworkloads which were not CPU-limited, such as MLPs,increasing memory capacity improved Gemmini’s areaand energy efficiency by minimizing main memory ac-cesses and improving locality within the scratchpad. Inall workloads, design points which demanded extra mem-ory bandwidth, such as 5 which increased bitwidth to32 bits, reduced energy and area efficiency significantly.

Although the weight-stationary dataflow ( 2 ) did notnoticeably improve performance, it did increase energyand area efficiency by removing the 32-bit accumulatorswithin each PE, which greatly reduced the power con-sumption of the systolic mesh. The 32×32 design ( 5 ), onthe other hand, suffered from very low efficiency despiteits high performance, because of its greatly increasedpower consumption and area overhead. The BOOM pro-cessor ( 10 ) improved energy- and area-efficiency signifi-cantly on MobileNet, which was severely CPU-limited,while it suffered on other workloads where the CPU wasnot the bottleneck.

Additionally, 500 MHz designs were generally moreenergy- and area-efficient than 1 GHz designs. The 500MHz designs used more high-voltage-threshold (HVT)logic gates, reducing leakage power consumption morethan enough to compensate for their slower computeperformance.

4. PHYSICAL DESIGN

We perform an evaluation of the physical design prop-erties of the generated accelerator design, to explore thearea and power requirements of different design points,as well as to evaluate the place-and-route feasibility ofGemmini accelerators.

Since the end of Dennard scaling, power density hasproven to be a significant factor in the design of digitalsystems-on-chip. In particular, custom accelerators witha dense collection of compute units (MACs, in the case ofa matrix multiplication unit) are known to be sensitiveto such thermal and energy constraints. As an example,the Google TPUv3 uses liquid cooling to assist the powerdissipation from its dense array of compute units [33].As such, it is important that the design space of sucha matrix multiplication unit be evaluated through fullphysical design VLSI flows (placement and routing),allowing for the evaluation of the feasibility of the RTLdesign (timing closure), as well as power density andenergy under different floorplans. Furthermore, physicaldesign of the selected design points was necessary for theintegration of the accelerator into test systems-on-chipfor fabrication.

We evaluated four design points based on the resultsof the earlier DSE, by choosing two Gemmini configura-tions and place-and-routing each of them at both 500MHz and 1 GHz. The first configuration is a 16 × 16systolic array, with dual dataflows, a 4-banked 256 KiBscratchpad memory and a 64 KiB accumulator. Thesecond configuration is a 32 × 32 systolic array, withdual dataflows, 4 banked 512 KiB scratchpad memoryand 128 KiB accumulator.

Each of our selected design points was also evaluatedusing two different floorplans in TSMC 16 nm FinFETprocess technology. The first floorplan (Figure 9) orga-nizes the accelerator’s SRAMs in a major block, leavingspace on the side for the systolic mesh, as well as arouting channel across the block. The second floorplan,in Figure 10, organizes the accelerator’s SRAMs in asemi-ring around the computational mesh.

The placed designs are shown in Figures 9 and 10.In both floorplans, the controller was placed next tothe Rocket host processor since the processor interactswith the controller to send instructions and data. Each

Figure 9: Block floorplan of the systolic array in TSMC16nm for a 16 × 16 design point. Rectangular blocksrepresent SRAM macros. Yellow points on the right areI/O pins for clock, reset, interrupt and external memory.

floorplan has intuitive benefits: while the block floorplanprovides the systolic mesh more vertical access to theSRAM address lines, the semi-ring floorplan allows formore surface area contact between the systolic mesh andthe SRAMs.

The comparison results are presented in Table 2. Wecan observe that the 16×16 arrays performed similarlywith both floorplans, even when increasing the frequencyfrom 500 MHz to 1 GHz. For example, all 16×16 designpoints achieved nearly the same Worst Negative Slack(WNS) for any given frequency, although the Total Neg-ative Slack (TNS) was 79% higher with the semi-ringfloorplan at 1 GHz. Additionally, the worst WNS for16×16 designs was -14 ps, which can easily be adjustedto meet timing requirements. The power consumption of16×16 designs also showed little variation across differentfloorplans, differing by only 1-3%. We can conclude fromthis that the generator is flexible enough to allow for avariety of floorplans, with reasonable a Quality-of-Result(QoR) for each.

However, we observe that for larger design points suchas a 32×32 configuration, the difference between thefloorplans begins to have more noticeable impacts uponour timing results. For example, the semi-ring floorplanachieves a 41% better setup WNS at 1 GHz than theblock floorplan, as well as an 81% better TNS, becausewires between the scratchpad’s SRAMs and the systolicmesh can be routed to travel a shorter average distance.However, neither of the 32×32 floorplans were able tomeet timing at 1 GHz, and the setup violations, whichwere several hundred picoseconds long, were significantenough to require several iterations of physical designattempts to have a chance of closing timing.

Overall, our physical design evaluation demonstratesthat semi-ring floorplans can reduce power consumptionand achieve faster clock frequencies. They do this byreducing wire lengths and by placing SRAMs where itis easier for physical design tools to route them to thesystolic mesh. Furthermore, with 16×16 designs, semi-

Figure 10: Semi-ring floorplan of the systolic array inTSMC 16nm for a 16 × 16 design point.

ring floorplans also reduce area requirements. However,as the systolic array grows to 32×32, the area of thesemi-ring floorplan grows faster than the block design.Thus, as an accelerator design grows, it may be necessaryto change to a floorplan which clusters SRAMs closertogether to meet area requirements.

5. DISCUSSIONEmbedding and integrating the Gemmini generator

within a composable, silicon-proven, open-source plat-form such as Rocket Chip allows for seamless integrationwith additional system components such as complex out-of-order cores and vector accelerators. We demonstrateintegration with a Rocket RISC-V core, representative oflow-energy processors found in embedded devices, as wellas integration with BOOM, a high-performance OoOcore, which revealed the power/performance tradeoffswhen using a beefier core.

Gemmini was designed as a flexible generator for sys-tolic GEMM accelerators to identify power, performance,and area trends as various parameters are varied, ratherthan to achieve state-of-the-art ML inference perfor-mance. Gemmini’s DSE revealed the benefit of special-izing the hardware for a weight-stationary-only dataflowas is the case in the TPU [6], and the system-level eval-uation demonstrated that a larger scratchpad is onlyvaluable if a workload is not CPU-limited.

Gemmini targets the most common kernel across manynetwork architectures: matrix multiplications. A largeportion of ML inference time is spent on fully connectedlayers [34], which are implemented as matrix multipli-cations. Furthermore, compute intensive convolutionsused in CNNs can be efficiently mapped to matrix mul-tiplications. By targeting GEMMs, Gemmini can toadapt to different network architectures and layer types,in contrast to specializing for convolutional layers. Ad-ditionally, GEMMs are a useful linear algebra primitive,which can expose the Gemmini generator to other appli-cation domains such as scientific computing.

Due to the speed limitations of RTL software simu-lation, some prior works choose to evaluate single-layerperformance, and then extrapolate to report the perfor-mance of a full DNN. However, extrapolation of layer-by-layer performance neglects to consider shared-cache

Table 2: Floor-planned Design Points in TSMC 16nm FinFET Process Technology.

DesignFreq

(MHz)Floorplan Area (mm2) Power (mW)

SetupWNS (ps)

SetupTNS (ps)

16×16 500 Block 1.34 321.71 1 0

16×16 500 Semi-Ring 1.21 312.41 0 0

16×16 1000 Block 1.34 773.70 -12 -235

16×16 1000 Semi-Ring 1.21 766.12 -14 -420

32×32 500 Block 2.81 1058.24 -14 -100

32×32 500 Semi-Ring 3.01 1078.71 -9 -59

32×32 1000 Block 2.81 2796.51 -530 -2716

32×32 1000 Semi-Ring 3.01 2683.01 -315 -508

state between the host processor and the accelerator, aswell as host-processor time between layers.

Furthermore, FPGA prototypes used to evaluate ASICDNN accelerators often connect directly to an on-FPGADRAM controller and thus see a higher memory through-put than an ASIC implementation would. This candistort performance and energy numbers. Gemmini hasbeen evaluated using FireSim, which accurately modelsa last-level cache and DRAM to preserve simulationfidelity while executing on an FPGA.

Full-system simulation of RTL implementations isimportant not only for performance evaluation, but forfunctional validation as well. While an original version ofthe Gemmini design passed many individual benchmarksand micro-benchmarks, some design decisions such asmethods for handling exceptions, memory managementrace conditions and TLB flushes were exposed only in afull-system multi-process environment such as Linux.

6. RELATED WORKSystolic architectures first came into prominence in

the early 1980s [9, 10], and since then, many systolicaccelerators have been developed. There has also beenmuch work on algorithms which can design new systolicarrays methodologically, rather than through ad-hocintuition [35].

Early systolic arrays were used to compute convo-lutions [36], solutions to triangular linear systems [9],matrix multiplications, and more. Systolic architec-tures enable modular and extensible designs, use localneighbor-to-neighbor message passing, and contain easy-to-floorplan regular structures.

Systolic architectures have recently regained popu-larity, since the convolution and matrix multiplicationkernels common in machine learning and deep learningapplication are highly susceptible to multi-dimensionalacceleration using systolic arrays.

Commercially deployed ASIC implementations of NNaccelerators include the Google TPU [6] for cloud work-loads, as well as edge inference implementations by Sam-sung [7], Nvidia [37], Apple [38], and Tesla [8, 39]. Inparticular, a detailed description of the original TPU

implementation includes a 256 × 256 matrix multipli-cation unit implemented using a reduced-precision sys-tolic MAC array with a weight stationary dataflow forNN inference in the cloud. Successor versions includedfloating-point representation, additional memory, andimproved utilization for both training and inference [33].

Prior work has demonstrated the integration of anopen-source commercial DNN accelerator (NVDLA) withthe Rocket Chip ecosystem and the FireSim platform[40]. The accelerator in this work was integrated us-ing the memory bus, as opposed to Gemmini which isintegrated using the RoCC interface. Prior work [41]has also demonstrated the integration of academic NNaccelerators with the Rocket Chip ecosystem using theRoCC interface, but did not use systolic architectures forthat purpose. Gemmini puts an emphasis on enablingdesign space exploration rather than single design-pointintegration.

Academic researchers have proposed numerous systolicaccelerators, especially for neural-network inference. Forexample, NeuFlow [42] was a systolic-inspired architec-ture which allowed individual processing elements (PEs)to be re-configured at runtime to perform tasks such asmultiply-accumulates, divisions, and non-linear activa-tions. ShiDianNao [43], similarly, allowed PEs to be re-configured at runtime to perform multiply-accumulates,additions, and max poolings. Eyeriss [44] implemented aweight-stationary dataflow using a spatial array. Eyerissv2 [45] improved on the original Eyeriss by demonstrat-ing a new PE architecture that can operate on sparseCSC-encoded matrices, and a hierarchical mesh NoC ca-pable of unicast, multicast, and broadcast data transfersto maximize reuse. These and other systolic-inspiredarchitectures typically permit both global and local con-nections between PEs and global memory, which is notstrictly systolic, but often improves performance.

Several previous proposals [13,46,47] have presentedperformance and energy benefits resulting from flexibledata-flow options in NN accelerators. However, thebenefits and impact of the dataflow structure of NNaccelerators is still an active area of research, and someworks [48] have shown that optimal memory-hierarchies

and loop-blocking strategies can have a more significantimpact on energy efficiency than the choice of dataflows.

Various energy efficient neural network acceleratorproposals have also been presented in the integratedcircuits community [49–59]. Many of these proposalsfocus on exploiting sparsity and quantization featuresof DNNs. Furthermore, while some of these proposalsaddress runtime-configurability, they still address only asingle fabrication-capable design point, and most do notpresent design and elaboration time parameterization.Further, most of these accelerators are tested in isolation,often without a fully integrated software environment,hence potentially neglecting system-level effects.

A host of DNN accelerators targeted for FPGA im-plementation have also been proposed [60–71], takingadvantage of FPGA reconfigurability to implement ex-otic layers, specialize the hardware for a specific network,and evaluate multiple design points. However, FPGAacceleration frameworks do not necessarily translate wellto ASIC implementations, and are not ideal for scenarioswhere energy efficiency is critical.

Some prior works [13, 72–78] use analytical or high-level model-based simulations to evaluate different pa-rameterizations of a proposed accelerator architecture.In contrast, Gemmini performs design space explorationon the RTL directly and uses feedback from FPGA-accelerated simulation and physical design to find opti-mal design points for ASIC implementation.

Since the energy consumed during DNN inference andmatrix multiplication is often dominated by externalmemory accesses, academic researchers have proposedprocessing in memory [79–85]. These works include thedevelopment of new SRAM circuits and the use of noveldevices such as ReRAMs. Gemmini is designed andvalidated for CMOS implementation, and uses designspace exploration to discover the ideal memory accesspatterns and memory hierarchy to conserve energy.

Researchers have also proposed methodological sys-tems and algorithms to automatically generate systolicarchitectures directly from the algorithms they are meantto accelerate. For example, PolySA [86] analyzes poly-hedral models to attempt to find the optimal mappingbetween a sequential algorithm and a set of parallelPEs. Yang et al. [48] extended the Halide program-ming language to automatically generate C++ high-level-synthesis (HLS) implementations of systolic arrays.

Prior work has also introduced TVM [87] and VTA[25] as an integrated research platform for SW/HWevaluation of NN accelerators. While Gemmini andVTA hold many architectural similarities, including theuse of a GEMM core, explicit memory management,and explicit instruction dependency handling, VTA hasprimarily targeted FPGA accelerators implementations,as opposed to Gemmini which currently targets primarilyASIC designs and has been used in the fabrication onmultiple test-chips. Furthermore, Gemmini’s integrationwith the RISC-V eco-system enables an additional levelof customization in SW/HW co-design.

7. FUTURE WORK

The Gemmini generator has been used in the fab-rication of two test system-on-chips. The chips weretaped-out within approximately a month of each otherin different process technologies, demonstrating the flex-ibility and utility of the Gemmini generator. Furtherevaluation of the integration of the Gemmini accelera-tors within the context of these larger embedded visionprocessors will be performed when the chips completethe fabrication process.

As demonstrated previously, CPU operations can sig-nificantly slow down inference on our workloads. Op-erations which map convolutions to GEMMs, such asim2col, can make up a significant portion of this over-head. To address this we intend to map convolutions toGEMMs transparently in hardware.

Some additional overheads come from zero-paddingmatrices so that their dimensions can tile onto the sys-tolic array, which reduces utilization at the boundariesof our arrays. By breaking up a single, large, systolicarray into numerous smaller ones operating in parallel,we can possibly reduce zero-padding requirements whilestill preserving the same compute throughput [88].

Finally, a generator-based methodology can be usefulfor the hardware/software co-design process throughthe integration of hardware generator and compiler pa-rameters. Future integration with optimizing DSLs andcompilers such as Halide or TVM will allow for bettercode generation which considers the generator param-eters, hence allowing for better cross-layer data re-useand optimization.

8. CONCLUSIONThis work presented Gemmini, an open source and

agile systolic array generator that enables systematicevaluations of deep-learning architectures. This system-atic evaluation is demonstrated through a DSE casestudy, identifying bottlenecks in common DNN inferenceworkloads, and capturing the variation of performanceimprovements of different workloads running on differ-ent hardware configurations. With a baseline designequipped with a 16×16 systolic array, Gemmini demon-strated 90× and 70× inference speedups on ResNet-152and ResNet-50, respectively, when compared to a cache-optimized CPU implementation, and two to three ordersof magnitude speedup on MLPs. We demonstrate thecritical importance of full-system evaluation by showingthat even though an accelerator can effectively accelerateindividual layers of DNNs, it often fails to achieve im-pressive performance improvements on the entire DNNif any part of it is not efficiently mapped onto the ac-celerator. For example, although a Gemmini baselinedesign was able to accelerate the first layer of MobileNetby 330×, it failed to accelerate the entire network be-yond 6× using a Rocket host processor and 18× using aBOOM host processor, due to the presence of depthwiseconvolutions. We also show that even with DNNs thathave similar network architectures, performance mayvary based upon the shape and size of different layers.Looking forward, we believe Gemmini will enable a newrange of systematic evaluations and HW/SW co-design

of deep learning workloads.

9. REFERENCES[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

nature, vol. 521, no. 7553, p. 436, 2015.

[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenetclassification with deep convolutional neural networks,” inAdvances in neural information processing systems,pp. 1097–1105, 2012.

[3] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence tosequence learning with neural networks,” in Advances inneural information processing systems, pp. 3104–3112, 2014.

[4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider,J. Schulman, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016.

[5] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu,J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning toplan robust grasps with synthetic point clouds and analyticgrasp metrics,” arXiv preprint arXiv:1703.09312, 2017.

[6] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal,R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers,R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell,M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami,R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho,D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch,N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary,Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore,M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami,R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda,A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg,A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle,V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H.Yoon, “In-datacenter performance analysis of a tensorprocessing unit,” in Proceedings of the 44th AnnualInternational Symposium on Computer Architecture, ISCA’17, (New York, NY, USA), pp. 1–12, ACM, 2017.

[7] J. Song, Y. Cho, J.-S. Park, J.-W. Jang, S. Leev, J.-H. Song,J.-G. Lee, and I. Kang, “7.1 an 11.5 tops/w 1024-macbutterfly structure dual-core sparsity-aware neuralprocessing unit in 8nm flagship mobile soc,” in 2019 IEEEInternational Solid-State Circuits Conference-(ISSCC),pp. 130–132, IEEE, 2019.

[8] P. J. Bannon, K. A. Hurd, and E. Talpes, “Acceleratedmathematical engine,” Jan. 24 2019. US Patent App.15/710,433.

[9] H.-T. Kung, “Why systolic architectures?,” IEEE computer,vol. 15, no. 1, pp. 37–46, 1982.

[10] H. Kung and C. E. Leiserson, “Systolic arrays (for vlsi),” inSparse Matrix Proceedings 1978, vol. 1, pp. 256–282, Societyfor Industrial and Applied Mathematics, 1979.

[11] B. Nikolic, “Simpler, more efficient design,” in ESSCIRCConference 2015 - 41st European Solid-State CircuitsConference (ESSCIRC), pp. 20–25, Sep. 2015.

[12] B. Nikolic, E. Alon, and K. Asanovic, “Generating the nextwave of custom silicon,” in ESSCIRC 2018 - IEEE 44thEuropean Solid State Circuits Conference (ESSCIRC),pp. 6–11, Sep. 2018.

[13] K. Kwon, A. Amid, A. Gholami, B. Wu, K. Asanovic, andK. Keutzer, “Co-design of deep neural nets and neural netaccelerators for embedded vision applications,” in 2018 55thACM/ESDA/IEEE Design Automation Conference (DAC),pp. 1–6, IEEE, 2018.

[14] K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer,D. Biancolin, C. Celio, H. Cook, D. Dabbelt, J. Hauser,A. Izraelevitz, S. Karandikar, B. Keller, D. Kim, J. Koenig,Y. Lee, E. Love, M. Maas, A. Magyar, H. Mao, M. Moreto,A. Ou, D. A. Patterson, B. Richards, C. Schmidt, S. Twigg,H. Vo, and A. Waterman, “The rocket chip generator,” Tech.Rep. UCB/EECS-2016-17, EECS Department, University ofCalifornia, Berkeley, Apr 2016.

[15] C. Celio, D. A. Patterson, and K. Asanovic, “The berkeleyout-of-order machine (boom): An industry-competitive,synthesizable, parameterized risc-v processor,” EECSDepartment, University of California, Berkeley, Tech. Rep.UCB/EECS-2015-167, 2015.

[16] S. Karandikar, H. Mao, D. Kim, D. Biancolin, A. Amid,D. Lee, N. Pemberton, E. Amaro, C. Schmidt, A. Chopra,Q. Huang, K. Kovacs, B. Nikolic, R. Katz, J. Bachrach, andK. Asanovic, “Firesim: Fpga-accelerated cycle-exactscale-out system simulation in the public cloud,” in 2018ACM/IEEE 45th Annual International Symposium onComputer Architecture (ISCA), pp. 29–42, June 2018.

[17] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman,R. Avizienis, J. Wawrzynek, and K. Asanovic, “Chisel:Constructing hardware in a scala embedded language,” inProceedings of the 49th Annual Design AutomationConference, DAC ’12, (New York, NY, USA), pp. 1216–1225,ACM, 2012.

[18] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G.Howard, H. Adam, and D. Kalenichenko, “Quantization andtraining of neural networks for efficientinteger-arithmetic-only inference,” CoRR,vol. abs/1712.05877, 2017.

[19] H. M. Cook, A. S. Waterman, and Y. Lee, “Sifive tilelinkspecification,” tech. rep., SiFive Inc., 2018. https://www.sifive.com/documentation/tilelink/tilelink-spec/.

[20] S. Han, H. Mao, and W. J. Dally, “Deep compression:Compressing deep neural networks with pruning, trainedquantization and huffman coding,” arXiv preprintarXiv:1510.00149, 2015.

[21] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deepgradient compression: Reducing the communicationbandwidth for distributed training,” arXiv preprintarXiv:1712.01887, 2017.

[22] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, andD. Brooks, “Co-designing accelerators and soc interfacesusing gem5-aladdin,” in 2016 49th Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO),pp. 1–12, IEEE, 2016.

[23] I. Y. John L. Gustafson, “Beating floating point at its owngame: Posit arithmetic,” Supercomputing Frontiers andInnovations, vol. 4, jun 2017.

[24] J. E. Smith, “Decoupled access/execute computerarchitectures,” in Proceedings of the 9th Annual Symposiumon Computer Architecture, ISCA ’82, (Los Alamitos, CA,USA), pp. 112–119, IEEE Computer Society Press, 1982.

[25] T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, andA. Krishnamurthy, “VTA: an open hardware-software stackfor deep learning,” CoRR, vol. abs/1807.04188, 2018.

[26] D. Biancolin, S. Karandikar, D. Kim, J. Koenig,A. Waterman, J. Bachrach, and K. Asanovic, “Fased:Fpga-accelerated simulation and evaluation of dram,” in The2019 ACM/SIGDA International Symposium onField-Programmable Gate Arrays (FPGA’19), FPGA ’19,(New York, NY, USA), ACM, 2019.

[27] D. Claudiu Ciresan, U. Meier, L. M. Gambardella, andJ. Schmidhuber, “Deep big simple neural nets excel onhandwritten digit recognition,” arXiv preprintarXiv:1003.0358, 2010.

[28] U. Meier, D. C. Ciresan, L. M. Gambardella, andJ. Schmidhuber, “Better digit recognition with a committeeof simple neural nets,” in 2011 International Conference onDocument Analysis and Recognition, pp. 1250–1254, IEEE,2011.

[29] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speechenhancement based on deep denoising autoencoder.,” inInterspeech, pp. 436–440, 2013.

[30] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y.Ng, “Multimodal deep learning,” in Proceedings of the 28thinternational conference on machine learning (ICML-11),pp. 689–696, 2011.

[31] S. Hadjis, F. Abuzaid, C. Zhang, and C. Re, “Caffe con troll:Shallow ideas to speed up deep learning,” in Proceedings of

https://www.sifive.com/documentation/tilelink/tilelink-spec/

https://www.sifive.com/documentation/tilelink/tilelink-spec/

the Fourth Workshop on Data Analytics in the Cloud,DanaC’15, (New York, NY, USA), pp. 2:1–2:4, ACM, 2015.

[32] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitivesfor deep learning,” CoRR, vol. abs/1410.0759, 2014.

[33] Google, “System architecture | cloud tpu | google cloud.”https://cloud.google.com/tpu/docs/system-architecture, 2019. Accessed: 2019-07-15.

[34] J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah,D. Khudia, J. Law, P. Malani, A. Malevich, S. Nadathur,J. Pino, M. Schatz, A. Sidorov, V. Sivakumar, A. Tulloch,X. Wang, Y. Wu, H. Yuen, U. Diril, D. Dzhulgakov,K. Hazelwood, B. Jia, Y. Jia, L. Qiao, V. Rao, N. Rotem,S. Yoo, and M. Smelyanskiy, “Deep learning inference infacebook data centers: Characterization, performanceoptimizations and hardware implications,” 2018.

[35] A. Ivanyi, ed., Algorithms of Informatics, vol. 2. KempelenFarkas Hallgatoi Informacios Kozpont, 2011.

[36] H. Kung and S. W. Song, “A Systolic 2-D Convolution Chip,”tech. rep., CARNEGIE-MELLON UNIV PITTSBURGH PADEPT OF COMPUTER SCIENCE, 1981.

[37] NVIDIA, “Nvidia turing gpu architecture.” https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf, 2018.Accessed: 2019-08-01.

[38] Apple, “The future is here: iphone x.”https://www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/, 2017. Accessed: 2019-08-01.

[39] P. J. Bannon and K. A. Hurd, “Systems and methods forhardware-based pooling,” July 4 2019. US Patent App.15/862,369.

[40] F. Farshchi, Q. Huang, and H. Yun, “Integrating nvidia deeplearning accelerator (nvdla) with risc-v soc on firesim,”CoRR, 2019.

[41] S. Eldridge, A. Waterland, M. Seltzer, J. Appavoo, andA. Joshi, “Towards general-purpose neural networkcomputing,” in 2015 International Conference on ParallelArchitecture and Compilation (PACT), pp. 99–112, Oct2015.

[42] C. Farabet, B. Martini, B. Corda, P. Akselrod,E. Culurciello, and Y. LeCun, “Neuflow: A runtimereconfigurable dataflow processor for vision.,” in CVPRWorkshops, pp. 109–116, 2011.

[43] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo,X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shiftingvision processing closer to the sensor,” in 2015 ACM/IEEE42nd Annual International Symposium on ComputerArchitecture (ISCA), pp. 92–104, June 2015.

[44] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss:An energy-efficient reconfigurable accelerator for deepconvolutional neural networks,” IEEE Journal of Solid-StateCircuits, vol. 52, no. 1, pp. 127–138, 2017.

[45] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: Aflexible accelerator for emerging deep neural networks onmobile devices,” IEEE Journal on Emerging and SelectedTopics in Circuits and Systems, vol. 9, pp. 292–308, June2019.

[46] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow:A flexible dataflow accelerator architecture for convolutionalneural networks,” in 2017 IEEE International Symposiumon High Performance Computer Architecture (HPCA),pp. 553–564, Feb 2017.

[47] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei,“Deep convolutional neural network architecture withreconfigurable computation patterns,” IEEE Transactionson Very Large Scale Integration (VLSI) Systems, vol. 25,pp. 2220–2233, Aug 2017.

[48] X. Yang, M. Gao, J. Pu, A. Nayak, Q. Liu, S. Bell, J. Setter,K. Cao, H. Ha, C. Kozyrakis, and M. Horowitz, “DNNdataflow choice is overrated,” CoRR, vol. abs/1809.04070,2018.

[49] K. Ueyoshi, K. Ando, K. Hirose, S. Takamaeda-Yamazaki,J. Kadomoto, T. Miyata, M. Hamada, T. Kuroda, andM. Motomura, “Quest: A 7.49tops multi-purposelog-quantized dnn inference engine stacked on 96mb 3d sramusing inductive-coupling technology in 40nm cmos,” in 2018IEEE International Solid - State Circuits Conference -(ISSCC), pp. 216–218, Feb 2018.

[50] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H. Yoo,“Unpu: A 50.6tops/w unified deep neural networkaccelerator with 1b-to-16b fully-variable weightbit-precision,” in 2018 IEEE International Solid - StateCircuits Conference - (ISSCC), pp. 218–220, Feb 2018.

[51] D. Bankman, L. Yang, B. Moons, M. Verhelst, andB. Murmann, “An always-on 3.8µj/86% cifar-10 mixed-signalbinary cnn processor with all memory on chip in 28nmcmos,” in 2018 IEEE International Solid - State CircuitsConference - (ISSCC), pp. 222–224, Feb 2018.

[52] T. Karnik, D. Kurian, P. Aseron, R. Dorrance, E. Alpman,A. Nicoara, R. Popov, L. Azarenkov, M. Moiseev, L. Zhao,S. Ghosh, R. Misoczki, A. Gupta, M. Akhila,S. Muthukumar, S. Bhandari, Y. Satish, K. Jain, R. Flory,C. Kanthapanit, E. Quijano, B. Jackson, H. Luo, S. Kim,V. Vaidya, A. Elsherbini, R. Liu, F. Sheikh, O. Tickoo,I. Klotchkov, M. Sastry, S. Sun, M. Bhartiya, A. Srinivasan,Y. Hoskote, H. Wang, and V. De, “A cm-scale self-poweredintelligent and secure iot edge mote featuring anultra-low-power soc in 14nm tri-gate cmos,” in 2018 IEEEInternational Solid - State Circuits Conference - (ISSCC),pp. 46–48, Feb 2018.

[53] D. Shin, J. Lee, J. Lee, and H. Yoo, “14.2 dnpu: An8.1tops/w reconfigurable cnn-rnn processor forgeneral-purpose deep neural networks,” in 2017 IEEEInternational Solid-State Circuits Conference (ISSCC),pp. 240–241, Feb 2017.

[54] S. Yin, P. Ouyang, S. Tang, F. Tu, X. Li, L. Liu, and S. Wei,“A 1.06-to-5.09 tops/w reconfigurable hybrid-neural-networkprocessor for deep learning applications,” in 2017Symposium on VLSI Circuits, pp. C26–C27, June 2017.

[55] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato,H. Nakahara, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki,T. Kuroda, and M. Motomura, “Brein memory: A 13-layer4.2 k neuron/0.8 m synapse binary/ternary reconfigurablein-memory deep neural network accelerator in 65 nm cmos,”in 2017 Symposium on VLSI Circuits, pp. C24–C25, June2017.

[56] C. Kim, S. Kang, D. Shin, S. Choi, Y. Kim, and H. Yoo, “A2.1tflops/w mobile deep rl accelerator with transposable pearray and experience compression,” in 2019 IEEEInternational Solid- State Circuits Conference - (ISSCC),pp. 136–138, Feb 2019.

[57] A. Sayal, S. Fathima, S. S. T. Nibhanupudi, and J. P.Kulkarni, “14.4 all-digital time-domain cnn engine usingbidirectional memory delay lines for energy-efficient edgecomputing,” in 2019 IEEE International Solid- StateCircuits Conference - (ISSCC), pp. 228–230, Feb 2019.

[58] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H. Yoo, “7.7lnpu: A 25.3tflops/w sparse deep-neural-network learningprocessor with fine-grained mixed precision of fp8-fp16,” in2019 IEEE International Solid- State Circuits Conference -(ISSCC), pp. 142–144, Feb 2019.

[59] J. Yue, R. Liu, W. Sun, Z. Yuan, Z. Wang, Y. Tu, Y. Chen,A. Ren, Y. Wang, M. Chang, X. Li, H. Yang, and Y. Liu, “A65nm 0.39-to-140.3tops/w 1-to-12b unified neural networkprocessor using block-circulant-enabled transpose-domainacceleration with 8.1 × higher tops/mm2and 6thbst-tram-based 2d data-reuse architecture,” in 2019 IEEEInternational Solid- State Circuits Conference - (ISSCC),pp. 138–140, Feb 2019.

[60] H. Zeng, R. Chen, C. Zhang, and V. Prasanna, “A frameworkfor generating high throughput CNN implementations onFPGAs,” in Proceedings of the 2018 ACM/SIGDAInternational Symposium on Field-Programmable GateArrays - FPGA '18, ACM Press, 2018.

[61] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “DeepBurning:automatic generation of FPGA-based learning accelerators

https://cloud.google.com/tpu/docs/system-architecture

https://cloud.google.com/tpu/docs/system-architecture

https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf




https://www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/

https://www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/

for the neural network family,” in Proceedings of the 53rdAnnual Design Automation Conference on - DAC '16, ACMPress, 2016.

[62] J. Shen, Y. Huang, Z. Wang, Y. Qiao, M. Wen, andC. Zhang, “Towards a uniform template-based architecturefor accelerating 2d and 3d CNNs on FPGA,” in Proceedingsof the 2018 ACM/SIGDA International Symposium onField-Programmable Gate Arrays - FPGA '18, ACM Press,2018.

[63] R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, andS. Areibi, “Caffeinated FPGAs: FPGA framework forconvolutional neural networks,” in 2016 InternationalConference on Field-Programmable Technology (FPT),IEEE, dec 2016.

[64] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen,G. Sun, W. Zhang, and J. Cong, “FP-DNN: An automatedframework for mapping deep neural networks onto FPGAswith RTL-HLS hybrid templates,” in 2017 IEEE 25thAnnual International Symposium on Field-ProgrammableCustom Computing Machines (FCCM), IEEE, apr 2017.

[65] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han,Y. Wang, and H. Yang, “Angel-eye: A complete design flowfor mapping CNN onto embedded FPGA,” IEEETransactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 37, pp. 35–47, jan 2018.

[66] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu,and D. Chen, “Dnnbuilder: An automated tool for buildinghigh-performance dnn hardware accelerators for fpgas,” inProceedings of the International Conference onComputer-Aided Design, ICCAD ’18, (New York, NY, USA),pp. 56:1–56:8, ACM, 2018.

[67] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong,“Optimizing fpga-based accelerator design for deepconvolutional neural networks,” in Proceedings of the 2015ACM/SIGDA International Symposium onField-Programmable Gate Arrays, FPGA ’15, (New York,NY, USA), pp. 161–170, ACM, 2015.

[68] S. I. Venieris and C. Bouganis, “fpgaconvnet: A frameworkfor mapping convolutional neural networks on fpgas,” in2016 IEEE 24th Annual International Symposium onField-Programmable Custom Computing Machines (FCCM),pp. 40–47, May 2016.

[69] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim,C. Shao, A. Mishra, and H. Esmaeilzadeh, “From high-leveldeep neural models to fpgas,” in Microarchitecture(MICRO), 2016 49th Annual IEEE/ACM InternationalSymposium on, pp. 1–12, IEEE, 2016.

[70] J. Zhang and J. Li, “Improving the performance ofopencl-based fpga accelerator for convolutional neuralnetwork,” in ACM/SIGDA International Symposium onField-Programmable Gate Arrays, 2017.

[71] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou,“Dlau: A scalable deep learning accelerator unit on fpga,”IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, vol. 36, pp. 513–517, March 2017.

[72] R. Yazdani, A. Segura, J. Arnau, and A. Gonzalez, “Anultra low-power hardware accelerator for automatic speechrecognition,” in 2016 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO), pp. 1–12, Oct2016.

[73] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: Apipelined reram-based accelerator for deep learning,” in 2017IEEE International Symposium on High PerformanceComputer Architecture (HPCA), pp. 541–552, Feb 2017.

[74] P. Srivastava, M. Kang, S. K. Gonugondla, S. Lim, J. Choi,V. Adve, N. S. Kim, and N. Shanbhag, “Promise: Anend-to-end design of a programmable mixed-signalaccelerator for machine-learning algorithms,” in 2018ACM/IEEE 45th Annual International Symposium onComputer Architecture (ISCA), pp. 43–56, June 2018.

[75] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos,“Loom: Exploiting weight and activation precisions toaccelerate convolutional neural networks,” in Proceedings ofthe 55th Annual Design Automation Conference, DAC ’18,

(New York, NY, USA), pp. 20:1–20:6, ACM, 2018.

[76] S. Angizi, Z. He, and D. Fan, “Dima: A depthwise cnnin-memory accelerator,” in 2018 IEEE/ACM InternationalConference on Computer-Aided Design (ICCAD), pp. 1–8,Nov 2018.

[77] C. Min, J. Mao, H. Li, and Y. Chen, “Neuralhmc: Anefficient hmc-based accelerator for deep neural networks,” inProceedings of the 24th Asia and South Pacific DesignAutomation Conference, ASPDAC ’19, (New York, NY,USA), pp. 394–399, ACM, 2019.

[78] T. Nowatzki, V. Gangadhar, N. Ardalani, andK. Sankaralingam, “Stream-dataflow acceleration,” in 2017ACM/IEEE 44th Annual International Symposium onComputer Architecture (ISCA), pp. 416–429, June 2017.

[79] J. Liu, H. Zhao, M. A. Ogleari, D. Li, and J. Zhao,“Processing-in-memory for energy-efficient neural networktraining: A heterogeneous approach,” in 2018 51st AnnualIEEE/ACM International Symposium on Microarchitecture(MICRO), pp. 655–668, Oct 2018.

[80] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang,and Y. Xie, “Prime: A novel processing-in-memoryarchitecture for neural network computation in reram-basedmain memory,” in 2016 ACM/IEEE 43rd AnnualInternational Symposium on Computer Architecture (ISCA),pp. 27–39, June 2016.

[81] H. Yan, H. R. Cherian, E. C. Ahn, and L. Duan, “Celia: Adevice and architecture co-design framework forstt-mram-based deep learning acceleration,” in Proceedingsof the 2018 International Conference on Supercomputing,ICS ’18, (New York, NY, USA), pp. 149–159, ACM, 2018.

[82] H. Yan, A. H. Aboutalebi, and L. Duan, “Efficient allocationand heterogeneous composition of nvm crossbar arrays fordeep learning acceleration,” in 2018 IEEE 37th InternationalPerformance Computing and Communications Conference(IPCCC), pp. 1–8, Nov 2018.

[83] Y. Zha, E. Nowak, and J. Li, “Liquid silicon: A nonvolatilefully programmable processing-in-memory processor withmonolithically integrated reram for big data/machinelearning applications,” in 2019 Symposium on VLSI Circuits,2019.

[84] Y. Ji, Y. Zhang, X. Xie, S. Li, P. Wang, X. Hu, Y. Zhang,and Y. Xie, “Fpsa: A full system stack solution forreconfigurable reram-based nn accelerator architecture,” inProceedings of the Twenty-Fourth International Conferenceon Architectural Support for Programming Languages andOperating Systems, ASPLOS ’19, (New York, NY, USA),pp. 733–747, ACM, 2019.

[85] L. Chang, X. Ma, Z. Wang, Y. Zhang, W. Zhao, and Y. Xie,“Corn: In-buffer computing for binary neural network,” in2019 Design, Automation Test in Europe ConferenceExhibition (DATE), pp. 384–389, March 2019.

[86] J. Cong and J. Wang, “Polysa: polyhedral-based systolicarray auto-compilation,” in 2018 IEEE/ACM InternationalConference on Computer-Aided Design (ICCAD), pp. 1–8,IEEE, 2018.

[87] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen,M. Cowan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, andA. Krishnamurthy, “TVM: An automated end-to-endoptimizing compiler for deep learning,” in 13th USENIXSymposium on Operating Systems Design andImplementation (OSDI 18), (Carlsbad, CA), pp. 578–594,USENIX Association, 2018.

[88] H. T. Kung, “Don’t use a single large systolic array, usemany small ones instead.” Workshop on ML for Systems atISCA, 2019.

Gemmini: An Agile Systolic Array Generator Enabling ...alonamid/papers/...depth, banking strategy,...

Documents

Transcript of Gemmini: An Agile Systolic Array Generator Enabling ...alonamid/papers/...depth, banking strategy,...