HERALD: Optimizing Heterogeneous DNN Accelerators for …HERALD, illustrated inFigure 2. As inputs,...

Herald: Optimizing Heterogeneous DNN Accelerators

Hyoukjun Kwon1,2, Liangzhen Lai2, Tushar Krishna1 and Vikas Chandra2

1Georgia Institute of Technology, Atlanta, Georgia, USA2Facebook, Menlo Park, California, USA

[email protected], [email protected], [email protected], [email protected]

ABSTRACTNew real time applications such as virtual reality (VR) withmultiple sub-tasks (e.g., image classification, segmentation,hand tracking, etc.) based on deep neural networks (DNNs)are emerging. Such applications rely on multiple DNNs foreach sub-task with heterogeneity and require to meet tar-get processing rates (frame rate) for each sub-task. Thus,the new compound DNN workload imposes new challengesto accelerator designs in two folds: (1) meeting target pro-cessing rate for each DNN for sub-tasks and (2) efficientlyprocessing heterogeneous layers As a solution, we exploreheterogeneous DNN accelerators (HDAs). HDAs consist ofmultiple accelerator substrates (i.e., sub-accelerators) to sup-port model parallelism that implements different mappingstyles to provide adaptivity to heterogeneous DNN layers.However, HDA’s performance and energy heavily dependon (1) how we partition hardware resources for each sub-accelerator within a given hardware budget, and (2) how weschedule layers on the sub-accelerators. That is, HDAs canresult in sub-optimal designs unless carefully co-optimized.

Therefore, we propose an HDA optimization framework,Herald, which performs co-optimization of hardware parti-tioning and layer schedluing. Herald exploits the simplicity ofdependence graph of DNN layers to reduce the complexity ofthe problem, which on average requires 9.48 ms per layer ona laptop with i9-9880H processor and 16GB memory. Heraldcan be utilized both in design time to perform co-optimization(optimizer) and in compile time to perform layer schedulingand report expected latency and energy (cost model and sched-uler). In our case studies, co-optimized HDAs with the bestEDP Herald identified provided 56.0% EDP improvementswith 46.82% latency and 6.3% energy benefits on averageacross workloads and accelerators we evaluate compared tothe best monolithic accelerators for each evaluation settingby deploying two complementary-style sub-accelerators inan HDA.

1. INTRODUCTIONDeep neural network (DNN) accelerators, hardware spe-

cialized for DNN computations, played a key role in provid-ing high computation capabilities with high energy efficiencyfor DNN computations from edge/mobile devices [3, 13] to

2 3 40

500

1000

1500

2 3 40

200

400

600

800

2 3 40

50M

100M

150M

200M

250M

300M

EDP

(J x

s)

0

500

1000

1500

0Shi DLA RS Shi DLA RS Shi DLA RS

200

400

600

800

050M

100M

150M

200M

250M300M

(a) Resnet50 (b) MobileNetV2 (c) UNet

Early Layer Late Layer Fully-connected Depth-wise Up-scale CONV

Figure 1: EDP estimation of Shidiannao [8], NVDLA [26], and Eye-riss’s row-stationary [5] style mapping on Resnet50, MobileNetV2, andUNet. We apply 256 PEs and 32GBps NoC bandwidth to MAESTROcost model [17] to estimate the energy and latency. The break-down isin the granularity of unit operations.

cloud devices [15] scale. The enhanced DNN computationcapabilities via DNN accelerators enabled the deploymentof diverse applications that heavily rely on DNNs, such asface recognition [38], image super-resolution [28], and so on.Based on such innovations in hardware, real time applicationswith multiple sub-tasks and DNN models for each, such asAR/VR [40] or autonomous driving [20, 32, 39], have beenenabled. For example, a VR application can require objectdetection, hand tracking, speech recognition, pose estimation,and so on [1, 10, 30], which all rely on different DNN mod-els for achieving state-of-the-art performance. Because ofthe diversity of sub-tasks, the DNN models supporting eachsub-task are compound and heterogeneous.

Such compound DNN workloads impose a new challengeto DNN accelerators since the workloads are expected to becompleted in target frame rates (i.e., real time application),and the workloads are heterogeneous in layer operations,layer shapes, and so on. In particular, the heterogeneityof the workload is one of the major challenges to mono-lithic DNN accelerators based on single processing style(i.e., dataflow [5]) because they are often over-specializedfor specific layer shapes and operations [17] - which pro-vides them an efficiency boost in the first place. For example,NVDLA [26] style accelerators exploit input and output chan-nel parallelism, which provides near roof-line throughput forCONV2D layers with deep channels as shown in Figure 1 (a)and (b). However, when it runs a CONV2D layer with a smallnumber of channels, NVDLA style accelerator suffers severecompute unit under-utilization, which leads to low through-put and energy efficiency, as results in Figure 1 (c) implies.

1

arX

iv:1

909.

0743

7v3

[cs

.DC

] 3

0 Ju

n 20

20

That is, tuning an accelerator for the average case can leadto uniform inefficiency across all the layers in heterogeneousDNN workloads.

Such efficiency drop based on layer heterogeneity in oper-ation and shape can be significant based on the target DNNmodels. For example, in a recent classification network,Resnet50 [11], layers with deep channels and low activationresolution1 are dominant; all the layers except the first layerare such layers. Therefore, NVDLA style accelerators pro-vide high compute unit utilization all the layers except theinput layer. However, in a segmentation network, UNet [33],layers with shallow channels and high activation resolutionaccount for 41.7% of the layers. NVDLA style acceleratorssuffer from low compute unit utilization on those less favor-able layers, resulting in severe efficiency drop; 3.6× slowerthan Shi-diannao style while NVDLA style is 3.0×fasterthan Shi-diannao style on Resnet 50 [17]. Flexible accelera-tors, which include coarse-grained reconfigurable architex-ture(CGRA) style ASIC accelerators [7, 19, 23] and FPGAaccelerators [16, 21, 35], are potential solutions to solve theworkload heterogeneity problems. However, the extra hard-ware components for the reconfigurability can cause staticefficiency overheads, which makes them less desired understringent energy constraints in edge, mobile, and cloud de-vices (e.g., MAERI required 96% more power compared to asystolic array design [19]).

To deal with the challenges from layer heterogeneity, wepropose to design heterogeneous DNN accelerators (HDAs)that contain diverse sub-accelerators within an acceleratorchip. This can be a promising option since we can assignthe most efficient sub-accelerator for each layer. Also, byhaving multiple sub-accelerators, we can exploit layer paral-lelism, which can be extremely useful for new applicationswith many DNNs running simultaneously. However, becausewe split finite hardware resources within a chip to multiplesub-accelerators, each sub-accelerator may be less powerfulcompared to a full homogeneous (or, monolithic) accelerator.Therefore, as reported in a recent heterogeneous acceleratorwork for database query [22], heterogeneous acceleratorscan be either more efficient or inefficient than monolithicaccelerators based on the design parameters (architecture),compiler mapping (dataflow + tile sizing (loop blocking)),and workload (models). We observed the same trend forDNNs as well as presented in Section 5.2, which impliestwo findings: (1) Heterogeneous accelerators have potentiallatency and energy gains over monolithic accelerators (2) Tomaterialize such benefits, we need a systematic optimizationframework that considers all of the aforementioned threeaspects; architecture, mapping, and DNN models.

Therefore, we propose an HDA optimization framework,Herald, illustrated in Figure 2, which performs two-phaseco-optimizations in design and compile time. As inputs, Her-ald receives total amounts of available hardware resources(number of PEs, NoC bandwidth, and so on), mappings ofsub-accelerators, and target DNN models to run. As outputs,Herald generates hardware resource partitioning for eachsub-accelerator, layer execution schedule, and estimated total

1we compare the number of input channels (C) and input activationheight (Y)to define deep/shallow channels and high/low activationresolution; if C > Y, deep and low-res., and vice versa.

latency and energy using the MAESTRO cost model [17].That is, Herald is an HDA design space explorer and layer-granularity scheduler for user-specified workloads. Her-ald mainly targets co-optimization in design and compiletime to fully exploit the benefits of hardware-schedule co-optimization, specializing both hardware and schedule fora target workload. When the workload changes at run time,Herald can still generate an optimized schedule for the newworkload for the underlying hardware. In our evaluations, weuse Herald to create HDA design points combining multipleaccelerators with complementary dataflow styles. The de-sign points with the best EDP for each experiment provided24.9% smaller energy-delay product (16.1% latency and 7.6%energy benefits) across complex edge device workloads weevaluate compared to the best monolithic design we evaluate.

We summarize the contribution of this paper as follows:

• To the best of our knowledge, Herald is the first work toexplore heterogeneous architecture and layer schedulingin DNN accelerator domain.

• Herald automatically searches for optimal hardware re-source partitioning among sub-accelerators in HDAs (canbe used as an optimizer at design time).

• Herald explores optimal layer schedules that match lay-ers with the most efficient accelerator for each consider-ing both of local (layer-accelerator matching) and global(load-balancing) optimization with expected latency andenergy (Can be used as a cost model for HDAs at compiletime.)

2. BACKGROUND

2.1 DNN Operations and Layer SizesConvolutional neural network (CNN) is one of the most

popular DNNs for computer vision applications, which isdominant in edge device applications. Therefore, we focuson CNNs and operations in recent CNN models as listedin Table 1. The basis of CNN operation is convolutions(CONV2D), a six-dimensional (or seven if include batch)multiplication and addition (MAC) operations over two ten-sors; input activation and filter weights, as illustrated in Fig-ure 3. That is, the operations are sets of element-wisemultiplications (or, Hadamard product) and accumulations,as described in the loop nest in Figure 3 (b) Other operationsare also based on the same style of computations, but theyaccumulate in different dimensions (depth-wise CONV2D),insert zeros in input activation (up-scale CONV), and so on.

2.2 Dataflow and MappingDataflow collectively refers to loop ordering and spatial

unrolling (partitioning) [5, 41] strategies. Dataflows are oftenrepresented in a loop-nest form [6], as shown in Figure 4,loop nest with loop bounds are unfilled. From the base loopnest (i.e., “unmapped" loop nest such as one in Figure 3 (b)), a series of loop interchange and parallelization modifieshow we compute DNN operations while preserving what wecompute. By providing loop bounds to the representation(i.e., tile size or blocking information), we obtain “mapping,"which indicates an instance of dataflow, which contains full

2

Resnet50:…UNet:…

Workload

HERALDHDA Design Space Explorer

PE Partitioning BW Partitioning

Layer SchedulerLayer Ordering /

Assignment Idle Time

Elimination

MAESTRO(Cost Model)

DNN ModelHW parameters

Sched. Policy, …

Latency, Energy,Mem Occupancy,…

PE PE PE PEPE PE PE PEPE PE PE PEPE PE PE PE

Optimized HDA Design ParamtersEnergy

Time0Energy

Time0Optimized Schedule

Mapping StrategiesHerald Outputs

Eye Tracking ImageSegmentation

Autonomous Driving

Vision-basedNavigation Lane Detection

…

AR/VR…

Multi-taskReal-time Applications

…

…++ +

DNN Model 1 DNN Model 2

…

DNN Models for each sub-task

Realtime and HeterogeneousDNN Workload

… … HW ResourceParameters

Herald Inputs Proposed Framework

Latency:Energy:

Expected Latency/Energy

Figure 2: Multi-task realtime application motivating heterogeneous DNN accelerators (HDAs), and an overview of Herald, an HDA optimizationframework.

R

SX

Y

CC

Y’

X’K

K CONV…

for(k=0; k<K; k++) for(c=0; c<C; c++) for(y=0; y<Y; y++) for(x=0; x<X; x++) for(r=0; r<R; r++) for(s=0; s<S; s++) Output[k][y][x] += Input[c][y+r][x+s] * Filter[k][c][r][s]

Slide…

…

…

…

…

…

HadamardProduct

AndAccumulation

(a) Visual Description of Convolution Operation (CONV2D)

Slide

(b) Loop Nest Description of Convolution Operation (CONV2D)

<Convention>K: Output Channel (or, filter)C: Input ChannelY’: Input Activation RowX’: Input Activation ColumnY: Output Activation RowX: Output Activation ColumnR: Filter RowS: Filter Column

FilterInput

ActivationOutput

Activation

Figure 3: An illustration of fundamental convolution operation(CONV2D). (a) shows a visualization of the operation as a slidingwindow stencil operation, and (b) shows the precise description ofCONV2D in a loop nest. Note that we illustrate one representative way(mapping) to compute CONV2D and do not include non-linear func-tions in the illustration for simplicity.

information to map a DNN operation on an accelerator [18].

2.3 DNN AcceleratorsDNN accelerators [2,6,8,9,15,19,23,26,27,34,36] are hard-

ware specialized for DNN forward (or, both of forward/back-ward) pass, which provides tremendous throughput and en-ergy benefits over CPUs and GPUs. For example, TPU [15]reported 53.5× and 25.74× higher performance of two CNNsover CPUs and GPUs on average, respectively. To achievesuch a high efficiency, DNN accelerators exploit hundredsof processing elements to deal with billions of multiply-and-accumulate (MAC) operations in DNN models and exploitscratchpad memory and inter-PE interconnection network tomaximize data reuse to deal with massive energy cost forfetching data from DRAM. We can observe three types ofDNN accelerators as follows, including HDAs we suggest todesign.Monolithic Accelerators. Most of previously proposed DNNaccelerators are homogeneous, or monolithic, which has reg-ular architecture as shown in Figure 5 (a) and runs only onedataflow style. Monolithic DNN accelerators can have morecomplicated structures but all the monolithic DNN accelera-tors replicate a unit structure to scale up.Flexible Accelerators. Some accelerators such as Flexflow [23],MAERI [19], and EyerissV2 [7] include extra hardware com-

for(k1=0; k1<K1; k1++) pfor(k0=0; k0<K0; k0++) for(c1=0; c1<C1; c1++) for(y1=0; y1<Y1; y1++) for(x1=0; x1<X1; x1++) pfor(c0=0; c0<C0; c0++) for(r1=0; r1<R; r1++) for(s1=0; s1<S; s1++) for(y0=0; y0<Y0; y0++) for(x0=0; x0<X0; x0++) for(r=0; r0<1; r++) for(s=0; s0<1; s++) { k=k1*K0 + k0; c=c1*C0 + c0; … x = x1*X0 + x0; Output[k][y][x] += Input[c][y+r][x+s] * Filter[k][c][r][s]; }

(a) NVDLA Style Dataflow

for(k1=0; k1<K1; k1++) for(k0=0; k0<K0; k0++) for(c1=0; c1<C1; c1++) for(y1=0; y1<Y1; y1++) for(x1=0; x1<X1; x1++) for(c0=0; c0<C0; c0++) pfor(y0=0; y0<Y0; y0++) pfor(x0=0; x0<X0; x0++) for(r=0; r<R; r++) for(s=0; s<S; s++) { k=k1*K0 + k0; c=c1*C0 + c0; … x = x1*X0 + x0; Output[k][y][x] += Input[c][y+r][x+s] * Filter[k][c][r][s]; }

(b) Shi-diannao Style Dataflow

Figure 4: Loop nest representation of dataflows from recent acceler-ators [5, 8]. We follow the same convention as the loop nest in Figure 3.Numbers after some loop variables indicate tile level.

ponents for runtime reconfigurability, that enables to runvarious mapping styles efficient for the workload. Because ofthe different preference to mapping styles of each DNN layerwe discuss in Section 3.3, adapting mapping styles to eachlayer can provide latency and energy benefits. For example,a recent study [17] reported 37% latency and 10% energygains assuming zero reconfigurability overhead2. Althoughflexible accelerators can provide such benefits, however, theenergy cost can be higher depending on the reconfigurabiltiycost. For example, 1024-MAC MAERI [17], a flexible accel-erator, synthesized using 28nm library required 30.2% morepower compared to 1024-MAC NVDLA [26], a monolithicaccelerator, synthesized using the same library.Heterogeneous Accelerators. A heterogeneous DNN accel-erator (HDA) is an accelerator that contains multiple sub-accelerators that run distinct mapping styles, as illustratedin Figure 5 (c). Like flexible accelerators, HDAs also exploitthe different preference to mapping styles of each DNN layerto optimize latency and energy but using a different strat-egy. HDAs assign layers to a sub-accelerator with the mostpreferred mapping style of each layer. Although an HDAneeds to distribute its hardware resources to sub-acceleratorsat design time, resulting in an array of accelerators, eachsmaller than monolithic or flexible accelerators with the samehardware budget, multiple accelerator instances enable layerparallelism when the workload is a complex application thatincludes multiple DNN models. We discuss such potentialbenefits of HDAs in Section 3.4.

In DNN accelerators, mapping determines the latency andenergy consumption because it determines the number of

2since the overhead is implementation-specific, and to show the fullpotential, the paper excluded the overhead.

3

PE PE

PE PE

NIC

Buffer

PE PE

PE PE

NIC

Buffer

PE

PEPE PE

PE PE

NIC

Buffer

Global Buffer

Global Interconnect

ReLu

Sigm

oid

PE PE

PE PEACC1 ACC2 ACC3

Leaky ReLU

To/From DRAM

(c) Heterogeneous Accelerator(a) Monolithic Accelerator

PE

PE

PE

PE

PELoca

l Int

erco

nnec

tGlobal Buffer

Global Interconnect

To/From DRAM

PE

PE

PE

PE

PELoca

l Int

erco

nnec

t PE

PE

PE

PE

PELoca

l Int

erco

nnec

t PE

PE

PE

PE

PELoca

l Int

erco

nnec

t

(b) Flexible Accelerator

PEPE PE PE PE

Global Buffer

Global Interconnect

To/From DRAM

PEPEPE+ + + +

+ ++

Reconfiguration Controller

Distribution CrossBar

PEPE PE PE PE

Global Buffer

Global Interconnect

To/From DRAM

PEPEPE+ + + +

+ ++

Reconfiguration Controller

Distribution CrossBar

Reconfig Reconfig…

Figure 5: Example monolithic, flexible, and heterogeneous DNN accelerators (HDAs).

buffer accesses, degree of parallelization (mapping utilizationof PEs), buffer size requirements, and so on [5, 17, 23]. In ad-dition, the efficiency of mapping depends on layer operationand sizes, which is one of the key rationales toward HDAs.We discuss such aspects in the following section.

3. MOTIVATIONIn this section, we discuss compound workloads that moti-

vates layer paralleism, the heterogeneity of layer operationsand sizes, the impact of mapping style on the efficiency of aDNN accelerator, which motivates the use of HDAs.

3.1 Realtime and Compound WorkloadsAs the number of applications relying on DNNs are in-

creasing and they are often tightly coupled with edge devicessuch as self-driving cars, smart phones, VR head sets, orAR glasses. Computer vision is one of the most popular do-mains of the application, which includes various tasks likeface recognition, image segmentation, image super resolu-tion, depth estimation, and so on [14]. Such applicationsoften consist of multiple computer vision sub-tasks to imple-ment the desired functionality, resulting in compound andheterogeneous DNN workloads. For example, a VR headsetsimultaneously runs DNN models for hand tracking, poseestimation, action segmentation, and multiple image clas-sifications [40] at designated frame rate for each. Anotherexample is self-driving cars that performs lane detection [20],driving scene understanding [32], and so on.

Compound and heterogeneous DNN workloads from suchemerging applications are often based on multiple DNN mod-els targeting a processing rate (or, frame rate) because theyare in nature real time applications, which imposes a new de-sign challenge to DNN accelerators. Such features motivateDNN accelerators to have multiple substrates for simultane-ously running layers from different DNN models to meetthe processing rate requirement for all the DNN models foreach sub-task. Another challenge is layer heterogeneity frommultiple DNN models for diverse tasks, which we discussnext.

3.2 Layer Heterogeneity in DNN ModelsBecause of the diversity of sub-tasks in compound DNN

workloads, the layers in the DNNs are also diverse, or het-erogeneous while DNN accelerators are often optimized forspecific DNN models to exploit the benefits of specialization.Therefore, understanding the layer heterogeneity and opti-mizing for them is crucial to efficiently support compoundDNN workloads.

Activation

Layer 1 Layer N

…

……

Filter …… …

… …

Activation

Layer 1 Layer N

…

……

Filter ……

…

(a) Classification Network

(b) Segmentation Network

Figure 6: Trends in layer shape of (a) classification networks such asResnet [11] and (b) segmentation networks such as UNet [33]

From recent DNN models, we can observe two classes ofheterogeneity; layer shape (or, size of each layer dimension)and layer operations. We summarize the two classes of het-erogeneity of some recent DNN models related to multi-DNNapplications such as autonomous driving and AR/VR devicesin Table 1.

3.2.1 Layer ShapeClassification networks such as Resnet [11] gradually re-

duce the resolution of activation because their goal is toextract a classification vector where each entry represents theprobability of each class. Also, classification networks tendto increase the number of channels to exploit as many fea-tures as possible for accurate classification. Therefore, layersin classification networks have high-resolution activation andshallow channels in early layers and low-resolution activationand deep channels in late layers, as illustrated in Figure 6 (a).

In contrast, segmentation networks such as UNet [33] needto restore the original resolution of activation because theirgoal is to generate masks over target objects in the inputimage. However, segmentation networks still need to ex-tract as many features as those in classification networksfor high accuracy. Therefore, segmentation networks firstfollow the same trend as classification networks until the mid-layer. Afterward, segmentation networks reduce the numberof channels and gradually restore the resolution of activationusing up-scaling methods such as transposed convolution(a.k.a. deconvolution or up-convolution). As a result, layershapes in segmentation networks follow the trend illustratedin Figure 6 (b).

3.2.2 Layer OperationWe list lists DNN models of computer vision tasks related

to AR/VR in Table 1. As listed in layer operation columnof Table 1, layer operations in such models are diverse and het-erogeneous. Each layer operation prefers different mappings

4

Table 1: DNN models selected for case studies motivated by AR/VR workloads [40]. For works without model name,we name them to refer to those works in the rest of paper.

Task Model Layer Shape Layer OperationsImage Classification Resnet50 [11] Classification CONV2D, FC, Skip-Con.Image Classification MobileNetV2 [25] Classification CONV2D, DWCONV, Skip-Con.Image Segmentation UNet [33] Segmentation CONV2D, FC,TRCONV, Concat.Depth Estimation Focal Length DepthNet [12] Segmentation CONV2D, FC, UPCONVHand Pose Estimation Br-Q HandposeNet [24] Classification CONV2D, FC

and hardware [17], which makes such workloads challengingto monolithic accelerators.

3.3 Mapping and Efficiency of AcceleratorsSince mappings dictate the data reuse strategies and PE

utilization, they significantly affect the efficiency of a DNNaccelerator with distinct preference to layer shapes and oper-ation [5, 17, 23].

We show examples of such cases in Figure 7, compar-ing two example accelerators based on Shi-diannao [8] andNVDLA [26] mapping styles. Those two accelerators havedistinct approaches to compute MAC operations in DNNs.As illustrated in Figure 7 (a), a Shi-diannao style accelera-tor parallelizes activation width and height using an output-stationary style mapping, which exploits convolutional reuseacross kernels, while an NVDLA style accelerator paral-lelizes input and output channels using a weight-stationarystyle mapping, which exploits activation reuse across outputchannels. Also, Shi-diannao style employs output-stationarystyle mapping that maximizes output reuse, while NVDLAstyle employs weight-stationary mapping that maximizes fil-ter weight reuse. Such differences in mappings result indramatically different utilization of compute units, as shownin Figure 7 (c). We use three example operations presentedin Figure 7 (c) to show the impact of mappings. Op1 and Op2are CONV2D operations with the aspect ratio of early andlate layers in classification network introduced in Figure 6(a), respectively. Op3 is a depth-wise CONV2D operationwith the same layer size as Op1.

Based on the parallelization strategies of each exampleaccelerator and layer sizes, we can observe dramatically dif-ferent PE utilization as shown in Figure 6 (c). We use MAE-STRO [17] cost model for DNN accelerators to estimate thelatency and energy and compute energy-delay product (EDP)as one of the indicators of overall efficiency, as shown in Fig-ure 6 (b). In combination of the differences in utilization anddata reuse strategies, two example accelerators result in dra-matically different EDPs, which implies distinct preferenceof two example accelerators to the operations. In addition tothe mapping utilization, each of mapping style has dramat-ically different memory/network-on-chip(NoC) bandwidthrequirements, buffer size requirements, and so on, which alsovaries based on the layer shape and operations in a differentdegree [17].

Therefore, no single mapping style is good for all thelayers, and we need to optimize the mapping for each layerin target workloads to maximize efficiency of an accelerator.However, when the target workload is heterogeneous, thecommon practice that optimizes the mapping for the averagecase of the workload can result in a consistently inefficientmapping for all the layers in the workload, which is one

of the major challenges for DNN acceleration for emergingapplications with multiple DNN models.

3.4 Benefits of HDAsTo overcome the limitation of monolithic DNN accelera-

tors implementing only one mapping style, we propose todesign heterogeneous DNN accelerators (HDAs) that containmultiple sub-accelerators with distinct mapping styles withina single accelerator chip. HDAs have two potential benefitsover monolithic accelerators.Selective scheduling. Because each layer differ by operationand shape prefers different mapping style and hardware, run-ning each layer on its most preferred sub-accelerator in anHDA is an effective solution to maximize overall efficiency.Latency hiding via layer parallelism. Unlike most of themonolithic accelerators run one layer and another, HDAscan run multiple layers of different models on each sub-accelerator in parallel. By running multiple layers in par-allel, a heterogeneous accelerator can overlap the latency ofmultiple models, which leads to latency hiding among DNNmodels reducing overall latency.

3.5 Challenges of HDAsAlthough HDAs can provide potential benefits in micro-

specialization for each layer, naive HDA designs and sched-ulers can lead to inefficiency.Reduced parallelism within a layer. Given the same num-ber of PEs between a monolithic and an HDA, sub-acceleratorsin the HDA has smaller number of PEs than the monolithicaccelerator since hardware resources need to be distributed(or partitioned) for each sub-accelerator. Therefore, the maxi-mum degree of parallelism each sub-accelerator can exploitfor a layer can decrease compared to a monolithic or flexi-ble accelerator. That can lead to higher energy consumptionsince the amount of data reuse can also decrease (dependingon the mapping) if the degree of parallelism decreases.Hardware Resource Partitioning. Each sub-acceleratorcontains smaller amount of hardware resources than a mono-lithic DNN accelerator if we assign the same hardware budgetto the entire chip because we need to distribute hardware re-sources across sub-accelerators in an HDA. That is, if thehardware resource distribution is sub-optimal, the overallefficiency of an HDA chip can be also degraded.Scheduling and Memory Constraints. Because multiplesub-accelerators exist in an HDA sharing one global buffer,scheduling of layers that determines sub-accelerator-layermatching and ordering of the execution is now critical foroverall efficiency, which is a problem did not exist in mono-lithic accelerators.

To deal with the challenges of HDAs and exploit benefits ofHDAs, we propose Herald, an HDA optimization framework

5

SRAM

PE PE

PE PE

PE

PE

PE

PE

PE PE PE

PE PE PE

PE

PE

SRAM

PE PE

PE PE

PE

PE

PE

PE

PE PE PE

PE PE PE

PE

PE

Shi-diannao Style Accelerator

Map

33

CONV

(Op1) CONV2D (Early Layers in Classification Networks)

FilterWeight

3 6

6CONV

3

Input Activation

Output Activation

2 4

42


SRAM

PE PE

PE PE

PE

PE

PE

PE

PE PE PE

PE PE PE

PE

PE


+

X X+

X X+

+

X X+

X X+

+ RFile

+

X X+

X X+

+

X X+

X X+

+ RFile

NVDLA Style Accelerator

SRAM

+

X X+

X X+

+

X X+

X X+

+ RFile

+

X X+

X X+

+

X X+

X X+

+ RFile


SRAM

+

X X+

X X+

+

X X+

X X+

+ RFile

+

X X+

X X+

+

X X+

X X+

+ RFile


SRAM

Map

Map

3

16

CONV

(Op2) CONV2D (Late Layers in Classification Networks)

3 4

4CONV

3

2 222

…

……

…

FilterWeight

Input Activation

Output Activation

33

6

6

2

4

4Ch-wise

CONV

(Op3) Depth-wise Convolution

22

FilterWeight

Input Activation

Output Activation

+

X X+

X X+

+

X X+

X X+

+ RFile

+

X X+

X X+

+

X X+

X X+

+ RFile

Parallelize Input Channels (C) Parallelize Output Columns (X’)

SRAM

PE PE

PE PE

PE

PE

PE

PE

PE PE PE

PE PE PE

PE

PE

Para

lleliz

e O

utpu

t Row

s (Y

’)

Para

lleliz

e O

utpu

t Cha

nnel

s (K

)


SRAM

PE PE

PE PE

PE

PE

PE

PE

PE PE PE

PE PE PE

PE

PE

SRAM


NVDLA StyleShi-diannao Style

EDP(

mJ

x Se

c)

02k4k6k8k

10k12k

Op1 Op2 Op3(a) Example Accelerators (b) The EDP of each Example Accelerator on Operations in (c)

(c) Three Example Operations and Corresponding Mapping Utilization on Two Example Accelerators

37.5% 100%

100%

25%100%

12.5%

Utilized Compute Unit Under-utilized Compute Unit

Figure 7: The impact of mapping styles on efficiency. (a) Two example accelerators based on NVDLA [26] and Shi-diannao [8] mapping styles. (b)the EDP as the indicator of efficiency (lower is better) of two example accelerators on three example operations presented in (c). (c) three exampleoperations based on CONV2D and depth-wise CONV2D and mapping utilization of compute units on each example accelerator based on theirmapping styles. We term the mapping utilization as the number of PEs with computation mapped divided by the number of PEs to distinguish itfrom the under-utilization based on stalls at execution time due to insufficient network-on-chip (NoC) and memory bandwidth.

that consists of a design time (hardware resource distributionoptimization) and a compile time (layer scheduling) opti-mization framework, performing (1) design and compile timeco-optimization when a user designs a specialized HDA fora given workload or (2) compile time optimization only af-ter an HDA is deployed and the target workload changes.Exploiting those sub-components, Herald automates HDAdesign tailored for user-specified target models and outputsestimated latency and energy using the co-optimized design.We discuss details of Herald next.

4. HERALD FRAMEWORK

4.1 Execution Model and WorkloadsWe target layer granularity execution on each sub-accelerator

of HDAs because we observe significantly different mappingpreference of layers [17,23] and more fine-grained schedulingresults in high control and scheduling overhead. We assumethe following execution steps of accelerators in Herald.

1. Fetch filter weight values from DRAM and store themin a global buffer.

2. Distribute filter values to sub-accelerators based layerexecution schedule.

3. Fetch activation from DRAM and store them in theglobal buffer.

4. Stream activation values to their corresponding sub-

accelerators based on layer execution schedule.5. Store streamed-out output activation from each sub-

accelerator to the global buffer.6. During sub-accelerators compute output activation, fetch

next filter values from DRAM and send the filter valuesto the next accelerator (assumes double-buffering).

7. When a sub-accelerator finishes executing a layer, streamoutput activation stored in the global buffer as input ac-tivation of the next layer.

8. Repeat above processes until processing all the layersof all the models.

For steps 3 and 6, activation is stored in DRAM and loadedin a tiled manner specified by the mapping in target accelera-tor if the buffer size is not sufficient to store entire activation.When output activation is committed to the global buffer,Herald in default assumes a rearrange buffer that adjuststhe data layout for the next layer if it runs on another sub-accelerator with a different mapping style. In the evaluation,we select mappings that have the same inner-loop order sothat we can maintain the same data layout, which eliminatessub-accelerator context change overheads from different datalayout. When the data layout and miscellaneous contextchange overheads, Herald also provides an option to specifythe latency and energy penalties for them.

4.2 Latency and Energy Estimation

6

0 10 20 30 40 50 600

200

400

600

800

1000

1200

200EDP

(J x

s)

PE Partition

0400600800

10001200

0163840 16384

Acc 1

3848.6 (out of range)

Acc 281928192

122884096

435212032

Heterogeneous Design Points

Best HDA Partitioning

Best MonolithicNaive HDA Partitioning

Figure 8: The impact of PE partitioning upon a large acceleratorlisted in Table 3 with two sub-accelerators (ACC1: Shi-diannao style,ACC2: NVDLA style). We use evaluation workload A presented in Sec-tion 5. The left- and right-most represents ACC1 and ACC2 monolithicdesigns.

We extend MAESTRO [17,18] for latency and energy esti-mation of HDA designs with given schedules. MAESTROis a validated cost model for monolithic DNN acceleratorswith any mapping, which reported 96.1% accuracy againstRTL simulation [19] and real processing time measured on achip [6]. Although MAESTRO supports any mapping, it doesnot model multi-DNN sub-accelerator environment, whichis crucital for HDA evaluations. Therefore, we extend MAE-STRO to support multi-DNN accelerator environment withheterogeneity. Herald models the memory requirement forthe global buffer and data movement from/to the global bufferto/from sub-accelerator buffers. The modeling method fol-lows the same methodology proposed by MAESTRO, whichidentifies the amount of reuse and computing activity countsbased on them (for energy) and communication/computa-tion delay considering reuse (for latency). In addition to thesame analytic equations, Herald considers the layer executionschedule generated by the scheduler we develop, discussedin Section 4.4 by modeling non-synchronized execution ofsub-accelerators (i.e., each sub-accelerator start processinga layer as soon as input data are available). For estimatinglatency and energy of sub-accelerator runs, we exploit theoriginal MAESTRO cost model.

4.3 Accelerator Design Space ExplorationHerald models accelerators using various hardware pa-

rameters such as number of PEs, network-on-chip (NoC)bandwidth, NoC latency, global memroy size, global memorybandwidth, and so on. Herald exposes unit cost database sothat users can update unit area/energy costs for each hard-ware component so that users can easily update it to evaluateHDAs under their own environment (technology node, etc.).

Herald’s design space exploration (DSE) tool receives thetotal number of PEs, memory size, and memory/NoC band-width as inputs, which describes the overall hardware budgetfor an accelerator chip. Unlike monolithic DNN acceleratorsfully exploit them for a single accelerator substrate, HDAsneed to distribute such resources for each sub-accelerator.However, evenly distributing those resources (i.e., naive HWresource partitioning) does not yield the most optimal HDAsbecause each accelerator’s mapping style has a different opti-mal balance between the number of PEs, memory size, andbandwidth requirements as discussed in Section 3.3. There-fore, Herald explores the resource partitioning space for eachresource type, which constructs a nested combinatorial opti-mization problem, or a nested resource partitioning problem.

We implement a combinatorial optimization framework

upon an analytic cost model for DNN accelerators, MAE-STRO [17], which estimates the latency and energy for givenlayer, mapping, and hardware design parameters (number ofPEs, NoC bandwidth, memory size, etc.). In Figure 8, Weshow an example design space from PE partitioning overtwo sub-accelerators in a 16K-PE-HDA fixing other designparameters assuming naive bandwidth partitioning (128/128GBps). We can observe that the design space has irregularcomputational cost variations so evenly partitioning (8K/8KPE partitioning) does not yield the most efficient HDA.

Based on user-specified framework options, Herald’s de-sign space exploration (DSE) tool either performs an exhaus-tive search, binary sampling-based search, or random search.Binary sampling-based search first evaluates 2n design pointswith a regular interval with user-specified parameter n. After-ward, Herald selects an interval between two adjacent evalu-ated design points with the lowest average cost and performsan exhaustive search over the selected interval. The randomsearch follows a similar approach as the binary-sampling-based search but selects random pre-evaluated design points.

4.4 Layer SchedulingThe goal of scheduling in Herald is to minimize the energy

consumption and latency of an HDA, exploiting differentpreferences of each layer to accelerators. However, the layerscheduling space is massive. For example, 2.54×1021 possi-ble layer execution schedules exist for AR/VR workload A inTable 2 even if we only consider permutation of the layers ona single accelerator. To deal with such a large search space,we develop a heuristic the characteristics of DNN workloadsto reduce the scheduling overhead. Two major character-istics we exploit are the dependence among layers; layershave mostly linear dependence chain in most models, andthey are independent across models. In Figure 9, we presentan overview of the layer scheduling processes that consistof three processs: layer assignment to each sub-accelerator,layer ordering within each sub-accelerator, and re-orderingas post-procesing. Herald implements two-step scheduler toperform the three stepsStep 1: Layer Assignment and Ordering. We describe

the layer assignment and ordering algorithm in Figure 10.The algorithm iterates the frontier layers, (i.e., layers to beexecuted first based on the dependnce from each model in theworkload, until it schedules all the layers. For each frontierlayer, the algorithm first queries the cost of execution oneach sub accelerator and identify the best-fit sub-accelerator.Figure 10 describes energy-delay-product (EDP) as the metricfor example purpose while Herald supports various metrics(e.g., latency, energy, and custom metric from users).

Once identified the best-fit sub-accelerator, the schedulerchecks (1) dependence, (2) memory, and (3) load-balancingconditions. The dependence check tests if the executionof previous layer of the layer to schedule is complete ornot. The memory check tests if scheduling the layer re-quires requires memory space more than available at thetime to schedule, considering previously scheduled layers.The load-balancing check tests if scheduling the layer resultsin extreme load balance defined by a user parameter, load-balancing factor, which defines the maximum ratio of thelatency of the fastest and slowest completing sub-accelerators.

7

Acc2

Acc11 2

1 2 3

Model A

Model B

LayerAssignment Acc1

1

2

3Acc2

2

1

LayerOrdering 2 1 3

1 2

Re-order forGap-filling

Time

Ener

gy

Time

Ener

gy <Acc2>1

21

2

3

<Acc1>Schedule

ConstructionTime

Ener

gy

Time

Ener

gy <Acc2>1

2 1

2

3

<Acc1>Initial Scheduling Post-processing

Figure 9: An overview of three processes to schedule layers on HDAs, and the boundary of two-phase scheduler implementation of Herald. Circlednumbers represent layers in each model.Inputs - A list of hardware parameters of sub accelerators (Accs) - A list of DNN models to run, sorted in the dependence order (MD) - Load-balancing factor (LbF)Outputs- A list of (schedule time, layer ID, model), (Schedule)- A list of completion time for each sub-accelerator (Tot_Latency_Acc)

cycle = 0;while MD.notEmpty do for model in MD do layer = model.head; // Get EDP/Latency for the layer on each acc (EDP, Latency) = MAESTRO_Herald.query(layer, Schedule, Accs) ; best_fit_acc = getAccIndex(min(EDP)); // Check dependence, memory size, and load-balancing conditions dependence_cond = is_prev_layer_complelete(Schedule, model, cycle); mem_size_cond = MemorySize(cycle, Schedule) + cost.getMemSizeReq < MemorySize; load_balance_cond = max(Tot_Latency_Acc) < LbF * (Tot_Latency_Acc[acc] + Latency[best_fit_acc];

if dependence_cond and mem_size_cond then if load_balance_cond then ToT_Latency_Acc[best_fit_acc] += Latency[best_fit_acc]; // Assign layer PopLayer(MD, layer); assigned_a_layer = true; else //Try the second, third, … -best fit accelerator for load-balancing end if end if if assigned_a_layer then rearrange(MD); // Rearrange the order of model based on the layer ordering // strategy (depth-first, breadth-first. etc) selected by users break; end if end cycle = nextLayerCompletionTime(Schedule) // Failed to schedule; defer executionend

Figure 10: Layer assignment and ordering algorithm.

For example, if the load-balancing factor is two, the slowestsub-accelerator must complete all the scheduled layers within2×Latency f astest_acc, where Latency f astest_acc is the latencyof the sub-accelerator completing earilest.

In case only load-balancing check fails, the scheduler trythe second-best-fit sub-accelerator until it finds an alternativethat meets all the conditions. If not found, the schedulerincrement the scheduling cycle to the completion time of alayer on any of sub-accelerators tracked by Tot_Latency_Accin Figure 10, which represent the completion time of eachsub-accelerator. If all the conditions are met, the schedulerassign the layer onto the best-fit sub-accelerator and removethe scheduled layer from the corresponding model in themodel list (MD in Figure 10). And then, the reference cycleis incremented in the same way as scheduling fail updated it,and search for another schedulable layer.

Before move on to the next layer, the scheduler changethe model to be scheduled next to implement layer-orderingstrategy specified by users. Herald supports depth-first andbreadth-first ordering. Depth-first ordering schedules allthe layers within a model and moves on to the next model.Breadth-first ordering schedules one layer from each model,and repeat it until the scheduler finishes. Such ordering is

Time0Acc 2 1 2 4

Time0Acc 1 1 23 3 4

Time0Acc 1 1 3 2 3 4

Ener

gy

Time0Acc 2 12 4

(a) Depth-first Layer Ordering

(b) Breadth-first Layer Ordering

Layer from Model A Layer from Model B

Time0Acc 2 1 2 4

Time0Acc 1 1 23 3 4

(c) Layer Ordering Post-processing

2 3 4Re-order

LatencyReduction

Ener

gyEn

ergy

Figure 11: Example timelines from different layer ordering meth-ods on two accelerators and two DNN models. Numbers in each boxrepresent the layer number.

possible without violating dependences since DNN modelsmostly have linear dependence for layers. Even if there existsbranches like Inception [37], the layers within each branchis also in linear dependence. However, those two orderingmethods are mainly for reducing the complexity of schedul-ing, which do not guarantee the optimality of the resultingschedule. Therefore, after we construct an initial schedule,we perform post-processing to fix inefficiency caused by thesimple layer ordering methods.Step 2: Post-processing. Once an initial schedule is gener-

ated by the first step, the scheduler in Herald further optimizesthe schedule by exploring layer execution reordering withineach sub-accelerator, which fills redundant idle time (i.e.,gaps) caused by layer order strategies chosen for Step 1. Wedescribe an example in Figure 11. Figure 11 (a) and (b)show example initial schedules generated by Step 1 applyingdepth- and breadth-first layer ordering, which we discussedin Step 1. As we can observe in the examples, idle time existin both. However, some of the such idle time is redundant.For example, in Figure 11 (b), layer 2,3, and 4 from modelB can be scheduled earlier, if that does not violate memorycondition, as shown in Figure 11 (c).

We describe the post-processing algorithm in Figure 12,which eliminates redundant idle time. Post-processing al-gorithm performs look-ahead of scheduled layers from thecompletion time of each layer, searching for other schedula-ble layers within idle time. The algorithm performs similarchecks as the schedule did in Step 1 (dependence, memory,and load-balacning). In addition to the three checks, the al-gorithm also checks if the idle time (i.e., gap of the initial

8

Inputs - A list of hardware parameters of sub accelerators (Accs) - A list of (schedule time, layer ID, model) (Schedule) - Look-ahead depth (LA)Outputs - An updated schedule (Schedule)

for acc in ACCs do for baseLayerIdx in NumLayers(Schedule[acc]) do look-ahead =1 while look-ahead < LA do prev_completion_time = Schedule[acc][baseLayerIdx].completion_time test_layer = Schedule[acc][baseLayerIdx + look-ahead] // Test dependence, memory, load-balancing, and schedule overlap if layers is_schedulable(test_layer, prev_completion_time, Schedule) then // Reorder the test layer UpdateSchedule(Schedule, test_layer, prev_completion_time) end if end endend

Figure 12: Post-processing algorithm that minimizes the idle time.

Table 2: Heterogeneous DNN workloads used for theevaluation. We model AR/VR workloads using modelslisted in Table 1. We also evaluate computer-vison net-works in MLPerf. Number of batches models differenttarget frame rates for each sub-task.

Workload Model # of batches

AR/VR-AResnet50 2

Mask-RCNN 4MobileNetV2 4

AR/VR-B

Resnet50 2Unet 2

MobileNetV2 4BR-Q Handpose 2

Focal Length DepthNet 2

MLPerf-CVResnet50 1

MobileNetV1 1SSD-Resnet34 1

SSD-MobileNetV1 1

schedule) is sufficient for scheduling fethcing a layer fromthe later in the schedule to the beginning of the idle time.

5. CASE STUDIESTo show the potential of HDAs as future proof, we perform

case studies on HDA designs and layer execution schedulesgenerated by Herald using two heterogeneous workloadslisted in Table 2.

5.1 Case Study SettingsWorkloads. Based on AR/VR-motivated DNN modelslisted in Table 1, we model AR/VR workloads as listed in Ta-ble 2. For each DNN model, we assign different numberof batches to model different target processing rate of eachsub-task. In addition to AR/VR workloads, we also evaluatemulti-stream ML-perf inference workload related to computervision, considering the motivation toward AR/VR.Mapping. Although Herald can handle arbitrary number ofmapping styles in sub-accelerators, we combine two and threedistinct mapping styles from recent DNN accelerators(Shi-diannao [8], NVDLA [26]), and Eyeriss [5]. The selection ofmapping style is based on their distinct parallel unrolling (or,partitioning) strategies to maximize synergy among mappingstyles. For example, Shi-diannao parallelizes output row andcolumn over PEs while NVDLA parallelizes input and outputchannels, which has significantly different preference to layershapes. Also, Shi-diannao style can run depth-wise convo-

Table 3: Two hardware parameter settings we use forthe evaluation. For heterogeneous accelerators, each set-ting indicate the total amount of hardware resources tobe partitioned into sub-accelerators.

Acc. ID Num. of PEs NoC BW Glob. MemorySmall (Edge) 1024 16 GB/s 4 MiB

Medium (Mobile) 4096 64 GB/s 8 MiBLarge (Cloud) 16384 256 GB/s 16 MiB

0

5

10

15

20

25

30

Aver

age

Impr

ovem

ents

(%)

05

1015202530

AWorkload

B Avg

LatencyEnergyEDP

−10

0

10

20

30

0102030

BaselineHERALD

(a) The impact of Wokrload

-10

Aver

age

EDP

Impr

ovem

ents

(%)

0

500

1000

1500

0500

10001500

EDP

(J x

s)

(b) The impact of Scheduling

Design opt. for WL ADesign opt. for WL BMonolithic Design

(c) The impact of Workload Change

Workload BLarge Acc.

Workload ASmall Acc.

Workload BSmall Acc.

Workload ALarge Acc.

Workload BLarge Acc.

Workload ASmall Acc.

Workload BSmall Acc.

Workload ALarge Acc.

Figure 13: The impact of workload and scheduling on the EDP oflarge and medium HDAs. (a) The average latency, energy, and EDPimprovements of HDAs compared to the best monolithic acceleratorfor each workload. (b) The average EDP improvements compared tothe best monolithic design using baseline and Herald’s scheduler.

lution more efficiently than NVDLA; NVDLA is optimizedfor channel-wise accumulation but depth-wise convolutiondoes not have channel-wise accumulation. However, NVDLAprovides higher efficiency for fully-connected (FC) layers be-cause FC layers are equivalent to CONV2D layers with onlyone output activation per each input and output channel. Thatis, combining those two mapping styles in an HDA and run-ning each layer on an appropriate sub-accelerator can providelatency and energy benefits.Cost estimation. As we discussed in Section 4.2, we extendMAESTRO for the latency and energy estimation.Accelerators. Based on previously proposed cloud and mo-bile accelerators, Cloud TPU [15] and Qualcomm Hexagon [29],we select three accelerator settings with the number of PEs1K, 4K, and 16K, as described in Table 3. We also corre-spondingly scaled network-on-chip (NoC) bandwidth andglobal memory. We estimate the extra energy costs for re-configurability of MAERI based on the open-sourced RTLby running CAD tool chain using a 28nm library like MAE-STRO’s reference hardware cost model did. We also modelhomogeneous multi-sub-accelerator chips [36] with evenlypartitioned hardware resources and Herald’s scheduler.Schedulers. We apply the scheduling algorithm we discussedin Section 4.4 in Herald. We compare the EDP of heteroge-neous accelerator designs with the best EDP for each experi-ment based on a baseline scheduler and Herald’s scheduler.The baseline scheduler performs EDP-greedy layer selectionand depth-first layer ordering discussed in Section 4.4.

5.2 ResultsBased on the observation from data we colleted, we high-

light some of them that provide useful insights.

9

(c) AR/VR-A, Large Accelerator

(f) AR/VR-B, Large Accelerator

Late

ncy

(sec

)

(e) AR/VR-B, Medium Accelerator

(b) AR/VR-A, Medium Accelerator(a) AR/VR-A, Small Accelerator

3200 3400 3600 3800Energy(mJ)

03000

105 RS Style

Shi-diannao Style

NVDLA Style

4000

15

(d) AR/VR-B, Medium Accelerator3200 3400 3600 3800

Energy(mJ)01

3000

2345 RS StyleShi-diannao Style

NVDLAStyle

4000360034003200 3800280005

10

Energy(mJ)

Shi-diannao Style

NVDLA StyleRS Style

3000

20

33003200310030002900280001

2

3

4

Energy(mJ)

Shi-diannao Style

NVDLAStyle

RS Style

Late

ncy

(sec

)La

tenc

y (s

ec)

3400330032003100300029002800012

3

4

Energy(mJ)

Shi-diannao Style

NVDLA Style

RS Style

3500

3200 3400 3600 3800Energy(mJ)

01

3000

2345 RS StyleShi-diannao Style

NVDLAStyle

30002900280027002600250000.5

11.5

2

Energy(mJ)

Shi-diannao Style

NVDLA Style

RS Style

3100(i) MLPerf-CV, Large Accelerator

3000290028002700260000.5

1

1.5

2

Energy(mJ)

Shi-diannao Style

NVDLAStyle

RS Style

(h) MLPerf-CV, Medium Accelerator

4000360034003200 3800280002

68

4

Energy(mJ)

Shi-diannao Style

NVDLA Style

RS Style

3000

1012

2600(g) MLPerf-CV, Small Accelerator

Legend Monolthic Acc. DLA-Shi-RS HDA DLA-RS HDA DLA-Shi HDA Shi-RS HDAFlexible Acc. Pareto CurveHomogeneous Multi-DNN

15

3400 3500

4200 4400

20

Late

ncy

(sec

)La

tenc

y (s

ec)

Late

ncy

(sec

)

Late

ncy

(sec

)La

tenc

y (s

ec)

Late

ncy

(sec

)

Figure 14: Design space of two- and three-way heterogeneous DNN accelerators on workloads listed in Table 2. Each point represents a designpoint (i.e., HW partitioning chocies) with an optimized schedule for the design point. We label each monolithic design point in each plot.

Table 4: Hardware resource partitioning optimizationresults for two-way HDA based on NVDLA and Shi-diannao style accelerators. (NVDLA/Shi-diannao)

Setting BW partitioning PE PartitioningAR/VR-A, Small 4 / 12 128 / 896

AR/VR-A, Medium 40 / 24 1792 / 2304AR/VR-A, Large 224 / 32 9728 / 6656AR/VR-B, Small 4 / 12 128 / 896

AR-VR-B, Medium 48 / 16 1536 / 2560AR/VR-B, Large 128 / 128 12032 / 4352

MLPerf-CV, Small 4 / 12 64 / 960MLPerf-CV, Medium 32 / 32 1280 / 2816

MLPerf-CV, Large 160 / 96 8192 / 8192

5.2.1 Costs and Benefits of HDAWe estimated latency and energy of HDAs in Figure 14.

We observe that well-optimized HDA points are always onthe Pareto curve, and monolithic designs are not. The flexibleaccelerator was on the Pareto curve on AR/VR-A (small andmedium accelerators) and MLPerf-CV (medium accelerator).On average, compared to the best monolithic design with thelowest EDP, the best heterogeneous design provided 56.0%EDP improvements across all the case studies in Figure 14.Optimal HW Resource Partitioning. As we observed

in Figure 8, naive (or even) partitioning of hardware resourcesoften lead to sub-optimal HDAs. The case study results ofoptimal hardware partitioning shows the same observation.In Table 4, we list the hardware resource partitioning resultsof design points with the best EDP for two-way HDAs basedon NVDLA and Shi-diannao style mappings. We can observethat the optimal hardware partitioning is not trivial, which ne-cessitates a systematic approach like Herald. This is becausemore number of active PEs requires more bandwidth, and thenumber of active PEs is a complex high-dimensional func-tion of layer operation, layer size, number of PEs, mapping,and so on [17]. Also, we observe homogeneous multi-DNNdesign points often provide similar latency and energy asshown in Figure 14. In such cases, we found that the opti-mization framework resulted in assigning minimum hardware

Late

ncy

(sec

)

NVDLA Style

Shi-diannao Style

0.5

02500

2600 2800 3000 3200Energy(mJ)

1.0

1.52.0

2.5 RS Style

(a) UNet, Large Accelerator (b) Resnet50, Large Accelerator

2800

1.5

1.0

0.5

0350 400 450

Energy(mJ)300

Shi-diannao Style

NVDLA Style

RS Style

Heterogeneous Design Point Monolithic Design PointLegend

Figure 15: Design space on single DNN use cases on (a) UNet and (b)Resnet50 based on large accelerator settings in Table 3.

resources to sub-accelerators except one, and tried to maxi-mize gain from the only large sub-accelerator.Impact of workloads. Each row in Figure 14 shows the

latency-energy space of monolithic designs, two- and three-way HDAs, and flexible accelerators we evaluate. As ex-pected, the design space depended on the workloads. Inparticular, we observe that workload with more heterogene-ity and layers like AR/VR-B workload is more friendly toHDAs, providing 86.8% latency and 6.61% energy improve-ments over best monolithic accelerators for each case studyin Figure 14, compared to 63.26% latency and 4.05% energyimprovements for AR/VR-A and 3.75% latency and 8.13%energy improvements for MLPerf-CV.Single-DNN Case. Even for a single DNN, HDAs can still

exploit layer parallelism and heterogeneity within a model bybatch-processing the workload. We run UNet and Resnet50using the batch size of four on the large accelerator setting.We observe that the best monolithic accelerator design ison the Pareto curve, unlike compound workloads we tar-get. However, HDA designs still provide latency and energybenefits over monolithic designs. For UNet and Resnet50workload, the best HDA provided 26.4% and 48.1% EDP im-provements over the best monolithic design. Compared to theflexible accelerators, HDA provided 11.7% and 15.8% lowerenergy for UNet and Resnet50, respectively. For latency, flex-ible accelerators provided 22.5% and 29.0% lower latencycompared to HDAs for UNet and Resnet50, respectively.Impact of scheduling algorithm. We evaluate the EDP of

10

Table 5: Average time required for scheduling for eachhardware design point (i.e., HW partition choice).

Workload Layers Sub-accs Time per HW DP (s)AR/VR-A 448 2 2.89AR/VR-A 448 3 4.32AR/VR-B 618 2 3.98AR/VR-B 618 3 10.74

MLPerf-CV 168 2 1.17MLPerf-CV 168 3 1.67

HDAs with a baseline scheduler and Herald’s scheduler. Forthe baseline, we implement a scheduler with EDP-greedylayer assignment and depth-first layer ordering discussedin Section 4.4. In Figure 13 (b), we show the average EDPimprovements of the best HDA design points for each exper-iment configuration based on baseline and Herald’s sched-uler. Across all the experiment settings using Herald’s sched-uler,Herald identified efficient designs with 24.9% averageEDP improvements while the baseline scheduler provided13.6% EDP improvements on average. In addition, on thesmall accelerator with workload A, the baseline scheduler-based optimized EDP was worse than the best monolithicbaseline while Herald’s scheduler identified an optimizedschedule that provides improved EDP over baseline.Comparison against flexible accelerators. We evaluate aMAERI [19] style flexible accelerator using the same hard-ware parameter of large accelerator setting. We plot thedesign point of flexible accelerator in Figure 14. Comparedto the best HDA design points for each evaluation, flexibleaccelerator designs provided 22.9%, 21.5%, and 24.0% lesslatency for AR/VR-A, AR/VR-B, and MLPerCV workloads,respectively. However, flexible accelerator designs required18.7%, 15.5%, and 18.9% more energy for each workload,respectively. The extra energy cost of the flexible acceleratoris based on extra hardware components for reconfigurability.In contrast, an HDA can keep sub-accelerators with relativelysimple architecture compared to flexible accelerators, whichleads to energy savings we present.

Results in Figure 14 show that both heterogeneous andflexible accelerators are both Pareto-optimal. HDAs are bene-ficial for energy, while flexible accelerators are beneficial forlatency. The amount of benefits for latency and energy varydepending on the workload. Therefore, the choice of flexibleor heterogeneous accelerators depend on the performancegoal, energy constraints, and the target workload.Impact of workload change. Since DNN models evolvesand applications change their inner implementation accord-ingly, workload change can occur after the deployment ofan HDA. Figure 13 (c) shows a case study of such scenario.That is, the case study represents the use case as a compiletime optimizer (i.e., scheduler). We observe that the large ac-celerator is less sensitive to the workload change still showingbenefits against the best monolithic design with 14.1% EDPincrease compared to HDAs originally optimized for newworkloads on average. However, when the amount of over-all hardware resources is smaller (e.g., medium accelerator),26.2% of EDP increase compared to HDAs originally opti-mized for new workloads on average. The results imply thatheterogeneous DNN accelerators prefer large accelerators.Execution time of Herald. Although the scheduling in Her-ald is designed to be offline, the scheduler is light-weighted.

We run Herald on a laptop with i9-9880H processor and 16GBof memory and present the time required for scheduling oneach hardware design point in Table 5, since overall runtimeheavily depends on user parameters (e.g., search granularity,strategy, etc.). On average, the scheduling requires 9.48 msper layer and per HDA design point.Summary. We summarize our main observations below:• The design space of HDA is not trivial, which requires

a systematic co-optimization of hardware resource parti-tioning and layer execution schedule.

• HDAs outperform or match the monolithic and flexibleaccelerators, as Pareto curves in Figure 14 show.

• HDA and flexible accelerators are beneficial for energyefficiency and latency, respectively, while maintainingsimilar overall EDP (often both of them are on the Paretocurve in latency-energy space).

• Simple combination of homogeneous sub-acceleratorsdoes not provide Pareto-optimal design points.

6. RELATED WORKSDNN Dataflows and Accelerators. Shi-diannao [8] is aCNN accelerator designed to be embedded near sensors,which exploits convolutional reuse via an output-stationarystyle dataflow. Eyeriss [6] is one of the state-of-the-art low-energy DNN accelerators that introduced dataflow taxonomyand a new dataflow style, row-stationary. Fused-layer CNNaccelerator [2] exploited fine-grained pipelined layer par-allelism that minimizes activation data movement amongmemory hierarchy. Flexflow [23] is a DNN accelerator thatsupports three distinct dataflow styles on a DNN accelera-tor with an analytic model used to identify the best dataflowbased on PE utilization. Tensor Processing Unit(TPU) [15] isa systolic array-based DNN accelerator designed for cloudworkload in data centers. MAERI [19] is a flexible dataflowDNN accelerator that also efficiently supports irregular map-pings resulting from sparsity, cross-layer mapping [2], andso on. Tangram [9] is a DNN accelerator explored pipelinedlayer parallelism within a model with optimized dataflow forsuch back-to-back layer execution. Interstellar [41] presentedthe importance of loop blocking (tile sizing) in DNN accel-erators utilizing Halide [31]. Shen et al. explored the use ofhomogeneous multi sub-accelerators termed as convolutionallayer processors for CNN processing in FPGAs [36].Heterogeneous Accelerators. Chandramoorthy et al. [4]explored accelerator-rich chip-multiprocessor that includevarious accelerators for different tasks in image processing.Although the work included a convolution module amongsub-accelerators, the convolution module provides only onedataflow style, focusing on general image kernels, not DNNs.Master of None Acceleration [22] explored a heterogeneousaccelerator for analytical query and presented that the designspace of heterogeneous accelerators for the target domain hasboth beneficial and disadvantageous design points.

7. CONCLUSIONIn this paper, we explored the latency and energy optimiza-

tion opportunities of heterogeneous DNN accelerators onrecent realtime applications with multiple sub-taks based on

11

DNNs. Because the efficiency of a DNN accelerator dependson mapping, workload, and hardware design parameters atthe same time, identifying the best heterogeneous DNN accel-erator design point with an optimized schedule is challenging.Therefore, we developed Herald, an automated design spaceexploration and layer scheduler framework for heterogeneousDNN accelerators. In our case studies, Herald identified op-timized design points and layer schedules, providing 56.0%EDP benefits compared to the best monolithic design wecompare. Herald has presented that the most efficient designpoint has non-trivial hardware resource partitioning and anaive scheduler can result in EDP degradation, motivating asystematic approach like Herald.

8. REFERENCES[1] M. Abrash, “Inventing the future,”

https://www.oculus.com/blog/inventing-the-future, 2017.

[2] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnnaccelerators,” in The 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture. IEEE Press, 2016, p. 22.

[3] Apple, “The future is here - iphone x (neural engine),” https://www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/,2017.

[4] N. Chandramoorthy, G. Tagliavini, K. Irick, A. Pullini, S. Advani,S. Al Habsi, M. Cotter, J. Sampson, V. Narayanan, and L. Benini,“Exploring architectural heterogeneity in intelligent vision systems,” in2015 IEEE 21st International Symposium on High PerformanceComputer Architecture (HPCA). IEEE, 2015, pp. 1–12.

[5] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture forenergy-efficient dataflow for convolutional neural networks,” inInternational Symposium on Computer Architecture (ISCA), 2016.

[6] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: Anenergy-efficient reconfigurable accelerator for deep convolutionalneural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1,pp. 127–138, 2016.

[7] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexibleaccelerator for emerging deep neural networks on mobile devices,”IEEE Journal on Emerging and Selected Topics in Circuits andSystems, vol. 9, no. 2, pp. 292–308, 2019.

[8] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,and O. Temam, “Shidiannao: Shifting vision processing closer to thesensor,” in International Symposium on Computer Architecture (ISCA),2015.

[9] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram:Optimized coarse-grained dataflow for scalable nn accelerators,” inProceedings of the Twenty-Fourth International Conference onArchitectural Support for Programming Languages and OperatingSystems. ACM, 2019, pp. 807–820.

[10] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril,D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro et al., “Appliedmachine learning at facebook: A datacenter infrastructure perspective,”in 2018 IEEE International Symposium on High PerformanceComputer Architecture (HPCA). IEEE, 2018, pp. 620–629.

[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in Proceedings of the IEEE conference oncomputer vision and pattern recognition, 2016, pp. 770–778.

[12] L. He, G. Wang, and Z. Hu, “Learning depth from single images withdeep neural network embedding focal length,” IEEE Transactions onImage Processing, vol. 27, no. 9, pp. 4676–4689, 2018.

[13] Huawei, “Hiai,”https://consumer.huawei.com/en/campaign/kirin-990-series/, 2019.

[14] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, andL. Van Gool, “Ai benchmark: Running deep neural networks onandroid smartphones,” in Proceedings of the European Conference onComputer Vision (ECCV), 2018, pp. 0–0.

[15] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter

performance analysis of a tensor processing unit,” in InternationalSymposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1–12.

[16] S. Kala, B. R. Jose, J. Mathew, and S. Nalesh, “High-performance cnnaccelerator on fpga using unified winograd-gemm architecture,” IEEETransactions on Very Large Scale Integration (VLSI) Systems, vol. 27,no. 12, pp. 2816–2828, 2019.

[17] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, andT. Krishna, “Understanding reuse, performance, and hardware cost ofdnn dataflow: A data-centric approach,” in Proceedings of the 52ndAnnual IEEE/ACM International Symposium on Microarchitecture,2019, pp. 754–768.

[18] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, andA. Parashar, “Maestro: A data-centric approach to understand reuse,performance, and hardware cost of dnn mappings,” IEEE Micro,vol. 40, no. 3, pp. 20–29, 2020.

[19] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexibledataflow mapping over dnn accelerators via reconfigurableinterconnects,” in International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS), 2018,pp. 461–475.

[20] S. Lee, S. W. Oh, D. Won, and S. J. Kim, “Copy-and-paste networksfor deep video inpainting,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2019, pp. 4413–4421.

[21] J. Li, Y. Chi, and J. Cong, “Heterohalide: From image processing dslto efficient fpga acceleration,” in The 2020 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, 2020, pp. 51–57.

[22] A. Lottarini, J. P. Cerqueira, T. J. Repetti, S. A. Edwards, K. A. Ross,M. Seok, and M. A. Kim, “Master of none acceleration: a comparisonof accelerator architectures for analytical query processing,” inProceedings of the 46th International Symposium on ComputerArchitecture. ACM, 2019, pp. 762–773.

[23] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexibledataflow accelerator architecture for convolutional neural networks,”in 2017 IEEE International Symposium on High PerformanceComputer Architecture (HPCA). IEEE, 2017, pp. 553–564.

[24] M. Madadi, S. Escalera, X. Baró, and J. Gonzalez, “End-to-end globalto local cnn learning for hand pose recovery in depth data,” arXivpreprint arXiv:1705.09606, 2017.

[25] M. Z. A. Z. Mark Sandler, Andrew Howard and L.-C. Chen,“MobileNetV2: Inverted Residuals and Linear Bottlenecks,” arXivpreprint arXiv:1801.04381, 2019.

[26] NVIDIA, “Nvdla deep learning accelerator,” http://nvdla.org, 2017.

[27] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: Anaccelerator for compressed-sparse convolutional neural networks,”ACM SIGARCH Computer Architecture News, vol. 45, no. 2, pp.27–40, 2017.

[28] S.-J. Park, H. Son, S. Cho, K.-S. Hong, and S. Lee, “Srfeat: Singleimage super-resolution with feature discrimination,” in Proceedings ofthe European Conference on Computer Vision (ECCV), 2018, pp.439–455.

[29] Qualcomm, “Quacomm hexagon 680,”https://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.24-Monday-Epub/HC27.24.20-Multimedia-Epub/HC27.24.211-Hexagon680-Codrescu-Qualcomm.pdf, 2015.

[30] S. Rabii, E. Beigne, V. Chandra, B. D. Salvo, R. Ho, and R. Pendse,“Computational directions for augmented reality systems,” in 2019IEEE Symposium on VLSI Circuits, plenary, 2019.

[31] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, andS. Amarasinghe, “Halide: a language and compiler for optimizingparallelism, locality, and recomputation in image processing pipelines,”Acm Sigplan Notices, vol. 48, no. 6, pp. 519–530, 2013.

[32] V. Ramanishka, Y.-T. Chen, T. Misu, and K. Saenko, “Toward drivingscene understanding: A dataset for learning driver behavior and causalreasoning,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 2018, pp. 7699–7707.

[33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in InternationalConference on Medical image computing and computer-assistedintervention. Springer, 2015, pp. 234–241.

[34] Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang,

12

https://www.oculus.com/blog/inventing-the-future

https://www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/

https://www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/

https://consumer.huawei.com/en/campaign/kirin-990-series/

http://nvdla.org

https://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.24-Monday-Epub/HC27.24.20-Multimedia-Epub/HC27.24.211-Hexagon680-Codrescu-Qualcomm.pdf



B. Keller, A. Klinefelter, N. Pinckney, P. Raina et al., “Simba: Scalingdeep-learning inference with multi-chip-module-based architecture,”in Proceedings of the 52nd Annual IEEE/ACM InternationalSymposium on Microarchitecture, 2019, pp. 14–27.

[35] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,A. Mishra, and H. Esmaeilzadeh, “From high-level deep neuralmodels to fpgas,” in 2016 49th Annual IEEE/ACM InternationalSymposium on Microarchitecture (MICRO). IEEE, 2016, pp. 1–12.

[36] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn acceleratorefficiency through resource partitioning,” in 2017 ACM/IEEE 44thAnnual International Symposium on Computer Architecture (ISCA).IEEE, 2017, pp. 535–547.

[37] C. Szegedy et al., “Going deeper with convolutions,” in Conference onComputer Vision and Pattern Recognition (CVPR), 2015.

[38] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closingthe gap to human-level performance in face verification,” in

Proceedings of the IEEE conference on computer vision and patternrecognition, 2014, pp. 1701–1708.

[39] Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing ofdeep-neural-network-driven autonomous cars,” in Proceedings of the40th international conference on software engineering, 2018, pp.303–314.

[40] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan,K. Hazelwood, E. Isaac, Y. Jia, B. Jia et al., “Machine learning atfacebook: Understanding inference at the edge,” in 2019 IEEEInternational Symposium on High Performance Computer Architecture(HPCA). IEEE, 2019, pp. 331–344.

[41] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao,H. Ha, P. Raina et al., “Interstellar: Using halide’s schedulinglanguage to analyze dnn accelerators,” in Proceedings of theTwenty-Fifth International Conference on Architectural Support for

Programming Languages and Operating Systems, 2020, pp. 369–383.

13

HERALD: Optimizing Heterogeneous DNN Accelerators for …HERALD, illustrated inFigure 2. As inputs,...

Documents

Transcript of HERALD: Optimizing Heterogeneous DNN Accelerators for …HERALD, illustrated inFigure 2. As inputs,...