The Anatomy of an Embedded Machine Learning Accelerator ...€¦ · white paper takes a...

WP515 (v1.0) November 5, 2019 www.xilinx.com 1

© Copyright 2019 Xilinx, Inc. Xilinx, the Xilinx logo, Alveo, Artix, Kintex, Spartan, UltraScale, Versal, Virtex, Vivado, Zynq, and other designated brands included herein are trademarks of Xilinx in the United States and other countries. All other trademarks are the property of their respective owners.

The distributed memories, high compute density, and logic support in FPGAs and SoCs make them ideal as ML accelerator platforms. Designers can leverage the inherent compute power of these devices with no hardware implementation.

WP515 (v1.0) November 5, 2019

The Anatomy of an Embedded Machine Learning Accelerator

ABSTRACTAs machine learning (ML) has increased in popularity over the last half-decade, a number of accelerators for ML have been introduced, evaluated, and deployed on programmable logic. With their unique power/performance point, FPGAs are increasingly a popular choice for machine learning applications. In addition, FPGAs can perform sensor fusion and the other operations that typically surround machine learning applications. This white paper takes a deliberately generic example of such an accelerator, and explains the purpose of its constituent parts and why the FPGA is an ideal implementation vehicle.

http://www.xilinx.com



IntroductionMachine Learning is an incredibly important application, allowing insights to be drawn from all sorts of different data streams. From facial recognition for security applications, to anomaly detection in industrial settings, traditional deep algorithmic approaches require very specific domain knowledge.

With their unique power/performance point, FPGAs are increasingly a popular choice for machine learning applications. In addition, FPGAs can perform sensor fusion and the other operations that typically surround machine learning applications. For example, a system for monitoring driver attention in cars might require interfaces to networks, ultrasonic sensors, multiple cameras, and audio, all while working in a setting typically bounded by power and thermal constraints. The system might also require more traditional processing, such as video scaling / cropping and machine learning, to detect when the driver’s attention strays from the road ahead.

As machine learning (ML) has increased in popularity over the last half-decade, a number of accelerators for ML have been introduced, evaluated, and deployed on programmable logic. This white paper takes a deliberately generic example of such an accelerator, and explains the purpose of its constituent parts, and why the FPGA is an ideal implementation vehicle.

Exploiting Parallelism in a Generic AcceleratorMachine learning (ML) is a very broad topic, covering various approaches that allow general algorithms to perform domain-specific tasks by training the general algorithms using data drawn from the specific domain. In the driver-attention system outlined previously, a general neural network could be trained using labeled images of drivers paying attention and drivers not paying attention to the road ahead. Though the neural network itself is “general,” it learns using domain-specific knowledge.

A number of neural network classes have been proposed over the years. One very important class often used in computer vision applications are convolutional neural networks (CNNs). CNNs consist of layers that accept three-dimensional (3D) input feature maps (IFMs). For example, a color image coming from a VGA camera is considered to be a 640 x 480 x 3 input feature map. CNNs map such images to 3D output feature maps (OFMs) using a 3D convolution operation. Though operations such as pooling and normalization are executed in CNNs, the bulk of the compute effort lies in performing the actual 3D convolution operation.

Many available online resources seek to create an understanding of the 3D convolution operation. This white paper, however, specifically describes the mapping of these 3D operations onto the FPGA fabric. The basic CNN formula is a 7-deep nested loop, and is shown in simplified form in Figure 1.




Huge amounts of parallelism are possible. The only data dependency is that the output pixels produced in one layer become the input pixels in the next layer. Figure 2 shows some examples of the levels of parallelism available to an accelerator.

One option is to perform layers in parallel, while remembering to factor in the one data dependency in the 7-deep loop nest: the output pixels of one layer must form the input pixels of the next layer. In programmable logic, however, it is very common to pipeline independent operations by employing structures such as line buffers. This allows the next stage of a pipeline to start when enough of its input arguments have been produced.

Another option is to produce multiple output pixels in parallel. Figure 1 shows that, within a given layer, output pixels at the same x,y location are produced using the same input pixels, but different weight sets. Therefore, production of a structure that allows input pixels (a) to flow from left to right and (b) get multiplied by different stored weights allows an accelerator to exploit output feature map parallelism, as shown in Figure 3. By storing more than one weight in each local memory, it is possible to re-use the same row of multipliers to produce multiple output feature maps. For a row of r multipliers and t time steps, the structure can produce pixels in r · t output feature maps.

X-Ref Target - Figure 1

Figure 1: Levels of Parallelism Available for Computing DNN Inference


Figure 2: Levels of Parallelism Available for Computing DNN Inference

WP515_01_082019

for each layer l

for each output feature map plane o

for each output feature map row r

for each output feature map col c

for each filter plane p

for each filter row x

for each filter col y

A[l][o][r][c] += A[l-1][p][r-x][c-y] * w[l][o][p][x][y]

WP515_02_082019

LayerL0

Inputs LayerL1

Outputs

Layer ParallelismB

ranc

hP

aral

llelis

mLayerL2a

LayerL2b

Bit Parallelism (inside arithmetic)A

ctiv

atio

n B

its

Wei

ght B

its

w00 w01 w10 w11

w00 w01 w10 w11

i00 i01 i10 i11

i01 i02 i11 i12

i10 i11 i20 i21

i11 i12 i21 i22

Output PixelParallelismIFM

Parallelism

OFMParallelism X

bit2

bit1

bit1

bit0

bit0

Multiply

Accumulate




By feeding in input pixels (a) from the same x,y location but (b) on different input feature maps, a structure can be added that weights the contribution of each of the input feature maps. This allows an accelerator to exploit input feature map parallelism. See Figure 4.

By using a distinct accumulator at the bottom of a column of multipliers, it is possible to reuse the same column of multipliers to consume multiple input feature maps. For a column of c multipliers and t time steps, the structure can consume pixels in c · t input feature maps.

Putting the two structures together forms the well-known systolic array structure, shown in Figure 5. This structure exploits both input and output feature map parallelism, and is a great fit for FPGA architectures.


Figure 3: Producing Multiple Output Feature Map Pixels from a Single Input Feature Map Pixel


Figure 4: Consuming Multiple Input Feature Map Pixels for a Single Output Feature Map Pixel

WP515_03_082019

X+

Wei

ghts X

+

Wei

ghts X

+

Wei

ghts X

+

Wei

ghts X

+

Wei

ghts X

+

Wei

ghts X

+

Wei

ghts X

+

Wei

ghts

WP515_04_083019

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts




By increasing the number of columns in an array, the array can produce as many output feature maps as there are columns in parallel. By increasing the number of rows in an array, the array can consume as many input feature maps as there are rows in parallel. However, as mentioned earlier, the structure produces output maps in multiples of the number of columns. Thus, if the required depth of the output feature map is not an exact multiple of the number of columns, then certain columns are unused for part of the calculation, reducing the compute efficiency.

Similarly, the structure consumes input maps in multiples of the number of rows. Therefore, unless the input feature map depth is not an exact multiple of the number of rows, then certain rows are unused for part of the calculation.

This can present a problem for early layers in the network that might have a very small number of input feature maps—1 or 3, for example. In this case, accelerators should be flexible enough to exploit one of the other forms of parallelism shown in Figure 1, such as performing calculations for all columns of a filter in parallel.

Generic Embedded Accelerator ArchitectureA number of architectures are commercially available, including the Xilinx Deep Learning Processing Unit (DPU) architectures [Ref 1]. Though there are some domain-specific differences between these architectures, the same basic functional blocks are present. See Figure 6.


Figure 5: Systolic Array

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

X+

Wei

ghts

WP515_05_083019




The following sections explore in detail the build-up of each functional block.

Activation MemoriesThe activation memories are used to store input and output tensors being processed by the systolic array. Data is read out of the activation memory in the order required to move through the 3D input feature map tensor, and is then written back into the activation memory, forming the 3D output feature map tensor. Some operations, such as Eltwise addition [Ref 2], require two 3D tensors and produce a single 3D output. While the systolic array is active, DMAs are used to spill a previously stored output feature map tensor and restore a new input feature map tensor. See Figure 7.


Figure 6: Generic Embedded Accelerator Architecture

Cross Bar

Pooling Pooling

ReLUReLUReLUReLU

Bias Bias Bias Bias

PoolingPooling

Spi

ll/R

esto

re D

MA

Con

trol

ler

Exe

cutio

n C

ontr

olle

rPooling/

EWA

Systolic Array

WP515_06_100119

Weights DMA Controller

Instruction Buffer

Image Queue




The activation memory requires a minimum of three read ports and two write ports, to avoid queuing up requests. Such memories are easy to implement in programmable logic, using either block RAM or UltraRAM, together with some external multiplexing to steer the five requests to the correct ports. Since both UltraRAM and block RAM can each have one write port and one read port, an activation memory can be made from a minimum of three of these components if the memories are wide enough. If the systolic array has many rows or if the data width is large, then the activation memory can be made from multiples of three of these components, combining enough of these multiples to feed all the rows of the array.

To avoid memory collisions of, for example, two reads from the same physical port, a smart compiler can be used that allocates base addresses and ranges within the activation. So, the systolic array could be reading from the first physical memory in a group of three, while the spill DMA is reading from the second. Because all the sizes of the tensors to be processed are known at network compile time, it is possible to ensure that the memory accesses are safe by allocating each tensor to a part of memory serviced by a different physical port.

The activation memories can be adapted during accelerator construction over a number of dimensions. The data type stored is typically one parameter (e.g., 8-bit or 16-bit), though it is possible to design an activation memory capable of carrying multiple data types. Another parameter is the type of memory building blocks used (block RAM or UltraRAM). The final parameter is the number of these memory building blocks used to implement the activation memory, which then defines the storage capacity.

One example is an activation memory capable of storing 8-bit data values, implemented using UltraRAM, capable of producing sixteen values in parallel. Such an activation memory can be made with two sets of three UltraRAMs, each configured as simple dual-port (SDP) memories. Since each


Figure 7: Generic Architecture: Activation Memories

Cross Bar

Pooling Pooling

ReLUReLUReLUReLU

Bias Bias Bias Bias

PoolingPooling

Activation Memories

Spi

ll/R

esor

e D

MA

Con

trol

ler

Exe

cutio

n C

ontr

olle

rPooling/

EWA

Systolic Array

WP515_07_091919

Instruction Buffer

Image Queue Weights DMA Controller




UltraRAM SDP memory can produce eight 8-bit values (64 bits in total), two sets are needed to produce the sixteen 8-bit values required. As described above, the minimum number of memory building blocks required in each set is three; therefore, six UltraRAMs in total are required. This means that even in a device as small and inexpensive as a Zynq UltraScale+ MPSoC XCZU4CG, eight activation memories for eight distinct embedded accelerators can be implemented.

Systolic ArrayContemporary deep learning algorithms are dependent on a large number of multiply-accumulate operations. Architectures that are able to fuse multiply and add operations into a combined operator consume less energy than operators that must return, consume, and produce their operands to/from a unified register file. The arrangement of DSP48 primitives in columns within Xilinx® FPGA architectures, as well as the presence of a dedicated carry-chain after the post-adder in that DSP slice, allows the construction of energy-efficient, high-speed, high-efficiency carry vectorized multiply and add operations. See Figure 8.

The Xilinx DSP48E2 DSP slice is shown in Figure 9. As can be seen, each block contains a wide multiplier, a post adder, and a dedicated carry-chain, among other functionality, making it ideal for implementing the chained multiply-accumulate operations required for deep learning [Ref 3].


Figure 8: Generic Architecture: Systolic Array

Cross Bar

Pooling Pooling

ReLUReLUReLUReLU

Bias Bias Bias Bias

PoolingPooling

Systolic Array

Spi

ll/R

esor

e D

MA

Con

trol

ler

Exe

cutio

n C

ontr

olle

r

Pooling/EWA

Systolic Array

WP515_08_091919


Instruction Buffer

Image Queue




The independent output channels of operators in a deep neural network (DNN) computation require multiplication of the inputs by a set of independent weights to produce separate output channels. This reuse of the input data for calculating different output channels can be exploited by running multiple output channels in parallel as separate columns of a 2D systolic array, with energy-efficient movement of input data horizontally across the array. This avoids both the energy cost of selecting operands in a register file, and keeps interconnect short, allowing low-energy or high frequency operation. This is further explored in the Supertile example [Ref 4].

2D systolic arrays supporting a channel-parallel architecture have a native size, e.g., a 16-input, 8-output array containing 128 multipliers. Reduction operators (such as convolution and fully-connected layers) with fewer vector elements can be supported by manipulation of the weights of the unoccupied channels, whereas layers that require more input channels or more outputs channels can be supported with the addition of a resettable accumulator at the bottom of every systolic array column. If the weights and inputs of a column are changed each cycle and the result produced in those cycles are summed together, the systolic array can support convolutions of a length only limited by the capacity of the on-chip weight RAM, which must hold weights for each input channel. If the accumulator is reset and a new set of weights loaded from external memory, an independent set of output channels can be calculated. This means the array can support a range of different convolution sizes. During the evaluation of a neural network, the systolic array can be reused in many different configurations to calculate the convolution transform of its constituent operators. The efficiency of the array is determined by the average proportion of the multipliers active over the execution of the complete network. Efficiency is highest when the dimensions of each layer are a multiple of the native size. Because Xilinx® FPGAs are available in a


Figure 9: DSP48E2 Block DiagramWP515_11_083019

Dual BRegister

BCOUT*

18

Dual A, D,and Pre-adder

MULTISIGNOUT*ACOUT* PCOUT*

A:BALUMODE

Mult27 x 18 M

4

18

18

30

27

B

A

30

0

0

0

0

1

17-Bit Shift

17-Bit Shift

RND

2

18

48

U

V

48

48

9

3

48

48

4

5

18

30

4

2730D

INMODE

CARRYIN

OPMODE

CARRYINSEL

C

+–

BCIN* ACIN* PCIN*

X

W

Y

Z

=

*These signals are dedicated routing paths internal to the DSP48E2 columns. They are not accessable via general-purpose routing resources.

CARRYOUT

PATTERNDETECT

PATTERNDETECT

CREG/C Bypass/Mask

MULTISIGNIN*

CARRYCASCIN*

CARRYCASCOUT*

P

8

XOR OUT

C

P




whole range of different sizes and can support multiple systolic arrays, each with configurable sizes, there is an efficiency advantage over fixed size arrays from other silicon vendors.

The ability to use LUTs as distributed RAM in Xilinx FPGA architectures brings large advantages when computing CNN layers. These layers form the backbone of modern compute vision pipelines. CNN layers perform convolution of input tensors with small kernels, which allows learning to exploit spatially local features in the tensor input. Because the convolution kernel is reused many times in different locations in the image, convolutional layers can allow significant reduction of weight memory bandwidth through weight reuse over the different pixels of the output channels if the architecture can support on-chip buffering of weights.

Exploitation of this weight reuse opportunity requires many memory ports, delivering different weights to each multiplier, but able to reuse weights multiple times. Distributed memories provide ideal qualities to support this. Each LUTM CLB in Xilinx UltraScale+™ fabric can support a 64 entry, 7-bit memory, and two columns of LUTM CLBs are distributed uniformly through the fabric immediately adjacent to columns of DSPs at a ratio of 10:2. This LUTM:DSP ratio allows construction of simple dual-ported memories providing 2Kb of memory and up to 10 ports for each DSP and flexible addressing opportunities of those RAMs using the adjacent LUTL logic column.

Flexible clocking available on the FPGA allows the user to derive maximum advantage from this architecture by operating the DSPs and reads from the weight memories at an integer frequency multiple of the frequency used to move data across the array and fill the distributed weight memories. This means the DSPs and LUTs (which have maximum frequencies of up to 891MHz in -3 rated parts) can be run close to their maximum operating frequencies, decoupled from slower logic, which must support data-movement over large distances in the FPGA die.

A further point to consider in FPGA-based implementations is the efficient use of the DSP slices themselves. A recent Xilinx white paper describes a technique for using the same multiplier in the DSP to produce two products [Ref 5] as long as the operands are 8 bits, and the two products share a common term. As mentioned, output pixels at the same x,y location are produced using the same input pixels—but different weight sets, so the need to have common terms is easily met in machine learning applications. The technique exploits the fact that the multiplier in the DSP is very wide, and since the two products have a common term it is possible to separate the two different inputs in such a way that the two individual products can be separated after a single multiply operation. This allows the construction of very high efficiency systolic arrays, which are now commercially available from Xilinx [Ref 1].

Special OperationsIn many modern machine learning applications, the bulk of the compute is typically in convolution layers or in similar matrix-matrix type computations. However, a number of point operations typically have to be executed, usually after a larger matrix computation has completed.

The activation functions commonly used serve as an example. Older machine-learning applications used relatively complex mathematical functions, such as tan h or sigmoid, to squeeze output activations into different ranges. These complex activation functions are still used in some applications such as LSTM networks [Ref 6]. Most modern convolution approaches use a simpler function called Rectified Linear Unit, or ReLU, or some minor variant of this [Ref 7]. Normal ReLU




passes all positive values unmodified, but clamps all negative values to zero. Though simple, this activation network has been shown to perform sufficiently well to allow training and inference of very deep networks. ReLU is a very inexpensive operation on an FPGA.

Other operations are pooling and normalization. Pooling selects one pixel out of a group of many in order to reduce the spatial dimension. This sort of filtering, or picking the max or median, is a well-known technique on FPGAs. Normalization is a point operation that modifies the dynamic range of activations to sit between known upper and lower bounds. This operation is easily implemented as an affine transformation using a DSP slice [Ref 1]. See Figure 10.

Execution ControlTensor processors generally run entire networks, or portions of networks, for one or more input tensors, without interaction with a controlling processor. Like other processors, they load compiler-generated programs that describe the schedule of operations required to perform a given computation. The instructions can be at quite a high level compared to, for example, RISC CPU instructions such as “copy data from address A to address B” or “run a convolution tile with parameters XYZ.” Internal state machines fetch these operations from a program memory and decode them into control signals or micro-operations consumed by the DMA engines and systolic array. At the end of program execution, the tensor processor can signal an external agent or simply wait for another program.

During network execution, a tensor processor must synchronize the movement of data between activation, weight, and parameter memories, with the compute performed by the systolic array or other compute engines. The systolic array implementation is most efficient in terms of logic utilization when it does not need to support stalling to wait for input data or throttling to wait for


Figure 10: Generic Architecture: Special Operations

Cross Bar

Pooling Pooling

ReLUReLUReLUReLU

Bias Bias Bias Bias

PoolingPooling

Special Operations

Spi

ll/R

esor

e D

MA

Con

trol

ler

Exe

cutio

n C

ontr

olle

r

Pooling/EWA

Systolic Array

WP515_10_091919


Instruction Buffer

Image Queue




output memory to become available. This means that when the input and output buffers are available, execution of the systolic array proceeds in a deterministic manner until the current operation is completed.

To ensure that all data buffers are available at the start of systolic array execution, the execution controller issues tasks independently to the DMA engines and systolic array, and then tracks the dependencies between the tasks in hardware. From a hardware perspective, consider a simpler approach: think of the data movement and compute engines in the tensor processor as occupying slots in a VLIW instruction, with one slot per engine. A compiler can then pipeline the sequence of load, compute, and store tasks for one tensor operation in time across multiple VLIW instructions. Synchronization in the processor occurs at the boundaries of these instruction bundles. Thus, if the systolic array needs the load result at time N and the load scheduled in VLIW bundle N – 1, the data is guaranteed to be available for systolic array execution.

Implementing this type of synchronization requires very little logic. The key requirement is that the compiler must be able to generate good estimates of data transfer time and systolic array execution time, so that the slots of the conceptual VLIW instruction can be filled with equal-length tasks. See Figure 11.X-Ref Target - Figure 11

Figure 11: Generic Architecture: Execution Control

Cross Bar

Pooling Pooling

ReLUReLUReLUReLU

Bias Bias Bias Bias

PoolingPooling

Execution Control

Spi

ll/R

esor

e D

MA

Con

trol

ler

Exe

cutio

n C

ontr

olle

r

Pooling/EWA

Systolic Array

WP515_11_091919


Instruction Buffer

Image Queue




QuantizationTo make inference as efficient as possible, it is necessary to reduce both computational complexity and the quantity of data stored and moved. Quantization is the method of representing floating point numbers as integers, typically of 16 bits or 8 bits. An 8-bit representation of a 32-bit floating point number reduces the data size by a factor of 4, and is an efficient size for multiplications in a Xilinx DSP slice. Being able to move around smaller amounts of data has two distinct advantages: first, less power is needed to deal with external memory, because less data is being moved; second, there is more opportunity for storing values using smaller on-chip memories, which have latency and power advantages.

Quantization obviously reduces accuracy, not the least of which results from not being able to represent every float in 8 bits. However, DNNs tend to be robust to noise, and tensor values for a given layer’s weights or activations tend to lie in a fairly narrow range. In many cases, post-training quantization gives reasonably good inference results. For deeper, more complex networks, quantization-aware retraining might be required. In such retraining, the floating point scale factors that convert floating point values to quantized integers can be learned as part of the retraining process, resulting in more fine-tuned quantization, thus improving network accuracy.

The process of quantization begins with selecting a floating point scale factor that applies to a given tensor, so that conversion between integer and floating values can be performed. The conversions are q = round( f / sf ), and fq = q · sf, where q is the quantized integer, f is the float, sf is the scale factor, and fq is the float—that is, what the quantized value actually represents.

The scale factor is selected, generally, to minimize quantization error, which is the difference between f and fq. A straightforward method is to find the maximum absolute value in the tensor and divide by 2n bits – 1 or, in the case of 8 bits, divide by 128. This method is able to represent the entire range of the tensor, although other methods might attempt to ignore extreme outlier values in favor of reducing the quantization error for the majority of smaller values.

Since all of the computation performed on the FPGA is in integer format, an integer computation must be used to convert from one scale factor to another. In a convolution, multiply inputs at one scale factor by weights at another scale factor, generating output of yet another scale factor. The contents of the 24-bit accumulator in a DSP column has a scale factor of input_sf · weight_sf, so that scale factor must be converted to output_sf, along with converting the 24-bit accumulator value to the 8-bit output.

The conversion can be done by multiplying the accumulated value by a precomputed integer, followed by a constant right shift that leaves the 8-bit output in the low-order bits of the result, where saturation can then occur.

At its simplest, converting a quantized integer from one scale factor to a quantized integer with a different scale factor is to calculate an integer conversion factor: cf = round( from_sf / to_sf ). This only works, however, when from_sf ≥ to_sf; otherwise, cf would be zero, and multiplying the quantized value by that would always result in zero.

Instead, a means of ensuring that cf > 0 must be introduced. This is done by artificially increasing from_sf by a power of 2 (equivalent to a left shift of the quantized value), so that the conversion factor cf is now calculated as: cf = round( from_sf · 2 shift / to_sf ). If the shift value is appropriately




selected, cf is a positive 16-bit integer multiplied (for example) by the accumulator, the result of which is shifted right by that same shift value, reducing the scale factor to to_sf.

Example: Input to be multiplied by a weight, and product to be multiplied by cf, where cf = round( input_sf · weight_sf · 215 / output_sf), an unsigned 16-bit integer. See Table 1.

The quantized output is the accumulated value shifted right 15. To deal with overflow, clipping leaves an 8-bit value in the range –127 to +127.

ConclusionMachine learning is a fast-moving field, and accelerator design proceeds at a fast rate to keep up with new layer requirements. FPGAs are an ideal implementation vehicle for such accelerators, due to their distributed memories, high density of compute, and their support for adaptable logic for special operators. By providing compilers that produce all the programming information for these accelerators [Ref 1], end users can take advantage of the inherent compute power of an FPGA without having to perform any hardware implementation.

References1. Xilinx Product Guide PG338, DPU for Convolutional Network IP Product Guide, v3.0.2. He, K. et al., “Deep Residual Learning for Image Recognition” in Proceedings of the IEEE:

Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.3. Xilinx User Guide UG579, UltraScale Architecture DSP Slice User Guide.4. Wu, E., et al., “A High-Throughput Reconfigurable Processing Array for Neural Networks” in

Proceedings of the 27th International Conference on Field-Programmable Logic and Application, 2017.

5. Xilinx White Paper WP486, Deep Learning with INT8 Optimization on Xilinx Devices.6. Hochreiter, S. and Schmidhuber, J, “Long Short-Term Memory” in Neural Computation, 9(8),

pp.1735-1780, 1997.7. V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines” In

ICML, 2010.

Table 1: Quantization ExampleFloat Scale Quant Quant Float Quant Error

input 0.843 0.03125 27 0.84375 0.00075

weight 0.091 0.00390625 23 0.08984375 0.00115625

input · weight 0.076713 0.0001220703125 621 0.0758056640625 0.0009073359375

cf 256

cf · prod 158976

round Adds 2–1 to represent 0.5 175360

output 0.076713 0.015625 5 0.078125 0.001412


https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf

https://www.xilinx.com/support/documentation/ip_documentation/dpu/v3_0/pg338-dpu.pdf



Revision HistoryThe following table shows the revision history for this document:

DisclaimerThe information disclosed to you hereunder (the “Materials”) is provided solely for the selection and use of Xilinxproducts. To the maximum extent permitted by applicable law: (1) Materials are made available “AS IS” and with all faults,Xilinx hereby DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOTLIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE;and (2) Xilinx shall not be liable (whether in contract or tort, including negligence, or under any other theory of liability)for any loss or damage of any kind or nature related to, arising under, or in connection with, the Materials (including youruse of the Materials), including for any direct, indirect, special, incidental, or consequential loss or damage (including lossof data, profits, goodwill, or any type of loss or damage suffered as a result of any action brought by a third party) evenif such damage or loss was reasonably foreseeable or Xilinx had been advised of the possibility of the same. Xilinxassumes no obligation to correct any errors contained in the Materials or to notify you of updates to the Materials or toproduct specifications. You may not reproduce, modify, distribute, or publicly display the Materials without prior writtenconsent. Certain products are subject to the terms and conditions of Xilinx’s limited warranty, please refer to Xilinx’sTerms of Sale which can be viewed at http://www.xilinx.com/legal.htm#tos; IP cores may be subject to warranty andsupport terms contained in a license issued to you by Xilinx. Xilinx products are not designed or intended to be fail-safeor for use in any application requiring fail-safe performance; you assume sole risk and liability for use of Xilinx productsin such critical applications, please refer to Xilinx’s Terms of Sale which can be viewed at http://www.xilinx.com/legal.htm#tos.

Automotive Applications DisclaimerXILINX PRODUCTS ARE NOT DESIGNED OR INTENDED TO BE FAIL-SAFE, OR FOR USE IN ANY APPLICATION REQUIRINGFAIL-SAFE PERFORMANCE, SUCH AS APPLICATIONS RELATED TO: (I) THE DEPLOYMENT OF AIRBAGS, (II) CONTROL OF AVEHICLE, UNLESS THERE IS A FAIL-SAFE OR REDUNDANCY FEATURE (WHICH DOES NOT INCLUDE USE OF SOFTWARE INTHE XILINX DEVICE TO IMPLEMENT THE REDUNDANCY) AND A WARNING SIGNAL UPON FAILURE TO THE OPERATOR, OR(III) USES THAT COULD LEAD TO DEATH OR PERSONAL INJURY. CUSTOMER ASSUMES THE SOLE RISK AND LIABILITY OFANY USE OF XILINX PRODUCTS IN SUCH APPLICATIONS.

Date Version Description of Revisions11/05/2019 1.0 Initial Xilinx release.


http://www.xilinx.com/legal.htm#tos




The Anatomy of an Embedded Machine Learning Accelerator ...€¦ · white paper takes a...

Documents

Transcript of The Anatomy of an Embedded Machine Learning Accelerator ...€¦ · white paper takes a...