Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different...

11/25/2019

Deep Learning Seminar on Xilinx SoCs

From Training to deployment

Agenda

01 Xilinx Deep Learning Solutions

02 Keras / TensorFlow ResNet50 Training

Building a “Fruit Recognizer”

03 Integration of the Deep Learning Processing

Unit in Vivado

04 Xilinx DNNDK: From a TensorFlow net

to the DPU Firmware

05 Programming Model: The DPU API

06 Question and Answer

11/25/2019 2

Xilinx Deep Learning

Solutions

Xilinx Focuses on Inference

4

Xilinx AI Inference Solution

Deep Learning Applications

Cloud On Premises Edge

Featuring the Most Powerful

FPGA in the Cloud

Virtex Ultrascale+ VU9P Zynq Ultrascale+ MPSoC

5

Many Applications for Machine Learning

RoboticsIIoT Gateways &

Edge Appliances

Drives &

Motor Control PLC/PAC/IPCI/O Modules &

Smart Sensors

Human

Machine

Interface

Video Surveillance &

Smart City

Machine & Computer

Vision Smart Grid3D Printing &

Additive Manufacturing

6

Xilinx AI Solution from Edge to Cloud

Edge Cloud

ZCU102 ZCU104 Ultra96Xilinx U200, U250, U280

DPU xDNN

DNNDK Runtime xfDNN Runtime

DNNDK Compiler xfDNN Compiler

DNNDK Quantizer xfDNN Quantizer

DNNDK Pruning

20+

LSTM

7

Models

Software Stack

FPGA IP

AI Platforms

Xilinx Solution Stack for Edge AI

Models

8

ZCU102 ZCU104 Ultra96

Framework

Tools & IP

Edge AI Platforms

CustomPublic

Custom

Why Xilinx for Edge AI ?

− Xilinx offers the optimal tradeoff for Edge AI− Latency− Power− Cost− Flexibility− Scalability− Time-to-market

− Xilinx pruning technology− Up to 50x optimization− Increased performance− Reduced power

9

Xilinx Edge AI – Value Proposition

Whole Application Acceleration

10

Keras / TensorFlow Training

Building a “Fruit Recognizer”

ResNet50

11/25/2019 12

ResNet was the Winner of the ImageNet

Large Scale Visual Recognition Challenge

(ILSVRC) in 2015

Steps to your Xilinx ML application

11/25/2019 13

1 Prepare your data

2 Train your network

3 Test your trained network on test data

4 Freeze your TensorFlow model

5 Vivado Project with DPU, export DPU configuration

5 Quantizise the model

6 Compile the model with the DPU configurtaion

7 Link the compiled model against your C/C++/Phyton application

8 Deploy it on the PetaLinux system

Decide what you want to classify

11/25/2019 14

The ImageNet Database:

Free database with ~ 20000 classes

with at least 500 pictures per class.

Current size166GByte

Docker: running different DNNDK/TensorFlow versions not

depending on the Host OS/Python/CUDA installation

11/25/2019 15

Copy over what you need for your training/validation/test folders

using python

11/25/2019 16

Train your net using Keras

11/25/2019 17

Test your TF Model with images NOTpart of the training and

validation folders

11/25/2019 18

Integration of the Deep Learning

Processing Unit in Vivado

DPU IP with High Efficiency

CPU MEM CONTROLLER

BUS

Data Mover

IMG WR SCHEDULER

WEIGHTS WRSCHEDULER

SMART MEM FABRIC

IMG RD SCHEDULER

WEIGHTS RD SCHEDULER

PE Array

PE PE PE PE

DISPATCHER

...

EXTERNAL MEMORY

INSTRFETCHER

DECODER

REG MAP

WB WR SCHEDULER

CTRLSIGNALS

MISC CALC

AVG POOL

MAX POOL

ROI POOL

ELEMENTWISE ...

20

DPU – Supported Operations

− Operations supported by the DPU core(s)

− Operations supported by additional cores

21

• Conv

• Dilation

• Pooling

• Max

• Average

• ReLU / Leaky Relu/ Relu6

• Full Connected (FC)

• Batch Normalization

• Concat

• Elementwise

• Deconv

• Depthwise conv

• Mean scale

• Upsampling

• Split

• Reorg

• Softmax

DPU – Interfaces and Parallelism

DPU

B4096

Master-axi-0 Master-axi-1 Master-axi-2slave-axi

32bits 32bits

128bits 128bits

DPU

B1152

Master-axi-0 Master-axi-1 Master-axi-2slave-axi

32bits 32bits

64bits 64bits− 3-level parallelism is exploited

− Pixel * input channel * output

channel

− Small core - B1152

− Parallelism: 4*12*12

− target Z7020/ZU2/ZU3

− Big core - B4096

− Parallelism: 8*16*16

− Target ZU5 and above

22

Xilinx Vivado Integration

23

Xilinx DNNDK: From a TensorFlow

net to the DPU Firmware

DPU Runtime Engine

− Distributed with the DPU TRD

− DPU IP Product Guide (PG338)

− yocto recipes

− Can be used in Petalinux project

− Runtime N2Cube (Cube of Neural Network)

− DPU linux kernel driver

− DPU run-time libraries

− DPU utilities

25

DPU – Linux Kernel Driver

− Distributed as

− DPU yocto recipe

− Files

− dpudef.h

− dpucore.c/.h

− dpuext.c/.h

− Source Code

− today, distributed with DPU TRD

− will be pushed to github repository when mature

26

DPU – run-time libraries

− Distributed as

− DNNDK yocto recipe

− DPU run-time library (libn2cube.so)

− DPU Loader

− DPU Task Scheduling

− DPU Task monitoring

− DPU Task profiling

− DPU utility library (libdputils.so)

− Utility functions to load images into DPU

27

Flow

28

vivado

dpu.ko

dputils

n2cube

device treeDPU

sysroot

application.el

f

BOOT.BIN

image.ub

SD Card

main.cc

dpu_{model}.elf

.hdf

decentmodel dnnc

petalinux

xsdk

4

1 2

DPU

3

DNNDK

caffe

tensorflow

5

DNNDK – Deep Neural Network Development Kit

− Distributed with

− DNNDK User Guide (UG1327)

− Composed of:

− Model Compression (DECENT)

− Pruning

− Quantization

− Model Compilation (DNNC)

− Compiler

− Assembler

29

DECENT – DEEp CompressioN Tool

− Pruning

− available separately (Xilinx AI Optimizer)

− requires license

− Quantization

− available in free version of tools

30

Quantization – Flow

− Preprocess

− folds batchnorm layers

− removes useless nodes

− Quantize

− weights / biases

− activations

− Calibrate

− using calibration dataset

− without labels

− Generate

− deployable DPU model

31

Floating-point model

preprocess

fixed-point / deploy

model

Calibration dataset

(without labels)

Quantize

&

Calibrate

Generate DPU model

DECENT_Q

Quantization – Results

Networks Float32 baseline 8-bit Quantization

Top1 Top5 Top1 ΔTop1 Top5 ΔTop5

Inception_v1 66.90% 87.68% 66.62% -0.28% 87.58% -0.10%

Inception_v2 72.78% 91.04% 72.40% -0.38% 90.82% -0.23%

Inception_v3 77.01% 93.29% 76.56% -0.45% 93.00% -0.29%

Inception_v4 79.74% 94.80% 79.42% -0.32% 94.64% -0.16%

ResNet-50 74.76% 92.09% 74.59% -0.17% 91.95% -0.14%

VGG16 70.97% 89.85% 70.77% -0.20% 89.76% -0.09%

Inception-ResNet-v2 79.95% 95.13% 79.45% -0.51% 94.97% -0.16%

− Uniform Quantization

− 8-bit for both weights and activation

− A small set of images for calibration

32

Quantization – Usage (TensorFlow)

− Input

− floating-point frozen graph (frozen_graph.pb)

− calibration dataset

− python pre-processing function

− Output

− quantized model for deployment (deploy_model.pb)

− quantized model for evaluation (quantize_eval_model.pb)

− Syntax

decent_q quantize

--input_frozen_graph frozen_graph.pb

--input_nodes {input node}

--input_shapes ?,28,28,1

--output_nodes {output node}

--input_fn {python script}

--method {0=non-overflow, 1=min-diffs}

--gpu 0

--calib_iter 200

33

DNNC – Deep Neural Network Compiler

− Compiler

− Specific to version of DPU core

− dnnc-dpu1.4.0 => DPU core with Low RAM usage

− dnnc-dpu1.4.0.1 => DPU core with High RAM usage

− Convert quantized network to micro-code for DPU

− Optimization via Fusion of layers

− Assembler

− not invoked directly, called by the Compiler

34

DNNC – Usage (TensorFlow)

− Input

− Quantized model (deploy_model.pb)

− Output

− ELF file(s) for DPU kernel(s)

− Syntax

dnnc –-parser=tensorflow

--frozen_pb={path to quantized model}

--dcf={DPU configuration file}

--cpu_arch=arm32

--mode=normal

--net_name={network name}

--output_dir={path to output directory}35

Take Away – Compiling a network with DNNDK

− Pruning / AI Optimizer

− Available separately (license required)

− Significant compression with minimal loss in accuracy

− Increased FPS / Reduced power

− Quantization

− Available for free

− Requires a small calibration dataset (without labels)

− Quantization to 8 bits with minimal loss in accuracy

36

References / What Next ?

The following tutorials provide additional examples based on TensorFlow

− https://github.com/Xilinx/Edge-AI-Platform-Tutorials

− CIFAR10 Classification with TensorFlow (UG1338)

− Freezing a Keras model for use with DNNDK (UG1380)

− Deep Learning with custom GoogleNet and ResNet in Keras and Xilinx DNNDK TF 3.0

(UG1381)

37

https://github.com/Xilinx/Edge-AI-Platform-Tutorials

https://github.com/Xilinx/Edge-AI-Platform-Tutorials/blob/master/docs/CIFAR10_tf



https://github.com/Xilinx/Edge-AI-Platform-Tutorials/blob/master/docs/Keras-freeze



https://github.com/Xilinx/Edge-AI-Platform-Tutorials/blob/master/docs/Keras-GoogleNet-ResNet







Programming Model: The DPU API

Programming with DNNDK API Makefile

39

Programming with DNNDK API, DPU Setup

40

Programming with DNNDK API, DPU Task

41

Take Away – Compiling a network with DNNDK

− Pruning / AI Optimizer

− Available separately (license required)

− Significant compression with minimal loss in accuracy

− Increased FPS / Reduced power

− Quantization

− Available for free

− Requires a small calibration dataset (without labels)

− Quantization to 8 bits with minimal loss in accuracy

42

Fragen?

Kontakt

11/25/2019 44

Team Kontakt Software and Services:

[email protected]

11/25/2019

Thank you!

45

Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different...

Documents

Transcript of Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different...