Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different...

45
11/25/2019 Deep Learning Seminar on Xilinx SoCs From Training to deployment

Transcript of Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different...

Page 1: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

11/25/2019

Deep Learning Seminar on Xilinx SoCs

From Training to deployment

Page 2: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Agenda

01 Xilinx Deep Learning Solutions

02 Keras / TensorFlow ResNet50 Training

Building a “Fruit Recognizer”

03 Integration of the Deep Learning Processing

Unit in Vivado

04 Xilinx DNNDK: From a TensorFlow net

to the DPU Firmware

05 Programming Model: The DPU API

06 Question and Answer

11/25/2019 2

Page 3: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Xilinx Deep Learning

Solutions

Page 4: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Xilinx Focuses on Inference

4

Page 5: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Xilinx AI Inference Solution

Deep Learning Applications

Cloud On Premises Edge

Featuring the Most Powerful

FPGA in the Cloud

Virtex Ultrascale+ VU9P Zynq Ultrascale+ MPSoC

5

Page 6: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Many Applications for Machine Learning

RoboticsIIoT Gateways &

Edge Appliances

Drives &

Motor Control PLC/PAC/IPCI/O Modules &

Smart Sensors

Human

Machine

Interface

Video Surveillance &

Smart City

Machine & Computer

Vision Smart Grid3D Printing &

Additive Manufacturing

6

Page 7: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Xilinx AI Solution from Edge to Cloud

Edge Cloud

ZCU102 ZCU104 Ultra96Xilinx U200, U250, U280

DPU xDNN

DNNDK Runtime xfDNN Runtime

DNNDK Compiler xfDNN Compiler

DNNDK Quantizer xfDNN Quantizer

DNNDK Pruning

20+

LSTM

7

Models

Software Stack

FPGA IP

AI Platforms

Page 8: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Xilinx Solution Stack for Edge AI

Models

8

ZCU102 ZCU104 Ultra96

Framework

Tools & IP

Edge AI Platforms

CustomPublic

Custom

Page 9: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Why Xilinx for Edge AI ?

− Xilinx offers the optimal tradeoff for Edge AI− Latency− Power− Cost− Flexibility− Scalability− Time-to-market

− Xilinx pruning technology− Up to 50x optimization− Increased performance− Reduced power

9

Page 10: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Xilinx Edge AI – Value Proposition

Whole Application Acceleration

10

Page 11: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Keras / TensorFlow Training

Building a “Fruit Recognizer”

Page 12: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

ResNet50

11/25/2019 12

ResNet was the Winner of the ImageNet

Large Scale Visual Recognition Challenge

(ILSVRC) in 2015

Page 13: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Steps to your Xilinx ML application

11/25/2019 13

1 Prepare your data

2 Train your network

3 Test your trained network on test data

4 Freeze your TensorFlow model

5 Vivado Project with DPU, export DPU configuration

5 Quantizise the model

6 Compile the model with the DPU configurtaion

7 Link the compiled model against your C/C++/Phyton application

8 Deploy it on the PetaLinux system

Page 14: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Decide what you want to classify

11/25/2019 14

The ImageNet Database:

Free database with ~ 20000 classes

with at least 500 pictures per class.

Current size166GByte

Page 15: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Docker: running different DNNDK/TensorFlow versions not

depending on the Host OS/Python/CUDA installation

11/25/2019 15

Page 16: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Copy over what you need for your training/validation/test folders

using python

11/25/2019 16

Page 17: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Train your net using Keras

11/25/2019 17

Page 18: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Test your TF Model with images NOTpart of the training and

validation folders

11/25/2019 18

Page 19: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Integration of the Deep Learning

Processing Unit in Vivado

Page 20: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DPU IP with High Efficiency

CPU MEM CONTROLLER

BUS

Data Mover

IMG WR SCHEDULER

WEIGHTS WRSCHEDULER

SMART MEM FABRIC

IMG RD SCHEDULER

WEIGHTS RD SCHEDULER

PE Array

PE PE PE PE

DISPATCHER

...

EXTERNAL MEMORY

INSTRFETCHER

DECODER

REG MAP

WB WR SCHEDULER

CTRLSIGNALS

MISC CALC

AVG POOL

MAX POOL

ROI POOL

ELEMENTWISE ...

20

Page 21: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DPU – Supported Operations

− Operations supported by the DPU core(s)

− Operations supported by additional cores

21

• Conv

• Dilation

• Pooling

• Max

• Average

• ReLU / Leaky Relu/ Relu6

• Full Connected (FC)

• Batch Normalization

• Concat

• Elementwise

• Deconv

• Depthwise conv

• Mean scale

• Upsampling

• Split

• Reorg

• Softmax

Page 22: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DPU – Interfaces and Parallelism

DPU

B4096

Master-axi-0 Master-axi-1 Master-axi-2slave-axi

32bits 32bits

128bits 128bits

DPU

B1152

Master-axi-0 Master-axi-1 Master-axi-2slave-axi

32bits 32bits

64bits 64bits− 3-level parallelism is exploited

− Pixel * input channel * output

channel

− Small core - B1152

− Parallelism: 4*12*12

− target Z7020/ZU2/ZU3

− Big core - B4096

− Parallelism: 8*16*16

− Target ZU5 and above

22

Page 23: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Xilinx Vivado Integration

23

Page 24: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Xilinx DNNDK: From a TensorFlow

net to the DPU Firmware

Page 25: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DPU Runtime Engine

− Distributed with the DPU TRD

− DPU IP Product Guide (PG338)

− yocto recipes

− Can be used in Petalinux project

− Runtime N2Cube (Cube of Neural Network)

− DPU linux kernel driver

− DPU run-time libraries

− DPU utilities

25

Page 26: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DPU – Linux Kernel Driver

− Distributed as

− DPU yocto recipe

− Files

− dpudef.h

− dpucore.c/.h

− dpuext.c/.h

− Source Code

− today, distributed with DPU TRD

− will be pushed to github repository when mature

26

Page 27: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DPU – run-time libraries

− Distributed as

− DNNDK yocto recipe

− DPU run-time library (libn2cube.so)

− DPU Loader

− DPU Task Scheduling

− DPU Task monitoring

− DPU Task profiling

− DPU utility library (libdputils.so)

− Utility functions to load images into DPU

27

Page 28: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Flow

28

vivado

dpu.ko

dputils

n2cube

device treeDPU

sysroot

application.el

f

BOOT.BIN

image.ub

SD Card

main.cc

dpu_{model}.elf

.hdf

decentmodel dnnc

petalinux

xsdk

4

1 2

DPU

3

DNNDK

caffe

tensorflow

5

Page 29: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DNNDK – Deep Neural Network Development Kit

− Distributed with

− DNNDK User Guide (UG1327)

− Composed of:

− Model Compression (DECENT)

− Pruning

− Quantization

− Model Compilation (DNNC)

− Compiler

− Assembler

29

Page 30: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DECENT – DEEp CompressioN Tool

− Pruning

− available separately (Xilinx AI Optimizer)

− requires license

− Quantization

− available in free version of tools

30

Page 31: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Quantization – Flow

− Preprocess

− folds batchnorm layers

− removes useless nodes

− Quantize

− weights / biases

− activations

− Calibrate

− using calibration dataset

− without labels

− Generate

− deployable DPU model

31

Floating-point model

preprocess

fixed-point / deploy

model

Calibration dataset

(without labels)

Quantize

&

Calibrate

Generate DPU model

DECENT_Q

Page 32: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Quantization – Results

Networks Float32 baseline 8-bit Quantization

Top1 Top5 Top1 ΔTop1 Top5 ΔTop5

Inception_v1 66.90% 87.68% 66.62% -0.28% 87.58% -0.10%

Inception_v2 72.78% 91.04% 72.40% -0.38% 90.82% -0.23%

Inception_v3 77.01% 93.29% 76.56% -0.45% 93.00% -0.29%

Inception_v4 79.74% 94.80% 79.42% -0.32% 94.64% -0.16%

ResNet-50 74.76% 92.09% 74.59% -0.17% 91.95% -0.14%

VGG16 70.97% 89.85% 70.77% -0.20% 89.76% -0.09%

Inception-ResNet-v2 79.95% 95.13% 79.45% -0.51% 94.97% -0.16%

− Uniform Quantization

− 8-bit for both weights and activation

− A small set of images for calibration

32

Page 33: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Quantization – Usage (TensorFlow)

− Input

− floating-point frozen graph (frozen_graph.pb)

− calibration dataset

− python pre-processing function

− Output

− quantized model for deployment (deploy_model.pb)

− quantized model for evaluation (quantize_eval_model.pb)

− Syntax

decent_q quantize

--input_frozen_graph frozen_graph.pb

--input_nodes {input node}

--input_shapes ?,28,28,1

--output_nodes {output node}

--input_fn {python script}

--method {0=non-overflow, 1=min-diffs}

--gpu 0

--calib_iter 200

33

Page 34: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DNNC – Deep Neural Network Compiler

− Compiler

− Specific to version of DPU core

− dnnc-dpu1.4.0 => DPU core with Low RAM usage

− dnnc-dpu1.4.0.1 => DPU core with High RAM usage

− Convert quantized network to micro-code for DPU

− Optimization via Fusion of layers

− Assembler

− not invoked directly, called by the Compiler

34

Page 35: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

DNNC – Usage (TensorFlow)

− Input

− Quantized model (deploy_model.pb)

− Output

− ELF file(s) for DPU kernel(s)

− Syntax

dnnc –-parser=tensorflow

--frozen_pb={path to quantized model}

--dcf={DPU configuration file}

--cpu_arch=arm32

--mode=normal

--net_name={network name}

--output_dir={path to output directory}35

Page 36: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Take Away – Compiling a network with DNNDK

− Pruning / AI Optimizer

− Available separately (license required)

− Significant compression with minimal loss in accuracy

− Increased FPS / Reduced power

− Quantization

− Available for free

− Requires a small calibration dataset (without labels)

− Quantization to 8 bits with minimal loss in accuracy

36

Page 37: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

References / What Next ?

The following tutorials provide additional examples based on TensorFlow

− https://github.com/Xilinx/Edge-AI-Platform-Tutorials

− CIFAR10 Classification with TensorFlow (UG1338)

− Freezing a Keras model for use with DNNDK (UG1380)

− Deep Learning with custom GoogleNet and ResNet in Keras and Xilinx DNNDK TF 3.0

(UG1381)

37

Page 38: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Programming Model: The DPU API

Page 39: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Programming with DNNDK API Makefile

39

Page 40: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Programming with DNNDK API, DPU Setup

40

Page 41: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Programming with DNNDK API, DPU Task

41

Page 42: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Take Away – Compiling a network with DNNDK

− Pruning / AI Optimizer

− Available separately (license required)

− Significant compression with minimal loss in accuracy

− Increased FPS / Reduced power

− Quantization

− Available for free

− Requires a small calibration dataset (without labels)

− Quantization to 8 bits with minimal loss in accuracy

42

Page 43: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Fragen?

Page 44: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

Kontakt

11/25/2019 44

Team Kontakt Software and Services:

[email protected]

Page 45: Deep Learning Seminar on Xilinx SoCs From Training to deployment · Docker: running different DNNDK/TensorFlow versions not depending on the Host OS/Python/CUDA installation 11/25/2019

11/25/2019

Thank you!

45