Endpoint AI and the Advent of the microNPU

© 2021 Arm

Tomas EdsöDistinguished Engineer

Arm Machine Learning Group

Endpoint AI and the Advent of the microNPU

MLSys Conference

April 5th 2021

2 © 2021 Arm

Endpoint AI and tinyML

Vision

Images and video

Object detection, face unlock, object classification

etc.

Voice and sound

Recognition and creation

Keyword spotting, speech recognition, natural

language processing, speech synthesis, sound recognition, etc.

Vibration and motion

Any ‘signal’

Predictive maintenance, sensor fusion, accelerometer, pressure, lidar/radar, speed, shock, vibration, pollution,

density, viscosity, etc.

On-device machine learning applications in the single mW and below

3 © 2021 Arm

Pushing the Boundaries for Real-time On-device Processing

Relative ML and signal processing performance

Cortex-M today

Rel

ativ

e co

ntr

ol c

od

e p

erfo

rman

ce

Cortex-M35PCortex-M33

Cortex-M7

Cortex-M3

Cortex-M23Cortex-M0+Cortex-M0Cortex-M1

>52 billion

Cortex-M based chips shipped**

Graph not to scale

Cortex-M4

**Based on Arm data

New Cortex-M CPU enabled by Helium

Cortex-M55

Arm microNPUs

Signal conditioning and ML foundation

ML performance and efficiency

Cortex-M55 + Ethos-U55 / Ethos-U65(multiple performance points available)

Well suited for ML & DSP applications

© 2021 Arm

Ethos-U55

The first microNPU for M-class CPUs

5 © 2021 Arm

Ethos-U55 overview

• Works alongsideCortex-M55, Cortex-M7, Cortex-M33 and Cortex-M4 processors

• Works alongside on-chip SRAM and system flash

• Accelerates CNN and RNN operators.

• Efficient weight compression

• 8- or 16-bit activationsWeights are always 8-bit

• 32, 64, 128 or 256 MAC/cc configurations

Cortex-M CPU

Ethos-U55

Mac Engine

Weight decoder

DMA

Shared Buffer

Central Control

Elementwise engine

SystemFlash

SystemSRAM

6 © 2021 Arm

Typical Ethos-U55 data flow

1. Input activations are stored in system SRAM.

2. The host starts Ethos-U55 by defining all memory regions to be used, such as the location of the command stream and input activations.

3. Ethos-U55 autonomously runs all commands, using SRAM as a scratch buffer. The final outputs are written to a defined SRAM buffer.

4. Interrupt on completion of writing the final outputs.

An offline compiled command stream with corresponding compressed weights are put into system Flash.

0.

Cortex-M CPU

Ethos-U55

Mac Engine

Weight decoder

DMA

Shared Buffer

Central Control

Elementwise engine

SystemFlash

SystemSRAM

2

4

3

10

Command steam and compressed

weightsActivations

7 © 2021 Arm

Operations supported by Ethos-U

• Conv2D and DepthwiseConv2D• Up to 64x64 kernel• Up to 3x3 stride• Optimized kernel dilation, up to 2x2

• DeConv• Zero, nearest neighbor and bilinear, up to 2x2

• MaxPool and AvgPool• Up to 64x64 kernel with SAME padding• Up to 256x256 kernel with VALID padding• Up to 3x3 stride

• FullyConnected• Support for batching

• LSTM, GRU• Broken down into supported operations

• ADD, SUB, MUL, MIN, MAX• All types of broadcast allowed

• ReLU, ReLU1, ReLU6, tanh, sigmoid• All can be fused with other operations

• Configurable LUT• Can map any unary function

• PRELU, ABS

• ConCat

• Reshape

• ExpandDims

• Squeeze

• SoftMax

8 © 2021 Arm

Mapping of NNs to Hardware using TensorFlow Lite

Host (offline) Target/device

Ten

sorF

low

Lit

e M

icro

ru

nti

me

Reference kernels

CMSIS-NN optimized kernels

microNPUdriver

microNPU

Cortex-MArmv6-M Armv7-M

Armv8-M Armv8.1-M (Helium)

Cortex-M

Offline Optimization

Ten

sorF

low

Lit

e ru

nti

meTooling

TensorFlow Framework

TensorFlow.tflite File

Optimized custom operators for the microNPU

9 © 2021 Arm

The Vela Offline Compiler

• Open source

• Reads a tflite file and identifies subgraphs

• Optimizes scheduling of subgraphs

• Loss-less compression of weights

• Execution and memory planning

• Generates commands for microNPU

• Writes out a modified tflite file

conv

pool

FC

other

other

microNPUcustom op

other

microNPU compiler

other

Up to 90% SRAM size reduction

Up to 70% model size reduction

Enabling networks not before feasible in embedded systems

10 © 2021 Arm

Vela Internal Components

• Graph Optimiser• Traverses the graph performing high level checks and

optimisations (e.g. LeakyReLU to Mul/Max to LUT)

• Pass Packing• Groups network operators that can be executed as a single

hardware operation

• Pass Scheduler• Finds the optimal way for the hardware to process the Pass• Cost model scores strategies using performance estimates• Weight Encoding

• Performance Estimation• Generates Memory usage, Bandwidth, and Processing Cycles

VelaModel Reader

Graph Optimiser

Pass Packing

Pass Scheduler

Tensor Allocation

Performance Estimation

Command Stream Gen.

Model WriterStats Writer

11 © 2021 Arm

Configurable LUT

• Fusing of sequences of unary functions• Single LUT fused with the layer producing it

12 © 2021 Arm

Network support in Ethos-U55

• Ethos-U55 can completely execute networks that map to the supported operator set • For example:

– Deepspeech– RNNoise– Wav2letter– DSCNN– MobileNet_v1– MobileNet_v2– MobileNet_v3– ResNet– Inception_v3

• Any unsupported operation fallback to the Cortex-M processor• These can be accelerated through CMSIS-NN library

Application TFL micro Driver Ethos-U55

Application TFL micro

Driver Ethos-U55

CMSIS-NN Cortex-M

© 2021 Arm

Ethos-U65

A microNPU for a new class of AI devices

14 © 2021 Arm

Ethos-U65 overview

• 256 or 512 unit multiply-accumulate (MAC) engine

• DMA update for DRAM as well as flash support

• Can be an M-class subsystem inside an A-class system

System SRAM

Control unit DMA

Configurable MAC engine

Elementwise engine

Local memory

Weight decode

Cortex-M processor

System DRAM

TrustZone for Armv8-M

Ethos-U65 microNPU

15 © 2021 Arm

Different systems applicable to Ethos-U65

Ethos-U65

AXI bus

Cortex-A+ cache, MMU

Cortex-M+ cache,

MPU

DRAM

Address filter

AXI bus

SRAM

Cortex-M CPU

SystemSRAM

SystemFlash

Cortex A + Cortex M system Cortex M only system

Ethos-U65

Mac Engine

Weight decoder

DMAShared Buffer

Central Control

Elementwise

17 © 2021 Arm

Neural Network performance Across ARM IPsWav2Letter

Performance Gain

CM4 (Ref) CM4 (CMSIS)

Efficient Software (CMSIS-NN)

Performance Gain

CM4 (Ref) CM4 (CMSIS) CM55 (CMSIS)

11x

Performance Gain

CM4 (Ref) CM4 (CMSIS)

CM55 (CMSIS) CM55+U55

25x

AI Capable Cortex-M55 AI Dedicated U55 256 MAC/cycle

3.5x

Total gain of almost 1000x

18 © 2021 Arm

Full example: Typical ML Workload for a Voice Assistant

Latency and energy spent for all tasks listed combined: voice activity detection, noise cancellation, two-mic beamforming, echo cancellation, equalizing, mixing, keyword spotting, OPUS decode, and automatic speech recognition.

✓ Faster responses

✓ Smaller form-factors

✓ Improved accuracy

Speed to inference Energy efficiency

50x

Cortex-M7 Cortex-M55+ Ethos-U55

7x6x

25x

Cortex-M55Cortex-M7 Cortex-M55+ Ethos-U55

Cortex-M55

19 © 2021 Arm

Broadest Range of ML-optimized Processing Solutions

Cortex-M today

Cortex-M55

Cortex-M and Ethos-U55

Cortex-A, Mali and Ethos-N or Ethos-U65

TOP/s

Data throughput

Sensor fusion

Keyword detection

Speech recognition

Object classification

Anomaly detection

Real-time recognition

Biometric awareness

Object detection

Gesture detection

Vibration detection

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in

the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks

© 2021 Arm

Endpoint AI and the Advent of the microNPU

Documents

Transcript of Endpoint AI and the Advent of the microNPU