Endpoint AI and the Advent of the microNPU
Transcript of Endpoint AI and the Advent of the microNPU
© 2021 Arm
Tomas EdsöDistinguished Engineer
Arm Machine Learning Group
Endpoint AI and the Advent of the microNPU
MLSys Conference
April 5th 2021
2 © 2021 Arm
Endpoint AI and tinyML
Vision
Images and video
Object detection, face unlock, object classification
etc.
Voice and sound
Recognition and creation
Keyword spotting, speech recognition, natural
language processing, speech synthesis, sound recognition, etc.
Vibration and motion
Any ‘signal’
Predictive maintenance, sensor fusion, accelerometer, pressure, lidar/radar, speed, shock, vibration, pollution,
density, viscosity, etc.
On-device machine learning applications in the single mW and below
3 © 2021 Arm
Pushing the Boundaries for Real-time On-device Processing
Relative ML and signal processing performance
Cortex-M today
Rel
ativ
e co
ntr
ol c
od
e p
erfo
rman
ce
Cortex-M35PCortex-M33
Cortex-M7
Cortex-M3
Cortex-M23Cortex-M0+Cortex-M0Cortex-M1
>52 billion
Cortex-M based chips shipped**
Graph not to scale
Cortex-M4
**Based on Arm data
New Cortex-M CPU enabled by Helium
Cortex-M55
Arm microNPUs
Signal conditioning and ML foundation
ML performance and efficiency
Cortex-M55 + Ethos-U55 / Ethos-U65(multiple performance points available)
Well suited for ML & DSP applications
© 2021 Arm
Ethos-U55
The first microNPU for M-class CPUs
5 © 2021 Arm
Ethos-U55 overview
• Works alongsideCortex-M55, Cortex-M7, Cortex-M33 and Cortex-M4 processors
• Works alongside on-chip SRAM and system flash
• Accelerates CNN and RNN operators.
• Efficient weight compression
• 8- or 16-bit activationsWeights are always 8-bit
• 32, 64, 128 or 256 MAC/cc configurations
Cortex-M CPU
Ethos-U55
Mac Engine
Weight decoder
DMA
Shared Buffer
Central Control
Elementwise engine
SystemFlash
SystemSRAM
6 © 2021 Arm
Typical Ethos-U55 data flow
1. Input activations are stored in system SRAM.
2. The host starts Ethos-U55 by defining all memory regions to be used, such as the location of the command stream and input activations.
3. Ethos-U55 autonomously runs all commands, using SRAM as a scratch buffer. The final outputs are written to a defined SRAM buffer.
4. Interrupt on completion of writing the final outputs.
An offline compiled command stream with corresponding compressed weights are put into system Flash.
0.
Cortex-M CPU
Ethos-U55
Mac Engine
Weight decoder
DMA
Shared Buffer
Central Control
Elementwise engine
SystemFlash
SystemSRAM
2
4
3
10
Command steam and compressed
weightsActivations
7 © 2021 Arm
Operations supported by Ethos-U
• Conv2D and DepthwiseConv2D• Up to 64x64 kernel• Up to 3x3 stride• Optimized kernel dilation, up to 2x2
• DeConv• Zero, nearest neighbor and bilinear, up to 2x2
• MaxPool and AvgPool• Up to 64x64 kernel with SAME padding• Up to 256x256 kernel with VALID padding• Up to 3x3 stride
• FullyConnected• Support for batching
• LSTM, GRU• Broken down into supported operations
• ADD, SUB, MUL, MIN, MAX• All types of broadcast allowed
• ReLU, ReLU1, ReLU6, tanh, sigmoid• All can be fused with other operations
• Configurable LUT• Can map any unary function
• PRELU, ABS
• ConCat
• Reshape
• ExpandDims
• Squeeze
• SoftMax
8 © 2021 Arm
Mapping of NNs to Hardware using TensorFlow Lite
Host (offline) Target/device
Ten
sorF
low
Lit
e M
icro
ru
nti
me
Reference kernels
CMSIS-NN optimized kernels
microNPUdriver
microNPU
Cortex-MArmv6-M Armv7-M
Armv8-M Armv8.1-M (Helium)
Cortex-M
Offline Optimization
Ten
sorF
low
Lit
e ru
nti
meTooling
TensorFlow Framework
TensorFlow.tflite File
Optimized custom operators for the microNPU
9 © 2021 Arm
The Vela Offline Compiler
• Open source
• Reads a tflite file and identifies subgraphs
• Optimizes scheduling of subgraphs
• Loss-less compression of weights
• Execution and memory planning
• Generates commands for microNPU
• Writes out a modified tflite file
conv
pool
FC
other
other
microNPUcustom op
other
microNPU compiler
other
Up to 90% SRAM size reduction
Up to 70% model size reduction
Enabling networks not before feasible in embedded systems
10 © 2021 Arm
Vela Internal Components
• Graph Optimiser• Traverses the graph performing high level checks and
optimisations (e.g. LeakyReLU to Mul/Max to LUT)
• Pass Packing• Groups network operators that can be executed as a single
hardware operation
• Pass Scheduler• Finds the optimal way for the hardware to process the Pass• Cost model scores strategies using performance estimates• Weight Encoding
• Performance Estimation• Generates Memory usage, Bandwidth, and Processing Cycles
VelaModel Reader
Graph Optimiser
Pass Packing
Pass Scheduler
Tensor Allocation
Performance Estimation
Command Stream Gen.
Model WriterStats Writer
11 © 2021 Arm
Configurable LUT
• Fusing of sequences of unary functions• Single LUT fused with the layer producing it
12 © 2021 Arm
Network support in Ethos-U55
• Ethos-U55 can completely execute networks that map to the supported operator set • For example:
– Deepspeech– RNNoise– Wav2letter– DSCNN– MobileNet_v1– MobileNet_v2– MobileNet_v3– ResNet– Inception_v3
• Any unsupported operation fallback to the Cortex-M processor• These can be accelerated through CMSIS-NN library
Application TFL micro Driver Ethos-U55
Application TFL micro
Driver Ethos-U55
CMSIS-NN Cortex-M
© 2021 Arm
Ethos-U65
A microNPU for a new class of AI devices
14 © 2021 Arm
Ethos-U65 overview
• 256 or 512 unit multiply-accumulate (MAC) engine
• DMA update for DRAM as well as flash support
• Can be an M-class subsystem inside an A-class system
System SRAM
Control unit DMA
Configurable MAC engine
Elementwise engine
Local memory
Weight decode
Cortex-M processor
System DRAM
TrustZone for Armv8-M
Ethos-U65 microNPU
15 © 2021 Arm
Different systems applicable to Ethos-U65
Ethos-U65
AXI bus
Cortex-A+ cache, MMU
Cortex-M+ cache,
MPU
DRAM
Address filter
AXI bus
SRAM
Cortex-M CPU
SystemSRAM
SystemFlash
Cortex A + Cortex M system Cortex M only system
Ethos-U65
Mac Engine
Weight decoder
DMAShared Buffer
Central Control
Elementwise
© 2021 Arm
Ethos-U performance
Enabling AI at the endpoint
17 © 2021 Arm
Neural Network performance Across ARM IPsWav2Letter
Performance Gain
CM4 (Ref) CM4 (CMSIS)
Efficient Software (CMSIS-NN)
Performance Gain
CM4 (Ref) CM4 (CMSIS) CM55 (CMSIS)
11x
Performance Gain
CM4 (Ref) CM4 (CMSIS)
CM55 (CMSIS) CM55+U55
25x
AI Capable Cortex-M55 AI Dedicated U55 256 MAC/cycle
3.5x
Total gain of almost 1000x
18 © 2021 Arm
Full example: Typical ML Workload for a Voice Assistant
Latency and energy spent for all tasks listed combined: voice activity detection, noise cancellation, two-mic beamforming, echo cancellation, equalizing, mixing, keyword spotting, OPUS decode, and automatic speech recognition.
✓ Faster responses
✓ Smaller form-factors
✓ Improved accuracy
Speed to inference Energy efficiency
50x
Cortex-M7 Cortex-M55+ Ethos-U55
7x6x
25x
Cortex-M55Cortex-M7 Cortex-M55+ Ethos-U55
Cortex-M55
19 © 2021 Arm
Broadest Range of ML-optimized Processing Solutions
Cortex-M today
Cortex-M55
Cortex-M and Ethos-U55
Cortex-A, Mali and Ethos-N or Ethos-U65
TOP/s
Data throughput
Sensor fusion
Keyword detection
Speech recognition
Object classification
Anomaly detection
Real-time recognition
Biometric awareness
Object detection
Gesture detection
Vibration detection
© 2021 Arm
Thank YouDanke
Gracias谢谢
ありがとうAsanteMerci
감사합니다धन्यवाद
Kiitosشكرًا
ধন্যবাদתודה
The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
© 2021 Arm