Felix Johnny and Fredrik Knutsson - Arm - tinyML
Transcript of Felix Johnny and Fredrik Knutsson - Arm - tinyML
“CMSIS-NN & Optimizations for Edge AI”
Felix Johnny and Fredrik Knutsson - Arm
Sweden Area Group – February 8, 2021
tinyML Talks Sponsors
Additional Sponsorships available – contact [email protected] for info
tinyML Strategic Partner
57 © 2020 Arm Limited (or its affiliates)57 © 2020 Arm Limited (or its affiliates)
Optimized models for embedded
Application
Runtime(e.g. TensorFlow Lite Micro)
Optimized low-level NN libraries(i.e. CMSIS-NN)
Arm Cortex-M CPUs and microNPUs
Profiling and debugging
tooling such as Arm Keil MDK
Connect to high-level
frameworks
1
Supported byend-to-end tooling
2
2
RTOS such as Mbed OS
Connect toRuntime
3
3
Arm: The Software and Hardware Foundation for tinyML
1
AI Ecosystem Partners
Resources: developer.arm.com/solutions/machine-learning-on-arm
Stay Connected
@ArmSoftwareDevelopers
@ArmSoftwareDev
PAGE 58| Confidential Presentation ©2020 Deeplite, All Rights Reserved
BECOME BETA USER bit.ly/testdeeplite
WE USE AI TO MAKE OTHER AI FASTER, SMALLER AND MORE POWER EFFICIENT
Automatically compress SOTA models like MobileNet to <200KB with
little to no drop in accuracy for inference on resource-limited MCUs
Reduce model optimization trial & error from weeks to days using
Deeplite's design space exploration
Deploy more models to your device without sacrificing performance or
battery life with our easy-to-use software
Copyright © EdgeImpulse Inc.
TinyML for all developers
Get your free account at http://edgeimpulse.com
Test
Edge Device Impulse
Dataset
Embedded and edge
compute deployment
options
Acquire valuable
training data securely
Test impulse with
real-time device
data flows
Enrich data and train
ML algorithms
Real sensors in real time
Open source SDK
Maxim Integrated: Enabling Edge Intelligencewww.maximintegrated.com/ai
Sensors and Signal Conditioning
Health sensors measure PPG and ECG signals critical to understanding vital signs. Signal chain products enable measuring even the most sensitive signals.
Low Power Cortex M4 Micros
The biggest (3MB flash and 1MB SRAM) and the smallest (256KB flash and 96KB SRAM) Cortex M4 microcontrollers enable algorithms and neural networks to run at wearable power levels
Advanced AI Acceleration
The new MAX78000 implements AI inferences at over 100x lower energy than other embedded options. Now the edge can see and hear like never before.
▪ Wide range of ML methods: GBM, XGBoost, Random
Forest, Logistic Regression, Decision Tree, SVM, CNN, RNN,
CRNN, ANN, Local Outlier Factor, and Isolation Forest
▪ Easy-to-use interface for labeling, recording, validating, and
visualizing time-series sensor data
▪ On-device inference optimized for low latency, low power
consumption, and a small memory footprint
▪ Supports Arm® Cortex™- M0 to M4 class MCUs
▪ Automates complex and labor-intensive processes of a
typical ML workflow – no coding or ML expertise required!
▪ Industrial Predictive Maintenance
▪ Smart Home
▪ Wearables
Qeexo AutoML for Embedded AIAutomated Machine Learning Platform that builds tinyML solutions for the Edge using sensor data
▪ Automotive
▪ Mobile
▪ IoT
QEEXO AUTOML: END-TO-END MACHINE LEARNING PLATFORM
Key Features Target Markets/Applications
For a limited time, sign up to use Qeexo AutoML at automl.qeexo.com for FREE to bring intelligence to your devices!
is for
building products
Automated Feature Exploration and Model
Generation
Bill-of-Materials
Optimization
Automated Data
Assessment
Edge AI / TinyMLcode for the smallest
MCUs
Reality AI Tools® software
Reality AI solutions
Automotive sound recognition & localization
Indoor/outdoor sound event recognition
RealityCheck™ voice anti-spoofing
[email protected] @SensorAI Reality AIhttps://reality.ai
SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edge devices. We design systems for real-time always-on smart sensing, for audio, vision, IMUs, bio-signals and more.
https://SynSense.ai
Next tinyML Talks
Date Presenter Topic / Title
Tuesday,February 16
Mohammed ZubairPhD, SMIEEEAssociate Professor / ConsultantDepartment of Electrical Engineering / Center for Artificial Intelligence King Khalid University
Oral Tongue Lesion Detection using TinyML
on Embedded Devices
Webcast start time is 8 am Pacific time
Please contact [email protected] if you are interested in presenting
Local Committee in Sweden
Ali Balador, Senior Researcher RISE, Assistant Professor MDH, [email protected]
Johan Malm, AI Engineer, PhD, Imagimob AB, [email protected]
Magnus Melander, Evangelist and Co-founder THINGS, [email protected]
Åke Wernelind, Business Development, Imagimob AB, [email protected]
Reminders
youtube.com/tinyml
Slides & Videos will be posted tomorrow
tinyml.org/forums
Please use the Q&A window for your questions
tinyML Foundation and Ecosystem introductionand Welcome tinyML Group in Sweden
Evgeni Gousev
Sr. Director, Qualcomm AI & Chairman, tinyML Foundation, BoDFebruary 8, 2021
tinyML Foundation* Vision**
We see a new world with trillions of intelligent devices enabled by tinyML technologies that sense, analyze and autonomously act together to create a healthier and more sustainable environment for all
*tinyML Foundation is a non-profit, 501c3, professional and educational organization registered in Los Altos, CA, USAtinyML and the tinyML logo are registered trademarks of tinyML Foundation
**adopted at tinyML Strategy leadership meeting on Dec 14, 2019
tinyML Foundation Mission:
- to grow a prosperous and integrated Global Community of HW, SW and SYS scientists, engineers, designers, product and biz people, both experts and newcomers, developing leading edge tinyML technologies
- to promote and stimulate knowledge exchange between tinyML researchers to allow the field to move ahead at a high pace
- to inspire on the capabilities of tinyML and its potential of changing the way machine intelligence and data analytics at the very edge of the physical and digital world occur
- to connect tinyML technologies and innovations to enormous product and business opportunities and value creation across the whole ecosystem and industry verticals
What is tinyML* ?
tinyML is broadly defined as machine learning architectures, techniques, tools and approaches capable of performing on-device analytics for a variety of sensing modalities (vision, audio, motion, chemical, etc.) at “mW” (or below) power range targeting predominately battery operated devices
* tinyML is a full-stack approach and ecosystem, including HW-SYS-SW-applications
Why tinyML opportunity is so enormous?Data is a new oil(electricity) and ML is a way to produce it
Cloud ML
• DNN on the cloud
• HW: TPU, FPGA, GPU, CPU
Edge ML
• Optimized algos and CNN-light
• SoC (with NPUs/NSP accelerators)
tiny ML
• CNN-micro
• MCU w/ HW accelerators
Data Sources:
Storage and sharing
User provided:1. Pics2. Audio3. Clicks/likes4. GPS, Location based
Real-time in the physical world
CMOS cameras
IRcameras
IMUs Audiomics
Environ/chemical
Temperature Optical sensors
1%
4%
95%
Climbing up tinyML mountain
Step 0
Tech Feasibility
Step 1
Building
Awareness
Step 2
Initial tinyML
Products
Step 3
tinyML
Killer apps
Step 4
tinyML
everywhere
Explosive growth
Trillion of devices
now
3-5 years
1-2 years
10 years
tinyML meetups global growth
https://www.meetup.com/pro/tinyml/
(4.5k+ members in 22 countries, in 18 months)
22
tinyML Group in Sweden:62 members in 2 months
Highlights:www.tinyML.org/summit2021
- Keywords: Premier Quality, Interactive, LIVE … and FREE- 5 days, 50+ presentations- 4 Tutorials- 2 Panel discussions: (i) VC and (ii) tinyML toolchains- tinyML Research Symposium - Late Breaking News - 3 Best tinyML Awards (Paper, Product, Innovation)- 10+ Breakout sessions on various topics- tinyML Partner sessions- tinyAI for (Good) Life- LIVE coverage, starting at 8am Pacific time
What should I do about it:- Check out the program – you will be impressed- Register on-line (takes 5 min, and 1000+ already registered in 5 days)
- If interested: Submit nominations for Best Awards and/or Late News – February 28 deadline
- Block out your calendar: March 22-26- Become a sponsor ([email protected])- Provide your feedback – we listen !- Don’t worry about missing some talks – all videos
will be posted on YouTube.com/tinyML
Announcement:
tinyML Summits are growing fast 2019 Summit (March 2019)
2020 Summit(Feb 2020)
2021 Summit(March 2021), expected
Attendees 160 400+ 3000+
Companies 90 172 300+ ?
LinkedIn members
0 798 ~ 2000
Meetupsmembers
0 1140 ~ 5000
YouTubesubscribers
0 0 ~ 3000
2018 2019 2020
also started in Asia: tinyML WeChat and BiliBili
2021
2021 Summit Sponsors (as of Feb 3, 2021; more in the pipeline)
FREE registration for the whole weekof the Summit due to the Sponsors:
Interested in joining tinyML ecosystem?www.tinyML.org
© 2021 Arm
Felix Johnny & Fredrik KnutssonMachine Learning Group
CMSIS-NN : Optimization for Edge AI
Felix Johnny
Felix Johnny is the maintainer of Arm’s open source CMSIS-NN library that targets optimized Neural Network kernels for Cortex-M CPUs. He has spent most of the last 15 years in the wireless domain working with software design and optimizations in memory and cycle constrained systems. Outside of work, he is an active music photographer.
Fredrik Knutsson
Fredrik Knutsson is the technical lead for the Arm team working on Ethos-U55 and Cortex-M integration into embedded runtimes. He holds a M.Sc. in electrical engineering from Chalmers university of technology. Fredrik has more than 15 years of experience in the embedded software domain, doing mainly software architecture and system design. Four the past four years he’s been working for Arm and has previous experience from the wireless, wearable and automotive business.
20 © 2021 Arm
Executive Summary
Introduction to Edge AI and Arm Cortex-M Processors
Preparing for Optimization
CMSIS-NN : Library Optimizations
CMSIS-NN demo
21 © 2021 Arm
Orienting Edge AI
Apply the model
Low latencyefficiency
Low/mid performance
Security &privacy
Train the model
High performance
Throughput oriented
Large data sets
High performance compute, servers, GPU
Embedded systems, heterogeneous processing
Clo
ud
sid
eEd
ge d
evic
eTraining
Inference
float to (u)int8
22 © 2021 Arm
Why Target Arm Cortex-M Processors?
Accelerate deployment of Machine Learning on edge devices
Power efficient processors that fit TinyML requirements
Availability of exciting new models with lower memory footprint and Multiply ACcumulate operations(MAC)
MobileNet V2
• 3.38 MB parameters
• ~ 307 million MACs
• Data Type: floatPerson Detect(TFLM)
• 250 kBytes
• ~ 7 million MACs
• Data Type: int8**Source: Arm data
+52 billion Cortex-M
based chips shipped**
23 © 2021 Arm
Processor’s View of a NN ModelConnected compute blocks
Compute Layer 0
Compute Layer 1
Compute Layer N-1
…Input Output
© 2020 Arm Limited (or its affiliates)
Optimizing Neural Networks for TinyML
Why & How
25 © 2021 Arm
Why Optimize for TinyML?
Power constraints
• Extend battery life of edge devices byreducing awake time.
Cycle constraints
• Enable more complex models to be deployed within the inference budget.
• Meet real time latency constraints.
26 © 2021 Arm
It Takes a Village to Raise a ChildEveryone has a role to play in optimization, some more important than the other
Optimization
Model Optimizations
SW Environment Settings
Library Optimizations
HW Design
27 © 2020 Arm Limited
What is CMSIS-NN?Optimized Neural Network kernels[5] for Arm Cortex-M CPUs
CMSIS-Pack
System-on-chip
Arm® Cortex® processorSpecialized peripherals
Communication peripherals
CoreSight™debug logic
Debugger
CMSIS-RTOSReal-time execution
CMSIS-NNMachine learning
CMSIS-DSPSignal processing
CMSIS-SVDPeripheral description
CMSIS-DAPDebug access
CMSIS-DriverMiddleware interface
CMSIS-COREProcessor core and peripheral access
Peripheral HALDevice specific
Access Filter(MPU, SAU)
CMSIS-ZoneSystem Partitioning
Application codeTFL for Microcontrollers
© 2021 Arm
Preparing for Library Optimizations
Optimizations
Model Optimizations
SW Environment
Settings
Library Optimizations
HW design
29 © 2020 Arm Limited (or its affiliates)
Setup for Data Used
Mbed based
Tensor Flow[1] #2025bf6f68
Mbed[2] #e642a7d8b3
Default Optimization level used: Ofast
Default Compiler Version: GNU Arm Embedded Toolchain 9.3.1 20200408
Processor : Arm Cortex-M7 @216 MHz
Memory layout of operators: TensorFlow Lite for Microcontroller(TFLM)
Model: Person detect from TFLM, int8 symmetric quantized[3]
STM32F746ZG
30 © 2021 Arm
Cycle MeasurementsNon-Optimized kernels
Methodology Grouped Cycle measurement
DW Conv38.56%
Conv61.40%
0.04% 0.00% 0.01%
Log cycle count & reset counter
Reset cycle counter
Log cycle count & reset counter
.
.
.
Snippet of Person Detect Model
31 © 2021 Arm
Setting a Target for Optimization The library optimizer
Cycle Bound Analysis
Processor Capability
Processor Utilization
Memory Bound Analysis
Memory Capability
Data Layout
32 © 2021 Arm
Processor Capability & Utilization
𝑃𝑢𝑡𝑖𝑙 =𝐶𝑡
𝐶𝑎, ∈ [0,1]
Speed of Light (SoL) or Capability. Best-case cycles for execution of a given arithmetic operator
Actual cycles for execution𝐶𝑎
𝐶𝑡Processor SoL or Capability for
MAC/cycle
Cortex-M4 1
Cortex-M7 2
Cortex-M55 8
Sequence: Load(LDR)->Multiply Accumulate(MAC)->LDR->MAC->…->Store(STR)
Cycle Bound Analysis
Processor Capability
Processor Utilization
Memory Bound Analysis
Memory Capability
Data Layout
33 © 2021 Arm
Processor Utilization for Person Detect ModelTFLM reference kernels
Unfair representation as conv and DW conv do more than just MAC’s.
One take away is that there is scope for improvement.
DW conv38.6%
Conv61.4%
CYCLES
DW conv13.5%
Conv86.5%
MAC OPERATIONS
DW conv Conv1.7%
6.9%
TFLM reference processor utilizationCortex-M7
34 © 2021 Arm
Memory Bound Analysis
Theoretical bandwidth - Maximum data transfer rate for a given hardware specification.
Effective bandwidth – Actual data transfer rate for an operation. For e.g. read/write of weights, inputs and outputs in a convolution operation.
Different memory blocks (on and off chip) are involved.
Often it is about reducing the effective bandwidth.
But, memory analysis on its own does not paint the full picture.
Cycle Bound Analysis
Processor Capability
Processor Utilization
Memory Bound Analysis
Memory Capability
Data Layout
35 © 2021 Arm
Finding a Direction for Optimization
Cycle and memory analysis, gives an indication of how effective an algorithm is and points to areas of improvement.
Optimization is about striking a balance between the two.
Cycle bound
Memory bound
© 2021 Arm
Optimizations
Model Optimizations
SW Environment
Settings
Library Optimizations
HW design
37 © 2021 Arm
Use Case: Fully ConnectedSimpler case of what im2col method targets for a convolution or DW conv
Ninput⋮
ip_0 ip_1 .. ip_n-1input @ addr_x
f0_0 .. f0_n-1filter 0 @ addr_y
filter 1 @ addr_y + n f1_0 .. f1_n-1
Moutput.
⋮ 1 byte
Input and weights are contiguous in memory
38 © 2021 Arm
Step 1: Working on the Memory Bound Aspect
..
loop_i,k {
input = input[i] + input_offset
sum += input * (filter_[k][i] + filter_offset)
}
Number of input vector reads is halved
Reducing memory reads and reusing is a vital aspect of optimization
..
loop_i,k/2 {
input = input[i] + input_offset
sum_0 += input * (filter[k][i] + filter_offset)
sum_1 += input * (filter[k + n][i] + filter_offset)
}
Why don’t I unroll it all the way?
39 © 2021 Arm
Well, There are Limits to Effective Unrolling
* Could be reused
Variables in use Potential register map
Address of filter_0 R0
Address of filter_1 R1
Address of input R2
Value of input R3
Value of filter_0 R4
Value of filter_1* R5
sum 0 R6
sum 1 R7
Input offset R8
Filter offset R9
Loop count R10
Unused R11-R12
Depends on the number of general purpose (GP) & vector registers available.
Process two filter inputs in same loop
..
loop_i {
input = input[i] + input_offset
sum_0 += input * (filter[k][i] + filter_offset)
sum_1 += input * (filter[k + n][i] + filter_offset)
}
A guideline (not a rule) is to unroll until there are no register spills/reloads
40 © 2021 Arm
Does the Compiler Output Match Expectations ?
Processing two filter rows in one loop
// Load single byte(sb)
// Add filter offset
// Loop count compare. Sets condition bit N
// one multiply accumulate
Pre/Post increment of pointers are done with load
Vital benefit of a constant address increment
// Branch on condition bit N
Code snippet from arm_nn_vec_mat_mult_t_s8.c(CMSIS-NN)
// lhs -> Input vector
// rhs -> Filter
41 © 2021 Arm
Step 2: SIMD* for MAC
Processor ExtensionSoL
MAC/cycle
Cortex-M4 DSP 1
Cortex-M7 DSP 2
Cortex-M55 MVE (Helium Technology)
8
SoL for MAC sequence : LDR->MAC->LDR->MAC->…->STR
.LBB1_2: @ %for.body
ldr r12, [r0], #4
ldr r4, [r1], #4
smlad r2, r12, r4, r2 // 2 MAC operations
le lr, .LBB1_2
.LBB0_2: @ %for.body
vldrb.u8 q0, [r0], #16
vldrb.u8 q1, [r1], #16
vmlava.s8 r12, q0, q1 // setup for 16 MAC operations
letp lr, .LBB0_2
DSP extension
MVE extension
Handles even number of MAC operations
Tail elements are handled separately
Tail predicated loops allow for handling both odd and even element lengths.
*Single Instruction Multiple Data
42 © 2021 Arm
Step 3: Further Simplification of Core Loop
Input/filter offset can be handled outside the core loop[4]
Reduces the number of load and add operations done in core loop.
Frees up registers that can be used for deeper unrolling instead.
43 © 2021 Arm
The Essentials of Optimization
• Memory access optimizations• Reducing relevant memory accesses while staying within the available GP/vector register constraint.
• Keep it Simple Stupid (KISS) Optimizations • Constant pointer increment/decrement in core loops• Simplify the core loop further by moving out input/filter offset adjustments
• Processor capability optimization• Single Instruction Multiple Data Optimizations
44 © 2021 Arm
Model Hyperparameter – The Unaligned AspectInput channel not a multiple of 4
3 input channels
Parts of the 16 blocks(output channel) of kernel results in unaligned access
btye 0 byte 1 .. byte 26
3x3x3 => 27 byte filter block in memory
45 © 2021 Arm
Impact of Aligned AccessChange from 3 input channels to 4
increase in macs
reduction in cycles
33.3%
21.1%
MAC and performance impact in using 4 input channels
Applicable for other operators as well
Memory alignment aware shapes get the best out of CMSIS-NN
© 2021 Arm
Deploy CMSIS-NN using Tensorflow Lite for
Microcontrollers
47 © 2021 Arm
Tensorflow Lite for Microcontrollers (TFLM)
• Version of TensorFlow Lite designed to execute neural networks on microcontrollers, starting at only a few kB of memory
• Designed to be portable even to 'bare metal' systems
• The core runtime is ~20kB.
• Examples/demos
• Micro speech: Detects simple commands such as yes, no and silence.
• Person detection: Detects whether a person is in the room or not.
• Magic wand: Detect gestures using an accelerometer.
• Over 50 operators supported currently
• Many integrated operator optimizations
48 © 2021 Arm
• Optimized kernels are enabled using OPTIMIZED_KERNEL_DIR=cmsis_nn in the TFLM build system
• Use optimized kernels when available –otherwise fallback to TFLM reference kernels
• CMSIS-NN “glue” is here:• /tensorflow/lite/micro/kernels/cmsis-nn
CMSIS-NN & TensorFlow Lite for MicrocontrollersAccess to optimized kernels through TFLM
Arm Cortex-M CPU
Application
TensorFlow Lite micro interpreter
Ref kernels CMSIS-NN kernels
49 © 2021 Arm
Optimize Where it Matters……and always have a fallback path
• Reference kernels for availability
• For more horsepower - CMSIS-NN
• For most horsepower - Ethos-U55
Kernel TFLM reference implementation
CMSIS-NN (fast)
NPU(faster)
Kernel 1 ✓ ✓ ✓
Kernel 2 ✓ ✓ ✓
Kernel 3 ✓ ✓ ✓
Kernel 4 ✓ ✓ ✓
Kernel 5 ✓ ✓
Kernel 6 ✓
Kernel 7 ✓
50 © 2021 Arm
• Quantization: Optimizations are available for int8, per-channel, quantized data.
• Operators: The majority of compute occurs in a handful of ops
• We’re open to optimizing new ops!• Open a Github ticket on CMSIS repo [5]
What kernels are optimized?
Keyword spotting Face recognision
Image classificationHuman activity
[3]
51 © 2021 Arm
Person Detection DemoAvailable on the Tensorflow repository [7]
• Model:• ~300kB flash (weight and bias)• ~100kB SRAM (activations etc.)• 31 layers, ~7 million MACs
– Depthwise conv, Conv, Average pool, Softmax, Reshape
• Input:• 96x96 pixel 8-bit grayscale image
• Output:• Two 8-bit values: Person score and no person score
52 © 2021 Arm
The HardwareArduino Nano 33 BLE Sense + Arducam Mini 2MP Plus
• Powered by Arm’s Cortex-M4 CPU
• 1 MB flash. 256kB SRAM. 64MHz.
• Green light: A person
• Red light: Not a person
53 © 2020 Arm Limited (or its affiliates)
References
• [1] TensorFlow GitHub
• [2] mbed-os
• [3] TensorFlow Lite int8 quantization specification
• [4] Efficient handling of offsets
• [5] CMSIS-NN source repository
• [6] Enable CMSIS-NN on Tensorflow Lite for Microcontrollers
• [7] Person detection example readme
© 2021 Arm
Thank YouDanke
Gracias谢谢
ありがとうAsanteMerci
감사합니다धन्यवाद
Kiitosشكرًا
ধন্যবাদתודה
Contact:
tinyML Talks Sponsors
Additional Sponsorships available – contact [email protected] for info
tinyML Strategic Partner
Copyright Notice
This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the opinion of the author(s) and their respective companies. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.
There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.
tinyML is a registered trademark of the tinyML Foundation.
www.tinyML.org