Felix Johnny and Fredrik Knutsson - Arm - tinyML

“CMSIS-NN & Optimizations for Edge AI”

Felix Johnny and Fredrik Knutsson - Arm

Sweden Area Group – February 8, 2021

tinyML Talks Sponsors

Additional Sponsorships available – contact Bette@tinyML.org for info

tinyML Strategic Partner

Optimized models for embedded

Application

Runtime(e.g. TensorFlow Lite Micro)

Optimized low-level NN libraries(i.e. CMSIS-NN)

Arm Cortex-M CPUs and microNPUs

Profiling and debugging

tooling such as Arm Keil MDK

Connect to high-level

frameworks

Supported byend-to-end tooling

RTOS such as Mbed OS

Connect toRuntime

Arm: The Software and Hardware Foundation for tinyML

AI Ecosystem Partners

Resources: developer.arm.com/solutions/machine-learning-on-arm

Stay Connected

@ArmSoftwareDevelopers

@ArmSoftwareDev

BECOME BETA USER bit.ly/testdeeplite

WE USE AI TO MAKE OTHER AI FASTER, SMALLER AND MORE POWER EFFICIENT

Automatically compress SOTA models like MobileNet to <200KB with

little to no drop in accuracy for inference on resource-limited MCUs

Reduce model optimization trial & error from weeks to days using

Deeplite's design space exploration

Deploy more models to your device without sacrificing performance or

battery life with our easy-to-use software

TinyML for all developers

Get your free account at http://edgeimpulse.com

Edge Device Impulse

Dataset

Embedded and edge

compute deployment

options

Acquire valuable

training data securely

Test impulse with

real-time device

data flows

Enrich data and train

ML algorithms

Real sensors in real time

Open source SDK

Maxim Integrated: Enabling Edge Intelligencewww.maximintegrated.com/ai

Sensors and Signal Conditioning

Health sensors measure PPG and ECG signals critical to understanding vital signs. Signal chain products enable measuring even the most sensitive signals.

Low Power Cortex M4 Micros

The biggest (3MB flash and 1MB SRAM) and the smallest (256KB flash and 96KB SRAM) Cortex M4 microcontrollers enable algorithms and neural networks to run at wearable power levels

Advanced AI Acceleration

The new MAX78000 implements AI inferences at over 100x lower energy than other embedded options. Now the edge can see and hear like never before.

▪ Wide range of ML methods: GBM, XGBoost, Random

Forest, Logistic Regression, Decision Tree, SVM, CNN, RNN,

CRNN, ANN, Local Outlier Factor, and Isolation Forest

▪ Easy-to-use interface for labeling, recording, validating, and

visualizing time-series sensor data

▪ On-device inference optimized for low latency, low power

consumption, and a small memory footprint

▪ Supports Arm® Cortex™- M0 to M4 class MCUs

▪ Automates complex and labor-intensive processes of a

typical ML workflow – no coding or ML expertise required!

▪ Industrial Predictive Maintenance

▪ Smart Home

▪ Wearables

Qeexo AutoML for Embedded AIAutomated Machine Learning Platform that builds tinyML solutions for the Edge using sensor data

▪ Automotive

▪ Mobile

▪ IoT

QEEXO AUTOML: END-TO-END MACHINE LEARNING PLATFORM

Key Features Target Markets/Applications

For a limited time, sign up to use Qeexo AutoML at automl.qeexo.com for FREE to bring intelligence to your devices!

is for

building products

Automated Feature Exploration and Model

Generation

Bill-of-Materials

Optimization

Automated Data

Assessment

Edge AI / TinyMLcode for the smallest

Reality AI Tools® software

Reality AI solutions

Automotive sound recognition & localization

Indoor/outdoor sound event recognition

RealityCheck™ voice anti-spoofing

info@reality.ai @SensorAI Reality AIhttps://reality.ai

SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edge devices. We design systems for real-time always-on smart sensing, for audio, vision, IMUs, bio-signals and more.

https://SynSense.ai

Next tinyML Talks

Date Presenter Topic / Title

Tuesday,February 16

Mohammed ZubairPhD, SMIEEEAssociate Professor / ConsultantDepartment of Electrical Engineering / Center for Artificial Intelligence King Khalid University

Oral Tongue Lesion Detection using TinyML

on Embedded Devices

Webcast start time is 8 am Pacific time

Please contact talks@tinyml.org if you are interested in presenting

Local Committee in Sweden

Ali Balador, Senior Researcher RISE, Assistant Professor MDH, ali.balador@ri.se

Johan Malm, AI Engineer, PhD, Imagimob AB, johan.malm@imagimob.com

Magnus Melander, Evangelist and Co-founder THINGS, magnus@wbird.se

Åke Wernelind, Business Development, Imagimob AB, ake.wernelind@imagimob.com

Reminders

youtube.com/tinyml

Slides & Videos will be posted tomorrow

tinyml.org/forums

Please use the Q&A window for your questions

tinyML Foundation and Ecosystem introductionand Welcome tinyML Group in Sweden

Evgeni Gousev

Sr. Director, Qualcomm AI & Chairman, tinyML Foundation, BoDFebruary 8, 2021

tinyML Foundation* Vision**

We see a new world with trillions of intelligent devices enabled by tinyML technologies that sense, analyze and autonomously act together to create a healthier and more sustainable environment for all

*tinyML Foundation is a non-profit, 501c3, professional and educational organization registered in Los Altos, CA, USAtinyML and the tinyML logo are registered trademarks of tinyML Foundation

**adopted at tinyML Strategy leadership meeting on Dec 14, 2019

tinyML Foundation Mission:

- to grow a prosperous and integrated Global Community of HW, SW and SYS scientists, engineers, designers, product and biz people, both experts and newcomers, developing leading edge tinyML technologies

- to promote and stimulate knowledge exchange between tinyML researchers to allow the field to move ahead at a high pace

- to inspire on the capabilities of tinyML and its potential of changing the way machine intelligence and data analytics at the very edge of the physical and digital world occur

- to connect tinyML technologies and innovations to enormous product and business opportunities and value creation across the whole ecosystem and industry verticals

What is tinyML* ?

tinyML is broadly defined as machine learning architectures, techniques, tools and approaches capable of performing on-device analytics for a variety of sensing modalities (vision, audio, motion, chemical, etc.) at “mW” (or below) power range targeting predominately battery operated devices

* tinyML is a full-stack approach and ecosystem, including HW-SYS-SW-applications

Why tinyML opportunity is so enormous?Data is a new oil(electricity) and ML is a way to produce it

Cloud ML

• DNN on the cloud

• HW: TPU, FPGA, GPU, CPU

Edge ML

• Optimized algos and CNN-light

• SoC (with NPUs/NSP accelerators)

tiny ML

• CNN-micro

• MCU w/ HW accelerators

Data Sources:

Storage and sharing

User provided:1. Pics2. Audio3. Clicks/likes4. GPS, Location based

Real-time in the physical world

CMOS cameras

IRcameras

IMUs Audiomics

Environ/chemical

Temperature Optical sensors

Climbing up tinyML mountain

Step 0

Tech Feasibility

Step 1

Building

Awareness

Step 2

Initial tinyML

Products

Step 3

tinyML

Killer apps

Step 4

tinyML

everywhere

Explosive growth

Trillion of devices

3-5 years

1-2 years

10 years

tinyML meetups global growth

https://www.meetup.com/pro/tinyml/

(4.5k+ members in 22 countries, in 18 months)

tinyML Group in Sweden:62 members in 2 months

Highlights:www.tinyML.org/summit2021

- Keywords: Premier Quality, Interactive, LIVE … and FREE- 5 days, 50+ presentations- 4 Tutorials- 2 Panel discussions: (i) VC and (ii) tinyML toolchains- tinyML Research Symposium - Late Breaking News - 3 Best tinyML Awards (Paper, Product, Innovation)- 10+ Breakout sessions on various topics- tinyML Partner sessions- tinyAI for (Good) Life- LIVE coverage, starting at 8am Pacific time

What should I do about it:- Check out the program – you will be impressed- Register on-line (takes 5 min, and 1000+ already registered in 5 days)

- If interested: Submit nominations for Best Awards and/or Late News – February 28 deadline

- Block out your calendar: March 22-26- Become a sponsor (sponsorships@tinyML.org)- Provide your feedback – we listen !- Don’t worry about missing some talks – all videos

will be posted on YouTube.com/tinyML

Announcement:

tinyML Summits are growing fast 2019 Summit (March 2019)

2020 Summit(Feb 2020)

2021 Summit(March 2021), expected

Attendees 160 400+ 3000+

Companies 90 172 300+ ?

LinkedIn members

0 798 ~ 2000

Meetupsmembers

0 1140 ~ 5000

YouTubesubscribers

0 0 ~ 3000

2018 2019 2020

also started in Asia: tinyML WeChat and BiliBili

2021 Summit Sponsors (as of Feb 3, 2021; more in the pipeline)

FREE registration for the whole weekof the Summit due to the Sponsors:

Interested in joining tinyML ecosystem?www.tinyML.org

Felix Johnny & Fredrik KnutssonMachine Learning Group

CMSIS-NN : Optimization for Edge AI

Felix Johnny

Felix Johnny is the maintainer of Arm’s open source CMSIS-NN library that targets optimized Neural Network kernels for Cortex-M CPUs. He has spent most of the last 15 years in the wireless domain working with software design and optimizations in memory and cycle constrained systems. Outside of work, he is an active music photographer.

Fredrik Knutsson

Fredrik Knutsson is the technical lead for the Arm team working on Ethos-U55 and Cortex-M integration into embedded runtimes. He holds a M.Sc. in electrical engineering from Chalmers university of technology. Fredrik has more than 15 years of experience in the embedded software domain, doing mainly software architecture and system design. Four the past four years he’s been working for Arm and has previous experience from the wireless, wearable and automotive business.

Executive Summary

Introduction to Edge AI and Arm Cortex-M Processors

Preparing for Optimization

CMSIS-NN : Library Optimizations

CMSIS-NN demo

Orienting Edge AI

Apply the model

Low latencyefficiency

Low/mid performance

Security &privacy

Train the model

High performance

Throughput oriented

Large data sets

High performance compute, servers, GPU

Embedded systems, heterogeneous processing

eTraining

Inference

float to (u)int8

Why Target Arm Cortex-M Processors?

Accelerate deployment of Machine Learning on edge devices

Power efficient processors that fit TinyML requirements

Availability of exciting new models with lower memory footprint and Multiply ACcumulate operations(MAC)

MobileNet V2

• 3.38 MB parameters

• ~ 307 million MACs

• Data Type: floatPerson Detect(TFLM)

• 250 kBytes

• ~ 7 million MACs

• Data Type: int8**Source: Arm data

+52 billion Cortex-M

based chips shipped**

Processor’s View of a NN ModelConnected compute blocks

Compute Layer 0

Compute Layer 1

Compute Layer N-1

…Input Output

Optimizing Neural Networks for TinyML

Why & How

Why Optimize for TinyML?

Power constraints

• Extend battery life of edge devices byreducing awake time.

Cycle constraints

• Enable more complex models to be deployed within the inference budget.

• Meet real time latency constraints.

It Takes a Village to Raise a ChildEveryone has a role to play in optimization, some more important than the other

Optimization

Model Optimizations

SW Environment Settings

Library Optimizations

HW Design

What is CMSIS-NN?Optimized Neural Network kernels[5] for Arm Cortex-M CPUs

CMSIS-Pack

System-on-chip

Arm® Cortex® processorSpecialized peripherals

Communication peripherals

CoreSight™debug logic

Debugger

CMSIS-RTOSReal-time execution

CMSIS-NNMachine learning

CMSIS-DSPSignal processing

CMSIS-SVDPeripheral description

CMSIS-DAPDebug access

CMSIS-DriverMiddleware interface

CMSIS-COREProcessor core and peripheral access

Peripheral HALDevice specific

Access Filter(MPU, SAU)

CMSIS-ZoneSystem Partitioning

Application codeTFL for Microcontrollers

Preparing for Library Optimizations

Optimizations

Model Optimizations

SW Environment

Settings

HW design

Setup for Data Used

Mbed based

Tensor Flow[1] #2025bf6f68

Mbed[2] #e642a7d8b3

Default Optimization level used: Ofast

Default Compiler Version: GNU Arm Embedded Toolchain 9.3.1 20200408

Processor : Arm Cortex-M7 @216 MHz

Memory layout of operators: TensorFlow Lite for Microcontroller(TFLM)

Model: Person detect from TFLM, int8 symmetric quantized[3]

STM32F746ZG

Cycle MeasurementsNon-Optimized kernels

Methodology Grouped Cycle measurement

DW Conv38.56%

Conv61.40%

0.04% 0.00% 0.01%

Log cycle count & reset counter

Reset cycle counter

Log cycle count & reset counter

Snippet of Person Detect Model

Setting a Target for Optimization The library optimizer

Cycle Bound Analysis

Processor Capability

Processor Utilization

Memory Bound Analysis

Memory Capability

Data Layout

Processor Capability & Utilization

𝑃𝑢𝑡𝑖𝑙 =𝐶𝑡

𝐶𝑎, ∈ [0,1]

Speed of Light (SoL) or Capability. Best-case cycles for execution of a given arithmetic operator

Actual cycles for execution𝐶𝑎

𝐶𝑡Processor SoL or Capability for

MAC/cycle

Cortex-M4 1

Cortex-M7 2

Cortex-M55 8

Sequence: Load(LDR)->Multiply Accumulate(MAC)->LDR->MAC->…->Store(STR)

Memory Capability

Data Layout

Processor Utilization for Person Detect ModelTFLM reference kernels

Unfair representation as conv and DW conv do more than just MAC’s.

One take away is that there is scope for improvement.

DW conv38.6%

Conv61.4%

CYCLES

DW conv13.5%

Conv86.5%

MAC OPERATIONS

DW conv Conv1.7%

TFLM reference processor utilizationCortex-M7

Theoretical bandwidth - Maximum data transfer rate for a given hardware specification.

Effective bandwidth – Actual data transfer rate for an operation. For e.g. read/write of weights, inputs and outputs in a convolution operation.

Different memory blocks (on and off chip) are involved.

Often it is about reducing the effective bandwidth.

But, memory analysis on its own does not paint the full picture.

Memory Capability

Data Layout

Finding a Direction for Optimization

Cycle and memory analysis, gives an indication of how effective an algorithm is and points to areas of improvement.

Optimization is about striking a balance between the two.

Cycle bound

Memory bound

Optimizations

Model Optimizations

SW Environment

Settings

HW design

Use Case: Fully ConnectedSimpler case of what im2col method targets for a convolution or DW conv

Ninput⋮

ip_0 ip_1 .. ip_n-1input @ addr_x

f0_0 .. f0_n-1filter 0 @ addr_y

filter 1 @ addr_y + n f1_0 .. f1_n-1

Moutput.

⋮ 1 byte

Input and weights are contiguous in memory

Step 1: Working on the Memory Bound Aspect

loop_i,k {

input = input[i] + input_offset

sum += input * (filter_[k][i] + filter_offset)

Number of input vector reads is halved

Reducing memory reads and reusing is a vital aspect of optimization

loop_i,k/2 {

sum_0 += input * (filter[k][i] + filter_offset)

sum_1 += input * (filter[k + n][i] + filter_offset)

Why don’t I unroll it all the way?

Well, There are Limits to Effective Unrolling

* Could be reused

Variables in use Potential register map

Address of filter_0 R0

Address of filter_1 R1

Address of input R2

Value of input R3

Value of filter_0 R4

Value of filter_1* R5

sum 0 R6

sum 1 R7

Input offset R8

Filter offset R9

Loop count R10

Unused R11-R12

Depends on the number of general purpose (GP) & vector registers available.

Process two filter inputs in same loop

loop_i {

sum_0 += input * (filter[k][i] + filter_offset)

sum_1 += input * (filter[k + n][i] + filter_offset)

A guideline (not a rule) is to unroll until there are no register spills/reloads

Does the Compiler Output Match Expectations ?

Processing two filter rows in one loop

// Load single byte(sb)

// Add filter offset

// Loop count compare. Sets condition bit N

// one multiply accumulate

Pre/Post increment of pointers are done with load

Vital benefit of a constant address increment

// Branch on condition bit N

Code snippet from arm_nn_vec_mat_mult_t_s8.c(CMSIS-NN)

// lhs -> Input vector

// rhs -> Filter

Step 2: SIMD* for MAC

Processor ExtensionSoL

MAC/cycle

Cortex-M4 DSP 1

Cortex-M7 DSP 2

Cortex-M55 MVE (Helium Technology)

SoL for MAC sequence : LDR->MAC->LDR->MAC->…->STR

.LBB1_2: @ %for.body

ldr r12, [r0], #4

ldr r4, [r1], #4

smlad r2, r12, r4, r2 // 2 MAC operations

le lr, .LBB1_2

.LBB0_2: @ %for.body

vldrb.u8 q0, [r0], #16

vldrb.u8 q1, [r1], #16

vmlava.s8 r12, q0, q1 // setup for 16 MAC operations

letp lr, .LBB0_2

DSP extension

MVE extension

Handles even number of MAC operations

Tail elements are handled separately

Tail predicated loops allow for handling both odd and even element lengths.

*Single Instruction Multiple Data

Step 3: Further Simplification of Core Loop

Input/filter offset can be handled outside the core loop[4]

Reduces the number of load and add operations done in core loop.

Frees up registers that can be used for deeper unrolling instead.

The Essentials of Optimization

• Memory access optimizations• Reducing relevant memory accesses while staying within the available GP/vector register constraint.

• Keep it Simple Stupid (KISS) Optimizations • Constant pointer increment/decrement in core loops• Simplify the core loop further by moving out input/filter offset adjustments

• Processor capability optimization• Single Instruction Multiple Data Optimizations

Model Hyperparameter – The Unaligned AspectInput channel not a multiple of 4

3 input channels

Parts of the 16 blocks(output channel) of kernel results in unaligned access

btye 0 byte 1 .. byte 26

3x3x3 => 27 byte filter block in memory

Impact of Aligned AccessChange from 3 input channels to 4

increase in macs

reduction in cycles

MAC and performance impact in using 4 input channels

Applicable for other operators as well

Memory alignment aware shapes get the best out of CMSIS-NN

Deploy CMSIS-NN using Tensorflow Lite for

Microcontrollers

Tensorflow Lite for Microcontrollers (TFLM)

• Version of TensorFlow Lite designed to execute neural networks on microcontrollers, starting at only a few kB of memory

• Designed to be portable even to 'bare metal' systems

• The core runtime is ~20kB.

• Examples/demos

• Micro speech: Detects simple commands such as yes, no and silence.

• Person detection: Detects whether a person is in the room or not.

• Magic wand: Detect gestures using an accelerometer.

• Over 50 operators supported currently

• Many integrated operator optimizations

• Optimized kernels are enabled using OPTIMIZED_KERNEL_DIR=cmsis_nn in the TFLM build system

• Use optimized kernels when available –otherwise fallback to TFLM reference kernels

• CMSIS-NN “glue” is here:• /tensorflow/lite/micro/kernels/cmsis-nn

CMSIS-NN & TensorFlow Lite for MicrocontrollersAccess to optimized kernels through TFLM

Arm Cortex-M CPU

Application

TensorFlow Lite micro interpreter

Ref kernels CMSIS-NN kernels

Optimize Where it Matters……and always have a fallback path

• Reference kernels for availability

• For more horsepower - CMSIS-NN

• For most horsepower - Ethos-U55

Kernel TFLM reference implementation

CMSIS-NN (fast)

NPU(faster)

Kernel 1 ✓ ✓ ✓

Kernel 5 ✓ ✓

Kernel 6 ✓

Kernel 7 ✓

• Quantization: Optimizations are available for int8, per-channel, quantized data.

• Operators: The majority of compute occurs in a handful of ops

• We’re open to optimizing new ops!• Open a Github ticket on CMSIS repo [5]

What kernels are optimized?

Keyword spotting Face recognision

Image classificationHuman activity

Person Detection DemoAvailable on the Tensorflow repository [7]

• Model:• ~300kB flash (weight and bias)• ~100kB SRAM (activations etc.)• 31 layers, ~7 million MACs

– Depthwise conv, Conv, Average pool, Softmax, Reshape

• Input:• 96x96 pixel 8-bit grayscale image

• Output:• Two 8-bit values: Person score and no person score

The HardwareArduino Nano 33 BLE Sense + Arducam Mini 2MP Plus

• Powered by Arm’s Cortex-M4 CPU

• 1 MB flash. 256kB SRAM. 64MHz.

• Green light: A person

• Red light: Not a person

References

• [1] TensorFlow GitHub

• [2] mbed-os

• [3] TensorFlow Lite int8 quantization specification

• [4] Efficient handling of offsets

• [5] CMSIS-NN source repository

• [6] Enable CMSIS-NN on Tensorflow Lite for Microcontrollers

• [7] Person detection example readme

Thank YouDanke

Gracias谢谢

ありがとうAsanteMerci

감사합니다धन्यवाद

Kiitosشكرًا

ধন্যবাদתודה

Contact:

felixjohnny.thomasmathibalan@arm.com

fredrik.knutsson@arm.com

tinyML Talks Sponsors

Additional Sponsorships available – contact Bette@tinyML.org for info

tinyML Strategic Partner

Copyright Notice

This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the opinion of the author(s) and their respective companies. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org

Felix Johnny and Fredrik Knutsson - Arm - tinyML

Documents

Transcript of Felix Johnny and Fredrik Knutsson - Arm - tinyML

Oberg Fredrik

Session 36_1 Fredrik Bärthel

“Building TinyML Applications Using Rune”

Fredrik Johansson

tinyML EMEA 2021 Proceedings Cover 210607

Gamification - Fredrik Siren - Richen

Commercial Real Estate Investing Tips from Keith Knutsson

tinyML EMEA 2021 Proceedings Cover 210607...tinyML EMEA Technical Forum 2021 Proceedings June 7 –10, 2021 Virtual Event

Mapping presentation Fredrik Sandberg

Newsletter from Fredrik Naumann

Fredrik Tumegard, NEC

Fredrik Agerhem_Jarven_water india

Fredrik Eikeland Fossan

Session 61 Fredrik Eriksson

Fredrik Degerlund

Fredrik Naumann 2011 Review

Report TVGT-5064 JULIA KNUTSSON - Geoteknik

Gundersen Fredrik Balchen - Unit

Fredrik Workstation 120x72 735407 PUB

Fredrik Holmström