Felix Johnny and Fredrik Knutsson - Arm - tinyML

63
CMSIS-NN & Optimizations for Edge AIFelix Johnny and Fredrik Knutsson - Arm Sweden Area Group – February 8, 2021

Transcript of Felix Johnny and Fredrik Knutsson - Arm - tinyML

Page 1: Felix Johnny and Fredrik Knutsson - Arm - tinyML

“CMSIS-NN & Optimizations for Edge AI”

Felix Johnny and Fredrik Knutsson - Arm

Sweden Area Group – February 8, 2021

Page 2: Felix Johnny and Fredrik Knutsson - Arm - tinyML

tinyML Talks Sponsors

Additional Sponsorships available – contact [email protected] for info

tinyML Strategic Partner

Page 3: Felix Johnny and Fredrik Knutsson - Arm - tinyML

57 © 2020 Arm Limited (or its affiliates)57 © 2020 Arm Limited (or its affiliates)

Optimized models for embedded

Application

Runtime(e.g. TensorFlow Lite Micro)

Optimized low-level NN libraries(i.e. CMSIS-NN)

Arm Cortex-M CPUs and microNPUs

Profiling and debugging

tooling such as Arm Keil MDK

Connect to high-level

frameworks

1

Supported byend-to-end tooling

2

2

RTOS such as Mbed OS

Connect toRuntime

3

3

Arm: The Software and Hardware Foundation for tinyML

1

AI Ecosystem Partners

Resources: developer.arm.com/solutions/machine-learning-on-arm

Stay Connected

@ArmSoftwareDevelopers

@ArmSoftwareDev

Page 4: Felix Johnny and Fredrik Knutsson - Arm - tinyML

PAGE 58| Confidential Presentation ©2020 Deeplite, All Rights Reserved

BECOME BETA USER bit.ly/testdeeplite

WE USE AI TO MAKE OTHER AI FASTER, SMALLER AND MORE POWER EFFICIENT

Automatically compress SOTA models like MobileNet to <200KB with

little to no drop in accuracy for inference on resource-limited MCUs

Reduce model optimization trial & error from weeks to days using

Deeplite's design space exploration

Deploy more models to your device without sacrificing performance or

battery life with our easy-to-use software

Page 5: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Copyright © EdgeImpulse Inc.

TinyML for all developers

Get your free account at http://edgeimpulse.com

Test

Edge Device Impulse

Dataset

Embedded and edge

compute deployment

options

Acquire valuable

training data securely

Test impulse with

real-time device

data flows

Enrich data and train

ML algorithms

Real sensors in real time

Open source SDK

Page 6: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Maxim Integrated: Enabling Edge Intelligencewww.maximintegrated.com/ai

Sensors and Signal Conditioning

Health sensors measure PPG and ECG signals critical to understanding vital signs. Signal chain products enable measuring even the most sensitive signals.

Low Power Cortex M4 Micros

The biggest (3MB flash and 1MB SRAM) and the smallest (256KB flash and 96KB SRAM) Cortex M4 microcontrollers enable algorithms and neural networks to run at wearable power levels

Advanced AI Acceleration

The new MAX78000 implements AI inferences at over 100x lower energy than other embedded options. Now the edge can see and hear like never before.

Page 7: Felix Johnny and Fredrik Knutsson - Arm - tinyML

▪ Wide range of ML methods: GBM, XGBoost, Random

Forest, Logistic Regression, Decision Tree, SVM, CNN, RNN,

CRNN, ANN, Local Outlier Factor, and Isolation Forest

▪ Easy-to-use interface for labeling, recording, validating, and

visualizing time-series sensor data

▪ On-device inference optimized for low latency, low power

consumption, and a small memory footprint

▪ Supports Arm® Cortex™- M0 to M4 class MCUs

▪ Automates complex and labor-intensive processes of a

typical ML workflow – no coding or ML expertise required!

▪ Industrial Predictive Maintenance

▪ Smart Home

▪ Wearables

Qeexo AutoML for Embedded AIAutomated Machine Learning Platform that builds tinyML solutions for the Edge using sensor data

▪ Automotive

▪ Mobile

▪ IoT

QEEXO AUTOML: END-TO-END MACHINE LEARNING PLATFORM

Key Features Target Markets/Applications

For a limited time, sign up to use Qeexo AutoML at automl.qeexo.com for FREE to bring intelligence to your devices!

Page 8: Felix Johnny and Fredrik Knutsson - Arm - tinyML

is for

building products

Automated Feature Exploration and Model

Generation

Bill-of-Materials

Optimization

Automated Data

Assessment

Edge AI / TinyMLcode for the smallest

MCUs

Reality AI Tools® software

Reality AI solutions

Automotive sound recognition & localization

Indoor/outdoor sound event recognition

RealityCheck™ voice anti-spoofing

[email protected] @SensorAI Reality AIhttps://reality.ai

Page 9: Felix Johnny and Fredrik Knutsson - Arm - tinyML

SynSense builds ultra-low-power (sub-mW) sensing and inference hardware for embedded, mobile and edge devices. We design systems for real-time always-on smart sensing, for audio, vision, IMUs, bio-signals and more.

https://SynSense.ai

Page 10: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Next tinyML Talks

Date Presenter Topic / Title

Tuesday,February 16

Mohammed ZubairPhD, SMIEEEAssociate Professor / ConsultantDepartment of Electrical Engineering / Center for Artificial Intelligence King Khalid University

Oral Tongue Lesion Detection using TinyML

on Embedded Devices

Webcast start time is 8 am Pacific time

Please contact [email protected] if you are interested in presenting

Page 11: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Local Committee in Sweden

Ali Balador, Senior Researcher RISE, Assistant Professor MDH, [email protected]

Johan Malm, AI Engineer, PhD, Imagimob AB, [email protected]

Magnus Melander, Evangelist and Co-founder THINGS, [email protected]

Åke Wernelind, Business Development, Imagimob AB, [email protected]

Page 12: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Reminders

youtube.com/tinyml

Slides & Videos will be posted tomorrow

tinyml.org/forums

Please use the Q&A window for your questions

Page 13: Felix Johnny and Fredrik Knutsson - Arm - tinyML

tinyML Foundation and Ecosystem introductionand Welcome tinyML Group in Sweden

Evgeni Gousev

Sr. Director, Qualcomm AI & Chairman, tinyML Foundation, BoDFebruary 8, 2021

Page 14: Felix Johnny and Fredrik Knutsson - Arm - tinyML

tinyML Foundation* Vision**

We see a new world with trillions of intelligent devices enabled by tinyML technologies that sense, analyze and autonomously act together to create a healthier and more sustainable environment for all

*tinyML Foundation is a non-profit, 501c3, professional and educational organization registered in Los Altos, CA, USAtinyML and the tinyML logo are registered trademarks of tinyML Foundation

**adopted at tinyML Strategy leadership meeting on Dec 14, 2019

Page 15: Felix Johnny and Fredrik Knutsson - Arm - tinyML

tinyML Foundation Mission:

- to grow a prosperous and integrated Global Community of HW, SW and SYS scientists, engineers, designers, product and biz people, both experts and newcomers, developing leading edge tinyML technologies

- to promote and stimulate knowledge exchange between tinyML researchers to allow the field to move ahead at a high pace

- to inspire on the capabilities of tinyML and its potential of changing the way machine intelligence and data analytics at the very edge of the physical and digital world occur

- to connect tinyML technologies and innovations to enormous product and business opportunities and value creation across the whole ecosystem and industry verticals

Page 16: Felix Johnny and Fredrik Knutsson - Arm - tinyML

What is tinyML* ?

tinyML is broadly defined as machine learning architectures, techniques, tools and approaches capable of performing on-device analytics for a variety of sensing modalities (vision, audio, motion, chemical, etc.) at “mW” (or below) power range targeting predominately battery operated devices

* tinyML is a full-stack approach and ecosystem, including HW-SYS-SW-applications

Page 17: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Why tinyML opportunity is so enormous?Data is a new oil(electricity) and ML is a way to produce it

Cloud ML

• DNN on the cloud

• HW: TPU, FPGA, GPU, CPU

Edge ML

• Optimized algos and CNN-light

• SoC (with NPUs/NSP accelerators)

tiny ML

• CNN-micro

• MCU w/ HW accelerators

Data Sources:

Storage and sharing

User provided:1. Pics2. Audio3. Clicks/likes4. GPS, Location based

Real-time in the physical world

CMOS cameras

IRcameras

IMUs Audiomics

Environ/chemical

Temperature Optical sensors

1%

4%

95%

Page 18: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Climbing up tinyML mountain

Step 0

Tech Feasibility

Step 1

Building

Awareness

Step 2

Initial tinyML

Products

Step 3

tinyML

Killer apps

Step 4

tinyML

everywhere

Explosive growth

Trillion of devices

now

3-5 years

1-2 years

10 years

Page 19: Felix Johnny and Fredrik Knutsson - Arm - tinyML

tinyML meetups global growth

https://www.meetup.com/pro/tinyml/

(4.5k+ members in 22 countries, in 18 months)

22

tinyML Group in Sweden:62 members in 2 months

Page 20: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Highlights:www.tinyML.org/summit2021

- Keywords: Premier Quality, Interactive, LIVE … and FREE- 5 days, 50+ presentations- 4 Tutorials- 2 Panel discussions: (i) VC and (ii) tinyML toolchains- tinyML Research Symposium - Late Breaking News - 3 Best tinyML Awards (Paper, Product, Innovation)- 10+ Breakout sessions on various topics- tinyML Partner sessions- tinyAI for (Good) Life- LIVE coverage, starting at 8am Pacific time

What should I do about it:- Check out the program – you will be impressed- Register on-line (takes 5 min, and 1000+ already registered in 5 days)

- If interested: Submit nominations for Best Awards and/or Late News – February 28 deadline

- Block out your calendar: March 22-26- Become a sponsor ([email protected])- Provide your feedback – we listen !- Don’t worry about missing some talks – all videos

will be posted on YouTube.com/tinyML

Announcement:

Page 21: Felix Johnny and Fredrik Knutsson - Arm - tinyML

tinyML Summits are growing fast 2019 Summit (March 2019)

2020 Summit(Feb 2020)

2021 Summit(March 2021), expected

Attendees 160 400+ 3000+

Companies 90 172 300+ ?

LinkedIn members

0 798 ~ 2000

Meetupsmembers

0 1140 ~ 5000

YouTubesubscribers

0 0 ~ 3000

2018 2019 2020

also started in Asia: tinyML WeChat and BiliBili

2021

Page 22: Felix Johnny and Fredrik Knutsson - Arm - tinyML

2021 Summit Sponsors (as of Feb 3, 2021; more in the pipeline)

FREE registration for the whole weekof the Summit due to the Sponsors:

Page 23: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Interested in joining tinyML ecosystem?www.tinyML.org

Page 24: Felix Johnny and Fredrik Knutsson - Arm - tinyML

© 2021 Arm

Felix Johnny & Fredrik KnutssonMachine Learning Group

CMSIS-NN : Optimization for Edge AI

Page 25: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Felix Johnny

Felix Johnny is the maintainer of Arm’s open source CMSIS-NN library that targets optimized Neural Network kernels for Cortex-M CPUs. He has spent most of the last 15 years in the wireless domain working with software design and optimizations in memory and cycle constrained systems. Outside of work, he is an active music photographer.

Page 26: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Fredrik Knutsson

Fredrik Knutsson is the technical lead for the Arm team working on Ethos-U55 and Cortex-M integration into embedded runtimes. He holds a M.Sc. in electrical engineering from Chalmers university of technology. Fredrik has more than 15 years of experience in the embedded software domain, doing mainly software architecture and system design. Four the past four years he’s been working for Arm and has previous experience from the wireless, wearable and automotive business.

Page 27: Felix Johnny and Fredrik Knutsson - Arm - tinyML

20 © 2021 Arm

Executive Summary

Introduction to Edge AI and Arm Cortex-M Processors

Preparing for Optimization

CMSIS-NN : Library Optimizations

CMSIS-NN demo

Page 28: Felix Johnny and Fredrik Knutsson - Arm - tinyML

21 © 2021 Arm

Orienting Edge AI

Apply the model

Low latencyefficiency

Low/mid performance

Security &privacy

Train the model

High performance

Throughput oriented

Large data sets

High performance compute, servers, GPU

Embedded systems, heterogeneous processing

Clo

ud

sid

eEd

ge d

evic

eTraining

Inference

float to (u)int8

Page 29: Felix Johnny and Fredrik Knutsson - Arm - tinyML

22 © 2021 Arm

Why Target Arm Cortex-M Processors?

Accelerate deployment of Machine Learning on edge devices

Power efficient processors that fit TinyML requirements

Availability of exciting new models with lower memory footprint and Multiply ACcumulate operations(MAC)

MobileNet V2

• 3.38 MB parameters

• ~ 307 million MACs

• Data Type: floatPerson Detect(TFLM)

• 250 kBytes

• ~ 7 million MACs

• Data Type: int8**Source: Arm data

+52 billion Cortex-M

based chips shipped**

Page 30: Felix Johnny and Fredrik Knutsson - Arm - tinyML

23 © 2021 Arm

Processor’s View of a NN ModelConnected compute blocks

Compute Layer 0

Compute Layer 1

Compute Layer N-1

…Input Output

Page 31: Felix Johnny and Fredrik Knutsson - Arm - tinyML

© 2020 Arm Limited (or its affiliates)

Optimizing Neural Networks for TinyML

Why & How

Page 32: Felix Johnny and Fredrik Knutsson - Arm - tinyML

25 © 2021 Arm

Why Optimize for TinyML?

Power constraints

• Extend battery life of edge devices byreducing awake time.

Cycle constraints

• Enable more complex models to be deployed within the inference budget.

• Meet real time latency constraints.

Page 33: Felix Johnny and Fredrik Knutsson - Arm - tinyML

26 © 2021 Arm

It Takes a Village to Raise a ChildEveryone has a role to play in optimization, some more important than the other

Optimization

Model Optimizations

SW Environment Settings

Library Optimizations

HW Design

Page 34: Felix Johnny and Fredrik Knutsson - Arm - tinyML

27 © 2020 Arm Limited

What is CMSIS-NN?Optimized Neural Network kernels[5] for Arm Cortex-M CPUs

CMSIS-Pack

System-on-chip

Arm® Cortex® processorSpecialized peripherals

Communication peripherals

CoreSight™debug logic

Debugger

CMSIS-RTOSReal-time execution

CMSIS-NNMachine learning

CMSIS-DSPSignal processing

CMSIS-SVDPeripheral description

CMSIS-DAPDebug access

CMSIS-DriverMiddleware interface

CMSIS-COREProcessor core and peripheral access

Peripheral HALDevice specific

Access Filter(MPU, SAU)

CMSIS-ZoneSystem Partitioning

Application codeTFL for Microcontrollers

Page 35: Felix Johnny and Fredrik Knutsson - Arm - tinyML

© 2021 Arm

Preparing for Library Optimizations

Optimizations

Model Optimizations

SW Environment

Settings

Library Optimizations

HW design

Page 36: Felix Johnny and Fredrik Knutsson - Arm - tinyML

29 © 2020 Arm Limited (or its affiliates)

Setup for Data Used

Mbed based

Tensor Flow[1] #2025bf6f68

Mbed[2] #e642a7d8b3

Default Optimization level used: Ofast

Default Compiler Version: GNU Arm Embedded Toolchain 9.3.1 20200408

Processor : Arm Cortex-M7 @216 MHz

Memory layout of operators: TensorFlow Lite for Microcontroller(TFLM)

Model: Person detect from TFLM, int8 symmetric quantized[3]

STM32F746ZG

Page 37: Felix Johnny and Fredrik Knutsson - Arm - tinyML

30 © 2021 Arm

Cycle MeasurementsNon-Optimized kernels

Methodology Grouped Cycle measurement

DW Conv38.56%

Conv61.40%

0.04% 0.00% 0.01%

Log cycle count & reset counter

Reset cycle counter

Log cycle count & reset counter

.

.

.

Snippet of Person Detect Model

Page 38: Felix Johnny and Fredrik Knutsson - Arm - tinyML

31 © 2021 Arm

Setting a Target for Optimization The library optimizer

Cycle Bound Analysis

Processor Capability

Processor Utilization

Memory Bound Analysis

Memory Capability

Data Layout

Page 39: Felix Johnny and Fredrik Knutsson - Arm - tinyML

32 © 2021 Arm

Processor Capability & Utilization

𝑃𝑢𝑡𝑖𝑙 =𝐶𝑡

𝐶𝑎, ∈ [0,1]

Speed of Light (SoL) or Capability. Best-case cycles for execution of a given arithmetic operator

Actual cycles for execution𝐶𝑎

𝐶𝑡Processor SoL or Capability for

MAC/cycle

Cortex-M4 1

Cortex-M7 2

Cortex-M55 8

Sequence: Load(LDR)->Multiply Accumulate(MAC)->LDR->MAC->…->Store(STR)

Cycle Bound Analysis

Processor Capability

Processor Utilization

Memory Bound Analysis

Memory Capability

Data Layout

Page 40: Felix Johnny and Fredrik Knutsson - Arm - tinyML

33 © 2021 Arm

Processor Utilization for Person Detect ModelTFLM reference kernels

Unfair representation as conv and DW conv do more than just MAC’s.

One take away is that there is scope for improvement.

DW conv38.6%

Conv61.4%

CYCLES

DW conv13.5%

Conv86.5%

MAC OPERATIONS

DW conv Conv1.7%

6.9%

TFLM reference processor utilizationCortex-M7

Page 41: Felix Johnny and Fredrik Knutsson - Arm - tinyML

34 © 2021 Arm

Memory Bound Analysis

Theoretical bandwidth - Maximum data transfer rate for a given hardware specification.

Effective bandwidth – Actual data transfer rate for an operation. For e.g. read/write of weights, inputs and outputs in a convolution operation.

Different memory blocks (on and off chip) are involved.

Often it is about reducing the effective bandwidth.

But, memory analysis on its own does not paint the full picture.

Cycle Bound Analysis

Processor Capability

Processor Utilization

Memory Bound Analysis

Memory Capability

Data Layout

Page 42: Felix Johnny and Fredrik Knutsson - Arm - tinyML

35 © 2021 Arm

Finding a Direction for Optimization

Cycle and memory analysis, gives an indication of how effective an algorithm is and points to areas of improvement.

Optimization is about striking a balance between the two.

Cycle bound

Memory bound

Page 43: Felix Johnny and Fredrik Knutsson - Arm - tinyML

© 2021 Arm

Optimizations

Model Optimizations

SW Environment

Settings

Library Optimizations

HW design

Page 44: Felix Johnny and Fredrik Knutsson - Arm - tinyML

37 © 2021 Arm

Use Case: Fully ConnectedSimpler case of what im2col method targets for a convolution or DW conv

Ninput⋮

ip_0 ip_1 .. ip_n-1input @ addr_x

f0_0 .. f0_n-1filter 0 @ addr_y

filter 1 @ addr_y + n f1_0 .. f1_n-1

Moutput.

⋮ 1 byte

Input and weights are contiguous in memory

Page 45: Felix Johnny and Fredrik Knutsson - Arm - tinyML

38 © 2021 Arm

Step 1: Working on the Memory Bound Aspect

..

loop_i,k {

input = input[i] + input_offset

sum += input * (filter_[k][i] + filter_offset)

}

Number of input vector reads is halved

Reducing memory reads and reusing is a vital aspect of optimization

..

loop_i,k/2 {

input = input[i] + input_offset

sum_0 += input * (filter[k][i] + filter_offset)

sum_1 += input * (filter[k + n][i] + filter_offset)

}

Why don’t I unroll it all the way?

Page 46: Felix Johnny and Fredrik Knutsson - Arm - tinyML

39 © 2021 Arm

Well, There are Limits to Effective Unrolling

* Could be reused

Variables in use Potential register map

Address of filter_0 R0

Address of filter_1 R1

Address of input R2

Value of input R3

Value of filter_0 R4

Value of filter_1* R5

sum 0 R6

sum 1 R7

Input offset R8

Filter offset R9

Loop count R10

Unused R11-R12

Depends on the number of general purpose (GP) & vector registers available.

Process two filter inputs in same loop

..

loop_i {

input = input[i] + input_offset

sum_0 += input * (filter[k][i] + filter_offset)

sum_1 += input * (filter[k + n][i] + filter_offset)

}

A guideline (not a rule) is to unroll until there are no register spills/reloads

Page 47: Felix Johnny and Fredrik Knutsson - Arm - tinyML

40 © 2021 Arm

Does the Compiler Output Match Expectations ?

Processing two filter rows in one loop

// Load single byte(sb)

// Add filter offset

// Loop count compare. Sets condition bit N

// one multiply accumulate

Pre/Post increment of pointers are done with load

Vital benefit of a constant address increment

// Branch on condition bit N

Code snippet from arm_nn_vec_mat_mult_t_s8.c(CMSIS-NN)

// lhs -> Input vector

// rhs -> Filter

Page 48: Felix Johnny and Fredrik Knutsson - Arm - tinyML

41 © 2021 Arm

Step 2: SIMD* for MAC

Processor ExtensionSoL

MAC/cycle

Cortex-M4 DSP 1

Cortex-M7 DSP 2

Cortex-M55 MVE (Helium Technology)

8

SoL for MAC sequence : LDR->MAC->LDR->MAC->…->STR

.LBB1_2: @ %for.body

ldr r12, [r0], #4

ldr r4, [r1], #4

smlad r2, r12, r4, r2 // 2 MAC operations

le lr, .LBB1_2

.LBB0_2: @ %for.body

vldrb.u8 q0, [r0], #16

vldrb.u8 q1, [r1], #16

vmlava.s8 r12, q0, q1 // setup for 16 MAC operations

letp lr, .LBB0_2

DSP extension

MVE extension

Handles even number of MAC operations

Tail elements are handled separately

Tail predicated loops allow for handling both odd and even element lengths.

*Single Instruction Multiple Data

Page 49: Felix Johnny and Fredrik Knutsson - Arm - tinyML

42 © 2021 Arm

Step 3: Further Simplification of Core Loop

Input/filter offset can be handled outside the core loop[4]

Reduces the number of load and add operations done in core loop.

Frees up registers that can be used for deeper unrolling instead.

Page 50: Felix Johnny and Fredrik Knutsson - Arm - tinyML

43 © 2021 Arm

The Essentials of Optimization

• Memory access optimizations• Reducing relevant memory accesses while staying within the available GP/vector register constraint.

• Keep it Simple Stupid (KISS) Optimizations • Constant pointer increment/decrement in core loops• Simplify the core loop further by moving out input/filter offset adjustments

• Processor capability optimization• Single Instruction Multiple Data Optimizations

Page 51: Felix Johnny and Fredrik Knutsson - Arm - tinyML

44 © 2021 Arm

Model Hyperparameter – The Unaligned AspectInput channel not a multiple of 4

3 input channels

Parts of the 16 blocks(output channel) of kernel results in unaligned access

btye 0 byte 1 .. byte 26

3x3x3 => 27 byte filter block in memory

Page 52: Felix Johnny and Fredrik Knutsson - Arm - tinyML

45 © 2021 Arm

Impact of Aligned AccessChange from 3 input channels to 4

increase in macs

reduction in cycles

33.3%

21.1%

MAC and performance impact in using 4 input channels

Applicable for other operators as well

Memory alignment aware shapes get the best out of CMSIS-NN

Page 53: Felix Johnny and Fredrik Knutsson - Arm - tinyML

© 2021 Arm

Deploy CMSIS-NN using Tensorflow Lite for

Microcontrollers

Page 54: Felix Johnny and Fredrik Knutsson - Arm - tinyML

47 © 2021 Arm

Tensorflow Lite for Microcontrollers (TFLM)

• Version of TensorFlow Lite designed to execute neural networks on microcontrollers, starting at only a few kB of memory

• Designed to be portable even to 'bare metal' systems

• The core runtime is ~20kB.

• Examples/demos

• Micro speech: Detects simple commands such as yes, no and silence.

• Person detection: Detects whether a person is in the room or not.

• Magic wand: Detect gestures using an accelerometer.

• Over 50 operators supported currently

• Many integrated operator optimizations

Page 55: Felix Johnny and Fredrik Knutsson - Arm - tinyML

48 © 2021 Arm

• Optimized kernels are enabled using OPTIMIZED_KERNEL_DIR=cmsis_nn in the TFLM build system

• Use optimized kernels when available –otherwise fallback to TFLM reference kernels

• CMSIS-NN “glue” is here:• /tensorflow/lite/micro/kernels/cmsis-nn

CMSIS-NN & TensorFlow Lite for MicrocontrollersAccess to optimized kernels through TFLM

Arm Cortex-M CPU

Application

TensorFlow Lite micro interpreter

Ref kernels CMSIS-NN kernels

Page 56: Felix Johnny and Fredrik Knutsson - Arm - tinyML

49 © 2021 Arm

Optimize Where it Matters……and always have a fallback path

• Reference kernels for availability

• For more horsepower - CMSIS-NN

• For most horsepower - Ethos-U55

Kernel TFLM reference implementation

CMSIS-NN (fast)

NPU(faster)

Kernel 1 ✓ ✓ ✓

Kernel 2 ✓ ✓ ✓

Kernel 3 ✓ ✓ ✓

Kernel 4 ✓ ✓ ✓

Kernel 5 ✓ ✓

Kernel 6 ✓

Kernel 7 ✓

Page 57: Felix Johnny and Fredrik Knutsson - Arm - tinyML

50 © 2021 Arm

• Quantization: Optimizations are available for int8, per-channel, quantized data.

• Operators: The majority of compute occurs in a handful of ops

• We’re open to optimizing new ops!• Open a Github ticket on CMSIS repo [5]

What kernels are optimized?

Keyword spotting Face recognision

Image classificationHuman activity

[3]

Page 58: Felix Johnny and Fredrik Knutsson - Arm - tinyML

51 © 2021 Arm

Person Detection DemoAvailable on the Tensorflow repository [7]

• Model:• ~300kB flash (weight and bias)• ~100kB SRAM (activations etc.)• 31 layers, ~7 million MACs

– Depthwise conv, Conv, Average pool, Softmax, Reshape

• Input:• 96x96 pixel 8-bit grayscale image

• Output:• Two 8-bit values: Person score and no person score

Page 59: Felix Johnny and Fredrik Knutsson - Arm - tinyML

52 © 2021 Arm

The HardwareArduino Nano 33 BLE Sense + Arducam Mini 2MP Plus

• Powered by Arm’s Cortex-M4 CPU

• 1 MB flash. 256kB SRAM. 64MHz.

• Green light: A person

• Red light: Not a person

Page 60: Felix Johnny and Fredrik Knutsson - Arm - tinyML

53 © 2020 Arm Limited (or its affiliates)

References

• [1] TensorFlow GitHub

• [2] mbed-os

• [3] TensorFlow Lite int8 quantization specification

• [4] Efficient handling of offsets

• [5] CMSIS-NN source repository

• [6] Enable CMSIS-NN on Tensorflow Lite for Microcontrollers

• [7] Person detection example readme

Page 61: Felix Johnny and Fredrik Knutsson - Arm - tinyML

© 2021 Arm

Thank YouDanke

Gracias谢谢

ありがとうAsanteMerci

감사합니다धन्यवाद

Kiitosشكرًا

ধন্যবাদתודה

Contact:

[email protected]

[email protected]

Page 62: Felix Johnny and Fredrik Knutsson - Arm - tinyML

tinyML Talks Sponsors

Additional Sponsorships available – contact [email protected] for info

tinyML Strategic Partner

Page 63: Felix Johnny and Fredrik Knutsson - Arm - tinyML

Copyright Notice

This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the opinion of the author(s) and their respective companies. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org