Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning...

35
© 2017 Arm Limited Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications Helena Zheng ML Group, Arm Arm Technical Symposia 2017, Taipei

Transcript of Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning...

Page 1: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

© 2017 Arm Limited

Comprehensive ArmSolutions for Innovative

Machine Learning (ML) and Computer Vision (CV)

Applications

Helena Zheng ML Group, Arm

Arm Technical Symposia 2017, Taipei

Page 2: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

3

© 2017 Arm Limited 3

Machine Learning is a Subset of Artificial IntelligenceAI means many things to many people

Artificial Intelligence

Machine Learning

Perception & Vision

Natural Language Processing

Knowledge Representation

Planning & Navigation

Generalized Intelligence

ML itself has a lot of depth

Page 3: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

4

© 2017 Arm Limited 4

Why Artificial Intelligence(AI) Exploding NowAvailability of increased data sourced at the edge with ubiquitous powerful compute!

Compute Data

2016 – 1 zettabyte

2020 – 2.3 zettabyte

IP Traffic

2010

2015

Page 4: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

5

© 2017 Arm Limited 5

AI Presents Significant Opportunity for Innovation

Robotics

Home, surveillance & analytics

VR/MR

IoT

Shipping & logistics

Mobile

Drones

Automotive

Page 5: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

6

© 2017 Arm Limited 6

Distributing Intelligence

Security and privacy for your data

Cloud-based training

High-performance processing

AI in your hand & cloud

Real-time inference for autonomous systems

On-device learning

Page 6: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

7

© 2017 Arm Limited 7

Why is On-device ML Driving AI to the Edge?

Bandwidth PrivacyLatencyCostPower

Page 7: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

8

© 2017 Arm Limited 8

Arm ML Platform Enables

FlexibilityEfficiency Freedom

Page 8: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

9

© 2017 Arm Limited 9

Arm’s ML Computing Platform

Arm DS-5 / Keil tools / compilers / drivers

AI Applications | Incorporating ML, CV, speech recognition etc. Applications

Edge devices

Stable SW

interfaces

Neural network frameworks(e.g. TensorFlow, Caffe, AndroidNN)

Spirit metadatalibrary

Optional Spirit libraries

& model sets

Compute library

SpiritComputer Vision

GPUCPUCPU

SVE

Page 9: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

10

© 2017 Arm Limited 10

Components of Arm ML Platform

Software Specialized Acceleration Hardware

Page 10: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

11

© 2017 Arm Limited 11

Software Development

Page 11: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

12

© 2017 Arm Limited 12

Arm Compute LibraryFaster, advanced processing

Functions for CV and deep-learning algorithms

Optimized for Arm CPU and GPU

OS and platform agnostic

No fee, MIT license

Use as a plug-in backend for your own runtime implementation

What is the Arm Compute Library?

Delivers faster processing Offers OpenCV and Open VX compatibility

Available now: https://developer.arm.com/technologies/compute-library

4.6x faster than stock OpenCVon NEON

Page 12: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

13

© 2017 Arm Limited 13

Arm Compute Library Partners

Page 13: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

14

© 2017 Arm Limited 14

Specialized Acceleration

Page 14: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

© 2017 Arm Limited

Computer Vision (CV)

Page 15: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

16

© 2017 Arm Limited 16

Spirit for Object Detection and Localization

Metadata stream (Regions of interest)

Image stream

Ben

Beth

SpiritCV pre-processor

Senso

r in

terface

Feature

extraction

Classifier 1

Classifier 2

ISPSensor

CPUGPU

Acceleration

Page 16: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

17

© 2017 Arm Limited 17

Comparison with Neural Network Framework Solutions

SSD Neural Network

Yolo

Spirit

Page 17: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

18

© 2017 Arm Limited 18

Object localization (>200k locations in full HD)

Size variation (~20x for full HD)

Scalable to 4K without performance compromise

Real time, 60 fps, no dropped frames

Invariance to optical distortions

Invariance to illumination conditions

High occlusion tolerance

Suitable for stationary and moving cameras

Spirit: Key Features

Page 18: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

19

© 2017 Arm Limited 19

Spirit uses a form of HOG*/ SVM* baked into an efficient hardware design

Using a DSP to achieve the same performance

E.g. Pedestrian Detection on a DSP

• Processed at VGA resolution

• Achieving 5fps

• Operating at 40MHz/50MHz

• Scaling this to Spirit performance levels of 1080p60 would require the DSP to run at 3.24GHz

*Histogram of Oriented Gradients *Support Vector Machines

Comparison with a DSP

Page 19: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

© 2017 Arm Limited

ML on Mali GPUs

Page 20: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

21

© 2017 Arm Limited 21

Mali GPUs: Increasing ML Throughput and Efficiency

Increasing efficiency

0.9

0.95

1

1.05

1.1

1.15

1.2

Mali-G71 Mali-G72

Rel

ativ

e En

eryg

Eff

icen

cy SGEMM FP32 SGEMM FP16

17% Efficiency

gain

• GEMM depicts core functionality of ML algorithms

• Mali-G72 has several optimizations to improve ML inference

• Less power-hungry FMA unit

• Bigger L1 cache in the execution engine

• Mali-G72 is the most efficient Mali GPU for machine learning

Page 21: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

22

© 2017 Arm Limited 22

Hardware

Page 22: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

23

© 2017 Arm Limited 23

AI Applications at the Edge on Arm

Detect plant diseases Sort cucumbers Detect Caltrain delays

Page 23: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

25

© 2017 Arm Limited 25

Instruction Sets for AI

• Additional dot product instructions (Cortex-A55 and Cortex-A75)

• New Scalable Vector Extensions (SVE) instructions

Cortex-A

• Optimized CMSIS-DSP libraries for matrix multiplication

Cortex-M

• Improved performance and efficiency (for broader use cases)

• Flexibility in multi-core computing with Arm DynamIQ technology

Closely-coupled acceleration

Page 24: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

26

© 2017 Arm Limited 26

DynamIQ: New Cluster Design for New Cores

Arm DynamIQ big.LITTLE systems:

• Greater product differentiation and scalability

• Improved energy efficiency and performance

• SW compatibility with Energy Aware Scheduling (EAS)

Private L2 and shared L3 caches

• Local cache close to processors

• L3 cache shared between all cores

DynamIQ Shared Unit (DSU)

• Contains L3, Snoop Control Unit (SCU) and all cluster interfaces

Additional instructions for ML1b+4L1b+3L1b+2L

1b+7L

Example: DynamIQ big.LITTLEconfigurations

..

AMBA4 ACE

SCU

Shared L3 cacheACP

Cortex-A5532b/64b Core

Private L2 cache

Async BridgesPeripheral Port

Cortex-A7532b/64b Core

Private L2 cache

DynamIQ Shared Unit (DSU)

2b+6L4b+4L

Page 25: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

27

© 2017 Arm Limited 27

New DynamIQ-based CPUs for New Possibilities

>50%

more performancecompared to current devices

2.5x

higher power efficiencycompared to current devices

Estimated device performance using SPECINT2006, final device results may varyComparison using Cortex-A73 at 2.4GHz vs Cortex-A75 at 3GHz

Comparison using Cortex-A53 in 28nm devices vs Cortex-A55 in 16nm devices

Cortex-A75 processor Cortex-A55 processor

Page 26: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

28

© 2017 Arm Limited 28

1.000.90

0.700.59

0.440.35

0

0.2

0.4

0.6

0.8

1

1.2

Tim

e (l

ow

er is

bet

ter)

Harris Corners (relative to Cortex-A53)

Cortex-A53 (FP32) Cortex-A55 (FP32) Cortex-A55(FP16)

Cortex-A73 (FP32) Cortex-A75(FP32) Cortex-A75(FP16)

Enhanced Architecture for Emerging Use Cases

Computer Vision Machine Learning

1 1.2

2.5

5.5

0

1

2

3

4

5

6

MA

C/c

ycle

General Matrix Multiply (relative to Cortex-A53)

Cortex-A53 (FP32) Cortex-A55 (FP32)

Cortex-A55 (FP16) Cortex-A55 (8-bit)

8-bit dot

product

FP16 FP16

Page 27: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

29

© 2017 Arm Limited 29

Introducing the Scalable Vector Extension (SVE)

Predicate-driven loop control and management

Per-lane predicationGather-load and scatter-store

Extended floating-point horizontal reductions

1 + 2 + 3 + 4

1 + 2

+

3 + 4

3 7

= =

=

=

1 2 3 45 5 5 51 0 1 0

6 2 8 4

+

=

pred

Vector partitioning and software-managed speculation

1 2 0 0

1 1 0 0

+

pred

1 2

n-2

1 01 0CMPLT n

n-1 n n+1INDEX i

for (i = 0; i < n; ++i)

Page 28: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

30

© 2017 Arm Limited 30

Instruction Sets for AI

• Additional dot product instructions (Cortex-A55 and Cortex-A75)

• New SVE instructions

Cortex-A

• Optimized CMSIS-DSP libraries for matrix multiplication

Cortex-M

• Improved performance and efficiency (for broader use cases)

• Flexibility in multi-core computing with DynamIQ technology

Closely-coupled acceleration

Page 29: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

31

© 2017 Arm Limited 31

Software Optimizations: Cortex-M Example

Convolution

• Use of partial im2col to reduce the memory footprint

• Optimized data dimension layout (Height-Width-Channel) to save im2col overhead

Pooling

• Split into x-pooling and y-pooling instead of window-based

• 5.1X improvements compared to Caffe-like implementation

Activation

• ReLU: use SIMD within a register, 2.6X improvement compared to Caffe-like implementation

• Sigmoid and Tanh: use table look-up

*Baseline uses CMSIS 1D Conv and Caffe-like Pooling/ReLU

CIFAR-10 network runtime improvement

0

1

2

3

4

5

6

Conv Pooling ReLU Total

Rel

ativ

e th

rou

ghp

ut

Baseline New kernels

The new kernels will be integrated into future versions of CMSIS

4x higher

perf

Page 30: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

32

© 2017 Arm Limited 32

Instruction sets for AI

• Additional dot product instructions (Cortex-A55 and Cortex-A75)

• New SVE instructions

Cortex-A

• Optimized CMSIS-DSP libraries for matrix multiplication

Cortex-M

• Improved performance and efficiency (for broader use cases)

• Flexibility in multi-core computing with DynamIQ technology

Closely-coupled acceleration

Page 31: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

33

© 2017 Arm Limited 33

Interface to Acceleration Logic

DynamIQ clusterAccelerator

(4) Writes result into CPU memory

(1) Configure registers for task

(2) Fetches data from CPU memory

(3) Carries out acceleration

ACP

PP

DynamIQ cluster

I/O agent

(4) Reads result from CPU memory

or sends data for onward processing

(3) Processing completed

(1) Writes data into CPU memory

(2) Carries out computation

ACP

PP

Page 32: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

34

© 2017 Arm Limited 34

Flexible Acceleration Platform

XPXP

XPXP

CMN-600

Agile System Cache

Accelerator

DMC-620

Acc

eler

ato

r

DMC-620

CCIX

DynamIQ Cluster

Level-3 Cache

Accel Cache

CCIX

Accelerator

XP

XP

Agile System Cache

Closely coupled Independent Off-chip coherent

Page 33: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

35

© 2017 Arm Limited 35

Arm ML Platform Enables

FlexibilityEfficiency Freedom

Page 34: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

3838

Thank You!Danke!Merci!謝謝!ありがとう!Gracias!Kiitos!

© 2017 Arm Limited

Page 35: Comprehensive Arm Solutions for Innovative Machine ... · Solutions for Innovative Machine Learning (ML) and ... ACP Shared L3 cache ... Comprehensive Arm Solutions for Innovative

3939 © 2017 Arm Limited

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks