End to End Deep Learning Solution on Arm Architecture...Arm on the road Astra at Sandia National Lab...
Transcript of End to End Deep Learning Solution on Arm Architecture...Arm on the road Astra at Sandia National Lab...
End to End Deep Learning Solutionon Arm Architecture
Jan. 14 2019, Jammy Zhou
HPC and AI convergenceTOP500 Trend
More than 50 percent of additional flops in the latest TOP500 rankings were from Nvidia Tesla GPUs according to TOP500 report
Half of TOP10 systems use Nvidia GPUs, and 122 systems of TOP500 use Nvidia GPUs (64 systems uses P100 GPUs, 46 systems uses V100 GPUs, 12 systems uses Kepler GPUs)
More AI/ML/DL workloads are being added to HPC applications with wide adoption of Nvidia GPUs
Arm on the road
Astra at Sandia National Lab of US is the first Arm based supercomputer entering TOP500 list, numbered at 203 in the latest ranking.
Good momentum of Arm based supercomputers around the world, Post-K from Japan, Tianhe-3 from China, Catalyst UK, GW4 Isambad and CEA system from Europe
Arm SVE is enabled by Post-K together with the Tofu D interconnect and HBM2 memory, and will be used for some AI workloads
Besides Nvidia GPUs, there are some other accelerator options in the market, for example, MI60/MI50 Radeon Instinct GPUs from AMD, Xilinx and Intel FPGAs, customized ASIC products, etc
HPC and AI in the Cloud
CPU Accelerator
Network Storage
AI & ML ServicesHPC Services
100 Gbps Ethernet, InfiniBand, Omni-Path, RDMA and RoCE
Fast and scalable storage, such as NVMe based local SSD
Arm on the roadScience Cloud with Arm based HPC from HPC Systems (supporting Hisilicon Hi1616 and Marvell Thunder X2)
Amazon EC2 A1 instances based on AWS Graviton Arm 64-bit processor for scale-out and Arm based workloads
Arm Neoverse continuous improvement
Accelerators (GPUs, FPGAs, ASICs)
HPC & AI software stack (languages, frameworks, libraries, drivers, compilers, etc), multi-node distributed support and MPI
HW Diversity & SW Fragmentation
DL Frameworks
HAL and Drivers
Libraries
Hardware (CPU, GPU, FPGA, ASIC, DSP)
TensorFlowCaffe MXNet Theano
Caffe2 CNTKPaddlePaddle
BLAS FFT RNG SPARSE Eigen
PyTorch Keras
Framework support for multiple accelerators
CMSIS-NNACL
Model Formats (framework specific, ONNX, NNEF)Deep Learning Compilers (TVM, Glow, XLA, ONNC, etc)
1. Difficult to switch between frameworks by application and algorithm developers
2. Different backends to maintain by framework developers for various accelerators
3. Multiple frameworks to support by chip and IP vendors with duplicated efforts, and out-of-tree support by forking the upstream
4. Multiple configurations to support by OEMs/ODMs and cloud vendors
Chainer...
1
2
3
4
Big Data Analytics
TensorFlowOnSpark CaffeOnSpark SparkFlow ...
cuDNN MIOpen
...
Open Neural Network eXchange EcosystemFramework interoperability & Hardware optimizations
ONNX Format
ONNX Models
ONNXIFI ONNX Runtime
ONNX Tools
Create Convert DeployOptimize
ONNX Specifications
Neural-network-only ONNXDefines an extensible computation graph model, built-in operators and standard data types
Support only tensors for input/output data types
ONNX-ML ExtensionClassical machine learning extension
Also support data types of sequences and maps, extend ONNX operator set with ML algorithms not based on neural networks
ONNX v1.3 Released on Sep. 1st 2018
Control Flow supportFunctions (composable operators, experimental)Enhanced shape inferenceAdditional optimization passesONNXIFI 1.0 (C-backend for accelerators)
More to come... QuantizationTest/ComplianceData pipelinesEdge/Mobile/IoT
ONNX Interface for Framework IntegrationONNXIFIStandardized interface for NN inference on different accelerators Runtime discovery and selection of execution backends, as well as ONNX operators supported on each backendSupport ONNX format & online model conversion
ONNXIFI BackendA combination of software layer and hardware device used to run an ONNX graphThe same software layer can expose multiple backendsHeterogeneous type of backend can distribute work across multiple device types internally
ONNXIFIlibonnxifi.so
Glow Library Alibonnxifi-glow.so libonnxifi-a.so
Applications
ONNX Models Frameworks
Library Blibonnxifi-b.so
Library Clibonnxifi-c.dll
Library Dlibonnxifi-d.dylib
...
ONNX RuntimeHigh-performance and cross-platform inference engine for ONNX modelsFully implements the ONNX specification including the ONNX-ML extensionArm platforms are supported on both Linux (experimental) and Windows
Diagram from https://github.com/Microsoft/onnxruntime/blob/master/docs/HighLevelDesign.mdTensorRT and nGraph support are work in progress
Machine IntelligenceA Linaro Strategic Initiative
Provide the best-in-class Deep Learning performance by leveraging Neural Networkacceleration in IP and SoCs from the Arm ecosystem, through collaborative seamlessintegration with the ecosystem of AI/ML software frameworks and libraries
Scope from HPC to microcontroller
HPC, Data Center & Cloud *SVE based optimization for DL frameworks & libraries
PCIe/CCIX based heterogeneous accelerator support on Arm servers (drivers, compilers and framework integration, etc)
Scale out support for distributed training
Edge node & deviceInitial focus on inference support for Cortex-A SOCs
Common model description format and APIs to the runtime
Common optimized runtime inference engine for Arm-based SoC
Plug-in framework to support multiple 3rd party IPs (NPU, GPU, DSP, FPGA)
Continuous integration testing and benchmarking
Microcontroller *CMSIS-NN optimized frameworks/libraries on RTOS
Frameworks like uTendor and TensorFlow Lite (quantization, footprint reduction, etc)
IP based accelerator support & optimization
* under discussion
traininginference
ArmNN based collaborations - ongoing
https://developer.arm.com/products/processors/machine-learning/arm-nnhttps://community.arm.com/tools/b/blog/posts/arm-nn-the-easy-way-to-deploy-edge-ml
A good base for future collaborations:
100 man-years of effort, 340,000 lines of code
Shipping in over 200 million Android devices based on estimation
Impressive performance uplift by software-only improvements over a period of 6 months
Thanks!