Doug Burger - HPTS · Edge site 3. 5. Silicon scaling will end because of economics, not physics...
Transcript of Doug Burger - HPTS · Edge site 3. 5. Silicon scaling will end because of economics, not physics...
Doug Burger
Technical Fellow, Advanced Intelligent Architectures, Microsoft Azure
2
United States
United States
Canada
Mexico
Venezuela
Colombia
Peru
Bolivia
Brazil
Argentina
Atlanta OceanAlgeria
MaliNiger
Nigeria
Chad
Libya Egypt
Sudan
Ethiopia
Dr Congo
AngolaZambia
Nambia
South
Africa
Greenland
Svalbard
Sweden
Norway
United
Kingdom
France
PolandUkraine
Turkey
Saudi
Arabia
Iran
Kazakistan
India
Russia
China
Myanmar
(Burma)
Indian Ocean
Indonesia
Australia
Pacific Ocean
Pacific Ocean
Data centerOwned capacity
Future capacity
Leased capacity
Edge site
3
5
Silicon scaling will end because
of economics, not physics
Minimal
cost/transistor
(7, 5, 3 nm?)
Lower TCO
might justify
one more node
But the
economics are
tough and will
happen late
Spatial Computation
FPGA
Data
Instruction
Instruction
Instruction
Data
Instruction
Instruction
Instruction
Temporal Computation
Temporal vs. spatial computing
7
CPU
Instruction
Microsoft Confidential
Catapult V0
8
• 1U rack-mounted
• 2 x 10Ge ports
• 3 x16 PCIe slots
• 12 Intel Westmere
cores (2 sockets)
Bing Ranking Implementation Details
Complex
ALU
Ln, ÷, div
Basic Tile
Basic Tile
Basic Tile
Basic Tile
Registers
Constants
FFE 1 Inst.
FFE nInst.
Compression
Thresholds
…
Local
ALU
DSP
DSP
Schedulin
g
Logic
Distribution latches
Control/Data Tokens
Feature
Transmissi
on
Network
Stream
Preprocessin
g FSM
FE FFE DTS
FE0
89 Non-
BodyBlock
Features
34 State
Machines
55 % Utilization
FE1
55 BodyBlock
Features
20 State
Machines
45 % Utilization
DTT [3][7]
DTT [3][6]
DTT [3][5]
DTT [3][4]
DTT [3][3]
DTT [3][2]
DTT [3][1]
DTT [3][0]
DTT [3][11]
DTT [3][10]
DTT [3][9]
DTT [3][8]
DTT [2][7]
DTT [2][6]
DTT [2][5]
DTT [2][4]
DTT [2][3]
DTT [2][2]
DTT [2][1]
DTT [2][0]
DTT [2][11]
DTT [2][10]
DTT [2][9]
DTT [2][8]
DTT [1][7]
DTT [1][6]
DTT [1][5]
DTT [1][4]
DTT [1][3]
DTT [1][2]
DTT [1][1]
DTT [1][0]
DTT [1][11]
DTT [1][10]
DTT [1][9]
DTT [1][8]
DTT [0][7]
DTT [0][6]
DTT [0][5]
DTT [0][4]
DTT [0][3]
DTT [0][2]
DTT [0][1]
DTT [0][0]
DTT [0][11]
DTT [0][10]
DTT [0][9]
DTT [0][8]
FFE [1][3]
FFE [1][2]
FFE [1][1]
FFE [1][0]
FFE [0][3]
FFE [0][2]
FFE [0][1]
FFE [0][0]
FFE: 64 cores / chip
256-512 threads
DTT: 48 DTT tiles/chip
240 tree processors
2880 trees/chip
Catapult V1 Accelerator Card
11
• Altera Stratix V D5
• 172.6K ALMs, 2014 M20Ks
• 457KLEs
• 1 KLE == ~12K gates
• M20K is a 2.5KB SRAM
• PCIe Gen 2 x8, 8GB DDR3
• 20 Gb network among
FPGAs
Stratix V
8GB DDR3
PCIe Gen3 x8
Mapped Fabric into a Pod
12
• Low-latency access to a local FPGA
• Compose multiple FPGAs to accelerate large workloads
• Low-latency, high-bandwidth sharing of storage and memory across server boundaries
Server 1
FPGA
Server 48
FPGA
Top Of Rack Switch (TOR)
Server 2
FPGA
Server 47
FPGA
Server 3
FPGA
Server 46
FPGA
Server 4
FPGA
Server 45
FPGA
Server 23
FPGA
Server 26
FPGA
Server 24
FPGA
Server 25
FPGA
… …DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
DTWS DTWS
S0 S0
S0 S0
S0 S0
S1 S1
S2 S2
S2 S2
10G
b E
thern
et
Lin
ks
FPGA Torus
1,632 server pilot deployed in production datacenter13
Catapult V2 Architecture
15
CPU CPU FPGA
NIC
DRAM DRAM DRAM
WCS 2.0 Server Blade (Mt. Hood) Catapult V2 (Pikes Peak)
Gen3 2x8
Gen3 x8
QPI Switch
QSF
P
QSF
P
QSF
P
40Gb/s
40Gb/s
WCS Gen4.1 Blade with Mellanox NIC and Catapult FPGA
Pikes Peak
WCS Tray
Backplane
Option Card
Mezzanine
Connectors
Catapult WCS Mezz card
(Pike’s Peak)
Slot DMA
Engine
Config
Flash
(RSU)
DDR3
Controller
JTAG
Temp.
Shell
I2C
4256 Mb QSPI
Config Flash
4 GB DDR3
72
8
SEU
40G
MAC
Server NIC Top-of-Rack Switch
Host
CPU
8
Clock
128
4 4
256
40G
MAC
512
FIFO
40G Bypass Ctrl
FIFO
256 256 256 256
ASLs
User Logic
Role
x8 PCIe Core
(HIP 0)
x8 PCIe Core
(HIP 1)
Shell
Abstractions
Can put NVMe
controllers in the
shell, how to use?
Bing Production Results
software
FPGA
99.9% Query Latency versus Queries/sec
HW
vs.
SW
Late
ncy a
nd
Lo
ad
average software load
99.9% software latency
99.9% FPGA latency
average FPGA query load
Gone to Scale: Azure SmartNIC Host
SmartNIC
FPGA
ToR
NIC ASIC
SmartNIC
CPU
Scalable Hardware Microservice fabric
Web search
ranking
Traditional software (CPU) server plane
QPICPU
ToR
FPGA
CPU
Hardware acceleration plane
Interconnected FPGAs form a separate plane of computation
Can be managed and used independently from the CPU
Web search
ranking
Deep neural
networks
SDN offload
SQL
11010100101110110011100111100111011 01100111001111001110
00011010
11010
11010
Accelerators as a First-Class Network Citizen
0
5
10
15
20
25
1 10 100 1000 10000 100000 1000000
Ro
un
d-T
rip
Late
ncy (u
s)
LTL L0 (same TOR)
LTL L1
Example L0 latency histogram
Example L1 latency histogram
Examples of L2 latency histograms for different pairs of
FPGAs
Number of Reachable Hosts/FPGAs
Catapult Gen1 Torus(can reach up to 48 FPGAs)
LTL Average
LatencyLTL 99.9th Percentile
6x8 Torus Latency
LTL L2
10K 100K 250K
Example: Bing Hardware Microservices
CPU FPGA CPU FPGA
AlexNet
ResNet-50
GNMTBERT-L
GPT-2
Megatron
1
10
100
1000
10000
2010 2012 2014 2016 2018 2020
Mill
ions
of p
ara
me
ters
AlexNet
ResNet-50
BERT-L
GPT-2
Megatron
1
10
100
1000
10000
100000
2010 2012 2014 2016 2018 2020
Billio
ns
of
op
s
~325x ResNet50
(8.3B parameters) ~2200x ResNet50
(17.6T ops)
Microsoft Azure Highly ConfidentialDo not share, forward or copy without MS CELA & MS PR Approval
Project Brainwave Engine
Using Project Brainwave for land use mapping
Data Build Train Deploy Intelligent Apps
ActionIntelligenceData
Stored on Azure Premium
Storage
Azure Machine Learning
1001
Land Classification ModelResNet-50
NAIP Data
20TB, 200M images
Visual Studio Tools for AI
Geo AI Data Science
Virtual Machine
Using FPGAs for ultra-fast inferencing different types of land use for the entire United States
( ESRI NAIP Data, 20+ TB), ~10 minutes, $42
Azure Batch AI
Ultra-fast Inferencing using FPGAs
Satellite Images
for the Entire US
Location: Kent Island Year: 2015
0
2000
4000
6000
AI
TFLO
PS
per
rack
DNN Inference Performance on Brainwave
NPUs
CPU Stratix 5 FPGA Arria 10 FPGA Stratix 10 FPGA
Brainwave NPUs are on steep growth trajectory for low-batch DNN inference
F
C
F
C
TOR
F
C
50G
T1
T2
TOR
T1
F
F
F
F
F
F
F
9x50G
FPGA
Dual Socket
CPUs
50GNIC
PCIeGen3 x16
50G
Bing Compute
Server
CP
U
50G
FPGA
FPGA
FPGA FPGA
FPGA
FPGA
Bing FPGA
Appliances
FPGA
FPGA
50G
NIC
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F F
25
Microsoft Azure Highly ConfidentialDo not share, forward or copy without MS CELA & MS PR Approval
Software Developers Should Program Hardware
C
Assembly
PDP-11 System/370
???
HDL
Intel FPGA Xilinx FPGA ASIC
ToR ToR ToR ToR
CS CS CS CS
ToRToRToRToR
Project Catapult + Brainwave History
2011: Project Catapult Launched
2013: Bing pilot runs decision trees 40X faster
2015: Bing ranking throughput increased 2X
2016: Azure Accelerated Networking delivers industry-leading cloud
performance
2017: Over 1M servers deployed with FPGAs at hyperscale
2017: Hardware Microservices harness FPGAs for distributed
computing
2017: FPGAs enable real-time AI, ultra-low latency inferencing without
batching; Bing launches first FPGA-accelerated Deep Neural Network
2018: Project Brainwave launched in Azure Machine Learning
Field Programmable Gate Arrays
29
32
Backup slides
Demo | AI for Earth
http://www.microsoft.com/aiforearth
33
References
Pixel-Level Land Cover Classification Using the Geo AI Data Science Virtual Machine and Batch AIhttps://blogs.technet.microsoft.com/machinelearning/2018/03/12/pixel-level-land-cover-classification-using-the-geo-ai-data-science-virtual-machine-and-batch-ai/
How to Use FPGAs for Deep Learning Inference to Perform Land Cover Mapping on Terabytes of Aerial Images https://blogs.technet.microsoft.com/machinelearning/2018/05/29/how-to-use-fpgas-for-deep-learning-inference-to-perform-land-cover-mapping-on-terabytes-of-aerial-images/
34
35©Microsoft Corporation
Azure
Rough: Storyboard Notes/Flow of Deck
Straw-man flow:
• Azure all up slide
• Cloud inflection point
• Azure global deployment and scale
• Digital transformation and what it means to companies
• Cloud and edge story. Supercomputer, Sphere (security) Starbucks example
• Data centers. Dublin DC, underwater DC
• Palix, Re-thinking traditional storage system design, co-designing future hardware & software infrastructure for the cloud and building a completely new storage system
• AI: OpenAI deal; general trend toward super AI
• End on an aspirational note…i.e. quantum
36©Microsoft Corporation
Azure
Links to other deck:
1. [Link to top level folder]
2. Mark Russinovich Presentations including
a) Mark LinkedIn deck
b) Inside Azure Architecture
c) Mark’s Build deck
37
Azure Datacenters & Architecture
38
Azure global infrastructure54 total Azure regions: 44 generally available + 10 coming soon
39
3x BeastAzure servers: General purpose
Gen 2
Processor 2 x 6 Core 2.1 GHz
Memory 32 GiB
Hard Drive 6 x 500 GB
SSD None
NIC 1 Gb/s
Godzilla
Processor 2 x 16 Core 2.0 GHz
Memory 512 GiB
Hard Drive None
SSD 9 x 800 GB
NIC 40 Gb/s
Beast
Processor 4 x 18 Core 2.5 GHz
Memory 4096 GiB
Hard Drive None
SSD4 x 2 TB NVMe, 1 x 960 GB SATA
NIC 40 Gb/s
Gen 6
Processor2 x Skylake 24 Core 2.7GHz
Memory 768GiB DDR4
Hard Drive None
SSD4 x 960 GB M.2 SSDsand 1 x 960 GB SATA
NIC 40 Gb/s + FPGA
Beast v2
Processor 8 x 28 Core 2.5 GHz
Memory 12 TiB
Hard Drive None
SSD4 x 2 TB NVMe, 1 x 960 GB SATA
NIC 50 Gb/s
Optimized Gen 7
Processor2 x 26 core Cascade Lake
Memory 192 GB DDR4
Hard Drive None
SSD 5 x 960 GB M.2 NVMe
NIC 50 Gb/s + FPGA
Optimized Gen 6
Processor2 x 24 core Skylake Lake
Memory 192 GB DDR4
Hard Drive None
SSD 4 x 960 GB M.2 NVMe
NIC 40 Gb/s + FPGA
Azure Sphere
Processor 2 x M4 Core @ 200 MHz
Memory 64KB RAM
WiFi 2.4/5.0 GHz 802.11 b/g/n
0.00000000533 Beast
40
Azure architecture
Azure Portal CLI3rd
party
Resource
ProviderCompute Networking Storage
PaaS
offering
Service
FabricAKS ACI Functions
Event
Grid
Azure Fabric Controller
Hardware Manager
Azure infrastructure
Au
then
ticati
on
Tele
metr
y a
nd
in
sig
hts
RB
AC
Azure Resource Manager
41
Web Application Firewall integration at edge
• IP restriction
• Geo filtering
• http parameters
• request methods
• size restriction
Microsoft Global Network
CDN
IP restriction rules in region
allows access from AFD only
WAF
Front Door
Region 1
Web Application
Region 2
Web Application
SQLi, XSS
Malicious Bots
42
Secure Application ContainersCompartmentalize code for agility, robustness & security
On-chip Cloud Services Provide update, authentication, and connectivity
Custom Linux kernelEmpowers agile silicon evolution and reuse of code
Security Monitor
Guards integrity and access to critical resources
App Containers for POSIX (on Cortex-A)
App Containers for I/O (on Cortex-Ms)
OS
Layer 4
On-chip Cloud ServicesOS
Layer 3
HLOS KernelOS
Layer 2
Security MonitorOS
Layer 1
Azure Sphere MCUsHardware
Azure Sphere OS Architecture
43
Simplify development
Focus your device development effort
on the value you want to create
Streamline debugging
Experience interactive, context-aware
debugging across device and cloud
Collaborate across your team
Apply tool-assisted collaboration across
your entire development organization
44
Azure SmartNIC – Accelerating SDN
Inside Azure Servers
46
Gen 2
Processor 2 x 6 Core 2.1 GHz
Memory 32 GiB
Hard Drive 6 x 500 GB
SSD None
NIC 1 Gb/s
Gen 3
Processor 2 x 8 Core 2.1 GHz
Memory 128 GiB
Hard Drive 1 x 4 TB
SSD 5 x 480 GB
NIC 10 Gb/s
HPC
Processor 2 x 12 Core 2.4 GHz
Memory 128 GiB
Hard Drive 5 x 1 TB
SSD None
NIC 10 Gb/s IP, 40 Gb/s IB
Gen 4
Processor 2 x 12 Core 2.4 GHz
Memory 192 GiB
Hard Drive 4 x 2 TB
SSD 4 x 480 GB
NIC 40 Gb/s
Godzilla
Processor 2 x 16 Core 2.0 GHz
Memory 512 GiB
Hard Drive None
SSD 9 x 800 GB
NIC 40 Gb/s
Gen 5.1
Processor 2 x 20 Core 2.3 GHz
Memory 256 GiB
Hard Drive None
SSD6 x 960 GB PCIe Flash and 1 x 960 GB SATA
NIC 40 Gb/s + FPGA
GPU Gen 5
Processor 2 x 8 Core 2.6 GHz
Memory 256 GiB
Hard Drive 1 x 2 TB
SSD 1 x 960 GB SATA
NIC 40 Gb/s
GPU 2 x 2 Compute GPU
Beast
Processor 4 x 18 Core 2.5 GHz
Memory 4096 GiB
Hard Drive None
SSD4 x 1920 GB NVMeand 1 x 960 GB SATA
NIC 40 Gb/s
Gen 6
Processor2 x Skylake 24 Core 2.7GHz
Memory 768GiB DDR4
Hard Drive None
SSD4 x 960 GB M.2 SSDsand 1 x 960 GB SATA
NIC 40 Gb/s
FPGA Yes
RAM
Cores
Mega Beast
Processor 4 x 18 Core 2.5 GHz
Memory 12,288 GiB
Hard Drive None
SSD4 x 1920 GB NVMeand 1 x 960 GB SATA
NIC 40 Gb/s
47
Cold Aisle Servicing
Intel, AMD, ARM64 CPUs
High density GPU expansion for HPC/AI
NVM (DRAM+battery) and 3DXP for low-latency
High density HDD and Flash expansion
Microsoft custom designed SSDs
40 going up to 50 Gbps networking
Accelerated VMs using FPGAsMicrosoft SSD
48
0.0E+00
5.0E+22
1.0E+23
1.5E+23
2.0E+23
Digital Universe
Installed Capacity
Source: IDC
0%
20%
40%
60%
80%
100%
2020 2030 2040
Perc
en
tag
e o
f
Dig
ital U
niv
ers
e
Data that can be stored
Data that has to be discarded
Byte
s
49
50
0.0E+00
5.0E+22
1.0E+23
1.5E+23
2.0E+23
Digital Universe
Installed Capacity
Source: IDC
0%
20%
40%
60%
80%
100%
2020 2030 2040
Perc
en
tag
e o
f
Dig
ital U
niv
ers
e
Data that can be stored
Data that has to be discarded
Byte
s
51
Norbert Wiener, U.S. News and World Rep, 1964; Mikhail Neiman, Radiotechnika, 1965; Church et al, Science, 2012; Goldman et al, Nature, 2013
Let’s store data in DNA!
The number of
transistors doubles
every 18-24 months
The industry
roadmaps are based
on that continued
rate of improvement
Arrange the atoms
the way we want
DNA molecules use approximately 50 atoms for one bit
52
Bases:
Data: Bits Base
00 A
01 C
10 G
11 T
Simple mapping:
Store data in synthetic DNA strands
…
150 to 300 bases
G T
TT G GG T G GT
10000111001001
53
VS.
Cold Storage: 1EB
Size: Two Walmart Supercenters
It’s here!
54
1.0E-02
1.0E-01
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
Disk (Rot) Disk (SSD) Flash (Chip) Tape Optical DNA
Gb
/mm
3
Today Projection Limit
redundancy
spacing
metadata
Potential of DNA:
>107 improvement
3-5 5 10-305 1000+Lifetime (years) 10-unlimited
55
(Credit: Philipp Stössel/ETH Zurich)
DNA “synthetic fossils” survive 2,000 – 2,000,000 years
Extreme density makes these conditions cheap and easy to keep56
800 Kbases/day 80 Gbases/day 10 Gbases/day
Size of a mainframe Size of a workstation Size of a portable SSD
Same medium as read technology improves:G T TG
57
Same medium as read technology improves:G T TG
Medium changes as read and write technology improves:
58
ProblemHow to improve performance for batch-of-1 inferences at scale with low cost?
Solution
Hardware accelerated models using FPGAs
Goals
Provide large-scale, performant, and affordable real-time AI inference services
59
Real-time AI at cloud scale with phenomenal performance and affordable cost
Ultra-fast DNN performance with accelerated ResNet50Extreme speed: Object classification on FPGA in <1.8 ms per image
Affordable: Only 20 cents per million images during preview
http://aka.ms/aml-real-time-ai
Models are easy to create and deploy into Azure cloud
Write once, deploy anywhere – to intelligent cloud or edge
Manage and update your models using Azure ML & IoT Edge
60
Data Build Train Deploy Intelligent Apps
ActionIntelligenceData
Stored on Azure Premium
Storage
Azure Machine Learning
1001
Land Classification ModelResNet-50
NAIP Data
20TB, 200M images
Visual Studio Tools for AI
Geo AI Data Science
Virtual Machine
Using FPGAs for ultra-fast inferencing different types of land use for the entire United States
( ESRI NAIP Data, 20+ TB)
Azure Batch AI
Ultra-fast Inferencing using FPGAs
Satellite Images
for the Entire US
Location: Kent Island Year: 2015
61
Image Answer Answers for Multimedia Images
QP MLBERT Quantus query passage ranking
Bling SD Web Intelligence model to produce semantic document body
Vortex Intelligent Search (QnA, Captions) & Ranking
WaveNet & QRNN Text to speech for Bing Mobile & Microsoft Assistant Commute
ANN Approximate nearest neighbor search
TP1, Deep Think v2, Deep Think v2.1, Deep Scan, Deep
Scan v2, Deep Scan v2.1, Deep Read, Poly Math v4.1,
Poly Math v5, Poly Math 5.1, Universal PM1, AGI
encoder/decoder
Intelligent Search (QnA, Captions) & Ranking
LiRIC List Reading and Inference using Comprehension
ResNet-50, ResNet-152, DenseNet-121, VGG-16, SSD-
VGG
Computer vision scenarios for Azure ML/3P
Universal Deep Think v1 Intelligent Search (QnA, Captions) & Ranking
Smart Reply, Meeting Insights, Key Points Intelligence for M365
Scenario &
Model Analysis
HW Development
(if needed)
Implementation
& Optimization
Quantization &
Fine-Tuning
Flighting & PROD
Deployment
62
EFFICIENCYFLEXIBILITY
FPGAs
Contro
l Unit
(CU)
Registers
Arithmeti
c Logic
Unit
(ALU)
CPUs GPUsASICsNPUsSoft
NPUs
63
http://aka.ms/rtai
Models are easy to create and deploy into Azure cloud
Write once, deploy anywhere – to intelligent cloud or edge
Manage and update your models using Azure IoT Edge
64
Project Natick
65
Retardance Polarization Angle
Data stored in quartz glass Written using femtosecond lasers
25µm
66
Storage Stamp
REST FILE
DFS Layer
Intra-stamp replication
LB
Partition Layer
Front-Ends
Storage Stamp
LB
Partition Layer
Front-Ends
DFS Layer
Intra-stamp replication
Geo replication
67
68