Hardwareaspekte Deep Learning -FPGAs für Computer Vision ... · [email protected] +41 43 456 16...
Transcript of Hardwareaspekte Deep Learning -FPGAs für Computer Vision ... · [email protected] +41 43 456 16...
1
Supercomputing Systems AG Phone +41 43 456 16 00
Technopark 1 Fax +41 43 456 16 10
8005 Zürich www.scs.ch
Vision trifft Realität.
Hardwareaspekte Deep
Learning - FPGAs für
Computer Vision im Auto
Workshop at University of Applied Sciences Ulm 11.7.2017
Felix Eberli, Department Head Embeded & Automotive
12 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
But already in series as many driver assistant systems
• ADC (Distronic)
• Blind spot detection
• Break assist
• Pedestrian detection
• Park pilot
• Stop & Go Pilot
• Highway Pilot (steering assist)
• :.
• Lets see ☺
2
13 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Application “Traffic Jam Pilot”
• Steering assistant
allows autonomous
driving in traffic
jams up to 30km/h,
assisting above.
green: Radar-Objects Object-Position via 6D-Vision
Sensor view for
an Urban Drive
15 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
SCS company profile
• Founded 1993 and privately owned by Prof. Dr. Anton Gunzinger
• 100+ employees:
Electrical engineers
Software engineers
Physicists
Mathematicians
• Company offices at Technopark Zurich, Switzerland
3
16 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
SCS Services
Departments
• Embedded & Automotive
• Life Science / Medical
• High Performance Safety
• Embedded
• High Performance Computing
• SW / Public Transport
• SW / Broadcast
• Measure & Decide
Embedded & Automotive
• Feasibility studies
• Hardware (Specification, Design, Schematics, Layout, Production)
• Firmware/IP (FPGA, DSP, GPU, CPU)
• Software (Drivers, Host SW – Windows/Linux)
• Optimizations (ARM , Neon, DSP, EVE, GPU, R-CAR, PC SSE)
17 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
SCS Embedded & Automotive Department
4
19 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
The Principle of Stereo Vision
2008 world-wide first real-time
implementation of Semi-Global
Matching on an automotive compliant
FPGA
S.Gehrig, F.Eberli, T.Meyer, “A Real-time Low-Power Stereo Vision Engine Using Semi-Global
Matching”,
ICVS 2009 (Best Paper Award)
22 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
SCS Stereo Vision Evaluation Plattform
5
24 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Example measurement accuracy 3:
Distance measured = 10.549m +/- 0.022m (Baselength = 25cm)
25 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
SCS Video Injection System, Multi camera record, replay and HIL
6
33 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Current research focuses
on a deeper understanding of the scene
• https://www.cityscapes-dataset.com/examples/
35 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Meine Diplomarbeit vor 20 Jahren –
Neuronales Netz auf DSP portieren
7
36 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Deep Learning -> CNN
• Convolutional Neural Network
Scene LabelingObject Detection
39 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Rechenbeispiel – Wie lese ich Marketingfolien K
1x 32 Bit Multiplier + Adder @ 1 THz
⇒ 1 TMACC
⇒ 2 TOPS (32bit)
⇒ 8 DLTOPS (8bit)
⇒ 16 TOPS peak (4bit)
• Bandbreite / Partitionierung => Nur 30-70% benutzbar
• Auch NOPS sind OPS
• Typischer Stromverbrauch
• Batch mode? => Latenz:
8
40 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Berechnung von CNNs: Ablauf
Layer
1classifier
layer
Layer
2
Eingangsbild Scene Labeling
modernes Deep CNN: 5 – 152 Layer
Layer
N...
Source: Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for CNNs”, MIT 2016
https://www.cityscapes-dataset.com/
41 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Berechnung von CNNs: Ablauf
Layer
1classifier
layer
Layer
2
low-level
features
Layer
N
high-level
features
...
Eingangsbild Scene Labeling
Source: features: https://arxiv.org/pdf/1311.2901v3.pdf
9
42 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Layer
1classifier
layer
Layer
2
Layer
N...
Berechnung von CNNs: Ablauf
Eingangsbild Scene Labeling
Faltung Aktivierung Normalisierung Pooling
43 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Berechnung von CNNs: Ablauf
Layer
1classifier
layer
Layer
2
Layer
N...
Eingangsbild
Aktivierung Normalisierung Pooling
Scene Labeling
Faltung
90 – 99% von Rechenaufwand + Laufzeit
10
44 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Komplexität von CNN-Architekturen
Beispiel Scene Labeling, GoogLeNet
1 Million
«Mega»
1 Milliarde
«Giga»1 Billion
«Tera»
1 Tausend
256x256
MACC/s
45 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Komplexität von CNN-Architekturen
1 FPS
1920x1080256x256
1 Million
«Mega»
1 Milliarde
«Giga»1 Billion
«Tera»
1 Tausend
MACC/s
Beispiel Scene Labeling, GoogLeNet
11
46 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Komplexität von CNN-Architekturen
1 FPS
1920x1080256x25630 FPS
1920x1080
1 Million
«Mega»
1 Milliarde
«Giga»1 Billion
«Tera»
1 Tausend
«Kilo»
MACC/s
• minimale Anzahl Rechnungen + Memory Transfers
• bei perfekter Parallelisierung + Data Reuse (kein Tiling)
Beispiel Scene Labeling, GoogLeNet
47 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Komplexität von CNN-Architekturen Image Classification 256x256
MACCs: 100 Millionen – 100 Milliarden
Memory: 10 MB – 1’000 MB
Source: D. Gschwend, “ZynqNet : An FPGA-accelerated Embedded Convolutional Neural Network”, ETHZ 2016
12
48 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Komplexität von CNN-Architekturen Scene Labeling 1920x1080
MACCs: 100 Mio. – 100 Mia. 10 Mia. – 1’000 Mia. = 1 TMACC
Memory: 10 MB – 1’000 MB 100 MB – 10’000 MB
Source: D. Gschwend, “ZynqNet : An FPGA-accelerated Embedded Convolutional Neural Network”, ETHZ 2016
x10
49 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Hardware-Plattformen für Deep Learning
?
13
50 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Name Typ Peak MACC/s Mem BW Peak Power
GPU NVidida Drive PX2 float 4’000 Mia. 80 GB/s 80 W
GPU NVidida Titan X float 3’000 Mia. 300 GB/s 250 W
FPGA Kintex XCKU115 int16 3’000 Mia. 50 GB/s 50 ... 100 W
FPGA Arria GX660 float 2’500 Mia. 50 GB/s 50 ... 100 W
VPU Movidius Myriad 2 int16 1000 Mia. 400 GB/s 2 W
GPU NVidia Tegra X1 float 250 Mia. 25 GB/s 10 W
CPU Core i7-6700K float 250 Mia. 30 GB/s 90 W
ASIC Origami int12 150 Mia. 0.5 GB/s 0.7 W
DSP TI C6678 float 80 Mia. 20 GB/s 15 W
ASIC Eyeriss int16 40 Mia. 0.1 GB/s 0.3 W
Scene Labeling GoogLeNet,
1920 x 1080 Pixel, 30 FPS:
2’000 Milliarden = 2 Tera MACCs pro Sekunde
50 GB/s Memory Bandbreite
Hardware-Plattformen für Deep Learning: Rechenkapazitäten
http://wccftech.com/nvidia-drive-
px2-pascal-gtc-2016/
http://www.geforce.com/hardware/
desktop-gpus/geforce-gtx-titan-x/specifications
http://www.xilinx.com/products/technology/dsp.html
https://www.xilinx.com/products/technology/
memory-interfacing.html
https://www.altera.com/products/fpga/features/dsp/
arria10-dsp-block.html
http://goo.gl/xBdTrV
http://uploads.movidius.com/1441734401-
Myriad-2-product-brief.pdf
http://browser.primatelabs.com/geekbench3/7309149
http://international.download.nvidia.com/pdf/tegra/
Tegra-X1-whitepaper-v1.0.pdf
http://people.csail.mit.edu/emer/slides/
2016.02.isscc.eyeriss.slides.pdf
http://asic.ethz.ch/2014/Origami.html
http://www.ti.com/lit/an/sprabk5b/sprabk5b.pdf
52 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
FPGAfloat
FPGAint
GPUTitan X
GPUDrive PX2
ASICeyeriss
ASICorigami
VPU
DSP
CPU
GPUTegra X1
Hardware-Plattformen für Deep Learning: Rechenkapazitäten
power
speed
1W
10W
100W
1’000W
10 Mia. 100 Mia. 1’000 Mia.
«Tera»
10’000 Mia. MACC/s
Implementationsverlust
14
53 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
FPGAfloat
FPGAint
GPUTitan X
GPUDrive PX2
ASICeyeriss
ASICorigami
VPU
DSP
CPU
GPUTegra X1
Hardware-Plattformen für Deep Learning: Rechenkapazitäten
power
speed
1W
10W
100W
1’000W
10 Mia. 100 Mia. 1’000 Mia.
«Tera»
10’000 Mia. MACC/s
256x25630 FPS
1920x1080
1 FPS
1920x1080
54 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Entwicklungen in Zukunft
“wilder Westen”, noch ein paar Jahre
- sehr viel Bewegung
- grosse Player, monatlich neue Forschungsresultate
optimale Hardware-Plattform für CNNs noch unklar
- Konvergenz von GPU, DSP, VPU, FPGA
- spezialisierte Beschleuniger (fixed-point integer, direct convolution)
- ebenso wichtig: Frameworks, Hersteller-Support, Lizenzbedingungen
15
55 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Entwicklungen in Zukunft
Abstimmung der Netze auf Zielplattform
Neue Bausteine mit für CNN optimierten Beschleunigern
- noch viel mehr zu erzählen :
� NDA - mit üblichen Verdächtigen und/oder mit SCS Kontakt aufnehmen.
56 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC
Komplexität von CNN-Architekturen: Mensch vs. Maschine
• Mehr als ½ des Hirns beschäftigt sich mit Sehen
• Hirn ist 1’000’000x Energie-effizienter
Source: Yu Wang, Tsinghua University, Feb 2016;
https://www.quora.com/How-much-of-the-brain-is-involved-with-vision
10’000 Tera Op/s 80 Tera Op/s* * IBM Watson, 2012
16
Supercomputing Systems AG Phone +41 43 456 16 00
Technopark 1 Fax +41 43 456 16 10
8005 Zürich www.scs.ch
Vision meets reality.
Supercomputing Systems AG
[email protected] +41 43 456 16 19