Hardwareaspekte Deep Learning -FPGAs für Computer Vision ... · [email protected] +41 43 456 16...

1

Supercomputing Systems AG Phone +41 43 456 16 00

Technopark 1 Fax +41 43 456 16 10

8005 Zürich www.scs.ch

Vision trifft Realität.

Hardwareaspekte Deep

Learning - FPGAs für

Computer Vision im Auto

Workshop at University of Applied Sciences Ulm 11.7.2017

Felix Eberli, Department Head Embeded & Automotive

12 Zürich 06.07.2017 © by Supercomputing Systems AG PUBLIC

But already in series as many driver assistant systems

• ADC (Distronic)

• Blind spot detection

• Break assist

• Pedestrian detection

• Park pilot

• Stop & Go Pilot

• Highway Pilot (steering assist)

• :.

• Lets see ☺

2


Application “Traffic Jam Pilot”

• Steering assistant

allows autonomous

driving in traffic

jams up to 30km/h,

assisting above.

green: Radar-Objects Object-Position via 6D-Vision

Sensor view for

an Urban Drive


SCS company profile

• Founded 1993 and privately owned by Prof. Dr. Anton Gunzinger

• 100+ employees:

Electrical engineers

Software engineers

Physicists

Mathematicians

• Company offices at Technopark Zurich, Switzerland

3


SCS Services

Departments

• Embedded & Automotive

• Life Science / Medical

• High Performance Safety

• Embedded

• High Performance Computing

• SW / Public Transport

• SW / Broadcast

• Measure & Decide

Embedded & Automotive

• Feasibility studies

• Hardware (Specification, Design, Schematics, Layout, Production)

• Firmware/IP (FPGA, DSP, GPU, CPU)

• Software (Drivers, Host SW – Windows/Linux)

• Optimizations (ARM , Neon, DSP, EVE, GPU, R-CAR, PC SSE)


SCS Embedded & Automotive Department

4


The Principle of Stereo Vision

2008 world-wide first real-time

implementation of Semi-Global

Matching on an automotive compliant

FPGA

S.Gehrig, F.Eberli, T.Meyer, “A Real-time Low-Power Stereo Vision Engine Using Semi-Global

Matching”,

ICVS 2009 (Best Paper Award)


SCS Stereo Vision Evaluation Plattform

5


Example measurement accuracy 3:

Distance measured = 10.549m +/- 0.022m (Baselength = 25cm)


SCS Video Injection System, Multi camera record, replay and HIL

6


Current research focuses

on a deeper understanding of the scene

• https://www.cityscapes-dataset.com/examples/


Meine Diplomarbeit vor 20 Jahren –

Neuronales Netz auf DSP portieren

7


Deep Learning -> CNN

• Convolutional Neural Network

Scene LabelingObject Detection


Rechenbeispiel – Wie lese ich Marketingfolien K

1x 32 Bit Multiplier + Adder @ 1 THz

⇒ 1 TMACC

⇒ 2 TOPS (32bit)

⇒ 8 DLTOPS (8bit)

⇒ 16 TOPS peak (4bit)

• Bandbreite / Partitionierung => Nur 30-70% benutzbar

• Auch NOPS sind OPS

• Typischer Stromverbrauch

• Batch mode? => Latenz:

8


Berechnung von CNNs: Ablauf

Layer

1classifier

layer

Layer

2

Eingangsbild Scene Labeling

modernes Deep CNN: 5 – 152 Layer

Layer

N...

Source: Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for CNNs”, MIT 2016

https://www.cityscapes-dataset.com/



Layer

1classifier

layer

Layer

2

low-level

features

Layer

N

high-level

features

...


Source: features: https://arxiv.org/pdf/1311.2901v3.pdf

9


Layer

1classifier

layer

Layer

2

Layer

N...



Faltung Aktivierung Normalisierung Pooling



Layer

1classifier

layer

Layer

2

Layer

N...

Eingangsbild

Aktivierung Normalisierung Pooling

Scene Labeling

Faltung

90 – 99% von Rechenaufwand + Laufzeit

10


Komplexität von CNN-Architekturen

Beispiel Scene Labeling, GoogLeNet

1 Million

«Mega»

1 Milliarde

«Giga»1 Billion

«Tera»

1 Tausend

256x256

MACC/s



1 FPS

1920x1080256x256

1 Million

«Mega»

1 Milliarde

«Giga»1 Billion

«Tera»

1 Tausend

MACC/s


11



1 FPS

1920x1080256x25630 FPS

1920x1080

1 Million

«Mega»

1 Milliarde

«Giga»1 Billion

«Tera»

1 Tausend

«Kilo»

MACC/s

• minimale Anzahl Rechnungen + Memory Transfers

• bei perfekter Parallelisierung + Data Reuse (kein Tiling)



Komplexität von CNN-Architekturen Image Classification 256x256

MACCs: 100 Millionen – 100 Milliarden

Memory: 10 MB – 1’000 MB

Source: D. Gschwend, “ZynqNet : An FPGA-accelerated Embedded Convolutional Neural Network”, ETHZ 2016

12


Komplexität von CNN-Architekturen Scene Labeling 1920x1080

MACCs: 100 Mio. – 100 Mia. 10 Mia. – 1’000 Mia. = 1 TMACC

Memory: 10 MB – 1’000 MB 100 MB – 10’000 MB

Source: D. Gschwend, “ZynqNet : An FPGA-accelerated Embedded Convolutional Neural Network”, ETHZ 2016

x10


Hardware-Plattformen für Deep Learning

?

13


Name Typ Peak MACC/s Mem BW Peak Power

GPU NVidida Drive PX2 float 4’000 Mia. 80 GB/s 80 W

GPU NVidida Titan X float 3’000 Mia. 300 GB/s 250 W

FPGA Kintex XCKU115 int16 3’000 Mia. 50 GB/s 50 ... 100 W

FPGA Arria GX660 float 2’500 Mia. 50 GB/s 50 ... 100 W

VPU Movidius Myriad 2 int16 1000 Mia. 400 GB/s 2 W

GPU NVidia Tegra X1 float 250 Mia. 25 GB/s 10 W

CPU Core i7-6700K float 250 Mia. 30 GB/s 90 W

ASIC Origami int12 150 Mia. 0.5 GB/s 0.7 W

DSP TI C6678 float 80 Mia. 20 GB/s 15 W

ASIC Eyeriss int16 40 Mia. 0.1 GB/s 0.3 W

Scene Labeling GoogLeNet,

1920 x 1080 Pixel, 30 FPS:

2’000 Milliarden = 2 Tera MACCs pro Sekunde

50 GB/s Memory Bandbreite

Hardware-Plattformen für Deep Learning: Rechenkapazitäten

http://wccftech.com/nvidia-drive-

px2-pascal-gtc-2016/

http://www.geforce.com/hardware/

desktop-gpus/geforce-gtx-titan-x/specifications

http://www.xilinx.com/products/technology/dsp.html

https://www.xilinx.com/products/technology/

memory-interfacing.html

https://www.altera.com/products/fpga/features/dsp/

arria10-dsp-block.html

http://goo.gl/xBdTrV

http://uploads.movidius.com/1441734401-

Myriad-2-product-brief.pdf

http://browser.primatelabs.com/geekbench3/7309149

http://international.download.nvidia.com/pdf/tegra/

Tegra-X1-whitepaper-v1.0.pdf

http://people.csail.mit.edu/emer/slides/

2016.02.isscc.eyeriss.slides.pdf

http://asic.ethz.ch/2014/Origami.html

http://www.ti.com/lit/an/sprabk5b/sprabk5b.pdf


FPGAfloat

FPGAint

GPUTitan X

GPUDrive PX2

ASICeyeriss

ASICorigami

VPU

DSP

CPU

GPUTegra X1


power

speed

1W

10W

100W

1’000W

10 Mia. 100 Mia. 1’000 Mia.

«Tera»

10’000 Mia. MACC/s

Implementationsverlust

14


FPGAfloat

FPGAint

GPUTitan X

GPUDrive PX2

ASICeyeriss

ASICorigami

VPU

DSP

CPU

GPUTegra X1


power

speed

1W

10W

100W

1’000W

10 Mia. 100 Mia. 1’000 Mia.

«Tera»

10’000 Mia. MACC/s

256x25630 FPS

1920x1080

1 FPS

1920x1080


Entwicklungen in Zukunft

“wilder Westen”, noch ein paar Jahre

- sehr viel Bewegung

- grosse Player, monatlich neue Forschungsresultate

optimale Hardware-Plattform für CNNs noch unklar

- Konvergenz von GPU, DSP, VPU, FPGA

- spezialisierte Beschleuniger (fixed-point integer, direct convolution)

- ebenso wichtig: Frameworks, Hersteller-Support, Lizenzbedingungen

15


Entwicklungen in Zukunft

Abstimmung der Netze auf Zielplattform

Neue Bausteine mit für CNN optimierten Beschleunigern

- noch viel mehr zu erzählen :

� NDA - mit üblichen Verdächtigen und/oder mit SCS Kontakt aufnehmen.


Komplexität von CNN-Architekturen: Mensch vs. Maschine

• Mehr als ½ des Hirns beschäftigt sich mit Sehen

• Hirn ist 1’000’000x Energie-effizienter

Source: Yu Wang, Tsinghua University, Feb 2016;

https://www.quora.com/How-much-of-the-brain-is-involved-with-vision

10’000 Tera Op/s 80 Tera Op/s* * IBM Watson, 2012

16

Supercomputing Systems AG Phone +41 43 456 16 00

Technopark 1 Fax +41 43 456 16 10

8005 Zürich www.scs.ch

Vision meets reality.

Supercomputing Systems AG

[email protected] +41 43 456 16 19

Hardwareaspekte Deep Learning -FPGAs für Computer Vision ... · [email protected] +41 43 456 16...

Documents

Transcript of Hardwareaspekte Deep Learning -FPGAs für Computer Vision ... · [email protected] +41 43 456 16...