Project 3: Optimizing Traditional and Deep Learning ... 3...Camera 2x MIPI CSI/2 DPHY lanes...

LECTURE: MP-6171 SISTEMAS EMPOTRADOS DE ALTO DESEMPEÑO

Project 3: Optimizing Traditional and DeepLearning Algorithms for Autonomous Driving

Lecturers:MSc. José Araya Martínez

MSc. Sergio Arriola-Valverde

First term, 2020

0

Contents

1 Introduction 31.1 Administrative Matters . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Team Formation . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Forum and Communication . . . . . . . . . . . . . . . . . . . . 31.1.3 Plagiarism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Development Environment Overview . . . . . . . . . . . . . . . . . . . . 41.2.1 Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Jetson Nano Hands-On 52.1 Setting up the Jetson Nano Development Environment . . . . . . . . . . 6

2.1.1 Installing the Official Ubuntu OS . . . . . . . . . . . . . . . . . 62.1.2 Installing OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.3 Installing QtCreator and running an example . . . . . . . . . . . 8

3 The Semi-Global Block Matching Algorithm 93.1 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Census Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Cost Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Cost Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.6 Disparity Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.7 Getting the Sources and Running the Baseline Implementation . . . . . . 17

4 Optimizing the "Expensive" SGBM Depth-Estimation Algorithm 184.1 Profiling the SGBM Algorithm to Detect "Hotspots" . . . . . . . . . . . 194.2 Reducing Memory Footprint of Baseline Implementation . . . . . . . . . 204.3 Multi-threading Technique to Accelerate Expensive Loops . . . . . . . . 214.4 Acceleration through NEON Intrinsics . . . . . . . . . . . . . . . . . . . 234.5 NEON Intrinsics vs Compiler Auto-Vectorization . . . . . . . . . . . . . 244.6 CUDA Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.7 Evaluating Real-Time Constraints . . . . . . . . . . . . . . . . . . . . . 26

5 Deep Learning Algorithm Optimization 265.1 Selection of the Convolutional Network Architecture . . . . . . . . . . . 275.2 Post-Training Optimization of CNN . . . . . . . . . . . . . . . . . . . . 27

6 Deliverables 286.1 Folder Structure of your Deliverables . . . . . . . . . . . . . . . . . . . . 286.2 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1

0

6.3 Deliverables Submission . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2

1.1.2

1 IntroductionAutonomous driving is constantly gaining relevance as modern embedded systems andits hardware acceleration (FPGA,GPU, etc) allow real-time computation of sophisticatedalgorithms to enable autonomous decision making while driving. This technology hasbeen already introduced in the first commercial cars and is going to change the traffic aswe know it today.

One necessary class of algorithms to allow safe autonomous driving are the depth-estimationalgorithms. Their purpose is to determine a dense disparity map in which every pixel inthe field of view of the cameras contains distance information from the cameras them-selves to every present object in the scene. With this information a car can decide whetherit is time to brake or steer away in case an object is getting too close.

Our task is to take a base (and not efficient) implementation of such an algorithm (Semi-Global Block Matching SGBM) and optimize it so that we can evaluate whether it wouldbe a good candidate to be used in a real application.

To do so, we will need to profile our base application, determine hotspots and try to takeadvantage of the NVIDIA Jetson Nano hardware to speed it up. Towards the end of theProject we will compare the performance of our implementation against a state-of-the-artimplementation.

To do so, we will take a deep-learning algorithm, optimize it using a post-training methodsuch as weight quantization or model pruning and compare it against our optimized tradi-tional SGBM.

1.1 Administrative MattersBefore we get down to business, some administrative affairs are to be discussed:

1.1.1 Team Formation

Use the initial work team organization defined previously. In this case maximum twostudents per group.

1.1.2 Forum and Communication

This project will be evaluated remotely, for this reason having a suitable online platformturns out to be very important to facilitate the communication between students and lec-turers.

In order to do so, we will adopt a "community" approach by means of a forum in theTecDigital platform. In the forum, all students can create new topics as questions arise.

3

1.2.1

All discussions are public to all the course members so that any fellow student can answerand/or add more information to the proposed question. Please avoid sending project-related queries directly to the lecturers’ Email, as this prevents other students to benefitfrom the answer. In the forum we all are a team! The only restriction is to not shareworking source code in the forum, instead, we can create discussion and propose ideasand concepts that lead to the solution.

1.1.3 Plagiarism

Any evidence of plagiarism will be thoroughly investigated. If it is corroborated, the teamwill receive a zero grade in the project and the incident will be communicated to thecorresponding institutional authorities for further processing.

1.2 Development Environment OverviewIn previous projects students learned already the Yocto project workflow and the value ofa TFTP server and a NTFS server. The approach for this project will be one of nativecompilation, therefore a different development environment must be set. An NVIDIAJetson Nano will be the platform used for this project.

As figure 1 shows, our development environment consists of a main component which isthe development board and several peripherals:

• Monitor

• Keyboard

• Mouse

• Camera

The monitor can be connected to either the display port (7) or the HDMI port (6), as forthe keyboard and mouse you can make use of the USB 3.0 ports (5). The Raspberry PiCamera Module v2 into the MIPI CSI camera connectors (9). The microSD card slot formain storage (1) is located on the back part of the board underneath the heat sink. Aswe are taking a native compilation approach an Ethernet connection to a routing device isneeded as well (4).

The starter kit comes with a paper stand which folds and fits inside the box so that theboard can rest upon it. Make sure to use this stand as there are components on the bottomwhich might get damaged from hard surfaces.

4

2

Figure 1: General view of the development environment

1.2.1 Nomenclature

During this project, we will mainly work with a single OS. Table 1 shows the promptsymbol for the Ubuntu running on the Jetson Nano and the one for a host system usedduring the first steps of configuration.

Table 1: Prompt symbols for the command-line console during this project.

Prompt symbol Description$ Jetson Nano Linux@ Host Computer

2 Jetson Nano Hands-OnAs we mentioned before, the NVIDIA Jetson Nano is our pick for development board forthe project. Table 2 summarizes the main technical specifications for the board.

The primary features of the Raspberry Pi 4 and the NVIDIA Jetson Nano are very similar,with the exception of the graphics capabilities. The Jetson Nano includes a higher perfor-mance Maxwell-based GPU meanwhile the Raspberry Pi 4 uses a low power integratedmultimedia GPU.

As for the CPU they are virtually identical and are in fact really similar. The newer ARM

5

2.1.1

Specification Description

GPU 128-core MaxwellCPU Quad-core ARM A57 @ 1.43 GHzMemory 4 GB 64-bit LPDDR4 25.6 GB/sStorage microSDVideo Encode 4K @ 30 | 4x 1080p @ 20 | 9x 720p @ 30 (H.264/H.265)Video Decode 4K @ 60 | 2x 4K @ 30 |8x 1080p @ 30 | 18x 720p@ 30 (H.264/H.265)Camera 2x MIPI CSI/2 DPHY lanesConnectivity Gigabit Ethernet, M.2 Key EDisplay HDMI and display portUSB 4x USB 3.0, USB 2.0 Micro-BOthers GPIO, I2C,I2S, SPI, UARTMechanical 69 mm x 45 mm, 260/pin edge connector

Table 2: Technical specifications for NVIDIA Jetson Nano[1]

A72 found in the Raspberry Pi 4 has a slightly higher clock speed of 1.5 GHz and wasdesigned to offer 90% greater performance than the A57.

As an interesting fact for the video game enthusiasts, the 128-core Maxwell GPU foundin the Jetson Nano is the same GPU found in the Nintendo Switch gaming console! Youcan also find the Cortex-A57 CPU inside the Nintendo Switch.

2.1 Setting up the Jetson Nano Development EnvironmentOnce we have reviewed all the hardware components needed it is time to set up an ap-propriate software development environment. The following sections will provide a highlevel guide to install the following software:

• Official NVIDIA Ubuntu OS

• OpenCV 4.4.0

• QTCreator

Finally a OpenCV hello world example using QTCreator to get started is overviewed.

2.1.1 Installing the Official Ubuntu OS

We are going to be using the official Jetson Nano Developer Kit SD Card Image, you canclick the previous link to start the download. Make sure you have at least 20 Gb of freespace in your computer, the compressed download is about 6 Gb and the extracted image

6

https://developer.nvidia.com/jetson-nano-sd-card-image

2.1.2

about 14 Gb. You can find instructions here to download and write the SD card usingEtcher or use the command line as we did in previous projects.

Hint: @ sudo dd bs=4M if=sd-blob-b01.img of=/dev/sdX conv=fsync status=progress

Plug the SD card into your Jetson Nano and turn it on. When you boot the first time, thedeveloper kit will take you through some initial steps, including:

• Review and accept NVIDIA Jetson software EULA

• Select system language, keyboard layout, and time zone

• Log in

Once you are done with the previous steps you should see a desktop like the one on figure2. So cool!

Figure 2: NVIDIA Official Ubuntu OS desktop

2.1.2 Installing OpenCV

Next thing we need to set our development environment is the Open Computer Visionlibrary. There are 2 approaches to install the required libraries, use the Ubuntu packagemanager or build from source. We are going to be using the later one. To do so download

7

https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-devkit

2.1.3

the executable script from this repo. The script is going to install build dependencies,clone a requested version of OpenCV, build it from source, and install it. Even thoughinstalling the packaged version from the Ubuntu repository is easier, building OpenCVfrom source gives you more flexibility, and it should be your first option when installingOpenCV. Some details to have in mind:

• Since we are building on the Jetson directly it is highly recommended to use aswapfile.

• You can modify the script to install OpenCV with examples or specific packagesusing the CMAKEFLAGS.

• The entire process takes about 5 hours to complete, be patient.

• You will be prompted for certain tasks for either approval or sudo password.

• If you are using a screen it might flicker.

• Once the installation is done a reboot is required.

2.1.3 Installing QtCreator and running an example

The process to install QtCreator is rather simpler and way faster. We are going to takeadvantage of the package manager. Go ahead and type:

1 $ sudo apt-get install qt5-default qtcreator -y2

Now that we have QtCreator and OpenCV let us give them a test run. We are going tocode a simple example that is going to display an image of our choice on a window.

To create a new project:

• Open QtCreator.

• File » ‘New File or Project’.

• We are going to choose ‘Applications’ and ‘Qt Widgets Application’.

• You can then choose a name, location, etc.

• Click on finish and QTCreator will create the source files for you.

In order to link OpenCV with our project we have to add the path to the libraries we aregoing to be using in the ‘.pro’ file. So add the following lines before ‘SOURCES’:

• INCLUDEPATH += /usr/local/include/opencv4

8

https://github.com/mdegans/nano_build_opencv

3.1

• LIBS += -L/usr/local/lib -lopencv_core -lopencv_imgcodecs -lopencv_highgui

If you installed OpenCV using a different method than the one described on section 2.1.2the path might be different so be sure to double check. Now we are going to write our codeon the ‘mainwindow.cpp’ source file. Right after ‘ui->setupUi(this)’ write the followinglines:

• cv::Mat inputImage = cv::imread("PATH_TO_IMAGE");

• if(!inputImage.empty()) cv::imshow("Display Image", inputImage);

Make sure your ‘PATH_TO_IMAGE is valid and to remove the brackets. Include theOpenCV library in the header of this same file:

• #include <opencv4/opencv.hpp>

Finally build your project by typing Ctrl + B or under Build » Build Project and run itwith Ctrl + R or under Build » Run. Congratulations! Now our development environmentis ready for some serious work!

3 The Semi-Global Block Matching AlgorithmThe Semi-Global Block Matching (SGBM) algorithm is an efficient depth-estimationmethod that takes into account not only information within a block-matching window,but also from the whole image along specific scan paths called Lr, where r defines a spe-cific direction (Richtung). The SGBM algorithm was first introduced by Hirschmüllerwhile working at the German Aerospace Center (DLR). You can find a copy of the origi-nal paper here.

The semi-global feature of the SGBM approach allows a more robust disparity estimationin comparison to other local methods, such as the simpler Block Matching approach. Onthe other hand, it makes the SGBM algorithm much more resource hungry than its localcounterparts. In this project we will try to get the high accuracy estimation of the SGBMwhile reducing computation time as much as possible.

Figure 3 shows an overview of the structure of the algorithm as a pipe architecture. Allgreen blocks represent information exchange between functions (buffers or variables)while blue blocks denote functions. In the following, a description of every stage ofthe algorithm is provided in detail.

Note that image rectification is not addressed within this document as the datasets we areusing offer rectified images. Additionally, the inclusion of a post-processing algorithmsuch as texture threshold or speckle range is not strictly necessary, for simplicity theprovided C implementation does not contain a post-processing step.

9

https://www.robotic.de/fileadmin/robotic/hirschmu/cvpr05hh.pdf

3.1

Figure 3: Pipe representation of the multiple stages of the SGBM algorithm and its buffers

3.1 Data AcquisitionOnce the rectification process has been done (in our case images are already rectified),the stereo algorithm has only to look for possible matching points along epipolar linesacross the image. We do not pad image borders prior to the stereo computation in ourimplementation, for this reason the usable pixels are reduced by Block_Size/2 on eachside of the incoming image. This is depicted in Figure 4 as grey borders along the imageframe.

Figure 4 shows additionally some fundamental variables related to the dimensions of theimage and its processing. The name of the depicted variables corresponds to the providedC implementation.

For better understanding, we will shortly describe some important variables present inFigure 4 and in Figure 5:

• height,width: Dimensions of the incoming image after rectification.

• Block_Size: Amount of neighboring pixels to be considered during the CensusTransform step.

• width_after_census = width-(Block_Size – 1)

• MinDisp: Lowest expected disparity value (it can be negative)

• MaxDisp: Highest expected disparity value

• TotalDisp = |MaxDisp-MinDisp|

• y: Vertical position of the current epipolar line. It corresponds to the vertical posi-tion of the pixel of interest in both images.

10

https://en.wikipedia.org/wiki/Epipolar_geometry

3.2

Figure 4: Dimensions of the input images and some relevant variables present in the Cimplementation

• x: Horizontal position of the pixel in the reference image (in this case the left im-age).

• x’: Horizontal position of the pixel in the right image.

Note that for each pixel in the left image we have to evaluate all pixels along the epipolarline within the disparity range in the right image. Considering this, x’ can be expressedas: x′ = x− d with d ∈ [DispMin,DispMax].

3.2 Census TransformIt is a simple yet efficient way to encode the relative intensity of a pixel and its neighboringpixels. It takes the intensity information of the pixel at a given position (x,y) and comparesit with a defined amount of surrounding pixels within a region defined by (x”, y”) ∈[x ± BlockSize

2, y ± BlockSize

2], as shown on Figure 5 and the left of Figure 6. The Census

transform provides a matching cost with a high radiometric robustness while being simpleto be parallelized in hardware.

As shown on the left of Figure 6, the Census Transform compares the central pixel inthe block against its neighbouring pixels and creates a bit string with the result of thecomparison. If the central pixel is greater than the compared pixel, the operation returns

11

3.3

Figure 5: Census Transform operation over a stereo image pair.

1 and it returns o for all otherwise. Thus, the Census operation can be summarized as:

Census(x, y) =

{1, ifpixel(x, y) > pixel(x”, y”)

0, otherwise(1)

Figure 6: Census and Hamming Transforms over a block of pixels.

Note: To represent the resulting value of the Census Transform, the following bit depth isrequired:

(BlockSize×BlockSize)− 1 (2)

12

3.3

3.3 Hamming DistanceAs you have seen on the left of Figure 6, the result of the census transform is a largebinary number which has a quadratic dependency with the dimensions of the BlockSize.The Hamming Distance operation will encode this large amount of data to represent thePixel Matching Cost in a more efficient way. The output of this stage in the SGBM pipeis a measure of the pixel intensity relative to its neighbouring pixels (Census Transform)encoded in a convenient way (Hamming Distance) for further processing. The furtherprocessing will not only consider information of pixels from a local neighbourhood but in"search paths" across the whole image (semi-global part of the algorithm).

For each pixel in the reference image (left), we have to compute the Census and Hammingtransforms against all pixels across the whole disparity range in the right image, as shownin Figure 5 and 4 in blue. There are sections of the reference image with no correspondinginformation on its counterpart, and vice versa. Figure 4 illustrates this regions with ayellow color. This yellow sections are dependent on the MaxDisp and MinDisp valuesand will reduce the size of the disparity estimation image at the end of the algorithm.

As illustrated in Figure 6, the Hamming distance calculation can be summarized as:

1. Bitwise XOR operation between the Census-transformed values of pixels at posi-tions (x,y) and (x’,y’).

2. Count number of resulting bits with a value of "1".

Its main advantage for our purposes is the achieved bit depth reduction in comparisonwith the Census Transform’s bit depth. After the Hamming Distance we can calculate thebit depth as:

dlog2(BlockSize ∗BlockSize− 1)e (3)

The output of this operation is called the Pixel Matching Cost C(p, d). As shown on theleft of Figure 7, for each pixel we have an array of values whose depth corresponds tothe TotalDisp value. So in this point we have a third-dimensional structure representinghow well a given pixel from the left image matches a pixel from the right image along theepipolar line withing a certain disparity range (we do not look through the whole with ofthe right image).

As we can see on the right of Figure 7, the C(p, d) function represents a local matchingcost between pixels of the left and right images along the disparity range. In theory,we could look for the minimum of C(p, d), and its index would correspond to the bestdisparity match for the pixel p between both images. If we stop the algorithm at thispoint, we have implemented the Block Matching (BM) stereo algorithm. However, as itonly takes into account information that is immediately around the pixel p, this algorithmis prone to be less robust than the SGBM approach, however, the SGBM is much more

13

3.4

computational expensive than the BM variant. As we have 4 ARM cores in the JetsonNano plus a GPU to use, we will try to implement the more robust SGBM variant andaccelerate it as much as we can.

Figure 7: Pixel Matching Cost C(p, d) after Census + Hamming operations.

3.4 Cost ComputationFigure 8 illustrates the Lr operation in the Cost Computation phase. At this step, it allstarts with the C(p, d) (output from Hamming distance) at every pixel p. The Cost Com-putation steps can be summarized as:

1. For each pixel p, initialize a buffer of size TotalDisp for each direction r with thesame content of the original C(p, d) information. In Figure 8, only four directionsare represented, however more are possible.

2. Starting from the top-left and moving sequentially to the right, compute the Lr(p, d)operation shown at the end of Figure 8 for each pixel p and for each direction r.Note that information from the own Pixel Matching Cost of the pixel C(p, d) andadditionally information from the neighbor Lr is required. This information flow isrepresented as a backwards-pointing arrow in the third state of the pipe in Figure 3.

3. Save the resulting Lr(p, d) values at the corresponding section of the buffer anddisparity as shown in Figure 8.

The Lr(p, d) function is the core of the SGBM. Here information from scanlines acrossthe image is combined to improve the C(p, d) function we calculated previously with alocal method (Census + Hamming). As shown in Figure 8, the Lr(p, d) operation and itsmain 3 components can be analyzed as follows:

1. The Pixel Cost Matching C(p,d) coming from the BM approach and used by SGBMas initialization.

14

3.5

Figure 8: Cost Computation Step of the SGBM Algorithm

2. A minimum function, highlighted in Figure 8 in orange which has the followingterms:

(a) Lr(p− r, d): Neighboring pixel in the r direction and same disparity d.

(b) Lr(p− r, d− 1) + P1 and Lr(p− r, d+ 1) + P1: Neighboring pixel in the rdirection but with a disparity difference of 1 compared to the current evaluateddisparity d. A small penalty P1 is added to its value to favor small disparitychanges.

(c) minLLr(p − r,D) + P2: This represents any other value over the disparityrange in the neighboring pixel p that is smaller than the previous 2 cases evenplus a large penalty P2.

(d) A scaling factor minLr(p − r,D) that prevents an overflow due to repetitiveaddition of positive terms over the scanlines. Note that this scaling factor isconstant for all disparities, thus, does not modify the shape of the functionitself.

The bit-depth upper limit of the cost computation operation is given as:

L ≤ Cmax(p, d) + P2 (4)

3.5 Cost AggregationAfter the expansion phase (Cost Computation), a reduction step follows on which the costinformation from all directions of pixel p are added together into a new semi-global cost

15

3.6

function S(p, d), as shown in Figure 9.

Figure 9: Cost Aggregation Step of the SGBM Algorithm

This operation can be expressed as the sum over the directions:

S(p, d) =r=NumDir∑

r=1

Lr(p, d) (5)

and its bit-depth upper limit is defined as:

S ≤ NumDir × (Cmax(p, d) + P2) (6)

3.6 Disparity ComputationThe last step of the SGBM is the disparity computation. Note that the functions S(p, d)from Figure 10 and C(p, d) 6 have the same structure as they contain equal number ofelements (TotalDisp). However, the S(p, d) contains information from the scan pathscollected at the Cost Computation stage and added together at the Cost Aggregation phase.Thus, the S(p, d) data structure can be represented in a third-dimensional fashion as shownon the left of Figure 10 or if we take the cost function S(p, d) of a single pixel p, it ispossible to be represented in a 2-dimensional space as shown on the right of Figure 10.

At this point, it is possible to estimate the disparity value d of pixel p at (x, y) by calcu-lating the location of the global minimum value of its Matching Cost function S(p, d) inthe disparity range, as shown in Figure 10.

16

3.7

Figure 10: Disparity Computation Step of the SGBM Algorithm

Note:

The width of the resulting disparity image has now been reduced to:

width′ = width− (BlockSize− 1)− TotalDisp (7)

Analogously, the height has been reduced to:

height′ = height− (BlockSize− 1) (8)

As you can see, the resulting image is a dense disparity map, where the value of eachpixel represents the distance of this pixel to the cameras. This information is usually usedfor autonomous vehicles to avoid possible obstacles on the street and prevent accidents.

We hope you find the SGBM algorithm as exciting as we do :)

3.7 Getting the Sources and Running the Baseline ImplementationAs we have seen in the last section, the SGBM algorithm is a fairly complex method toestimate depth of stereo images. Luckily, we have a functional implementation to use asour baseline. Our task is to deeply understand the baseline implementation to optimize it.

17

4

You can download the baseline implementation from this link.

It contains 2 folders:

1. SGBM: A QTCreator project that should run out-of-the-box in your Jetson Nano.You just have to configure the project with your Kit that contains QT and OpenCV.

2. Test_data: It contains 3 sets of images:

• Cars_GT, Cars_L and Cars_R: This will be our dataset to test all optimiza-tions (taken from the KITTI Dataset). It contains a left and right images plusa ground truth to compare our result against it. The default SGBM parametersset in the GUI (Block Size, Directions, etc) should work with this dataset.

• Prof_L and Prof_R: Because the Nano has some difficulties profiling thebaseline implementation with large images, we will use this particularly smalldataset for the profiling. While using it, set the Max Disp to a value notgreather than 32.

• Toys: A famous image pair and its ground truth taken from the MiddleburyDataset. This will no be used formally during the project and is there just incase you need one more example to debug your implementation. Considerthat if you want to compare the ground truth you have to use a Max Disp ofaround 64, comment out line 34 of the sgbm.cpp file and uncomment line 35of the same file.

Note: The provided SGBM implementation is partially copyrighted and only released foreducation purposes. Please do not upload it to a public repository and do not distributeit to third persons to avoid licence violation.

4 Optimizing the "Expensive" SGBM Depth-EstimationAlgorithm

In this section we will optimize the hotspots detected in the previous section using multi-ple techniques. Among them: Multi-threading, SIMD instructions of the NEON unit andCUDA optimization. In summary our tasks are:

1. Determine hotspots to be optimized (profiling).

2. Take advantage of the deep knowledge we have about the algorithm to change theway it is implemented. For instance we can change the data types of the variablesand memory allocations of the current implementation to match allocate just thenecessary amount of memory per variable. To do so we can use the equationsexposed in the last section (Code optimization).

18

http://www.ie.tec.ac.cr/joaraya/HPEC/Proyecto3/

4.1

3. Parallelize computation bottlenecks (OpenMP).

4. Take advantage of the hardware units of the processor we are using to performSIMD operations (NEON Intrinsics).

5. Take advantage of other powerful computation units of the SoC to assume somecritical stages of the computation pipe (CUDA optimization).

Before going further, please make sure you understand the above-described algorithm gutenough to be able to optimize it on these ways.

4.1 Profiling the SGBM Algorithm to Detect "Hotspots"The first step towards our optimization goal is to determine what exactly we should op-timize to get the best results while investing the least possible amount of effort. To doso, we are going to profile our application, find its hotspots and tackle them with ouroptimization methods.

You already have experience profiling algorithms with perf, however, you should recon-sider whether perf is the better option for an application with GUI as the one we are goingto optimize.

In this case, we are going to use the Valgrind profiler, which is integrated into QTCreator.But before we can use it we need to install it in our Jetson Nano. Please consider thefollowing high-level hints:

1. Install Valgrind from the console by typing:

1 $ sudo apt-get install valgrind2

2. Once installed, please investigate how to use the integrated profiling capabilitiesof QTCreator + Valgrind and how to represent and interpret the data you get todetermine costly functions.

3. Determine the 3 most expensive functions that belong to the SGBM algorithm.To do so please use the provided Prof_L.png and Prof_R.png input images and donot select any Ground Truth image as we are only interested in the computation ofthe SGBM and not in its accuracy evaluation.

4. Note: As the Jetson Nano has not been designed to fully support the requirementsof a GUI development environment, you could have some problems profiling yourapplication. If your Nano cannot profile the GUI application, you have at least 2options:

19

4.2

(a) Drop the GUI. To do that you need to create a C++ project without GUI andsubstitute the file mainwindow.cpp with an own main.cpp where you executeall functions called in the on_B_Run_Clicked() without the need of the GUIelements and window. You can enter the images as arguments of your com-mand line program of hard code them for this profiling step. Then you can useValgrind or perf to profile your C++ program.

(b) Alternatively you can install QTCreator in your Ubuntu Virtual Machine andprofile your application there at a cost of potentially having inaccurate resultsare we are profiling in a complete different architecture.

(c) In either scenario, please write a discussion in your report to support yourdecision.

4.2 Reducing Memory Footprint of Baseline ImplementationThe current baseline implementation disregards the theoretical maximum data size at ev-ery stage of the algorithm. It uses int data type to declare most variables and performmemory allocations (with the instruction called new). This leads to an inefficient memoryuse as we allocate more memory than we will ever need. As zou might be already think-ing, this is not allowed for an optimized implementation aimed to run in an embeddeddevice. We have to correct this issue. To do so, consider the following points:

1. Luckily the whole implementation is already done with integers, otherwise the firststep would have been to port the implementation from floats to ints.

2. Take a look at Equations 2, 3, 4 and 6 and determine the maximum bit-depth for thefollowing parameters:

(a) BlockSize: 5 pixels

(b) P2: 128

(c) NumDir: 4

(d) Cmax(p, d): Results from Equation 3

3. Ask yourselves the question: do we really need to use signed integers or is it enoughto use unsigned data types?

4. Modify the relevant variables, memory allocations (for instance in the ComputeAlgofunction) and all affected function parameters to allocate not more data than the re-quired by Equations 2, 3, 4 and 6.

5. As those variables are used in multiple functions, proceed systematically modifyingone by one and check always with the Ground Truth that the estimation error does

20

4.3

not increase. Debug the problem if the error increases.

6. Document your calculations and new data types in your report.

7. Can you check the memory footprint of the application before and after the reduc-tion of the memory footprint?

8. Do you see any improvement in the computation time already?

Great! So far we have identified costly functions and used our knowledge of the algorithmto reduce its memory usage.

Please create a new git branch in your repository called mem_opt and commit the memory-optimized implementation to it. This will be the new baseline implementation for furtheroptimization.

4.3 Multi-threading Technique to Accelerate Expensive LoopsThe first approach we will adopt for the optimization is to parallelize some very expensiveloops in the SGBM algorithm. To do so we will use the OpenMP API. Theoretically wecan use POSIX Threads (Pthreads) as well, however it would require more work at abenefit of more flexibility (that we hopefully don’t require now).

You might ask yourself: it possible to perform a manual loop-unrolling of such a com-plex algorithm like SGBM? Yes it is, however we can find a better cost-benefit trade-offby using high-level tools such as OpenMP to leverage much of the manual work to theparallelization library. To do so, our job consists of:

1. Understand the baseline implementation to gain an intuition of the places with thehighest optimization potential and correlate this understanding with the hotspotsyou found through profiling. Please discuss this in your report.

2. Analyze the critical functions you just found to understand its variables and opera-tions so that we can pass accurate information over to OpenMP so that it can carryout a reasonable optimization for us. To do so we need to understand which vari-ables are to be considered as private and shared in those expensive loops and letOpenMP know about that (please research how to do that). This is one of our designdecisions.

3. The next optimization decision is to determine which loops have strong data depen-dencies that could make hard or impossible any parallelization. You can consideras well whether you can optimize those tricky loops with help of the #pragma ompcritical. Justify (very well) in your report if you decide not to optimize one of themost expensive functions.

21

4.3

Note: You can review some useful OpenMP examples in this repository.

Considering the points above, please perform the next tasks and document your reason-ing, results and analysis in your report:

1. Create a new git branch called multi_threading for this section. It should be de-rived from the mem_opt branch.

2. Make sure you understand all steps of the SGBM algorithm from a functional andlogical perspective.

3. Make sure your understanding of the algorithm matches the implementation presentin the sgbm.cpp file, put special effort on understanding all functions with nestedloops (optimization potential!!!) such as the cost_computation, compute_hamming,cost_aggregation, etc.

4. Optimize with OpenMP at least the three most expensive functions. To determinewhich are those functions please refer to the profiling information you gatheredfrom Section 4.1.

5. The cost_computation function should be among the most expensive functions. Ifnot, please optimize the three most expensive from your analysis and additionallythe cost_computation function. We will use the cost_computation function as anexample to review the steps you have to follow with each function:

(a) As you can see in the cost_computation function, there are four nested loops.This is alone an indication of an expensive computation.

(b) Please measure the execution time of the function before the optimization. Youcan do that taking time measurements before and after the cost_computationfunction call in the compute_SGM function. Please create a new GUI labeland output the execution time in the QT graphical user interface.

(c) Now it is time to optimize! Analyze the data dependencies across the loopsand determine in which level you can start parallelizing them.

(d) Make sure the original average pixel error has not changed.

(e) Measure the execution time of the function with different number of threadsranging from 1 to 8. To do so set the omp_set_num_threads(n);. Select thebest one and argument your selection with an analysis in your report.

(f) Report the execution time of the function before and after the optimization andanalyze your results.

6. Please perform the above steps for the next 2 most expensive functions that can beparallelized (based on your profiling analysis).

22

https://github.com/muatik/openmp-examples

4.4

7. Analyze your results and discuss your design decisions. Always check againstthe ground truth and make sure the error does not get increased!

8. Summarize your experiments and results in a table with all optimized functions andmake a reference to this table from the "conclusions" chapter of your report.

Note:

• To be able to use the OpenMP API make sure you have "#include <omp.h>".

• Additionally link against the openmp library. To do so add in your QTWidgetOpenCV.prothe "-fopenmp" to your flags "QMAKE_CXXFLAGS_DEBUG".

• Before running the application make sure you select the "Debug" build under Build→ Open Build and Run Kit Selector.

• You can perform a simple test to make sure all CPU cores of your Jetson NANOare working in parallel by using the command line "htop" application as shown inFigure 11. There you can see how in the single-core implementation only one coreis being used at its top capacity whereas in the multi-core implementation all coreswork together.

(a) Single-core execution. (b) Multi-core execution.

Figure 11: Htop command to check CPU usage.

4.4 Acceleration through NEON IntrinsicsEvery core of the Quad-Core-ARM A57 processor has a NEON unit as part of its pipe, doyou remember? That means that the multi-threading optimization we just performed andthe NEON optimization are not mutually exclusive; we can keep the parallelization donein the last step and additionally take advantage of the Single Instruction - Multiple Datacapabilities of the NEON unit on each core! Isn’t amazing?

23

4.5

In this section we will try to keep our parallelization in place and speed up a bit more ourSGBM using SIMD instructions.

To do so, please consider the following steps:

1. Create a new git branch called neon_intrinsics for this section. It should be derivedfrom the multi_threading branch.

2. Identify 3 functions that you can potentially optimize with the SIMD Intrinsics ofthe NEON unit. To do so consider:

(a) Research about the SIMD intrinsics available and how to use them, you cancheck the Neon Intrinsics Reference and the Optimizing C Code with NeonIntrinsics Guide.

(b) Use information from the profiling analysis you already performed, your deepknowledge of the algorithm implementation and the information you gatheredfrom the last point to determine 3 critical functions to be optimized in thecurrent implementation.

(c) Argument in your report why you chose those functions.

3. Optimize your 3 selected functions using NEON Intrinsics. For instance considerwhether it is a good idea to accelerate the function find_minLri using one of thevmin intrinsics or whether it is more efficient to count the bits with the vcnt intrinsicin the compute_hamming_distance function instead of keeping the present forloop. You could consider as well to optimize the cost_aggregation function withan intrinsic instruction for vector summation for all directions at once! Just let yourcreativity optimization skills fly :)


4.5 NEON Intrinsics vs Compiler Auto-VectorizationIn the last section we accelerated 3 critical functions with NEON SIMD Intrinsics. Nowwe are going to compare how well we did it against the compiler auto-vectorization.Please consider the following steps to try this out:

1. Create a new git branch called auto_vectorization for this section. The base ofthis branch should be the multi_threading branch one and not the neon_intrinsicsbranch. This means, you should keep paralellization but not NEON Intrinsics.

2. Modify the code of your 3 critical functions to allow compiler auto-verctorization.You can get some insights on how to do that here.

24

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a/optimizing-c-code-with-neon-intrinsics

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a/optimizing-c-code-with-neon-intrinsics

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a/compiling-for-neon-with-auto-vectorization

4.7

3. Investigate which compiler flags you have to set so that the compiler tries to performauto-verctorization of our C/C++ code. You can find some resources on how to dothat here.

4. If necessary, disassemble your code to make sure the auto-vectorization is working.

5. Measure the average time every of your three critical functions takes to be executedwithout and with auto-vectorization enabled. Create a table to summarize yourresults.

6. Analyze your results and compare them with your NEON Intrinsics acceleration.

4.6 CUDA OptimizationAs reviewed in the Section 2, we have a (so far) idle 128-core Nvidia GPU at our disposal!What if we use it now?

In the next points you will find a high-level description of what you need to do:

1. Investigate how to accelerate a C/C++ program using the Compute Unified DeviceArchitecture (CUDA) API. You might want to run a "Hello word example" withQTCreator so that you make sure your environment is fully functional.

2. Derive from your research which functions you can optimize at best using the 128parallel cores and the CUDA API. Find expensive functions in the SGBM that arepotential candidates to be optimized with CUDA.

3. Create a new branch called cuda. You decide based on your investigation whetherit is more efficient to combine the multi-threading and/or NEON optimizations withthe CUDA approach (neon_intrinsics or multi_threading branches) or whetheryou expect better results using the non-optimized version as baseline (mem_optbranch)

4. Optimize 3 suitable (and expensive) functions with CUDA. You can use nvprof toprofile the CUDA functions you just programmed.

5. Use the "tegrastats" tool to measure the GPU usage after you have ported someexpensive function to CUDA. Try to leverage as much computation as possible fromthe CPU to the GPU and demonstrate that your code is actually running in the GPU.

6. Measure the time taken by the functions before and after the optimization.


25

https://www.xilinx.com/Attachment/53775/Neon_Introduction_for_Avnet_training.pdf

5.1

4.7 Evaluating Real-Time ConstraintsCongratulations! At this point you have succeeded at optimizing the SGBM algorithmby using different methods! Let’s now evaluate if we are ready to face a hard real-timescenario.

What do we need to do?

1. Create a new branch called real_time.

2. Combine together the optimization methods from the previous sections in the bestway possible. This means, in the way that you get the shortest computation time ofyour compute_SGM function. The design decision of Which of those methods youshould use together and how to combine them is yours to make.

3. Measure the total elapsed time of your compute_SGM function before and after theoptimization

4. Discuss in your report:

• How many frames per second (FPS) can reach your implementation?

• How much could you optimize the implementation in percentage?

• Does the accelerated version meet real-time requirements for an autonomousvehicle? What would be your next design ideas to further optimize the code?

5 Deep Learning Algorithm OptimizationNow that we are experts at low-level optimization of computationally-expensive C/C++autonomous-driving algorithms, we are going to try out a very trendy approach to tacklecomputer vision problems in a different way: Machine Learning and its ConvolutionalNeural Networks (CNN).

You have a lot of freedom in this section. You can decide which problem you want tosolve whiting the multiple topics relevant to autonomous driving, you can select whichnetwork architecture and implementation you are going to use and at least you decidewhich optimization technique you want to implement.

Please consider the following sections as a guide to get your deep learning algorithmrunning and to optimize it.

26

5.2

5.1 Selection of the Convolutional Network ArchitectureThe first decision we have to make is: which problem you want to solve whiting theautonomous driving scope?

Even though the decision is yours, it must consider the following:

1. It must be related to autonomous driving in some way, for example:

• Monocular depth estimation (see this as an example)

• Monocular optical flow estimation

• Object detection

• Object segmentation

• You can find lots of implementations in the Jetson Zoo.

2. You must use the Raspberry Pi 2 camera (provided with the kit).

3. As training a CNN model might require lots of resources (RAM, GPU and CPUcores), and we are not sure that everyone has such a computer, you can use a pre-trained model and only implement the inference part.

4. Training your model yourself will gran extra points.

Please document your selection in the report and measure the time it takes to compute aframe from your camera.

5.2 Post-Training Optimization of CNNAs many of the low-level operations such are convolutions are already pretty optimizedfor CUDA and ARM microprocessors, we are not going to manually modify the code thistime. Instead we are going to try acceleration techniques that are specifically designed forrunning CNN in resource-constrained embedded systems. Please carry out a research onthis topic and include in your report:

1. What are the CNN post-training optimization methods? Do they have an impact inthe accuracy of the implementation?

2. Which post-training optimization methods are implemented in the framework (Ten-sorFlow, PyTorch, etc) you are using?

3. Select at least one of those acceleration methods to be applied to the model youselected. Whiting this selection you usually have:

• Weight quantization

27

https://github.com/mrharicot/monodepth

https://www.elinux.org/Jetson_Zoo

6.1

• Model pruning

• etc...

4. Apply the selected post-training acceleration method and demonstrate that the com-putation time has been reduced.

5. Do do see any change in the output accuracy? How much?

Demonstrate with a video that you implemented the CNN model in the Jetson Nano andshow the optimization gain and accuracy loss.

6 Deliverables

6.1 Folder Structure of your DeliverablesIn relation to deliverables documentation and folder structure you MUST follow the folderorganization in your user_id.tar.gz (e.g group1_HPEC_2020) and Git repository as de-picted below.

For the Git repository consider each main level as branches of your repository while forthe .zip file consider each main level as a folder.

1 master2 |--- Report3 |--- Report.pdf4 |--- Video5 |--- Video_Link.pdf6 |--- Sources7 |--- Original sources (Profiling Section)8 mem_opt9 |--- Sources

10 |--- Memory-optimized sources11 multi_threading12 |--- Sources13 |--- OpenMp-parallelized sources14 neon_intrinsics15 |--- Sources16 |--- Neon-optimized sources17 auto_vectorization18 |--- Sources19 |--- Auto-vectorized sources20 cuda21 |--- Sources22 |--- CUDA-optimized sources23 real_time24 |--- Sources

28

6

25 |--- Sources of your great implementation26 deep_learning27 |--- Sources28 |--- Sources and compilation instructions of your deep-learing implementation

6.2 GradingThe grading of this project will be based on Table 3.

Table 3: Project’s evaluation criteria

Item Description PercentageProfiling analysis 5 %Reduction of memory footprint and analysis 15 %OpenMP paralelization and analysis 15 %Report,

Videoand

Repository

NEON intrinsics optimization and analysis 15 %Compiler Auto-vectorization and comparisson to NEON Intrinsics 5 %CUDA Optimization and analysis 15 %Combination of optimization and real-time evaluation 5 %Deep Learning algorithm implementation 10 %Deep Learning algorithm training extra 10 %Deep learning algorithm optimization and analysis 15 %

Total 110 %

6.3 Deliverables SubmissionTo submit your deliverables you have to:

1. Follow the folder structure established on section 6.1.

2. Compress your deliverable folder in zip format and look in TEC-Digital for thedelivery link for Project 3.

3. Invite the professors to have access to your private repository.

4. Due date is August 29th 2020 at 23:45. Afterwards the link access in TEC-Digitalwill be closed.

References[1] NVIDIA. Jetson nano developer kit, 2020.

29

Project 3: Optimizing Traditional and Deep Learning ... 3...Camera 2x MIPI CSI/2 DPHY lanes...

Documents

Transcript of Project 3: Optimizing Traditional and Deep Learning ... 3...Camera 2x MIPI CSI/2 DPHY lanes...