Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP)...
Transcript of Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP)...
![Page 1: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/1.jpg)
Technology for a better society 1
Session S3227
Where's Waldo?
Real-time 3D Tracking Using GPUs
Dr. André R. Brodtkorb, Research Scientist
SINTEF ICT, Department of Applied Mathematics
![Page 2: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/2.jpg)
Technology for a better society 2
• Established 1950 by the Norwegian Institute of Technology.
• The largest independent research organisation in Scandinavia.
• A non-profit organisation.
• Motto: “Technology for a better society”.
• Key Figures*
• 2123 Employees from 67 different countries.
• 2755 million NOK in turnover
(about 340 million EUR / 440 million USD).
• 7216 projects for 2200 customers.
• Offices in Norway, USA, Brazil, Macedonia,
United Arab Emirates, and Denmark.
About SINTEF
* Data from SINTEF’s 2009 annual report [Map CC-BY-SA 3.0 based on work by Hayden120 and NuclearVacuum, Wikipedia]
![Page 3: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/3.jpg)
Technology for a better society 3
• Motivation & Introduction
• Our Work
• Efficient Video Decoding for CUDA
• Single-camera image processing
• Multi-camera image processing
• Utilizing multiple GPUs
• Summary
Outline
![Page 4: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/4.jpg)
Technology for a better society 4
• ADABTS (project number 218 197)
• 7th framework programme, Security
• Coordinated by FOI (Sweden)
• Total project cost 4.5 million EURO / 5.8 million USD
• 7 european project partners
• This work is part of
Work package: 6 Real Time Platform and System Integration
Motivation: The ADABTS Project
![Page 5: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/5.jpg)
Technology for a better society 5
ADABTS Work Package 6 in a nutshell
"To develop a new hardware and software platform for advanced
real-time video analysis and detection using heterogeneous computing.
Exploit the possibilities that commercially available low cost heterogeneous
hardware architectures (multi-core CPUs in combination with GPUs) represent."
![Page 6: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/6.jpg)
Technology for a better society 6
• GPUs have a 7-10x performance advantage
for floating point and bandwidth
• GPUs are naturally suited for image
processing
• NVIDIA GPUs support hardware-accelerated
video decoding for CUDA
Motivation for using GPUs
![Page 7: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/7.jpg)
Technology for a better society 7
Results
(HTTP) H.264
(HTTP)
• Hardware platform: A SuperMicro server with four NVIDIA GTX 580 GPUs
• Software Platform: Linux / Windows, CUDA, NVCUVID, C++, a lot of threading and other
snacks.
Simplified Real Time Platform Sketch
IP cameras "Desktop Supercomputer"
Operator and further
processing
![Page 8: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/8.jpg)
Technology for a better society 8
Results
(HTTP) H.264
(HTTP)
Simplified Real Time Platform Sketch
IP cameras "Desktop Supercomputer"
Operator and further
processing
Decode Single camera image
processing
Multi camera image
processing HTTP read HTTP send
![Page 9: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/9.jpg)
Technology for a better society 9
Reading, decoding, and sending data
Decode Single camera image
processing
Multi camera image
processing HTTP read HTTP send
![Page 10: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/10.jpg)
Technology for a better society 10
• Decode burden grows as the IP camera
resolution grows…
• Current industry standard codecs are
JPEG, H.264, and MPEG4 Part 2
• NVIDIA GPUs support H.264 and MPEG
4 Part 2 decoding in GPU hardware! [2]
IP Cameras
"The majority of IP cameras offered are now
megapixel (54.1%). This is somewhat amazing as
megapixel was a distinct minority just two years
ago." [1]
[1] IP Camera Statistics 2011, John Honovich, 2010, http://ipvm.com/report/ip_camera_statistics
[2] NVIDIA Purevideo, feature set D
"Over 60% of megapixel cameras support H.264
while only about 20% support MPEG-4." [1]
![Page 11: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/11.jpg)
Technology for a better society 11
• NVCUVID - CUDA Video Decoder
• Released publicly in 2008
• Linux support after two years
• Enables GPU-accelerated decode
• Virtually zero CPU use
• H264, MPEG-2, (MPEG4-part2?)
• Decodes a frame into CUDA memory
• Transfers compressed video data directly to the GPU
• Far less PCI-e bandwidth used
GPU Decoding of Video
CPU GPU
CUDA
memory
System
memory H.264
H.264
Frame H.264
![Page 12: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/12.jpg)
Technology for a better society 12
• Video processor operates
independently of other GPU
engines [1]
• Decode comes for free!
• Presumably uses the
same hardware as PureVideo
• When something looks too good to be true, it usually is:
• Writing a decoder is a lot of work!
GPU Concurrency Revisited
VideoProcessor (VP)
CUDA
cores DMA Engine 2
DMA Engine 1
[1] E. Young and F. Jargstorff, Image processing & video algorithms with CUDA, Nvision 2008
![Page 13: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/13.jpg)
Technology for a better society 13
• The SDK example decodes a single movie from file
• Our needs are a bit more complex:
• Decode multiple movies simultaneously
• Read from network
• Use multiple GPUs
• …
Our Starting Point: The SDK Example
![Page 14: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/14.jpg)
Technology for a better society 14
• NVCUVID is sparsely documented
• Find random forum posts online
• Speak with the NVIDIA engineers
• Read the header files carefully
• SDK example is "pedagogically suboptimal"
• Major challenge to decipher
• Difficult to grasp data flow
• A lot of hidden threading
• Ended up creating a UML diagram
of what was going on
Deciphering the SDK Example
![Page 15: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/15.jpg)
Technology for a better society 15
• Four main threads:
• ByteStream reads data
• Decoder pushes data to NVCUVID
• CUVideoparser performs black magic
• CUVideodecoder decodes video bitstream
• Extremely easy to create a decoder:
1. Create a ByteStream
2. Give the ByteStream to a Decoder
3. Call getNextFrameAsync()
or getNextFrame()
• All threading is now hidden!
Our Decoder Structure
ByteStream • Reads data over HTTP
• Splits data into NALUs
Decoder • Reads data from bytestream
• Writes data to NVCUVID
• Keeps track of frames buffered
by NVCUVID
CUVideoparser • Part of NVCUVID
• Black magic
CUVideodecoder • Part of NVCUVID
• Decodes video bitstream
![Page 16: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/16.jpg)
Technology for a better society 16
• If you feed the decoder with H.264 data, it crashes
• Must be fed one NALU at a time or it will get angry!
• NVCUVID-memory can be "special"
• cudamemcpy used to crash when trying to copy from a decoded frame (works today)
• Access to the CUDA context is not thread safe!
• You must use cuvidCtxLock / cuvidCtxUnlock for *each and every* cuda call (extremely easy to
forget, and hard to get new developers not to do this mistake)
• We created a CudaContext class that pushes / pops and locks / unlocks the cuda context
• JPEG works well!
• But uses the CPU only apparently (no surprize there, since PureVideo does not support JPEG)
• Other formats also exist in the header file…
Lessons Learned When Working With NVCUVID
![Page 17: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/17.jpg)
Technology for a better society 17
• Our decoder performance matches that of
the SDK example
• Decoder speed varies with GPU and encoder options
• We get roughly 200 FPS for one camera
• When we use multiple decoders, the performance
scales linearly!
• Two cameras give ~100 FPS per camera
• This means that one GPU should handle roughly
10 cameras in 20 FPS decode!
Performance
0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5 6 7 8
Pe
rfo
rma
nc
e
Performance per camera
![Page 18: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/18.jpg)
Technology for a better society 18
• Decoder parameters tweaked for
throughput, at the expense of latency
• Varies with GPU, but on the order of
one second for powerful gamer cards
• Not an issue with our usage scenario
Latency
0,001
0,01
0,1
1
1 251
Se
co
nd
s (
log
ari
thm
ic)
Frame number
Frame number versus frame latency
![Page 19: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/19.jpg)
Technology for a better society 19
Image processing for one camera
Decode Single camera image
processing
Multi camera image
processing HTTP read HTTP send
![Page 20: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/20.jpg)
Technology for a better society 20
• Low level algorithms are at the heart of high-level logic
• Break a complex task up into less complex tasks
• For a single camera, we have a set of low-level tasks
• Image segmentation (foreground / background)
• Optical flow
• Face/pedestrian detection (boosting)
• Modular system
• We can exchange algorithms, and add or remove them
• Most algorithms can run simultaneously
Single camera processing
Foreground
Segmentation
Optical flow
Face detection
Input
Frame
Detection
results
![Page 21: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/21.jpg)
Technology for a better society 21
• Foreground segmentation relies on a good
description of background
• Background is essentially an empty scene
• Foreground is everything that deviates from
background
Segmentation
![Page 22: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/22.jpg)
Technology for a better society 22
• There are many ways of describing the background
• We use intensity and edges (HOG)
• Anything that deviates from the background model is
considered foreground
• Computing and updating the background model is
embarrassingly parallel!
Describing the background
![Page 23: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/23.jpg)
Technology for a better society 23
• There is noise in the input video (especially at
H.264 I-frames)
• We need to allow some deviations from
background
• It is notoriously difficult to handle
• Shadows
• Rapidly changing lighting conditions
(clouds / headlights / …)
• Reflections
• Our implementation addresses video noise,
varying lighting conditions and shadows.
Detections
![Page 24: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/24.jpg)
Technology for a better society 24
• Foreground segmentation is difficult!
• Foreground objects that look like background
• Light changes (shadows, reflections, etc.) are difficult to get right
Example Detections
Good Bad Ugly
![Page 25: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/25.jpg)
Technology for a better society 25
• Optical flow is the calculation of movement in a video stream
• Where did this pixel come from in the last frame?
• Computationally demanding algorithm
• Highly susceptible to image / compression noise
• Multiple ways of finding
• Brute force search
• Polynomial expansion
• …
Optical flow
![Page 26: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/26.jpg)
Technology for a better society 26
• Based on a local search for each pixel
• Take a neighborhood in the previous frame
• Compute sum of absolute differences for a
variety of locations in the new frame
• Choose the minimum
• Embarrassingly parallel algorithm
• Quite expensive to compute per pixel in terms of
memory bandwidth!
• Only computed for segmented foreground,
which makes it high performant
Brute Force Search
![Page 27: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/27.jpg)
Technology for a better society 27
• Results are quite reasonable
• Algorithm is sensitive to choice of search
directions and size of patches to compare
Optical Flow Results
![Page 28: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/28.jpg)
Technology for a better society 28
• Approximates the local neighborhood with a second
order polynomial.
• Based on implementation in OpenCV and computed
for all pixels
• Gives high quality results, but is expensive
Färneback Optical Flow
![Page 29: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/29.jpg)
Technology for a better society 29
• Face detection by combining many weak classifiers
into one strong classifier (variation of AdaBoost)
• A few "easy" steps to perform
1. Generate a mipmap pyramid for resolution-
independent face detection
2. Classify every window location for multiple weak
classifiers, exit early on non-faces
3. (Summarize hits (positive face classifications)
over scale and space)
WaldBoost Face detection
![Page 30: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/30.jpg)
Technology for a better society 30
• Based on the implementation by Michael Hruby [1]
• Ported from OpenCL to CUDA
• A lot of the porting work was done by a set of
#defines and constants.
• In-kernel syntax mostly identical!
• Worst part of the porting work was figuring out the
data flow.
• Very easy to integrate into our framework
Porting from OpenCL to CUDA
[1] Michael Hruby, WaldBoost on OpenCL
![Page 31: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/31.jpg)
Technology for a better society 31
• We have benchmarked over multiple
GPU generations
• Growing preformance with each new
gen
• Non-optimal performance for
GTX 680 for unknown reasons
Single Camera Performance
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
9800GX2 GTX 540M GTX 285 GTX 480 GTX 580 GTX 680N
orm
ali
ze
d p
erf
orm
an
ce
Performance versus GPU generation
![Page 32: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/32.jpg)
Technology for a better society 32
Multi-camera image processing
Decode Single camera image
processing
Multi camera image
processing HTTP read HTTP send
![Page 33: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/33.jpg)
Reception
1280x960 H.264 @ 20FPS
1280x960 H.264 @ 20FPS
1280x960 H.264 @ 20FPS
1280x960 H.264 @ 20FPS
720x576 JPEG @ 20FPS
![Page 34: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/34.jpg)
Technology for a better society 34
• Multi-camera image processing has high
requirements to camera synchronization
• If a camera is off by a few frames, results
deteriorate rapidly
• IP cameras are naturally out of sync
• Their internal clock is unreliable
• A 25 FPS camera means something like 25
FPS on a sunny day
• Mixing camera makes and frame-rates gives
extra challenges
Camera Synchronization Torkel A. Haufmann
Poster P0168 / CO 09 in Computer Vision
![Page 35: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/35.jpg)
Technology for a better society 35
• GPU-implementation of automatic
synchronization [1]
• Given two cameras generate a set of planes
• The plane must run through both
cameras, and a third point in 3D space
• This plane looks like a line in both
cameras
• Record changes along these lines, and
synchronize based on that
Synchronizing Two Cameras
[1] Pundik and Moses, Video synchronization using temporal signals from epipolar lines, 2010.
![Page 36: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/36.jpg)
Technology for a better society 36
• Changes along the epipolar lines are recorded in a
2D array for each camera
• Try matching these two arrays to find the camera
drift
• For more cameras, split up into pairs of cameras
and synchronize each pair
Synchronizing Two Cameras
Cam 0
Fra
me
nu
mb
er
Plane number
Cam 1
Fra
me
nu
mb
er
Plane number
![Page 37: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/37.jpg)
Technology for a better society 37
• Synchronized cameras can be used for voxel carving
• Voxel carving gives us the volumetric extent of an object in
3D by combining a view of the object from several angles
• Algorithm is computationally demanding
• Example grid can be 16 million or more voxels
• The CPU is way too slow to handle this
• The basic idea is to create a voxel grid in 3D space, and project
the foreground segmentation into this grid
• This projective texturing is a well known technique from
computer graphics, used e.g. in shadow mapping.
• Perfectly suited for the GPU!
Voxel Carving
Convex hull
Carved voxel volume
![Page 38: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/38.jpg)
Technology for a better society 38
• We compute each output element independently
• This makes our algorithm output sensitive:
easy to adjust performance by varying voxel
grid size
• We use texture lookups for simplicity
• Caveat: Textures, constant memory etc. might
be dangerous to use in a multi-threaded
setting…
Voxel Carving
Cam 0 Cam 1
Pseudocode
parallel for each voxel {
float avg = 0.0f;
for each camera {
avg += getForeground(camera);
}
avg /= num_cameras;
}
![Page 39: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/39.jpg)
Technology for a better society 39
• By thresholding our voxel carving, we
get an iso-surface for foreground
objects
• Results dependent on segmentation
quality
• Results dependent on occlusions
Voxel Carving Results
200 400 600 800 1000 1200 1400 1600
200
400
600
800
1000
1200
![Page 40: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/40.jpg)
Technology for a better society 40
• Voxel carving is a no-op for small voxel
grid sizes
• Camera decode is the bottleneck
• For larger domain sizes, performance
drops slower than expected
• Indication that we are not fully
saturating the GPU, even for
1024x1024x6
Voxel Carving Results
0
10
20
30
40
50
60
32 64 128 256 384 512 768 1024
Fra
me
s p
er
se
co
nd
Voxel grid size: n x n x 64
Performance versus voxel grid size
![Page 41: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/41.jpg)
Technology for a better society 41
• If we sum our voxel grid along the height dimension, we get
the density of foreground for each ground plane location.
• We can easily download this 2D map to the CPU and track
blobs
• Naïve and simple algorithm:
1. Find blobs with high density
2. Try matching with blobs from previous frame
• Easily extendible create better tracks:
• Face detection results, optical flow, image features, etc.
• Discrete optimization (max flow / k shortest paths)
Probability Map Tracker
![Page 42: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/42.jpg)
Technology for a better society 42
• The voxel carver produces too much data. • Output is e.g., 512x512x64 (16 M) voxels
• We want to send this over HTTP for further processing
• Most of the data is not important • Areas with a low average detection (background)
• Areas with uniform high average detection (e.g., the inside of
the carved hull)
• We can compress the data by using standard stream
compaction • Pick out the 1-voxel wide shell of each object
• Reduces data dramatically!
Compression
![Page 43: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/43.jpg)
Technology for a better society 43
Multi-GPU processing
![Page 44: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/44.jpg)
Technology for a better society 44
• Different ways of utilizing multiple GPUs
• Task-parallel pipelining
• Data-parallel between GPUs
• Task-parallel pipelining is terrible
• Ruins all bandwidth savings
• Data-parallel is perfect!
• Our aim is many cameras
• Create a CPU thread for each GPU,
and we're in business
Multi-GPU strategies
Decode
Segmentation Optical flow
Multi Camera
Face detect
![Page 45: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/45.jpg)
Technology for a better society 45
• Perfect weak scaling
• One GPU supports four-five cameras
with all processing
• Four cameras supports 16-20 cameras
with all processing
Multi-GPU results
0
0,5
1
1,5
2
2,5
3
3,5
4
1 2 3 4
No
rma
lize
d p
erf
orm
an
ce
Number of GPUs
Performance versus number of GPUs
![Page 47: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/47.jpg)
Technology for a better society 47
• We have presented
• Efficient Video Decoding for CUDA
• Single and multi-camera image processing
• Utilizing multiple GPUs
• GPUs are superbly suited for these tasks
• Papers to be published
Summary
![Page 48: Real-time 3D Tracking With GPUs | GTC 2013€¦ · GPU Concurrency Revisited VideoProcessor (VP) CUDA cores DMA Engine 2 DMA Engine 1 [1] E. Young and F. Jargstorff, Image processing](https://reader033.fdocuments.in/reader033/viewer/2022052106/604153d34deb901eba3eb6d8/html5/thumbnails/48.jpg)
Technology for a better society 48
Contact:
André R. Brodtkorb, SINTEF ICT
Email: [email protected]
Webpage: http://babrodtk.at.ifi.uio.no/
Youtube: http://youtube.com/babrodtk/
Thank you for your attention
SINTEF ICT
Department of Applied Mathematics
http://www.sintef.no/math
SINTEF ICT
Department of Optical Measurement
Systems and Data Analysis
http://www.sintef.no/math
Project participants:
Asbjørn Berge*, André Brodtkorb,
Torkel A. Haufmann, Jens Olav Nygaard,
Anna Kim, Kristin Kaspersen,
Jon Hjelmervik
* Project leader