Transcript of REAL-TIME PERSON TRACKING IN HIGH-RESOLUTION …
PRODUCTION
1 JOANNEUM RESEARCH, DIGITAL – Institute for Information and
Communication Technologies, Graz, Austria 2 Silesian University of
Technology, PhD Faculty of Data Mining, Gliwice, Poland
Abstract
For enabling immersive user experiences for interactive TV services
and automating camera view selection and framing, knowledge of the
location of persons in a scene is essential. We describe an
architecture for detecting and tracking persons in high-resolution
panoramic video streams, obtained from the OmniCam, a panoramic
camera stitching video streams from 6 HD resolution tiles. We use a
CUDA accelerated feature point tracker, a blob detector and a CUDA
HOG person detector, which are used for region tracking in each of
the tiles before fusing the results for the entire panorama. In
this paper we focus on the application of the HOG person detector
in real- time and the speedup of the feature point tracker by
porting it to NVIDIA’s Fermi architecture. Evaluations indicate
significant speedup for our feature point tracker implementation,
enabling the entire process in a real-time system.
Keywords: person tracking, CUDA, Fermi, Virtual Director
1 Introduction
The broadcast industry is constantly challenged to improve the
efficiency of the production process on the one hand, and to
provide more engaging user experiences on the other hand. One
approach towards this goal is the format agnostic approach [16],
which proposes a paradigm shift towards capturing a format agnostic
representation of the whole scene from a given viewpoint, rather
than the view selected by a cameraman based on assumptions about
the viewer’s screen size and interests. The FascinatE project1 is
working towards implementing such a format agnostic production
workflow for live events.
The FascinatE system uses the concept of a layered scene
representation, where several cameras with different spatial
resolutions and fields-of-view can be used to represent the view of
the scene from a given viewpoint. The views from these cameras can
be considered as providing a base layer panoramic image (obtained
in the FascinatE system by the OmniCam panoramic camera, depicted
in Figure 1), with enhancement layers from one or more cameras more
tightly-framed on key
1http://www.fascinate-project.com/
areas of interest. The same concept is used for audio by capturing
an ambient sound field together with a number of directional sound
sources.
The system thus allows end-users to interactively view and navigate
around an ultra-high resolution video panorama showing a live
event. Different scenarios are targeted with genres ranging from
sports events (e.g., soccer, track & field) to music
concerts.
The system includes components called Scripting Engines which take
the decisions about what is visible and audible at each playout
device and prepare the audiovisual content streams for display.
Such components are commonly referred to as a Virtual Director. In
order to take reasonable decisions, the engine needs knowledge
about what is currently happening in the scene and which camera
streams are capturing that action. Because manual input is usually
both too expensive and slow in real-time setups, automatic
audiovisual (AV) content analysis is used.
For personalizing interactive TV services detection and tracking of
persons is a widely common requirement. A special aspect of this
work is that objects have to be tracked over the multiple tiles
constituting the high-resolution panormaic image. As a real-time
broadcast system, most technical components are required to operate
in real-time with a low amount of delay.
The degree of viewer interaction of course depends on the type of
the user’s terminal device. Whereas in a cinema- like setup all
viewers get to see the same content, viewers with an individual
mobile display may have several options to personalize content
selection on a fine-grained level. In a sports scenario for example
this could be the option to follow a certain athlete the majority
of time on close-up unless an important action takes place
elsewhere in the scene. For such a feature, automatic detection and
tracking of objects is necessary.
In this paper we describe the subsystem for detecting and tracking
persons in the omnidirectional panoramic scene in real-time. The
main contributions are a distributed architecture for performing
detection and tracking on different tiles of the OmniCam image and
merging the results. The person detection and tracking algorithms
are implemented on the GPU. In particular, we describe a highly
efficient tracker
Figure 1: The OmniCam [16].
implementation using NVIDIA’s Fermi architecture. The aim is to
obtain the player positions and tracks in real-time as an input to
automatic scripting and framing.
The rest of this document is structured as follows. Section 2
discusses related technical work regarding real-time tracking in
multi-camera and panoramic setups. Next, Section 3 briefs on the
FascinatE system, the specifics of the input video streams and the
scripting component. The core of our contributions is covered by
Sections 4 and 5 which highlight aspects of the feature point
tracker and describe our implementation for person tracking over
OmniCam tiles, including evaluation results. Finally, conclusions
and future work are discussed in Section 6. Examples and
evaluations throughout the paper are based on test content from a
soccer match.
2 Related work
Tracking persons or objects across different cameras is a common
problem, and a range of approaches has been proposed in literature.
The authors of [9] propose an approach for soccer player detection
and tracking in a multi-camera environment using joint probability
models. The model consists of a color appearance model, a Kalman
filter based 2D motion model and a 3D motion model using the
assumption that players move on the ground plane. An approach for
tracking players in a multi-camera environment is presented in [5].
The system uses projections from the different views to the ground
plane to obtain an occupancy map. In [4], an adaptive learning
method for tracking objects across different cameras is presented.
The approach uses a probabilistic formulation and models the
transition probabilities for different zones in the scene as well
as the appearance (as a brightness transfer function). It is shown
that this approach is also robust aginst sudden lighting changes.
The authors of [2] use multiple spatial temporal features to
associate trajectories in different views, and fuse the results of
association in the ground and image plane.
An approach for tracking humans across multiple uncalibrated
cameras is presented in [10]. The field of view lines representing
the area seen from other cameras are estimated and used to predict
the re-appearing of humans in other cameras. The approach has been
extended in [3], using edges of fields of view to obtain consistent
labelling of person tracks across views. In [19] the automatic
estimation of homographies between cameras based on corresponding
person detections in different cameras has been proposed.
A smaller number of papers also considers omnidirectional and
panoramic image sequences. A system for tracking in panoramic image
sequences has been proposed in [8]. However, as this system uses a
panoramic angular lens, the problem is quite different to that
encountered in the FascinatE system. Tracking of faces in a
high-resolution panoramic image sequence of a meeting room is
proposed in [14]. The system uses a part based face detector and a
-shape detector and tracks the identified regions using a mean
shift tracker. In [11], the application of active shape model to a
panoramic image obtained from multiple sensors (both CCD and
infrared cameras) is presented. Detection and tracking of players
in a network of fixed and omnidirectional cameras is presented in
[1]. The problem is formulated as the inverse problem of estimating
the number and players and their positions on the ground plane from
a set of binary player silhouettes in each of the views.
There are a couple of fast GPU algorithms for KLT feature point
tracking and person detection with the histogram of oriented
gradients (HOG) detector. For feature point tracking, the works
[18, 22, 13] describe GPU implementations which use the Cg2 shader
language. This has the disadvantage that the implemenation has to
be adapted to fit to the computer-graphics oriented render
pipeline, which leads typically to significant performance
penalties. In [7] a CUDA implementation is proposed which tracks
5,000 - 10,000 points in realtime for high definition (HD)
material. In this work, we will enhance this implementation in
order to take full usage of the most recent NVIDIA GPU architecture
Fermi. In [21, 15] fast GPU algorithms for the HOG detector are
described. Near-realtime performance (approx. 80 milliseonds) is
reported in [15] for image sequences of resolution 640 x 480.
3 The FascinatE system and workflow
The FascinatE system aims to create an innovative end-to-end system
for immersive and interactive TV services. It allows users to
navigate in an ultra-high resolution video panorama, showing live
or recorded content, with matching accompanying audio. The output
is adapted to the viewing device, covering anything from a mobile
handset to an immersive panoramic display with surround sound,
delivering a personalized multi- screen experience.
FascinatE is using a panoramic camera with high enough resolution
for cropping interesting regions, the OmniCam [16], depicted in
Figure 1. It is a collection of 6 HD cameras sharing
2http://developer.nvidia.com/cg-toolkit
Figure 2: Detail of the FascinatE system architecture.
a single optical centre for obtaining a 180 panoramic video
sequence stiched together from the 6 tiles. The vertical field of
view is 60 degrees. The HD cameras are placed on their side and
point upwards to a reflecting mirror to maximize the resolution
such that when the video sequences are stitched together, the
resolution of the final panorama is usually 7000 × 1920 pixels.
This resolution allows to capture even distant objects in good
quality so that e.g. persons on the other end of a sports field can
be detected automatically. The minimum distance of objects to the
camera depends on the accuracy of the camera mounting and is
roughly 2 meters. Besides the OmniCam, a range of broadcast cameras
such as the ALEXA3
are used. So far, person tracking has only been applied to the
panoramic image.
The Production Scripting Engine (PSE) is responsible for decision
making on content selection. The key feature is to automatically
select a suitable area within the OmniCam panorama image, in
addition to cuts between different broadcast cameras. Selection
behavior is based on pragmatic (cover most interesting actions) and
cinematographic (ensure basic aesthetic principles) rules,
comparable to the approach described by Falelakis et al. [6]. In
some cases, this is not fully automatic but involves a
human-in-the-loop, a production team member deciding between
prepared options. The PSE is a distributed component with at least
one instance at the production site and one at the terminal end,
which is also reflected in the system diagram in Figure 2. The
output of the PSE is called a script, which consists of a
combination of content selection options and decisions, renderer
instructions, user interface options a.s.o. Scripts are passed to
subsequent PSE components from the production site towards the
terminal, where final instructions are given to a device-specific
renderer.
The PSE is closely working together with another type of Scripting
Engine, the Delivery Scripting Engine (DSE). The DSE is taking
instructions from the PSE to prepare content streams and makes sure
needed content is available for the renderers, optimizing bandwidth
management. Further, the PSE is tightly integrated with an user
interface for the professional production team, the EditorUI tools.
They
3http://www.arridigital.com/alexa
Figure 3: Runtime comparison of the tracking step of the KLT
algorithm of the standard and the Fermi-optimized GPU
implementation, for 10,000 tracked points.
allow to create live annotations for concepts not covered by AV
content analysis. Further, they could serve as a tool for
validating and correcting the results of content analysis. However,
the latter feature has not yet been implemented. Summarizing, the
EditorUI and the AV content analysis are the main metadata sources
for decision making within the PSE.
4 Feature point tracker
This section gives a brief description of the GPU-accelerated
feature point tracker we use for person tracking. Furthermore, it
describes the porting of the GPU algorithm to the most recent
NVIDIA GPU architecture Fermi in order to take full advantage of
the capabilities of recent NVIDIA GPUs. For feature point tracking,
we use the classical KLT algorithm proposed by Kanade, Lucas and
Tomasi [12, 20, 17]. It is widely used for the detection and
tracking of salient points in image sequences, as it provides
competitive quality as a basic building block of a variety of
computer vision tasks (e.g. structure from motion, object tracking)
and has reasonable computational complexity. In [7] a very
efficient
Figure 4: Overall runtime of the feature point tracker for full HD
material for different numbers of feature points, for the standard
and the Fermi-optimized GPU implementation.
GPU implementation of the KLT algorithm was described, which is
able to track several thousand points in realtime in full HD
material. Here, we describe the modifications we employed in order
to implement it efficiently for GPUs based on the Fermi
architecture and evaluate their effect on the runtime of the
algorithm.
4.1 Algorithm
The KLT algorithm operates as follows: For two consecutive frames
of an image sequence I and J , the first step is to detect points
in image I with sufficient texture (feature point selection). As
cornerness measure for a pixel p, the smaller eigenvalue of the
structure matrix G is used, where G =
∑ x∈W (p)∇I(x)∇I(x)T and W (p) is a rectangular
region centered around p. Now the n pixels with the highest
cornerness value are added to the already existing feature points.
Prior to adding, a minimum distance constraint is enforced in order
to avoid clumping of all feature points in a small region of the
image. The second step is to determine for each feature point p its
translational movement v from image I to J (feature point
tracking). The best translation vopt can be calculated as the
optimum of the SSD4 dissimilarity function ε(v) =
∑ x∈W (p)(J(x + v) − I(x))2. For the optimization,
an iterative gradient descent method of Gauss-Newton type is used.
Furthermore, the optimization is done in a multi- resolution way
using an image pyramid in order to avoid local minima. For a more
detailed description of the KLT algorithm the reader is referred to
[12, 20, 17, 7].
4.2 Porting steps
Although the speedup of the GPU implementation described in [7] of
approximately an order of magnitude with respect to an
multi-threaded CPU implementation was sufficient to enable
4sum of squared differences
real-time tracking of thousands of points in HD material, it does
not use the features of the most recent NVIDIA GPU architecture
Fermi, as it was not available at that time. In the following we
describe the most important novelties of this architecture and
report our work done on porting the GPU implementation described in
[7] to the Fermi architecture. Note that a basic knowledge of GPU
programming and the CUDA programming environment is assumed. A good
source for information is the CUDA programming guide5.
4.2.1 Fermi architecture
The Fermi architecture6 was introduced by NVIDIA in 2010 with the
Geforce 400 series and introduces a couple of major changes
compared to previous GPU architectures. Most important, the memory
hierarchy has been significantly revamped. The amount of shared
memory (a very fast user-managed on-chip cache) has tripled to 48
KB per multiprocessor, and a GPU-managed on-chip L1 cache of 16KB
has been added. It is possible to switch this configuration to 16
KB of shared memory and 48 KB of L1 cache, which helps for
algorithm with non-deterministic memory accesses, e.g., traversing
a binary tree. Furthermore, there are a couple of other major
changes, including significant speedup of atomic operations as well
as operations involving the double data type are significantly
faster, support for ECC memory, and a uniform memory space.
4.2.2 Modifications
In order to access the global memory in the most efficient
(coalesced) way, in previous GPU architectures groups of 16
consecutive threads (half-warps) had to access data from the same
memory segment. In the Fermi architecture, this restriction (for
optimal performance) applies to groups of 32 consecutive threads
(warps). In practice, this means that we need to adjust the thread
blocks to have a width of a multiple of 32. By doing this, we
ensure that there will be no shared memory bank conflicts.
Considering the ability of Fermi architecture GPUs to fetch smaller
memory segments, it is possible to reduce the number of elements
that are fetched without being used.
In our standard implementation the CUDA kernel for the tracking
step itself was limited by shared memory size. Thus only very small
thread blocks could be used, reducing the number of blocks that
could be processed concurrently on a multiprocessor. The only
alternative was to store data in an uncached local memory, which
proved to be much less efficient for previous GPU architectures.
Fermi’s cached
5http://developer.download.nvidia.com/compute/
cuda/3_0/toolkit/docs/NVIDIA_CUDA_ProgrammingGuide. pdf
6http://www.nvidia.com/content/PDF/fermi_white_
papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper. pdf
local memory allows us to reconsider this alternative and
consequently the optimized implementation takes advantage of
Fermi’s L1 cache. Thus we can have more threads per thread block
and increase the number of concurently executed blocks.
Furthermore, there is no shared memory restriction anymore as we
now use solely local memory. It is important to configure the CUDA
kernel to prefer L1 cache over shared memory, so that it has the
maximum amount of L1 cache (48 KB) available. During our
experiments we observed that for the CUDA tracking kernel, several
thread block size configurations result in similar runtime. By
choosing the one with the smallest thread block size, we gain much
better scalability for future GPU generations which are likely to
employ even more multiprocessors.
4.2.3 Evaluation
All runtime tests were done on a Windows XP 64 system with an 3.0
GHz Quad-Core Intel Xeon CPU and an NVIDIA Geforce GTX 480 in a
PCIe 2.0 x16 slot. The runtime is per frame, for images of full HD
(1920 × 1080) resolution. Regarding the parametrization, 5 pyramid
levels and a window size of 7 × 7 was used, if not stated
otherwise. The minimum distance between feature points was set to 4
pixels. The maximum number of iterations was set to 7, and the
criterion for convergence of tracking was set to 0.1 pixel. We
compare the standard GPU implementation as described in [7] (but of
course executed on the faster GTX 480) with our Fermi- optimized
GPU implementation.
The optimization of the feature point selection for Fermi
architecture gives a minor speedup factor of ≈15%. This is partly
due to the fact that the selection algorithm uses the NVIDIA CUDPP
library for calculating the maximum occuring cornerness, compaction
of the point list etc., and the CUDPP library internally
automatically switches to the best implementation for the GPU on
which it is executed. The runtime of the minimum distance
enforcement remains unchanged as it is done entirely on the
CPU.
Regarding the feature point tracking, one can see from Figure 3
that a significant speedup factor of 2 - 5 is achieved. One can
observe that the runtime for the standard (unoptimized) GPU
implementation increases quadratically with the window size,
whereas for the Fermi-optimized GPU implementation it rises only
linearly. Overall, the optimization for the Fermi architecture
results in an decrease of the runtime of the GPU KLT algorithm by a
factor of approximately 1.6 to 1.9. (see Figure 4). The porting of
the GPU implementation to Fermi architecture gives us the ability
to run two instances of the feature point tracker on a single GPU
in real-time for full HD material.
5 Person tracking over OmniCam tiles
The FascinatE system aims to broadcast live events, therefore the
most important requirement for the AV content analysis is
that the tracking algorithm is able to detect and track persons
within real-time constraints. The algorithm cannot be expected to
track all persons at all time because of occlusions and similar
distractions. Persons need to be tracked over six static and
rectified image-sequences from the OmniCam which has been described
in Section 3. Tiles in our setup have a resolution of 1080 × 1920
pixels (HD resolution) each.
The AV content analysis could use the stitched, rectified, ultra
high definition image (7000 × 1920 pixels), however, that is not
feasible due to limitations of network bandwidth and computational
power. Instead, the system processes the video tiles separately,
enabling the AV content analysis to work in real-time.
The two main steps of the person tracking algorithm are illustrated
in Figure 5 and described below. The region tracker detects and
tracks persons in the image sequence from one tile, and the
multi-tile tracker connects camera tracks from the different
tiles.
5.1 Region Tracking
The region tracking algorithm combines three main video analysis
methods depicted in Figure 5: a CUDA HOG person- detector, a blob
detector and a CUDA point tracker described in Section 4.
To analyze defined regions only, image masks have been created for
all camera sequences. This adjustment avoids false positives at
locations outside the soccer field (e.g., in the audience) and
additionally reduces the calculation time of the AV content
analysis. As discussed in Section 2 we intend to utilize GPU
implementations due to their superior performance. The publicly
available implementation of the fastHOG, a real-time GPU
implementation of HOG [15], algorithm is used as a basis due to its
favorable performance.
However, the original fastHOG implementation does not meet our
real-time requirement, but needs 200ms for processing a frame. For
that reason we increased the scale ratio of the pretrained SVM
classifier which is applied to the sliding window positions. The
effects of coarser scale sampling, using a scale ratio of 1.3
instead of 1.05 in the original implementation are acceptable for
our application. This modification additionally enables the
improved detector to extend the scale range of the scale level
calculations. In contrast to the required minimal size of 128
pixels in the original implementation we are able to detect players
as small as 32 pixels, at a runtime of 70ms per frame.
Additionally, to reduce the amount of false positive results a
scene scale calculation is added. Persons at a certain distance to
the camera appear in approximately the same height and as the
OmniCam is static this is further constant over time. To overcome
missed detections for situations of sudden turns and movement
changes (esp. under presence of motion blur), a blob detector is
added to the region tracking. In our system
Figure 5: AV content analysis using region tracker and multi-tile
tracker.
the OpenCV blob detector7 is used. The detector is based on a
foreground/background discrimination with a subsequent grouping of
adjacent, foreground labeled pixels.
The feature point tracker (using approx. 1,000 points), the
improved fastHOG and the blob detector work in parallel, which
provides stable tracked feature points and region detections. The
results of the person and blob detector for each image of the
different image sequences provide the regions of detected persons
for further processing. The extracted feature points and regions
are combined as a region tracker as shown in Figure 6.
The algorithm of the region tracker is described in the following.
If a detected region solely contains feature points that are not
linked to a person ID, a new person ID is created and linked to
this region and the feature points located inside. For these
feature points the linked person IDs are available also for the
next time points for further calculations.
If a detected region contains feature points not linked to the same
person ID, they are (re-)linked to the person ID which has the most
votes inside the detected region.
All feature points not located in a detected region are clustered
by a maximum distance and a minimum number of points algorithm. For
all extracted clusters a new detection region is calculated by its
convex hull. If this region contains no feature points linked to a
person ID, a new person ID is created and linked to the region and
the feature points inside it. Otherwise the feature points and the
region are re-linked to the most votes inside the detection
region.
With the proposed algorithm missed person detections respectively
regions can be found by distance clustering. The effort for fusing
the results of the three base algorithms
7http://opencv.willowgarage.com/wiki/VideoSurveillance
is negligible, thus the entire person detection and person tracking
algorithm is very close to real-time. In total, the region tracking
takes about 180ms, consisting of 70ms for the improved fastHOG,
50ms the feature point tracker, 50ms the blob detector and 10ms for
result fusion. In the metadata interface all tracked regions, their
locations and person IDs of all separate camera sequences are
delivered with their specific image-sequence ID to process the
results in multi-tile tracking component.
5.2 Multi-tile tracker
For multi-tile tracking the results from each of the sequences
analyzed by different workstations are merged and updated for the
stitched panoramic image. Due to the fact that all images are
rectified, the horizontal borders of adjacent camera-views
correspond to each other. On the basis of this the following simple
algorithm has been developed to track persons between camera
sequences. If a person disappears at the border of one image the
algorithm tries to find this person in the adjacent image at the
same y-position within a minimum overlap. To fully detect the
person in both images the algorithm searches for a short time
interval after the person disappears in one of the images. Then the
new occurring region is re-linked to the person ID of the region
that disappears in the adjacent image. Finally, the algorithm
recalculates the coordinates of the detected region for the tracked
person using the location within the panoramic image. The
multi-tile tracker works in real-time.
5.3 Evaluation
To evaluate the results of person detection ground truth data has
been created manually. For our experiments we annotated all visible
soccer players within different test sequences (about 1,500
frames). The material is from the first test shoot of the
Figure 6: The left image shows detected person regions described by
bounding boxes and tracked feature points. The resulting tracked
persons with their bounding boxes and IDs are shown on the
right.
FascinatE project carried out at a live event: the UK Premier
League football match Chelsea vs. Wolverhampton Wanderers, at
Stamford Bridge, London. Only the center tiles, where most of the
action of the soccer game takes place, have been used. The
algorithm runtime reported in this section has been measured on a
system equipped with a NVIDIA Geforce GTX 285 graphics board and a
3 GHz Quad-Core Intel Xeon CPU.
For quantitative evaluation we compared the bounding boxes of the
ground truth data with the bounding boxes of person regions
obtained from AV content analysis. With a bounding box overlap
threshold of 25% we calculated an average precision of 95.57% and
an average recall of 58.84% for all test sequences. The high
precision indicates that the combination of both region detectors
and the feature point trackers operates sufficiently for scenes
with fewer rapid movements. As excepted, the average recall of
64.43% using the original fastHOG implementation with a scale ratio
of 1.05 is slightly higher while the precision does not change.
However, but due to the given speed up factor of 3 - 4 the coarser
scale sampling caused by our alterations is acceptable.
The scene scale calculation contributes to the average bounding box
overlap of 76% for true positives. The lower recall is caused by
situations of sudden movement changes and turns of the players. In
such scenes very few players are detected and tracked. It is worth
mentioning that the range of the average recall is between 58% and
59% and depends on the occurrence of such situations while the
average precision is quite robust. The resulting recall is
furthermore a consequence of the difficult discrimination between
the upper bodies of the players at the far end of the soccer field
and the crowd behind.
6 Conclusions and outlook
In this paper we have presented a system for real-time person
tracking in a high-resolution panoramic image. The system is part
of a workflow for format agnostic production of live
events. The results of AV content analysis are used as an input to
automated scripting, in order to derive content for a wide range of
terminal devices from the high-resolution scene.
The optimization of the GPU feature point tracker described in [7]
for the Fermi architectures gives us a significant speedup of 60%
to 90% compared with the standard GPU implementation, thereby
allowing us to run multiple instances of the feature point tracker
in real-time for full HD material on a single GPU.
The proposed person multi-tile tracking algorithm for 6 parallel HD
image sequences performs very close to real-time. The next
improvements of this algorithm will include a speed up to
real-time. The bottleneck regarding the processing time is the
detection of the person regions. Therefore the fastHOG algorithm
could be improved in order to only extract scale- levels for
possible person heights at each image location induced by the scene
scale calculation.
Due to missed detections of the fastHOG in situations of sudden
turns, the performances of the blob-tracker should be enhanced by
several modifications (e.g. using less resolution). It remains to
be investigated if the blob detector can be applied in fewer cases
and at lower temporal rates. Up to now persons with a minimal size
of 32 pixels height can be tracked in each camera sequence as long
as they are not occluded. Occlusion could be solved by trajectory
analysis, i.e. using Kalman-based filters [10] for short term
occlusions and verification by color- features. This will also help
to increase the length of the person tracks. Person detection will
be extended to the broadcast cameras as well and optimizations for
other scenario types have to be targeted.
Acknowledgements
The authors would like to thank Georg Schwendt for his help with
the ground truth annotation. The research leading to this paper has
been partially supported by the European Commission under the
grant agreements no. 248138, “FascinatE – Format- Agnostic
SCript-based INterAcTive Experience”
(http://www.fascinate-project.eu/), and no. 215475, “2020 3D Media
– Spatial Sound and Vision” (http://www.20203dmedia.eu/). The work
of Jakub Rosner has been partially supported by the European Social
Fund.
References
[1] Alexandre Alahi, Yannick Boursier, Laurent Jacques, and Pierre
Vandergheynst. Sport Players Detection and Tracking With a Mixed
Network of Planar and Omnidirectional Cameras. In Third ACM/IEEE
International Conference on Distributed Smart Cameras, Como,
2009.
[2] Nadeem Anjum and Andrea Cavallaro. Trajectory association and
fusion across partially overlapping cameras. In Proceedings of the
2009 Sixth IEEE International Conference on Advanced Video and
Signal Based Surveillance, AVSS ’09, pages 201–206, Washington, DC,
USA, 2009.
[3] Simone Calderara, Andrea Prati, Roberto Vezzani, and Rita
Cucchiara. Consistent labeling for multi-camera object tracking. In
Fabio Roli and Sergio Vitulano, editors, Image Analysis and
Processing ICIAP 2005, volume 3617 of Lecture Notes in Computer
Science, pages 1206–1214. Springer Berlin / Heidelberg, 2005.
[4] Kuan-Wen Chen, Chih-Chuan Lai, Yi-Ping Hung, and Chu-Song Chen.
An adaptive learning method for target tracking across multiple
cameras. In Computer Vision and Pattern Recognition, 2008. CVPR
2008. IEEE Conference on, pages 1–8, june 2008.
[5] Damien Delannay, Nicolas Danhier, and Christophe De
Vleeschouwer. Detection and recognition of sports(wo)men from
multiple views. In Third ACM/IEEE International Conference on
Distributed Smart Cameras, 2009.
[6] Manolis Falelakis, Rene Kaiser, Wolfgang Weiss, and Marian
Ursu. Reasoning for Video-Mediated Group Communication. In
Proceedings IEEE International Conference on Multimedia & Expo,
July 2011.
[7] Hannes Fassold, Jakub Rosner, Peter Schallauer, and Werner
Bailer. Realtime KLT Feature Point Tracking for High Definition
Video. In Vaclav Skala and Dietmar Hildebrand, editors, GraVisMa
2009 - Computer Graphics, Vision and Mathematics for Scientific
Computing, 2010.
[8] P. Foldesy, I. Szatmari, and A. Zarandy. Moving object tracking
on panoramic images. In Cellular Neural Networks and Their
Applications, 2002. (CNNA 2002). Proceedings of the 2002 7th IEEE
International Workshop on, pages 63–70, Jul. 2002.
[9] J Kang, I Cohen, and G Medioni. Soccer player tracking across
uncalibrated camera streams. Performance Evaluation, 10,
2003.
[10] S. Khan, O. Javed, Z. Rasheed, and M. Shah. Human tracking in
multiple cameras. In Computer Vision, 2001. ICCV 2001. Proceedings.
Eighth IEEE International Conference on, volume 1, pages 331–336,
2001.
[11] Daehee Kim, Vivek Maik, Dongeun Lee, Jeongho Shin, and Joonki
Paik. Active shape model-based object tracking in panoramic video.
In Vassil Alexandrov, Geert van Albada, Peter Sloot, and Jack
Dongarra, editors, Computational Science ICCS 2006, volume 3994 of
Lecture Notes in Computer Science, pages 922–929. Springer Berlin /
Heidelberg.
[12] B. D. Lucas and T. Kanade. An Iterative Image Registration
Technique with an Application to Stereo Vision. In IJCAI81, pages
674–679, 1981.
[13] J.F. Ohmer and N.J. Redding. GPU-Accelerated KLT Tracking with
Monte-Carlo-Based Feature Reselection. In Computing: Techniques and
Applications, 2008. DICTA ’08.Digital Image, pages 234 –241, dec.
2008.
[14] R. Patil, P.E. Rybski, T. Kanade, and M.M. Veloso. People
detection and tracking in high resolution panoramic video mosaic.
In Intelligent Robots and Systems, 2004. (IROS 2004). Proceedings.
2004 IEEE/RSJ International Conference on, volume 2, pages 1323 –
1328 vol.2, sept.-2 oct. 2004.
[15] Victor Prisacariu and Ian Reid. FastHOG - a real- time GPU
implementation of HOG. Technical report, Department of Engineering
Science, Oxford University, 2009.
[16] R. Schafer, P. Kauff, and C. Weissig. Ultra high resolution
video production and display as basis of a format agnostic
production system. In Proceedings of International Broadcast
Conference (IBC 2010), 2010.
[17] Jianbo Shi and Tomasi. Good features to track. In Computer
Vision and Pattern Recognition, 1994. Proceedings CVPR ’94., 1994
IEEE Computer Society Conference on, pages 593–600, 1994.
[18] Sudipta N. Sinha, Jan-michael Frahm, Marc Pollefeys, and Yakup
Genc. GPU-based Video Feature Tracking and Matching. Technical
report, 2006.
[19] M. Thaler and R. Morzinger. Automatic inter-image homography
estimation from person detections. In Advanced Video and Signal
Based Surveillance (AVSS), 2010 Seventh IEEE International
Conference on, pages 456 –461, Sept. 2010.
[20] Carlo Tomasi and Takeo Kanade. Detection and Tracking of Point
Features. Technical Report CMU-CS-91-132, Carnegie Mellon
University, April 1991.
[21] Christian Wojek, Gyuri Dorko, Andre Schulz, and Bernt Schiele.
Sliding-Windows for Rapid Object Class Localization: A Parallel
Technique. In DAGM- Symposium, pages 71–81, 2008.