4K Lecture Tracking System: Movement Recognition and Lecturer … › downloads › Final Papers ›...

4K Lecture Tracking System: Movement Recognition and Lecturer TrackingMaximilian Karl Alfred Hahn

Honours, Department of Computer Science University of Cape Town

Rondebosch, 7701, South Africa

[email protected]

ABSTRACTThis paper is based on a lecture recording project using static 4K

cameras set by UCT’s CILT that includes blackboard

segmentation, lecturer recognition and smooth virtual panning. We

present a lecturer tracking solution using the OpenCV image

processing library. The work presented in this paper focuses on

movement detection, object tracking, lecturer recognition and near

real-time efficiency. Movement detection is performed using

OpenCV’s: absolute difference background subtraction,

thresholding and contour search. We find that this sequence

outperforms traditional contour searches such as Canny [2] in

terms of run-time while still providing useable results. The success

of our solution is evaluated by running a set of 17 lecture specific

test cases. We measure run-time of use cases as a factor of the

video’s length. Precision is measured as the implementation’s

ability to track the lecturer as a percentage of total time. We find

that the solution works for all common use cases and most

uncommon ones achieving near real-time processing with an

average process time of 1.13 times the length of the video.

CCS Concepts• Computing methodologies ➝ Computer vision problems

• Computing methodologies ➝ Video segmentation

• Computing methodologies ➝ Tracking

KeywordsBackground Segmentation; Movement Detection; Object Tracking;

Lecturer Recognition; OpenCV Library;

1. INTRODUCTIONSome lecturers fear that lecture recordings will be the end of class

attendance. However, lecture recording has become more popular

in educational institutions as studies show that students are willing

to attend lectures as well as make use of lecture recordings. Often

this combination of learning environments yielded the highest

marks overall [10] [7].

A recording systems can come with a variety of features that cover:

single tracking, multi-presenter tracking, blackboard and projector

screen segmentation and real-time processing. It can also be

implemented on a variety of camera’s such as a static camera, a

panning camera, a tilting camera, a zooming camera or a

combination of the above.

The advantages to these systems are that the recorded lectures need

minimal to no editing. For example, using blackboard segmentation

the frames containing blackboard writing don’t need to be manually

cropped out and saved to file. Most importantly a cinematographer

doesn’t need to be hired to film the lecturer or edit the lecture video

by manually panning a frame. Unfortunately, there aren’t any open

source systems available that implement a lecture tracking solution

that produce visually appealing results.

We were approached by the Centre for Innovation in Learning and

Teaching (CILT) at the University of Cape Town (UCT) to

implement such a system for their new high definition (3840 x 2160

pixels) 4K video cameras (Figure 1). CILT had previously

implemented many approaches to lecture tracking including 1080p

static cameras (Figure 2) and Pan-Tilt-Zoom (PTZ) cameras

(Figure 2) which cost R5000-R10,000 and R60,000-R80,000

respectively. The 1080p cameras don’t produce a clear enough

image and the PTZ cameras are very expensive with the addition of

the Raspberry Pi required to power them. The 4K camera’s that

CILT have invested in cost R15,000 each and produce very clear

video. Unfortunately streaming them requires a very fast internet

connection. The brief from CILT was to provide a cost effective

software system to segment boards, track the lecturer and output a

smooth-panning 720p frame extracted from the high resolution 4K

stream.

Figure 1 - CILT's 4K Video Camera

Figure 2 – Left: PTZ Camera, Right: Static Camera

The 4K lecture tracking program was divided into three

components: the video pre-processing and blackboard

segmentation system, the movement detection and lecturer tracking

system and lastly the virtual cinematographer.

1. The video pre-processing performs light correction to

smooth light-based changes between frames and uses the

processed frames to segment out blackboards. Lastly a

motion detection mask is created over a constant time

step to record where motion occurs over those frames.

The mask is part of our future work.

2. The movement detection system performs movement

detection using background subtraction between two

frames. The movement is then segmented into rectangles

of motion. A lecturer is chosen from these rectangles of

motion using intersection of rectangles across frames and

time on screen of a certain area. This can be thought of

as similar to a heat map.

3. The virtual cinematographer section focuses on panning

the output frame in a smooth fashion to avoid making the

video uncomfortable or jarring to watch. This attempts to

emulate the way an actual cameraman would film a

lecture.

Our aim is to develop a fully functioning 4K lecture tracking

solution. This includes making the solution both time efficient and

robust enough to track a lecturer precisely across a plethora of

possible cases. This paper addresses the 2nd module of this project.

The paper is laid out as follows. Section 2 covers work related to

image processing techniques used in this system. Section 3 details

the specifics of the system as well as system requirements and

constraints. Section 4 elaborates on how we evaluated the system

including the evaluation results. Section 5 focuses on analysis of

the results found. Section 6 focuses on concluding statements as

well as possible future directions and improvements for and to this

system.

2. BACKGROUND & RELATED WORK Below we discuss related work that implemented lecture tracking

solutions.

Zhang et al. (2005) present a tracking solution based on a pixel

motion histogram that implements virtual panning. In addition to

this they also use a PTZ camera that can pan horizontally to track

the lecturer when they move out of frame. They mention the

performance issues with face detection with regard to run time as

well as robustness. To make their system robust to lighting changes

and make it more efficient the motion pixel histogram is only

calculated around an area that surrounds the detected lecturer. This

means that changes in lighting don’t affect other areas of the screen

as they simply aren’t processed. While this solution appears to work

well, their frame was a 640 x 480 resolution which is far smaller

than the 4K frame we need to process. It is possible that

performance would become a problem.

Arseneau et al. (1999) present a tracking solution for classroom

environments where the camera isn’t mounted and pointed in an

optimal position like ours. A background subtraction technique that

divides horizontal and vertical maxima into bins is used. The global

maxima of these two axes is chosen as the center point of a region

of interest. These regions of interests are then processed using a 2:1

height:width ratio rectangle to output the location of the presenter.

This approach is more robust to room setup but doesn’t account for

other movement in the scene, if two humans were to enter the view

the region of interest could jump between successive frames.

Below we present a description and explanation of the image

processing techniques used in this work that are implemented as

part of the OpenCV library.

Mathematical Morphology is a processing technique

implemented on geometrically structured data such as a 2D grid of

binary pixels [5]. It can expand or shrink the data with the

application of dilation or erosion procedures. In our

implementation we make use of it to expand our thresholded image

so contours are easier to find.

Blurring functions reduces image sharpness by blending nearby

values together. This can be done in a variety of ways such as with

a box blur which simply averages all nearby values in a sliding

window [6] or with a Gaussian approximation which similarly

makes use of a sliding window but averages values using a

weighted matrix of values [3]. Blur is used in our implementation

to make contours easier to find.

Finding contours of an image can be done using border following

as in OpenCV’s implementation. This implementation follows

Suzuki et al. (1985) which describes a border following algorithm

that accepts binary images as input. It creates outline contours as

well as registers holes in contours. These contours outline and save

the movement detected by a background subtraction algorithm.

3. DESIGN & IMPLEMENTATION This section includes a description of the constraints set forth by

CILT, their requirements as well as the assumptions we made that

helped in developing our program. We then discuss our system’s

overview and our architecture as well as go into more depth about

the lecturer recognition portions of the sections. Finally, the

methods from OpenCV we used are discussed as well as an in-depth

walkthrough of our algorithm in Sections 3.4, 3.5 and 3.6.

3.1 System Constraints & Requirements The basic requirements set forth by CILT are to:

Segment blackboards saving clear and high definition

images of these segments, these segmented images are to

be updated as the board is written on.

Take a 4K lecture video as input and use computer vision

algorithms to track the lecturer.

Output a 720p video that follows the tracked lecturer

positions acting as a virtual cinematographer.

Important assumptions we made when implementing our system

include:

The camera is positioned such that the lecturer will be

around the middle of the screen with regard to the y-axis.

The camera is positioned such that the lecturer’s podium

and middle board or gap between two boards are in the

middle of the screen with regard to the x-axis.

The lecturer will not always be alone in the frame and as

such occlusion needs to be handled.

The lecturer can’t move very far between successive

frame reads.

The video captured doesn’t ever lag meaning that no

frames are dropped and duplicated.

3.1.1 Functional Requirements CILT requires that the system is able to track a lecturer correctly

throughout most normal lecture situations. From this we devised

the research aim that our system needs to be able to correctly

identify the lecturer for 90% of the video. The lecturer is correctly

identified when (in debug mode) the program shows a rectangle

that at least 70% contains the lecturer.

CILT requires the system to be able to run well across a plethora of

lecture scenarios. We therefore devised a set of 17 use cases that

are discussed in the methods section. These cover all realistic

scenarios that would occur in a lecture theatre as well as some edge

case scenarios. Successfully completing this requirement would

provide strong support for the robustness of the system across all

realistic scenarios.

3.1.2 Non-Functional Requirements CILT requires that lecture videos are currently released within 8

hours of the lecture finishing and this all video post-processing

must fall within that period This doesn’t leave much time for our

framework as other processes and steps also form part of this 8-

hour time limit. One example of these processes is needing to

manually cut the video to the exact start and end of the lecture.

CILT recommended this module take at most 3 times the runtime

of the actual video. From this we devised the research aim that

focuses on processing the lecturer tracking section in less than 2

times the length of the input video as we believe our planned

algorithm can meet and beat the requirement outlined.

Our research aims therefore are:

1. Can this section run efficiently enough to be processed in

less than 2 times the length of the video?

This will fulfill the process speed requirement meaning

lecture videos will be available to students faster without

creating a buffer of unedited videos.

2. Does our system correctly segment out all motion and

decide on the correct lecturer 90% of the time for likely

use cases?

This will fulfill the main functional requirements

meaning our solution will be usable without further work

for most lecture recording.

3. Does our system correctly segment out all motion and

decide on the correct lecturer 90% of the time for unlikely

(edge) use cases?

This will fulfill the edge case functional requirements

meaning our solution will work for a variety of odd cases

and is generally quite robust.

3.2 System Overview The movement detection and lecturer recognition module requires

the lecture video as input and reads all frames of the video file

during its execution. Not all frames need to be processed; in order

to reduce processing time for this module, we implemented the

ability to skip frames, these are only read, not processed. These

frames are passed through multiple computer vision algorithms

provided by the OpenCV library to yield rectangles that

encapsulate an area of motion. A history of these rectangles is

created in a class called “Ghost”. Once all frames have been read,

these ghosts are post-processed to select a lecturer for each frame

and the locations of the ghosts are shifted to more accurately

represent the position of the lecturer. This section sends the

locations of the lecturer at each frame processed to the virtual

cinematographer section.

We used OpenCV, an open source computer vision library, for

much of our processing. This library provided us with many

complicated image processing methods that allowed us to test a

plethora of approaches to a single problem with minimal effort.

OpenCV makes use of the OpenCL framework to parallelize their

methods as much as possible which means that OpenCV is written

to be very efficient and aims to provide the possibility for solutions

to run in real-time.

Since we needed a very efficient program we also had to decide

between C++, Java and Python implementations of the OpenCV

library. Prechelt (2000) found that C++ runtime for the same

program was on average 2 times faster than Java and almost 2 times

faster than Python. Similarly, C++ also performed best for memory

being 2 times more memory efficient than Python and 4 times more

memory efficient than Python. Given these results it seems logical

to use C++ as our processes need to be completed quickly and

minimizing memory usage is important when processing raw 4K

pixel data. Unfortunately, the study also revealed that development

using Python is two times faster than Java and C++. This meant that

although the code would run quicker with our choice of C++, it

would likely take much longer to develop especially since we were

new to the OpenCV library. A similar study also performed by

Prechelt (1999) mirrored these results between Java and C++.

Figure 3 Primary Sequence of MovementDetection Class

Figure 3 shows our ability to process every Nth frame. This means

we can find a balance between the run time of the program and how

precise the changes in movement can be registered. Sections 3.4,

3.5 and 3.6 all implement many small methods that are based on

assumptions about the way a lecturer will behave and tend towards

maximizing the chance the lecturer is found. Note that OpenCV is

implemented with a top-left origin co-ordinate system and all

methods assume knowledge of this logic.

A project system architecture diagram is presented in Appendix A

in the form of a class diagram.

3.3 OpenCV Methods Our first approach was to use Sobel filtering [4] which was slow

for many frames but also yielded all edges which we would then

have needed to segment. We ran into similar issues using Canny

edge detection [2]. We decided to change the direction of the

project to make use of movement detection through background

subtraction rather than edge detection.

Our first backgrounds subtraction approach made use of OpenCV’s

MOG2 background subtractor from Zivkovic et al. (2004) which

produced good foreground segmentation but resulted in lots of

noise all over the scene. In addition to this it also took noticeably

too long to process frames. We decided to go with a simple absolute

difference algorithm to segment differences between frames and

then applied a threshold function. This provided mostly clean

background subtraction with very little noise because we chose

very tight limits for our thresholding function. The runtime of this

algorithm was also noticeably better.

We then performed a morphological dilation as well as a blur using

a large 15 x 15 matrix to make contours more recognizable.

while(readNextFrame)

if(Nth frame)

1 absoluteDifference(frame a, frame b)

2 threshold(25, 255)

3 morphologicalDilate(3x3 structure)

4 blur(15x15 matrix)

5 findContours()

6 Process Rectangles Logic (Section 3.4)

7 Process Ghosts Logic (Section 3.5)

8 Store Ghosts and Rectangles

9 findLecturer() (Section 3.6)

10 adjustLecturer() (Section 3.6)

Because of this order of operation, we found that very little noise

got through and thus the dilation and blur only made areas we

wanted to detect larger. After this we found the contours of our

edited frame the output of which is shown in Figure 4 below.

Figure 4 - Output of Find Contours Including Short Contours.

3.4 Recognition Algorithm This section follows Figure 3 Step (6) and explains how contours

are converted to rectangles and the various culling and grouping

processes rectangles per frame undergo.

The first step in the process removes any contour chains that have

very few nodes. This is because any complex shape which we are

interested in will have many nodes, this also tends to ignore small

movements which can come about as a result of light changes or

refocusing of the camera. Very large simple shapes such as a

rectangle can be represented by short contour chains so this

removes those as well.

Figure 5 - Upper and Lower Bounds of Movement

A bounding rectangle is created around each contour using

OpenCV’s boundingRect method. Since we assume that the

lecturer is in the middle of the screen with regard to the y-axis any

rectangle’s top that is greater than a maximum y threshold is

deleted. For the same reason any rectangle’s bottom that is smaller

than a minimum y threshold is deleted. Figure 5 shows an example

of these thresholds where green rectangles are valid and red

rectangles are removed.

The next step is to check if the rectangle’s 𝑤𝑖𝑑𝑡ℎ: ℎ𝑒𝑖𝑔ℎ𝑡 ratio is

below a certain threshold. This is because we found that humans

being tracked don’t move in such a way as to generate very wide

rectangles. This check ignores boards and projector screens moving

vertically as those are often detected as very wide rectangles as in

Figure 6.

Figure 6 - Board Top Being Movement Detected

Overlapping and nearby rectangles need to be grouped together into

a larger rectangle. This is in an attempt to build a rectangle around

a person that has multiple small movements detected and therefore

has registered multiple unconnected contours in a frame.

Implementation of this is shown in Figure 7.

Figure 7 - Rectangle Overlap and Proximity Evaluation Logic

Figure 8 includes a rudimentary implementation of a clustering

algorithm for rectangles. Figure 8 shows the outcome of this

clustering algorithm. Each run of the logic in Figure 7 is evaluated

for changes as a new bounding rectangle could now intersect other

rectangles.

Two redundancy checks are then performed. The first checks if the

frame is too cluttered with rectangles which can occur where there

is too much motion or when the camera refocuses. In this case it is

difficult to extract useful information so a default middle bounding

rectangle replaces all others. The second case is if there are no

bounding rectangles found in which case a default rectangle is

similarly placed. This is necessary otherwise empty frames are

effectively discarded and the virtual cinematographer will begin to

while(no changes registered)

changeRegistered = false

for each rectangle R1


if(R1 intersects R2)

replace R1 with new bounding rectangle

remove R2

changedRegistered = true



if(minDistance(R1, R2))*

replace R1 with new bounding rectangle

remove R2

changeRegistered = true

*minDistance(rectangle, rectangle) method finds distance

between closest two edge points of 2 rectangles

lag behind the actual video. This is because the wrong rectangles

are assigned to frames as a result of the skipped empty frames.

Figure 8 - Nearby Rectangles (left) and Their Resulting Bounding

Rectangle (right)

3.5 Ghost Tracking This section follows Figure 1 Steps (7 and 8) and explains what the

“Ghost” class does as well as how ghosts are chosen and used.

A ghost is a rectangle that’s size and longevity depends on how it

intersects with rectangles from the recognition stage (Section 3.4).

The ghost tracks the movement detection rectangles across multiple

frames recording how long an object moves as the number of

frames it is tracked in. The ghost takes uncorrelated rectangles in

multiple frames and establishes a relationship between them

assigning a group of rectangles across many frames to a single

entity.

A new ghost is instantiated when a rectangle is created that doesn’t

intersect with any previous ghosts. This ghost will have the

dimensions of the rectangle as well as a time on screen of 1 frame.

In successive frames, if this new ghost intersects other rectangles,

it will grow outwards towards that rectangle’s extremities as shown

in Figure 9 below.

Figure 9 - A Ghost (white) Updating Position Towards the

Movement Detected (blue)

If no intersection is found or the intersection is below a percentage

of the ghost’s area, then the ghost will shrink inwards. Once the

ghost has become small enough it will be deleted. The assumption

behind this is that people in the view will constantly be moving at

least a small amount. This should ensure that human objects are

constantly tracked whereas objects such as boards are only tracked

when they are moved. The resizing of the ghost based on

intersected rectangles also allows the ghost to track movement

laterally as in Figure 9. This assumes that people do not move

quickly enough to exit the ghost.

Figure 10 shows the logic for ghosts being merged together and

split apart. The merge and split algorithm was introduced to handle

occlusion of two or more tracked objects. This allows the highest

screen time count of the merged ghosts to be recorded and kept.

Figure 11 illustrates the split algorithm running, it shows two

rectangles being a certain distance apart, greater than the threshold

and therefore being split.

As a form of reset in the case that the system picks up the wrong

object as the lecturer there is a reset function that reduces each

ghost’s time by two thirds. This operation is performed every 120

frames and acts as a soft reset. We found through testing that the

120 frames were frequent enough to affect any object visible on

screen for more than 4 seconds. This is because it took us an

average of 4 seconds to walk across the lecture view. Thus only

affecting objects that are attempting to stay in the view.

At the end of the frame read loop, Step (8), all rectangles that

survived until this point as well as all ghosts are saved for post-read

processing.

Figure 10 - Ghost Intersect and Divide Logic

Figure 11 - Ghost (white) Being Split into Two

3.6 Lecturer Selection This stage follows Figure 1 Step (9 and 10) and processes all ghosts

stored when the video file was processed. It focuses on deciding

which ghost is the lecturer and then shifting the boundaries of the

ghost to better track the lecturer.

To find the lecturer, we begin by assuming that the lecturer will

tend to be in the center of the screen. The lecturer is selected as the

ghost with the highest value of a distance ratio operation. The ratio

is the distance from the x-sides and y-sides on a scale from 0 to 1.

Figure 13 displays an example of this scaling.

//intersect

for each ghost G1

for each ghost G2

if G1 intersects G2

merge G1 and G2

//divide

for each ghost G

for each rectangle R

if G intersects R

store R index in I

for each i1 in I and each other i2 in I

if i1 and i2 (don’t intersect and are inline on y-axis)

if distance(i1, i2) > minDistance

split G into G(i1) and G(i2)

𝑣𝑎𝑙𝑥 = 𝑜𝑛𝑆𝑐𝑟𝑒𝑒𝑛𝑇𝑖𝑚𝑒 ∗ (𝑟𝑎𝑡𝑖𝑜𝑋2)

𝑣𝑎𝑙𝑦 = 𝑜𝑛𝑆𝑐𝑟𝑒𝑒𝑛𝑇𝑖𝑚𝑒 ∗ (𝑟𝑎𝑡𝑖𝑜𝑌2)

Figure 12 – Positional Importance Formulae

Figure 13 - Positional Importance Visualization

The final decision value is calculated using the average of valx and

valy shown in Figure 12. This means that movement low down or

high up on the screen can be ignored as this is likely seated students,

a student crossing the venue or a projector screen moving above.

Since we assume the camera is pointed to the center of the lecture

space it also makes sense that the lecturer will occupy this central

space.

The last step is Step (10) which focuses on adjusting the selected

lecturer’s bounding rectangle explained in Figure 14. This is to

better match the original rectangles generated by this module. This

is necessary as the ghosts used to determine the lecturer tend to lag

behind the lecturer’s movements slightly.

Figure 14 - Lecturer Rectangle Adjustment Logic

The locations of the lecturer are saved as a vector of rectangles and

can be accessed later for the virtual cinematography module of the

program.

4. Experimental Data and Setup To test our system, we devised a set of 17 use cases illustrating how

a lecturer might move in a lecture. We then acted them out while

recording using a 4K camera. These videos were then processed

through this module. The use cases chosen are fairly broad and

cover many common cases as well as a couple of edge cases to test

the limits of this module. We analyzed a set of 5 videos of real

lectures. For each of our 17 use cases we estimated a score (1- 5)

explained in Table 1.

Table 1 - Explanation of Use Case Scores

Likelihood No. Definition

Very

Unlikely

1 Occurred once in one of the 5 lecture

videos

Unlikely 2 Occurred at least once in at least 3 of the

lecture videos

Possible 3 Occurred at least once in all 5 of the

lectures

Likely 4 Occurred at least 5 times in all 5 of the

lectures

Very

Likely

5 Occurred more than 5 times in all 5 of the

lectures

Table 2 - Lecture Use Cases

No. Description Likelihood

(1 – 5)

1 Basic lecturing – little movement, no

pacing or gesturing.

5

2 Lots of hand waving – moderate

movement, little pacing, lots of gesturing.

4

3 Lots of pacing – lots of movement, lots of

pacing, little gesturing.

4

4 Changing light – moderate movement

lecturing while light conditions change

often.

2

5 Moving boards – lecturing while moving

boards up and down often.

3

6 Screens and movement - setting projector

screens to go up and down while lecturing,

lots of lecturer movement and pacing.

2

7 Screens and stationary – setting projector

screens to go up and down while lecturing

with no movement.

2

8 Off and on – move off the screen and then

back on.

3

9 Student crosses – lecturing in center while

student crosses view.

3

10 Both move – both student and lecturer walk

from one side of view, lecturer halts in

middle.

2

11 Both move opposite – both student and

lecturer approach from different sides of

the view.

1

12 Running – lecturer running across the

view.

1

13 Throwing – lecturer throwing an object

between a student.

1

14 Multiple students – simulate a 3 student

presentation.

1

15 Two students cross – students walk past

lecturer in middle from either side of the

view.

1

16 Student chairs – student moving along

chairs in bottom part of view.

3

17 No movement – no one in the view. 2

We then constructed Table 2 using the scoring system which

showed the level to which the behavior in each use case occurred.

We decided that use cases with likelihood 3 and above (1, 2, 3, 5,

for each lecturer L

for each intersecting rectangle at L.index

if rectangle intersects and isn't 2 x larger than L

store intersecting rectangles R

for each stored rectangle R

shift boundaries of L as average of all R and L

8, 9 and 16) are grouped as likely to occur. Whereas 4, 6, 7, 10, 11,

12, 13, 14, 15, 17 are grouped as unlikely to occur. This division

will help us evaluate the overall success of this module as core use

cases are more important to cover than edge use cases as they will

be occurring more often. This distinction is also used to evaluate

research aims 2 and 3 which are the same objective for more and

less likely use cases.

To evaluate the performance of our approach we stepped through

the use case videos at a rate of 4 frames per step. We counted the

number of times the program incorrectly identified the lecturer.

Incorrect identification is counted when the rectangle tracked the

wrong person, a moving object or lagged behind the movement of

the lecturer (having the lecturer completely outside of the rectangle

but still tracking in their direction). In our final results we

simplified the data by rounding frames to the nearest second, this

handles the possibility of variable framerates occurring between

videos.

To evaluate the runtime of our system the processing time of the

use case is recorded. If the average processing time of all our use

cases takes longer than 2 times the average length of the use case

videos, the research aim is considered failed.

5. RESULTS All data presented below was a result of using a i7-6700HQ

processor with a boost frequency of 3.5GHz, 8 threads and 6MB

cache, 8GB of DDR3 memory and a 7200rpm HDD.

Table 3 - Use Case Evaluation Results

5.1 Runtime of Solution Table 3 shows the time taken to process each use case as a

percentage of the video’s length. The results are within the

parameter of research aim 1 as each falls well below 200% with

some even dipping below 100%. We decided to read every 4th

frame because we found a good balance between keeping this

module time efficient, while preventing objects from moving too

far between successive processed frames. These results can be

further improved with the use of a solid state hard drive(SSD). We

ran some test with an SSD and found that this largely reduced the

processing time. As illustrated in Figure 15 68% of the time taken

to process is read operations. This is likely because of the large size

of a single 4K frame being 7.45MB. It is likely that our fast

processing times are due to the efficiency of OpenCV and their

heavy parallelism. We were able to verify the effectiveness of

OpenCV’s parallel implementations by watching the Windows task

manager while running the program. It revealed that up to 30

threads were being pooled at any one time. It is also important to

note that our processes described in 3.4, 3.5 and 3.6 take up a very

small portion of the processing time (less than 13%). Most

processing in this project is attributed to the computer vision

techniques implemented by the OpenCV library and the read.

Figure 15 - Operation Time Taken

5.2 Likely Use-Cases Table 3 (blue highlighted numbers) show that all of our likely use

cases have a tracking rate of at least 90%. This means that for a

typical lecture this system can correctly track the lecturer at least

90% of the time, thus completing our research aim 2.

The solution presented registers large movement well. This is

because the first step of our algorithm employs absolute difference

background subtraction. Without movement there is nothing for the

system to register therefore the center of the screen will be tracked.

One of our stated assumptions is that the 4K cameras are always

pointed to the center of the lecture stage. Because of this

assumption use cases where there was minute or no movement

tended to track the lecturer as they were stationary resting by the

podium. For use cases where the lecturer was stationary in other

areas of the room slight movements were still recognized and

tracked.

The moving boards use case (5) tests lots of lecturer movement and

board movement. In this use case the lecturer is contained within

the greater tracked rectangle of the board being moved. While this

can be considered a failed detection, to our system and specifically

the virtual cinematographer module this has no negative effect as

the lecturer/board will still be tracked correctly horizontally. The

virtual cinematographer doesn’t need to pan vertically as of yet.

One of the likely use cases is that a student crosses the view of the

lecturer. What this means for our module is that the lecturer and

student will be pushed into a single ghost and given the same time

on screen. Because of the setup of a lecture the student will either

be going to a seat or going from a seat. In either case they will not

be in the view long and when they approach the edges of the view

their change to be tracked is reduced severely. This is because of

our weighted values approach illustrated in Figure 13.

Another use case is a student bending over in the view to seat

themselves in the front row. This is handled as we assume a lecturer

will be vertically in the middle of the frame because of the way in

which the cameras are positioned. The projector screen will be the

only thing moving at the top and students the only things moving

at the bottom of the screen.

The system developed and the stages explained in Section 3.4 aim

to capture as much useful movement data as possible. It then trims

this data to fit our assumptions. This data can then be assumed to

be an object being moved or a person moving themselves. This is

why our system works for all of these likely use cases as in each we

No. Length (s) Process Time (s) % Process Time Mistracked (s) % Correct Tracked

1 57 66,192 116,13% 2 96,49%

2 31 32,756 105,66% 0 100,00%

3 48 46,648 97,18% 0 100,00%

4 85 90,575 106,56% 60 29,41%

5 40 44,879 112,20% 3 92,50%

6 48 56,975 118,70% 1 97,92%

7 46 52,147 113,36% 3 93,48%

8 40 45,824 114,56% 3 92,50%

9 72 85,613 118,91% 4 94,44%

10 17 19,67 115,71% 1 94,12%

11 16 21,707 135,67% 0 100,00%

12 24 28,274 117,81% 14 41,67%

13 32 40,115 125,36% 2 93,75%

14 162 156,311 96,49% 62 61,73%

15 30 37,577 125,26% 3 90,00%

16 32 36,397 113,74% 1 96,88%

17 77 76,248 99,02% 0 100,00%

have identified a whole moving form and have made decisions on

those identifications.

5.3 Unlikely Use-Cases Table 3 (yellow highlighted numbers) shows that 7 out of 10 of our

unlikely use cases are correctly tracked at least 90% of the time.

Cases 4, 12 and 14 are the cases that failed the 90% requirement.

Therefore, we can mark our research aim 3 as failed.

When viewing the video for use-case 4 we realized that the

recording wasn’t smooth and often had duplicate frames in

sequence for a couple of frames at a time. In this particular use case

we put an emphasis on switching various lecture lights on and off

and opening and closing the blinds to test using natural lighting as

well. When a shift in light occurred the 4K camera would correct

the light very well going so far as to grayscale the image and add

illumination in very dark settings. We hypothesized that the

camera’s processor couldn’t keep up with this processing and thus

duplicated frames instead of processing new ones. This is possibly

why the video given to us was very laggy and jittery. In either case

this violates one of our original assumption that we would be

processing smooth video with continuous movement. As such we

can ignore this use case.

Use case 12 represents a lecturer running across the lecturing area.

When processed this video captures the lecturer’s movement but

when they move too quickly (completely out of the original

tracking frame) the lecturer is lost until this frame shrinks to

nothing. This is as a result of how our system moves “Ghosts” in

the scene. Each frame that generates new ghosts searches for

intersections with ghosts recorded in previous frames. If a new

ghost is too far away from a previous ghost it is assumed there is

no correlation. In this case the fact that we skip 3 frames for every

processed frame means that if the lecturer moves very quickly

(which is a very unlikely case) the program might not track the

lecturer properly. This can feasibly be solved by processing more

frames however this would increase the processing duration of the

module.

Use case 14 represents multiple students giving a presentation. This

use case only tracked the lecturer correctly 61,7% of the time.

Fundamentally this is meant to be a difficult use case for the module

to handle. When 3 students are presenting there is no indication

other than voice and nuanced movement distinction who the

lecturer at any one moment is. While our system was developed to

handle temporary passing and occlusion of students it still

fundamentally only handles one lecturer. With this in mind the

problems we noticed were that students who weren’t lecturing but

were in the view continued moving and thus retained their screen

time count. Additionally, because the role of speaker is passed

between the students, the screen time counts are all mixed together.

Therefore, the lecturer is often decided by who of the 3 is most

central in the view.

The use cases that include complicated configurations of students

crossing the view (10, 11, 15) act similar to use case 9, which is the

simplest configuration of this type. The lecturer was tracked

sufficiently well even though when students pass the lecturer they

pick up the lecturers on screen time. This occurs when ghosts merge

and then split as in Figure 11. These students always move towards

an edge so the actual lecturer was correctly chosen.

Use cases with moving projector screens (6, 7) also tracked the

lecturer well. This is because of the width of the projector screen’s

bottom weight. The entire screen is a uniform colour and material

except for the weight. Because of this only the weight at the bottom

is tracked as movement. The bottom weight’s very large width to

height ratio means it’s bounding rectangle is ignored early on.

Furthermore, bounding rectangles very high up are also ignored;

this covers the case of when the projector screen is almost

completely rolled up. When the lecturer isn’t moving as in use case

7 the tracking defaults to the center of the room where the lectern

is and therefore where the lecturer is.

The 2 valid use cases that failed the 90% requirement are both very

unlikely cases. Although this failure means that our tracking

solution can’t handle all the less likely cases, our results show that

the proposed solution is still viable for use in a general lecturing

environment.

6. CONCLUSIONS In this paper, we presented a software solution for the lecturer

tracking module of the 4K lecture recording project. The module

correctly identifies the lecturer for most use cases with a run-time

well within our defined restrictions. The run-time of the solution is

an average of 1.13 times the length of the video. We do highly

recommend that the software be run on an SSD with a very fast read

because the read time reduces the processing speed severely. We

also discuss other efficiencies discussed in future work. Although

our solution fails on 3 out of 17, less plausible, use cases we still

believe it is a viable piece of software to be used in lecture

recording. This belief is based on the result that likely use case will

occur much more often than unlikely use cases. We have made

many assumptions about the layout of a lecture theatre and it is

important to note that a more robust solution is needed to generalize

our software. This includes lecture theatres with no podium or a

podium in another location and a camera mounted less centrally and

is therefore aimed at an angle.

7. FUTURE WORK Gesture recognition could provide a new and useful element our

system. Gestures could be used by the virtual cinematographer

module to create more informative panning such as sideways in the

direction of a gesture. Gesture recognition could be implemented

by using contour information that is calculated in an early step. The

lecturer would be analyzed to segment arms and therefore track if

the lecturer is gesturing or resting their arms.

Another addition that would make this module more robust would

be to analyze contours when a board moves to segment the lecturer

out. This is necessary as often the lecturer and board are given the

same line segment at the contour detection step and are assumed to

be one object. Detection of a board merging with a lecturer could

be done by analyzing when a rectangle’s sizes vary wildly between

frames.

A recognition algorithm could be implemented on individual ghosts

to detect colour features. Perhaps a form of histogram equalization.

This could then be used to correctly redistribute on-screen time

when two ghosts merge and then split again.

The system for deciding which ghost is the lecturer could be

extended to make use of temporal states. Because we record all

ghost positions and then process them we can watch for ghosts that

disappear off the screen and remove all occurrences of them.

The project could be extended to begin by segmenting the podium

or center of lecturer movement out. This could then be used as the

center of the lecturing area. In this case a robust scaling system

would need to be developed to change the scaling of the formulae

from Figure 12.

A movement mask is calculated by the pre-processing module of

the program. The tracking module can make use of that mask to

limit processing in areas where there is no movement by doing a

simple binary check if that area is available. This would likely

require integration with OpenCV’s implementations of absolute

difference, threshold and contour detection.

8. REFERENCES 1. Arseneau, Shawn, and Jeremy R. Cooperstock. "Presenter

tracking in a classroom environment." Industrial

Electronics Society, 1999. IECON'99 Proceedings. The 25th

Annual Conference of the IEEE. Vol. 1. IEEE, 1999.

2. Canny, John. "A computational approach to edge

detection." IEEE Transactions on pattern analysis and

machine intelligence 6 (1986): 679-698.

3. Durand, Frédo, and Julie Dorsey. "Fast bilateral filtering for

the display of high-dynamic-range images." ACM

transactions on graphics (TOG). Vol. 21. No. 3. ACM,

2002.

4. Gao, Wenshuo, et al. "An improved Sobel edge

detection." Computer Science and Information Technology

(ICCSIT), 2010 3rd IEEE International Conference on. Vol.

5. IEEE, 2010.

5. Haralick, Robert M., Stanley R. Sternberg, and Xinhua

Zhuang. "Image analysis using mathematical

morphology." IEEE transactions on pattern analysis and

machine intelligence 4 (1987): 532-550.4

6. Jarosz, Wojciech. "Fast image convolutions." SIGGRAPH

Workshop. 2001.

7. Larkin, Helen E. "But they won't come to lectures..." The

impact of audio recorded lectures on student experience and

attendance." Australasian Journal of Educational

Technology 26.2 (2010): 238-249

8. Prechelt, Lutz. "An empirical comparison of C, C++, Java,

Perl, Python, Rexx and Tcl." IEEE Computer 33.10 (2000):

23-29.

9. Prechelt, Lutz. "Comparing Java vs. C/C++ Efficiency

Differences to Interpersonal Differences." Commun.

ACM 42.10 (1999): 109-112.

10. Soong, Swee Kit Alan, et al. "Impact of video recorded

lectures among students." Who's learning (2006): 789-793.

11. Suzuki, Satoshi. "Topological structural analysis of

digitized binary images by border following." Computer

Vision, Graphics, and Image Processing 30.1 (1985): 32-46.

12. Zhang, Cha, et al. "Hybrid speaker tracking in an automated

lecture room." 2005 IEEE International Conference on

Multimedia and Expo. IEEE, 2005.

13. Zivkovic, Zoran. "Improved adaptive Gaussian mixture

model for background subtraction." Pattern Recognition,

2004. ICPR 2004. Proceedings of the 17th International

Conference on. Vol. 2. IEEE, 2004.

Appendix A – Track 4K Class Diagram

4K Lecture Tracking System: Movement Recognition and Lecturer … › downloads › Final Papers ›...

Documents

Transcript of 4K Lecture Tracking System: Movement Recognition and Lecturer … › downloads › Final Papers ›...