A Machine Learning System For Human-in-the-loop Video ... · A Machine Learning System For...

A Machine Learning System For Human-in-the-loop Video Surveillance

Ulas Vural and Yusuf Sinan AkgulGIT Vision Lab, http://vision.gyte.edu.tr/,

Department of Computer Engineering, Gebze Institute of Technology41400 Kocaeli, Turkey

{uvural, akgul}@bilmuh.gyte.edu.tr

Abstract

We propose a novel human-in-the-loop surveillancesystem that continuously learns the properties of ob-jects that are interesting for a human operator. Theinteresting objects are automatically learned by track-ing the eye gaze positions of the operator while he orshe monitors the surveillance video. The system auto-matically detects interesting objects in the surveillancevideo and forms a new synthetic video that containsinteresting objects at earlier positions in the time di-mension. The operator always views this syntheticallyformed video which makes manual video retrieval tasksmore convenient. Sensitivity to operator interests andinterest changes are other major advantages. We testedour system both on synthetic and real videos, which areprovided as supplementary materials [1]. The resultsshow the effectiveness of the proposed system.

1 IntroductionDigital surveillance technologies have become very

popular with decreasing initial set-up costs [11]. Inmany cases, customers demand larger number of cam-eras for a better coverage. Although an extended cov-erage is desired, the amount of visual data increases bythe number of cameras. Recorded surveillance videosand live video streams should be examined for eventsand suspicious actions. There are considerable num-ber of studies for automatic inspection of these data[15]. Fully automated video analysis approaches [4]are fast but they only work for a limited number ofmodels and their success is far from the desired lev-els [8]. Human operators generally make better deci-sions than the advanced surveillance systems at criticaltimes[13] but manual processing of the information isa hard and time-consuming task. Therefore, we arguethat a human-in-the loop approach should be adoptedfor increasing the efficiency of the human operators forthe surveillance task.

Video synopsis is one popular approach for increas-ing the operator efficiency [9]. Synopsis methods showall actions in the video archive in a shorter time. Key-frame based video synopsis methods drop the frameswith negligible amount of actions but these methods caneasily discard some important actions when they areforced to produce shorter synopsis videos. Non-linearvideo synopsis methods [2] preserve actions better bychanging the video time positions of the actions insteadof discarding frames. One main problem of non-linearsynopsis methods is that they generally produce com-plex output videos which are full of actions regardlessthey are interesting or not.

Human operators have limited tracking capabilities[3] and they can easily overlook important actions whenthe scenes are crowded. Human psychology shouldbe taken into account for increasing surveillance sys-tem reliability because the number of objects can betracked by an operator highly depends on the opera-tor’s experience, fatigue, and workload [13]. Recently,eye-gaze based metrics were used in surveillance sys-tems for measuring the performance of the operators.In [12], we track the operators’ eye-gaze positions todetermine if an action in the scene is monitored or not.This system then gives a second chance to the operatorby showing a non-linear synopsis of overlooked partsagain. Another eye-gaze based method addresses theproblem of crowded synopsis output videos [13]. Thismethod adjusts the trackability of surveillance video ac-cording to the operator attention level. These methodssupport operators and increase the system reliability bypreventing overlooks. Eye-gaze based metrics are alsoused for measuring the human interest especially fordesigning shop windows or internet sites [6] but noneof the human-in-the-loop surveillance systems analysesthe features that the operators find interesting.

In this paper, we propose a novel method that synthe-ses new surveillance videos according to the operator’sinterest. Surveillance operators generally have special

interests on some types of objects or actions. These ob-jects and actions vary according to the surveillance taskat hand and operator’s motivation. Furthermore, a spe-cial action or object seen in the video may change theinterest of the operator. Thus, an online learning algo-rithm is a vital part of our system. We define the surveil-lance video synthesis problem as an online learningproblem and introduce a novel learning scheme whichclassifies each action as interesting or not. Image andmotion based features of an action are extracted andthey are classified as interesting if eye-gaze positionsof the operator match the action for some duration. Forunseen video sections, the dynamically learned featuresare used to determine the interesting actions or objectsand the chronological positions of these actions or ob-jects are changed in the synthesized video so that theoperator can see them earlier. This makes any man-ual video retrieval tasks more convenient for the oper-ators. Similar video synthesis systems[10] classify theobserved actions into several different classes or clus-ters without a dynamic operator model. As a result, theycannot easily adapt themselves for different operatorsand changing operator interests. Our method, on theother hand, continuously trains the learning scheme andif the operator shows change of interest, the system au-tomatically adjusts itself. Furthermore, our synthesisedvideo includes every action or object in the video butonly changes their chronology so there are no missedaction problems.

2 MethodOur system uses methods from the field of computer

vision (CV), human-computer interaction (HCI), andmachine learning (ML). An overview of the proposedmethod is shown in Figure 1.

(a) E

xtra

ct O

bjec

t Tub

es

(b) E

xtra

ct F

eatu

res

of

Tube

s

(c) C

lass

ify T

he T

ubes

(d) S

elec

t Act

ions

To

Show

Monitor

(e) O

nlin

e Tr

aini

ng

(Operator)

ML CV

HCI

(f) E

ye-tr

acki

ng

(Video Stream)

(Surveillance Camera)

Figure 1. Overview of our method.

Computer vision algorithms are used to extract theset of all actions A from the input video V (Fig. 1-(a)). An action ai ∈ A is represented with an objecttube ti ∈ T which is composed of a sequence of related

bounding boxes ti [10]. We use a mixture-of-Gaussianbased foreground/background subtraction algorithm forfinding the moving objects and their rectangular bound-ing boxes.

After obtaining an action set A, image and motionbased low level featuresX of these actions are extracted(Fig. 1-(b)). Histograms of local binary patterns (LBP)are used for representing the texture of objects [14].LBP based features are computed for each boundingbox for which an eight-bin histogram is constructed.RGB color histogram of a bounding box provides an-other image based feature vector. We also use the sizesof bounding boxes, their aspect ratios, and center po-sitions as other image based features. The direction ofmovement and speed are used as motion features. Themotion based features are not computed for each bound-ing box and are only computed for each action once.The total number of features in a feature vector xi foran action ai contains 48 elements.

An online boosting method based on the gentle-boost algorithm[7] is proposed to learn interesting ac-tions in a binary classification setting. Our binary boost-ing algorithm takes any input X ∈ RD where D is thedimension of a feature vector x and maps it to a class la-bel Y = {+1,−1}. A binary boosting algorithm is con-tinuously trained by using a data set {xi, yi}, i = 1..Nwhere xi ∈ X , yi ∈ Y , and N is the number of train-ing samples. The class label yi of a training sampleaction ai is either interesting (yi = 1) or uninteresting(yi = −1). If an action ai is eye tracked (Fig. 1-(f)) bythe operator longer than a given threshold, the systemmarks the action as interesting. The system determinesthe tracking time by measuring the duration of the over-lap time between the eye gaze position and the boundingboxes of ai. Note that our sample training set is contin-uously updated with the recently viewed actions by theoperator, which can be achieved only by online training(Fig. 1-(e)). The training results are used in classifier Cto assign class labels for the features xi of the actionsai detected from the input video (Fig. 1-(c)). An ac-tion queue Q is used to store all unmonitored actions aiand their class labels yi = C(xi). A selection priorityP (ai) of each action ai is calculated by

P (ai) = C(xi) + λR(ai), (1)

where λ is a weighting constant and R(ai) ∈ [0, 1] as-signs a random number depending on the time of ar-rival of ai so that P (ai) is larger for both interestingactions and actions that happened earlier. Actions thathave higher selection priorities are shown to the opera-tor first (Fig. 1-(d)) using the video synthesis method of[13]. Psychological studies indicate that a human op-erator can only track at most 4 independently moving

items at a time [3]. Therefore, the maximum number ofactive actions visible can be at most 4 at a given time.

3 ExperimentsWe tested our system on both synthetic and real

world videos. First, we built a synthetic test video oflength 120 minutes to evaluate our system in a con-trolled setting. The synthetic test video includes 30 dif-ferent types of objects which have 5 different patternsand 6 different colors (Fig. 2). These objects movein random directions at random speeds. There are to-tal of 1200 actions included in the test video. Figure4-(a) shows the graph of cumulative object counts foreach object type at a given time position, which showsthat the count values increase regularly for all 30 objecttypes.

Red Yellow Green Cyan Blue MagentaLight Square

(LS)LightSquare

(LS)Dark Square

(DS)Dark Circle

(DC)Light Circle

(DC)No Pattern

(NP)

Figure 2. Sample synthetic objects.

This test video is provided to our system as the videostream from a surveillance camera (see Fig 1). We usedthe eye-gaze tracker of Arrington Research [5] for ob-taining the operator eye gaze positions. We run our sys-tem 5 times on the test video. For each of the tests, weasked the volunteer operators to track a specific objector objects for the first 10 minutes and then asked themto track another object or objects for the next 10 min-utes. Table 1 shows the object property values used foreach test.

Table 1. Properties of interesting objectsfor the 5 synthetic test cases

The first 10 minutes The second 10 minutesTest Color Pattern Color Pattern

1 Red All Magenta DC2 Red DC Blue DC3 Blue NP All LS4 Magenta LC Magenta NP,DS5 Yellow LS Yellow,Cyan LC

For each experiment, we kept the cumulative countsof each object type that the operator monitors. Figure4-(b) shows the graph of cumulative object counts forexperiment 1. The counts of all red objects increasesharply for the first 10 minutes because we asked the op-erator to track red objects of all patterns for the first 10minutes. For the next 10 minutes, the counts of magentadark circle objects increase sharply as expected. All theother (uninteresting) object counts increase slowly for

the whole experiment. We only include two uninterest-ing object types in Figure 4 for brevity. We see simi-lar results for the other experiments (Fig. 4-(c,d,e,f)).Note that a very big percentage of the interesting ob-jects are viewed in the first 20 minutes of the experi-ment which shows the suitability of the proposed sys-tem for a manual video retrieval/search task. Please seethe supplemental materials for the complete videos ofthese experiments[1].

(a) All gray automobiles are interesting.

(b) Humans are interesting, cars are not.

1148→1148 10219→32181148→11481148→1148

1148→1148

15847→3621

7991→35614980→3260

19998→53226265→44188538→4256

Figure 3. Sample frames from realdataset.

We performed 2 experiments on the real data usinga surveillance video from a university campus setting.The original video was 120 minutes long. For the firstexperiment, we asked the operator to track gray auto-mobiles. The synthesized video frames are shown inFigure 3-a. We show the original and synthetic videoframe numbers of the actions on top left corners of theframes, which shows that the gray colored automobileactions are moved to earlier times in the synthesizedvideos. For the second experiment, we asked the opera-tor to track humans instead of automobiles. Figure 3-bshows a few sample frames from the resulting videos.Note that there are objects other than automobiles thatare shown at earlier times, which is expected becausethe selection priority formula (Eq. 1) includes a termfor randomness and favor for earlier objects. Note alsothat, the initial frames of the produced videos may in-clude uninteresting objects because at this phase, theonline learning algorithm does not have enough train-ing data to make a good decision. Please see the supple-mental materials for the complete videos of the real dataexperiments[1]. Both experiments on real and syntheticdata showed that our system can successfully adapt tooperator interests. The final system makes it very con-venient for the operator to go through a lengthy videoto search for desired objects and actions without givingexplicit instructions to the system.

Original Video

0

10

20

30

40

50

60

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115

Time (minutes)

Num

ber o

f obj

ects

0

5

10

15

20

25

30

35

40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Nu

mb

er

of

ob

jects

Time (minutes)

Synthetic Video for Test-1

(*) Red-NP(*) Red-LC(*) Red-DC(*) Red-LS(*) Red-DS(**) Magenta-DCUninterestingUninteresting


0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Time (minutes)

Num

ber

of o

bjec

ts

(*) Red-DC(**) Blue-DCUninterestingUninteresting

05

10152025303540

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Nu

mb

er

of

ob

jects

Time (minutes)


(*) Blue-NP(**) Blue-LS(**) Green-LS(**) Red-LS(**) Yellow-LS(**) Cyan-LS(**) Magenta-LSUninterestingUninteresting

0

5

10

15

20

25

30

35

40

45

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21N

um

be

r o

f ob

jec

ts

Time (minutes)


(*) Magenta-LC

(**) Magenta-NP

(**) Magenta-DS

Uninteresting

Uninteresting

(a) (b) (c)

(d) (e) (f)


0

5

10

15

20

25

30

35

40

45

50

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Time (minutes)

Num

ber o

f obj

ects (*)Yellow-LS

(**)Cyan-LC(**)Yellow-LCUninterestingUninteresting

Figure 4. Graphical results of the synthetic experiments. (*): Interesting object types for the1st 10 minutes, (**): Interesting object types for the 2nd 10 minutes.

4 ConclusionsA novel surveillance video synthesis method that

uses online operator interest is introduced. The methoduses eye-gaze measures of the operator to find interest-ing actions and synthesizes the interesting actions first.The results of the experiments show that the systemadapts itself well on interest changes. In addition, thetime required for monitoring the interesting actions isdecreased. The proposed system could be used for real-time surveillance monitoring and offline retrieval.

The proposed system combines methods fromhuman-computer interaction, computer vision, and ma-chine learning. Operator interest based binary classifi-cation of objects is novel. The system is robust againstclassification errors because it changes the times of ac-tions instead of filtering uninteresting classes. We planto extend our system to work with multiple operatorsand multiple video streams where each operator moni-tors the surveillance video with different interests. Thismethod could also be used for analysing the monitoringstrategies, subjective interests and overall performanceof the surveillance operators.

AcknowledgementsThis work is supported by TUBITAK Project

110E033.References

[1] Project site: http://vision.gyte.edu.tr/projects.php?id=7.[2] A. R. Acha, Y. Pritch, and S. Peleg. Making a long

video short: Dynamic video synopsis. In CVPR, pagesI: 435–441, 2006.

[3] G. Alvarez and S. Franconeri. How many objects canyou track?: Evidence for a resource-limited attentivetracking mechanism. Journal of Vision, 7(13), 2007.

[4] B. Antic and B. Ommer. Video parsing for abnormalitydetection. In ICCV, pages 2415–2422, 2011.

[5] K. Arrington. Viewpoint eye tracker. Arrington Re-search, 1997.

[6] S. Castagnos and P. Pu. Consumer decision patternsthrough eye gaze analysis. In Workshop on Eye Gaze inIntelligent Human Machine Interaction, 2010.

[7] J. Friedman, T. Hastie, and R. Tibshirani. Additive lo-gistic regression: a statistical view of boosting. The an-nals of statistics, 28(2):337–407, 2000.

[8] H. Keval and M. A. Sasse. Man or gorilla? performanceissues with cctv technology in security control rooms.16th World Congress on Ergonomics Conference, Inter-national Ergonomics Association, 2006.

[9] Y. Li, Y. Li, T. Zhang, T. Zhang, D. Tretter, and D. Tret-ter. An overview of video abstraction techniques. Tech-nical report, HP Laboraties, Palo Alto, 2001.

[10] Y. Pritch, S. Ratovitch, A. Hendel, and S. Peleg. Clus-tered synopsis of surveillance video. In AVSS, pages195–200, Washington, DC, USA, 2009.

[11] M. A. Sasse. Not seeing the crime for the cameras?Commun. ACM, 53(2):22–25, 2010.

[12] U. Vural and Y. S. Akgul. Eye-gaze based real-timesurveillance video synopsis. Pattern Recogn. Lett.,30(12):1151–1159, 2009.

[13] U. Vural and Y. S. Akgul. Operator attention basedvideo surveillance. In ICCV Workshops, pages 1955–1962, 2011.

[14] X. Wang, T. Han, and S. Yan. An hog-lbp human de-tector with partial occlusion handling. In ICCV, pages32–39, 2009.

[15] C. Yu, X. Zheng, Y. Zhao, G. Liu, and N. Li. Reviewof intelligent video surveillance technology research. In(EMEIT, volume 1, pages 230–233. IEEE, 2011.

A Machine Learning System For Human-in-the-loop Video ... · A Machine Learning System For...

Documents

Transcript of A Machine Learning System For Human-in-the-loop Video ... · A Machine Learning System For...