Content-based image andimage and video analysis … · Keyframe classifier selects most features in...

63
Content-based image and Content based image and video analysis Event Recognition Event Recognition 15.06.2009

Transcript of Content-based image andimage and video analysis … · Keyframe classifier selects most features in...

Content-based image andContent based image and video analysis

Event RecognitionEvent Recognition

15.06.2009

What is an event?

“a thing that happens or takes place” , g pp p ,Oxford Dictionary

Examples:H tHuman gesturesHuman actions (running, drinking, etc.)H i t ti ( fi ht h t )Human interaction (gunfight, car-crash, etc.)Sports event (tennis, soccer, etc.)N t t (fi t t )Nature event (fire, storm, etc.)…

Content-based image and video retrieval 2

Why event recognition?y g

Huge amounts of video availableg

E i i f l fEvent recognition useful for:Content-based browsing

“Fast forward to the next goal scoring scene”Video search

“Show me all videos with Bush and Putin shaking hands”Show me all videos with Bush and Putin shaking hands”

2Content-based image and video retrieval

Human actions

Human actions are major events in moviejcontent

Meaning hidden within visual representation

4Content-based image and video retrieval

What are human actions?

Definition 1:Physical body motion:

KTH action Database(http://www.nada.kth.se/cvap/actions/)

5Content-based image and video retrieval

What are human actions?

Definition 2:Interaction with environment on specific purpose

Same physical motion – different action depending on the context

6Content-based image and video retrieval

Challengesg

Variations:Lighting, appearance, view, backgroundIndividual motion, camera motion,

Difference in shape

Difference in motion

Both actions are similar in

Drinking

Both actions are similar in overall shape (human posture) and motion (hand motion)

S ki( )

Smoking

7Content-based image and video retrieval

Challengesg

Problems:Existing datasets contain few action classes, captured in controlled and simplified settingsLots of (realistic) training data needed

Idea:Realistic human actions are frequent events within qmoviesPerform automatic labeling of video sequences based on movie scripts

8Content-based image and video retrieval

Action event dataset

“Coffee and Cigarettes” datasetg159 annotated “Drinking” samples149 annotated “Smoking” samplesg p

KeyframeFirst framehead rectangle

Temporal annotationSpatial annotation

Last frameKeyframeFirst framehead rectangle Last frame

torso rectangle

http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html

9Content-based image and video retrieval

Challengesg

Problems:Existing datasets contain few action classes, captured in controlled and simplified settingsLots of (realistic) training data needed

Idea:Realistic human actions are frequent events within qmoviesPerform automatic labeling of video sequences based on movie scripts

Hollywood Human actions (HOHA) dataset

10Content-based image and video retrieval

Script-based annotationp

Problems:No time information available

Use subtitles to align scripts with the video

Described actions do not always correspond to movie scene

Assign an alignment score to scene description:a = (#matched words)/(#all words)a = (#matched words)/(#all words)

Apply a Threshold (eg. a>0.5)

Variability of action expressions in textUse regularized Perceptron to classify action descriptiong p y p

11Content-based image and video retrieval

Script alignmentp g

Scripts publicly available to many moviesp p y ySubtitles available for the most of moviesTransfer time to scripts by text alignment

…117201:20:17,240 --> 01:20:20,437

Wh 't h t ith ?

…RICK

Transfer time to scripts by text alignment

Why weren't you honest with me?Why'd you keep your marriage a secret?

117301 20 20 640 01 20 23 598

Why weren't you honest with me? Whydid you keep your marriage a secret?

Rick sits down with Ilsa01:20:17

01:20:20,640 --> 01:20:23,598

lt wasn't my secret, Richard.Victor wanted it that way.

Rick sits down with Ilsa.

ILSA

Oh, it wasn't my secret, Richard.

01:20:23

117401:20:23,800 --> 01:20:26,189

Not even our closest friendsknew about our marriage

Victor wanted it that way. Not even our closest friends knew about our marriage.

…knew about our marriage.… subtitles movie script

12Content-based image and video retrieval

Script alignment: Problemsp g

Perfect alignment does not guarantee perfect action annotation in video

F 147 ti ith tFrom 147 action with correct alignment (a=1), only 70% did match the video

(a: quality of subtitle script matching)Errors Temporal misalignment (10%)Ouside FOV (10%)

(a: quality of subtitle-script matching)

( %)Completely missing in video (10%)

Example of false positive for “get outExample of false positive for get out of car”, Action is not visible in video!

A black car pulls up, two army officers get out.

14Content-based image and video retrieval

Script-based Annotationp

Pros:Realistic variation of actionsMany examples per class, many classesy p p , yNo extra overhead for new classesCharacter names may be used to resolve “who is ydoing what?”

Problems:No spatial localizationTemporal localization may be poorp y pScript does not always follow the movieAutomatic annotation useful for training, but not g,precise enough for evaluation

15Content-based image and video retrieval

Retrie ing actions in mo iesRetrieving actions in movies

Ivan Laptev and Patric PerezICCV 2007

Content-based image and video retrieval 16

Actions == space-time objects?p j

“stable-stableview” objects

“atomic” actions

car exit phoning smoking hand shaking drinking

Take advantage

Temporaadvantage

of space-time shape

al slice

time time time time

17Content-based image and video retrieval

Actions == space-time objects?p j

Can actions be considered as space-time pobjects?

Transfer object detectors to action recognitionj g

Here: only atomic actions considered (i eHere: only atomic actions considered (i.e. simple actions)

Temporal slice of same action under differentTemporal slice of same action under different circumstances similar

So, (atomic) actions = space-time objects

Content-based image and video retrieval 18

Action features

Action volume = space-time cuboid region p garound the head (duration of action)

Encoded with block-histogram features fθ(.), θ = (x y t dx dy dt β ϕ) defined byθ (x,y,t,dx,dy,dt,β,ϕ), defined by

Location (x,y,t)Space-time extent (dx dy dt)Space-time extent (dx,dy,dt)Type of block (β)

β el {Plane, Temp-2, Spat-4}β el {Plane, Temp 2, Spat 4}Type of histogram (ϕ)

Histogram of optical flow (HOF)Histogram of oriented gradient (HOG)

Content-based image and video retrieval 19

Action featuresHOG featuresHOF features

20Content-based image and video retrieval

Histogram featuresg

(simplified) Histogram of oriented gradient:Apply gradient operator to each frame within sequence (eg. Sobel)Bin gradients discretized in 4 orientations to block histogramBin gradients discretized in 4 orientations to block-histogram

Histogram of optical flow:g pCalculate optical flow (OF) between framesBin OF vectors discretized in 4 direction bins (+1 bin for no

ti ) t bl k hi tmotion) to block-histogram

Normalized action cuboid has size 14x14x8 with units corresponding to 5x5x5 pixels

More than 1 Mio possible features fθ( )More than 1 Mio. possible features fθ(.)

Content-based image and video retrieval 21

Histogram featuresgHOG: histograms oforiented gradient

HOF: histograms ofoptic floworiented gradient optic flow

>10^6 possible features•

4 grad. orientation bins 4 OF direction bins+ 1 bin for no motion22Content-based image and video retrieval

Action learningg

Use boosting method (eg. AdaBoost) to g ( g )classify features within an action volume

Features:Block histogram featuresBlock-histogram features

selected features

boosting

weak classifierweak classifier

• Efficient discriminative classifier [Freund&Schapire’97]• Good performance for face detection [Viola&Jones’01]AdaBoost:

Content-based image and video retrieval 23

• Good performance for face detection [Viola&Jones 01]

Action learning: Boostingg g

A weak classifier h is a classifier with accuracy only slightly better than chance

selected features

Boosting: combine a number of weak l ifi th t th bl i bit il

weak classifier

classifiers so that the ensemble is arbitrarily accurate

All th f i l ( k) l ifi ith tAllows the use of simple (weak) classifiers without the loss if accuracySelects features and trains the classifierSelects features and trains the classifier

Content-based image and video retrieval 24

Action learning: Boostingg gHaarfeatures

optimal threshold

pre-aligned samples

Histogram f t

Use FLD, select opt. threshold

Weak classifier ht:In case of one dimensional features

features

In case of one dimensional featuresselect an optimal decision thresholdE.g. for Haar-filter responses (Viola&Jones face detector)g p ( )

Here: m-dimensional featuresProject data on one dimension using Fisher’s Linear Di i i (FLD) h l i l h h ld i 1 DDiscriminant (FLD), then select optimal threshold in 1-D

Content-based image and video retrieval 25

Action classification test

Comparison of:pStatic keyframe classifier with spatial HOG features (BH-Grad4)

Use Boosting to classify action using features from keyframes (= frame when hand reaches the mouth) only

Space-time action classifier with HOF features (STBH-OF5)(STBH OF5)

Space-time action classifier with HOF and HOGSpace time action classifier with HOF and HOG features (STBH-OFGrad9)

Content-based image and video retrieval 28

Action classification testRandom motion patternspatterns

Additional shape information does not seem to improve the space-time classifierSpace time classifier and static key frame classifier might haveSpace-time classifier and static key-frame classifier might have complementary properties

29Content-based image and video retrieval

Classifier propertiesp pTraining output: Accumulated feature maps

Space-time classifier (HOF) Static keyframe classifier(HOG)

Space-time classifier and static keyframe classifier might p y ghave complementary features

30Content-based image and video retrieval

Classifier propertiesp p

Region of selected features show most “active” parts of classifier

High activity at beginning and end of sequence, but low activity at keyframeactivity at keyframeLess accuracy expected when only classifying keyframes

Space-time classifier selects most features aroundSpace time classifier selects most features around hand regionKeyframe classifier selects most features in upper part y pp pof the key-frame

Idea: Use complementary properties to combine classifiers (-> keyframe priming)

Content-based image and video retrieval 31

Keyframe primingy p g

Combination of static key-frame classifier with yspace-time classifier

Motivated by complementary properties of both y p y p pclassifiersBootstrap space-time classifier and apply it to keyframes detected by the keyframe detector (boosted space-time window classifier)

Speeds up detectionCombines complementary modelsCombines complementary models

32Content-based image and video retrieval

Keyframe primingy p g

Apply keyframe detector (HOG classification pp y y (on single frame) to all positions, scales and frames while being set to a high false positive g g prate

Generate space-time blocks aligned with detected keyframes and with different temporaldetected keyframes and with different temporal extent

Run space-time classifier on each hypothesis

33Content-based image and video retrieval

Keyframe primingy p gTraining

Positive NegativePositive training sample

Negative training samples Keyframe-primed

event detectionKeyframe detections

Test

34Content-based image and video retrieval

Action detection

Test on 25min from “Coffee and Cigarettes” with 38 drinking actionsNo overlap with the training set in subjects or scenesK f i i i f t d l d t i ifi t b ttKeyframe priming is faster and leads to significant better results

Keyframe primingpriming

No Keyframe primingpriming

35Content-based image and video retrieval

Action detection

36Content-based image and video retrieval

Learning realistic human actions from gmovies

Ivan Laptev, Marcin Marszalek, Cordelia Schmid and Benjamin Rozenfeld

CVPR 2008

Content-based image and video retrieval 37

Action recognition in real-world videosg

Humans can do more than just “drinking” and j g“smoking”Robust detection and classification of all kindsRobust detection and classification of all kinds of human actions needed

38Content-based image and video retrieval

Space-Time Features: Detectorp[Laptev, IJCV 2005]

39Content-based image and video retrieval

Space-Time Features: Detectorp

Space-Time Interest Points (STIP):p ( )Space-Time Extension of Harris Operator

Add dimensionality of time to the second moment matrixLook for maxima in extended Harris corner function H

D t ti d d ti t l lDetection depends on spatio-temporal scaleExtract features at multiple levels of spatio-temporal scales (dense scale sampling)scales (dense scale sampling)

40Content-based image and video retrieval

Space-Time Features: Descriptorp p

Multi-scale space-time patchesMulti scale space time patches from corner detector

Histogram of oriented spatial

d (HOG)

Histogram of optical

grad. (HOG) flow (HOF)Public code available at www.irisa.fr/vista/actions

3x3x2x4bins HOGdescriptor

3x3x2x5bins HOF descriptor

41Content-based image and video retrieval

Space-Time Features: Descriptorp p

Compute histogram descriptors of space-time p g p pvolumes in neighborhood of detected points:

Compute a 4-bin HOG for each cube in 3x3x2 space-time gridp g

Compute a 5-bin HOF for each cube in 3x3x2 pspace-time grid

Size of each volume related to detection scales

Content-based image and video retrieval 42

Spatio-temporal bag of features (BoF)p p g ( )

Cluster features (eg. k-means)( g )Visual vocabulary (here, k=4000)

Assign each feature to nearest vocabulary wordAssign each feature to nearest vocabulary word

Compute histogram of visual word occurrencesCompute histogram of visual word occurrences over space time volume

diff t ti t l id l ddifferent spatio-temporal grids explored:

• • •

43Content-based image and video retrieval

Action classification

Use SVMs with multi-channel chi-square qkernel:

“Channel” c is a combination of a spatioChannel c is a combination of a spatio-temporal grid and a descriptor (HoG or HoF) D i 2 di t b t th B F hi tD is χ2 distance between the BoF-histogramsA is mean value of distances between all training samplesOne-against-all approach in case of multi-class classification

44Content-based image and video retrieval

Results on KTH actions datasetExamples of all six classes and all four scenarios

45Content-based image and video retrieval

Results on KTH actions dataset

Average class accuracyAverage class accuracy

C f i t iConfusion matrix

46Content-based image and video retrieval

Results on HOHA dataset

Average Precision for each action classAverage Precision for each action classComparison of results for annotated (clean) and automatic training dataautomatic training dataChance denotes results of a random classifier

47Content-based image and video retrieval

Results on HOHA dataset

48Content-based image and video retrieval

Visual Event Recognition in News Video sing Kernel Methods ithVideo using Kernel Methods with Multi-Level Temporal Alignmentp g

Dong Xu and Shih-Fu ChangCVPR 2007

Content-based image and video retrieval 49

Event recognition in news broadcastg

Challenging task:Complex motion, cluttered backgrounds, occlusions, geometric variations of objects

HOF/HOG h iti t hi h tiHOF/HOG approaches are sensitive to high-motion regions only

News events may have relatively low motion (eg fire)News events may have relatively low motion (eg. fire)

Broadcast news less constrained domain than human actions Event usually consists of different sub-clips

E.g. “riot” may consist of scenes of fire and smoke and different location

50Content-based image and video retrieval

Temporally aligned pyramid matchingp y g py g

51Content-based image and video retrieval

Decomposition of video into sub-clipsp p

Temporal-constrained Hierarchical pAgglomerative Clustering (T-HAC):

First each feature-vector forms a clusterConstruct clusters iteratively by combining existing clusters based on their distances (EMD)Only merge neighboring clusters in temporal dimension

Clusters form a pyramid like structure (dendrogram)

52Content-based image and video retrieval

Aglomerative Clusteringg g

53Content-based image and video retrieval

Features

Low-level global features:gGrid Color Moment (GCM)

First three moments for each grid region (eg. 5X5 grid)Gabor Texture feature (GT)Edge Direction Histogram (EDH) (same as HoG)

Apply Sobel operator and histogram edge directions quantized at 5 degrees

Mid-level feature: Concept Score (CS)N di i l t ( N 108)N-dimensional vector (eg. N=108)each component represents confidence score from a semantic concept classifier (SVMs)a semantic concept classifier (SVMs)

54Content-based image and video retrieval

Temporal alignmentp g

Problem: Event stages of different clips don't follow fixed temporal order

Solution (Temporal alignment):( p g )Calculate EMD distance matrix for each levelSolve only for integer solutions at levels > 0y g

Produces 1 to 1 mappingsAllow soft temporal alignment at level 0

55Content-based image and video retrieval

EMD-Distances

Distances Level 1Distances Level 1

56Content-based image and video retrieval

Distances Level 2

Temporal alignmentp g

Content-based image and video retrieval 57

Matching in single-level EMD (a) and temporally aligned pyramid matching (TAPM) (b)

Classification

SVM Classification with Earth Movers Distance (EMD) Kernel:

D(P,Q): EMD between features from sequence P d QP and QA: hyper-parameter set empirically through cross-validation

58Content-based image and video retrieval

Fusion

Fuse Information from different levels directly:y

L b f b li l lL: number of sub-clip levelsg: decision values from the SVMs at different levels lh: level weights (eg. all weights equal to 1)g ( g g q )

59Content-based image and video retrieval

Experimentsp

Test with LSCOM concepts on TRECVID-2005p56 events/activities annotated (449 total concepts)10 events chosen

Relatively high occurrence frequencyMay be recognized from visual cues intuitively

Number of positive samples for each class: 84-877

60Content-based image and video retrieval

Results

Comparison of single-level EMD (SLEMD) with key-frame classification on different features

61Content-based image and video retrieval

Results

Average Precision on different levels of TAPML<x>: classification at level xL<x>: classification at level xd: fusion with non-uniform weights (h0=h1=1; h2=2)

62Content-based image and video retrieval

Summaryy

Paper1: “Retrieving actions in movies”Atomic human actions: smoking, drinkingHistogram of oriented gradients (HoG), histograms of oriented flow (HoF), at different spatio-temporal positions & scales + grid-types, ..( ) p p p g ypBoosting to select good features and to build classifier

P 2 “L i li ti h ti f i ”Paper2: “Learning realistic human actions from movies”Eight atomic human actions: Kiss, AnswerPhone, GetOutCar, ..HoG / HoF at space-time interest points, bag of featuresp p , gSVM for classification

Paper 3: “Visual event recognition in news videos…”Decomposes video into sub-clipsTemporal alignment improves classificationTemporal alignment improves classificationLow- and mid-level features, SVM for classification

63Content-based image and video retrieval

References

I. Laptev and P. Perez. Retrieving actions in p gmovies. ICCV '07

I. Laptev, M. Marszalek, C. Schmid and B. Rozenfeld Learning realistic human actionsRozenfeld. Learning realistic human actions from movies. CVPR '08

D. Xu and SF. Chang. Visual Event Recognition in News Video using KernelRecognition in News Video using Kernel Methods with Multi-Level Temporal Alignment. CVPR '07CVPR 07

64Content-based image and video retrieval

References

http://www.irisa.fr/vista/Equipe/People/Ivan.Lap q p pptev.html

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple featuresusing a boosted cascade of simple features. CVPR '01

R. Schapire, Y. Freund, P. Bartlett and WS. Lee Boosting the margin: A new explanationLee. Boosting the margin: A new explanation for the effectiveness of voting methods. ICML '9797

65Content-based image and video retrieval

Diplomarbeitp

Interested in action recognition?g

We are currently looking for a motivated student toWe are currently looking for a motivated student to teach Armar-III how to recognize situations within rooms (eg. cooking, cleaning)

Required skills:qInterest in Computer VisionC++ programming experience under Linux

Email lukas rybok@ira uka de for more detailsEmail [email protected] for more details

Content-based image and video retrieval 66