Content-based image and video analysis Event Recognition · Keyframe classifier selects most...

Content-based image and video analysis

Event Recognition

21.06.2010

What is an event?

“a thing that happens or takes place” ,

Oxford Dictionary

Examples:

Human gestures

Human actions (running, drinking, etc.)

Human interaction (gunfight, car-crash, etc.)

Sports event (tennis, soccer, etc.)

Nature event (fire, storm, etc.)

…

Content-based image and video retrieval 1

Why event recognition?

Huge amounts of video available

Event recognition useful for:

Content-based browsing

“Fast forward to the next goal scoring scene”

Video search

“Show me all videos with Bush and Putin shaking hands”

2Content-based image and video retrieval

Human actions

Human actions are major events in moviecontent

Meaning hidden within visual representation


What are human actions?

Definition 1:

Physical body motion:

KTH action Database

(http://www.nada.kth.se/cvap/actions/)


What are human actions?

Definition 2:

Interaction with environment on specific purpose

Same physical motion – different action depending on the context


Challenges

Variations:

Lighting, appearance, view, background

Individual motion, camera motion

Difference in shape

Difference in motion

Both actions are similar in

overall shape (human

posture) and motion

(hand motion)

Drinking

Smoking


Challenges

Problems:

Existing datasets contain few action classes, captured in controlled and simplified settings

Lots of (realistic) training data needed

Idea:

Realistic human actions are frequent events within movies

Perform automatic labeling of video sequences based on movie scripts


Action event dataset

“Coffee and Cigarettes” dataset

159 annotated “Drinking” samples

149 annotated “Smoking” samples

KeyframeFirst framehead rectangle

torso rectangle

Temporal annotationSpatial annotation

Last frame

http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html


Challenges

Problems:

Existing datasets contain few action classes, captured in controlled and simplified settings

Lots of (realistic) training data needed

Idea:

Realistic human actions are frequent events within movies

Perform automatic labeling of video sequences based on movie scripts

Hollywood Human actions (HOHA) dataset


Script-based annotation

Problems:

No time information available

Use subtitles to align scripts with the video

Described actions do not always correspond to movie scene

Assign an alignment score to scene description:

a = (#matched words)/(#all words)

Apply a Threshold (eg. a>0.5)

Variability of action expressions in text

Use regularized Perceptron to classify action description


Script alignment

…

1172

01:20:17,240 --> 01:20:20,437

Why weren't you honest with me?

Why'd you keep your marriage a secret?

1173

01:20:20,640 --> 01:20:23,598

lt wasn't my secret, Richard.

Victor wanted it that way.

1174

01:20:23,800 --> 01:20:26,189

Not even our closest friends

knew about our marriage.

…

…

RICK

Why weren't you honest with me? Why

did you keep your marriage a secret?

Rick sits down with Ilsa.

ILSA

Oh, it wasn't my secret, Richard.

Victor wanted it that way. Not even

our closest friends knew about our

marriage.

…

01:20:17

01:20:23

Scripts publicly available to many movies

Subtitles available for the most of movies

Transfer time to scripts by text alignment

subtitles movie script


Script alignment: Problems

Perfect alignment does not guarantee perfect action annotation in video

From 147 action with correct alignment (a=1), only 70% did match the video

Errors

Temporal misalignment (10%)

Ouside FOV (10%)

Completely missing in video (10%)

Example of false positive for “get out of car”, Action is not visible in video!

(a: quality of subtitle-script matching)

A black car pulls up, two army

officers get out.


Script-based Annotation

Pros:

Realistic variation of actions

Many examples per class, many classes

No extra overhead for new classes

Character names may be used to resolve “who is doing what?”

Problems:

No spatial localization

Temporal localization may be poor

Script does not always follow the movie

Automatic annotation useful for training, but not precise enough for evaluation


Retrieving actions in movies

Ivan Laptev and Patric Perez

ICCV 2007


Actions == space-time objects?

“atomic”

actions

car exit phoning smoking hand shaking drinking

Take

advantage

of space-

time shape

time time time time

“stable-

view”

objects


Tem

pora

l slic

e

Actions == space-time objects?

Can actions be considered as space-time objects?

Transfer object detectors to action recognition

Here: only atomic actions considered (i.e. simple actions)

Temporal slice of same action under different circumstances similar

So, (atomic) actions = space-time objects


Action features

Action volume = space-time cuboid region around the head (duration of action)

Encoded with block-histogram features f(), = (x,y,t,dx,dy,dt,,), defined by

Location (x,y,t)

Space-time extent (dx,dy,dt)

Type of block ()

el Plane, Temp-2, Spat-4

Type of histogram ()

Histogram of optical flow (HOF)

Histogram of oriented gradient (HOG)


Action featuresHOG features

HOF features


Histogram features

(simplified) Histogram of oriented gradient:

Apply gradient operator to each frame within sequence (eg. Sobel)

Bin gradients discretized in 4 orientations to block-histogram

Histogram of optical flow:

Calculate optical flow (OF) between frames

Bin OF vectors discretized in 4 direction bins (+1 bin for no motion) to block-histogram

Normalized action cuboid has size 14x14x8 with units corresponding to 5x5x5 pixels

More than 1 Mio. possible features f()


Histogram features

HOG: histograms of

oriented gradient

HOF: histograms of

optic flow

>10^6 possible features

4 grad. orientation bins

4 OF direction bins

+ 1 bin for no motion22Content-based image and video retrieval

Action learning

Use boosting method (eg. AdaBoost) to classify features within an action volume

Features:

Block-histogram features


boosting

selected features

weak classifier

• Efficient discriminative classifier [Freund&Schapire’97]

• Good performance for face detection [Viola&Jones’01]AdaBoost:

Action learning: Boosting

A weak classifier h is a classifier with accuracy only slightly better than chance

Boosting: combine a number of weak classifiers so that the ensemble is arbitrarily accurate

Allows the use of simple (weak) classifiers without the loss if accuracy

Selects features and trains the classifier


selected features

weak classifier

Action learning: Boosting

Weak classifier ht:

In case of one dimensional features

select an optimal decision threshold

E.g. for Haar-filter responses (Viola&Jones face detector)

Here: m-dimensional features

Project data on one dimension using Fisher’s Linear Discriminant (FLD), then select optimal threshold in 1-D


Haar

features

Histogram

features

Use FLD, select

opt. threshold

optimal

threshold

pre-aligned

samples

Action classification test

Comparison of:

Static keyframe classifier with spatial HOG features (BH-Grad4)

Use Boosting to classify action using features from keyframes (= frame when hand reaches the mouth) only

Space-time action classifier with HOF features (STBH-OF5)

Space-time action classifier with HOF and HOG features (STBH-OFGrad9)


Action classification testRandom

motion

patterns

Additional shape information does not seem to improve the space-time classifier

Space-time classifier and static key-frame classifier might have complementary properties


Classifier properties

Space-time classifier (HOF) Static keyframe classifier(HOG)

Training output: Accumulated feature maps

Space-time classifier and static keyframe classifier might have complementary features


Classifier properties

Region of selected features show most “active” parts of classifier

High activity at beginning and end of sequence, but low activity at keyframe

Less accuracy expected when only classifying keyframes

Space-time classifier selects most features around hand region

Keyframe classifier selects most features in upper part of the key-frame

Idea: Use complementary properties to combine classifiers (-> keyframe priming)


Keyframe priming

Combination of static key-frame classifier with space-time classifier

Motivated by complementary properties of both classifiers

Bootstrap space-time classifier and apply it to keyframes detected by the keyframe detector (boosted space-time window classifier)

Speeds up detection

Combines complementary models


Keyframe priming

Apply keyframe detector (HOG classification on single frame) to all positions, scales and frames while being set to a high false positive rate

Generate space-time blocks aligned with detected keyframes and with different temporal extent

Run space-time classifier on each hypothesis


Keyframe priming

Training

Positive

training

sample

Negative

training

samples

Test


Keyframe-primed

event detectionKeyframe detections

Action detection

Keyframe

priming

No

Keyframe

priming

Test on 25min from “Coffee and Cigarettes” with 38 drinking actions

No overlap with the training set in subjects or scenes

Keyframe priming is faster and leads to significant better results


Action detection


Learning realistic human actions from movies

Ivan Laptev, Marcin Marszalek, Cordelia Schmid and

Benjamin Rozenfeld

CVPR 2008


Action recognition in real-world videos

Humans can do more than just “drinking” and “smoking”

Robust detection and classification of all kinds of human actions needed


Space-Time Features: Detector

[Laptev, IJCV 2005]


Space-Time Features: Detector

Space-Time Interest Points (STIP):

Space-Time Extension of Harris Operator

Add dimensionality of time to the second moment matrix

Look for maxima in extended Harris corner function H

Detection depends on spatio-temporal scale

Extract features at multiple levels of spatio-temporal scales (dense scale sampling)


Space-Time Features: Descriptor

Histogram of oriented spatial

grad. (HOG)

Histogram of optical

flow (HOF)

3x3x2x4bins HOGdescriptor

3x3x2x5bins HOF descriptor

Public code available at www.irisa.fr/vista/actions

Multi-scale space-time patches from corner detector


Space-Time Features: Descriptor

Compute histogram descriptors of space-time volumes in neighborhood of detected points:

Compute a 4-bin HOG for each cube in 3x3x2 space-time grid

Compute a 5-bin HOF for each cube in 3x3x2 space-time grid

Size of each volume related to detection scales


Spatio-temporal bag of features (BoF)

Cluster features (eg. k-means)

Visual vocabulary (here, k=4000)

Assign each feature to nearest vocabulary word

Compute histogram of visual word occurrences over space time volume

different spatio-temporal grids explored:


Action classification

Use SVMs with multi-channel chi-square kernel:

“Channel” c is a combination of a spatio-temporal grid and a descriptor (HoG or HoF)

D is 2 distance between the BoF-histograms

A is mean value of distances between all training samples

One-against-all approach in case of multi-class classification


Results on KTH actions dataset

Examples of all six classes and all four scenarios


Results on KTH actions dataset

Average class accuracy

Confusion matrix


Results on HOHA dataset

Average Precision for each action class

Comparison of results for annotated (clean) and automatic training data

Chance denotes results of a random classifier47Content-based image and video retrieval

Results on HOHA dataset


Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment

Dong Xu and Shih-Fu Chang

CVPR 2007


Event recognition in news broadcast

Challenging task:

Complex motion, cluttered backgrounds, occlusions, geometric variations of objects

HOF/HOG approaches are sensitive to high-motion regions only

News events may have relatively low motion (eg. fire)

Broadcast news less constrained domain than human actions

Event usually consists of different sub-clips

E.g. “riot” may consist of scenes of fire and smoke at different locations


Temporally aligned pyramid matching

Algorithm overview:

each frame is represented by one feature vector

decompose video into sub-clips using hierarchical clustering

on each hierarchy level, calculate Earth Mover’s Distance (EMD) between frame features and use it within SVM framework

Constrain EMD that frames from one sub-clip can only be matched to fames from one other sub-clip (alignment)

fuse SVM outputs from all levels


Temporally aligned pyramid matching


Decomposition of video into sub-clips

Temporal-constrained Hierarchical Agglomerative Clustering (T-HAC):

First each feature-vector forms a cluster

Construct clusters iteratively by combining existing clusters based on their distances (EMD)

Only merge neighboring clusters in temporal dimension

Clusters form a pyramid like structure (dendrogram)


Features

Low-level global features:

Grid Color Moment (GCM)

First three moments for each grid region (eg. 5X5 grid)

Gabor Texture feature (GT)

Edge Direction Histogram (EDH) (same as HoG)

Apply Sobel operator and histogram edge directions quantized at 5 degrees

Mid-level feature: Concept Score (CS)

N-dimensional vector (eg. N=108)

each component represents confidence score from a semantic concept classifier (SVMs)


Temporal alignment


Matching in single-level EMD (a) and temporally aligned pyramid matching

(TAPM) (b)

Classification

SVM Classification with Earth Movers Distance (EMD) Kernel:

D(P,Q): EMD between features from sequence P and Q

A: hyper-parameter set empirically through cross-validation


Fusion

Fuse Information from different levels directly:

L: number of sub-clip levels

g: decision values from the SVMs at different levels l

h: level weights (eg. all weights equal to 1)

Similar to logistic regression/perceptron


Experiments

Test with LSCOM concepts on TRECVID-2005

56 events/activities annotated (449 total concepts)

10 events chosen

Relatively high occurrence frequency

May be recognized from visual cues intuitively

Number of positive samples for each class: 54-877


Results

Comparison of single-level EMD (SLEMD) with key-frame classification on different features


Results

Average Precision on different levels of TAPM

L<x>: classification at level x

d: fusion with non-uniform weights (h0=h1=1; h2=2)


Summary

Paper1: “Retrieving actions in movies”

Atomic human actions: smoking, drinking

Histogram of oriented gradients (HoG), histograms of oriented flow (HoF), at different spatio-temporal positions & scales + grid-types, ..

Boosting to select good features and to build classifier

Paper2: “Learning realistic human actions from movies”

Eight atomic human actions: Kiss, AnswerPhone, GetOutCar, ..

HoG / HoF at space-time interest points, bag of features

SVM for classification

Paper 3: “Visual event recognition in news videos…”

Decomposes video into sub-clips

Temporal alignment improves classification

Low- and mid-level features, SVM for classification


References

I. Laptev and P. Perez. Retrieving actions in movies. ICCV '07

I. Laptev, M. Marszalek, C. Schmid and B. Rozenfeld. Learning realistic human actions from movies. CVPR '08

D. Xu and SF. Chang. Visual Event Recognition in News Video using Kernel Methods with Multi-Level Temporal Alignment. CVPR '07


References

http://www.irisa.fr/vista/Equipe/People/Ivan.Laptev.html

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR '01

R. Schapire, Y. Freund, P. Bartlett and WS. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. ICML '97


Hiwi

Interested in action recognition?

We are currently looking for a motivated student to teach Armar-III how to recognize situations within rooms (eg. cooking, cleaning)

Required skills:

Interest in Computer Vision

C++ programming experience under Linux

Email [email protected] for more details


Content-based image and video analysis Event Recognition · Keyframe classifier selects most...

Documents

Transcript of Content-based image and video analysis Event Recognition · Keyframe classifier selects most...