Content-based image andimage and video analysis … · Keyframe classifier selects most features in...
Transcript of Content-based image andimage and video analysis … · Keyframe classifier selects most features in...
Content-based image andContent based image and video analysis
Event RecognitionEvent Recognition
15.06.2009
What is an event?
“a thing that happens or takes place” , g pp p ,Oxford Dictionary
Examples:H tHuman gesturesHuman actions (running, drinking, etc.)H i t ti ( fi ht h t )Human interaction (gunfight, car-crash, etc.)Sports event (tennis, soccer, etc.)N t t (fi t t )Nature event (fire, storm, etc.)…
Content-based image and video retrieval 2
Why event recognition?y g
Huge amounts of video availableg
E i i f l fEvent recognition useful for:Content-based browsing
“Fast forward to the next goal scoring scene”Video search
“Show me all videos with Bush and Putin shaking hands”Show me all videos with Bush and Putin shaking hands”
2Content-based image and video retrieval
Human actions
Human actions are major events in moviejcontent
Meaning hidden within visual representation
4Content-based image and video retrieval
What are human actions?
Definition 1:Physical body motion:
KTH action Database(http://www.nada.kth.se/cvap/actions/)
5Content-based image and video retrieval
What are human actions?
Definition 2:Interaction with environment on specific purpose
Same physical motion – different action depending on the context
6Content-based image and video retrieval
Challengesg
Variations:Lighting, appearance, view, backgroundIndividual motion, camera motion,
Difference in shape
Difference in motion
Both actions are similar in
Drinking
Both actions are similar in overall shape (human posture) and motion (hand motion)
S ki( )
Smoking
7Content-based image and video retrieval
Challengesg
Problems:Existing datasets contain few action classes, captured in controlled and simplified settingsLots of (realistic) training data needed
Idea:Realistic human actions are frequent events within qmoviesPerform automatic labeling of video sequences based on movie scripts
8Content-based image and video retrieval
Action event dataset
“Coffee and Cigarettes” datasetg159 annotated “Drinking” samples149 annotated “Smoking” samplesg p
KeyframeFirst framehead rectangle
Temporal annotationSpatial annotation
Last frameKeyframeFirst framehead rectangle Last frame
torso rectangle
http://www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
9Content-based image and video retrieval
Challengesg
Problems:Existing datasets contain few action classes, captured in controlled and simplified settingsLots of (realistic) training data needed
Idea:Realistic human actions are frequent events within qmoviesPerform automatic labeling of video sequences based on movie scripts
Hollywood Human actions (HOHA) dataset
10Content-based image and video retrieval
Script-based annotationp
Problems:No time information available
Use subtitles to align scripts with the video
Described actions do not always correspond to movie scene
Assign an alignment score to scene description:a = (#matched words)/(#all words)a = (#matched words)/(#all words)
Apply a Threshold (eg. a>0.5)
Variability of action expressions in textUse regularized Perceptron to classify action descriptiong p y p
11Content-based image and video retrieval
Script alignmentp g
Scripts publicly available to many moviesp p y ySubtitles available for the most of moviesTransfer time to scripts by text alignment
…117201:20:17,240 --> 01:20:20,437
Wh 't h t ith ?
…RICK
Transfer time to scripts by text alignment
Why weren't you honest with me?Why'd you keep your marriage a secret?
117301 20 20 640 01 20 23 598
Why weren't you honest with me? Whydid you keep your marriage a secret?
Rick sits down with Ilsa01:20:17
01:20:20,640 --> 01:20:23,598
lt wasn't my secret, Richard.Victor wanted it that way.
Rick sits down with Ilsa.
ILSA
Oh, it wasn't my secret, Richard.
01:20:23
117401:20:23,800 --> 01:20:26,189
Not even our closest friendsknew about our marriage
Victor wanted it that way. Not even our closest friends knew about our marriage.
…knew about our marriage.… subtitles movie script
12Content-based image and video retrieval
Script alignment: Problemsp g
Perfect alignment does not guarantee perfect action annotation in video
F 147 ti ith tFrom 147 action with correct alignment (a=1), only 70% did match the video
(a: quality of subtitle script matching)Errors Temporal misalignment (10%)Ouside FOV (10%)
(a: quality of subtitle-script matching)
( %)Completely missing in video (10%)
Example of false positive for “get outExample of false positive for get out of car”, Action is not visible in video!
A black car pulls up, two army officers get out.
14Content-based image and video retrieval
Script-based Annotationp
Pros:Realistic variation of actionsMany examples per class, many classesy p p , yNo extra overhead for new classesCharacter names may be used to resolve “who is ydoing what?”
Problems:No spatial localizationTemporal localization may be poorp y pScript does not always follow the movieAutomatic annotation useful for training, but not g,precise enough for evaluation
15Content-based image and video retrieval
Retrie ing actions in mo iesRetrieving actions in movies
Ivan Laptev and Patric PerezICCV 2007
Content-based image and video retrieval 16
Actions == space-time objects?p j
“stable-stableview” objects
“atomic” actions
car exit phoning smoking hand shaking drinking
Take advantage
Temporaadvantage
of space-time shape
al slice
time time time time
17Content-based image and video retrieval
Actions == space-time objects?p j
Can actions be considered as space-time pobjects?
Transfer object detectors to action recognitionj g
Here: only atomic actions considered (i eHere: only atomic actions considered (i.e. simple actions)
Temporal slice of same action under differentTemporal slice of same action under different circumstances similar
So, (atomic) actions = space-time objects
Content-based image and video retrieval 18
Action features
Action volume = space-time cuboid region p garound the head (duration of action)
Encoded with block-histogram features fθ(.), θ = (x y t dx dy dt β ϕ) defined byθ (x,y,t,dx,dy,dt,β,ϕ), defined by
Location (x,y,t)Space-time extent (dx dy dt)Space-time extent (dx,dy,dt)Type of block (β)
β el {Plane, Temp-2, Spat-4}β el {Plane, Temp 2, Spat 4}Type of histogram (ϕ)
Histogram of optical flow (HOF)Histogram of oriented gradient (HOG)
Content-based image and video retrieval 19
Histogram featuresg
(simplified) Histogram of oriented gradient:Apply gradient operator to each frame within sequence (eg. Sobel)Bin gradients discretized in 4 orientations to block histogramBin gradients discretized in 4 orientations to block-histogram
Histogram of optical flow:g pCalculate optical flow (OF) between framesBin OF vectors discretized in 4 direction bins (+1 bin for no
ti ) t bl k hi tmotion) to block-histogram
Normalized action cuboid has size 14x14x8 with units corresponding to 5x5x5 pixels
More than 1 Mio possible features fθ( )More than 1 Mio. possible features fθ(.)
Content-based image and video retrieval 21
Histogram featuresgHOG: histograms oforiented gradient
HOF: histograms ofoptic floworiented gradient optic flow
>10^6 possible features•
4 grad. orientation bins 4 OF direction bins+ 1 bin for no motion22Content-based image and video retrieval
Action learningg
Use boosting method (eg. AdaBoost) to g ( g )classify features within an action volume
Features:Block histogram featuresBlock-histogram features
selected features
boosting
weak classifierweak classifier
• Efficient discriminative classifier [Freund&Schapire’97]• Good performance for face detection [Viola&Jones’01]AdaBoost:
Content-based image and video retrieval 23
• Good performance for face detection [Viola&Jones 01]
Action learning: Boostingg g
A weak classifier h is a classifier with accuracy only slightly better than chance
selected features
Boosting: combine a number of weak l ifi th t th bl i bit il
weak classifier
classifiers so that the ensemble is arbitrarily accurate
All th f i l ( k) l ifi ith tAllows the use of simple (weak) classifiers without the loss if accuracySelects features and trains the classifierSelects features and trains the classifier
Content-based image and video retrieval 24
Action learning: Boostingg gHaarfeatures
optimal threshold
pre-aligned samples
Histogram f t
Use FLD, select opt. threshold
Weak classifier ht:In case of one dimensional features
features
In case of one dimensional featuresselect an optimal decision thresholdE.g. for Haar-filter responses (Viola&Jones face detector)g p ( )
Here: m-dimensional featuresProject data on one dimension using Fisher’s Linear Di i i (FLD) h l i l h h ld i 1 DDiscriminant (FLD), then select optimal threshold in 1-D
Content-based image and video retrieval 25
Action classification test
Comparison of:pStatic keyframe classifier with spatial HOG features (BH-Grad4)
Use Boosting to classify action using features from keyframes (= frame when hand reaches the mouth) only
Space-time action classifier with HOF features (STBH-OF5)(STBH OF5)
Space-time action classifier with HOF and HOGSpace time action classifier with HOF and HOG features (STBH-OFGrad9)
Content-based image and video retrieval 28
Action classification testRandom motion patternspatterns
Additional shape information does not seem to improve the space-time classifierSpace time classifier and static key frame classifier might haveSpace-time classifier and static key-frame classifier might have complementary properties
29Content-based image and video retrieval
Classifier propertiesp pTraining output: Accumulated feature maps
Space-time classifier (HOF) Static keyframe classifier(HOG)
Space-time classifier and static keyframe classifier might p y ghave complementary features
30Content-based image and video retrieval
Classifier propertiesp p
Region of selected features show most “active” parts of classifier
High activity at beginning and end of sequence, but low activity at keyframeactivity at keyframeLess accuracy expected when only classifying keyframes
Space-time classifier selects most features aroundSpace time classifier selects most features around hand regionKeyframe classifier selects most features in upper part y pp pof the key-frame
Idea: Use complementary properties to combine classifiers (-> keyframe priming)
Content-based image and video retrieval 31
Keyframe primingy p g
Combination of static key-frame classifier with yspace-time classifier
Motivated by complementary properties of both y p y p pclassifiersBootstrap space-time classifier and apply it to keyframes detected by the keyframe detector (boosted space-time window classifier)
Speeds up detectionCombines complementary modelsCombines complementary models
32Content-based image and video retrieval
Keyframe primingy p g
Apply keyframe detector (HOG classification pp y y (on single frame) to all positions, scales and frames while being set to a high false positive g g prate
Generate space-time blocks aligned with detected keyframes and with different temporaldetected keyframes and with different temporal extent
Run space-time classifier on each hypothesis
33Content-based image and video retrieval
Keyframe primingy p gTraining
Positive NegativePositive training sample
Negative training samples Keyframe-primed
event detectionKeyframe detections
Test
34Content-based image and video retrieval
Action detection
Test on 25min from “Coffee and Cigarettes” with 38 drinking actionsNo overlap with the training set in subjects or scenesK f i i i f t d l d t i ifi t b ttKeyframe priming is faster and leads to significant better results
Keyframe primingpriming
No Keyframe primingpriming
35Content-based image and video retrieval
Learning realistic human actions from gmovies
Ivan Laptev, Marcin Marszalek, Cordelia Schmid and Benjamin Rozenfeld
CVPR 2008
Content-based image and video retrieval 37
Action recognition in real-world videosg
Humans can do more than just “drinking” and j g“smoking”Robust detection and classification of all kindsRobust detection and classification of all kinds of human actions needed
38Content-based image and video retrieval
Space-Time Features: Detectorp
Space-Time Interest Points (STIP):p ( )Space-Time Extension of Harris Operator
Add dimensionality of time to the second moment matrixLook for maxima in extended Harris corner function H
D t ti d d ti t l lDetection depends on spatio-temporal scaleExtract features at multiple levels of spatio-temporal scales (dense scale sampling)scales (dense scale sampling)
40Content-based image and video retrieval
Space-Time Features: Descriptorp p
Multi-scale space-time patchesMulti scale space time patches from corner detector
•
Histogram of oriented spatial
d (HOG)
Histogram of optical
grad. (HOG) flow (HOF)Public code available at www.irisa.fr/vista/actions
3x3x2x4bins HOGdescriptor
3x3x2x5bins HOF descriptor
41Content-based image and video retrieval
Space-Time Features: Descriptorp p
Compute histogram descriptors of space-time p g p pvolumes in neighborhood of detected points:
Compute a 4-bin HOG for each cube in 3x3x2 space-time gridp g
Compute a 5-bin HOF for each cube in 3x3x2 pspace-time grid
Size of each volume related to detection scales
Content-based image and video retrieval 42
Spatio-temporal bag of features (BoF)p p g ( )
Cluster features (eg. k-means)( g )Visual vocabulary (here, k=4000)
Assign each feature to nearest vocabulary wordAssign each feature to nearest vocabulary word
Compute histogram of visual word occurrencesCompute histogram of visual word occurrences over space time volume
diff t ti t l id l ddifferent spatio-temporal grids explored:
• • •
43Content-based image and video retrieval
Action classification
Use SVMs with multi-channel chi-square qkernel:
“Channel” c is a combination of a spatioChannel c is a combination of a spatio-temporal grid and a descriptor (HoG or HoF) D i 2 di t b t th B F hi tD is χ2 distance between the BoF-histogramsA is mean value of distances between all training samplesOne-against-all approach in case of multi-class classification
44Content-based image and video retrieval
Results on KTH actions datasetExamples of all six classes and all four scenarios
45Content-based image and video retrieval
Results on KTH actions dataset
Average class accuracyAverage class accuracy
C f i t iConfusion matrix
46Content-based image and video retrieval
Results on HOHA dataset
Average Precision for each action classAverage Precision for each action classComparison of results for annotated (clean) and automatic training dataautomatic training dataChance denotes results of a random classifier
47Content-based image and video retrieval
Visual Event Recognition in News Video sing Kernel Methods ithVideo using Kernel Methods with Multi-Level Temporal Alignmentp g
Dong Xu and Shih-Fu ChangCVPR 2007
Content-based image and video retrieval 49
Event recognition in news broadcastg
Challenging task:Complex motion, cluttered backgrounds, occlusions, geometric variations of objects
HOF/HOG h iti t hi h tiHOF/HOG approaches are sensitive to high-motion regions only
News events may have relatively low motion (eg fire)News events may have relatively low motion (eg. fire)
Broadcast news less constrained domain than human actions Event usually consists of different sub-clips
E.g. “riot” may consist of scenes of fire and smoke and different location
50Content-based image and video retrieval
Decomposition of video into sub-clipsp p
Temporal-constrained Hierarchical pAgglomerative Clustering (T-HAC):
First each feature-vector forms a clusterConstruct clusters iteratively by combining existing clusters based on their distances (EMD)Only merge neighboring clusters in temporal dimension
Clusters form a pyramid like structure (dendrogram)
52Content-based image and video retrieval
Features
Low-level global features:gGrid Color Moment (GCM)
First three moments for each grid region (eg. 5X5 grid)Gabor Texture feature (GT)Edge Direction Histogram (EDH) (same as HoG)
Apply Sobel operator and histogram edge directions quantized at 5 degrees
Mid-level feature: Concept Score (CS)N di i l t ( N 108)N-dimensional vector (eg. N=108)each component represents confidence score from a semantic concept classifier (SVMs)a semantic concept classifier (SVMs)
54Content-based image and video retrieval
Temporal alignmentp g
Problem: Event stages of different clips don't follow fixed temporal order
Solution (Temporal alignment):( p g )Calculate EMD distance matrix for each levelSolve only for integer solutions at levels > 0y g
Produces 1 to 1 mappingsAllow soft temporal alignment at level 0
55Content-based image and video retrieval
EMD-Distances
Distances Level 1Distances Level 1
56Content-based image and video retrieval
Distances Level 2
Temporal alignmentp g
Content-based image and video retrieval 57
Matching in single-level EMD (a) and temporally aligned pyramid matching (TAPM) (b)
Classification
SVM Classification with Earth Movers Distance (EMD) Kernel:
D(P,Q): EMD between features from sequence P d QP and QA: hyper-parameter set empirically through cross-validation
58Content-based image and video retrieval
Fusion
Fuse Information from different levels directly:y
L b f b li l lL: number of sub-clip levelsg: decision values from the SVMs at different levels lh: level weights (eg. all weights equal to 1)g ( g g q )
59Content-based image and video retrieval
Experimentsp
Test with LSCOM concepts on TRECVID-2005p56 events/activities annotated (449 total concepts)10 events chosen
Relatively high occurrence frequencyMay be recognized from visual cues intuitively
Number of positive samples for each class: 84-877
60Content-based image and video retrieval
Results
Comparison of single-level EMD (SLEMD) with key-frame classification on different features
61Content-based image and video retrieval
Results
Average Precision on different levels of TAPML<x>: classification at level xL<x>: classification at level xd: fusion with non-uniform weights (h0=h1=1; h2=2)
62Content-based image and video retrieval
Summaryy
Paper1: “Retrieving actions in movies”Atomic human actions: smoking, drinkingHistogram of oriented gradients (HoG), histograms of oriented flow (HoF), at different spatio-temporal positions & scales + grid-types, ..( ) p p p g ypBoosting to select good features and to build classifier
P 2 “L i li ti h ti f i ”Paper2: “Learning realistic human actions from movies”Eight atomic human actions: Kiss, AnswerPhone, GetOutCar, ..HoG / HoF at space-time interest points, bag of featuresp p , gSVM for classification
Paper 3: “Visual event recognition in news videos…”Decomposes video into sub-clipsTemporal alignment improves classificationTemporal alignment improves classificationLow- and mid-level features, SVM for classification
63Content-based image and video retrieval
References
I. Laptev and P. Perez. Retrieving actions in p gmovies. ICCV '07
I. Laptev, M. Marszalek, C. Schmid and B. Rozenfeld Learning realistic human actionsRozenfeld. Learning realistic human actions from movies. CVPR '08
D. Xu and SF. Chang. Visual Event Recognition in News Video using KernelRecognition in News Video using Kernel Methods with Multi-Level Temporal Alignment. CVPR '07CVPR 07
64Content-based image and video retrieval
References
http://www.irisa.fr/vista/Equipe/People/Ivan.Lap q p pptev.html
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple featuresusing a boosted cascade of simple features. CVPR '01
R. Schapire, Y. Freund, P. Bartlett and WS. Lee Boosting the margin: A new explanationLee. Boosting the margin: A new explanation for the effectiveness of voting methods. ICML '9797
65Content-based image and video retrieval
Diplomarbeitp
Interested in action recognition?g
We are currently looking for a motivated student toWe are currently looking for a motivated student to teach Armar-III how to recognize situations within rooms (eg. cooking, cleaning)
Required skills:qInterest in Computer VisionC++ programming experience under Linux
Email lukas rybok@ira uka de for more detailsEmail [email protected] for more details
Content-based image and video retrieval 66