Representing Videos using Mid-level Discriminative Patches - Arpit ...
-
Upload
phungkhanh -
Category
Documents
-
view
218 -
download
0
Transcript of Representing Videos using Mid-level Discriminative Patches - Arpit ...
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
Representing Videos using Mid-levelDiscriminative Patches
Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry Davis
Sobhan Naderi
26 June 2013
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
I Learning mid-level discriminative spatio-temporal patches
I Category level action recognitionI Understainding actions at a finer level
I action primitives: bend, pick, liftI objects: people, weightI sceneI temporal localization
I Video alignment and label transfer
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
1. Global spatio-temporal templates
2. Bag of local features
3. Part based approaches
This paper uses exemplar-SVMto automatically discoverdistinctive parts
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
This paper’s approach
Look for patches that are ...
I recurrent: fire in many images
I distinctive: fire only (mostly) on samples of one category
The big challenge is that
I Space of all possible spatio-temporal patches is huge
I Most of the patches belong to background or areuninteresting
Simple solution:Run K-Means and prune.But it doesn’t work!
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
This paper’s approach
1. Form inital clustersI Use training setI Sample 200 random spatio-temporal patches per image
ignoring uniform/no motionI Keep 500 most distinct patches per class (use 20-NN)
2. Select top-rank clustersI Use validation setI Rank by: appearance +λ purityI Choose the top-80 clusters (for each class)I Q: are the e-SVM scores comparable?I Q: how is λ obtained?I Q: redundant clusters?
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
This paper’s approach
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
Patch selection procedure
1. Action classificationI Run each e-SVM in sliding window
fashionI Construct SPM representation (with
max-pooling)I Train an SVM classifier
2. Fine grain video analysisI Use context to choose a subset of
patchesI Build correspondance between videosI Q: how?
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
Patch selection procedure
I Here we assume the video has been classified as class ”k”I Potential patches: highest-scoring detection of each e-SVMI For a patch-vocabulary of size N let x = (x1, . . . , xN) be an
indicator vector xi ∈ {0, 1}I Find x∗ = argmaxx
∑i Aixi + w1
∑i Ck,ixi − w2
∑i ,j Pi ,jxixj
where:A : N × 1 appearance vector
Ck : N × 1 class consistency vector for class k
P : N × N penalty matrix
I Solve the optimization problem using IPFP.This requires writing the problem in the following form:
X ∗ = argmaxXXTMX
X =
(1
x
)
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
Patch selection procedure
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
ClassificationAlignment
I Only cuboid patches
I Scale ranges from 120x120x50 to the entire video
I Each patch is represented by HOG3D (4x4x5 and 20orientations)
I Experiment on UCF50 and Olympics Dataset
I This method outperforms action-bank by 3.32% on UCF50
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
ClassificationAlignment
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
ClassificationAlignment
I Manually label 50 patches per class with:I Objects of interaction (e.g. golf club, weights)I Person bounding boxesI Person pose
I These extra annotation is transferred to test images afteraligning
I Informal evaluations:I 50% of transferred joints are within 15 pix of ground-truthI 84% accuracy in localizing persons (50% overlap criteria)
I Q: How is the alignment done?
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches
MotivationDifferent approaches to video analysis
Part discoveryUsing the mid-level representation
Results
ClassificationAlignment
Sobhan Naderi Representing Videos using Mid-level Discriminative Patches