Representing Videos using Mid-level Discriminative Patches - Arpit ...

MotivationDifferent approaches to video analysis

Part discoveryUsing the mid-level representation

Results

Representing Videos using Mid-levelDiscriminative Patches

Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry Davis

Sobhan Naderi

26 June 2013

Sobhan Naderi Representing Videos using Mid-level Discriminative Patches



Results

I Learning mid-level discriminative spatio-temporal patches

I Category level action recognitionI Understainding actions at a finer level

I action primitives: bend, pick, liftI objects: people, weightI sceneI temporal localization

I Video alignment and label transfer




Results

1. Global spatio-temporal templates

2. Bag of local features

3. Part based approaches

This paper uses exemplar-SVMto automatically discoverdistinctive parts




Results

This paper’s approach

Look for patches that are ...

I recurrent: fire in many images

I distinctive: fire only (mostly) on samples of one category

The big challenge is that

I Space of all possible spatio-temporal patches is huge

I Most of the patches belong to background or areuninteresting

Simple solution:Run K-Means and prune.But it doesn’t work!




Results


1. Form inital clustersI Use training setI Sample 200 random spatio-temporal patches per image

ignoring uniform/no motionI Keep 500 most distinct patches per class (use 20-NN)

2. Select top-rank clustersI Use validation setI Rank by: appearance +λ purityI Choose the top-80 clusters (for each class)I Q: are the e-SVM scores comparable?I Q: how is λ obtained?I Q: redundant clusters?




Results





Results

Patch selection procedure

1. Action classificationI Run each e-SVM in sliding window

fashionI Construct SPM representation (with

max-pooling)I Train an SVM classifier

2. Fine grain video analysisI Use context to choose a subset of

patchesI Build correspondance between videosI Q: how?




Results


I Here we assume the video has been classified as class ”k”I Potential patches: highest-scoring detection of each e-SVMI For a patch-vocabulary of size N let x = (x1, . . . , xN) be an

indicator vector xi ∈ {0, 1}I Find x∗ = argmaxx

∑i Aixi + w1

∑i Ck,ixi − w2

∑i ,j Pi ,jxixj

where:A : N × 1 appearance vector

Ck : N × 1 class consistency vector for class k

P : N × N penalty matrix

I Solve the optimization problem using IPFP.This requires writing the problem in the following form:

X ∗ = argmaxXXTMX

X =

(1

x

)




Results





Results

ClassificationAlignment

I Only cuboid patches

I Scale ranges from 120x120x50 to the entire video

I Each patch is represented by HOG3D (4x4x5 and 20orientations)

I Experiment on UCF50 and Olympics Dataset

I This method outperforms action-bank by 3.32% on UCF50




Results





Results


I Manually label 50 patches per class with:I Objects of interaction (e.g. golf club, weights)I Person bounding boxesI Person pose

I These extra annotation is transferred to test images afteraligning

I Informal evaluations:I 50% of transferred joints are within 15 pix of ground-truthI 84% accuracy in localizing persons (50% overlap criteria)

I Q: How is the alignment done?




Results



Representing Videos using Mid-level Discriminative Patches - Arpit ...

Documents

Transcript of Representing Videos using Mid-level Discriminative Patches - Arpit ...