Workshop(on(Web,scale(Vision(and(Social(Media...

Workshop on Web-‐scale Vision and Social MediaECCV 2012, Firenze, Italy

October, 2012

Linearized Smooth Additive ClassifiersSubhransu Maji

Research Assistant ProfessorToyota Technological Institute at Chicago

Generalized Additive Models

• Why use them?

• Efficiency : can be efficiently evaluated

• Interpretability : Simple generalization of linear classifiers, i.e., may lead to models that are interpretable

• Well known in the statistics community

• Generalized Additive Models (Hastie & Tibshirani ’90)

• However traditional learning algorithms do not scale well (e.g. “backfitting algorithm”)

f(x1, x2, . . . , xn) = f1(x1) + f2(x2) + . . .+ fn(xn)

Additive kernels in computer vision

• Images are represented as histograms of low level features such as color and texture [Swain and Ballard 01, Odone et al. 05]

• Histogram based similarity measures are typically additive

• Other examples of additive kernels based on approximate correspondence :

Pyramid Match Kernel, Grauman and Darrell, CVPR’05

Spatial Pyramid Match Kernel,Lazebnik, Schmidt and Ponce, CVPR’06

K�2(x,y) =X 2xiyi

xi + yiKmin(x,y) =

Xmin(xi, yi)

Directly learning additive classifiers

f(x1, x2, . . . , xn) = f1(x1) + f2(x2) + . . .+ fn(xn)

A SVM like optimization framework

minf2F

l�yk, f(xk)

�+ �R(f)

k 2 Rd

yk 2 {+1,�1}

l�yk, f(xk

�= max(0, 1� ykf(xk

Loss function on the datae.g. hinge loss function

Regularizatione.g. derivative norm

Strategy for learning additive classifiers

minf2F

l�yk, f(xk)

�+ �R(f)

k 2 Rd

yk 2 {+1,�1}

minf2F

l�yk, f(xk)

�+ �R(f)

k 2 Rd

yk 2 {+1,�1}

Basis expansion

wij�

Smoothness penalty on the weights

minf2F

l�yk, f(xk)

�+ �R(f)

k 2 Rd

yk 2 {+1,�1}

Basis expansion

wij�

• Search for representations of the function and regularization for which the optimization can be efficiently solved

• In particular we want to leverage the latest methods for learning linear classifiers (almost linear time algorithms)

Smoothness penalty on the weights

minf2F

l�yk, f(xk)

�+ �R(f)

k 2 Rd

yk 2 {+1,�1}

Basis expansion

wij�

Spline embeddings : motivation

• Represent each function using a uniformly spaced spline basis

• Motivated by our earlier analysis that splines approximate additive classifiers well

• Popularized by Eilers and Marx (P-Splines ’02, ’04) for additive modeling

• Well known in graphics (DeBoor ’01)

• Question: What is the regularization on w?

Figure from Eilers & Marx ’04

f(x1, x2, . . . , xn) = f1(x1) + f2(x2) + . . .+ fn(xn)X

wi�i

Spline embeddings : representationf(x) = wT�(x)

Linear'B)Spline' Quadra0c'B)Spline' Cubic'B)Spline'

Representation

Project data onto a uniformly spaced spline basis

Spline embeddings : regularizationf(x) = wT�(x)

Representation

D0 = I

R(f) = wTHw H = DTDRegularization

Penalize differences between adjacent weightsEnsures that the learned function is smooth

Similar to Penalized Splines of Eilers and Marx ’02

Spline embeddings : optimizationf(x) = wT�(x)

Representation

Optimization

D0 = I

R(f) = wTHw H = DTDRegularization

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

Spline embeddings : linear casef(x) = wT�(x)

Representation

D0 = I

Regularization zero order differences [Maji and Berg, ICCV ’09]

Reduces to a standard linear SVM

Projected features are sparse. At most k non-zero entries for a basis of degree k.

Optimization

c(w) =

�0, 1� y

k�wT�(x

��

Spline embeddings : general casef(x) = wT�(x)

Representation

Optimization

Regularization Penalize first order differences. Non-standard SVM due to regularization [Eilers & Marx ’02]

Can offer better smoothness when some bins have few data points

c(w) =

wTDT1 D1w +

�0, 1� y

k�wT�(x

��

Spline embeddings : linearization

Original Problem

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

Re-Parameterization of the weight

c(w) =

�0, 1� y

k�wTD�T

1 �(x

��

D�T1 =

BBBBBB@

1 1 . . . 1 10 1 . . . 1 1

0 . . . 1 1. . .

1 10 1

CCCCCCA

Upper TriangularOriginal Problem

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

Equivalently a change of basis

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

Re-Parameterization of the features

c(w) =

�0, 1� y

k�wTD�T

1 �(x

��

D�T1 =

BBBBBB@

1 1 . . . 1 10 1 . . . 1 1

0 . . . 1 1. . .

1 10 1

CCCCCCA

Upper Triangular

= D�T1 �

Equivalently a change of basis

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

c(w) =

�0, 1� y

k�wTD�T

1 �(x

��

= D�T1 �

K11 = T

Implicit kernel

Re-Parameterization of the features

Spline embeddings : visualizing the kernel

Various basis and regularization

Fixing order(regularization)=1, and varying degree(basis)

Smooth variants of the min kernel

Spline embeddings : visualizing the basis

Various basis and regularization

�i(x) = (x� ⌧i)r+

The basis are equivalent to truncated polynomial basis if:degree(basis) - order(regularization) = 1 [DeBoor ’01]

Solving the optimization efficiently

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

Original problem : non-standard regularization

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

c(w) =

�0, 1� y

k�wTD�T

d �(x

��

Modified problem : dense features (memory bottleneck)

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

c(w) =

�0, 1� y

k�wTD�T

d �(x

��

Solution : implicit representationmaintain:

classification: f(x) = wTd �(x) O(d) vs O(n)

c(w) =

wTDTd Ddw +

�0, 1� y

k�wT�(x

��

c(w) =

�0, 1� y

k�wTD�T

d �(x

��

update:

Solution : implicit representationmaintain:

classification: f(x) = wTd �(x) O(d) vs O(n)

Solving the optimization efficientlyComputing the updates

maintain:

classification: f(x) = wTd �(x)

update:

Given n linearly spaced basis of degree d, computing updates to wd takes O(n2) time.

Computing the updatesmaintain:

update:

However, one can compute it on O(nd) by exploiting the structure of D (see paper below)

S. Maji, Linearized Smooth Addi4ve Classifiers, ECCV WS 2012

update:

However, one can compute it on O(nd) by exploiting the structure of D (see paper below)

S. Maji, Linearized Smooth Addi4ve Classifiers, ECCV WS 2012

Hence these classifiers can be trained with very low memory overhead, without compromising training time

Spline embeddings : computational tradeoffs

spline degree

linear quadratic cubic

spline degree

Standard linear solver

Test time increases linearlyAccuracy increases

Training time increases linearly

5.96s, 90.23% 7.26s, 90.34% 10.08s, 90.39%

Experiments on DC pedestrian dataset ( n = 20)IKSVM training time : 360s

spline degree

Custom solver for order > 1

Test time constant

Accuracy peaks at D1This suggests that first order smoothness is sufficient

Training time increases linearly

5.96s 90.23%

32.43s91.20%

246.87s89.06%

Experiments on DC pedestrian dataset ( n = 20)IKSVM training time : 360s

spline degree

Useful regime for most datasets

spline degree

most accurateclosely approximates IKSVM

fastest

spline degree

most accurateclosely approximates IKSVM

fastest

Experiments on MNIST, INRIA pedestrians, Caltech 101, DC pedestrians, etcAll these are still an order of magnitude faster than traditional SVM solver. These have the same memory overhead as linear SVMs since the features are computed online.

General basis expansion

f(x1, x2, . . . , xn) = f1(x1) + f2(x2) + . . .+ fn(xn)X

wi�i

local splines are the basis

In general can choose any orthonormal basis

Fourier embeddings : representation

Examples: Trigonometric functions, Wavelets, etc.

f(x) =X

wi i(x)

1(x), 1(x), . . . , n(x)

An orthonormal basis

a i(x) j(x)�(x)dx = �i,j

Representation

Fourier embeddings : regularization

f(x) =X

wi i(x)

1(x), 1(x), . . . , n(x)

Representation

R(f) = wTHw

Regularization : penalize d’th order derivative

Hi,j =

di (x)

dj (x)w(x)dx

d(x)2�(x)dx =

di (x)

dj (x)�(x)dx

Similar to the spline case (different basis)Requires the basis to be differentiable

Fourier embeddings : optimization

f(x) =X

wi i(x)

1(x), 1(x), . . . , n(x)

Regularization : penalize d’th order derivative

Hi,j =

di (x)

dj (x)w(x)dx

Optimization

c(w) =

wTHw +

�0, 1� y

k�wT (x

��

R(f) = wTHw

d(x)2�(x)dx =

di (x)

dj (x)�(x)dx

Representation

Optimization

c(w) =

wTHw +

�0, 1� y

k�wT (x

��

H is not diagonal - cannot directly use fast linear solversIn general it is not structured either (unlike splines)

Optimization

c(w) =

wTHw +

�0, 1� y

k�wT (x

��

Practical solutionPick orthogonal basis with orthogonal derivatives

Cross terms disappear, i.e., H is diagonal againZ b

d(x)2�(x)dx =

di (x)

dj (x)�(x)dx / w

Optimization

c(w) =

wTHw +

�0, 1� y

k�wT (x

��

Practical solutionPick orthogonal basis with orthogonal derivatives

Cross terms disappear, i.e., H is diagonal againZ b

d(x)2�(x)dx =

di (x)

dj (x)�(x)dx / w

Examples: Trigonometric functions, One of Jacobi, Laguerre or Hermite polynomials

M. Webster, Orthogonal polynomials with orthogonal derivaFves. Mathema'sche Zeitschri., 39:634–638, 1935

Fourier Embeddings : Two practical ones

Two families of orthogonal basis with orthogonal derivatives

Embeddings that penalize the first and second order derivatives

Learning : project data onto the first few basis and use a linear solver such as LIBLINEAR

Fourier features are low dimensional and dense, as opposed to spline features which are high dimensional but sparse.

Comparison of various additive classifiers

DC pedestrian dataset

MNIST dataset

Software

• Code to train large scale additive classifiers. Provides functions:

• train : input (y,x) and outputs additive classifiers

• choice of various encodings, regularizations.

• encodings are computed online

• implements efficient weight updates for splines features

• classify : takes a learned classifier and features, outputs decision values

• encode : returns encoded features which can be directly used with any linear solver

• Download at:

• http://ttic.uchicago.edu/~smaji/libspline-release1.0.tar.gz

Conclusions

• We discussed methods to directly learn additive classifiers based on a regularized loss minimization

• We proposed two kinds of basis for which the learning problem can be efficiently solved, Spline and Fourier embeddings

• Spline embeddings are sparse, easy to compute, and can be used to learn classifiers with almost no memory overhead, compared to learning linear classifiers.

• Fourier embeddings (Trigonometric and Hermite) are low dimensional, but are relatively expensive to compute, hence are useful in setting where features can be stored in memory

• More experimental details and code can be found on the author’s website (ttic.uchicago.edu/~smaji)

Workshop(on(Web,scale(Vision(and(Social(Media...

Documents

Transcript of Workshop(on(Web,scale(Vision(and(Social(Media...

Describing People: A Poselet-Based Approach to Attribute ...ttic.uchicago.edu/~smaji/papers/attributes-iccv11.pdf · Describing People: A Poselet-Based Approach to Attribute Classiﬁcation

Discriminative Decorelation for clustering and classification ECCV 12

Space Time Tracking ECCV 2002

LUTE (Local Unpruned Tuple Expansion): Accurate ...ttic.uchicago.edu/~mhallen/publications/LUTE_RECOMB.pdf · LUTE (Local Unpruned Tuple Expansion): Accurate Continuously Flexible

Tutorials & Workshops - ECCV 2012

DISCRIMINATIVE DECORELATION FOR CLUSTERING AND CLASSIFICATION ECCV 12 Bharath Hariharan, Jitandra Malik, and Deva Ramanan.

Knowing a good HOG filter when you see it: Efficient ...ttic.uchicago.edu/~smaji/papers/goodParts-eccv14.pdf · Knowing a good HOG filter when you see it: Efficient selection

ECCV Our Golden Years

DISCRIMINATIVE ARTICULATORY MODELS FOR SPOKEN TERM ...ttic.uchicago.edu/~klivescu/papers/prabhavalkar_ICASSP2013.pdf · DISCRIMINATIVE ARTICULATORY MODELS FOR SPOKEN TERM DETECTION

【ECCV 2016 BNMW】Human Action Recognition without Human

Fast gap‐free enumeration of conformations and sequences ...ttic.uchicago.edu/~mhallen/publications/DynamicAStar_Proteins.pdftechniques from the field of computational optimization,

Intrinsic Images by Entropy Minimization ECCV, Prague, May 2004mark/ftp/Eccv04/extraimages.pdf · 2004. 4. 9. · Intrinsic Images by Entropy Minimization ECCV, Prague, May 2004 Graham

Efﬁcient Structured Prediction for 3D Indoor Scene ...ttic.uchicago.edu/~rurtasun/publications/schwing_et_al_cvpr12.pdfEfﬁcient Structured Prediction for 3D Indoor Scene Understanding

ECCV Describing Clothing by Semantic Attributeschenlab.ece.cornell.edu/.../ECCV2012_ClothingAttributes.pdf · Describing Clothing by Semantic Attributes Anonymous ECCV submission

2014 eccv stm

MATHEMATICAL ENGINEERING TECHNICAL REPORTS …ttic.uchicago.edu/~ryotat/papers/metrCSP-SPEC.pdfMATHEMATICAL ENGINEERING TECHNICAL REPORTS Spectrally weighted Common Spatial Pattern

Recursive Bilateral Filtering F01943024. Reference Yang, Qingxiong. "Recursive bilateral filtering." ECCV 2012. Deriche, Rachid. "Recursively implementating.

Automatic Photograph Enhancement - ttic.uchicago.eduttic.uchicago.edu/~smaji/reports/cs270-final-report.pdf · Automatic Photograph Enhancement ... using the Ford-Fulkerson method

Perceptron - Computer Science Education Lab, UMASS, …smaji/cmpsci689/slides/lec05_perceptron… · separable (xor) ! Mediocre generation: the algorithm ... ‣a weight that survives

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND …ttic.uchicago.edu/...AudiovisualSpeechInversion_ieee-j-aslp09.pdf · Acoustic Information to Recover Articulation ... contain audio, video,