Transfer Learning for Image Classification

69
Transfer Learning for Image Classification

description

Transfer Learning for Image Classification. Transfer Learning Approaches. Leverage data from related tasks to improve performance: Improve generalization. Reduce the run-time of evaluating a set of classifiers. Two Main Approaches: Learning Shared Hidden Representations. - PowerPoint PPT Presentation

Transcript of Transfer Learning for Image Classification

Page 1: Transfer Learning for Image Classification

Transfer Learning forImage Classification

Page 2: Transfer Learning for Image Classification

Transfer Learning Approaches

Leverage data from related tasks to improve performance:

Improve generalization.

Reduce the run-time of evaluating a set of classifiers.

Two Main Approaches:

Learning Shared Hidden Representations.

Sharing Features.

Page 3: Transfer Learning for Image Classification

Sharing Features: efficient boosting procedures for multiclass object detection

Antonio Torralba

Kevin Murphy

William Freeman

Page 4: Transfer Learning for Image Classification

Snapshot of the idea

Goal: Reduce the computational cost of multiclass object recognition Improve generalization performance

Approach: Make boosted classifiers share weak learners

di Rx

)},(),...,,(),{( 221,1 nnT yxyxyx

],...,,[ 21 miiii yyyy

Some Notation:

}1,1{ kiy

Page 5: Transfer Learning for Image Classification

Training a single boosted classifier

Consider training a single boosted classifier:

Candidate weak learners

Weighted stumps

Fit an additive model

Page 6: Transfer Learning for Image Classification

Training a single boosted classifier

Greedy Approach

Minimize the exponential loss

Gentle Boosting

Page 7: Transfer Learning for Image Classification

Standard Multiclass Case: No Sharing

Additive model for each class

Minimize the sum of exponential losses

Each class has its own set of weak learners:

Page 8: Transfer Learning for Image Classification

Multiclass Case: Sharing Features

subset of classes

corresponding additiveclassifier

classifier for the k class

At iteration t add one weak learner to oneof the additive models:

Minimize the sum of exponential losses

Naive approach:

Greedy Heuristic:

Page 9: Transfer Learning for Image Classification

Sharing features for multiclass object detection

Torralba, Murphy, Freeman. CVPR 2004

Page 10: Transfer Learning for Image Classification

Learning efficiency

Page 11: Transfer Learning for Image Classification

Sharing features shows sub-linear scaling of features with objects (for area under ROC = 0.9).

Red: shared features Blue: independent features

Page 12: Transfer Learning for Image Classification

How the features are shared across objects

Basic features: Filter responses computed at different locations of the image

Page 13: Transfer Learning for Image Classification

Uncovering Shared Structuresin Multiclass Classification

Yonatan AmitMichael FinkNathan SrebroShimon Ullman

Page 14: Transfer Learning for Image Classification

Structure Learning Framework

Class Parameters

Structural Parameters

Find optimal parameters:

Linear Classifiers

Linear Transformations

)()(),,(minarg 211

, cWcWLossn

iiiW

yx

Page 15: Transfer Learning for Image Classification

Multiclass Loss Function

)1(max),,(1,1:,

xxyx

Tj

Tpyyyy

wwWLoss pjpj

Maximal Hinge Loss:

)1,0max( x tyw

Hinge Loss:

Page 16: Transfer Learning for Image Classification

Snapshot of the idea

Main Idea: Enforce Sharing by finding low rank parameter matrix W

Transformation on x

Transformation on w

Consider the m by d parameter matrix:

Can be factored:

Rows of theta form a basis

)1(max),,(1,1:,

xxyx

Tj

Tpyyyy

wwWLoss pjpj

Page 17: Transfer Learning for Image Classification

Low Rank Regularization Penalty

Rank of a d by m matrix is the smallest z, such that:

A regularization penalty designed to minimize the rank of W’ wouldtend to produce solutions where a few basis are shared by all classes.

Minimizing the rank would lead to a hard combinatorial problem

Instead use a trace norm penalty:

eigen value of W’

Page 18: Transfer Learning for Image Classification

Putting it all together

No longer in the objective

For optimization they use a gradient based methodthat minimizes a smooth approximation of the objective

)(),,(minarg '

1

'' WcWLossn

iiiW

yx

Page 19: Transfer Learning for Image Classification

Mammals Dataset

Page 20: Transfer Learning for Image Classification

Results

Page 21: Transfer Learning for Image Classification

Transfer Learning for Image Classification via Sparse Joint Regularization

Ariadna Quattoni

Michael Collins

Trevor Darrell

Page 22: Transfer Learning for Image Classification

Training visual classifiers when a few examples are available

Problem: Image classification from a few examples can be

hard. A good representation of images is crucial.

Solution: We learn a good image representation using:

unlabeled data + labeled data from related problems

Page 23: Transfer Learning for Image Classification

Snapshot of the idea:

Use unlabeled dataset + kernel function to compute a new representation: Complex features, high dimensional space Some of them will be very discriminative (hopefully) Most will be irrelevant

If we knew the relevant features we could learn from fewer examples.

Related problems may share relevant features.

We can use data from related problems to discover them !!

Page 24: Transfer Learning for Image Classification

Semi-supervised learning

Large dataset of unlabeled

data

Small training set of

labeled images

h-dimensional training set

h:F I R

Visual Representation

Unsupervised learning

Compute Train Classifier

Step 1: Learn representation

Step 2: Train Classifier

h:F I R

Page 25: Transfer Learning for Image Classification

Semi-supervised learning:

Raina et al. [ICML 2007] proposed an approach that learns a sparse set

of high level features (i.e. linear combinations of the original features) from unlabeled data using a sparse coding technique.

Balcan et al. [ML 2004] proposed a representation based on computing kernel distances to unlabeled data points.

Page 26: Transfer Learning for Image Classification

Learning visual representations using unlabeled dataonly

Unsupervised learning in data space Good thing:

Lower dimensional representation preserves relevant statistics of the data sample.

Bad things: The representation might still contain irrelevant

features, i.e. features that are useless for classification.

Page 27: Transfer Learning for Image Classification

Learning visual representations from unlabeled data + labeled data from related categories

Step 1: Learn a representation

Large Dataset of unlabeled images

RXXk :

Kernel Function

Create New

Representation

Select DiscriminativeFeatures of the New

Representation

h:F I R

Discriminative representation

Labeled images from related categories

Page 28: Transfer Learning for Image Classification

Our contribution

Main differences with previous approaches:

Our choice of joint regularization norm allows us to express the joint loss minimization as a linear program (i.e. no need for greedy approximations.)

While previous approaches build joint sparse classifiers on the feature space our method discovers discriminative features in a spacederived using the unlabeled data and uses these discriminative features to solve future problems.

Page 29: Transfer Learning for Image Classification

Overview of the method

Step I: Use the unlabeled data to compute a new representation space

[Kernel SVD].

Step II: Use the labeled data from related problems to discover discriminative features in the new space [Joint Sparse Regularization].

Step III: Compute the new discriminative representation for the samples of the target problem.

Step IV: Train the target classifier using the representation of step III.

Page 30: Transfer Learning for Image Classification

Step I: Compute a representation using the unlabeled data

A) Compute kernel matrix of unlabeled images:

),(: jiij xxkK

B) Compute a projection matrix A by taking all the eigen vectors of K.

K

U

Perform Kernel SVD on the Unlabeled Data.

Page 31: Transfer Learning for Image Classification

Step I: Compute a representation using the unlabeled data C) Project labeled data from related problems to the new space:

D

Notational shorthand: )(' xzx

Page 32: Transfer Learning for Image Classification

Sidetrack

Another possible method for learning a representation from the unlabeled datawould be to create a projection matrix Q by taking the h eigen vectors of Acorresponding to the h largest eigenvalues.

We will call this approach the Low Rank Baseline

Our method differs significantly from the low rank approach in that we use training data from related problems to select discriminative features inthe new space.

Page 33: Transfer Learning for Image Classification

Step II: Discover relevant features by joint sparse approximation

}1,1{Y

X

YXF :

A classifier is a function:

where: is the input space, i.e. the representation learnt from unlabeled data.

is a binary label, in our application is going to be 1 if an imageand

A loss function:

RyxfF }),({:

belongs to a particular topic and -1 otherwise.

Page 34: Transfer Learning for Image Classification

Step II: Discover relevant features by joint sparse approximation

')'( xwxf t

Dyx

p

jjwyxfl

),( 1

||)),'((minw

Consider learning a single sparse linear classifier (on the space learnt from the unlabeled data) of the form:

A sparse model will have only a few features with non-zero coefficients. A natural choice for parameter estimation would be:

Classification error

L1 penalizes non-sparse solutions

Donoho [2004] has proven (in a regression setting) that the solution with smallest L1 norm is also the sparsest solution.

Page 35: Transfer Learning for Image Classification

Step II: Discover relevant features by joint sparse approximation

m

kk

Dyx

Wyxflk1 ),(

],..,[ )()),'((min m21 wwwW

},...,,{ 21 mDDDD

Goal: Find a subset of features R such that each problem in:

Solution : Regularized Joint loss minimization:

Classificationerror on training set k

penalizes solutions that

utilize too many features

can be well approximated by a sparse classifier whose non-zero coefficients correspond to features in R.

')'( xxf tkk w

Page 36: Transfer Learning for Image Classification

Step II: Discover relevant features by joint sparse approximation

rowszerononW #)(

pm

pp

m

m

www

www

www

21

222

21

112

11

W

How do we penalize solutions that use too many features?

Coefficients forfor feature 2

Coefficients for classifier 2

Problem : not a proper norm, would lead to a hard combinatorial problem .

Page 37: Transfer Learning for Image Classification

Step II: Discover relevant features by joint sparse approximation

Instead of using the #non-zero-rows pseudo-norm we will use a convex relaxation [Tropp 2006]

p

iik

kwW

1

|)(|max)(

The combination of the two norms results in a solution where only a few features are used but the features used will contribute in solving many classification problems.

This norm combines:

An L1 norm on the maximum absolute values of the coefficients promotes sparsity on max values.

Use few features

The L∞ norm on each row promotes non-sparsity on the rows Share

features

Page 38: Transfer Learning for Image Classification

Step II: Discover relevant features by joint sparse approximation

m

k

p

iik

kk

Dyx

wyxflk

m1 1),(

],..,[ |)(|max)),'((min21 wwwW

Using the L1- L∞ norm we can rewrite our objective function as:

For any convex loss this is a convex function, in particular when consideringthe hinge loss:

))(1,0max())(( xyfxfl

the optimization problem can be expressed as a linear program.

Page 39: Transfer Learning for Image Classification

Step II: Discover relevant features by joint sparse approximation

Objective:

m

k

D

j

p

ii

kj

k

t1

||

1 1],,[min tεW

Linear program formulation ( hinge loss):

Max value constraints:

mkfor :1:

pifor :1:

0kj

iiki twt mkfor :1:

|:|1: kDjfor

kj

kjk

kj xfy 1)'(

and

Slack variables constraints:and

Page 40: Transfer Learning for Image Classification

Step III: Compute the discriminative features representation

}|)max(|:{ rkwrR

Define the set of relevant features to be:

Create a new representation by taking all the features in x’ corresponding to the indexes in R

Page 41: Transfer Learning for Image Classification

Experiments: Dataset

nL

10382 images, 108 topics .Predict 10 most frequent topics; (binary prediction)

Reuters Dataset

Data Partitions

3000 unlabeled images.

labeled training sets of sizes: 1, 5, 10 ,15…50.

2382 images as testing data.

5000 images as source of supervised training data

Training set with n positive examples and 2*nnL

Page 42: Transfer Learning for Image Classification

Dataset

SuperBowl [341]

Danish Cartoons[178]

Sharon [321]

Australian open[209]

Trapped coal miners [196]

Golden globes [167]

Grammys [170]

Figure skating [146]

Academy Awards [135]

Iraq [ 125]

Page 43: Transfer Learning for Image Classification

Baseline Representation

Sampling:

‘Bag of words’ representation that combines: color, texture and raw local image information

Sample image patches on a fixed grid

For each image patch compute:Color features based on HSV color histograms

Texture features based on mean responses of Gabor filters at different scales and orientations

Raw features, normalized pixel values

Create visual dictionary: for each feature type we do vector quantization andcreate a dictionary V of 2000 visual words.

Page 44: Transfer Learning for Image Classification

Baseline representation

]...,;,..,;,..,[)( ||21||21||21 rtc VVV rrrtttcccIg

For every feature type map each patch to its closest visual word in the corresponding dictionary.

Compute baseline representation:

Sample image patches over a fix grid

The final representation is given by:

where

icis the number of times that an image patch was mapped to

the i-th color word

Page 45: Transfer Learning for Image Classification

Setting

Step 1: Learn a representation using the unlabeled datataset and

labeled datatasets from 9 topics.

Step 2: Train a classifier for the 10th held out topic using the learnt representation.

As evaluation we use the equal error rate averaged over the 10 topics.

Page 46: Transfer Learning for Image Classification

Uses as a representation: where Q is consists of the h highest eigenvectors of the matrix A computed in the first step of the algorithm

Experiments

Baseline model (RFB): Uses raw representation.

)()( xQxv t

For both LRB and SPT we used and RBF kernel when computing the Representation from unlabeled data.

Low Rank baseline (LRB):

Sparse Transfer Model (SPT)

Three models, all linear SVMs

Uses Representation computed by our algorithm

Page 47: Transfer Learning for Image Classification

Results:

1 5 10 15 20 25 30 35 40 45 500.25

0.3

0.35

0.4

0.45

0.5E

qual

Err

or R

ate

# positive training examples

RFB

SPT LRB

Page 48: Transfer Learning for Image Classification

Results:

Mean Equal Error rate per topic for classifiers trained with five positive examples; for the RFB model and the SPT model. SuperBowl; GG: Golden Globes; DC: Danish Cartoons; Gr: Grammys; AO: Australian Open; Sh:Sharon; FS: Figure Skating; AA: Academy Awards; Ir: Iraq.

SB GG DC Gr AO Sh TC FS AA Ir0.25

0.3

0.35

0.4

0.45

0.5

Equal E

rror

Rate

Average equal error rate for models trained with 5 examples

RFB

SPT

Page 49: Transfer Learning for Image Classification

Results

Page 50: Transfer Learning for Image Classification

Conclusion

Summary: We described a method for learning discriminative

sparse image representations from unlabeled images + images from related tasks.

The method is based on learning a representation from the unlabeled data and performing joint sparse approximation on the data from related tasks to find a subset of discriminative features.

The induced representation improves performance when learning with very few examples.

Page 51: Transfer Learning for Image Classification

Future work

Develop an efficient algorithm for solving the joint optimization thatwill scale to very large datasets.

Combine different representations

Page 52: Transfer Learning for Image Classification

Joint Sparse Approximation

Discovers image representations which improve learning with few examples

The LP formulation is feasible for small problems but becomes intractable for larger data-sets.

Page 53: Transfer Learning for Image Classification

Outline

A Joint Sparse Approximation Model for Multi-task Learning

An Efficient Algorithm

Experiments

Page 54: Transfer Learning for Image Classification

Joint Sparse Approximation as a Constrained ConvexOptimization Problem

m

kk

Dyxk

yxflD

k1 ),(

)),((||

1minW

d

iik

kQWts

1

|)(|max..

We will use a Projected SubGradient method. Main advantages: simple, scalable, guaranteed convergence rates.

A convex function

Convex constraints

Projected SubGradient methods have been recently proposed:

L2 regularization, i.e. SVM [Shalev-Shwartz et al. 2007] L1 regularization [Duchi et al. 2008]

Ariadna
computing the subgradients is trivial.So the question is weather the projection can be computed efficiently.
Page 55: Transfer Learning for Image Classification

Euclidean Projection into the L1-∞ ball

ji,

ji

jijiB

AB,

2,,

,)(

2

1minμ

Projection:

Inspecting the Lagrangian

shows that it can be reduced to:

We reduced the projection to finding new maximums for each feature across tasks and using them to truncate A. The total mass removed from a feature across tasks should be the

same for all features whose coefficients don’t vanish.

ijiB ,

Qd

ii

1

0, jiB

0i

ji ,

i

,μfind

ijiA

ijii Ai

,

,,0:

.t.s

.t.s

d

ii Q

1

1 2

jijiiji ABA ,,, ijiiji BA ,,

:set

0ii

0

Ariadna
Here is how we can formulate the projection.Assume that we have a matrix A that is outside the L1-inf ball. We wish to find the matrix B in the L1-inf ball that it is closest to A.I am going to assume that A is non-negative it is easy to show that this approach can be generalized to arbritrary matrixes.Let's first look at the formulation on the left. The objective searches for a matrix B that is close in euclidean distance to A.We also want B to be in the L1-inf ball of radious Q.To impose this constraint we introduce auxiliary variables mu, there will be one mu for every feature. These variables are going to encodethe maximum feature values of the new matrix B.Thus, the first constraint bounds the value of the coefficients of a feature i of B to be below its corresponding maximum mu_i. The second constraint makes sure that the new maximums sum to Q.------------------------------------------------------Inspecting the Lagrangina shows us the structure of this problem and reveals that it can be re- formulated as an optimization that can be solved efficiently; which is shown on the left box. In this new formulation:We are also going to search for new maximums mu,let's ignore theta for now we will return to it later.As in the previous formulation the first constraint enforces that the new maximums should sum to Q.Ignore the second constraints for now.Once we found the new maximums the two lines at the end of box 2 tell us how to construct B . It says that every coefficient of A below its corresponding maximum will remain the same while coefficients above the maximum will be truncated. So essentially we reduced the projection to finding new maximums for each feature that we will use to truncate A. Now let's go back to the second constraint, the lagragian reveals that at the optimal solution the mass that is lost because of the truncation (that is the sum of the differences between the values of A larger than the new maximum and the new maximum) will be the same accross features.Intuitively this makes sense because the l1 norm on the maximums should prefer no feature over another.
Page 56: Transfer Learning for Image Classification

t1

t2

t3

t4

t5

t6

t1

t2

t3

t4

t5

t6

t1

t2

t3

t4

t5

t6Feature I Feature II Feature III

1

2

3

Euclidean Projection into the L1-∞ ball

1,1

1,111 )(

jA

jAR

8

8

3 Features, 6 problems, Q=14

3

1

14i

i

Regret

Ariadna
Here we have a concrete examplesThere are 3 features, six problems and Q is 14so we want to find maxs that sum to 14The figures shows the sorted coefficients of each task for each of the 3 features. We see thatselecting a new maximum is selecting a new truncating point. Inside each boxis the mass that we will loose for the corresponding feature if we truncate it by thatvalue, i called this theta. Again The lagragiantell us that theta should be the same accross features. So our algorithm for finding the new maximum is based on lowering the maximums is such a way that theta is constant accross features and we do this until the new maximums sum to Q,and those are the maximums of the new matrix B.
Page 57: Transfer Learning for Image Classification

Euclidean Projection into the L1-∞ ball

8

7

8 8

Page 58: Transfer Learning for Image Classification

An Efficient Algorithm For Finding μ

:R i i

Recall that we need to find a vector of maximums µ that sums to Q and a constant θ such that the mass loss for every feature is the same.

Consider the mass loss function and its inverse:

:1iR

d

iiR

1

1 )()N(

i

We can construct this function and pick the θ for which Q)N(

This is a piece wise linear function

So is its inverse

Consider a function that takes a mass loss and computes the corresponding sum of µ ( the norm of the new matrix)

This is also piece wise linear

Ariadna
connection to previous slidethis can be easily done by incrementally building some functions which are piece-wise lineaconsider the function Ri wich takes a maximum i and returns the corresponding losssince this is piece wise linear it is easy to build its inverse which given a loss computes its corresponding maximum. using this inverse function we can consider another function N that takes a theta value, computes the maximums for each feature using the inverse of R and sums them up. Note that this is the l-1 inf norm.
Page 59: Transfer Learning for Image Classification

Complexity

))log(( mdmO

The total cost of the algorithm is dominated by the initial sort

The total cost is in the order of:

The merge of these rows to build the piece wise linear function:

of each row of A:

))log(( ddmO

))log(( dmdmO

Notice that we only need to consider non-zero rows of A, so d is really the number of non-zero rows rather than the actual number of features

Page 60: Transfer Learning for Image Classification

Outline

A Joint Sparse Approximation Model for Multi-task Learning

An Efficient Algorithm

Experiments

Page 61: Transfer Learning for Image Classification

Synthetic Experiments

We use the same projected subgradient method, and compared three different types of projection steps:

L1−∞ projection L2 projection L1 projection

)( xwsign All tasks consisted of predicting functions of the form:

To generate jointly sparse vectors we randomly selected 10% of the features to be the relevant feature set V . Then for each task we randomly selected a subset v and zeroed all parameters outside v.

Page 62: Transfer Learning for Image Classification

Synthetic Experiments

10 20 40 80 160 320 64015

20

25

30

35

40

45

50

# training examples per task

Err

or

Synthetic Experiments Results: 60 problems 200 features 10% relevant

L2

L1L1-LINF

Test Error

Performance on predictingrelevant features

10 20 40 80 160 320 64010

20

30

40

50

60

70

80

90

100

# training examples per task

Feature Selection Performance

Precision L1-INF

Recall L1Precision L1

Recall L1-INF

Page 63: Transfer Learning for Image Classification

Dataset: News story prediction

SuperBowl

Danish CartoonsSharon

Australian open

Trapped coal miners

Golden globes Grammys

Figureskating

Academy Awards Iraq

40 tasks Raw image representation: Bag of visual words

Linear Kernel 3000 dimensions

Page 64: Transfer Learning for Image Classification

Image Classification Experiments

15 30 60 120 2400.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

# training examples per task

Mea

n E

ER

Reuters Dataset Results

L2

L1L1-INF

Absolute Weights L1

Fea

ture

5 10 15 20 25 30 35 40

500

1000

1500

2000

2500

3000

10

20

30

40

50

60

Absolute Weights L1-INF

Fea

ture

task

5 10 15 20 25 30 35 40

500

1000

1500

2000

2500

3000

0.01

0.02

0.03

0.04

0.05

0.06

Page 65: Transfer Learning for Image Classification

Conclusion and Future Work

Presented a simple and effective algorithm for training joint models with L1−∞ constraints.

The algorithm scales linearly with the number of examples and O(dm log(dm) ) with the number of problems and dimensions. The experiments on a real image dataset show that this algorithm can find solutions that are jointly sparse, resulting in lower test error.

We believe our approach can be extended to work on an online multi-task learning setting.

Page 66: Transfer Learning for Image Classification

Future Work: Online Multitask Classification [Cavallanti et al. 2008]

There are m binary classification tasks indexed by 1 ,…., M

At each time step t=1,2,…,T the learner receives a task index k and the corresponding instance vector. Based on this information the learner outputs a binary prediction an then observes the correct label yk

We are interested in comparing the learner’s mistake count to that of the optimal predictors:

T

i

it

it

it

Rww

xywld

m 1

),,(inf,....,1

Page 67: Transfer Learning for Image Classification

Thanks!

Page 68: Transfer Learning for Image Classification

Lagrangian

Page 69: Transfer Learning for Image Classification

Lagrangian