Kullback-Leibler Boosting Ce Liu Heung-Yeung Shum Microsoft Research Asia Research Asia.

Kullback-Leibler Boosting

Ce Liu Heung-Yeung ShumCe Liu Heung-Yeung Shum

Microsoft Research AsiaMicrosoft Research Asia

ResearchAsia

A General Two-layer ClassifiersA General Two-layer Classifiers

1

1)(xI

Input Intermediate Output

)))(((sign)(1

n

iiii cxxI

RRdi :)( Projection function

x

RRi :)( Discriminating function

}1,1{:)( dRI Identification function

}{ i Coefficients

Issues under Two-layer FrameworkIssues under Two-layer Framework

How to choose the type of projection function?How to choose the type of projection function? How to choose the type of discriminating function?How to choose the type of discriminating function? How to learn the parameters from samples?How to learn the parameters from samples?

xx T )(

Projection function

||||)( xx )tanh()( zz

zez

1

1)(

Sigmoid

}exp{)( 2zz RBF

01

1)( bzbzbz nn Polynomial

Discriminating function

)))(((sign)(1

n

iiii cxxI

Our proposalOur proposal

How to choose the type of projection function?How to choose the type of projection function?• Kullback-Leibler linear featureKullback-Leibler linear feature

How to choose the type of discriminating function?How to choose the type of discriminating function?• Histogram divergencesHistogram divergences

How to learn the parameters from samples?How to learn the parameters from samples?• Sample re-weighting (Boosting)Sample re-weighting (Boosting)

Kullback-Leibler Boosting (KL Boosting)Kullback-Leibler Boosting (KL Boosting)

IntuitionsIntuitions

Linear projection is robust and easy to computeLinear projection is robust and easy to compute

The histograms of two classes upon a projection are The histograms of two classes upon a projection are evidences for classificationevidences for classification

• The linear feature, on which the histograms of two classes differ The linear feature, on which the histograms of two classes differ most, should be selectedmost, should be selected

If the weight distribution of the sample set changes, If the weight distribution of the sample set changes, the histogram changes as wellthe histogram changes as well

• Increase weights for misclassified samples, and decrease Increase weights for misclassified samples, and decrease weights for correctly classified samplesweights for correctly classified samples

Linear projections and histogramsLinear projections and histograms

xx T )(

)(),(~}{ ii xWxfx)(xf

H

)(

)()()(

i

iT

i

xW

xzxWzH

KLBoosting (1)KLBoosting (1)

At the At the kkthth iteration iteration• Kullback-Leibler FeatureKullback-Leibler Feature

• Discriminating functionDiscriminating function

• ReweightingReweighting

xdxh

xhxhxhKL T

Tk

TkT

kT

k

)(

)(log)]()([)(

)(

)(log)(

xh

xhx T

kk

TkkT

kk

)(maxarg* KLk

)}(exp{)(1

)(

)}(exp{)(1

)(

111

111

ikkikik

ikkikik

xIxWZ

xW

xIxWZ

xW

k

kk

1log

KLBoosting (2)KLBoosting (2)

Two types of parameters to learnTwo types of parameters to learn• KL features: KL features: • Combination coefficients:Combination coefficients:

Learning KL feature in low dimensions: MCMCLearning KL feature in low dimensions: MCMC

Learning weights to minimize training errorLearning weights to minimize training error

• Optimization: brute-force searchOptimization: brute-force search

kkjj

j

}{

1* minarg}{

N

iii xIy

N 1

))((1

minarg

}{ j

}{ j

Learn combining coefficients

FlowchartFlowchart

ki 1}{

Input: }1,1{, 1 yRx d

i

),(,),,( 11 NN yxyx

Initialize weights

Learn KL feature

Update weights

Recognition error small enough?

Output classifier

])(

)(log[sign)(

1

k

iTii

Tii

i xh

xhxI

k

00 ,WW

kk WW ,

Y

N

}{ iy}{ ix

}{ iy

}{ ix

+1 -1 -1 +1 -1 +1 +1 +1 -1 +1 -1 +1

)}({ ixW

A Simple ExampleA Simple Example

KL Features

Histograms

Decision manifold

A Complicated CaseA Complicated Case

8 16 24 321

30%

20%

10%

0%

40%

Feature number

Error rate

Error rate on testing data by KLBoosting

Error rate on training data by KLBoosting

Error rate on testing data by AdaBoost

Error rate on training data by AdaBoost

Kullback-Leibler Analysis (KLA)Kullback-Leibler Analysis (KLA)

A challenging task to find KL feature in image spaceA challenging task to find KL feature in image space

Sequential 1D optimizationSequential 1D optimization• Construct a feature bankConstruct a feature bank• Build a set of the most promising featuresBuild a set of the most promising features• Sequentially do 1D optimization along the promising Sequentially do 1D optimization along the promising

featuresfeatures

Conjecture:The global optimum of an objective function can be reached bysearching along linear features as many as needed

Intuition of Sequential 1D OptimizationIntuition of Sequential 1D Optimization

Feature bankPromising feature setResult of Sequential 1D Optimization MCMC feature

Optimization in Image SpaceOptimization in Image Space

Image is a random field, not a pure random variableImage is a random field, not a pure random variable

The local statistics can be captured by waveletsThe local statistics can be captured by wavelets

• 111111×400×400 small-scale wavelets for the whole 20 small-scale wavelets for the whole 20×20 patch×20 patch

• 8080×100 ×100 large-scale wavelets for the inner 10large-scale wavelets for the inner 10×10 patch×10 patch

• Total 52,400 wavelets to compose a feature bankTotal 52,400 wavelets to compose a feature bank

• 2,800 most promising wavelets selected2,800 most promising wavelets selected

Gaussian family wavelets Harr wavelets Feature bank

Compose the KL feature by sequential 1D optimization

Data-driven KLAData-driven KLA

Face patterns

Non-face patterns

Feature bank (111 wavelets)

Promising feature set (total 2,800 features)

On each position of the 20*20 lattice, compute the histograms of the 111 wavelets and the KL divergences between face and non-face images.

Large scale wavelets are used to capture the global statistics, on the 10*10 inner lattice

Comparison with Other FeaturesComparison with Other Features

MCMC feature Best Harr wavelet

KL=2.944 (Harr wavelet)KL=3.246 (MCMC feature)

KL feature

KL=10.967 (KL feature)

Application: Face DetectionApplication: Face Detection

Experimental setupExperimental setup• 2020××20 patch to represent face20 patch to represent face

• 17,520 frontal faces17,520 frontal faces

• 1,339,856,947 non-faces from 2,484 images1,339,856,947 non-faces from 2,484 images

• 300 bins in histogram representation300 bins in histogram representation

A cascade of KLBoosting classifiersA cascade of KLBoosting classifiers• In each classifier, keep false negative rate <0.01% and In each classifier, keep false negative rate <0.01% and

false alarm rate <35%false alarm rate <35%

• Totally 22 classifiers to form the cascade (450 features)Totally 22 classifiers to form the cascade (450 features)

KL Features of Face DetectorKL Features of Face Detector

Face patterns Non-face patterns

First 10 KL features

Some other KL features

Global semantics

Frequency filters

Local features

ROC CurveROC Curve

false alarm rate

corr

ect d

etec

tion

rat

e

1.00

0.95

0.90

0.85

0.80

0.75

0.70

0.653.16 X10 -7 3.16 X10 -6

3.16 X10 -51X10 -6 1X10 -51X10 -7

KLBoosting

AdaBoost

Neural Network

Some Detection ResultsSome Detection Results

Comparison with AdaBoostComparison with AdaBoost

1

10-1

10-2

10-3

10-4

10-5

40080 160 240 3200

KLBoosting

AdaBoost

Number of features

Fal

se a

larm

rat

e

Compared with AdaBoostCompared with AdaBoost

KLBoostingKLBoosting AdaBoostAdaBoost

Base Base classifierclassifier

KL feature + histogram KL feature + histogram divergencedivergence

Selected from Selected from experiencesexperiences

Combining Combining coefficientscoefficients

Globally optimized to Globally optimized to minimize training errorminimize training error

Empirically set to be Empirically set to be incrementally optimalincrementally optimal

SummarySummary

KLBoosting is an optimal classifierKLBoosting is an optimal classifier• Projection function: linear projectionProjection function: linear projection• Discrimination function: histogram divergenceDiscrimination function: histogram divergence• Coefficients: optimized by minimizing training errorCoefficients: optimized by minimizing training error

KLA: a data-driven approach to pursue KL featuresKLA: a data-driven approach to pursue KL features

Applications in face detectionApplications in face detection

Thank you!

Harry ShumHarry ShumMicrosoft Research AsiaMicrosoft Research Asia

[email protected]@microsoft.com

ResearchAsia

Compared with SVMCompared with SVM

KLBoostingKLBoosting SVMSVM

Support Support vectorsvectors

KL features learnt to optimize KL KL features learnt to optimize KL divergence (a few)divergence (a few)

Selected from training Selected from training samples (many)samples (many)

Kennel Kennel functionfunction Histogram divergence (flexible)Histogram divergence (flexible) Selected from Selected from

experiences (fixed)experiences (fixed)

Kullback-Leibler Boosting Ce Liu Heung-Yeung Shum Microsoft Research Asia Research Asia.

Documents

Transcript of Kullback-Leibler Boosting Ce Liu Heung-Yeung Shum Microsoft Research Asia Research Asia.