Post on 26-Mar-2015
Kullback-Leibler Boosting
Ce Liu Heung-Yeung ShumCe Liu Heung-Yeung Shum
Microsoft Research AsiaMicrosoft Research Asia
ResearchAsia
A General Two-layer ClassifiersA General Two-layer Classifiers
1
1)(xI
Input Intermediate Output
)))(((sign)(1
n
iiii cxxI
RRdi :)( Projection function
x
RRi :)( Discriminating function
}1,1{:)( dRI Identification function
}{ i Coefficients
Issues under Two-layer FrameworkIssues under Two-layer Framework
How to choose the type of projection function?How to choose the type of projection function? How to choose the type of discriminating function?How to choose the type of discriminating function? How to learn the parameters from samples?How to learn the parameters from samples?
xx T )(
Projection function
||||)( xx )tanh()( zz
zez
1
1)(
Sigmoid
}exp{)( 2zz RBF
01
1)( bzbzbz nn Polynomial
Discriminating function
)))(((sign)(1
n
iiii cxxI
Our proposalOur proposal
How to choose the type of projection function?How to choose the type of projection function?• Kullback-Leibler linear featureKullback-Leibler linear feature
How to choose the type of discriminating function?How to choose the type of discriminating function?• Histogram divergencesHistogram divergences
How to learn the parameters from samples?How to learn the parameters from samples?• Sample re-weighting (Boosting)Sample re-weighting (Boosting)
Kullback-Leibler Boosting (KL Boosting)Kullback-Leibler Boosting (KL Boosting)
IntuitionsIntuitions
Linear projection is robust and easy to computeLinear projection is robust and easy to compute
The histograms of two classes upon a projection are The histograms of two classes upon a projection are evidences for classificationevidences for classification
• The linear feature, on which the histograms of two classes differ The linear feature, on which the histograms of two classes differ most, should be selectedmost, should be selected
If the weight distribution of the sample set changes, If the weight distribution of the sample set changes, the histogram changes as wellthe histogram changes as well
• Increase weights for misclassified samples, and decrease Increase weights for misclassified samples, and decrease weights for correctly classified samplesweights for correctly classified samples
Linear projections and histogramsLinear projections and histograms
xx T )(
)(),(~}{ ii xWxfx)(xf
H
)(
)()()(
i
iT
i
xW
xzxWzH
KLBoosting (1)KLBoosting (1)
At the At the kkthth iteration iteration• Kullback-Leibler FeatureKullback-Leibler Feature
• Discriminating functionDiscriminating function
• ReweightingReweighting
xdxh
xhxhxhKL T
Tk
TkT
kT
k
)(
)(log)]()([)(
)(
)(log)(
xh
xhx T
kk
TkkT
kk
)(maxarg* KLk
)}(exp{)(1
)(
)}(exp{)(1
)(
111
111
ikkikik
ikkikik
xIxWZ
xW
xIxWZ
xW
k
kk
1log
KLBoosting (2)KLBoosting (2)
Two types of parameters to learnTwo types of parameters to learn• KL features: KL features: • Combination coefficients:Combination coefficients:
Learning KL feature in low dimensions: MCMCLearning KL feature in low dimensions: MCMC
Learning weights to minimize training errorLearning weights to minimize training error
• Optimization: brute-force searchOptimization: brute-force search
kkjj
j
}{
1* minarg}{
N
iii xIy
N 1
))((1
minarg
}{ j
}{ j
Learn combining coefficients
FlowchartFlowchart
ki 1}{
Input: }1,1{, 1 yRx d
i
),(,),,( 11 NN yxyx
Initialize weights
Learn KL feature
Update weights
Recognition error small enough?
Output classifier
])(
)(log[sign)(
1
k
iTii
Tii
i xh
xhxI
k
00 ,WW
kk WW ,
Y
N
}{ iy}{ ix
}{ iy
}{ ix
+1 -1 -1 +1 -1 +1 +1 +1 -1 +1 -1 +1
)}({ ixW
A Simple ExampleA Simple Example
KL Features
Histograms
Decision manifold
A Complicated CaseA Complicated Case
8 16 24 321
30%
20%
10%
0%
40%
Feature number
Error rate
Error rate on testing data by KLBoosting
Error rate on training data by KLBoosting
Error rate on testing data by AdaBoost
Error rate on training data by AdaBoost
Kullback-Leibler Analysis (KLA)Kullback-Leibler Analysis (KLA)
A challenging task to find KL feature in image spaceA challenging task to find KL feature in image space
Sequential 1D optimizationSequential 1D optimization• Construct a feature bankConstruct a feature bank• Build a set of the most promising featuresBuild a set of the most promising features• Sequentially do 1D optimization along the promising Sequentially do 1D optimization along the promising
featuresfeatures
Conjecture:The global optimum of an objective function can be reached bysearching along linear features as many as needed
Intuition of Sequential 1D OptimizationIntuition of Sequential 1D Optimization
Feature bankPromising feature setResult of Sequential 1D Optimization MCMC feature
Optimization in Image SpaceOptimization in Image Space
Image is a random field, not a pure random variableImage is a random field, not a pure random variable
The local statistics can be captured by waveletsThe local statistics can be captured by wavelets
• 111111×400×400 small-scale wavelets for the whole 20 small-scale wavelets for the whole 20×20 patch×20 patch
• 8080×100 ×100 large-scale wavelets for the inner 10large-scale wavelets for the inner 10×10 patch×10 patch
• Total 52,400 wavelets to compose a feature bankTotal 52,400 wavelets to compose a feature bank
• 2,800 most promising wavelets selected2,800 most promising wavelets selected
Gaussian family wavelets Harr wavelets Feature bank
Compose the KL feature by sequential 1D optimization
Data-driven KLAData-driven KLA
Face patterns
Non-face patterns
Feature bank (111 wavelets)
Promising feature set (total 2,800 features)
On each position of the 20*20 lattice, compute the histograms of the 111 wavelets and the KL divergences between face and non-face images.
Large scale wavelets are used to capture the global statistics, on the 10*10 inner lattice
Comparison with Other FeaturesComparison with Other Features
MCMC feature Best Harr wavelet
KL=2.944 (Harr wavelet)KL=3.246 (MCMC feature)
KL feature
KL=10.967 (KL feature)
Application: Face DetectionApplication: Face Detection
Experimental setupExperimental setup• 2020××20 patch to represent face20 patch to represent face
• 17,520 frontal faces17,520 frontal faces
• 1,339,856,947 non-faces from 2,484 images1,339,856,947 non-faces from 2,484 images
• 300 bins in histogram representation300 bins in histogram representation
A cascade of KLBoosting classifiersA cascade of KLBoosting classifiers• In each classifier, keep false negative rate <0.01% and In each classifier, keep false negative rate <0.01% and
false alarm rate <35%false alarm rate <35%
• Totally 22 classifiers to form the cascade (450 features)Totally 22 classifiers to form the cascade (450 features)
KL Features of Face DetectorKL Features of Face Detector
Face patterns Non-face patterns
First 10 KL features
Some other KL features
Global semantics
Frequency filters
Local features
ROC CurveROC Curve
false alarm rate
corr
ect d
etec
tion
rat
e
1.00
0.95
0.90
0.85
0.80
0.75
0.70
0.653.16 X10 -7 3.16 X10 -6
3.16 X10 -51X10 -6 1X10 -51X10 -7
KLBoosting
AdaBoost
Neural Network
Some Detection ResultsSome Detection Results
Comparison with AdaBoostComparison with AdaBoost
1
10-1
10-2
10-3
10-4
10-5
40080 160 240 3200
KLBoosting
AdaBoost
Number of features
Fal
se a
larm
rat
e
Compared with AdaBoostCompared with AdaBoost
KLBoostingKLBoosting AdaBoostAdaBoost
Base Base classifierclassifier
KL feature + histogram KL feature + histogram divergencedivergence
Selected from Selected from experiencesexperiences
Combining Combining coefficientscoefficients
Globally optimized to Globally optimized to minimize training errorminimize training error
Empirically set to be Empirically set to be incrementally optimalincrementally optimal
SummarySummary
KLBoosting is an optimal classifierKLBoosting is an optimal classifier• Projection function: linear projectionProjection function: linear projection• Discrimination function: histogram divergenceDiscrimination function: histogram divergence• Coefficients: optimized by minimizing training errorCoefficients: optimized by minimizing training error
KLA: a data-driven approach to pursue KL featuresKLA: a data-driven approach to pursue KL features
Applications in face detectionApplications in face detection
Thank you!
Harry ShumHarry ShumMicrosoft Research AsiaMicrosoft Research Asia
hshum@microsoft.comhshum@microsoft.com
ResearchAsia
Compared with SVMCompared with SVM
KLBoostingKLBoosting SVMSVM
Support Support vectorsvectors
KL features learnt to optimize KL KL features learnt to optimize KL divergence (a few)divergence (a few)
Selected from training Selected from training samples (many)samples (many)
Kennel Kennel functionfunction Histogram divergence (flexible)Histogram divergence (flexible) Selected from Selected from
experiences (fixed)experiences (fixed)