Cooperative Human and Machine Perception in Teleoperated ...
Advanced Machine Learning & Perception
description
Transcript of Advanced Machine Learning & Perception
Tony Jebara, Columbia University
Advanced Machine Learning & Perception
Instructor: Tony Jebara
Tony Jebara, Columbia University
Boosting•Combining Multiple Classifiers•Voting•Boosting•Adaboost
•Based on material by Y. Freund, P. Long & R. Schapire
Tony Jebara, Columbia University
Combining Multiple Learners•Have many simple learners•Also called base learners or weak learners which have a classification error of <0.5•Combine or vote them to get a higher accuracy•No free lunch: there is no guaranteed best approach here•Different approaches:
Votingcombine learners with fixed weight
Mixture of Expertsadjust learners and a variable
weight/gate fnBoosting
actively search for next base-learners and vote
Cascading, Stacking, Bagging, etc.
Tony Jebara, Columbia University
Voting•Have T classifiers•Average their prediction with weights
•Like mixture of experts but weight is constant with input
ht x( )
f x( ) = atht x( )t=1TÂ whereat ≥ 0and att=1
TÂ = 1
y
x
1x
2x
3x
1x
2x
3x
1x
2x
3x
Expert Expert Expert
Tony Jebara, Columbia University
Mixture of Experts•Have T classifiers or experts and a gating fn•Average their prediction with variable weights•But, adapt parameters of the gating function and the experts (fixed total number T of experts)
ht x( )
f x( ) = at x( )ht x( )t=1TÂ whereat x( ) ≥ 0and at x( )t=1
TÂ = 1Output
Input
GatingNetwork
1x
2x
3x
1x
2x
3x
1x
2x
3x
Expert Expert Expert
at x( )
Tony Jebara, Columbia University
Boosting•Actively find complementary or synergistic weak learners•Train next learner based on mistakes of previous ones.•Average prediction with fixed weights•Find next learner by training on weighted versions of data.
weights
training data x1,y1( ),K , xN ,yN( ){ }
w1,K ,wN{ }
ht x( )weak learner
ensemble of learners
weak rule
wistep - ht xi( )yi( )i=1NÂ
wii=1NÂ < 1
2- g
a1,h1( ),K , aT ,hT( ){ }
prediction f x( ) = atht x( )t=1TÂ
Tony Jebara, Columbia University
AdaBoost•Most popular weighting scheme•Define margin for point i as•Find an ht and find weight t to min the cost function
sum exp-margins
weights
training data x1,y1( ),K , xN ,yN( ){ }
w1,K ,wN{ }
ht x( )weak learner
ensemble of learners
weak rule
prediction f x( ) = atht x( )t=1TÂ
wistep - ht xi( )yi( )i=1NÂ
wii=1NÂ < 1
2- g
a1,h1( ),K , aT ,hT( ){ }
yi atht xi( )t=1TÂ
exp - yi atht xi( )t=1
TÂ( )i=1NÂ
Tony Jebara, Columbia University
AdaBoost•Choose base learner & t:•Recall error of base classifier ht must be
•For binary h, Adaboost puts this weight on weak learners:
(instead of the more general rule)
•Adaboost picks the following for the weights on data for
the next round (here Z is the normalizer to sum to 1)
minat ,ht
exp - yi atht xi( )t=1TÂ( )i=1
NÂ
et =
wistep - ht xi( )yi( )i=1NÂ
wii=1NÂ < 1
2- g
at = 1
2 ln 1- et
et
ÊËÁÁÁÁ
ˆ¯˜̃̃˜̃
wi
t+1 =wi
t exp - atyiht xi( )( )Zt
Tony Jebara, Columbia University
Decision Trees
X>3
Y>5-1
+1-1
no
yes
yesno
X
Y
3
5
+1
-1
-1
Tony Jebara, Columbia University
-0.2
Decision tree as a sum
X
Y-0.2
Y>5
+0.2-0.3
yesno
X>3
-0.1
no
yes
+0.1+0.1-0.1
+0.2
-0.3
+1
-1
-1sign
Tony Jebara, Columbia University
An alternating decision tree
X
Y
+0.1-0.1
+0.2
-0.3
sign
-0.2
Y>5
+0.2-0.3yesno
X>3
-0.1
no
yes
+0.1
+0.7
Y<1
0.0
no
yes
+0.7
+1
-1
-1
+1
Tony Jebara, Columbia University
Example: Medical Diagnostics
•Cleve dataset from UC Irvine database.
•Heart disease diagnostics (+1=healthy,-1=sick)
•13 features from tests (real valued and discrete).
•303 instances.
Tony Jebara, Columbia University
Ad-Tree Example
Tony Jebara, Columbia University
Cross-validated accuracyLearning algorith
mNumber of splits
Average test error
Test error
varianceADtree 6 17.0% 0.6%C5.0 27 27.2% 0.5%
C5.0 + boostin
g446 20.2% 0.5%
Boost Stump
s16 16.5% 0.8%
Tony Jebara, Columbia University
AdaBoost Convergence•Rationale?•Consider bound on the training error:
•Adaboost is essentially doing gradient descent on this.•Convergence?
Remp = 1N step - yi f xi( )( )i=1
N£ 1
N exp - yi f xi( )( )i=1NÂ
= exp - yi atht xi( )t=1TÂ( )i=1
NÂ= Ztt=1
T’
Loss
CorrectMarginMistakes
Brownboost
Logitboost
0-1 loss
exp bound on step
definition of f(x)
recursive use of Z
Tony Jebara, Columbia University
AdaBoost Convergence•Convergence? Consider the binary ht case.
Remp £ Ztt=1T’ = wi
t exp - atyiht xi( )( )i=1NÂt=1
T’= wi
t exp ln et
1- et
ÊËÁÁÁÁ
ˆ¯˜̃̃˜̃
12yiht xi( )Ê
Ë
ÁÁÁÁÁÁÁ
ˆ
¯
˜̃̃˜̃̃˜̃
Ê
Ë
ÁÁÁÁÁÁÁ
ˆ
¯
˜̃̃˜̃̃˜̃i=1
NÂt=1T’
= wit et
1- et
Ê
ËÁÁÁÁÁ
ˆ
¯˜̃̃˜̃̃yiht xi( )
i=1NÂt=1
T’
= wit et
1- et
+ wit 1- et
etincorrectÂcorrectÂ
Ê
ËÁÁÁÁÁ
ˆ
¯˜̃̃˜̃̃t=1
T’= 2 et 1- et( )t=1
T’ = 1- 4gt2( )t=1
T’ £ exp - 2 gt2
tÂ( )
Remp £ exp - 2 gt2
tÂ( ) £ exp - 2T g2( )So, the final learner converges exponentially fast in T if each weak learner is at least better than gamma!
at = 12 ln 1- et
et
ÊËÁÁÁÁ
ˆ¯˜̃̃˜̃
et = 12 - gt £ 1
2 - g
Tony Jebara, Columbia University
Curious phenomenonBoosting decision trees
Using <10,000 training examples we fit >2,000,000 parameters
Tony Jebara, Columbia University
Explanation using margins
Margin
0-1 loss
Tony Jebara, Columbia University
Explanation using margins
Margin
0-1 loss
No examples with small margins!!
Tony Jebara, Columbia University
Experimental Evidence
Tony Jebara, Columbia University
AdaBoost Generalization Bound•Also, a VC analysis gives a generalization bound:
(where d is VC ofbase classifier)
•But, more iterations overfitting!•A margin analysis is possible, redefine margin as:
Then have
R £ Remp +O Td
N
Ê
ËÁÁÁÁÁ
ˆ
¯˜̃̃˜̃̃
marf x,y( ) =
y atht x( )tÂattÂ
R £ 1
Nstep q- marf xi,yi( )( )
i=1
NÂ +O dNq2
Ê
ËÁÁÁÁÁ
ˆ
¯˜̃̃˜̃̃
Tony Jebara, Columbia University
AdaBoost Generalization Bound
Margin
dm
•Suggests this optimization problem:
Tony Jebara, Columbia University
AdaBoost Generalization Bound•Proof Sketch
Tony Jebara, Columbia University
UCI Results
Database Other Boosting Error
reductionCleveland 27.2 (DT) 16.5 39%Promoters 22.0 (DT) 11.8 46%Letter 13.8 (DT) 3.5 74%Reuters 4 5.8, 6.0, 9.8 2.95 ~60%Reuters 8
11.3, 12.1, 13.4 7.4 ~40%
% test error rates
Tony Jebara, Columbia University
Boosted Cascade of Stumps•Consider classifying an image as face/notface•Use weak learner as stump that averages of pixel intensity•Easy to calculate, white areas subtracted from black ones
•A special representation of the sample called the integral image makes feature extraction faster.
Tony Jebara, Columbia University
Boosted Cascade of Stumps•Summed area tables
•A representation that means any rectangle’s values can be calculated in four accesses of the integral image.
Tony Jebara, Columbia University
Boosted Cascade of Stumps•Summed area tables
Tony Jebara, Columbia University
Boosted Cascade of Stumps•The base size for a sub window is 24 by 24 pixels.•Each of the four feature types are scaled and shifted across all possible combinations•In a 24 pixel by 24 pixel sub window there are ~160,000 possible features to be calculated.
Tony Jebara, Columbia University
Boosted Cascade of Stumps•Viola-Jones algorithm, with K attributes (e.g., K = 160,000) we have 160,000 different decision stumps to choose from
At each stage of boosting •given reweighted data from previous stage•Train all K (160,000) single-feature perceptrons•Select the single best classifier at this stage•Combine it with the other previously selected classifiers•Reweight the data•Learn all K classifiers again, select the best, combine, reweight•Repeat until you have T classifiers selected•Very computationally intensive!
Tony Jebara, Columbia University
Boosted Cascade of Stumps•Reduction in Error as Boosting adds Classifiers
Tony Jebara, Columbia University
Boosted Cascade of Stumps•First (e.g. best) two features learned by boosting
Tony Jebara, Columbia University
Boosted Cascade of Stumps•Example training data
Tony Jebara, Columbia University
Boosted Cascade of Stumps•To find faces, scan all squares at different scales, slow
•Boosting finds ordering on weak learners (best ones first)•Idea: cascade stumps to avoid too much computation!
Tony Jebara, Columbia University
Boosted Cascade of Stumps•Training time = weeks (with 5k faces and 9.5k non-faces)
•Final detector has 38 layers in the cascade, 6060 features
•700 Mhz processor:•Can process a 384 x 288 image in 0.067 seconds (in 2003 when paper was written)
Tony Jebara, Columbia University
Boosted Cascade of Stumps•Results