Boosting and Additive Tree (2)
-
Upload
jigar-patel -
Category
Documents
-
view
216 -
download
0
Transcript of Boosting and Additive Tree (2)
-
8/6/2019 Boosting and Additive Tree (2)
1/26
Boosting and Additive Trees (2)
Yi Zhang , Kevyn Collins-Thompson
Advanced Statistical Seminar 11-745Oct 29, 2002
-
8/6/2019 Boosting and Additive Tree (2)
2/26
Recap: Boosting (1)
Background: Ensemble Learning
Boosting Definitions, Example
AdaBoost
Boosting as an Additive Model
Boosting Practical Issues
Exponential Loss
Other Loss Functions
Boosting Trees Boosting as Entropy Projection
Data Mining Methods
-
8/6/2019 Boosting and Additive Tree (2)
3/26
Outline for This Class Find the solution based on numerical optimization
Control the model complexity and avoid over
fittingRight sized trees for boosting
Number of iterations
Regularization
Understand the final model (Interpretation) Single variable
Correlation of variables
-
8/6/2019 Boosting and Additive Tree (2)
4/26
Numerical Optimization
Goal: Find f that minimize the loss function overGoal: Find f that minimize the loss function overtraining datatraining data
Gradient Descent Search in the unconstrainedGradient Descent Search in the unconstrainedfunction space to minimize the loss on training datafunction space to minimize the loss on training data
Loss on training data converges to zeroLoss on training data converges to zero
mmm
mmm
TNmxfxf
i
iiim
gff
gfL
gxf
xfyLg
imi
*
)*
(minarg
,...,g,g])(
))(,([
1
1
2m1m)()( 1
V
VV V
!
!
!x
x!
!
!
!!N
i
iiff
xfyLfLf1
^
))(,(minarg)(minarg
},...,,{)}(),...,(),({ 2121 NN yyyxfxfxff p!
-
8/6/2019 Boosting and Additive Tree (2)
5/26
Gradient Search on Constrained Function
Space: Gradient Tree Boosting
Introduce a tree at the mth iteration whose predictions
tm are as close as possible to the negative gradient
Advantage compared with unconstrained gradient
search: Robust, less likely for over fitting
!5
5!5N
i
ii xg1
2~
));((minarg
-
8/6/2019 Boosting and Additive Tree (2)
6/26
Algorithm 3: MAR
T
M
J
jjmmm
Rx
imij
f
End
RxIxfxfd
xfyL
c
b
x
m
jmim
!
!
!
!!"
!
!
^
1m1
1m
lm
mmim
im
0
:utput
or
)()()()
))(,(minarg
Rregiondi erentt ithincoe icienovalueoptimaltheind)
J1,2,...,Rrtotreeregressionait)
unctionlossonbasedrresidualspseudoomputea)
:to1mor2.
treenodeterminalsingleto)(Initialize.1
K
KKK
-
8/6/2019 Boosting and Additive Tree (2)
7/26
View Boosting as Linear Model
Basis expansion:
use basis function Tm (m=1..M, each Tm is a
weak learner) to transform inputs vector X into
T space, then use linear models in this new
space
Special for Boosting: Choosing of basisfunction Tm depends on T1, Tm-1
-
8/6/2019 Boosting and Additive Tree (2)
8/26
Improve Boosting as Linear Model
Recap: Linear Models inChapter 3
Bias Variance trade off1. Subset selection (feature
selection, discrete)
2. Coefficient shrinkage(smoothing: ridge, lasso)
3. Using derived input direction(PCA, PLA)
Multiple outcome shrinkage
and selection Exploit correlations in
different outcomes
This Chapter: Improve
Boosting
1. Size of the constituent
trees J
2. Number of boosting
iterations M (subset
selection)
3.Regularization (Shrinkage)
-
8/6/2019 Boosting and Additive Tree (2)
9/26
Right Size Tree for Boosting (?)
The Best for one step is not the best in long run
Using very large tree (such as C4.5) as weak learner tofit the residue assumes each tree is the last one in the
expansion. Usually degrade performance and increasecomputation
Simple approach: restrict all trees to be the samesize J
J limits the input features interaction level of tree-based approximation
In practice low-order interaction effects tend todominate, and empirically 4J 8 works well (?)
-
8/6/2019 Boosting and Additive Tree (2)
10/26
-
8/6/2019 Boosting and Additive Tree (2)
11/26
Number of Boosting Iterations
(subset selection)
Boosting will over fit as M ->
Use validation set
Other methods (later)
-
8/6/2019 Boosting and Additive Tree (2)
12/26
Shrinkage Scale the contribution of each tree by a
factor 0
-
8/6/2019 Boosting and Additive Tree (2)
13/26
-
8/6/2019 Boosting and Additive Tree (2)
14/26
PenalizedR
egression Ridge regression or Lasso regression
^
1
1
2
1
2
Lasso||)(
normL2,RegressionRidge)(
})(*))(({minarg)(
!
!
!
!
!
!
K
k
k
K
k
k
N
i k
ikk
J
J
Jxy
EE
EE
EPEPEE
-
8/6/2019 Boosting and Additive Tree (2)
15/26
Algorithm 4: Forward stagewise
linear
!
! !
!
n
!
!
"!!
K
1k k
*
1 1
2
,
**
^
)(f(x).2
)()
)))((minarg),()
:Mto1mFor2.
largeMandconstantsmallsometo0,,...,1,0Initialize.1
xOutput
signb
xxyka
setkk
k
kk
N
i
K
l
ikillk
k
E
FIEE
FEF
IE
F
-
8/6/2019 Boosting and Additive Tree (2)
16/26
(, M ) lasso regression S/t/
If is
monotone in, we have
k|k| = M,
and the
solution for
algorithm 4 is
identical to
result of lasso
regression as
described in
page 64.
)( PE
-
8/6/2019 Boosting and Additive Tree (2)
17/26
More about algorithm 4 Algorithm 4 Algorithm 3 + Shrinkage
L1 norm vs. L2 norm: more details later
Chapter 12 after learning SVM
-
8/6/2019 Boosting and Additive Tree (2)
18/26
Interpretation: Understanding the
final model Single decision trees are easy to interpret
Linear combination of trees is difficult to
understand
Which features are important?
Whats the interaction between features?
-
8/6/2019 Boosting and Additive Tree (2)
19/26
Relative Importance of Individual
Variables For a single tree, define the importance of xl as
For additive tree, define the importance of x l as
For K-class classification, just treat as K 2-class
classification task
!-partitionforxusingnodeallover
2
l
regionover thefitconstantaforoverriskerrorsquareinimprove)(Tl
!
-!-M
m
mll TM
1
22)(
1
-
8/6/2019 Boosting and Additive Tree (2)
20/26
-
8/6/2019 Boosting and Additive Tree (2)
21/26
Partial Dependence Plots Visualize dependence of approximation f(x)
on the joint values of important features
Usually the size of the subsets is small (1-3)
Define average or partial dependence
Can be estimated empirically using the
training data:
c
X
cccX dXXPXXfXXfXf
c
C)(),(),()( !! ::::
!
!N
i
ixfN
f1
_
),(1
)( :::
-
8/6/2019 Boosting and Additive Tree (2)
22/26
10.50 vs. 10.52
Same if predictor variables are independent
Why use 10.50 instead of 10.52 to Measure PartialDependency?
Example 1: f(X)=h1(xs)+ h2(xc)
Example 2: f(X)=h1(xs)* h2(xc)
tonshdhh
dhhf
ccc
ccc
c
c
tan)()()()(
)())()(()(:50.10
121
21
!!
!
::
:::
)|(),()|),(()(:10.52~
c
X
ccc dXXXPXXfXXXfXf
c
!! ::::::
c
X
cccX dXXPXXfXXfXf
c
)(),(),()(:10.50 !! ::::
-
8/6/2019 Boosting and Additive Tree (2)
23/26
-
8/6/2019 Boosting and Additive Tree (2)
24/26
-
8/6/2019 Boosting and Additive Tree (2)
25/26
Find the solution based on numericaloptimization
Control the model complexity and avoid
over fittingRight sized trees for boosting
Number of iterations
R
egularization Understand the final model (Interpretation)
Single variable
Correlation of variables
Conclusion
-
8/6/2019 Boosting and Additive Tree (2)
26/26