Importance Sampling: Intrinsic Dimension and Computational ...
Importance Sampling: An Alternative View of Ensemble Learning...
Transcript of Importance Sampling: An Alternative View of Ensemble Learning...
Importance Sampling:
An Alternative View of Ensemble Learning
Jerome H. Friedman
Bogdan Popescu
Stanford University
PREDICTIVE LEARNING
Given data: fzigN1 = fyi;xigN1 v q(z)
y = \output" or \response" attribute (variable)
x = fx1; � � �; xng = \inputs" or \predictors"
and loss function L(y; F ):
estimate F �(x) = argminF (x)Eq(z)L(y; F (x))
WHY?
F �(x) is best predictor of y jx under L.
Examples:
Regression: y; F 2 R
L(y; F ) = jy � F j; (y � F )2
Classi�cation: y; F 2 fc1; � � �; cKg
L(y; F ) = Ly;F (K �K matrix)
F �(x) = \target" function (regression)
concept (classi�cation)
Estimate: F (x) learning procedure (fzigN1 )
(CART, MARS, logistic regression)
Here: procedure = \ LEARNING ENSEMBLES"
TreeNet (MART)
Random Forests
BASIC LINEAR MODEL
F (x) =RP a(p) f(x;p) dp
f(x;p) = \base" learner (basis function)
parameters: p = (p1; p2; � ��)
p 2 P indexes particular function of x
from f f(x;p)gp2P
a(p) = coe�cient of f(x;p)
Examples:
f(x;p) = [1 + exp(�ptx)]�1 (neural nets)
= multivariate splines (MARS)
= decision trees (Mart, RF)�
NUMERICAL QUADRATURE
RP I(p) dp w
PMm=1wmI(pm)
here: I(p) = a(p) f(x;p)
Quadrature rule de�ned by:
fpmgM1 = evaluation points 2 P
fwmgM1 = weights
F (x) wMXm=1
wm a(pm) f(x;pm)
wMXm=1
cm f(x;pm)
Averaging over x:
fc�mgM1 =
linear regression of y on ff(x;pm)gM1 (pop.)
Problem: �nd good fpmgM1 :
MONTE CARLO METHODS
r(p) = sampling pdf of p 2 P
fpm v r(p)gM1
Simple Monte Carlo: r(p) = constant
Usually not very good
IMPORTANCE SAMPLING
Customize r(p) for each particular problem (F �(x))
r(pm) = big =)
pm important to high accuracy
when used with fpm0gm0 6=m
MONTE CARLO METHODS
(1) \Random" Monte Carlo:
ignore other points: pm v r(p) iid
(2) \Quasi" Monte Carlo:
fpmgM1 = deterministic
account for other points
importance ! groups of points
RANDOM MONTE CARLO
(Lack of) importance J(p) depends only on p
One measure: \partial importance"
J(p) = Eq(z)L(y; f(x;p))
p� = argminp J(p)
= best single point (M = 1) rule
f(x;p�) = optimal single base learner
Usually not very good, especially if
F �(x) =2 ff(x;p)gp2P
BUT, often used:
single logistic regression or tree
Note: J(pm) ignores fpm0gm0 6=m
Hope: better than r(p) = constant.
PARTIAL IMPORTANCE SAMPLING
r(p) = g(J(p))
g(�) = monotone decreasing function
r(p�) = max w center (location)
p 6= p� =) r(p) < r(p�)
d(p;p�) = J(p)� J(p�)
Besides location,
Critical parameter for importance sampling:
Scale (width) of r(p):
� =RP d(p;p
�) r(p) dp
Controlled by choice of g(�):
� = too large ! r(p) = constant.
� = too small ! best single point rule p�
Questions:
(1) how to choose g(�) v �
(2) sample from r(p) = g(J(p))
TRICK
Perturbation sampling ) repeatedly:
(1) randomly modify (perturb) problem
(2) �nd optimal f(x;pm) for perturbed problem
pm = RmnargminpEq(z)L(y; f(x;p))
ocontrol width � of r(p) by degree
Perturb: L(y; F ), q(z), algorithm, hybrid.
EXAMPLES
Perturb loss function:
Lm(y; f) = L(y; f) + � � lm(y; f)
lm(y; f) = random function
Lm(y; f) = L(y; f + � � hm(x))
hm(x) = random function of x
pm = argminpEq(z)Lm(y; f(x;p))
Width � of r(p) v value of �
Perturb data distribution:
Random reweighting:
qm(z) = [wm(z)]� q(z)
wm(z) = random function of z
pm = argminpEqm(z)L(y; f(x;p))
Width � of r(p) v value of �
Perturb algorithm:
pm = rand[argminp]Eq(z)L(y; f(x;p))
control width � of r(p) by degree
repeated partial optimizations
perturb partial solutions
Examples (trees):
Dittereich - random trees
Breiman - random forests
GOAL
Produce a good fpmgM1 so that
MXm=1
c�mf(x;pm) w F �(x)
where
fc�mgM1 = pop. linear regression (L)
of y on ff(x;pm)gM1
Note: both depend on knowing population q(z).
FINITE DATA
fzigN1 v q(z)
q(z) =PNi=1
1N �(z� zi)
Apply perturbation sampling based on q(z):
Loss function / algorithm:
q(z)! q(z)
width � of r(p) controlled as before
Empirical data distribution: random reweighting
qm(z) =PNi=1wim�(z� zi)
wim v Pr(w) : Ewim = 1=N
width � of r(p) controlled by std(wim)
Fastest comp: wim 2 f0; 1=Kg
) draw K from N without replacement
� v std(w) = (N=K � 1)1=2=N
computation v K=N
Quadrature Coe�cients
Population:
Linear regression of y on ff(x;pm)gM1 :
fc�mgM1 = argminfcmgEq(z)L�y;PMm=1 cm f(x;pm)
�
Finite data: regularized linear regression
fcmgM1 = argminfcmgEq(z)L�y;PMm=1 cm f(x;pm)
�
+� �PMm=1 j cm � c(0)m j (Lasso)
Regularization ) reduce variance
fc(0)m gM1 = prior guess (usually = 0)
� > 0 chosen by cross{validation
Fast algorithm: sol'ns for all �
(see Friedman & Popescu 2004)
Importance Sampled Learning Ensembles (ISLE)
Numerical Integration
F (x) =RP a(p) f(x;p) dp
w PMm=1 cm f(x; pm)
fpmgM1 v r(p) importance sampling
v perturbation sampling on q(z)
fcmgM1 regularized linear regression
of y on ff(x; pm)gM1
BAGGING (Breiman 1996)
Perturb data distribution q(z):
qm(z) = bootstrap sample =PNi=1wim�(z� zi)
wim 2 f0; 1=N; 2=N; � � �; 1g
v multinomial (1=N)
pm = argminpEqm(z)L(y; f(x;p))
F (x) =PMm=1
1M f(x;pm) (average)
Width � of r(p):
E(std(wim)) = (1� 1=N)�1=N w 1=N
Fixed ) no control
No joint �tting of coe�cients:
� =1 & c(0)m = 1=M
Potential improvements:
Di�erent � (sampling strategy)
� <1) jointly �t coe�'s to data
RANDOM FORESTS (Breiman 1998)
f(x;p) = T (x) = largest possible decision tree
Hybrid sampling strategy :
(1) qm(z) = bootstrap sample (bagging)
(2) random algorithm modi�cation:
select var. for each split from
among randomly chosen subset
Breiman: ns = fl(log2 n+ 1)
F (x) =PMm=1
1M Tm(x) (average)
As an ISLE: �(RF ) > �(Bag); (" as ns #)
Potential improvements: same as bagging
Di�erent � (sampling strategy)
� <1) jointly �t coe�'s to data
(more later)
SEQUENTIAL SAMPLING
Random Monte Carlo: fpm v r(p)gM1 iid
Quasi{Monte Carlo: fpmgM1 = deterministic
J(fpmgM1 ) = minf�mgEq(z)L�y;PMm=1�m f(x;pm)
�
Joint regression of y on f f(x;pm)gM1 (pop.)
Approximation: sequential sampling
(forward stagewise)
Jm(p j fplgm�11 ) = min�Eq(z)L(y; � f(x;p)+hm(x))
hm(x) =Pm�1l=1 �l f(x;pl); �l = sol'n for pl
pm = argminp Jm(p j fplgm�11 )
Repeatedly modify loss function:
Similar to Lm(y; f) = L(y; f + � � hm(x))
but here � = 1 & hm(x) = deterministic
Connection to Boosting
AdaBoost (Freund & Shapire 1996):
L(y; f) = exp(�y � f), y 2 f�1; 1g
F (x) = sign�PM
m=1�m f(x;pm)�
f�mgM1 = sequential partial reg. coe�'s
Gradient Boosting (MART { Friedman 2001):
general y & L(y; f), �m = shrunk (� << 1)
F (x) =PMm=1�m f(x;pm)
Potential improvements (ISLE):
(1) F (x) =PMm=1 cm f(x; pm)
fpmgM1 v seq. sampling on q(x)
fcmgM1 v regularized linear regression
(2) and/or hybrid with random qm(z) (speed)
(sample K from N without replacement)
ISLE Paradigm
Wide variety of ISLE methods:
(1) base learner f(x;p); (2) loss criterion L(y; f)
(3) perturbation method
(4) degree of perturbation: � of r(p)
(5) iid vs. sequential
(6) hybrids
Examine several options.
Monte Carlo Study
100 data sets: each N = 10000; n = 40
fyil = Fl(xi) + "ilg10000i=1 , l = 1; 100
fFl(x)g1001 = di�erent (random) target fun's.
xi v N(0; I40); "il v N(0; V arx(Fl(x)))
) signal/noise = 1/1
Evaluation Criteria
Relative RMS error:
rmse(Fjl) = [1�R2(Fl; Fjl)]1=2
Comparative RMS error:
cmse = rmse(Fjl)=minkfrmse(Fkl)g
(adjusts for problem di�culty)
j; k 2 frespective methodsg
10000 indep. obs.
Properties of Fl(x)
(1) 30 \noise" variables
(2) wide variety of fun's (di�culty)
(3) emphasize lower order interactions
(3) not in span of base learners
Decision Trees
CLASSIFICATION
(classcomp.ps)
CENSUS DATA (http://www.ips.umn.edu/usa)
N = 46937 (36000/10937, 5 times)
y = individual personal income
x1; � � �; x70 = demographic variables (many missing )
categorical: occupation, industry, etc.
numeric: education grade level, family size, etc.
(censusdat.ps, censusrf.ps, censusmart.ps)
SPAM DATA (http://www.data-mining-cup.com)
N = 19177 emails (15177/4000, 5 times)
y = spam/not spam, x1; � � �; x833 = binary features
presence/absence of:
selected text strings
characteristics of header
URL features
(spam.ps, spamlite.ps)
SUMMARY: Theory { unify:
(1) Bagging, (2) random forests,
(3) Bayesian model averaging,
(4) boosting
single paradigm v Monte Carlo integration
(1) { (3) : iid Monte Carlo, p v r(p)
(1), (2) perturb. sampling; (3) MCMC
(4) quasi{Monte Carlo: approx. seq. sampling
Practice:
fwigM1 lasso linear regression
(1) improves accuracy of RF v bagging (faster)
(2) combined with aggressive subsampling
& weaker base learners, improves speed:
bagging & RF > 102, MART v 5
allowing much bigger data sets.
Also, prediction many times faster.
FUTURE DIRECTIONS
(1) More thorough understanding (theory)
! speci�c recommendations
(2) Multiple learning ensembles (MISLES)
F (x) =PKk=1
RPkak(pk) fk(x;pk) dpk
f fk(x;pk)gK1 = di�erent (comp.) base learners
~F (x) =Pckm fk(x;pk)
fckmg combined lasso regression
Example: f1 = decision trees
f2 = fxjgn1 (no sampling)
SLIDES
http://www-stat.stanford.edu/~jhf/talks/isletalk.pdf
REFERENCES
Friedman & Popescu 2003:
http://www-stat.stanford.edu/~jhf/ftp/isle.pdf
Friedman & Popescu 2004:
http://www-stat.stanford.edu/~jhf/ftp/path.pdf