Importance Sampling: An Alternative View of Ensemble Learning ...
Transcript of Importance Sampling: An Alternative View of Ensemble Learning ...
![Page 1: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/1.jpg)
Importance Sampling:An Alternative View of Ensemble Learning
Jerome H. FriedmanBogdan PopescuStanford University
1
![Page 2: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/2.jpg)
PREDICTIVE LEARNING
Given data: {zi}N1 = {yi,xi}
N1 v q(z)
y = “output” or “response” attribute (variable)x = {x1, · · ·, xn} = “inputs” or “predictors”
and loss function L(y, F ):estimate F ∗(x) = argminF (x) Eq(z)L(y, F (x))
2
![Page 3: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/3.jpg)
WHY?
F ∗(x) is best predictor of y |x under L.Examples:Regression: y, F ∈ R
L(y, F ) = |y − F |, (y − F )2
Classification: y, F ∈ {c1, · · ·, cK}L(y, F ) = Ly,F (K ×K matrix)
3
![Page 4: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/4.jpg)
F ∗(x) = “target” function (regression)concept (classification)
Estimate: F (x)← learning procedure ({zi}N1 )
Here: procedure = “ LEARNING ENSEMBLES”
4
![Page 5: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/5.jpg)
BASIC LINEAR MODELF (x) =
∫
Pa(p) f(x;p) dp
f(x;p) = “base” learner (basis function)parameters: p = (p1, p2, · ··)p ∈ P indexes particular function of x
from { f(x;p)}p∈Pa(p) = coefficient of f(x;p)
5
![Page 6: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/6.jpg)
Examples:f(x;p) = [1 + exp(−ptx)]−1 (neural nets)
= multivariate splines (MARS)= decision trees (Mart, RF)∗
6
![Page 7: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/7.jpg)
NUMERICAL QUADRATURE∫
PI(p) dp w
∑Mm=1 wmI(pm)
here: I(p) = a(p) f(x;p)
Quadrature rule defined by:{pm}
M1 = evaluation points ∈ P
{wm}M1 = weights
7
![Page 8: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/8.jpg)
F (x) w
M∑
m=1
wm a(pm) f(x;pm)
w
M∑
m=1
cm f(x;pm)
Averaging over x:{c∗m}
M1 =
linear regression of y on {f(x;pm)}M1 (pop.)
Problem: find good {pm}M1 .
8
![Page 9: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/9.jpg)
MONTE CARLO METHODSr(p) = sampling pdf of p ∈ P
{pm v r(p)}M1
Simple Monte Carlo: r(p) = constantUsually not very good
9
![Page 10: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/10.jpg)
IMPORTANCE SAMPLING
Customize r(p) for each particular problem (F ∗(x))r(pm) = big =⇒
pm important to high accuracywhen used with {pm′}m′ 6=m
10
![Page 11: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/11.jpg)
MONTE CARLO METHODS
(1) “Random” Monte Carlo:ignore other points: pm v r(p) iid
(2) “Quasi” Monte Carlo:{pm}
M1 = deterministic
account for other pointsimportance → groups of points
11
![Page 12: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/12.jpg)
RANDOM MONTE CARLO
(Lack of) importance J(p) depends only on p
One measure: “partial importance”J(p) = Eq(z)L(y, f(x;p))
p∗ = argminp J(p)= best single point (M = 1) rule
f(x;p∗) = optimal single base learner
12
![Page 13: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/13.jpg)
Usually not very good, especially ifF ∗(x) /∈ {f(x;p)}p∈P
BUT, often used:single logistic regression or tree
Note: J(pm) ignores {pm′}m′ 6=m
Hope: better than r(p) = constant.
13
![Page 14: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/14.jpg)
PARTIAL IMPORTANCE SAMPLINGr(p) = g(J(p))
g(·) = monotone decreasing functionr(p∗) = max w center (location)p 6= p∗ =⇒ r(p) < r(p∗)d(p,p∗) = J(p)− J(p∗)
14
![Page 15: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/15.jpg)
Besides location,Critical parameter for importance sampling:Scale (width) of r(p):
σ =∫
Pd(p,p∗) r(p) dp
Controlled by choice of g(·):σ = too large → r(p) = constant.σ = too small → best single point rule p∗
15
![Page 16: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/16.jpg)
−1.0 −0.5 0.0 0.5 1.0
05
1015
2025
16
![Page 17: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/17.jpg)
Questions:(1) how to choose g(·) v σ(2) sample from r(p) = g(J(p))
17
![Page 18: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/18.jpg)
TRICK
Perturbation sampling ⇒ repeatedly:(1) randomly modify (perturb) problem(2) find optimal f(x;pm) for perturbed problempm = Rm
{
argminp Eq(z)L(y, f(x;p))}
control width σ of r(p) by degreePerturb: L(y, F ), q(z), algorithm, hybrid.
18
![Page 19: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/19.jpg)
EXAMPLES
Perturb loss function:Lm(y, f) = L(y, f) + η · lm(y, f)
lm(y, f) = random functionLm(y, f) = L(y, f + η · hm(x))
hm(x) = random function of xpm = argminp Eq(z)Lm(y, f(x;p))Width σ of r(p) v value of η
19
![Page 20: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/20.jpg)
Perturb sampling distribution:qm(z) = [wm(z)]
η q(z)wm(z) = random function of zpm = argminp Eqm(z)L(y, f(x;p))Width σ of r(p) v value of η
20
![Page 21: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/21.jpg)
Perturb algorithm:pm = rand[argminp]Eq(z)L(y, f(x;p))control width σ of r(p) by degreerepeated partial optimizationsperturb partial solutionsExamples:
Dittereich - random treesBreiman - random forests
21
![Page 22: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/22.jpg)
GOAL
Produce a good {pm}M1 so that
M∑
m=1
c∗mf(x;pm) w F ∗(x)
where{c∗m}
M1 = pop. linear regression (L)
of y on {f(x;pm)}M1
Note: both depend on knowing population q(z).
22
![Page 23: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/23.jpg)
FINITE DATA{zi}
N1 v q(z)
q(z) =∑N
i=11Nδ(z− zi)
Apply perturbation sampling based on q(z):Loss function / algorithm:
q(z)→ q(z)width σ of r(p) controlled as before
23
![Page 24: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/24.jpg)
Sampling distribution: random reweightingqm(z) =
∑Ni=1 wimδ(z− zi)
wim v Pr(w) : Ewim = 1/Nwidth σ of r(p) controlled by std(wim)Fastest comp: wim ∈ {0, 1/K}⇒ draw K from N without replacementσ v std(w) = (N/K − 1)1/2/Ncomputation v K/N
24
![Page 25: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/25.jpg)
Quadrature Coefficients
Population:Linear regression of y on {f(x;pm)}
M1 :
{c∗m}M1 = argmin{cm}Eq(z)L
(
y,∑M
m=1 cm f(x;pm))
25
![Page 26: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/26.jpg)
Finite data: regularized linear regression
{cm}M1 = argmin{cm}Eq(z)L
(
y,∑M
m=1 cm f(x;pm))
+λ ·∑M
m=1 | cm − c(0)m | (Lasso)
Regularization ⇒ reduce variance{c
(0)m }M1 = prior guess (usually = 0)
λ > 0 chosen by cross–validationFast algorithm: sol’ns for all λ(see HTF 2001, EHT 2002)
26
![Page 27: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/27.jpg)
Importance Sampled Learning Ensembles (ISLE)Numerical Integration
F (x) =∫
Pa(p) f(x;p) dp
w∑M
m=1 cm f(x; pm)
{pm}M1 v r(p) importance sampling
v perturbation sampling on q(z){cm}
M1 ← regularized linear regression
of y on {f(x; pm)}M1
27
![Page 28: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/28.jpg)
BAGGING (Breiman 1996)
Perturb data distribution q(z):qm(z) = bootstrap sample =
∑Ni=1 wimδ(z− zi)
wim ∈ {0, 1/N, 2/N, · · ·, 1}v multinomial (1/N)
pm = argminp Eqm(z)L(y, f(x;p))
F (x) =∑M
m=11Mf(x;pm) (average)
28
![Page 29: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/29.jpg)
Width σ of r(p):E(std(wim)) = (1− 1/N)
−1/N w 1/NFixed ⇒ no controlNo joint fitting of coefficients:
λ =∞ & c(0)m = 1/M
Potential improvements:Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data
29
![Page 30: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/30.jpg)
RANDOM FORESTS (Breiman 1998)
f(x;p) = T (x) = largest possible decision treeHybrid sampling strategy :
(1) qm(z) = bootstrap sample (bagging)(2) random algorithm modification:
select var. for each split fromamong randomly chosen subsetBreiman: ns = fl(log2 n+ 1)
30
![Page 31: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/31.jpg)
F (x) =∑M
m=11MTm(x) (average)
As an ISLE: σ(RF ) > σ(Bag), (↑ as ns ↓)Potential improvements: same as bagging
Different σ (sampling strategy)λ <∞⇒ jointly fit coeff’s to data(more later)
31
![Page 32: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/32.jpg)
SEQUENTIAL SAMPLING
Random Monte Carlo: {pm v r(p)}M1 iid
Quasi–Monte Carlo: {pm}M1 = deterministic
J({pm}M1 ) = min{αm}Eq(z)L
(
y,∑M
m=1 αm f(x;pm))
Joint regression of y on { f(x;pm)}M1 (pop.)
32
![Page 33: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/33.jpg)
Approximation: sequential sampling(forward stagewise)
Jm(p | {pl}m−11 ) = minαEq(z)L(y, α f(x;p) + hm(x))
hm(x) =∑m−1
l=1 αl f(x;pl), αl = sol’n for pl
pm = argminp Jm(p | {pl}m−11 )
Repeatedly modify loss function:Similar to Lm(y, f) = L(y, f) + η · hm(x))but here η = 1 & hm(x) = deterministic
33
![Page 34: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/34.jpg)
Connection to Boosting
AdaBoost (Freund & Shapire 1996):L(y, f) = exp(−y · f), y ∈ {−1, 1}
F (x) = sign(
∑Mm=1 αm f(x;pm)
)
{αm}M1 = sequential partial reg. coeff’s
Gradient Boosting (MART – Friedman 2001):general y & L(y, f), αm = shrunk (η << 1)F (x) =
∑Mm=1 αm f(x;pm)
34
![Page 35: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/35.jpg)
Potential improvements (ISLE):(1) F (x) =
∑Mm=1 cm f(x; pm)
{pm}M1 v seq. sampling on q(x)
{cm}M1 v regularized linear regression
(2) and/or hybrid with random qm(z) (speed)(sample K from N without replacement)
35
![Page 36: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/36.jpg)
ISLE Paradigm
Wide variety of ISLE methods:(1) base learner f(x;p); (2) loss criterion L(y, f)(3) perturbation method(4) degree of perturbation: σ of r(p)(5) iid vs. sequential(6) hybrids
Examine several options.
36
![Page 37: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/37.jpg)
Monte Carlo Study
100 data sets: each N = 10000, n = 40{yil = Fl(xi) + εil}
10000i=1 , l = 1, 100
{Fl(x)}1001 = different (random) target fun’s.
xi v N(0, I40), εil v N(0, V arx(Fl(x)))⇒ signal/noise = 1/1
37
![Page 38: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/38.jpg)
Evaluation Criteria
Relative RMS error:rmse(Fjl) = [1−R2(Fl, Fjl)]
1/2
Comparative RMS error:cmse = rmse(Fjl)/mink{rmse(Fkl)}(adjusts for problem difficulty)
j, k ∈ {respective methods}10000 indep. obs.
38
![Page 39: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/39.jpg)
Properties of Fl(x)
(1) 30 “noise” variables(2) wide variety of fun’s (difficulty)(3) emphasize lower order interactions(3) not in span of base learners
Decision Trees
39
![Page 40: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/40.jpg)
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P
Rela
tive R
MS
err
or
Bagging Relative RMS Error
40
![Page 41: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/41.jpg)
1.0
1.2
1.4
1.6
1.8
2.0
Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P
Co
mp
ara
tive
RM
S e
rro
r
Bagging Comparative RMS Error
41
![Page 42: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/42.jpg)
1.0
1.5
2.0
2.5
RF RF_P RF_.05 RF_.05_P RF_6 RF_6_P RF_6_.05_P
Co
mp
ara
tive
RM
S e
rro
r
Random Forests Comparative RMS error
42
![Page 43: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/43.jpg)
1.0
1.2
1.4
1.6
Bag RF Bag_6_05_P RF_6_05_P
Co
mp
ara
tive
RM
S e
rro
r
Bag/RF Comparative RMS Error
43
![Page 44: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/44.jpg)
1.0
1.1
1.2
1.3
Mart Mart_P Mart_10_P Mart_.01_10_P Mart_.01_20_P
Co
mp
ara
tive
RM
S e
rro
r
Sequential Sampling Comparative RMS Error
44
![Page 45: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/45.jpg)
1.0
1.2
1.4
1.6
1.8
2.0
2.2
Mart Mart_P Seq_.01_.2 Bag Bag_6_.05_p RF RF_6_.05_p
Seq/Bag/RF Comparative RMS Error
45
![Page 46: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/46.jpg)
SUMMARY: Theory – unify:(1) Bagging, (2) random forests,(3) Bayesian model averaging,(4) boosting
single paradigm v Monte Carlo integration(1) – (3) : iid Monte Carlo, p v r(p)
(1), (2) perturb. sampling; (3) MCMC(4) quasi–Monte Carlo: approx. seq. sampling
46
![Page 47: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/47.jpg)
Practice:{wi}
M1 ← lasso linear regression
(1) improves accuracy of RF v bagging (faster)(2) combined with aggressive subsampling
& weaker base learners, improves speed:bagging & RF > 102, MART v 5
allowing much bigger data sets.Also, prediction many times faster.
47
![Page 48: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/48.jpg)
FUTURE DIRECTIONS
(1) More thorough understanding (theory)→ specific recommendations
48
![Page 49: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/49.jpg)
(2) Multiple learning ensembles (MISLES)F (x) =
∑Kk=1
∫
Pk
ak(pk) fk(x,pk) dpk
{ fk(x,pk)}K1 = different (comp.) base learners
F (x) =∑
ckm fk(x,pk){ckm} ← combined lasso regressionExample: f1 = decision trees
f2 = {xj}n1 (no sampling)
49
![Page 50: Importance Sampling: An Alternative View of Ensemble Learning ...](https://reader033.fdocuments.in/reader033/viewer/2022042611/5851c9081a28abfa398cbad5/html5/thumbnails/50.jpg)
SLIDEShttp://www-stat.stanford.edu/˜jhf/isletalk.pdf
REFERENCES
HTF (2001): Hastie, Tibshirani, & Friedman,Elements of Statistical Learning, Springer.
EHT (2002): Efron, Hastie, & Tibshirani,Least angle regression (Stanford preprint)
50