Bootstrap Regularization by Stefan Wager, Stanford

Bootstrap Regularization

Stefan WagerStanford University

http://web.stanford.edu/~swager

H2O, 28 April 2015

http://web.stanford.edu/~swager

We interviewed some college students, to find out who theirfavorite musician/band was.

Stanford Berkeley MIT USC ...

Kanye West 12 13 11 15Ke$ha 7 2 3 36

Led Zeppelin 18 10 12 3MGMT 4 9 12 3

Taylor Swift 24 8 16 28The Grateful Dead 9 37 4 8

...

Would You Trust This Clustering?

Correspondence analysis onthe “September data.”

●

●

●

●

●

●

Kanye West

Kesha

MGMT

Led Zeppelin

Taylor Swift

The Grateful Dead

−2

−1

0

1

2

−2 −1 0 1 2x

y

Correspondence analysis onthe “October data.”

●

●

●

●

●

●

Kanye WestKesha

MGMT

Led Zeppelin

Taylor SwiftThe Grateful Dead

−2

−1

0

1

2

−2 −1 0 1 2x

y

“This movie, while containing decent acting performances for themost part, was difficult and just not very enjoyable to watch. Itfelt as though it was in love with itself and its whole depiction of70s east coast disco Guido “culture” when in fact it was just anempty, pointless piece of bombast not half as clever as it thoughtit was. It felt fake, contrived and half-baked from the first scene tothe last, and as others have noted you just didn’t care one bitabout anyone on screen.” – svicious22 on IMDB

Would You Trust This Classifier?“This movie, while containing decent acting performances for themost part, was difficult and just not very enjoyable to watch. Itfelt as though it was in love with itself”

I Positive Review

“was difficult and just not very enjoyable to watch. It felt fake,contrived and half-baked from the first scene, pointless piece ofbombast not half as clever as it thought it was.”

I Positive Review

“was difficult and just not very enjoyable to watch. It felt asthough it was in love with itself and its whole depiction of 70s eastcoast disco Guido “culture” when in fact it was just an empty,pointless piece of bombast”

I Negative Review

“in fact it was just an empty, pointless piece of bombast not halfas clever as it thought it was. It felt fake, contrived”

I Negative Review

Bootstrap Regularization, or

How to not get embarrassed when someonereplicates your method on slightly different data.

The bootstrap is often used for

I Variance estimation/confidence intervals [Efron, 1979; Efronand Tibshirani, 1994; Davison and Hinkley, 1997; Politis andRomano, 1994; etc.]

I Bagging/model smoothing [Breiman, 1996; Buhlmann andYu, 2002; Efron, 2014; etc.]

I Extrapolation/bias correction [Cook and Stefanski, 1994; Hall,1992; Politis et al., 1999; etc.]

We use the bootstrap for regularization.

I This adds to a literature on regularizing via pseudo-examples[Abu-Mostafa, 1990; Bishop, 1995; Burges and Scholkopf,1997; Simard et al, 2000; van der Maaten et al., 2013; etc.].

Document Classification

Motivation: document classification with unigram features (i.e.,single word features).

I We get a dataset of (sentence, sentiment pairs):I (“I loved that movie!,” positive sentiment)I (“Never seeing that again...,” negative sentiment)

I Our goal is to run logistic regression on a dataset of (x , y)pairs.

I For us, xij counts the number of times the j-th word in thedictionary appears in the i-th document.

Example: “After watching the first 25 minutes of this movie Irealized that I had wasted 30 minutes of my life.”

Dictionary and the movie actor minutes ...

xi 0 1 1 0 2 ...


The idea of bootstrap regularization is to perturb the data bydeleting random words before training. For example, “Afterwatching the first 25 minutes of this movie I realized that I hadwasted 30 minutes of my life” is turned into

I “watching of this movie I realized my life”

I “after watching first minutes of wasted life”

I “the first movie that wasted minutes of my”

For the features x , this amounts to binomial thinning; this idea isclosely related to dropout (Hinton & al., 2012).

Dictionary and the movie actor minutes ...

xi 0 1 1 0 2 ...

0 0 1 0 1 ...xi 0 1 1 0 0 ...

0 1 0 0 2 ...

Logistic Regression

Given a dataset of labeled training examples (xi , yi ), logisticregression solves the problem

β = argminβ

{n∑

i=1

` (β; xi , yi )

}, where

` (β; x , y) = −y x · β + log(

1 + ex ·β).

Given this learned weight vector β, we make predictions

y = 1({β · x > 0

}).

Bootstrap Regularization for Logistic Regression

With bootstrap regularization, we don’t use the maximumlikelihood estimate

β = argminβ

{n∑

i=1

` (β; xi , yi )

}.

Instead, we add bootstrap noise to create many different versionsx of our features x

βboot = argminβ

{n∑

i=1

E [` (β; xi , yi )]

}.

Bootstrap Regularization for Logistic RegressionWe train the loss

βboot = argminβ

{n∑

i=1

E [` (β; xi , yi )]

}

with binomial thinning xij = 2 Binom (xij , 0.5).

For logistic regression

` (β; x , y) = −y β · x + ψ (β · x) ,

the bootstrap regularized loss can be written as

E [` (β; x , y)] = −E [y β · x ] + E [ψ (β · x))]

= −y β · x + E [ψ (β · x)]

= ` (β; x , y) +(E [ψ (β · x)]− ψ (β · x)

),

Bootstrap Regularization for Logistic Regression

The bootstrap regularized loss can be written as

E [` (β; x , y)] = ` (β; x , y) + R (β; x), where

R (β; x) = E [ψ (β · x)]− ψ (β · x)

is the bootstrap regularizer. The function ψ is convex, soR (β) ≥ 0 by Jensen’s inequality.

Remarks:

I The full regularizer∑n

i=1 R (β; xi ) can be efficientlyapproximated in closed form, so we do not need to actuallycreate many versions xi of xi .

I The regularizer R (β; xi ) does not depend on the labels yi .

Semi-Supervised Bootstrap Regularization

The term

R(β) =n∑

i=1

(E [ψ (xi · β)]− ψ (xi · β)

)penalizes complicated models, without regard to the labels yi .

Idea: If R(β) is “morally” just a regularizer on model complexity,we should be able to improve it by using unlabeled data.

R+ (β) =n

n + αm(R(β) + αRUnlabeled(β)) ,

where m is the number of unlabeled examples.

Experiments

Several classic document classification examples:

I Number of training examples 5, 000 ≤ n ≤ 800, 000

I Number of features 20, 000 ≤ p ≤ 700, 000

Methods under consideration:

I Logistic regression (no regularization)

I Ridge regularized logistic regression (L2)

I Bootstrap regularized logistic regression

I Semi-supervised bootstrap regularized logistic regression

Semi-supervised Bootstrap Regularization: Results

Semi-supervised setting. We use unlabeled features to improvethe regularizer.

Dataset K L2 Boot +Unlabeled

CoNLL 5 91.46 91.81 92.0220news 20 76.55 79.07 80.47RCV14 4 94.76 94.79 95.16

R21578 65 90.67 91.24 90.30TDT2 30 97.34 97.54 97.89

Results for multiclass logistic regression; K is the number ofclasses, accuracy is in percent.

[Source: Wang, Wang, —–, Liang, and Manning, 2013]

Transductive Bootstrap Regularization: Results

Transductive setting. We use the test set features to improve theregularizer.

Dataset K None L2 Boot +Test

CoNLL 5 78.03 80.12 80.90 81.6620news 20 81.44 82.19 83.37 84.71RCV14 4 95.76 95.90 96.03 96.11

R21578 65 92.24 92.24 92.24 92.58TDT2 30 97.74 97.91 98.00 98.12

Results for multi-class logistic regression; K is the number ofclasses, accuracy is in percent.

[Source: Wang, Wang, —–, Liang, and Manning, 2013]

IMDB Dataset

I Sentiment classification of IMDB movie reviews.

I Test set and train sets of size 25k reviews each. Highly polarreviews.

I 50k unlabeled reviews, some of which are polar and othersneutral.

I 500k features with unigrams, 5M features with bigrams.

I Assembled by Maas et al. [2011].

IMDB Dataset: Results

Method Supervised Semi-supervised

Naive Bayes on Unigrams 83.62 84.13Naive Bayes on Bigrams 86.63 86.98

[Su et al., 2011]Vectors for Sentiment Analysis 88.33 88.89

[Maas et al., 2011]Compressive Feature Learning 90.40 –

[Paskov et al., 2013]Naive Bayes SVM (Bigrams) 91.22 –

[Wang and Manning, 2012]

Bootstrap Reg. on Unigrams 87.78 89.52Bootstrap Reg. on Bigrams 91.31 91.98

[Source: —–, Wang, and Liang, 2013] Accuracy in percent;pairwise differences greater than 0.5% are significant.

IMDB Dataset: Results

0 10000 20000 30000 400000.85

0.86

0.87

0.88

0.89

0.9

size of unlabeled data

accura

cy

dropout+unlabeled

dropout

L2

Digging Deeper

Bootstrap regularization involves training the loss

βboot = argminβ

{n∑

i=1

E [` (β; xi , yi )]

}

with binomial thinning xij = Binom (xij , 0.5).

I This method randomly deletes roughly half the words.

I We could delete any fraction 0 < δ < 1 of words:

xij = Binom (xij , 1− δ) .

I If we send δ = 0 and delete no words, we recover logisticregression with no regularization.

I What does δ → 1 say?

Taking δ → 1

If we randomly delete all but one word from each document, werecover a well-known method: Naive Bayes.

I As we take δ → 1, usually xi = 0 or has only one word. Thus,we converge to naive Bayes.

I Note: the intercept is different.

Naive Bayes is well-known to be much more stable than logisticregression and has excellent generalization properties;unfortunately, naive Bayes is also very biased.

I Bootstrap regularization gives most of the benefits of naiveBayes estimation with only a fraction of the bias.

Taking δ → 1

Simulation: Data fit using logistic regression (LR), naive Bayes(NB), and bootstrap regularization with various amounts of δ.

50 100 200 500 1000 2000

n

Test

Err

or R

ate

(%)

0.5

12

510

20

LR0.250.50.750.90.950.99NB

[Source: —–, Fithian, Wang, and Liang, 2014]


In the context of document classification, we regularize byperturbing the data in a domain-specific way.

The bootstrap regularization road map:

1. Identify a natural noise model for perturbing the data (e.g.,dropout, binomial sampling, Gaussian noise along a manifold).

2. Turn each training example xi into many pseudo-examplesxi1, ..., xiB by applying the noising scheme.

3. Train on the xib instead of the xi . The noise in the xib acts asregularization.

Further work: We can use similar ideas in the form of “stableautoencoding” to regularize PCA in a domain-specific way (Josse& —–, 2015).

References

—–, William Fithian, Sida Wang, and Percy Liang. AltitudeTraining: Strong Bounds for Single-Layer Dropout. Advancesin Neural Information Processing Systems (NIPS), 2014.

—–, Sida Wang, and Percy Liang. Dropout Training asAdaptive Regularization. Advances in Neural InformationProcessing Systems (NIPS), 2013.

Sida Wang, Mengqiu Wang, —–, Percy Liang, and Chris Manning.Feature Noising for Log-linear Structured Prediction.Empirical Methods in Natural Language Processing (EMNLP),2013.

Thanks!

Bootstrap Regularization by Stefan Wager, Stanford

Technology

Transcript of Bootstrap Regularization by Stefan Wager, Stanford