Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases...

Discriminativev. generative

Geoff Gordon—Machine Learning—Fall 2013

Naive Bayes

2


Naive Bayes

MLE:

3

P (xij , yi) =Y

i

P (yi)Y

j

P (xij |yi)

max

aj ,bj ,pP (xij , yi)

P (yi = +) = p

P (xij = 1 | yi = –) = aj

P (xij = 1 | yi = +) = bj

p = 1N

P�[yi = +]

aj =P

�[(yi = �) ^ (xij = 1)]/P

�[yi = �]

bj =P

�[(yi = +) ^ (xij = 1)]/P

�[yi = �]

2k+1 parameters

P (yi = + | xij) = 1/(1 + exp(�zi))

zi = w0 +P

j wjxij

naive Bayes model: y_i -> [x_ij] N training examples (x_i, y_i), k binary features y_i \in {0,1}, x_i \in {0,1}^kMLE led to classifier: P(y_i=+ \mid x_{ij}) &= 1/(1+\exp(-z_i))\\z_i &= \textstyle w_0 + \sum_j w_j x_{ij}

===

P(x_{ij}, y_i) = \prod_i P(y_i) \prod_j P(x_{ij} | y_i)

p &= \textstyle \frac{1}{N} \sum \delta[y_i=+]\\a_j &= \textstyle \sum \delta[(y_i=-) \wedge (x_{ij}=1)] / \sum \delta[y_i=-]\\b_j &= \textstyle \sum \delta[(y_i=+) \wedge (x_{ij}=1)] / \sum \delta[y_i=-]

\max_{a_j, b_j, p} P(x_{ij}, y_i)

P(y_i=+) &= p \\P(x_{ij}=1 \mid y_i=–) &= a_j\\P(x_{ij}=1 \mid y_i=+) &= b_j


Logistic regression

4

argmax

w

Y

i

P (yi|xij)

= argmin

w

X

i

ln(1 + exp(�yizi))

= argmin

w

X

i

h(yizi)

P (yi = + | xij) = 1/(1 + exp(�zi))

zi = w0 +P

j wjxij

[sketch h]

optional prior for both NB and log. reg.

SVM: can think of it as approximating this optimization with a QP

===

P(y_i=+ \mid x_{ij}) &= 1/(1+\exp(-z_i))\\z_i &= \textstyle w_0 + \sum_j w_j x_{ij}

\lefteqn{\arg\max_w \prod_i P(y_i | x_{ij})}& \\& = \arg\min_w \sum_i \ln(1+\exp(-y_i z_i))\\& = \arg\min_w \sum_i h(y_i z_i)


Same model, different answer

Why?‣ max P(X, Y) vs. max P(Y | X)

‣ generative vs. discriminative

‣ MLE v. MCLE (max conditional likelihood estimate)

How to pick?‣ Typically MCLE better if lots of data, MLE better if not

5

MCLE better if lots of data: we’ll see why below

it’s perhaps disturbing that there are two different ways to train the same model (3 if we count SVM, but we can justify that as an approximation to MCLE); can we relate them?

[also integration v. maximization, but ignore that]


MCLE as MLE

Recipe: MCLE = MLE + extra parameters to decouple P(x) from P(y|x)

Bias-variance tradeoff: MLE places additional constraints on θ by coupling to P(x)‣ MLE increases bias, decreases variance (vs. MCLE)

To interpolate generative / discriminative models, soft-tie θx to θy w/ prior

7

Tom Minka. Discriminative models, not discriminative training. MSR tech report TR-2005-144, 2005

parameters of P(y|x) are ones of interest


Comparison

As #examples → ∞‣ if Bayes net is right: NB & LR get same answer

‣ if not:

‣ LR has minimum possible training error

‣ train error → test error

‣ so LR does at least as well as NB, usually better

8


Comparison

Finite sample: n examples with k attributes‣ how big should n be for excess risk ≤ ϵ?‣ GNB needs n = θ(log k) as long as a constant fraction of

attributes are relevant

‣ Hoeffding for each weight + union bound over weights + bound z away from 0

‣ LR needs n = θ(k)

‣ VC-dimension of linear classifier

GNB converges much faster to its (perhaps less-accurate) final estimates

9

see [Ng & Jordan, 2002]

informally, difference in convergence rates happens because NB’s parameter estimates are uncoupled, while LR’s are coupled


Comparison on UCI

10

0 20 40 600.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

pima (continuous)

0 10 20 300.2

0.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

adult (continuous)

0 20 40 600.2

0.25

0.3

0.35

0.4

0.45

m

erro

r

boston (predict if > median price, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r

optdigits (0’s and 1’s, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r


0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

m

erro

r

ionosphere (continuous)

0 20 40 600.35

0.4

0.45

0.5

m

erro

r

liver disorders (continuous)

0 20 40 60 80 100 1200.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

sonar (continuous)

0 100 200 300 4000.2

0.3

0.4

0.5

0.6

0.7

m

erro

r

adult (discrete)

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

m

erro

r

promoters (discrete)

0 50 100 1500.1

0.2

0.3

0.4

0.5

m

erro

r

lymphography (discrete)

0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

breast cancer (discrete)

0 5 10 15 20 250.1

0.2

0.3

0.4

0.5

m

erro

r

lenses (predict hard vs. soft, discrete)

0 50 100 1500

0.2

0.4

0.6

0.8

m

erro

r

sick (discrete)

0 20 40 60 800

0.1

0.2

0.3

0.4

m

erro

r

voting records (discrete)

0 20 40 600.25

0.3

0.35

0.4

0.45

0.5

mer

ror

pima (continuous)

0 10 20 300.2

0.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

adult (continuous)

0 20 40 600.2

0.25

0.3

0.35

0.4

0.45

m

erro

r


0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r


0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r


0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

m

erro

r


0 20 40 600.35

0.4

0.45

0.5

m

erro

r


0 20 40 60 80 100 1200.25

0.3

0.35

0.4

0.45

0.5

m

erro

rsonar (continuous)

0 100 200 300 4000.2

0.3

0.4

0.5

0.6

0.7

m

erro

r

adult (discrete)

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

m

erro

r


0 50 100 1500.1

0.2

0.3

0.4

0.5

m

erro

r


0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

m

erro

r


0 5 10 15 20 250.1

0.2

0.3

0.4

0.5

m

erro

r


0 50 100 1500

0.2

0.4

0.6

0.8

m

erro

r

sick (discrete)

0 20 40 60 800

0.1

0.2

0.3

0.4

m

erro

r




Comparison on UCI

11

0 20 40 600.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

pima (continuous)

0 10 20 300.2

0.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

adult (continuous)

0 20 40 600.2

0.25

0.3

0.35

0.4

0.45

m

erro

r


0 50 100 150 2000

0.1

0.2

0.3

0.4

mer

ror


0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r


0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

m

erro

r


0 20 40 600.35

0.4

0.45

0.5

m

erro

r


0 20 40 60 80 100 1200.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

sonar (continuous)

0 100 200 300 4000.2

0.3

0.4

0.5

0.6

0.7

m

erro

r

adult (discrete)

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

m

erro

r


0 50 100 1500.1

0.2

0.3

0.4

0.5

m

erro

rlymphography (discrete)

0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

m

erro

r


0 5 10 15 20 250.1

0.2

0.3

0.4

0.5

m

erro

r


0 50 100 1500

0.2

0.4

0.6

0.8

m

erro

r

sick (discrete)

0 20 40 60 800

0.1

0.2

0.3

0.4

m

erro

r


0 20 40 600.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

pima (continuous)

0 10 20 300.2

0.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

adult (continuous)

0 20 40 600.2

0.25

0.3

0.35

0.4

0.45

m

erro

r


0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r


0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r


0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

m

erro

r


0 20 40 600.35

0.4

0.45

0.5

m

erro

r


0 20 40 60 80 100 1200.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

sonar (continuous)

0 100 200 300 4000.2

0.3

0.4

0.5

0.6

0.7

m

erro

r

adult (discrete)

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

m

erro

r


0 50 100 1500.1

0.2

0.3

0.4

0.5

m

erro

r


0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

m

erro

r


0 5 10 15 20 250.1

0.2

0.3

0.4

0.5

m

erro

r


0 50 100 1500

0.2

0.4

0.6

0.8

m

erro

r

sick (discrete)

0 20 40 60 800

0.1

0.2

0.3

0.4

m

erro

rvoting records (discrete)


Decision trees


Dichotomous classifier

13

Search

HOME

· Alderfly/Dobsonfly

· Bees, Ants, Wasps

· Beetles

· Bristletails

· Butterflies, Moths

· Centipedes

· Cicada & Similar

· Cockroaches

· Dragonfly/Damselfly

· Earwigs

· Flies

· Grasshopper/Crickets

· Mayflies

· Mite or Tick

· Scorpion and Like

· Snakeflies

· SPIDERS

· True Bugs

· Walkingsticks

· View ALL

ABOUT BUGS

· Identifying Insects

· Insect Anatomy

· Insect Mouth Parts

SCIENTIFIC

· Dichotomous Keys

· Taxonomic Orders

· Insect Orders

· Scientific Names

· Metamorphosis

· Process of Molting

MISCELLANEOUS

· Bees and Wasps

· Beneficial Insects

· Field Guides

· Color the Bugs

· Spiders.us

Winged Insect Key (Flies, bees, butterflies, beetles, etc.)

Starting with question #1, determine which statement (a or b) is true for your insect. Follow the direction at the endof the true statement until you are finally given the name of the Order your insect belongs to.

1. a. Insect has 1 pair of wings ........................................................................ Order Diptera (flies,mosquitoes)

b. Insect has 2 pair of wings ........................................................................ go to #2 2. a. Insect has extremely long prothorax (neck) ............................................... go to #3 b. Insect has a regular length or no prothorax ............................................... go to #4 3. a. Forelegs come together in a 'praying' position ............................................ Order Mantodea (mantids)

b. Forelegs do not come together in a 'praying' position .................................. Order Raphidoptera(snakeflies)

4. a. Wings are armour-like with membraneous hindwings underneath them ........... Order Coleoptera (beetles) b.Wings are not armour-like .......................................................................... go to #5

5. a. Wings twist when insect is in flight ............................................................. Order Strepsiptera (twisted-wing parasites)

b.Wings flap up and down (no twisting) when in flight ..................................... go to #6 6. a. Wings are triangular in shape .................................................................... go to #7 b.Wings are not triangular in shape .............................................................. go to #8

7. a. Insect lacks a proboscis and has long filaments at abdomenal tip ................... Order Ephemeroptera(mayflies)

b. Insect has a proboscis and lacks long filaments at abdomenal ........................ Order Lepidoptera(butterflies)

8. a. Head is elongated (snout-like) .................................................................... Order Mecoptera(scorpionflies)

b. Head is not elongated (snout-like) .............................................................. go to #9 9. a. Insect has 2 pair of cerci (pincers) at tip of abdomen ..................................... Order Dermaptera (earwigs) b. Insect does not have 2 pair of cerci (pincers) at tip of abdomen ...................... go to #10 10.a. All 4 wings are both similar in size and in shape to each other ......................... go to #11 b. All 4 wings are not similar in size nor in shape to each other ........................... go to #16 11.a. Eyes nearly cover or make up entire head ..................................................... Order Odonata (dragonflies) b. Eyes do not nearly cover nor make up entire head ......................................... go to #12 12.a. All 4 wings are finely veined and are almost 2x longer than abdomen ............... Oder Isoptera (termites) b. All 4 wings are not finely veined and are not almost 2x longer than abdomen .... go to #13 13.a. All 4 wings are transparent with many criss-crossing veins .............................. Order Neuroptera (lacewings) b. All 4 wings are not transparent with many criss-crossing veins ........................ go to #14

http://www.insectidentification.org/winged-insect-key.asp

decision trees were invented long before machine learning -- first called “dichotomous classifiers”

used for field identification of species, rock types, etc.


Decision tree

Problem: classification (or regression)‣ n training examples (x1, y1), (x2, y2), … (xn, yn)

‣ xi ∈ Rk, yi ∈ {0, 1}

14

well-known implementations: ID3, C4.5, J48, CART

input variables: real valued, categorical, binary, …output variables: real valued, categorical, binary, …

tree shape [sketch]: a question in each node (about values of input variables), branch based on answer e.g., x_{i,3} > 5

when we reach a leaf, it tells us about the output variable

usually yes/no questions, but could be multiple choice -- answers must be mutually exclusive and exhaustive


The picture

15

typical decision tree cartoon: divide plane into rectangular regions


The picture

16

Composition II in Red, Blue, and Yellow

Piet Mondrian, 1930


Variants

Type of question at internal nodes

Type of label at leaf nodes

Labels on internal nodes or edges

17

type of question at internal node: most common: >, <, or = on single attribute e.g., height > 3? color = blue? also used: logical fns (conjuncts, disjuncts) also used: linear threshold (3*height+width-2)

type of label at leaf: constant (“3” or “true”), or could be any classifier or regressor (e.g., linear regression on all data points w/in box)

labeled internal nodes/edges (combine e.g. w/ sum) color = blue: +3, go to node 3; else +2, go to 4 height > 5: -2, go to 7; else +1, go to 13


Variants

Decision list

Decision diagram (DAG)

18

allow cycles + side effects: flowchart

fine print: this might have cycles in it


Example

20

Sepa

l Len

gth

Petal Length

petal: range 1..7sepal: range 4..8red: setosagreen: versicolorblue: virginica

suppose we had another type at bottom right; could split on sepal length


Why decision trees?

Why?‣ work pretty well

‣ fairly interpretable

‣ very fast at test time

‣ closed under common operations

Why not DTs?‣ learning is NP-hard

‣ often not state-of-art error rate

‣ but: see bagging, boosting

21

closed under common operations: e.g., sum of 2 decision trees/diagrams is another tree/diagram (and there are good algorithms to compute and optimize representation of sum)

not usually state-of-art performance but: boosted or bagged versions may be but: these lose interpretability


Learning

22

red? fuzzy? Class

T T –

T F +

T F –

F T –

F F +

split on red: T: -+-, F: -+ best (MLE) labels: 1/3, 1/2 lik: log(2/3)+log(1/3)+log(2/3)+log(1/2)+log(1/2)

alternately, split on fuzzy: T: --, F: +-+ MLE labels: 0, 2/3 [could use Laplace smoothing] lik: log(1)+log(2/3)+log(1/3)+log(1)+log(2/3) better by 2 log(2) = 2 bits

if we now split fuzzy=F by red?, get pure leaves: perfect performance on training data


Learning

Bigger data sets with more attributes: finding training set MLE is NP-hard

Heuristic search: build tree greedily, root down‣ start with all training examples in one bin

‣ pick an impure bin

‣ try some candidate splits (e.g., all single-attribute binary tests), pick the best (largest increase in likelihood)

‣ repeat until all bins are either pure or have no possible splits left (ran out of attributes to split on)

23

tradeoff: if we consider stronger splits (e.g., linear classifiers vs. single-attribute tests) we get more progress at each node, but: more selection bias overall goal is not to do well at a node, but to do well with final tree

heuristic: learn slowly (pick a weak set of splits)


Information gain

Initially: L = 2 log(.4) + 3 log(.6)

Split on red:‣ bin T: 2 log(.667) + log(.333)

‣ bin F: 2 log(.5)

Split on fuzzy:‣ bin T: 2 log 1 + 0 log 0 = 0

‣ bin F: 2 log(.667) + log(.333)

In general: H(Y) – EX[H(Y|X)]

24

red? fuzzy? Class

T T –

T F +

T F –

F T –

F F +

evaluating a candidate split: increase in likelihood = information gain = measure of purity of binsinit: -4.85red: -4.75 (gain .1 bits)fuzzy: –2.75 (gain 2.1 bits)

there are other splitting criteria besides info gain (e.g., Gini) but we won’t cover


Real-valued attributes

25

finding threshold for a real attribute: sort by attribute value, try n+1 thresholds, one in each gap between observed values

===

xs = randn(50)-1; ys = randn(50)+1;xs = sort(xs); ys = sort(ys);clf(); plot(xs, arange(1,51)/50., ys, arange(1, 51)/50., marker='x', ls='none', mew=3, ms=5)


Multi-way discrete splits

Split on temp yields {–,–} and {+,–,+}

Split on SS# yields 5 pure leaves

26

SS# Temp Sick?

123-45-6789 36 –

010-10-1010 36.5 +

555-55-1212 41 +

314-15-9265 37 –

271-82-8183 40 +

unfair advantage of multi-way splits

fix: penalize splits of high arity e.g., allow only binary (1 vs rest) e.g., use a statistical test of significance to select a split variable


Pruning

Build tree on training set

Prune on holdout set:‣ while removing last split

along some path improves holdout error, do so

‣ if a node N’s children are all pruned, then N becomes eligible for pruning

27

note: order of testing children of N is unimportant


Prune as rules

Alternately, convert each leaf to a rule then prune‣ test1 ∧ ¬test2 ∧ test3 …

‣ while dropping a test from a rule improves performance, do so

28

rule-based version: typically leads to smaller, more interpretable classifiers

may get overlap among rules; if so, e.g., average their predictions


Bagging

Bagging = bootstrap aggregating

Can be used with any classifier, but particularly effective with decision trees

Generate M bootstrap resamples

Train a decision tree on each one‣

Final classifier: vote all M trees‣ e.g., tree 1 says p(+) = .7, tree 2 says p(+) = .9: predict .8

29

train: could use different training methods on each resample; choices include candidate splits, pruning strategies

random forests: restrict each tree to use a random subsample of k’<<k attributes; don’t prune

bagging can increase performance substantially (random forests often get state-of-art performance) but reduces interpretability


Out-of-bag error estimates

Each bag contains (1–1/e) (~67%) of examples

Use out-of-bag examples to estimate error of each tree

To estimate error of overall vote‣ for each example, classify using all out-of-bag trees

‣ average across all examples

Conservative: we’re averaging over ~67% of our trees—but if we have lots of trees, bias is small

30

Boosting


Voted classifiers

f: Rk → {–1, 1}

Voted classifier: ∑j fj(x) > 0

Weighted vote: ∑j αj fj(x) > 0‣ assume wlog αj > 0

‣ optionally scale so αj sum to 1

32

5 halfspaces (or add constant classifier for |H|=6)

typically a larger hypothesis space (vs. base set of classifiers) -- e.g., voted halfspaces

terminology: base f_j are called “weak hypotheses” to distinguish from the stronger class of voted f_j

wlog: since we assume hypothesis space is closed under negation (for each f, -f also in space)

idea: learn a voted classifier by MCLEpotential benefit: improved performance, if we can avoid overfitting due to bigger hypothesis space


Voted classifiers—the matrix

33

n tr

aini

ng e

xam

ples

T distinct classifiers (T < 2n)

write f_1 ... f_T for all *distinct* classifiers in our hypothesis space (at most 2^n for n training examples)

write z_ij = y_i f_j(x_i) = does f_j get (x_i, y_i) right?(matrix dimensions: # examples * # classifiers)


Finding the best voted classifier

34

write s_i = y_i [weighted vote]voted classifier is right on (x_i, y_i) iff s_i > 0

s_i = y_i [\sum_j \alpha_j f_j(x_i)] = \sum_j \alpha_j z_ij

MCLE: min_{\alpha,s} L = \sum_i h(s_i)s.t. s = Z \alphah(s) = log(1+exp(-s))this is a convex program (since h is convex)but too big to solve directly -- how to do it?


Coordinate descent

Repeat:‣ Find an index j s.t.

‣ Increase

“Repeatedly increase the weight of a useful weak hypothesis”

35

dL/d↵j < 0

↵j

find an index j s.t. dL/d\alpha_j < 0 [by assumption, don’t have to check separately for dL/d\alpha_j > 0]

concretely, \alpha_j += \alpha [there are other strategies, but this is actually one of the best]

to make this fast, “find an index j” has to be efficient (can’t enumerate columns)


Finding a good weak hypothesis

Find j s.t. –dL/dαj is big‣ –dL/dαj =

36

–dL/d\alpha_j = \sum_i –h’(s_i) ds_i/d\alpha_j= \sum_i –h’(s_i) z_ij= \sum_i –h’(s_i) y_i f_j(x_i)= “edge” of classifier jwant to find j to make edge as big as possible–h’>0, so want to make each y_i f_j(x_i) big [sketch -h’]

y_i f_j(x_i) is big <==> f_j gets x_i confidently rightweights -h’(s_i): example i is important if current voted classifier gets it confidently wrong


Weak learner

Weak learner = weighted classification algorithm that gets edge ≥ γ‣ i.e., finds classifier that performs well on currently-wrong

examples

Thm: if weak learner always succeeds, L → 0

37


Discussion

Can take h(s) to be any convex, decreasing fn of s‣ e.g., exp(–z) or hinge loss max(0, –z)

‣ we used log(1+exp(–z)) — discrete variant of LogitBoost

‣ exp(–z) leads to AdaBoost

Can use confidence-rated classifiers (range [–1,1]) or regression algorithms

Weak hypothesis class: usually want a less-complex class than we’d use on its own—mitigates overfitting‣ same “slow learning rate” idea as for decision tree splits

38

original (real-valued) LogitBoost uses regression as weak learner, like IRLS


In practice

Boosting typically takes training error quickly to 0‣ could also stop with failure of weak learner, but this

doesn’t typically happen

Tends to keep increasing margin, even after training error is 0

Tends not to overfit—usually attributed to margin

39


Is weak learning reasonable?

If weak learner can always succeed, then ∃ a vote that gets every training example right

If weak learner can fail, boosting seems like a good algorithm for making it do so‣ but as we said, weak learner usually keeps working

40

first line: a theorem of Freund & Schapire

Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases...

Documents

Transcript of Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases...