Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases...

40
Discriminative v. generative

Transcript of Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases...

Page 1: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Discriminativev. generative

Page 2: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Naive Bayes

2

Page 3: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Naive Bayes

MLE:

3

P (xij , yi) =Y

i

P (yi)Y

j

P (xij |yi)

max

aj ,bj ,pP (xij , yi)

P (yi = +) = p

P (xij = 1 | yi = –) = aj

P (xij = 1 | yi = +) = bj

p = 1N

P�[yi = +]

aj =P

�[(yi = �) ^ (xij = 1)]/P

�[yi = �]

bj =P

�[(yi = +) ^ (xij = 1)]/P

�[yi = �]

2k+1 parameters

P (yi = + | xij) = 1/(1 + exp(�zi))

zi = w0 +P

j wjxij

naive Bayes model: y_i -> [x_ij] N training examples (x_i, y_i), k binary features y_i \in {0,1}, x_i \in {0,1}^kMLE led to classifier: P(y_i=+ \mid x_{ij}) &= 1/(1+\exp(-z_i))\\z_i &= \textstyle w_0 + \sum_j w_j x_{ij}

===

P(x_{ij}, y_i) = \prod_i P(y_i) \prod_j P(x_{ij} | y_i)

p &= \textstyle \frac{1}{N} \sum \delta[y_i=+]\\a_j &= \textstyle \sum \delta[(y_i=-) \wedge (x_{ij}=1)] / \sum \delta[y_i=-]\\b_j &= \textstyle \sum \delta[(y_i=+) \wedge (x_{ij}=1)] / \sum \delta[y_i=-]

\max_{a_j, b_j, p} P(x_{ij}, y_i)

P(y_i=+) &= p \\P(x_{ij}=1 \mid y_i=–) &= a_j\\P(x_{ij}=1 \mid y_i=+) &= b_j

Page 4: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Logistic regression

4

argmax

w

Y

i

P (yi|xij)

= argmin

w

X

i

ln(1 + exp(�yizi))

= argmin

w

X

i

h(yizi)

P (yi = + | xij) = 1/(1 + exp(�zi))

zi = w0 +P

j wjxij

[sketch h]

optional prior for both NB and log. reg.

SVM: can think of it as approximating this optimization with a QP

===

P(y_i=+ \mid x_{ij}) &= 1/(1+\exp(-z_i))\\z_i &= \textstyle w_0 + \sum_j w_j x_{ij}

\lefteqn{\arg\max_w \prod_i P(y_i | x_{ij})}& \\& = \arg\min_w \sum_i \ln(1+\exp(-y_i z_i))\\& = \arg\min_w \sum_i h(y_i z_i)

Page 5: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Same model, different answer

Why?‣ max P(X, Y) vs. max P(Y | X)

‣ generative vs. discriminative

‣ MLE v. MCLE (max conditional likelihood estimate)

How to pick?‣ Typically MCLE better if lots of data, MLE better if not

5

MCLE better if lots of data: we’ll see why below

it’s perhaps disturbing that there are two different ways to train the same model (3 if we count SVM, but we can justify that as an approximation to MCLE); can we relate them?

[also integration v. maximization, but ignore that]

Page 6: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

MCLE as MLEmax

Y

i

P (xi, yi | ✓) max

Y

i

P (yi|xi, ✓)

P(x_i, y_i | \theta) = P(x_i | \theta) P(y_i | x_i, \theta)now suppose \theta = (\theta_x, \theta_y), and P(x_i | \theta) = P(x_i | \theta_x) P(y_i | x_i, \theta) = P(y_i | x_i, \theta_y)then max \sum_i \ln P(x_i, y_i | \theta)= max \sum_i [\ln P(x_i | \theta_x) + \ln P(y_i | x_i, \theta_y)]can solve separately for \theta_x and \theta_y\theta_y is MCLE solution

Page 7: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

MCLE as MLE

Recipe: MCLE = MLE + extra parameters to decouple P(x) from P(y|x)

Bias-variance tradeoff: MLE places additional constraints on θ by coupling to P(x)‣ MLE increases bias, decreases variance (vs. MCLE)

To interpolate generative / discriminative models, soft-tie θx to θy w/ prior

7

Tom Minka. Discriminative models, not discriminative training. MSR tech report TR-2005-144, 2005

parameters of P(y|x) are ones of interest

Page 8: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Comparison

As #examples → ∞‣ if Bayes net is right: NB & LR get same answer

‣ if not:

‣ LR has minimum possible training error

‣ train error → test error

‣ so LR does at least as well as NB, usually better

8

Page 9: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Comparison

Finite sample: n examples with k attributes‣ how big should n be for excess risk ≤ ϵ?‣ GNB needs n = θ(log k) as long as a constant fraction of

attributes are relevant

‣ Hoeffding for each weight + union bound over weights + bound z away from 0

‣ LR needs n = θ(k)

‣ VC-dimension of linear classifier

GNB converges much faster to its (perhaps less-accurate) final estimates

9

see [Ng & Jordan, 2002]

informally, difference in convergence rates happens because NB’s parameter estimates are uncoupled, while LR’s are coupled

Page 10: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Comparison on UCI

10

0 20 40 600.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

pima (continuous)

0 10 20 300.2

0.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

adult (continuous)

0 20 40 600.2

0.25

0.3

0.35

0.4

0.45

m

erro

r

boston (predict if > median price, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r

optdigits (0’s and 1’s, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r

optdigits (2’s and 3’s, continuous)

0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

m

erro

r

ionosphere (continuous)

0 20 40 600.35

0.4

0.45

0.5

m

erro

r

liver disorders (continuous)

0 20 40 60 80 100 1200.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

sonar (continuous)

0 100 200 300 4000.2

0.3

0.4

0.5

0.6

0.7

m

erro

r

adult (discrete)

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

m

erro

r

promoters (discrete)

0 50 100 1500.1

0.2

0.3

0.4

0.5

m

erro

r

lymphography (discrete)

0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

breast cancer (discrete)

0 5 10 15 20 250.1

0.2

0.3

0.4

0.5

m

erro

r

lenses (predict hard vs. soft, discrete)

0 50 100 1500

0.2

0.4

0.6

0.8

m

erro

r

sick (discrete)

0 20 40 60 800

0.1

0.2

0.3

0.4

m

erro

r

voting records (discrete)

0 20 40 600.25

0.3

0.35

0.4

0.45

0.5

mer

ror

pima (continuous)

0 10 20 300.2

0.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

adult (continuous)

0 20 40 600.2

0.25

0.3

0.35

0.4

0.45

m

erro

r

boston (predict if > median price, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r

optdigits (0’s and 1’s, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r

optdigits (2’s and 3’s, continuous)

0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

m

erro

r

ionosphere (continuous)

0 20 40 600.35

0.4

0.45

0.5

m

erro

r

liver disorders (continuous)

0 20 40 60 80 100 1200.25

0.3

0.35

0.4

0.45

0.5

m

erro

rsonar (continuous)

0 100 200 300 4000.2

0.3

0.4

0.5

0.6

0.7

m

erro

r

adult (discrete)

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

m

erro

r

promoters (discrete)

0 50 100 1500.1

0.2

0.3

0.4

0.5

m

erro

r

lymphography (discrete)

0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

breast cancer (discrete)

0 5 10 15 20 250.1

0.2

0.3

0.4

0.5

m

erro

r

lenses (predict hard vs. soft, discrete)

0 50 100 1500

0.2

0.4

0.6

0.8

m

erro

r

sick (discrete)

0 20 40 60 800

0.1

0.2

0.3

0.4

m

erro

r

voting records (discrete)

see [Ng & Jordan, 2002]

Page 11: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Comparison on UCI

11

0 20 40 600.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

pima (continuous)

0 10 20 300.2

0.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

adult (continuous)

0 20 40 600.2

0.25

0.3

0.35

0.4

0.45

m

erro

r

boston (predict if > median price, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

mer

ror

optdigits (0’s and 1’s, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r

optdigits (2’s and 3’s, continuous)

0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

m

erro

r

ionosphere (continuous)

0 20 40 600.35

0.4

0.45

0.5

m

erro

r

liver disorders (continuous)

0 20 40 60 80 100 1200.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

sonar (continuous)

0 100 200 300 4000.2

0.3

0.4

0.5

0.6

0.7

m

erro

r

adult (discrete)

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

m

erro

r

promoters (discrete)

0 50 100 1500.1

0.2

0.3

0.4

0.5

m

erro

rlymphography (discrete)

0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

breast cancer (discrete)

0 5 10 15 20 250.1

0.2

0.3

0.4

0.5

m

erro

r

lenses (predict hard vs. soft, discrete)

0 50 100 1500

0.2

0.4

0.6

0.8

m

erro

r

sick (discrete)

0 20 40 60 800

0.1

0.2

0.3

0.4

m

erro

r

voting records (discrete)

0 20 40 600.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

pima (continuous)

0 10 20 300.2

0.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

adult (continuous)

0 20 40 600.2

0.25

0.3

0.35

0.4

0.45

m

erro

r

boston (predict if > median price, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r

optdigits (0’s and 1’s, continuous)

0 50 100 150 2000

0.1

0.2

0.3

0.4

m

erro

r

optdigits (2’s and 3’s, continuous)

0 20 40 60 80 1000.1

0.2

0.3

0.4

0.5

m

erro

r

ionosphere (continuous)

0 20 40 600.35

0.4

0.45

0.5

m

erro

r

liver disorders (continuous)

0 20 40 60 80 100 1200.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

sonar (continuous)

0 100 200 300 4000.2

0.3

0.4

0.5

0.6

0.7

m

erro

r

adult (discrete)

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

m

erro

r

promoters (discrete)

0 50 100 1500.1

0.2

0.3

0.4

0.5

m

erro

r

lymphography (discrete)

0 100 200 3000.25

0.3

0.35

0.4

0.45

0.5

m

erro

r

breast cancer (discrete)

0 5 10 15 20 250.1

0.2

0.3

0.4

0.5

m

erro

r

lenses (predict hard vs. soft, discrete)

0 50 100 1500

0.2

0.4

0.6

0.8

m

erro

r

sick (discrete)

0 20 40 60 800

0.1

0.2

0.3

0.4

m

erro

rvoting records (discrete)

see [Ng & Jordan, 2002]

Page 12: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Decision trees

Page 13: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Dichotomous classifier

13

Search

   HOME

 ·  Alderfly/Dobsonfly

 ·  Bees,  Ants,  Wasps

 ·  Beetles

 ·  Bristletails

 ·  Butterflies,  Moths

 ·  Centipedes

 ·  Cicada  &  Similar

 ·  Cockroaches

 ·  Dragonfly/Damselfly

 ·  Earwigs

 ·  Flies

 ·  Grasshopper/Crickets

 ·  Mayflies

 ·  Mite  or  Tick

 ·  Scorpion  and  Like

 ·  Snakeflies

 ·  SPIDERS

 ·  True  Bugs

 ·  Walkingsticks

 ·  View  ALL

   ABOUT  BUGS

 ·  Identifying  Insects

 ·  Insect  Anatomy

 ·  Insect  Mouth  Parts

   SCIENTIFIC

 ·  Dichotomous  Keys

 ·  Taxonomic  Orders

 ·  Insect  Orders

 ·  Scientific  Names

 ·  Metamorphosis

 ·  Process  of  Molting

   MISCELLANEOUS

   ·  Bees  and  Wasps

   ·  Beneficial  Insects

   ·  Field  Guides

   ·  Color  the  Bugs

   ·  Spiders.us

Winged  Insect  Key  (Flies,  bees,  butterflies,  beetles,  etc.)

Starting  with  question  #1,  determine  which  statement  (a  or  b)  is  true  for  your  insect.  Follow  the  direction  at  the  endof  the  true  statement  until  you  are  finally  given  the  name  of  the  Order  your  insect  belongs  to.

1. a. Insect  has  1  pair  of  wings  ........................................................................ Order  Diptera  (flies,mosquitoes)

  b. Insect  has  2  pair  of  wings  ........................................................................ go  to  #2                2. a. Insect  has  extremely  long  prothorax  (neck)  ............................................... go  to  #3   b. Insect  has  a  regular  length  or  no  prothorax  ............................................... go  to  #4                3. a. Forelegs  come  together  in  a  'praying'  position  ............................................ Order  Mantodea  (mantids)

  b. Forelegs  do  not  come  together  in  a  'praying'  position  .................................. Order  Raphidoptera(snakeflies)

               4. a. Wings  are  armour-­like  with  membraneous  hindwings  underneath  them  ........... Order  Coleoptera  (beetles)   b.Wings  are  not  armour-­like  .......................................................................... go  to  #5                

5. a. Wings  twist  when  insect  is  in  flight  ............................................................. Order  Strepsiptera  (twisted-­wing  parasites)

  b.Wings  flap  up  and  down  (no  twisting)  when  in  flight  ..................................... go  to  #6                6. a. Wings  are  triangular  in  shape  .................................................................... go  to  #7   b.Wings  are  not  triangular  in  shape  .............................................................. go  to  #8                

7. a. Insect  lacks  a  proboscis  and  has  long  filaments  at  abdomenal  tip  ................... Order  Ephemeroptera(mayflies)

  b. Insect  has  a  proboscis  and  lacks  long  filaments  at  abdomenal  ........................ Order  Lepidoptera(butterflies)

 

       

8. a. Head  is  elongated  (snout-­like)  .................................................................... Order  Mecoptera(scorpionflies)

  b. Head  is  not  elongated  (snout-­like)  .............................................................. go  to  #9                9. a. Insect  has  2  pair  of  cerci  (pincers)  at  tip  of  abdomen  ..................................... Order  Dermaptera  (earwigs)   b. Insect  does  not  have  2  pair  of  cerci  (pincers)  at  tip  of  abdomen  ...................... go  to  #10                10.a. All  4  wings  are  both  similar  in  size  and  in  shape  to  each  other  ......................... go  to  #11   b. All  4  wings  are  not  similar  in  size  nor  in  shape  to  each  other  ........................... go  to  #16                11.a. Eyes  nearly  cover  or  make  up  entire  head  ..................................................... Order  Odonata  (dragonflies)   b. Eyes  do  not  nearly  cover  nor  make  up  entire  head  ......................................... go  to  #12                12.a. All  4  wings  are  finely  veined  and  are  almost  2x  longer  than  abdomen  ............... Oder  Isoptera  (termites)   b. All  4  wings  are  not  finely  veined  and  are  not  almost  2x  longer  than  abdomen  .... go  to  #13                13.a. All  4  wings  are  transparent  with  many  criss-­crossing  veins  .............................. Order  Neuroptera  (lacewings)   b. All  4  wings  are  not  transparent  with  many  criss-­crossing  veins  ........................ go  to  #14

http://www.insectidentification.org/winged-insect-key.asp

decision trees were invented long before machine learning -- first called “dichotomous classifiers”

used for field identification of species, rock types, etc.

Page 14: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Decision tree

Problem: classification (or regression)‣ n training examples (x1, y1), (x2, y2), … (xn, yn)

‣ xi ∈ Rk, yi ∈ {0, 1}

14

well-known implementations: ID3, C4.5, J48, CART

input variables: real valued, categorical, binary, …output variables: real valued, categorical, binary, …

tree shape [sketch]: a question in each node (about values of input variables), branch based on answer e.g., x_{i,3} > 5

when we reach a leaf, it tells us about the output variable

usually yes/no questions, but could be multiple choice -- answers must be mutually exclusive and exhaustive

Page 15: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

The picture

15

typical decision tree cartoon: divide plane into rectangular regions

Page 16: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

The picture

16

Composition II in Red, Blue, and Yellow

Piet Mondrian, 1930

Page 17: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Variants

Type of question at internal nodes

Type of label at leaf nodes

Labels on internal nodes or edges

17

type of question at internal node: most common: >, <, or = on single attribute e.g., height > 3? color = blue? also used: logical fns (conjuncts, disjuncts) also used: linear threshold (3*height+width-2)

type of label at leaf: constant (“3” or “true”), or could be any classifier or regressor (e.g., linear regression on all data points w/in box)

labeled internal nodes/edges (combine e.g. w/ sum) color = blue: +3, go to node 3; else +2, go to 4 height > 5: -2, go to 7; else +1, go to 13

Page 18: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Variants

Decision list

Decision diagram (DAG)

18

allow cycles + side effects: flowchart

Page 19: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

fine print: this might have cycles in it

Page 20: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Example

20

Sepa

l Len

gth

Petal Length

petal: range 1..7sepal: range 4..8red: setosagreen: versicolorblue: virginica

suppose we had another type at bottom right; could split on sepal length

Page 21: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Why decision trees?

Why?‣ work pretty well

‣ fairly interpretable

‣ very fast at test time

‣ closed under common operations

Why not DTs?‣ learning is NP-hard

‣ often not state-of-art error rate

‣ but: see bagging, boosting

21

closed under common operations: e.g., sum of 2 decision trees/diagrams is another tree/diagram (and there are good algorithms to compute and optimize representation of sum)

not usually state-of-art performance but: boosted or bagged versions may be but: these lose interpretability

Page 22: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Learning

22

red? fuzzy? Class

T T –

T F +

T F –

F T –

F F +

split on red: T: -+-, F: -+ best (MLE) labels: 1/3, 1/2 lik: log(2/3)+log(1/3)+log(2/3)+log(1/2)+log(1/2)

alternately, split on fuzzy: T: --, F: +-+ MLE labels: 0, 2/3 [could use Laplace smoothing] lik: log(1)+log(2/3)+log(1/3)+log(1)+log(2/3) better by 2 log(2) = 2 bits

if we now split fuzzy=F by red?, get pure leaves: perfect performance on training data

Page 23: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Learning

Bigger data sets with more attributes: finding training set MLE is NP-hard

Heuristic search: build tree greedily, root down‣ start with all training examples in one bin

‣ pick an impure bin

‣ try some candidate splits (e.g., all single-attribute binary tests), pick the best (largest increase in likelihood)

‣ repeat until all bins are either pure or have no possible splits left (ran out of attributes to split on)

23

tradeoff: if we consider stronger splits (e.g., linear classifiers vs. single-attribute tests) we get more progress at each node, but: more selection bias overall goal is not to do well at a node, but to do well with final tree

heuristic: learn slowly (pick a weak set of splits)

Page 24: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Information gain

Initially: L = 2 log(.4) + 3 log(.6)

Split on red:‣ bin T: 2 log(.667) + log(.333)

‣ bin F: 2 log(.5)

Split on fuzzy:‣ bin T: 2 log 1 + 0 log 0 = 0

‣ bin F: 2 log(.667) + log(.333)

In general: H(Y) – EX[H(Y|X)]

24

red? fuzzy? Class

T T –

T F +

T F –

F T –

F F +

evaluating a candidate split: increase in likelihood = information gain = measure of purity of binsinit: -4.85red: -4.75 (gain .1 bits)fuzzy: –2.75 (gain 2.1 bits)

there are other splitting criteria besides info gain (e.g., Gini) but we won’t cover

Page 25: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Real-valued attributes

25

finding threshold for a real attribute: sort by attribute value, try n+1 thresholds, one in each gap between observed values

===

xs = randn(50)-1; ys = randn(50)+1;xs = sort(xs); ys = sort(ys);clf(); plot(xs, arange(1,51)/50., ys, arange(1, 51)/50., marker='x', ls='none', mew=3, ms=5)

Page 26: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Multi-way discrete splits

Split on temp yields {–,–} and {+,–,+}

Split on SS# yields 5 pure leaves

26

SS# Temp Sick?

123-45-6789 36 –

010-10-1010 36.5 +

555-55-1212 41 +

314-15-9265 37 –

271-82-8183 40 +

unfair advantage of multi-way splits

fix: penalize splits of high arity e.g., allow only binary (1 vs rest) e.g., use a statistical test of significance to select a split variable

Page 27: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Pruning

Build tree on training set

Prune on holdout set:‣ while removing last split

along some path improves holdout error, do so

‣ if a node N’s children are all pruned, then N becomes eligible for pruning

27

note: order of testing children of N is unimportant

Page 28: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Prune as rules

Alternately, convert each leaf to a rule then prune‣ test1 ∧ ¬test2 ∧ test3 …

‣ while dropping a test from a rule improves performance, do so

28

rule-based version: typically leads to smaller, more interpretable classifiers

may get overlap among rules; if so, e.g., average their predictions

Page 29: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Bagging

Bagging = bootstrap aggregating

Can be used with any classifier, but particularly effective with decision trees

Generate M bootstrap resamples

Train a decision tree on each one‣

Final classifier: vote all M trees‣ e.g., tree 1 says p(+) = .7, tree 2 says p(+) = .9: predict .8

29

train: could use different training methods on each resample; choices include candidate splits, pruning strategies

random forests: restrict each tree to use a random subsample of k’<<k attributes; don’t prune

bagging can increase performance substantially (random forests often get state-of-art performance) but reduces interpretability

Page 30: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Out-of-bag error estimates

Each bag contains (1–1/e) (~67%) of examples

Use out-of-bag examples to estimate error of each tree

To estimate error of overall vote‣ for each example, classify using all out-of-bag trees

‣ average across all examples

Conservative: we’re averaging over ~67% of our trees—but if we have lots of trees, bias is small

30

Page 31: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Boosting

Page 32: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Voted classifiers

f: Rk → {–1, 1}

Voted classifier: ∑j fj(x) > 0

Weighted vote: ∑j αj fj(x) > 0‣ assume wlog αj > 0

‣ optionally scale so αj sum to 1

32

5 halfspaces (or add constant classifier for |H|=6)

typically a larger hypothesis space (vs. base set of classifiers) -- e.g., voted halfspaces

terminology: base f_j are called “weak hypotheses” to distinguish from the stronger class of voted f_j

wlog: since we assume hypothesis space is closed under negation (for each f, -f also in space)

idea: learn a voted classifier by MCLEpotential benefit: improved performance, if we can avoid overfitting due to bigger hypothesis space

Page 33: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Voted classifiers—the matrix

33

n tr

aini

ng e

xam

ples

T distinct classifiers (T < 2n)

write f_1 ... f_T for all *distinct* classifiers in our hypothesis space (at most 2^n for n training examples)

write z_ij = y_i f_j(x_i) = does f_j get (x_i, y_i) right?(matrix dimensions: # examples * # classifiers)

Page 34: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Finding the best voted classifier

34

write s_i = y_i [weighted vote]voted classifier is right on (x_i, y_i) iff s_i > 0

s_i = y_i [\sum_j \alpha_j f_j(x_i)] = \sum_j \alpha_j z_ij

MCLE: min_{\alpha,s} L = \sum_i h(s_i)s.t. s = Z \alphah(s) = log(1+exp(-s))this is a convex program (since h is convex)but too big to solve directly -- how to do it?

Page 35: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Coordinate descent

Repeat:‣ Find an index j s.t.

‣ Increase

“Repeatedly increase the weight of a useful weak hypothesis”

35

dL/d↵j < 0

↵j

find an index j s.t. dL/d\alpha_j < 0 [by assumption, don’t have to check separately for dL/d\alpha_j > 0]

concretely, \alpha_j += \alpha [there are other strategies, but this is actually one of the best]

to make this fast, “find an index j” has to be efficient (can’t enumerate columns)

Page 36: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Finding a good weak hypothesis

Find j s.t. –dL/dαj is big‣ –dL/dαj =

36

–dL/d\alpha_j = \sum_i –h’(s_i) ds_i/d\alpha_j= \sum_i –h’(s_i) z_ij= \sum_i –h’(s_i) y_i f_j(x_i)= “edge” of classifier jwant to find j to make edge as big as possible–h’>0, so want to make each y_i f_j(x_i) big [sketch -h’]

y_i f_j(x_i) is big <==> f_j gets x_i confidently rightweights -h’(s_i): example i is important if current voted classifier gets it confidently wrong

Page 37: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Weak learner

Weak learner = weighted classification algorithm that gets edge ≥ γ‣ i.e., finds classifier that performs well on currently-wrong

examples

Thm: if weak learner always succeeds, L → 0

37

Page 38: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Discussion

Can take h(s) to be any convex, decreasing fn of s‣ e.g., exp(–z) or hinge loss max(0, –z)

‣ we used log(1+exp(–z)) — discrete variant of LogitBoost

‣ exp(–z) leads to AdaBoost

Can use confidence-rated classifiers (range [–1,1]) or regression algorithms

Weak hypothesis class: usually want a less-complex class than we’d use on its own—mitigates overfitting‣ same “slow learning rate” idea as for decision tree splits

38

original (real-valued) LogitBoost uses regression as weak learner, like IRLS

Page 39: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

In practice

Boosting typically takes training error quickly to 0‣ could also stop with failure of weak learner, but this

doesn’t typically happen

Tends to keep increasing margin, even after training error is 0

Tends not to overfit—usually attributed to margin

39

Page 40: Discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣MLE increases bias, decreases variance (vs. MCLE) To interpolate generative / discriminative models, soft-tie θ x

Geoff Gordon—Machine Learning—Fall 2013

Is weak learning reasonable?

If weak learner can always succeed, then ∃  a vote that gets every training example right

If weak learner can fail, boosting seems like a good algorithm for making it do so‣ but as we said, weak learner usually keeps working

40

first line: a theorem of Freund & Schapire