Lecture 8: Decision Trees & k-Nearest Neighbors

30
Machine Learning for Language Technology Lecture 8: Decision Trees and kNearest Neighbors Marina San:ni Department of Linguis:cs and Philology Uppsala University, Uppsala, Sweden Autumn 2014 Acknowledgement: Thanks to Prof. Joakim Nivre for course design and materials 1

description

Decision trees, best split, entropy, information gain, gain ratio, k-Nearest Neighbors, distance metric.

Transcript of Lecture 8: Decision Trees & k-Nearest Neighbors

Page 1: Lecture 8: Decision Trees & k-Nearest Neighbors

Machine  Learning  for  Language  Technology    Lecture  8:  Decision  Trees  and  

k-­‐Nearest  Neighbors  Marina  San:ni  

Department  of  Linguis:cs  and  Philology  Uppsala  University,  Uppsala,  Sweden  

 Autumn  2014  

 Acknowledgement:  Thanks  to  Prof.  Joakim  Nivre  for  course  design  and  materials  

1

Page 2: Lecture 8: Decision Trees & k-Nearest Neighbors

Supervised  Classifica:on  

•  Divide  instances  into  (two  or  more)  classes  –  Instance  (feature  vector):  

•  Features  may  be  categorical  or  numerical  

–  Class  (label):  

–  Training  data:    

•  Classifica:on  in  Language  Technology  –  Spam  filtering  (spam  vs.  non-­‐spam)  –  Spelling  error  detec:on  (error  vs.  no  error)  –  Text  categoriza:on  (news,  economy,  culture,  sport,  ...)  –  Named  en:ty  classifica:on  (person,  loca:on,  organiza:on,  ...)  

2

X = {x t,y t }t=1N

x = x1, …, xm

y

Page 3: Lecture 8: Decision Trees & k-Nearest Neighbors

Models  for  Classifica:on  •  Genera:ve  probabilis:c  models:  – Model  of  P(x,  y)  – Naive  Bayes  

•  Condi:onal  probabilis:c  models:  – Model  of  P(y  |  x)  –  Logis:c  regression    

•  Discrimina:ve  model:  – No  explicit  probability  model  – Decision  trees,  nearest  neighbor  classifica:on    –  Perceptron,  support  vector  machines,  MIRA  

 3

Page 4: Lecture 8: Decision Trees & k-Nearest Neighbors

Repea:ng…  •  Noise  – Data  cleaning  is  expensive  and  :me  consuming  

•  Margin  •  Induc:ve  bias  

4

Types  of  inductive  biases • Minimum  cross-­‐‑validation  error • Maximum  margin • Minimum  description  length • [...]

Page 5: Lecture 8: Decision Trees & k-Nearest Neighbors

DECISION  TREES  

5

Page 6: Lecture 8: Decision Trees & k-Nearest Neighbors

Decision  Trees  

•  Hierarchical  tree  structure  for  classifica:on  –  Each  internal  node  specifies  a  test  of  some  feature  –  Each  branch  corresponds  to  a  value  for  the  tested  

feature  –  Each  leaf  node  provides  a  classifica:on  for  the  

instance  •  Represents  a  disjunc:on  of  conjunc:ons  of  

constraints  –  Each  path  from  root  to  leaf  specifies  a  conjunc:on  

of  tests  –  The  tree  itself  represents  the  disjunc:on  of  all  paths  

6

Page 7: Lecture 8: Decision Trees & k-Nearest Neighbors

Decision  Tree  

7

Page 8: Lecture 8: Decision Trees & k-Nearest Neighbors

Divide  and  Conquer  •  Internal  decision  nodes  

–  Univariate:  Uses  a  single  a]ribute,  xi  •  Numeric  xi  :  Binary  split  :  xi    >  wm  •  Discrete  xi  :  n-­‐way  split  for  n  possible  values  

–  Mul:variate:  Uses  all  a]ributes,  x  •  Leaves  

–  Classifica:on:  class  labels  (or  propor:ons)  –  Regression:  r  average  (or  local  fit)  

•  Learning:  – Greedy  recursive  algorithm  –  Find  best  split  X =  (X1,  ...,  Xp),  then  induce  tree  for  each  Xi  

8

Page 9: Lecture 8: Decision Trees & k-Nearest Neighbors

Classifica:on  Trees  (ID3,  CART,  C4.5)  

9

•  For  node  m,  Nm  instances  reach  m,  Nim  belong  to  Ci  

•  Node  m  is  pure  if  pim  is  0  or  1  •  Measure  of  impurity  is  entropy    

Im = − pmi log2pm

i

i=1

K

P Ci |x,m( ) ≡ pmi =Nm

i

Nm

Page 10: Lecture 8: Decision Trees & k-Nearest Neighbors

Example:  Entropy  

•  Assume  two  classes  (C1,  C2)  and  four  instances  (x1,  x2,  x3,  x4)  •  Case  1:  

–  C1  =  {x1,  x2,  x3,  x4},  C2  =  {  }  –  Im =  –  (1  log  1  +  0  log  0)  =  0  

•  Case  2:  –  C1  =  {x1,  x2,  x3},  C2  =  {x4}  –  Im =  –  (0.75  log  0.75  +  0.25  log  0.25)  =  0.81  

•  Case  3:  –  C1  =  {x1,  x2},  C2  =  {x3,  x4}  –  Im =  –  (0.5  log  0.5  +  0.5  log  0.5)  =  1  

10

Page 11: Lecture 8: Decision Trees & k-Nearest Neighbors

Best  Split  

P Ci |x,m, j( ) ≡ pmji =Nmj

i

Nmj

11

•  If  node  m  is  pure,  generate  a  leaf  and  stop,  otherwise  split  with  test  t  and  con:nue  recursively  

•  Find  the  test  that  minimizes  impurity  •  Impurity  ager  split  with  test  t:    –  Nmj  of  Nm  take  branch  j    –  Ni

mj  belong  to  Ci  

Imt = −

Nmj

Nmj=1

n

∑ pmji log2pmj

i

i=1

K

Im = − pmi log2pm

i

i=1

K

P Ci |x,m( ) ≡ pmi =Nm

i

Nm

Page 12: Lecture 8: Decision Trees & k-Nearest Neighbors

12

Page 13: Lecture 8: Decision Trees & k-Nearest Neighbors

Informa:on  Gain  

•  We  want  to  determine  which  a]ribute  in  a  given  set  of  training  feature  vectors  is  most  useful  for  discrimina:ng  between  the  classes  to  be  learned.  

•  •Informa:on  gain  tells  us  how  important  a  given  a]ribute  of  the  feature  vectors  is.  

•  •We  will  use  it  to  decide  the  ordering  of  a]ributes  in  the  nodes  of  a  decision  tree.  

13

Page 14: Lecture 8: Decision Trees & k-Nearest Neighbors

Informa:on  Gain  and  Gain  Ra:o  

Im = − pmi log2pm

i

i=1

K

14

•  Choosing  the  test  that  minimizes  impurity  maximizes  the  informa:on  gain  (IG):  

•  Informa:on  gain  prefers  features  with  many  values  

•  The  normalized  version  is  called  gain  ra:o  (GR):  

Imt = −

Nmj

Nmj=1

n

∑ pmji log2pmj

i

i=1

K

IGmt = Im − Im

t

Vmt = −

Nmj

Nm

log2Nmj

Nmj=1

n

GRmt =

IGmt

Vmt

Page 15: Lecture 8: Decision Trees & k-Nearest Neighbors

Pruning  Trees  

•  Decision  trees  are  suscep:ble  to  overfijng  •  Remove  subtrees  for  be]er  generaliza:on:  –  Prepruning:  Early  stopping  (e.g.,  with  entropy  threshold)  –  Postpruning:  Grow  whole  tree,  then  prune  subtrees  

•  Prepruning  is  faster,  postpruning  is  more  accurate  (requires  a  separate  valida:on  set)  

15

Page 16: Lecture 8: Decision Trees & k-Nearest Neighbors

Rule  Extrac:on  from  Trees  

16

C4.5Rules    (Quinlan,  1993)

Page 17: Lecture 8: Decision Trees & k-Nearest Neighbors

Learning  Rules  

•  Rule  induc:on  is  similar  to  tree  induc:on  but    –  tree  induc:on  is  breadth-­‐first    –  rule  induc:on  is  depth-­‐first  (one  rule  at  a  :me)  

•  Rule  learning:  –  A  rule  is  a  conjunc:on  of  terms  (cf.  tree  path)  –  A  rule  covers  an  example  if  all  terms  of  the  rule  evaluate  to  true  for  the  example  (cf.  sequence  of  tests)  

–  Sequen:al  covering:  Generate  rules  one  at  a  :me  un:l  all  posi:ve  examples  are  covered  

–  IREP  (Fürnkrantz  and  Widmer,  1994),  Ripper  (Cohen,  1995)  

17

Page 18: Lecture 8: Decision Trees & k-Nearest Neighbors

Proper:es  of  Decision  Trees  •  Decision  trees  are  appropriate  for  classifica:on  when:  –  Features  can  be  both  categorical  and  numeric  –  Disjunc:ve  descrip:ons  may  be  required  –  Training  data  may  be  noisy  (missing  values,  incorrect  labels)  

–  Interpreta:on  of  learned  model  is  important  (rules)  

•  Induc0ve  bias  of  (most)  decision  tree  learners:  1.   Prefers  trees  with  informa0ve  aEributes  close  to  the  root  2.   Prefers  smaller  trees  over  bigger  ones  (with  pruning)  3.   Preference  bias  (incomplete  search  of  complete  space)  

18

Page 19: Lecture 8: Decision Trees & k-Nearest Neighbors

K-­‐NEAREST  NEIGHBORS  

19

Page 20: Lecture 8: Decision Trees & k-Nearest Neighbors

Nearest  Neighbor  Classifica:on  •  An  old  idea  

•  Key  components:  – Storage  of  old  instances  – Similarity-­‐based  reasoning  to  new  instances  

20

This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952)

Page 21: Lecture 8: Decision Trees & k-Nearest Neighbors

k-­‐Nearest  Neighbour  •  Learning:  –  Store  training  instances  in  memory  

•  Classifica:on:  – Given  new  test  instance  x,    

•  Compare  it  to  all  stored  instances  •  Compute  a  distance  between  x  and  each  stored  instance  xt  •  Keep  track  of  the  k  closest  instances  (nearest  neighbors)  

– Assign  to  x  the  majority  class  of  the  k  nearest  neighbours  

•  A  geometric  view  of  learning  –  Proximity  in  (feature)  space  à  same  class  –  The  smoothness  assump:on  

21

Page 22: Lecture 8: Decision Trees & k-Nearest Neighbors

Eager  and  Lazy  Learning  •  Eager  learning  (e.g.,  decision  trees)  –  Learning  –  induce  an  abstract  model  from  data  –  Classifica:on  –  apply  model  to  new  data  

•  Lazy  learning  (a.k.a.  memory-­‐based  learning)  –  Learning  –  store  data  in  memory  –  Classifica:on  –  compare  new  data  to  data  in  memory  –  Proper:es:  

•  Retains  all  the  informa:on  in  the  training  set  –  no  abstrac:on  

•  Complex  hypothesis  space  –  suitable  for  natural  language?  •  Main  drawback  –  classifica:on  can  be  very  inefficient  

22

Page 23: Lecture 8: Decision Trees & k-Nearest Neighbors

Dimensions  of  a  k-­‐NN  Classifier  

•  Distance  metric  – How  do  we  measure  distance  between  instances?  – Determines  the  layout  of  the  instance  space  

•  The  k  parameter  – How  large  neighborhood  should  we  consider?  – Determines  the  complexity  of  the  hypothesis  space  

23

Page 24: Lecture 8: Decision Trees & k-Nearest Neighbors

Distance  Metric  1  

•  Overlap  =  count  of  mismatching  features  

 

24

Δ(x,z) = δ(xi,zi)i=1

m

⎪⎪⎪

⎪⎪⎪

=−

=

ii

ii

ii

ii

ii

zxifzxif

elsenumericifzx

zx10

,minmax

),(δ

Page 25: Lecture 8: Decision Trees & k-Nearest Neighbors

Distance  Metric  2  

•  MVDM  =  Modified  Value  Difference  Metric  

 

25

Δ(x,z) = δ(xi,zi)i=1

m

δ(xi,zi) = P(C j | xi)− P(C j | zi)j=1

K

Page 26: Lecture 8: Decision Trees & k-Nearest Neighbors

The  k  parameter  •  Tunes  the  complexity  of  the  hypothesis  space  –  If  k  =  1,  every  instance  has  its  own  neighborhood  –  If  k  =  N,  all  the  feature  space  is  one  neighborhood    

 

26

k  =  1 k  =  15

E = E(h|V ) = 1 h xt( ) ≠ rt( )t=1

M

Page 27: Lecture 8: Decision Trees & k-Nearest Neighbors

A  Simple  Example  

Δ(x,z) = δ(xi,zi)i=1

m

27

Training  set:  1.  (a,  b,  a,  c)  à  A  2.  (a,  b,  c,  a)  à  B  3.  (b,  a,  c,  c)  à  C  4.  (c,  a,  b,  c)  à  A  

New  instance:  5.  (a,  b,  b,  a)  

Distances  (overlap):  Δ(1,  5)  =  2  Δ(2,  5)  =  1  Δ(3,  5)  =  4  Δ(4,  5)  =  3  

k-­‐NN  classifica:on:  1-­‐NN(5)  =  B  2-­‐NN(5)  =  A/B  3-­‐NN(5)  =  A  4-­‐NN(5)  =  A    

⎪⎪⎪

⎪⎪⎪

=−

=

ii

ii

ii

ii

ii

zxifzxif

elsenumericifzx

zx10

,minmax

),(δ

Page 28: Lecture 8: Decision Trees & k-Nearest Neighbors

Further  Varia:ons  on  k-­‐NN  

•  Feature  weights:  – The  overlap  metric  gives  all  features  equal  weight  – Features  can  be  weighted  by  IG  or  GR  

•  Weighted  vo:ng:  – The  normal  decision  rule  gives  all  neighbors  equal  weight  

–  Instances  can  be  weighted  by  (inverse)  distance  

 

28

Page 29: Lecture 8: Decision Trees & k-Nearest Neighbors

Proper:es  of  k-­‐NN  •  Nearest  neighbor  classifica:on  is  appropriate  when:  –  Features  can  be  both  categorical  and  numeric  –  Disjunc:ve  descrip:ons  may  be  required  –  Training  data  may  be  noisy  (missing  values,  incorrect  labels)  

–  Fast  classifica:on  is  not  crucial  •  Induc0ve  bias  of  k-­‐NN:  1.   Nearby  instances  should  have  the  same  label  

(smoothness  assump0on)  2.   All  features  are  equally  important  (without  feature  

weights)  3.   Complexity  tuned  by  the  k  parameter  

 29

Page 30: Lecture 8: Decision Trees & k-Nearest Neighbors

End  of  Lecture  8      

30