Lecture 8: Decision Trees & k-Nearest Neighbors
-
Upload
marina-santini -
Category
Education
-
view
229 -
download
5
description
Transcript of Lecture 8: Decision Trees & k-Nearest Neighbors
Machine Learning for Language Technology Lecture 8: Decision Trees and
k-‐Nearest Neighbors Marina San:ni
Department of Linguis:cs and Philology Uppsala University, Uppsala, Sweden
Autumn 2014
Acknowledgement: Thanks to Prof. Joakim Nivre for course design and materials
1
Supervised Classifica:on
• Divide instances into (two or more) classes – Instance (feature vector):
• Features may be categorical or numerical
– Class (label):
– Training data:
• Classifica:on in Language Technology – Spam filtering (spam vs. non-‐spam) – Spelling error detec:on (error vs. no error) – Text categoriza:on (news, economy, culture, sport, ...) – Named en:ty classifica:on (person, loca:on, organiza:on, ...)
2
€
X = {x t,y t }t=1N
€
x = x1, …, xm
€
y
Models for Classifica:on • Genera:ve probabilis:c models: – Model of P(x, y) – Naive Bayes
• Condi:onal probabilis:c models: – Model of P(y | x) – Logis:c regression
• Discrimina:ve model: – No explicit probability model – Decision trees, nearest neighbor classifica:on – Perceptron, support vector machines, MIRA
3
Repea:ng… • Noise – Data cleaning is expensive and :me consuming
• Margin • Induc:ve bias
4
Types of inductive biases • Minimum cross-‐‑validation error • Maximum margin • Minimum description length • [...]
DECISION TREES
5
Decision Trees
• Hierarchical tree structure for classifica:on – Each internal node specifies a test of some feature – Each branch corresponds to a value for the tested
feature – Each leaf node provides a classifica:on for the
instance • Represents a disjunc:on of conjunc:ons of
constraints – Each path from root to leaf specifies a conjunc:on
of tests – The tree itself represents the disjunc:on of all paths
6
Decision Tree
7
Divide and Conquer • Internal decision nodes
– Univariate: Uses a single a]ribute, xi • Numeric xi : Binary split : xi > wm • Discrete xi : n-‐way split for n possible values
– Mul:variate: Uses all a]ributes, x • Leaves
– Classifica:on: class labels (or propor:ons) – Regression: r average (or local fit)
• Learning: – Greedy recursive algorithm – Find best split X = (X1, ..., Xp), then induce tree for each Xi
8
Classifica:on Trees (ID3, CART, C4.5)
9
• For node m, Nm instances reach m, Nim belong to Ci
• Node m is pure if pim is 0 or 1 • Measure of impurity is entropy
€
Im = − pmi log2pm
i
i=1
K
∑
P Ci |x,m( ) ≡ pmi =Nm
i
Nm
Example: Entropy
• Assume two classes (C1, C2) and four instances (x1, x2, x3, x4) • Case 1:
– C1 = {x1, x2, x3, x4}, C2 = { } – Im = – (1 log 1 + 0 log 0) = 0
• Case 2: – C1 = {x1, x2, x3}, C2 = {x4} – Im = – (0.75 log 0.75 + 0.25 log 0.25) = 0.81
• Case 3: – C1 = {x1, x2}, C2 = {x3, x4} – Im = – (0.5 log 0.5 + 0.5 log 0.5) = 1
10
Best Split
P Ci |x,m, j( ) ≡ pmji =Nmj
i
Nmj
11
• If node m is pure, generate a leaf and stop, otherwise split with test t and con:nue recursively
• Find the test that minimizes impurity • Impurity ager split with test t: – Nmj of Nm take branch j – Ni
mj belong to Ci
€
Imt = −
Nmj
Nmj=1
n
∑ pmji log2pmj
i
i=1
K
∑
€
Im = − pmi log2pm
i
i=1
K
∑
P Ci |x,m( ) ≡ pmi =Nm
i
Nm
12
Informa:on Gain
• We want to determine which a]ribute in a given set of training feature vectors is most useful for discrimina:ng between the classes to be learned.
• •Informa:on gain tells us how important a given a]ribute of the feature vectors is.
• •We will use it to decide the ordering of a]ributes in the nodes of a decision tree.
13
Informa:on Gain and Gain Ra:o
€
Im = − pmi log2pm
i
i=1
K
∑
14
• Choosing the test that minimizes impurity maximizes the informa:on gain (IG):
• Informa:on gain prefers features with many values
• The normalized version is called gain ra:o (GR):
€
Imt = −
Nmj
Nmj=1
n
∑ pmji log2pmj
i
i=1
K
∑
€
IGmt = Im − Im
t
€
Vmt = −
Nmj
Nm
log2Nmj
Nmj=1
n
∑
€
GRmt =
IGmt
Vmt
Pruning Trees
• Decision trees are suscep:ble to overfijng • Remove subtrees for be]er generaliza:on: – Prepruning: Early stopping (e.g., with entropy threshold) – Postpruning: Grow whole tree, then prune subtrees
• Prepruning is faster, postpruning is more accurate (requires a separate valida:on set)
15
Rule Extrac:on from Trees
16
C4.5Rules (Quinlan, 1993)
Learning Rules
• Rule induc:on is similar to tree induc:on but – tree induc:on is breadth-‐first – rule induc:on is depth-‐first (one rule at a :me)
• Rule learning: – A rule is a conjunc:on of terms (cf. tree path) – A rule covers an example if all terms of the rule evaluate to true for the example (cf. sequence of tests)
– Sequen:al covering: Generate rules one at a :me un:l all posi:ve examples are covered
– IREP (Fürnkrantz and Widmer, 1994), Ripper (Cohen, 1995)
17
Proper:es of Decision Trees • Decision trees are appropriate for classifica:on when: – Features can be both categorical and numeric – Disjunc:ve descrip:ons may be required – Training data may be noisy (missing values, incorrect labels)
– Interpreta:on of learned model is important (rules)
• Induc0ve bias of (most) decision tree learners: 1. Prefers trees with informa0ve aEributes close to the root 2. Prefers smaller trees over bigger ones (with pruning) 3. Preference bias (incomplete search of complete space)
18
K-‐NEAREST NEIGHBORS
19
Nearest Neighbor Classifica:on • An old idea
• Key components: – Storage of old instances – Similarity-‐based reasoning to new instances
20
This “rule of nearest neighbor” has considerable elementary intuitive appeal and probably corresponds to practice in many situations. For example, it is possible that much medical diagnosis is influenced by the doctor's recollection of the subsequent history of an earlier patient whose symptoms resemble in some way those of the current patient. (Fix and Hodges, 1952)
k-‐Nearest Neighbour • Learning: – Store training instances in memory
• Classifica:on: – Given new test instance x,
• Compare it to all stored instances • Compute a distance between x and each stored instance xt • Keep track of the k closest instances (nearest neighbors)
– Assign to x the majority class of the k nearest neighbours
• A geometric view of learning – Proximity in (feature) space à same class – The smoothness assump:on
21
Eager and Lazy Learning • Eager learning (e.g., decision trees) – Learning – induce an abstract model from data – Classifica:on – apply model to new data
• Lazy learning (a.k.a. memory-‐based learning) – Learning – store data in memory – Classifica:on – compare new data to data in memory – Proper:es:
• Retains all the informa:on in the training set – no abstrac:on
• Complex hypothesis space – suitable for natural language? • Main drawback – classifica:on can be very inefficient
22
Dimensions of a k-‐NN Classifier
• Distance metric – How do we measure distance between instances? – Determines the layout of the instance space
• The k parameter – How large neighborhood should we consider? – Determines the complexity of the hypothesis space
23
Distance Metric 1
• Overlap = count of mismatching features
24
€
Δ(x,z) = δ(xi,zi)i=1
m
∑
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
≠
=−
−
=
ii
ii
ii
ii
ii
zxifzxif
elsenumericifzx
zx10
,minmax
),(δ
Distance Metric 2
• MVDM = Modified Value Difference Metric
25
€
Δ(x,z) = δ(xi,zi)i=1
m
∑
€
δ(xi,zi) = P(C j | xi)− P(C j | zi)j=1
K
∑
The k parameter • Tunes the complexity of the hypothesis space – If k = 1, every instance has its own neighborhood – If k = N, all the feature space is one neighborhood
26
k = 1 k = 15
E = E(h|V ) = 1 h xt( ) ≠ rt( )t=1
M
∑
A Simple Example
€
Δ(x,z) = δ(xi,zi)i=1
m
∑
27
Training set: 1. (a, b, a, c) à A 2. (a, b, c, a) à B 3. (b, a, c, c) à C 4. (c, a, b, c) à A
New instance: 5. (a, b, b, a)
Distances (overlap): Δ(1, 5) = 2 Δ(2, 5) = 1 Δ(3, 5) = 4 Δ(4, 5) = 3
k-‐NN classifica:on: 1-‐NN(5) = B 2-‐NN(5) = A/B 3-‐NN(5) = A 4-‐NN(5) = A
⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
≠
=−
−
=
ii
ii
ii
ii
ii
zxifzxif
elsenumericifzx
zx10
,minmax
),(δ
Further Varia:ons on k-‐NN
• Feature weights: – The overlap metric gives all features equal weight – Features can be weighted by IG or GR
• Weighted vo:ng: – The normal decision rule gives all neighbors equal weight
– Instances can be weighted by (inverse) distance
28
Proper:es of k-‐NN • Nearest neighbor classifica:on is appropriate when: – Features can be both categorical and numeric – Disjunc:ve descrip:ons may be required – Training data may be noisy (missing values, incorrect labels)
– Fast classifica:on is not crucial • Induc0ve bias of k-‐NN: 1. Nearby instances should have the same label
(smoothness assump0on) 2. All features are equally important (without feature
weights) 3. Complexity tuned by the k parameter
29
End of Lecture 8
30