CSE 151 Machine Learning - Computer Science · 2012. 5. 8. · CSE 151 Machine Learning Instructor:...
Transcript of CSE 151 Machine Learning - Computer Science · 2012. 5. 8. · CSE 151 Machine Learning Instructor:...
-
CSE 151 Machine LearningInstructor: Kamalika Chaudhuri
-
Linear Classification
Given labeled data:
(xi, yi)
where y is +1 or -1, find a hyperplane to separate + from -
featurevector label
i=1,..,n
-
Linear Classification Process
Classification Algorithm
Training Data
--
-
+-
-
++
+
+
+
---
-- wO
Prediction Rule: Vector w
How do you classify a test example x?
Output y = sign(�w, x�)
-
--
-
+-
-
++
+
+
+
---
--w
O
Linear Classification
Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,
Find a hyperplane through the origin to separate + from -
w: normal vector to the hyperplane
For a point x on one side of the hyperplane,
�w, x� > 0
For a point x on the other side,
�w, x� < 0
-
The Perceptron Algorithm
Goal: Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,Find a vector w such that the corresponding hyperplane separates + from -
Perceptron Algorithm:
1. Initially, w1 = y1x12. For t=2, 3,....
If wt+1 = wt + yt xt
Else:
then:yt�wt, xt� < 0
wt+1 = wt
-
Linear Separability
--
-
+-
-
++
+
+
+
---
--
linearly separable not linearly separable
--
-
+-
-
++
+
+
+
---
--
+
+
-
Measure of Separability: Margin
For a unit vector w, and training set S, margin of w is defined as:
γ = minx∈S
|�w, x�|�x�
min cosine of angle between w and x, for any x in S
Low Margin
+
wO
θ
+w
O
High Margin
θ
-
Real Examples: Margin
For a unit vector w, and training set S, margin of w is defined as:
γ = minx∈S
|�w, x�|�x�
min cosine of angle between w and x, for any x in data
Low Margin High Margin
w w
-
When does Perceptron work?
Observations:
1. Lower the margin, more mistakes
If the training points are linearly separable with margin γ
then #mistakes made by perceptron is at most 1/γ2
Theorem:
and if for all xi in the training data,
2. May need more than one pass over the training data to get a classifier which makes no mistakes on the training set
�xi� = 1
-
What if data is not linearly separable?
Ideally, we want to find a linear separator that makes the minimum number of mistakes on training data
But this is NP Hard
--
-
+-
-
++
+
+
+
---
--
+
+
-
Voted Perceptron
Perceptron: Voted Perceptron:
1. Initially:m = 1, w1 = y1x1
2. For t=2, 3,....
If wm+1 = wm + yt xt
then:
m = m + 13. Output wm
yt�wm, xt� ≤ 0
1. Initially:m = 1, w1 = y1x1
2. For t=2, 3,....
If wm+1 = wm + yt xt
then:
m = m + 1
3. Output (w1, c1), (w2, c2), ..., (wm, cm)
cm = 1Else:
cm = cm + 1
yt�wm, xt� ≤ 0
-
Voted Perceptron
Voted Perceptron:
How to classify example x?
Output: sign(m�
i=1
cisign(�wi, x�))
Problem: Have to store all the classifiers
1. Initially:m = 1, w1 = y1x1
2. For t=2, 3,....
If wm+1 = wm + yt xt
then:
m = m + 1
3. Output (w1, c1), (w2, c2), ..., (wm, cm)
cm = 1Else:
cm = cm + 1
yt�wm, xt� ≤ 0
-
Averaged Perceptron
Averaged Perceptron:
How to classify example x?
Output: sign(�m�
i=1
ciwi, x�)
1. Initially:m = 1, w1 = y1x1
2. For t=2, 3,....
If wm+1 = wm + yt xt
then:
m = m + 1
3. Output (w1, c1), (w2, c2), ..., (wm, cm)
cm = 1Else:
cm = cm + 1
yt�wm, xt� ≤ 0
-
Voted vs. Averaged
How to classify example x?
Averaged: sign(�m�
i=1
ciwi, x�)
Voted: sign(m�
i=1
cisign(�wi, x�))
Example where voted and averaged are different:
w1 = x1, w2 = x2, w3 = x3, c1 = c2 = c3 = 1
Averaged: x1 + x2 + x3 > 0Voted: Any two of x1, x2, x3 > 0
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
w
O
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 1: w1 = (4, 0), c1 = 1
O w1
Decision Boundary
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
Step 1: w1 = (4, 0), c1 = 1
O
Decision Boundary
w1
x2
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 1: w1 = (4, 0), c1 = 1
O
x2
Decision Boundary
w1
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 1: w1 = (4, 0), c1 = 1
Update: w2 = (3, -1), c2 = 1
O-x2 w2
Decision Boundary
w1
x2
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 3: (x3, y3) = ( (0, 1), -1)
Step 1: w1 = (4, 0), c1 = 1
Update: w2 = (3, -1), c2 = 1
Ow2
Decision Boundary
w1
x2x3
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 3: (x3, y3) = ( (0, 1), -1)
y3�w2, x3� > 0 Correct!
Step 1: w1 = (4, 0), c1 = 1
Update: w2 = (3, -1), c2 = 1
Ow2
Decision Boundary
x2x3
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 3: (x3, y3) = ( (0, 1), -1)
y3�w2, x3� > 0 Correct!
Step 1: w1 = (4, 0), c1 = 1
Update: w2 = (3, -1), c2 = 1
Update: c2 = 2
Ow2
Decision Boundary
x2x3
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 3: (x3, y3) = ( (0, 1), -1)
y3�w2, x3� > 0 Correct!
Step 4: (x4, y4) = ( (-2, -2), +1)
Step 1: w1 = (4, 0), c1 = 1
Update: w2 = (3, -1), c2 = 1
Update: c2 = 2
Ow2
Decision Boundary
x2x3
x4
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 3: (x3, y3) = ( (0, 1), -1)
y3�w2, x3� > 0 Correct!
Step 4: (x4, y4) = ( (-2, -2), +1)
Mistake!
Step 1: w1 = (4, 0), c1 = 1
Update: w2 = (3, -1), c2 = 1
y4�w2, x4� < 0
Update: c2 = 2
Ow2
Decision Boundary
x2x3
x4
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 3: (x3, y3) = ( (0, 1), -1)
y3�w2, x3� > 0 Correct!
Step 4: (x4, y4) = ( (-2, -2), +1)
Mistake!
Update: w3 = (1, -3), c3 = 1
Step 1: w1 = (4, 0), c1 = 1
Update: w2 = (3, -1), c2 = 1
y4�w2, x4� < 0
Update: c2 = 2
O
x4
Decision Boundary
w3
w2
-
Perceptron: An Example
Data: ( (4, 0), +1), ( (1, 1), -1), ( (0, 1), -1), ( (-2,-2), +1)
Step 2: (x2, y2) = ( (1, 1), -1)
y2�w1, x2� < 0 Mistake!
Step 3: (x3, y3) = ( (0, 1), -1)
y3�w2, x3� > 0 Correct!
Step 4: (x4, y4) = ( (-2, -2), +1)
Mistake!
Update: w3 = (1, -3), c3 = 1
Step 1: w1 = (4, 0), c1 = 1
Update: w2 = (3, -1), c2 = 1
y4�w2, x4� < 0
Update: c2 = 2
Final Voted Perceptron:
Final Averaged Perceptron:sign(�w1 + 2w2 + w3, x�)
O
x4
Decision Boundary
w3
w2 sign(sign(�w1, x�) + 2sign(�w2, x�) + sign(�w3, x�))
-
Summary
• Linear classification • Perceptron• Voted and Averaged Perceptron• Some issues
-
Margins
Suppose data is linearly separable by a margin. Perceptron will stoponce it finds a linear separator. But sometimes we would like to compute the max-margin separator. This is done by Support Vector Machines (SVMs).
--
-
+
-
-
+
++
+
+
---
--
+
+
Max margin separator
Separator
-
More General Linear Classification
Given labeled data (xi, yi), i=1,..,n, where y is +1 or -1,
Find a hyperplane to separate + from -
We can transform general linear classification to linear classification by a hyperplane through the origin as follows:
For each (xi, yi) (xi is d-dimensional), construct:
(zi, yi), where zi is the d+1-dimensional vector [1, xi]
Any hyperplane through the origin in the z-space corresponds to a hyperplane (not necessarily through the origin) in x-space, and vice versa
-
Categorical Features
Features with no inherent ordering
Example: Gender = “Male”, “Female”State = “Alaska”, “Hawaii”, “California”
Convert a categorical feature with k categories to a k-dimensional vector:e.g, “Alaska” maps to [1, 0, 0], “Hawaii” to [0, 1, 0], “California” to [0, 0, 1]
-
Multiclass Linear Classification
Main Idea: Reduce to multiple binary classification problems
One-vs-all: For k classes, solve k binary classification problemsClass 1 vs. rest, Class 2 vs. rest, ..., Class k vs. rest
All-vs-all: For k classes, solve k(k-1)/2 binary classification problemsClass i vs. j, for i=1,..,k, and j = i+1,..,k
To classify, combine by voting
-
The Confusion Matrix
Indicates which classes are easy and hard to classify
1 2 3 4 5 6
1
2
3
4
5
6
High diagonal entry: easy to classifyHigh off-diagonal entry: high scope for confusion
Mij = #examples that have label j but are classified as i
Nj = #examples with label j
Cij = Mij/Nj
-
Kernels
-
Feature Maps
Data which is not linearly separable may be linearly separable in another feature space.
X1
X2
Z1
Z2
Z3
In the figure, data is linearly separable in z-space: (Z1, Z2, Z3) = (X12, 2X1X2, X22)
φ
-
Feature Maps
Let φ(x) be a feature map. For example: = (x12, x1x2, x22)We want to run perceptron on the feature space
φ(x)
φ(x)
X1
X2
Z1
Z2
Z3
φ
-
Feature Maps
X1
X2
What feature map will make this dataset linearly separable?
-
Feature Maps
Let φ(x) be a feature map. For example: = (x12, x1, x2)We want to run perceptron on the feature space
φ(x)
φ(x)
Many linear classification algorithms, including perceptron, SVMs, etc can be written in terms of dot-products �x, z�
We can simply change these dot products to �φ(x),φ(z)�
For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�
Thus we can write the perceptron algorithm in terms of K
-
Kernels
For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�
Examples:
1. K(x, z) = (�x, z�)2
-
Kernels
For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�
Examples:
1.
φ(x) = [ x12, x1x2, x1x3, x2x1, x22, x2x3, x3x1, x3x2, x32]TFor d=3,
K(x, z) =
�d�
i=1
xizi
��d�
i=1
xizi
�=
d�
i=1
d�
j=1
xixjzizj =d�
i,j=1
(xixj)(zizj)
Time to compute K directly = O(d)Time to compute K though feature map = O(d2)
K(x, z) = (�x, z�)2
-
Kernels
For a feature map φ we define the corresponding kernel K:K(x, y) = �φ(x),φ(y)�
Examples:
1.
2.
In Class Problem: What is the feature map for this kernel?
K(x, z) = (�x, z�)2
K(x, z) = (�x, z�+ c)2