Machine Learning: Machine Learning: Introduction Introduction
Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine...
Transcript of Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine...
Introduction to Machine Learning4. Perceptron and Kernels
Alex SmolaCarnegie Mellon University
http://alex.smola.org/teaching/cmu2013-10-70110-701
• Perceptron• Hebbian learning & biology• Algorithm• Convergence analysis
• Features and preprocessing• Nonlinear separation• Perceptron in feature space
• Kernels• Kernel trick• Properties• Examples
Outline
Perceptron
Frank Rosenblatt
early theoriesof the brain
Biology and Learning• Basic Idea
• Good behavior should be rewarded, bad behavior punished (or not rewarded). This improves system fitness.
• Killing a sabertooth tiger should be rewarded ...• Correlated events should be combined.• Pavlov’s salivating dog.
• Training mechanisms• Behavioral modification of individuals (learning)
Successful behavior is rewarded (e.g. food). • Hard-coded behavior in the genes (instinct)
The wrongly coded animal does not reproduce.
Neurons• Soma (CPU)
Cell body - combines signals• Dendrite (input bus)
Combines the inputs from several other nerve cells
• Synapse (interface)Interface and parameter store between neurons
• Axon (cable)May be up to 1m long and will transport the activation signal to neurons at different locations
Neurons
f(x) =X
i
wixi = hw, xi
x1 x2 x3 xn. . .
output
w1 wn
synapticweights
Perceptron• Weighted linear
combination• Nonlinear
decision function• Linear offset (bias)
• Linear separating hyperplanes(spam/ham, novel/typical, click/no click)
• LearningEstimating the parameters w and b
x1 x2 x3 xn. . .
output
w1 wn
synapticweights
f(x) = � (hw, xi+ b)
Perceptron
SpamHam
The Perceptron
• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of
inner products
initialize w = 0 and b = 0
repeatif yi [hw, xii+ b] 0 thenw w + yixi and b b+ yi
end ifuntil all classified correctly
w =X
i2I
yixi
f(x) =X
i2I
yi hxi, xi+ b
Convergence Theorem• If there exists some with unit length and
then the perceptron converges to a linear separator after a number of steps bounded by
• Dimensionality independent• Order independent (i.e. also worst case)• Scales with ‘difficulty’ of problem
(w⇤, b⇤)
yi [hxi, w⇤i+ b
⇤] � ⇢ for all i
⇣b
⇤2 + 1⌘ �
r
2 + 1�⇢
�2 where kxik r
ProofProof, Part I
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 14
Starting PointWe start from w1 = 0 and b1 = 0.
Step 1: Bound on the increase of alignmentDenote by w
i
the value of w at step i (analogously b
i
).
Alignment: h(wi
, b
i
), (w
⇤, b
⇤)i
For error in observation (x
i
, y
i
) we get
h(wj+1, bj+1) · (w
⇤, b
⇤)i
= h[(wj
, b
j
) + y
i
(x
i
, 1)] , (w
⇤, b
⇤)i
= h(wj
, b
j
), (w
⇤, b
⇤)i + y
i
h(xi
, 1) · (w
⇤, b
⇤)i
� h(wj
, b
j
), (w
⇤, b
⇤)i + ⇢
� j⇢.
Alignment increases with number of errors.
ProofProof, Part II
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 15
Step 2: Cauchy-Schwartz for the Dot Producth(w
j+1, bj+1) · (w
⇤, b
⇤)i k(w
j+1, bj+1)k k(w⇤, b
⇤)k
=
p1 + (b
⇤)
2k(wj+1, bj+1)k
Step 3: Upper Bound on k(wj
, b
j
)kIf we make a mistake we have
k(wj+1, bj+1)k2
= k(wj
, b
j
) + y
i
(x
i
, 1)k2
= k(wj
, b
j
)k2+ 2y
i
h(xi
, 1), (w
j
, b
j
)i + k(xi
, 1)k2
k(wj
, b
j
)k2+ k(x
i
, 1)k2
j(R
2+ 1).
Step 4: Combination of first three steps
j⇢ p
1 + (b
⇤)
2k(wj+1, bj+1)k
pj(R
2+ 1)((b
⇤)
2+ 1)
Solving for j proves the theorem.
Consequences• Only need to store errors.
This gives a compression bound for perceptron.• Stochastic gradient descent on hinge loss
• Fails with noisy datal(xi, yi, w, b) = max (0, 1� yi [hw, xii+ b])
do NOT train your avatar with perceptrons
Black & White
Hardnessmargin vs. size
hard easy
Concepts & version space• Realizable concepts
• Some function exists that can separate data and is included in the concept space
• For perceptron - data is linearly separable• Unrealizable concept
• Data not separable• We don’t have a suitable function class (often hard to distinguish)
Minimum error separation
• XOR - not linearly separable• Nonlinear separation is trivial• Caveat (Minsky & Papert)
Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).
Nonlinearity & Preprocessing
• RegressionWe got nonlinear functions by preprocessing
• Perceptron• Map data into feature space• Solve problem in this space• Query replace by for code
• Feature Perceptron• Solution in span of
Nonlinear Features
x ! �(x)
hx, x0i h�(x),�(x0)i
�(xi)
Quadratic Features
• Separating surfaces areCircles, hyperbolae, parabolae
Constructing Features(very naive OCR system)
Constructing Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 35
IdeaConstruct features manually. E.g. for OCR we could use
Feature Engineeringfor Spam Filtering
• bag of words• pairs of words• date & time• recipient path• IP number• sender• encoding• links• ... secret sauce ...
Delivered-To: [email protected]: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53 -0800 (PST)Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST)Return-Path: <[email protected]>Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:51 -0800 (PST)Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of [email protected]) client-ip=209.85.215.175;Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of [email protected]) [email protected]; dkim=pass (test mode) [email protected]: by eaal1 with SMTP id l1so15092746eaa.6 for <[email protected]>; Tue, 03 Jan 2012 14:17:51 -0800 (PST)Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; Tue, 03 Jan 2012 14:17:51 -0800 (PST)X-Forwarded-To: [email protected]: [email protected] [email protected]: [email protected]: by 10.204.65.198 with SMTP id k6cs206093bki; Tue, 3 Jan 2012 14:17:50 -0800 (PST)Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48 -0800 (PST)Return-Path: <[email protected]>Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800 (PST)Received-SPF: pass (google.com: domain of [email protected] designates 209.85.220.179 as permitted sender) client-ip=209.85.220.179;Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <[email protected]>; Tue, 03 Jan 2012 14:17:48 -0800 (PST)DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o=MIME-Version: 1.0Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST)Sender: [email protected]: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST)Date: Tue, 3 Jan 2012 14:17:47 -0800X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHsMessage-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com>Subject: CS 281B. Advanced Topics in Learning and Decision MakingFrom: Tim Althoff <[email protected]>To: [email protected]: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113a
--f46d043c7af4b07e8d04b5a7113aContent-Type: text/plain; charset=ISO-8859-1
More feature engineering• Two Interlocking Spirals
Transform the data into a radial and angular part
• Handwritten Japanese Character Recognition • Break down the images into strokes and recognize it• Lookup based on stroke order
• Medical Diagnosis• Physician’s comments• Blood status / ECG / height / weight / temperature ...• Medical knowledge
• Preprocessing• Zero mean, unit variance to fix scale issue (e.g. weight vs. income)• Probability integral transform (inverse CDF) as alternative
(x1, x2) = (r sin�, r cos�)
The Perceptron on features
• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of
inner products
Perceptron on Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 37
argument: X := {x1, . . . , xm
} ⇢ X (data)Y := {y1, . . . , ym
} ⇢ {±1} (labels)function (w, b) = Perceptron(X, Y, ⌘)
initialize w, b = 0
repeatPick (x
i
, y
i
) from dataif y
i
(w · �(x
i
) + b) 0 thenw
0= w + y
i
�(x
i
)
b
0= b + y
i
until y
i
(w · �(x
i
) + b) > 0 for all i
end
Important detailw =
X
j
y
j
�(x
j
) and hence f (x) =
Pj
y
j
(�(x
j
) · �(x)) + b
w =X
i2I
yi�(xi)
f(x) =X
i2I
yi h�(xi),�(x)i+ b
Problems• Problems
• Need domain expert (e.g. Chinese OCR)• Often expensive to compute• Difficult to transfer engineering knowledge
• Shotgun Solution• Compute many features• Hope that this contains good ones• Do this efficiently
Kernels
Grace Wahba
Solving XOR
• XOR not linearly separable• Mapping into 3 dimensions makes it easily solvable
(x1, x2) (x1, x2, x1x2)
Quadratic FeaturesPolynomial Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 39
Quadratic Features in R2
�(x) :=
⇣x
21,p
2x1x2, x22
⌘
Dot Producth�(x), �(x
0)i =
D⇣x
21,p
2x1x2, x22
⌘,
⇣x
012,
p2x
01x
02, x
022⌘E
= hx, x
0i2.InsightTrick works for any polynomials of order d via hx, x
0id.
Computational EfficiencyKernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40
ProblemExtracting features can sometimes be very costly.Example: second order features in 1000 dimensions.This leads to 5005 numbers. For higher order polyno-mial features much worse.
SolutionDon’t compute the features, try to compute dot productsimplicitly. For some features this works . . .
DefinitionA kernel function k : X ⇥ X ! R is a symmetric functionin its arguments for which the following property holds
k(x, x
0) = h�(x), �(x
0)i for some feature map �.
If k(x, x
0) is much cheaper to compute than �(x) . . .
5 · 105
The Kernel Perceptron
• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of inner products
w =X
i2I
yi�(xi)
Kernel Perceptron
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 42
argument: X := {x1, . . . , xm
} ⇢ X (data)Y := {y1, . . . , ym
} ⇢ {±1} (labels)function f = Perceptron(X, Y, ⌘)
initialize f = 0
repeatPick (x
i
, y
i
) from dataif y
i
f (x
i
) 0 thenf (·) f (·) + y
i
k(x
i
, ·) + y
i
until y
i
f (x
i
) > 0 for all i
end
Important detailw =
X
j
y
j
�(x
j
) and hence f (x) =
Pj
y
j
k(x
j
, x) + b.
f(x) =X
i2I
yi h�(xi),�(x)i+ b =X
i2I
yik(xi, x) + b
Polynomial KernelsPolynomial Kernels in Rn
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 41
IdeaWe want to extend k(x, x
0) = hx, x
0i2 to
k(x, x
0) = (hx, x
0i + c)
d where c > 0 and d 2 N.
Prove that such a kernel corresponds to a dot product.Proof strategySimple and straightforward: compute the explicit sumgiven by the kernel, i.e.
k(x, x
0) = (hx, x
0i + c)
d
=
mX
i=0
✓d
i
◆(hx, x
0i)i cd�i
Individual terms (hx, x
0i)i are dot products for some �
i
(x).
Kernel ConditionsAre all k(x, x
0) good Kernels?
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 43
ComputabilityWe have to be able to compute k(x, x
0) efficiently (much
cheaper than dot products themselves).“Nice and Useful” FunctionsThe features themselves have to be useful for the learn-ing problem at hand. Quite often this means smoothfunctions.
SymmetryObviously k(x, x
0) = k(x
0, x) due to the symmetry of the
dot product h�(x), �(x
0)i = h�(x
0), �(x)i.
Dot Product in Feature SpaceIs there always a � such that k really is a dot product?
Mercer’s Theorem
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 44
The TheoremFor any symmetric function k : X ⇥ X ! R which issquare integrable in X⇥ X and which satisfies
Z
X⇥X
k(x, x
0)f (x)f (x
0)dxdx
0 � 0 for all f 2 L2(X)
there exist �i
: X ! R and numbers �
i
� 0 wherek(x, x
0) =
X
i
�
i
�
i
(x)�
i
(x
0) for all x, x
0 2 X.
InterpretationDouble integral is the continuous version of a vector-matrix-vector multiplication. For positive semidefinitematrices we haveX
i
X
j
k(x
i
, x
j
)↵
i
↵
j
� 0
Mercer’s Theorem
PropertiesProperties of the Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 45
Distance in Feature SpaceDistance between points in feature space via
d(x, x
0)
2:=k�(x) � �(x
0)k2
=h�(x), �(x)i � 2h�(x), �(x
0)i + h�(x
0), �(x
0)i
=k(x, x) + k(x
0, x
0) � 2k(x, x)
Kernel MatrixTo compare observations we compute dot products, sowe study the matrix K given by
K
ij
= h�(x
i
), �(x
j
)i = k(x
i
, x
j
)
where x
i
are the training patterns.Similarity MeasureThe entries K
ij
tell us the overlap between �(x
i
) and�(x
j
), so k(x
i
, x
j
) is a similarity measure.
PropertiesProperties of the Kernel Matrix
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 46
K is Positive SemidefiniteClaim: ↵
>K↵ � 0 for all ↵ 2 Rm and all kernel matrices
K 2 Rm⇥m. Proof:mX
i,j
↵
i
↵
j
K
ij
=
mX
i,j
↵
i
↵
j
h�(x
i
), �(x
j
)i
=
*mX
i
↵
i
�(x
i
),
mX
j
↵
j
�(x
j
)
+=
�����
mX
i=1
↵
i
�(x
i
)
�����
2
Kernel ExpansionIf w is given by a linear combination of �(x
i
) we get
hw, �(x)i =
*mX
i=1
↵
i
�(x
i
), �(x)
+=
mX
i=1
↵
i
k(x
i
, x).
A CounterexampleA Counterexample
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 47
A Candidate for a Kernel
k(x, x
0) =
⇢1 if kx� x
0k 1
0 otherwiseThis is symmetric and gives us some information aboutthe proximity of points, yet it is not a proper kernel . . .
Kernel MatrixWe use three points, x1 = 1, x2 = 2, x3 = 3 and computethe resulting “kernelmatrix” K. This yields
K =
2
41 1 0
1 1 1
0 1 1
3
5 and eigenvalues (
p2�1)
�1, 1 and (1�
p2).
as eigensystem. Hence k is not a kernel.
ExamplesSome Good Kernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 48
Examples of kernels k(x, x
0)
Linear hx, x
0iLaplacian RBF exp (��kx � x
0k)Gaussian RBF exp
���kx � x
0k2�
Polynomial (hx, x
0i + ci)d , c � 0, d 2 NB-Spline B2n+1(x � x
0)
Cond. Expectation E
c
[p(x|c)p(x
0|c)]Simple trick for checking Mercer’s conditionCompute the Fourier transform of the kernel and checkthat it is nonnegative.
Linear KernelLinear Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 49
Laplacian KernelLaplacian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 50
Gaussian KernelGaussian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 51
Polynomial of order 3Polynomial (Order 3)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 52
B3 Spline KernelB
3
-Spline Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 53
• Perceptron• Hebbian learning & biology• Algorithm• Convergence analysis
• Features and preprocessing• Nonlinear separation• Perceptron in feature space
• Kernels• Kernel trick• Properties• Examples
Summary