Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine...

56
Introduction to Machine Learning 4. Perceptron and Kernels Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701

Transcript of Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine...

Page 1: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Introduction to Machine Learning4. Perceptron and Kernels

Alex SmolaCarnegie Mellon University

http://alex.smola.org/teaching/cmu2013-10-70110-701

Page 2: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

• Perceptron• Hebbian learning & biology• Algorithm• Convergence analysis

• Features and preprocessing• Nonlinear separation• Perceptron in feature space

• Kernels• Kernel trick• Properties• Examples

Outline

Page 3: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Perceptron

Frank Rosenblatt

Page 4: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

early theoriesof the brain

Page 5: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Biology and Learning• Basic Idea

• Good behavior should be rewarded, bad behavior punished (or not rewarded). This improves system fitness.

• Killing a sabertooth tiger should be rewarded ...• Correlated events should be combined.• Pavlov’s salivating dog.

• Training mechanisms• Behavioral modification of individuals (learning)

Successful behavior is rewarded (e.g. food). • Hard-coded behavior in the genes (instinct)

The wrongly coded animal does not reproduce.

Page 6: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Neurons• Soma (CPU)

Cell body - combines signals• Dendrite (input bus)

Combines the inputs from several other nerve cells

• Synapse (interface)Interface and parameter store between neurons

• Axon (cable)May be up to 1m long and will transport the activation signal to neurons at different locations

Page 7: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Neurons

f(x) =X

i

wixi = hw, xi

x1 x2 x3 xn. . .

output

w1 wn

synapticweights

Page 8: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Perceptron• Weighted linear

combination• Nonlinear

decision function• Linear offset (bias)

• Linear separating hyperplanes(spam/ham, novel/typical, click/no click)

• LearningEstimating the parameters w and b

x1 x2 x3 xn. . .

output

w1 wn

synapticweights

f(x) = � (hw, xi+ b)

Page 9: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Perceptron

SpamHam

Page 10: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

The Perceptron

• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of

inner products

initialize w = 0 and b = 0

repeatif yi [hw, xii+ b] 0 thenw w + yixi and b b+ yi

end ifuntil all classified correctly

w =X

i2I

yixi

f(x) =X

i2I

yi hxi, xi+ b

Page 11: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Convergence Theorem• If there exists some with unit length and

then the perceptron converges to a linear separator after a number of steps bounded by

• Dimensionality independent• Order independent (i.e. also worst case)• Scales with ‘difficulty’ of problem

(w⇤, b⇤)

yi [hxi, w⇤i+ b

⇤] � ⇢ for all i

⇣b

⇤2 + 1⌘ �

r

2 + 1�⇢

�2 where kxik r

Page 12: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

ProofProof, Part I

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 14

Starting PointWe start from w1 = 0 and b1 = 0.

Step 1: Bound on the increase of alignmentDenote by w

i

the value of w at step i (analogously b

i

).

Alignment: h(wi

, b

i

), (w

⇤, b

⇤)i

For error in observation (x

i

, y

i

) we get

h(wj+1, bj+1) · (w

⇤, b

⇤)i

= h[(wj

, b

j

) + y

i

(x

i

, 1)] , (w

⇤, b

⇤)i

= h(wj

, b

j

), (w

⇤, b

⇤)i + y

i

h(xi

, 1) · (w

⇤, b

⇤)i

� h(wj

, b

j

), (w

⇤, b

⇤)i + ⇢

� j⇢.

Alignment increases with number of errors.

Page 13: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

ProofProof, Part II

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 15

Step 2: Cauchy-Schwartz for the Dot Producth(w

j+1, bj+1) · (w

⇤, b

⇤)i k(w

j+1, bj+1)k k(w⇤, b

⇤)k

=

p1 + (b

⇤)

2k(wj+1, bj+1)k

Step 3: Upper Bound on k(wj

, b

j

)kIf we make a mistake we have

k(wj+1, bj+1)k2

= k(wj

, b

j

) + y

i

(x

i

, 1)k2

= k(wj

, b

j

)k2+ 2y

i

h(xi

, 1), (w

j

, b

j

)i + k(xi

, 1)k2

k(wj

, b

j

)k2+ k(x

i

, 1)k2

j(R

2+ 1).

Step 4: Combination of first three steps

j⇢ p

1 + (b

⇤)

2k(wj+1, bj+1)k

pj(R

2+ 1)((b

⇤)

2+ 1)

Solving for j proves the theorem.

Page 14: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Consequences• Only need to store errors.

This gives a compression bound for perceptron.• Stochastic gradient descent on hinge loss

• Fails with noisy datal(xi, yi, w, b) = max (0, 1� yi [hw, xii+ b])

do NOT train your avatar with perceptrons

Black & White

Page 15: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Hardnessmargin vs. size

hard easy

Page 16: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 17: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 18: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 19: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 20: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 21: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 22: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 23: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 24: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 25: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 26: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 27: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 28: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Concepts & version space• Realizable concepts

• Some function exists that can separate data and is included in the concept space

• For perceptron - data is linearly separable• Unrealizable concept

• Data not separable• We don’t have a suitable function class (often hard to distinguish)

Page 29: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Minimum error separation

• XOR - not linearly separable• Nonlinear separation is trivial• Caveat (Minsky & Papert)

Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

Page 30: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Nonlinearity & Preprocessing

Page 31: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

• RegressionWe got nonlinear functions by preprocessing

• Perceptron• Map data into feature space• Solve problem in this space• Query replace by for code

• Feature Perceptron• Solution in span of

Nonlinear Features

x ! �(x)

hx, x0i h�(x),�(x0)i

�(xi)

Page 32: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Quadratic Features

• Separating surfaces areCircles, hyperbolae, parabolae

Page 33: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Constructing Features(very naive OCR system)

Constructing Features

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 35

IdeaConstruct features manually. E.g. for OCR we could use

Page 34: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Feature Engineeringfor Spam Filtering

• bag of words• pairs of words• date & time• recipient path• IP number• sender• encoding• links• ... secret sauce ...

Delivered-To: [email protected]: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53 -0800 (PST)Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST)Return-Path: <[email protected]>Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:51 -0800 (PST)Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of [email protected]) client-ip=209.85.215.175;Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of [email protected]) [email protected]; dkim=pass (test mode) [email protected]: by eaal1 with SMTP id l1so15092746eaa.6 for <[email protected]>; Tue, 03 Jan 2012 14:17:51 -0800 (PST)Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; Tue, 03 Jan 2012 14:17:51 -0800 (PST)X-Forwarded-To: [email protected]: [email protected] [email protected]: [email protected]: by 10.204.65.198 with SMTP id k6cs206093bki; Tue, 3 Jan 2012 14:17:50 -0800 (PST)Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48 -0800 (PST)Return-Path: <[email protected]>Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800 (PST)Received-SPF: pass (google.com: domain of [email protected] designates 209.85.220.179 as permitted sender) client-ip=209.85.220.179;Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <[email protected]>; Tue, 03 Jan 2012 14:17:48 -0800 (PST)DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o=MIME-Version: 1.0Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST)Sender: [email protected]: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST)Date: Tue, 3 Jan 2012 14:17:47 -0800X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHsMessage-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com>Subject: CS 281B. Advanced Topics in Learning and Decision MakingFrom: Tim Althoff <[email protected]>To: [email protected]: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113a

--f46d043c7af4b07e8d04b5a7113aContent-Type: text/plain; charset=ISO-8859-1

Page 35: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

More feature engineering• Two Interlocking Spirals

Transform the data into a radial and angular part

• Handwritten Japanese Character Recognition • Break down the images into strokes and recognize it• Lookup based on stroke order

• Medical Diagnosis• Physician’s comments• Blood status / ECG / height / weight / temperature ...• Medical knowledge

• Preprocessing• Zero mean, unit variance to fix scale issue (e.g. weight vs. income)• Probability integral transform (inverse CDF) as alternative

(x1, x2) = (r sin�, r cos�)

Page 36: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

The Perceptron on features

• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of

inner products

Perceptron on Features

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 37

argument: X := {x1, . . . , xm

} ⇢ X (data)Y := {y1, . . . , ym

} ⇢ {±1} (labels)function (w, b) = Perceptron(X, Y, ⌘)

initialize w, b = 0

repeatPick (x

i

, y

i

) from dataif y

i

(w · �(x

i

) + b) 0 thenw

0= w + y

i

�(x

i

)

b

0= b + y

i

until y

i

(w · �(x

i

) + b) > 0 for all i

end

Important detailw =

X

j

y

j

�(x

j

) and hence f (x) =

Pj

y

j

(�(x

j

) · �(x)) + b

w =X

i2I

yi�(xi)

f(x) =X

i2I

yi h�(xi),�(x)i+ b

Page 37: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Problems• Problems

• Need domain expert (e.g. Chinese OCR)• Often expensive to compute• Difficult to transfer engineering knowledge

• Shotgun Solution• Compute many features• Hope that this contains good ones• Do this efficiently

Page 38: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Kernels

Grace Wahba

Page 39: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Solving XOR

• XOR not linearly separable• Mapping into 3 dimensions makes it easily solvable

(x1, x2) (x1, x2, x1x2)

Page 40: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Quadratic FeaturesPolynomial Features

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 39

Quadratic Features in R2

�(x) :=

⇣x

21,p

2x1x2, x22

Dot Producth�(x), �(x

0)i =

D⇣x

21,p

2x1x2, x22

⌘,

⇣x

012,

p2x

01x

02, x

022⌘E

= hx, x

0i2.InsightTrick works for any polynomials of order d via hx, x

0id.

Page 41: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.
Page 42: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Computational EfficiencyKernels

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40

ProblemExtracting features can sometimes be very costly.Example: second order features in 1000 dimensions.This leads to 5005 numbers. For higher order polyno-mial features much worse.

SolutionDon’t compute the features, try to compute dot productsimplicitly. For some features this works . . .

DefinitionA kernel function k : X ⇥ X ! R is a symmetric functionin its arguments for which the following property holds

k(x, x

0) = h�(x), �(x

0)i for some feature map �.

If k(x, x

0) is much cheaper to compute than �(x) . . .

5 · 105

Page 43: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

The Kernel Perceptron

• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of inner products

w =X

i2I

yi�(xi)

Kernel Perceptron

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 42

argument: X := {x1, . . . , xm

} ⇢ X (data)Y := {y1, . . . , ym

} ⇢ {±1} (labels)function f = Perceptron(X, Y, ⌘)

initialize f = 0

repeatPick (x

i

, y

i

) from dataif y

i

f (x

i

) 0 thenf (·) f (·) + y

i

k(x

i

, ·) + y

i

until y

i

f (x

i

) > 0 for all i

end

Important detailw =

X

j

y

j

�(x

j

) and hence f (x) =

Pj

y

j

k(x

j

, x) + b.

f(x) =X

i2I

yi h�(xi),�(x)i+ b =X

i2I

yik(xi, x) + b

Page 44: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Polynomial KernelsPolynomial Kernels in Rn

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 41

IdeaWe want to extend k(x, x

0) = hx, x

0i2 to

k(x, x

0) = (hx, x

0i + c)

d where c > 0 and d 2 N.

Prove that such a kernel corresponds to a dot product.Proof strategySimple and straightforward: compute the explicit sumgiven by the kernel, i.e.

k(x, x

0) = (hx, x

0i + c)

d

=

mX

i=0

✓d

i

◆(hx, x

0i)i cd�i

Individual terms (hx, x

0i)i are dot products for some �

i

(x).

Page 45: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Kernel ConditionsAre all k(x, x

0) good Kernels?

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 43

ComputabilityWe have to be able to compute k(x, x

0) efficiently (much

cheaper than dot products themselves).“Nice and Useful” FunctionsThe features themselves have to be useful for the learn-ing problem at hand. Quite often this means smoothfunctions.

SymmetryObviously k(x, x

0) = k(x

0, x) due to the symmetry of the

dot product h�(x), �(x

0)i = h�(x

0), �(x)i.

Dot Product in Feature SpaceIs there always a � such that k really is a dot product?

Page 46: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Mercer’s Theorem

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 44

The TheoremFor any symmetric function k : X ⇥ X ! R which issquare integrable in X⇥ X and which satisfies

Z

X⇥X

k(x, x

0)f (x)f (x

0)dxdx

0 � 0 for all f 2 L2(X)

there exist �i

: X ! R and numbers �

i

� 0 wherek(x, x

0) =

X

i

i

i

(x)�

i

(x

0) for all x, x

0 2 X.

InterpretationDouble integral is the continuous version of a vector-matrix-vector multiplication. For positive semidefinitematrices we haveX

i

X

j

k(x

i

, x

j

)↵

i

j

� 0

Mercer’s Theorem

Page 47: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

PropertiesProperties of the Kernel

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 45

Distance in Feature SpaceDistance between points in feature space via

d(x, x

0)

2:=k�(x) � �(x

0)k2

=h�(x), �(x)i � 2h�(x), �(x

0)i + h�(x

0), �(x

0)i

=k(x, x) + k(x

0, x

0) � 2k(x, x)

Kernel MatrixTo compare observations we compute dot products, sowe study the matrix K given by

K

ij

= h�(x

i

), �(x

j

)i = k(x

i

, x

j

)

where x

i

are the training patterns.Similarity MeasureThe entries K

ij

tell us the overlap between �(x

i

) and�(x

j

), so k(x

i

, x

j

) is a similarity measure.

Page 48: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

PropertiesProperties of the Kernel Matrix

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 46

K is Positive SemidefiniteClaim: ↵

>K↵ � 0 for all ↵ 2 Rm and all kernel matrices

K 2 Rm⇥m. Proof:mX

i,j

i

j

K

ij

=

mX

i,j

i

j

h�(x

i

), �(x

j

)i

=

*mX

i

i

�(x

i

),

mX

j

j

�(x

j

)

+=

�����

mX

i=1

i

�(x

i

)

�����

2

Kernel ExpansionIf w is given by a linear combination of �(x

i

) we get

hw, �(x)i =

*mX

i=1

i

�(x

i

), �(x)

+=

mX

i=1

i

k(x

i

, x).

Page 49: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

A CounterexampleA Counterexample

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 47

A Candidate for a Kernel

k(x, x

0) =

⇢1 if kx� x

0k 1

0 otherwiseThis is symmetric and gives us some information aboutthe proximity of points, yet it is not a proper kernel . . .

Kernel MatrixWe use three points, x1 = 1, x2 = 2, x3 = 3 and computethe resulting “kernelmatrix” K. This yields

K =

2

41 1 0

1 1 1

0 1 1

3

5 and eigenvalues (

p2�1)

�1, 1 and (1�

p2).

as eigensystem. Hence k is not a kernel.

Page 50: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

ExamplesSome Good Kernels

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 48

Examples of kernels k(x, x

0)

Linear hx, x

0iLaplacian RBF exp (��kx � x

0k)Gaussian RBF exp

���kx � x

0k2�

Polynomial (hx, x

0i + ci)d , c � 0, d 2 NB-Spline B2n+1(x � x

0)

Cond. Expectation E

c

[p(x|c)p(x

0|c)]Simple trick for checking Mercer’s conditionCompute the Fourier transform of the kernel and checkthat it is nonnegative.

Page 51: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Linear KernelLinear Kernel

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 49

Page 52: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Laplacian KernelLaplacian Kernel

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 50

Page 53: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Gaussian KernelGaussian Kernel

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 51

Page 54: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

Polynomial of order 3Polynomial (Order 3)

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 52

Page 55: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

B3 Spline KernelB

3

-Spline Kernel

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 53

Page 56: Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40 Problem Extracting features can sometimes be very costly.

• Perceptron• Hebbian learning & biology• Algorithm• Convergence analysis

• Features and preprocessing• Nonlinear separation• Perceptron in feature space

• Kernels• Kernel trick• Properties• Examples

Summary