Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine...

Introduction to Machine Learning4. Perceptron and Kernels

Alex SmolaCarnegie Mellon University

http://alex.smola.org/teaching/cmu2013-10-70110-701

http://alex.smola.org/teaching/berkeley2012

http://alex.smola.org/teaching/berkeley2012

• Perceptron• Hebbian learning & biology• Algorithm• Convergence analysis

• Features and preprocessing• Nonlinear separation• Perceptron in feature space

• Kernels• Kernel trick• Properties• Examples

Outline

Perceptron

Frank Rosenblatt

early theoriesof the brain

Biology and Learning• Basic Idea

• Good behavior should be rewarded, bad behavior punished (or not rewarded). This improves system fitness.

• Killing a sabertooth tiger should be rewarded ...• Correlated events should be combined.• Pavlov’s salivating dog.

• Training mechanisms• Behavioral modification of individuals (learning)

Successful behavior is rewarded (e.g. food). • Hard-coded behavior in the genes (instinct)

The wrongly coded animal does not reproduce.

Neurons• Soma (CPU)

Cell body - combines signals• Dendrite (input bus)

Combines the inputs from several other nerve cells

• Synapse (interface)Interface and parameter store between neurons

• Axon (cable)May be up to 1m long and will transport the activation signal to neurons at different locations

Neurons

f(x) =X

i

wixi = hw, xi

x1 x2 x3 xn. . .

output

w1 wn

synapticweights

Perceptron• Weighted linear

combination• Nonlinear

decision function• Linear offset (bias)

• Linear separating hyperplanes(spam/ham, novel/typical, click/no click)

• LearningEstimating the parameters w and b

x1 x2 x3 xn. . .

output

w1 wn

synapticweights

f(x) = � (hw, xi+ b)

Perceptron

SpamHam

The Perceptron

• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of

inner products

initialize w = 0 and b = 0

repeatif yi [hw, xii+ b] 0 thenw w + yixi and b b+ yi

end ifuntil all classified correctly

w =X

i2I

yixi

f(x) =X

i2I

yi hxi, xi+ b

Convergence Theorem• If there exists some with unit length and

then the perceptron converges to a linear separator after a number of steps bounded by

• Dimensionality independent• Order independent (i.e. also worst case)• Scales with ‘difficulty’ of problem

(w⇤, b⇤)

yi [hxi, w⇤i+ b

⇤] � ⇢ for all i

⇣b

⇤2 + 1⌘ �

r

2 + 1�⇢

�2 where kxik r

ProofProof, Part I

Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 14

Starting PointWe start from w1 = 0 and b1 = 0.

Step 1: Bound on the increase of alignmentDenote by w

i

the value of w at step i (analogously b

i

).

Alignment: h(wi

, b

i

), (w

⇤, b

⇤)i

For error in observation (x

i

, y

i

) we get

h(wj+1, bj+1) · (w

⇤, b

⇤)i

= h[(wj

, b

j

) + y

i

(x

i

, 1)] , (w

⇤, b

⇤)i

= h(wj

, b

j

), (w

⇤, b

⇤)i + y

i

h(xi

, 1) · (w

⇤, b

⇤)i

� h(wj

, b

j

), (w

⇤, b

⇤)i + ⇢

� j⇢.

Alignment increases with number of errors.

ProofProof, Part II


Step 2: Cauchy-Schwartz for the Dot Producth(w

j+1, bj+1) · (w

⇤, b

⇤)i k(w

j+1, bj+1)k k(w⇤, b

⇤)k

=

p1 + (b

⇤)

2k(wj+1, bj+1)k

Step 3: Upper Bound on k(wj

, b

j

)kIf we make a mistake we have

k(wj+1, bj+1)k2

= k(wj

, b

j

) + y

i

(x

i

, 1)k2

= k(wj

, b

j

)k2+ 2y

i

h(xi

, 1), (w

j

, b

j

)i + k(xi

, 1)k2

k(wj

, b

j

)k2+ k(x

i

, 1)k2

j(R

2+ 1).

Step 4: Combination of first three steps

j⇢ p

1 + (b

⇤)

2k(wj+1, bj+1)k

pj(R

2+ 1)((b

⇤)

2+ 1)

Solving for j proves the theorem.

Consequences• Only need to store errors.

This gives a compression bound for perceptron.• Stochastic gradient descent on hinge loss

• Fails with noisy datal(xi, yi, w, b) = max (0, 1� yi [hw, xii+ b])

do NOT train your avatar with perceptrons

Black & White

Hardnessmargin vs. size

hard easy

Concepts & version space• Realizable concepts

• Some function exists that can separate data and is included in the concept space

• For perceptron - data is linearly separable• Unrealizable concept

• Data not separable• We don’t have a suitable function class (often hard to distinguish)

Minimum error separation

• XOR - not linearly separable• Nonlinear separation is trivial• Caveat (Minsky & Papert)

Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

Nonlinearity & Preprocessing

• RegressionWe got nonlinear functions by preprocessing

• Perceptron• Map data into feature space• Solve problem in this space• Query replace by for code

• Feature Perceptron• Solution in span of

Nonlinear Features

x ! �(x)

hx, x0i h�(x),�(x0)i

�(xi)

Quadratic Features

• Separating surfaces areCircles, hyperbolae, parabolae

Constructing Features(very naive OCR system)

Constructing Features


IdeaConstruct features manually. E.g. for OCR we could use

Feature Engineeringfor Spam Filtering

• bag of words• pairs of words• date & time• recipient path• IP number• sender• encoding• links• ... secret sauce ...

Delivered-To: [email protected]: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53 -0800 (PST)Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST)Return-Path: <[email protected]>Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:51 -0800 (PST)Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of [email protected]) client-ip=209.85.215.175;Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of [email protected]) [email protected]; dkim=pass (test mode) [email protected]: by eaal1 with SMTP id l1so15092746eaa.6 for <[email protected]>; Tue, 03 Jan 2012 14:17:51 -0800 (PST)Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; Tue, 03 Jan 2012 14:17:51 -0800 (PST)X-Forwarded-To: [email protected]: [email protected] [email protected]: [email protected]: by 10.204.65.198 with SMTP id k6cs206093bki; Tue, 3 Jan 2012 14:17:50 -0800 (PST)Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48 -0800 (PST)Return-Path: <[email protected]>Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800 (PST)Received-SPF: pass (google.com: domain of [email protected] designates 209.85.220.179 as permitted sender) client-ip=209.85.220.179;Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <[email protected]>; Tue, 03 Jan 2012 14:17:48 -0800 (PST)DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o=MIME-Version: 1.0Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST)Sender: [email protected]: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST)Date: Tue, 3 Jan 2012 14:17:47 -0800X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHsMessage-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com>Subject: CS 281B. Advanced Topics in Learning and Decision MakingFrom: Tim Althoff <[email protected]>To: [email protected]: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113a

--f46d043c7af4b07e8d04b5a7113aContent-Type: text/plain; charset=ISO-8859-1

mailto:[email protected]




























mailto:CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com

mailto:CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com





More feature engineering• Two Interlocking Spirals

Transform the data into a radial and angular part

• Handwritten Japanese Character Recognition • Break down the images into strokes and recognize it• Lookup based on stroke order

• Medical Diagnosis• Physician’s comments• Blood status / ECG / height / weight / temperature ...• Medical knowledge

• Preprocessing• Zero mean, unit variance to fix scale issue (e.g. weight vs. income)• Probability integral transform (inverse CDF) as alternative

(x1, x2) = (r sin�, r cos�)

The Perceptron on features

• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of

inner products

Perceptron on Features


argument: X := {x1, . . . , xm

} ⇢ X (data)Y := {y1, . . . , ym

} ⇢ {±1} (labels)function (w, b) = Perceptron(X, Y, ⌘)

initialize w, b = 0

repeatPick (x

i

, y

i

) from dataif y

i

(w · �(x

i

) + b) 0 thenw

0= w + y

i

�(x

i

)

b

0= b + y

i

until y

i

(w · �(x

i

) + b) > 0 for all i

end

Important detailw =

X

j

y

j

�(x

j

) and hence f (x) =

Pj

y

j

(�(x

j

) · �(x)) + b

w =X

i2I

yi�(xi)

f(x) =X

i2I

yi h�(xi),�(x)i+ b

Problems• Problems

• Need domain expert (e.g. Chinese OCR)• Often expensive to compute• Difficult to transfer engineering knowledge

• Shotgun Solution• Compute many features• Hope that this contains good ones• Do this efficiently

Kernels

Grace Wahba

Solving XOR

• XOR not linearly separable• Mapping into 3 dimensions makes it easily solvable

(x1, x2) (x1, x2, x1x2)

Quadratic FeaturesPolynomial Features


Quadratic Features in R2

�(x) :=

⇣x

21,p

2x1x2, x22

⌘

Dot Producth�(x), �(x

0)i =

D⇣x

21,p

2x1x2, x22

⌘,

⇣x

012,

p2x

01x

02, x

022⌘E

= hx, x

0i2.InsightTrick works for any polynomials of order d via hx, x

0id.

Computational EfficiencyKernels


ProblemExtracting features can sometimes be very costly.Example: second order features in 1000 dimensions.This leads to 5005 numbers. For higher order polyno-mial features much worse.

SolutionDon’t compute the features, try to compute dot productsimplicitly. For some features this works . . .

DefinitionA kernel function k : X ⇥ X ! R is a symmetric functionin its arguments for which the following property holds

k(x, x

0) = h�(x), �(x

0)i for some feature map �.

If k(x, x

0) is much cheaper to compute than �(x) . . .

5 · 105

The Kernel Perceptron

• Nothing happens if classified correctly• Weight vector is linear combination• Classifier is linear combination of inner products

w =X

i2I

yi�(xi)

Kernel Perceptron


argument: X := {x1, . . . , xm

} ⇢ X (data)Y := {y1, . . . , ym

} ⇢ {±1} (labels)function f = Perceptron(X, Y, ⌘)

initialize f = 0

repeatPick (x

i

, y

i

) from dataif y

i

f (x

i

) 0 thenf (·) f (·) + y

i

k(x

i

, ·) + y

i

until y

i

f (x

i

) > 0 for all i

end

Important detailw =

X

j

y

j

�(x

j

) and hence f (x) =

Pj

y

j

k(x

j

, x) + b.

f(x) =X

i2I

yi h�(xi),�(x)i+ b =X

i2I

yik(xi, x) + b

Polynomial KernelsPolynomial Kernels in Rn


IdeaWe want to extend k(x, x

0) = hx, x

0i2 to

k(x, x

0) = (hx, x

0i + c)

d where c > 0 and d 2 N.

Prove that such a kernel corresponds to a dot product.Proof strategySimple and straightforward: compute the explicit sumgiven by the kernel, i.e.

k(x, x

0) = (hx, x

0i + c)

d

=

mX

i=0

✓d

i

◆(hx, x

0i)i cd�i

Individual terms (hx, x

0i)i are dot products for some �

i

(x).

Kernel ConditionsAre all k(x, x

0) good Kernels?


ComputabilityWe have to be able to compute k(x, x

0) efficiently (much

cheaper than dot products themselves).“Nice and Useful” FunctionsThe features themselves have to be useful for the learn-ing problem at hand. Quite often this means smoothfunctions.

SymmetryObviously k(x, x

0) = k(x

0, x) due to the symmetry of the

dot product h�(x), �(x

0)i = h�(x

0), �(x)i.

Dot Product in Feature SpaceIs there always a � such that k really is a dot product?

Mercer’s Theorem


The TheoremFor any symmetric function k : X ⇥ X ! R which issquare integrable in X⇥ X and which satisfies

Z

X⇥X

k(x, x

0)f (x)f (x

0)dxdx

0 � 0 for all f 2 L2(X)

there exist �i

: X ! R and numbers �

i

� 0 wherek(x, x

0) =

X

i

�

i

�

i

(x)�

i

(x

0) for all x, x

0 2 X.

InterpretationDouble integral is the continuous version of a vector-matrix-vector multiplication. For positive semidefinitematrices we haveX

i

X

j

k(x

i

, x

j

)↵

i

↵

j

� 0

Mercer’s Theorem

PropertiesProperties of the Kernel


Distance in Feature SpaceDistance between points in feature space via

d(x, x

0)

2:=k�(x) � �(x

0)k2

=h�(x), �(x)i � 2h�(x), �(x

0)i + h�(x

0), �(x

0)i

=k(x, x) + k(x

0, x

0) � 2k(x, x)

Kernel MatrixTo compare observations we compute dot products, sowe study the matrix K given by

K

ij

= h�(x

i

), �(x

j

)i = k(x

i

, x

j

)

where x

i

are the training patterns.Similarity MeasureThe entries K

ij

tell us the overlap between �(x

i

) and�(x

j

), so k(x

i

, x

j

) is a similarity measure.

PropertiesProperties of the Kernel Matrix


K is Positive SemidefiniteClaim: ↵

>K↵ � 0 for all ↵ 2 Rm and all kernel matrices

K 2 Rm⇥m. Proof:mX

i,j

↵

i

↵

j

K

ij

=

mX

i,j

↵

i

↵

j

h�(x

i

), �(x

j

)i

=

*mX

i

↵

i

�(x

i

),

mX

j

↵

j

�(x

j

)

+=

��

mX

i=1

↵

i

�(x

i

)

��

2

Kernel ExpansionIf w is given by a linear combination of �(x

i

) we get

hw, �(x)i =

*mX

i=1

↵

i

�(x

i

), �(x)

+=

mX

i=1

↵

i

k(x

i

, x).

A CounterexampleA Counterexample


A Candidate for a Kernel

k(x, x

0) =

⇢1 if kx� x

0k 1

0 otherwiseThis is symmetric and gives us some information aboutthe proximity of points, yet it is not a proper kernel . . .

Kernel MatrixWe use three points, x1 = 1, x2 = 2, x3 = 3 and computethe resulting “kernelmatrix” K. This yields

K =

2

41 1 0

1 1 1

0 1 1

3

5 and eigenvalues (

p2�1)

�1, 1 and (1�

p2).

as eigensystem. Hence k is not a kernel.

ExamplesSome Good Kernels


Examples of kernels k(x, x

0)

Linear hx, x

0iLaplacian RBF exp (��kx � x

0k)Gaussian RBF exp

��kx � x

0k2�

Polynomial (hx, x

0i + ci)d , c � 0, d 2 NB-Spline B2n+1(x � x

0)

Cond. Expectation E

c

[p(x|c)p(x

0|c)]Simple trick for checking Mercer’s conditionCompute the Fourier transform of the kernel and checkthat it is nonnegative.

Linear KernelLinear Kernel


Laplacian KernelLaplacian Kernel


Gaussian KernelGaussian Kernel


Polynomial of order 3Polynomial (Order 3)


B3 Spline KernelB

3

-Spline Kernel


• Perceptron• Hebbian learning & biology• Algorithm• Convergence analysis

• Features and preprocessing• Nonlinear separation• Perceptron in feature space

• Kernels• Kernel trick• Properties• Examples

Summary

Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine...

Documents

Transcript of Introduction to Machine Learning - alex.smola.orgAlexander J. Smola: An Introduction to Machine...