Introduction to Machine...

Introduction to Machine LearningFelix Brockherde12 Kristof Schutt1

1Technische Universitat Berlin2Max Planck Institute of Microstructure Physics

IPAM Tutorial 2013

Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 1 / 35

What is Machine Learning?

InferedStructure

Data with Pattern Algorithm ML Model

ML is about learning structure from data

Examples

Drug discovery

Search engines

DNA splice site detection

Face recognition

Recommender systems

Speech recognition

This Talk

Part 1: Learning Theory andSupervised ML

I Basic Ideas of Learning TheoryI Support Vector MachinesI KernelsI Kernel Ridge Regression

Part 2: Unsupervised ML andApplication

I PCAI Model SelectionI Feature Representation

Not covered

I Probabilistic Models

I Neural Networks

I Online Learning

I Reinforcement Learning

I Semi-supervised Learning

I etc.

Supervised Learning

Classification

yi ∈ {−1,+1}

Regression

yi ∈ RI Given: Points X = (x1, . . . , xN) with xi ∈ Rd and Labels

Y = (y1, . . . , yn) generated by some joint probability distribution.

I Learn underlying unknown mapping f (x) = y

I Important: Performance on unseen data

Basic Ideas in Learning Theory

Risk minimization (RM)Learn a model function f from examples

(x1, y1), . . . , (xN , yN) ∈ Rd × R or {+1,−1}, generated from P(x, y)

such that the expected number of errors on test data (drawn fom P(x, y)),

R[f ] =

2|f (x)− y |2dP(x, y),

is minimal.

Problem: Distribution P(x, y) is unknown

Empirical Risk Minimization (ERM)Replace the average over P(x, y) by average of training samples (i.e.minimize the training error):

Remp[f ] =1

N∑i=1

2|f (xi )− yi |2

Law of large numbers: Remp[f ]→ R[f ] asN →∞.

Question: Does minf Remp[f ] give usminf R[f ] for sufficiently large N?

No: uniform convergence needed

Error bound for classification

With probablity of at least 1− η:

R[f ] ≤ Remp[f ] +

√D(log 2N

D + 1)− log(η4 )

where D is the VC dimension (Vapnik and Chervonenkis (1971)).

Introduce structure on set of possible functions and use Structural RiskMinimization (SRM).

Law of large numbers: Remp[f ]→ R[f ] asN →∞.

Question: Does minf Remp[f ] give usminf R[f ] for sufficiently large N?

No: uniform convergence needed

Error bound for classification

With probablity of at least 1− η:

R[f ] ≤ Remp[f ] +

√D(log 2N

D + 1)− log(η4 )

where D is the VC dimension (Vapnik and Chervonenkis (1971)).

Introduce structure on set of possible functions and use Structural RiskMinimization (SRM).

The linear function class has VC-dimension D = 3

Remp[f ] + Complexity[f ]

Support Vector Machines (SVM)

{x|w · x + b = 0}

{x|w · x + b = +1}

{x|w · x + b = −1}

2||w||

Normalize w so thatminxi w · xi + b = 1.

w · x1 + b = +1

w · x2 + b = −1

⇐⇒ w · (x1 − x2) = 2

⇐⇒ w

||w||· (x1 − x2) =

{x|w · x + b = 0}

{x|w · x + b = +1}

{x|w · x + b = −1}

2||w||

w · x1 + b = +1

w · x2 + b = −1

⇐⇒ w · (x1 − x2) = 2

⇐⇒ w

||w||· (x1 − x2) =

{x|w · x + b = 0}

{x|w · x + b = +1}

{x|w · x + b = −1}

2||w||

w · x1 + b = +1

w · x2 + b = −1

⇐⇒ w · (x1 − x2) = 2

⇐⇒ w

||w||· (x1 − x2) =

{x|w · x + b = 0}

{x|w · x + b = +1}

{x|w · x + b = −1}

2||w||

w · x1 + b = +1

w · x2 + b = −1

⇐⇒ w · (x1 − x2) = 2

⇐⇒ w

||w||· (x1 − x2) =

{x|w · x + b = 0}

{x|w · x + b = +1}

{x|w · x + b = −1}

2||w||

w · x1 + b = +1

w · x2 + b = −1

⇐⇒ w · (x1 − x2) = 2

⇐⇒ w

||w||· (x1 − x2) =

VC Dimension of Hyperplane Classifiers

Theorem (Cortes and Vapnik (1995))

Hyperplanes in canonical form have VC Dimension

D ≤ min{R2||w||2 + 1, N + 1}

where R the radius of the smallest sphere containing the data.

SRM Bound:

R[f ] ≤ Remp[f ] +

√D(log 2N

D + 1)− log(η4 )

maximal margin = minimum ||w||2 → good generalization, i.e. low risk:

minw,b

||w||2

subject to yi (w · xi + b) ≥ 1 for i = 1 . . .N

Slack variables

Introduce slack variables ξi :

minw,b,ξi

||w||2 + CN∑i=1

subject to yi (w · xi + b) ≥ 1− ξiξi ≥ 0

Slack variables

Introduce slack variables ξi :

minw,b,ξi

||w||2 + CN∑i=1

subject to yi (w · xi + b) ≥ 1− ξiξi ≥ 0

Non-linear hyperplanes

Map into a higher dimensional feature space:

Φ : R2 → R3

(x1, x2) 7→ (x21 ,√

2x1x2, x22 )

Non-linear hyperplanes

Map into a higher dimensional feature space:

Φ : R2 → R3

(x1, x2) 7→ (x21 ,√

2x1x2, x22 )

Dual SVM

Primalmin

w,b,ξi||w||2 + C

N∑i=1

subject to yi (w · Φ(xi) + b) ≥ 1− ξi and ξi ≥ 0 for i = 1 . . .N

N∑i=1

αi −1

N∑i ,j=1

αiαjyiyj (Φ(xi) · Φ(xj))

subject toN∑i=1

αiyi = 0 and C ≥ αi ≥ 0 for i = 1 . . .N

Data points xi only appear in scalar products (Φ(xi) · Φ(xj)).

The Kernel Trick

Replace scalar products with kernel function (Muller et al. (2001)):

k(x , y) = Φ(x) · Φ(y)

I Compute kernel matrix Kij = k(xi , xj), i.e. never use Φ directlyI Underlying mapping Φ can be unknownI Kernels can be adopted to specific task, e.g. using prior knowledge

(kernels for graphs, trees, strings, . . . )

Common kernels

Gaussian Kernel: k(x , y) = exp(− ||x−y ||

)Linear Kernel: k(x , y) = x · y

Polynomial Kernel: k(x , y) = (x · y + c)d

The Support Vectors in SVM

N∑i=1

αi −1

N∑i ,j=1

αiαjyiyj (Φ(xi) · Φ(xj))

subject toN∑i=1

αiyi = 0 and C ≥ αi ≥ 0 for i = 1 . . .N

KKT conditions

yi [wΦ(xi)) + b] > 1 =⇒ ai = 0 −→ xi irrelevant

yi [wΦ(xi)) + b] = 1 =⇒ on/in margin −→ xi Support Vector

Old model f (x) = w · Φ(xi ) + b becomes via w =∑N

i=1 αiyiΦ(xi ):

f (x) =N∑i=1

αiyik(xi , x) + b −→ f (x) =∑

xi∈SVαiyik(xi , x) + b

Kernel Ridge Regression (KRR)

Ridge Regression

N∑i=1

||yi − w · xi||2 + λ||w ||2

Setting derivative to zero gives

(λI +

N∑i=1

xixᵀi

)−1 N∑i=1

Linear Model: f (x) = w · x

Kernelizing Ridge Regression

Setting X = (x1, . . . , xN) ∈ Rd×N and Y = (y1, . . . , yn)ᵀ ∈ RN :

w = (λI + XX ᵀ)−1 XY

Apply Woodbury Matrix identity:

w = X (X ᵀX + λI )−1 Y

Introduce α:

α = (K + λI )−1Y and w =N∑i=1

Φ(xi)αi

Kernel Model: f (x) = w · Φ(x) =∑N

i=1 αik(xi , x)

Unsupervised Learning

I Learn structure from unlabeled data

I Fit an assumed model / distribution tothe data

I ExamplesI clusteringI blind source separationI outlier detectionI dimensionality reduction

426 9. MIXTURE MODELS AND EM

!2 0 2

Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green pointsdenote the data set in a two-dimensional Euclidean space. The initial choices for centres µ1 and µ2 are shownby the red and blue crosses, respectively. (b) In the initial E step, each data point is assigned either to the redcluster or to the blue cluster, according to which cluster centre is nearer. This is equivalent to classifying thepoints according to which side of the perpendicular bisector of the two cluster centres, shown by the magentaline, they lie on. (c) In the subsequent M step, each cluster centre is re-computed to be the mean of the pointsassigned to the corresponding cluster. (d)–(i) show successive E and M steps through to final convergence ofthe algorithm.

Principal Component Analysis (PCA)

Given centered data matrix X = (x1, . . . , xN)ᵀ ∈ RNxD

I best linear approximation

w1 = arg min‖w‖=1

{‖X − Xwwᵀ‖2

}I direction of largest variance

w1 = arg max‖w‖=1

{‖Xw‖2

}I matrix reduction for further

components

Xk+1 = Xk − XkwwᵀPearson, K. 1901. On lines and planes of closest fit to systems of points in space. Philosophical Magazine 2:559-572.

http://pbil.univ-lyon1.fr/R/pearson1901.pdf

Pearson (1901)

Given centered data matrix X ∈ RNxD , decompose correlated data matrixinto uncorrelated, orthogonal PCs

I diagonalize covariance matrix Σ = 1NX

Σwk = σ2kwk

I order principal components wk by variance σ2k

I project data to first n principal components

What about nonlinear correlations?

Given centered data matrix X ∈ RNxD , decompose correlated data matrixinto uncorrelated, orthogonal PCs

I diagonalize covariance matrix Σ = 1NX

Σwk = σ2kwk

I order principal components wk by variance σ2k

I project data to first n principal components

What about nonlinear correlations?Felix Brockherde, Kristof Schutt Introduction to Machine Learning IPAM Tutorial 2013 20 / 35

Kernel Principal Component Analysis (kPCA)

Transformation to feature space X 7→ Xf :

Σf =1

NX ᵀf Xf , K = Xf X

ᵀf ,Kij = k(xi , xj)

Σf wk = σ2kwk

⇓ wk = X ᵀf αk

X ᵀf Xf X

ᵀf αk = Nσ2

kXᵀf αk

⇓ Xf ·K 2αk = Nσ2

⇓ K−1·Kαk = Nσ2

Σf =1

X ᵀf Xf wk = Nσ2

⇓ wk = X ᵀf αk

X ᵀf Xf X

ᵀf αk = Nσ2

kXᵀf αk

Σf =1

X ᵀf Xf wk = Nσ2

⇓ wk = X ᵀf αk

X ᵀf Xf X

ᵀf αk = Nσ2

kXᵀf αk

Σf =1

X ᵀf Xf wk = Nσ2

⇓ wk = X ᵀf αk

X ᵀf Xf X

ᵀf αk = Nσ2

kXᵀf αk

⇓ Xf ·Xf X

ᵀf Xf X

ᵀf αk = Nσ2

kXf Xᵀf αk

Σf =1

X ᵀf Xf wk = Nσ2

⇓ wk = X ᵀf αk

X ᵀf Xf X

ᵀf αk = Nσ2

kXᵀf αk

Σf =1

X ᵀf Xf wk = Nσ2

⇓ wk = X ᵀf αk

X ᵀf Xf X

ᵀf αk = Nσ2

kXᵀf αk

Projection:

xᵀf wk = xᵀf Xᵀf αk

=N∑i=1

αk,ik(x , x i )Scholkopf et al. (1997)

Model Selection

I Find the model that best fits thedata distribution

I We can only estimate thisdistribution

I ConsiderI noise ratio / distributionI data correlation

Hyperparameters

I adjust model complexityI regularization, kernel

parameters, etc.

I have to be tuned using examplesnot used for training

I standard solution:exhaustive search overparameter grid

0 1 2 3 4 5 6

traintest

f (x) = sin(x)

f (x) =∑i

αi exp

(‖x − xi‖2

)α = (K + τ I )−1y

Grid Search

0 1 2 3 4 5 6

10−2 10−1 100 101 102

10−2

10−1

0 1 2 3 4 5 6

k-fold cross-validation

testtraining

training test

split data

4x inner loopmodel selection

evaluation 5x outer loop

Don’t even think aboutlooking at the test set!

k-fold cross-validation

testtraining

training test

split data

4x inner loopmodel selection

evaluation 5x outer loop

Don’t even think aboutlooking at the test set!

From objects to vectors

How to represent complex objects for kernel methods?I explicit map to vector space: φ : M → Rn

I use standard kernel (e.g., linear, polynomial, gaussian)k : Rn × Rn → R on mapped features

I direct use of kernel function: k : M ×M → R

Feature Representation

Given a physical object (molecule, crystal, etc.) and a property of interest,what is a good ML representation?

I no loss of valuable informationI support generalization

I remove invariancesI decompose problem

I incorporation of domain knowledge

I depends on data set, target function and learning method

Feature Representation - Molecules

Coulomb matrix:

0.5Z 2.4

i if i = j

‖r i − r j‖if i 6= j

(a) (b) (c) (d) (e)

(Rupp et al., 2012; Montavon et al., 2012)

Feature Representation - Molecules

PCA of Coulomb matrices with atom permutations Montavon et al. (2013)

Results - Molecules

Feature Representation - Crystals

element pair r1 · · · rn

α α gαα(r1) · · · gαα(rn)

α β gαβ(r1) · · · gαβ(rn)

β α gβα(r1) · · · gβα(rn)

β β gββ(r1) · · · gββ(rn)

Results - Crystals

Learning curve of DOSfermi predictions

K.T. Schutt, H. Glawe, F. Brockherde, A. Sanna, K.-R. Muller, E.K.U. Gross, How to represent crystal structures for machine

learning: towards fast prediction of electronic properties, arXiv, 2013

Machine Learning ...

... has been successfully applied to various research fields.

... is based on statistical learning theory.

... provides fast and accurate predictions on previously unseen data.

... is able to model non-linear relationships of high-dimensional data.

Feature representation is key!

Literature I

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3):273–297.

Montavon, G., Hansen, K., Fazli, S., Rupp, M., Biegler, F., Ziehe, A., Tkatchenko, A., Lilienfeld, A. V., and Muller, K.-R.(2012). Learning invariant representations of molecules for atomization energy prediction. In Advances in Neural InformationProcessing Systems, pages 449–457.

Montavon, G., Rupp, M., Gobre, V., Vazquez-Mayagoitia, A., Hansen, K., Tkatchenko, A., Muller, K.-R., and von Lilienfeld,O. A. (2013). Machine learning of molecular electronic properties in chemical compound space. arXiv preprintarXiv:1305.7074.

Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf, B. (2001). An introduction to kernel-based learning algorithms.Neural Networks, IEEE Transactions on, 12(2):181–201.

Pearson, K. (1901). Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and DublinPhilosophical Magazine and Journal of Science, 2(11):559–572.

Rupp, M., Tkatchenko, A., Muller, K.-R., and von Lilienfeld, O. A. (2012). Fast and accurate modeling of molecularatomization energies with machine learning. Physical Review Letters, 108(5):058301.

Scholkopf, B., Smola, A., and Muller, K.-R. (1997). Kernel principal component analysis. In Artificial NeuralNetworks—ICANN’97, pages 583–588. Springer.

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to theirprobabilities. Theory of Probability & Its Applications, 16(2):264–280.

Introduction to Machine...

Documents

Transcript of Introduction to Machine...

IPAM Summer School 2012 Tutorial on: Deep Learninghelper.ipam.ucla.edu/publications/gss2012/gss2012_10742.pdf · IPAM Summer School 2012 Tutorial on: Deep Learning Geoffrey Hinton

PHD FORUM...2020/01/01 · Robust and Energy-Efficient Deep Learning Systems Muhammad Abdullah Hanif1 (Ph.D. Candidate), Muhammad Shafique2 (Advisor) 1Technische Universität Wien

On Sophus Lie's Fundamental Theorems. I by Karl Heinrich … · 2016. 12. 31. · by Karl Heinrich Hofmann I and Jimmie D. Lawson 2 1Technische Hochschule Darmstadt, 19-6100 Darmstadt,

Faculty oof EEnvironmental SSciences, DDepartment oof ... fileDuc Le 1*1, Fitria Rinawati 1, Walter Walter Lintangah 2, Thuy Anh Le 3 1Technische Universität Dresden, Institute of

GNUnet Distributed Data Storage - git.gnunet.org€¦ · GNUnet Distributed Data Storage DHT and Distance Vector Transport Nathan S. Evans1 1Technische Universit at Munchen Department

Optimization Methods for Machine Learninghelper.ipam.ucla.edu/publications/elws1/elws1_13686.pdfOptimization Methods for Machine Learning Stephen Wright University of Wisconsin-Madison

Image Segmentation in Twenty Questions - cv … · Image Segmentation in Twenty Questions Christian Rupprecht1,2 Lo¨ıc Peter1 Nassir Navab1,2 1Technische Universit¨at M unchen,

Post-quantum cryptography { dealing with the … · Post-quantum cryptography {dealing with the fallout of physics success Daniel J. Bernstein 1;2 and Tanja Lange 1Technische Universiteit

An Algebraic Perspective on Deep Learninghelper.ipam.ucla.edu/publications/gss2012/gss2012_10780.pdf · Review of GM De ninitions ... setting the output to j1i ... From these primitives

Introduction to phonon transporthelper.ipam.ucla.edu/publications/msetut/msetut_11469.pdf · 2013-09-12 · Introduction to phonon transport Ivana Savi´c Tyndall National Institute,

DYNAMIC NEUTRON RADIOGRAPHY OF A COMBUSTION · PDF fileDYNAMIC NEUTRON RADIOGRAPHY OF A COMBUSTION ENGINE J. Brunner1, A. Hillenbach2, E. Lehmann 3, B. Schillinger 1 1Technische

On the Statistical Complexity of Reinforcement Learninghelper.ipam.ucla.edu/publications/lco2020/lco2020_16408.pdfOn the Statistical Complexity of Reinforcement Learning and the use

ABC of ground-state DFT - UCLAhelper.ipam.ucla.edu/publications/msetut/msetut_11459.pdfI LYP= Lee-Yang-Parr correlation Non-empirical I GGA:PBE I Meta-GGA:TPSS I Hybrid: PBE0 Kieron

An Algebraic Perspective on Deep Learninghelper.ipam.ucla.edu/publications/gss2012/gss2012_10605.pdf · An Algebraic Perspective on Deep Learning Jason Morton Penn State July 19-20,

Mark E. Tuckerman Dept. of Chemistry and Courant Institute ...helper.ipam.ucla.edu/publications/msetut/msetut_11465.pdfScience, December 22, 2006 . Spatial-Warping Transformations

Combination of the two radio space geodetic techniques with … · 2016. 6. 29. · Younghee Kwak1, Johannes Boehm1, Thomas Hobiger2 3, Lucia Plank , Kamil Teke4 1Technische Universität

DYNAMIC NEUTRON RADIOGRAPHY OF A COMBUSTION …DYNAMIC NEUTRON RADIOGRAPHY OF A COMBUSTION ENGINE J. Brunner1, A. Hillenbach2, E. Lehmann 3, B. Schillinger 1 1Technische Universität

Energy-Based Self-Supervised Learninghelper.ipam.ucla.edu › publications › mlpws4 › mlpws4_15927.pdf · 2019-11-20 · Y. LeCun Reinforcement Learning: works great for games

Distributed Private Machine Learninghelper.ipam.ucla.edu/publications/pbd2018/pbd2018_14607.pdf · Distributed Private Machine Learning Abhradeep Guha Thakurta University of California

Solving k-means on High-dimensional Big Data · PDF fileSolving k-means on High-dimensional Big Data Jan-Philipp W. Kappmeier1, Daniel R. Schmidt2 and Melanie Schmidt2 1Technische