Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning...
Transcript of Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning...
![Page 1: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/1.jpg)
Machine LearningNeural Networks, Support Vector Machines
Georg Dorffner
Section for Artificial Intelligence and Decision Support
CeMSIIS – Medical University of Vienna
![Page 2: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/2.jpg)
2
Neural networks: The simple mathematical model
• Propagation rule:
– Weighted sum
– Euclidian distance
• Transfer function f:
– Threshold function(McCulloch & Pitts)
– Linear fct.
– Sigmoid fct.
yj fxj
w1
w2
wi
…
weight
unit (neuron)
activation, output
(net-) input
n
iiijjj
n
iiijj
xwfyfx
xwy
1
1
![Page 3: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/3.jpg)
3
Perceptron as neural network
• Inputs are random „feature“ detectors
• Binary codes
• Perceptron learns classification
• Learning rule = weight adaptation
• Model of perception / object recongition
• But can solve only linearly separable problems
Neuron.eng.wayne.edu
target""
sonst0
if
if
j
jji
jjiij
t
txx
txxw
![Page 4: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/4.jpg)
4
Multilayer perceptron (MLP)
• 2 (or more) layers (= connections)
Input units
Hidden units (typically sigmoid)
Output units(typically linear)
![Page 5: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/5.jpg)
5
Learning rule (weight adaptation): Backpropagation
• Generalised delta rule
ijij xw
outout'outjjjj xtyf
out
1
outhid'hidk
n
kjkjj wyf
Whid
Wout
yhid, xhid
yout, xout
out
• Error is being propagated back
• „Pseudo-error“ for the hidden units
![Page 6: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/6.jpg)
6
Backpropagation as gradient descent
• Define (quadratic) error (for pattern l):
• Minimize error
• Change weights in the direction of the gradient
• Chain rule leads to backpropagation
m
kkkl txE
1
2out
ij
lij
w
Ew
(partial derivative by weight)
![Page 7: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/7.jpg)
7
Limits of backpropagation
• Gradient descent can get stuck in local minimum(depends on initial values)
• it is not guranteed that backpropagation can find an existing solution
• Further problems: slow, can oscillate
• Solution: conjugent gradient, quasi-Newton
![Page 8: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/8.jpg)
8
The power of NN: Arbitrary classifications
• Each hidden unit separates space into 2 halves (perceptron)
• Output units work like “AND”
• Sigmoids: smooth transitions
![Page 9: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/9.jpg)
Maschinelles Lernen und Neural Computation
9
Allgemeiner Ansatz: Diskriminanzanalyse
• Lineare Diskriminanzfunktion:
entspricht dem Perceptron mit 1 Output Unit pro Klasse
• Quadratisch linear:
entspricht einer „Vorverarbeitung“ der Daten,Parameter (w,v) noch immer linear
n
iii wxwg
10x
n
i
p
i
p
jjiijii wxxvxwg
10
1 1
x
![Page 10: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/10.jpg)
Maschinelles Lernen und Neural Computation
10
Der Schritt zum neuronalen Netz
• Allgemein linear:
beliebige Vorverarbeitungsfunktionen, lineare Verknüpfung
• Neuronales Netz:
NN implementiert adaptive Vorverarbeitungnichtlinear in Parametern (w)
01
wywgp
iiii
xx
Gauss ...
Sigmoide ...
ffy
ffy
ii
ii
xwx
xwx T
MLP
RBFN
![Page 11: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/11.jpg)
11
MLP to produce probabilities
• MLP can approximate the Bayes posterior
• Activation function: Softmax
• Prior probabilities:Distribution in training set
k
ii
j
j
x
xy
1
exp
exp
j
iij
ji
p
cPcpcP
x
xx
||
![Page 12: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/12.jpg)
SS 2008 Maschinelles Lernen und Neural Computation
12
Regression
• To model the data generator: estimate joint distribution
• Likelihood:
xxttx, ppp |
in
i
ii ppL xxt
1
|
Distribution with expected value f(xi)
![Page 13: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/13.jpg)
13
Gaussian noise
• Likelihood:
• Maximize = -logL minimize(constant terms can be dropped incl. p(x))
• Corresponds to the quadratic error(see backpropagation)
n
i
iin
i
ii tftpL
12
2
1 2
;exp
2
1|
WxxW
n
i
ii tfE1
2; Wx
![Page 14: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/14.jpg)
SS 2008Maschinelles Lernen und Neural Computation
14
Training als Maximum Likelihood
• Minimierung des quadratischen Fehlers ist Maximum Likelihood mit den Annahmen:– Fehler ist in jedem Punkt normalverteilt, ~N(0,)
– Varianz dieser Verteilung ist konstant
• Varianz des Fehlers (des Rauschens):
• Aber: das muss nicht gelten!Erweiterungen möglich (Rauschmodell)
min1
2
opt2 1
;1
En
tfn
n
i
ii
Wx (verbleibender normalisierter Fehler)
![Page 15: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/15.jpg)
SS 2008Maschinelles Lernen und Neural Computation
15
Klassifikation als Regression
• MLP soll Posterior annähern
• Verteilung der Targets ist keine Normalverteilung
• Bernoulli Verteilung:
• Neg. log-Likelihood:
• „Cross-Entropy Fehler“ (für 2 Klassen; verallgemeinerbar auf n Klassen)
n
i
titiii
xxL1
1
outout 1
n
i
iiii xtxtE1
outout 1log1log
xout=P(c|xin)
![Page 16: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/16.jpg)
SS 2008Maschinelles Lernen und Neural Computation
16
Optimale Paarungen: Transferfunktion (am Output) +Fehlerfunktion
• Regression:
– Linear + summierter quadratischer Fehler
• Klassifikation (Diskriminationsfunktion):
– Linear + summierter quadratischer Fehler
• Klassifikation (Posterior nach Bayes):
– Softmax+cross-entropy Fehler
– 2 Klassen, 1 Ouput: Sigmoid+cross-entropy
![Page 17: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/17.jpg)
Gradient der Fehlerfunktion
• Backpropagation (nach Bishop 1995):effiziente Berechnung des Gradienten (Beitrag des Netzes): O(W) statt O(W2), siehe p.146f
• ist unabhängig von der gewählten Fehlerfunktion
ii w
x
x
E
w
E
out
out
Beitrag der Fehlerfunktion Beitrag des Netzes
• Optimierung basiert auf Gradienteninformation:
![Page 18: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/18.jpg)
Gute Minimierungsverfahren
• Gradientenabstieg Konjugierter(„Backprop“) Gradient:
18
![Page 19: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/19.jpg)
19
MLP as universal function approximator
• E.g: 1 Input, 1 Output, 5 Hidden
• MLP can approximate arbitray functions (Hornik et al. 1990)
• Through superposition of sigmoids
• Complexity by combining simple elements
out0
1 1
hid0
inhidoutinoutj
n
j
m
iiiijjkkk wwxwfwxgx
move(bias)
stretch, mirror
![Page 20: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/20.jpg)
20
Overfitting
• If too few training data: NN tries to model the noise
• Overfitting: worse performance on new data (quadratic error becomes bigger)
50 samples, 15 H.U.
![Page 21: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/21.jpg)
21
Avoiding overfitting
• As much data as possible(good coverage of distribution)
• Model (network) as small as possible
• More generally: regularisation (= limit the effective number of degrees of freedom):
– Several training runs, average
– Penalty for large networks, e.g.:
– „Pruning“ (remove connections)
– Early stopping
N
iiwEE
1
2'
![Page 22: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/22.jpg)
22
The important steps in practice
Owing to their power and characteristics, neural networkrequire a sound and careful strategy:
1. Data inspection (visualisation)
2. Data preprocessing (e.g. normalization to zero mean andunit variance)
3. Feature selection
4. Model selection (pick best network size)
5. Comparison with simpler methods
6. Testing on independent data
7. Interpretation of results
![Page 23: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/23.jpg)
23
Model selection
• Strategy for the optimal choice of model complexity:– Start small (e.g. 1 or 2 hidden units)
– n-fold cross-validation
– Add hidden units one by one
– Accept as long as there is a significant improvement (test)
• No regularization necessaryoverfitting is captured by cross-validation (averaging)
• Too many hidden units too large variance no statistical significance
• The same method can also be used for feature selection (“wrapper”)
![Page 24: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/24.jpg)
24
Support Vector Machines: Returning to the perceptron
• Advantage of (linear) perecptron:
– Global solution guaranteed (no local minima)
– Easy to solve / optimize
• Disadvantage:
– Restricted to linear separability
• Idea:
– Transformation of data to a highdimensional space, such that problem becomes linearly separable
![Page 25: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/25.jpg)
25
Mathematical formulation of perceptron learning rule
• Perceptron (1 Output):
• ti = +1/-1:
• Data is described in terms of inner products („dual form“)
0wxf T wx
iitw x
i
Tii wtxf 0xx
Inner product(dot product)
![Page 26: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/26.jpg)
26
Kernels
• The goal is a certain transformation xi→Φ(xi), such that problem becomes linearly separable (can be high-dimensional)
• Kernel: Function that is depictable as inner product of Φs:
• Φ does not have to be explicitly known
TK 2121, xΦxΦxx
i
ii wKtf 0, xxx
![Page 27: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/27.jpg)
27
Example: polynomial kernel
• 2 dimensions:
• Kernel is indeed an inner product of vectors after transformation („preprocessing“)
2, TK xzzx
zΦxΦ
xz
T
T
zzzzxxxx
zzxxzxzx
zxzx
2122
2121
22
21
212122
22
21
21
2
2211
2
2,,2,,
2
![Page 28: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/28.jpg)
28
The effect of the „kernel trick“
• Use of the kernel, e.g:
• 16x16-dimensional vectors (e.g. pixel images), 5th degree polynomial: dimension = 1010
– Inner product of two 10000000000-dim. vectors
• Calculation is done in low-dimensional space:– Inner Product of two 256-dim. vectors
– To the power of 5
i
Tii
iii wtwKtf 0
5
0, xxxxx
![Page 29: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/29.jpg)
29
Large Margin Classifier• Highdimensional space:
Overfitting easily possible
• Solution: Search for decision border (hyperplabe) with largest distance to closest points
0 bTwx
distance maximal
w
1 bTwx
1 bTwx
w
2d
• Optimization:
Minimize
(Maximize )
Boundary condition:
w
2d
2w
01 bt Tii wx
![Page 30: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/30.jpg)
30
Optimization of large margin classifier
• Quadratic optimization problem, Lagrange multiplier approach, leads to:
• „Dual“ form
• Important: Data is again denoted in terms of inner products
• Kernel trick can be used again
min2
1
,
i ji
TjijijiiD ttL xx
min,2
1
,
i ji
jijijiiD KttL xx
![Page 31: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/31.jpg)
31
Support Vectors
• Support-Vectors: Points at the margin (closest to decision border
• Determine the solution, all other points could be omitted
Kernel function
Back projectionsupport vectors
![Page 32: Machine Learning Neural Networks, Support Vector Machines … · 2019-10-09 · Machine Learning Neural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence](https://reader034.fdocuments.in/reader034/viewer/2022050405/5f823bec112c3365de5e819f/html5/thumbnails/32.jpg)
32
Summary
• Neural networks are powerful machine learners for numerical features, initally inspired by neurophysiology
• Nonlinearity through interplay of simpler learners (perceptrons)
• Statistical/probabilistic framework most appropriate
• Learning = Maximum Likelihood, minimizing error function with efficient gradient-based method (e.g. conjugent gradient)
• Power comes with downsides (overfitting) -> careful validation necessary
• Support vector machines are interesting alternatives, simplify learning problem through „Kernel trick“