Artificial Neural Networks - MIT OpenCourseWare · PDF fileArtificial Neural Networks ......

Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo

6.873/HST.951 Medical Decision Support Spring 2005

ArtificialNeural Networks

Lucila Ohno-Machado (with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)

Overview

• Motivation • Perceptrons • Multilayer perceptrons • Improving generalization

• Bayesian perspective

Motivation

Images removed due to copyright reasons.

benign lesion malignant lesionbenign lesion malignant lesion

Motivation

• Human brain – Parallel processing – Distributed representation – Fault tolerant – Good generalization capability

• Mimic structure and processing in computational model

Biological Analogy

Synapses

Dendrites

Synapses+ + + --

(weights)

Perceptrons

00 11 Input patterns

Input layer

Output layer

10 Sorted

.patterns

Activation functions

Perceptrons (linear machines)Input units

Cough Headache

weights Δ rule change weights todecrease the error

- what we gotwhat we wantedNo disease Pneumonia Flu Meningitis error

Output units

0 1 0 0000Abdominal Pain Perceptron

Intensity DurationMale Age Temp WBC Pain Pain

adjustable weights37 10 11 120

AppendicitisDiverticulitis Ulcer Pain Cholecystitis Obstruction Pancreatitis Duodenal Non-specific Small Bowel Perforated

θ = 0.5

input output 00 01 10 11

0 0 0 1

f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 0 f(1w1 + 0w2) = 0 f(1w1 + 1w2) = 1

1, for a > θf(a) = 0, for a ≤ θ

some possible values for w1 and w2 w1 w2

0.20 0.20 0.25 0.40

0.35 0.40 0.30 0.20

Single layer neural network

Output of unit j:

Input units

Input to unit

measured value of variable

Output +θj)oj = 1/ (1 + e- (aj )

j: aj = Σ wijai

Single layer neural networkOutput of unit j:

Output oj = 1/ (1 + e - ( aj+θj)

Input to unit

measured

Input units

j: aj = Σ wijai

i: a i

value of variable i Increasing θ

-5 0 5

-15 -10 10 15

Ser es1

Ser es2

Ser es3

Training: Minimize ErrorInput units

Cough Headache

weights Δ rule change weights todecrease the error

- what we gotwhat we wantedNo disease Pneumonia Flu Meningitis error

Output units

Error Functions

• Mean Squared Error (for regression problems), where t is target, o is output

Σ(t - o)2/n

• Cross Entropy Error (for binary classification)

− Σ(t log o) + (1-t) log (1-o)

+θj)oj = 1/ (1 + e- (aj )

Error function

• Convention: w := (w0,w), x := (1,x) • w0 is “bias” y

θ = 0.5

• o = f (w • x) • Class labels ti ∈{+1,-1} • Error measure

– E = -Σ ti (w • x ) oi miscl.

• How to minimize E?

Minimizing the Error

initial error

final error

local minimum

Error surface

derivative

initial trainedw w

Gradient descent

Local minimum

Global minimum

Perceptron learning

w• Find minimum of E by iterating

k+1 = wk – η gradw E

• E = -Σ ti (w • x ) ⇒ii miscl.

gradw E = -Σ ti xii miscl.

w• “online” version: pick misclassified x

k+1 = wk + η ti xi

Perceptron learning

• Update rule wk+1 = wk + η ti xi

• Theorem: perceptron learning converges for linearly separable sets

Gradient descent• Simple function minimization algorithm

• Gradient is vector of partial derivatives

• Negative gradient is direction of steepest descent

Figures by MIT OCW.

Classification ModelInputs

Weights Output

Age 34

4 of beingAlive”

4 Σ 0.6

Gender “Probability

DependentIndependent Coefficients variable variables a, b, c px1, x2, x3

Prediction

Terminology

• Independent variable = input variable

• Dependent variable = output variable

• Coefficients = “weights” • Estimates = “targets”

• Iterative step = cycle, epoch

θ = 0.5

input output 00 01 10 11

0 1 1 0

f(x1w1 + x2w2) = y f(0w1 + 0w2) = 0 f(0w1 + 1w2) = 1 f(1w1 + 0w2) = 1 f(1w1 + 1w2) = 0

some possible values for w1 and w2 w1 w2

1, for a > θf(a) = 0, for a ≤ θ

x 1 x 2

y input output

00 01 10 11

0 1 1 0

θ = 0.5 w 3 w5 w 4

z θ = 0.5

f(w1, w2, w3, w4, w5)

a possible set of values for ws

1, for a > θ(w1, w2, w3, w4, w5) f(a) = 0, for a ≤ θ

(0.3,0.3,1,1,-2) θ

XORinput output

00 01 10 11

0 1 1 0

w1 w3 w2

w5 w6 θ = 0.5 for all units

f(w1, w2, w3, w4, w5 , w6)

a possible set of values for ws

1, for a > θ(w1, w2, w3, w4, w5 , w6) f(a) = 0, for a ≤ θ

(0.6,-0.6,-0.7,0.8,1,1) θ

From perceptrons to multilayer perceptrons

Abdominal Pain

37 10 1

Appendicitis Diverticulitis

Perforated Non-specificCholecystitis

Small BowelPancreatitis

Male Age Temp WBC PainIntensity

PainDuration

0 1 0 0000

adjustableweights

ObstructionPainDuodenal Ulcer

Heart Attack Network

Duration Intensity ECG: ST Pain Pain elevationSmoker Age Male

Myocardial Infarction

112 1504

“Probability” of MI

Multilayered Perceptrons

Input units

Input to unit j: aj = Σwijai

Input to unit i: ai i

Output of unit j: oj = 1/ (1 + e- (aj+θj) )

Output units

Perceptron

MultilayeredInput to unit k:

perceptron

Output of unit k: ok = 1/ (1 + e- (ak+θk) )

Hidden units

ak = Σwjkoj

measured value of variable

Neural Network ModelInputs

Age 34

Output

0.6 Gender

“Probability

Stage of beingAlive”

Independent Weights Hidden Weights Dependent

variables Layer variable

Prediction

“Combined logistic models”Inputs

Age 34

Output

0.6 Gender

“Probability

Prediction

Inputs

Age Output34

Σ 0.6

Gender “Probability

Prediction

Inputs

Age 34

Output

0.6 Gender

“Probability

Prediction

Not really, no target for hidden units...

Age 0.6

Gender

.2 Σ “Probability

Prediction

Hidden Units and Backpropagation

Output units

Hidden units

what we gotwhat we wanted -error

Δ rule

ckpr opag ato n

Input units

Multilayer perceptrons

• Sigmoidal hidden layer • Can represent arbitrary decision regions

• Can be trained similar to perceptrons

ECG Interpretation

R-R interval

S-T elevation

QRS duration

QRS amplitude

AVF lead

SV tachycardia

Ventricular tachycardia

LV hypertrophy

RV hypertrophy

Myocardial infarction

P-R interval

Linear SeparationSeparate n-dimensional space using one (n - 1)-dimensional space

Meningitis Flu No cough CoughHeadache Headache

CoughNo coughNo headache No headache

No disease Pneumonia

No treatment

Treatment

Another way of thinking about this…

• Have data set D = {(x ,t )} drawn fromi iprobability distribution P(x,t)

• Model P(x,t) given samples D by ANNwith adjustable parameter w

• Statisticsanalogy:

Maximum Likelihood Estimation

• Maximize likelihood of data D

• Likelihood L = Π p(x ,t ) = Π p(t |x )p(x )i i i i i

• Minimize -log L = -Σ log p(t |x ) -Σ log p(x )i i i

• Drop second term: does not depend on w

• Two cases: “regression” and classification

Likelihood for classification(ie categorical target)

• For classification, targets t are class labels

• Minimize -Σ log p(t |x )i i

• p(t 1-ti|x ) = y(x ,w) ti (1- y(x ,w)) ⇒i i i i-log p(t |x ) = -t log y(x ,w) -(1 – t ) * log(1-y(x ,w))i i i i i i

• Minimizing –log L equivalent to minimizing -[Σ t log y(x ,w) +(1 – t ) * log(1-y(x ,w))]i i i i(cross-entropy error)

Likelihood for “regression”(ie continuous target)

• For regression, targets t are real values

• Minimize -Σ log p(t |x )i i

• p(t |x ) = 1/Z exp(-(y(x ,w) – t )2/(2σ2)) ⇒i i i i-log p(t |x ) = 1/(2σ2) (y(x ,w) – t )2 +log Zi i i i

• y(x ,w) is network outputi

• Minimizing –log L equivalent to minimizing Σ (y(x ,w) – t )2 (sum-of-squares error)i i

Backpropagation algorithm

• Minimizing error function by gradient descent:

k+1 = wk – η gradw E • Iterative gradient

calculation by propagating error signals

Problem: how to set learning rate η ?

Better: use more advanced minimization algorithms (second-order information)

-4 -2 20-4 -2

Figures by MIT OCW.

Classification Regression

cross-entropy sum-of-squares

Overfitting

ModelReal Distribution Overfitted

Improving generalization

Problem: memorizing (x,t) combinations (“overtraining”)

0.7 0.5 0 0.9 -0.5 1

1 -1.2 -0.2

0.3 0.6 1

0.5 -0.2 ?

Improving generalization

• Need test set to judge performance

• Goal: represent information in data set, not noise

• How to improve generalization?– Limit network topology – Early stopping – Weight decay

Limit network topology

• Idea: fewer weights ⇒ less flexibility

• Analogy to polynomial interpolation:

Limit network topology

Early StoppingC

error holdout

training

Overfitted model “Real” model Overfitted model

0 age cycles

Early stopping

• Idea: stop training when information (butnot noise) is modeled

• Need hold-out set to determine when to stop training

Overfitting

b = training set

a = hold-out set

Overfitted model

Epochs

min (Δtss)

Stopping criterion

Early stopping

Weight decay

• Idea: control smoothness of network output by controlling size of weights

• Add term α||w||2 to error function

Weight decay

Bayesian perspective

• Error function minimization corresponds to maximum likelihood (ML) estimate: singlebest solution wML

• Can lead to overtraining• Bayesian approach: consider weight

posterior distribution p(w|D).

Bayesian perspective

• Posterior = likelihood * prior • p(w|D) = p(D|w) p(w)/p(D) • Two approaches to approximating p(w|D):

– Sampling – Gaussian approximation

Sampling from p(w|D)

prior likelihood

Sampling from p(w|D)

prior * likelihood = posterior

Bayesian example for regression

Bayesian example for classification

Model Features(with strong personal biases)

Modeling Examples Explanat. Effort Needed

Rule-based Exp. Syst. high low high? Classification Trees low high+ “high” Neural Nets, SVM low high low Regression Models high moderate moderate

Learned Bayesian Nets low high+ high (beautiful when it works)

Regression vs. Neural Networks

“X1 ” 1X3 ” “X1X2X3 ”

“X2 ”

X1 X2 X3 X1X2 X1X3 X2X3

(23 -1) possible combinations

X1X2X3

X1 X2 X3

Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...

Summary

• ANNs inspired by functionality of brain

• Nonlinear data model • Trained by minimizing error function • Goal is to generalize well • Avoid overtraining • Distinguish ML and MAP solutions

Some References

Introductory and Historical Textbooks • Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed

Processing. MIT Press, Cambridge, 1986. (H) • Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of

Neural Computation. Addison-Wesley, Redwood City, 1991. • Pao, YH. Adaptive Pattern Recognition and Neural Networks.

Addison-Wesley, Reading, 1989. • Bishop CM. Neural Networks for Pattern Recognition. Clarendon

Press, Oxford, 1995.

Artificial Neural Networks - MIT OpenCourseWare · PDF fileArtificial Neural Networks ......

Documents

Transcript of Artificial Neural Networks - MIT OpenCourseWare · PDF fileArtificial Neural Networks ......

Input parameter ranking for neural networks in a space ...

Input Tool Measurement Data Input Unit - Mitutoyo America Corporation

Synergistic Control of Hand Muscles Through Common Neural Input

M.5 input unit

Unit 4: Input and Output Devices - Bangladesh Open …...Input and Output Devices 71 Unit 4: Input and Output Devices Introduction The unit 4 presents the information of input and

NEURAL NETWORKS - ULisboa · PDF file... (VLSI) implementation 272. 2 ... Artificial neural networks ... network Artificial neural Soma Neuron Dendrite Input Axon Output Synapse Weight

Recurrent Neural Networkstat.snu.ac.kr/mcp/Lecture_8_RNN.pdfRecurrent Neural Network Building units of RNN Input: x t Hidden unit: h t = tanh(b + wh t 1 + Ux t) Output unit: o t =

COMPLEX INPUT CONVOLUTIONAL NEURAL NETWORKS FOR …hero.engin.umich.edu/wp-content/uploads/sites/335/... · convolutional neural network (CNN) that accepts fully complex input features.

Input Tool Measurement Data Input Unit

Training Neural Networks - Databricks€¦ · Convolutional Neural Networks • Similar to Artificial Neural Networks but CNNs (or ConvNets) make explicit assumptions that the input

CS407 Neural Computation - Unit information

Rapid Quality Estimation of Neural Network Input Representations

InTech-Review of Input Variable Selection Methods for Artificial Neural Networks

Malicious Input Detection for Deep Neural Networks · 2017. 6. 28. · Institute for Theoretical Information Technology Malicious Input Detection for Deep Neural Networks Emilio Balda

Input & output unit

Convolutional Neural Nets · Convolutional Neural Network (CNN) Convolutional layers localize connections to a small window of input For example, the input node x 3 connects to the

FP7 Multi Input / Output Unit User's Manual...FP7 Digital Input/Output Unit FP7 Digital Input/Output Unit Users Manual WUME-FP7DIO FP7 Multi Input/Output Unit FP7 Multi Input/Output

Neural Acupunture Unit

UNIT 6 Input-Output-Organization

CSCE 636 Neural Networks (Deep Learning)Deep Learning and Neural Network Input Neural Network Output What neural network is doing: computing (often transformation of features/representations,