Indian Institute of Technology Bombay MACHINE LEARNING.

Indian Institute of Technology Bombay

MACHINE LEARNING


Marc Chagall


(Vincent van Gogh)


Marc Chagall ? Or Vincent van Gogh?


(Paul Gaugin)


(Vincent van Gogh)


7


8


Induction vs Deduction

• Deductive reasoning is the process of reasoning from one or more general statements (premises) to reach a logically certain conclusion.

• Inductive is reasoning in which the premises seek to supply strong evidence for (not absolute proof of) the truth of the conclusion.


• The human mind is the best pattern recognizer and classifier, can recognize pattern in spite of noise and vagueness.

• 1. The human mind learns by induction

2. The human mind recognizes looking at the whole and not at individual parts.

MACHINE LEARNING


Learning is a fundamental and essential characteristic of biological neural networks.

The ease with which they can learn led to attempts to emulate a biological neural network in a computer.

Indian Institute of Technology BombayIndian Institute of Technology Bombay

The human brain incorporates nearly 10 billion neurons and 60 trillion connections, synapses, between them. By using multiple neurons simultaneously, the brain can perform its functions much faster than the fastest computers in existence today.

How does human mind learn?

Soma Soma

Synapse

Synapse

Dendrites

Axon

Synapse

Dendrites

Axon

A neuron consists of a cell body, soma, a number of fibers called dendrites, and a single long fiber called the axon.


1. Human beings learn patterns by induction (seeing examples)

2. The knowledge acquired remains in their memory,

3. The knowledge is recalled when required to recognize a pattern not seen

before

Human Learning: Key features


• Show the computer several examples of a pattern repeatedly.

• Hope that it would learn the “diagnostic” characteristic of the

pattern.

• We make sure that the computer has learnt adequately (how?)

• The knowledge acquired by the computer will remains in their

“memory” (how?)

• The computer will recall the knowledge when asked to classify

an unseen pattern

Machines Learning : Key features

Indian Institute of Technology BombayIndian Institute of Technology Bombay • The human mind is much better than a computer at recognizing vague/noisy

patterns -

• A well-trained computer can process larger amount of information!

• Non-linear model –same feature gets different weights in different combinations

MACHINE LEARNING


16

• Downside – the computer will not tell you why it has classified a particular pattern in a particular way.

• A blackbox!!

• Like human mind!!

MACHINE LEARNING


Problems with Probabilistic/Fuzzy methods

• Weights of Evidence– Correlation between maps

• Fuzzy Logic– Subjective judgment -> difficult to reproduce


MACHINE LEARNING

• Neural netwroks• Hybrid Neurofuzzy systems • Bayesian Classifier• Genetic Algorithms• SOM


19

• Resource potential modeling can be viewed as a pattern recognition

problem.

• Involves predictive classification of each spatial unit characterized by a

unique combination of spatially coincident predictor patterns (or

unique conditions) as mineralized or barren with respect to the target

mineral deposit-type. In machine learning jargon, its called a feature

vector

1 1

1 2

2 2

3 4

3 3 3 4

4555

MACHINE LEARNING

??

??

??

??

??

1.250000 - 1<155

4.10004 - 5 15 - 304

3125003 - 430 - 453

2.50002 - 345 - 602

5100001 - 2>601

Distance from permeable struct

Soil permeability

Drainage density

SlopeUnique

Condition No

Predictor patterns Class –

Potential (1) or

not potential (0)


Attribute1(i), Attribute2(i), …………., Attribute6(i) 0

Attribute1(i), Attribute2(iii), …………., Attribute6(iv) 1

Attribute1(ii), Attribute2(i), …………., Attribute6(v) 1

Attribute1(v), Attribute2(i), …………., Attribute6(i) 0

…

…

Attribute1(iii), Attribute2(ii), …………., Attribute6(vi) 1

MACHINE LEARNING


UNIQUE CONDITIONS GRID


22

Converting GIS layers to feature vectors

targetoutput

1

Input feature vector

[3, 8, 33, 800]

GIS raster layers

SiO2 content

Rock type

Fe content

Distance to Fault

Deposits

11

0

0 00 0

0

00 0

0

1

00 0

0 0

0

0


Targeted Output (d) =1

Input Vector

40

1120

600

ActualOutput (y)

NN 0.36

Error = (d – y) 0.64

Feed forward

Backpropagation

Deposits

SiO2 content

MgO content

Fe content

Distance to Fault

Deposits


Inside the black-box………….???

A Layers of input neurons(Input layer -I)

A neuron (Nodes) (processing units)

A layer of Hidden neurons(Hidden layer -H)

A layer of output neurons(Output layer - O)

Neuron – Neural, but what is network?what is network? – Connect all neurons…..

w11w

12

w21w

22

w 31

w32w 41

w 42

x11

x 21

fi

fi

fi

fi

fh

fh

fo


An artificial neural network consists of a number of very simple processors, also called neurons, which are analogous to the biological neurons in the brain.

The neurons are connected by weighted links passing signals from one neuron to another.

The output signal is transmitted through the neuron’s outgoing connection. The outgoing connection splits into a number of branches that transmit the same signal. The outgoing branches terminate at the incoming connections of other neurons in the network.


Properties of architecture

• No connections within a layer• No direct connections between input and output layers• Fully connected between layers• Often more than 3 layers• Number of output units need not equal number of input units• Number of hidden units per layer can be more or less than input or

output units


The neuron computes the weighted sum of the input signals and compares the result with a threshold value, . If the net input is less than the threshold, the neuron output is 0/–1. But if the net input is greater than or equal to the threshold, the neuron becomes activated and its output attains a value +1.

The neuron uses the following transfer or activation function:

n

iiiwxX

1

X

XY

ge

XfY

if ,1/0

if ,1

..

)(

Neuron functions (Also called Activation functions)


Activation functions of a neuron

S t e p f u n c t io n S ig n f u n c t io n

+ 1

-1

0

+ 1

-1

0X

Y

X

Y

+ 1

-1

0 X

Y

S ig m o id f u n c t io n

+ 1

-1

0 X

Y

L in e a r f u n c t io n

0 if ,0

0 if ,1

X

XY step

0 if ,1

0 if ,1

X

XY sign

Xsigmoid

eY

1

1XY linear

2

22

)(

cX

RBF eY

Radial basis function


Σ..

∫p1

p2

pn

w11

w1ni

n

iii bxwu

11

)(ufz

f – activation function

t

Network output

yTarget output

Error

yterror

Σ – transfer function

INPUT LAYER

HIDDEN LAYER

OUTPUTLAYER

∫

∫

w21

w2n

.

.)(vfy

Σ

j

n

iii bzwv

12


NETWORK PARAMETERS

• Weights• Number of neurons• Function parameters

NETWORK TRAINING

Iterative modifications of network parameters to minimize error

TRAINING SAMPLES (VALIDATION SAMPLE)- Feature vectors whose class is known


TRAINING ALGORITHM

• Problem of assigning ‘credit’ or ‘blame’ to individual elements

involved in forming overall response of a learning system

(hidden units)

• In neural networks, problem relates to deciding which weights

should be altered, by how much and in which direction.

Analogous to deciding how much a weight in an early layer contributes to the output and thus the error

We therefore want to find out how weight wij affects the error i.e. we want:

)(

)(

tw

tE

ij


Backpropagation learning algorithm ‘BP’

( Rumelhart, Hinton and Williams ,1986)

BP has two phases:

Forward pass phase: computes ‘functional signal’, feedforward propagation of input pattern signals through network

Backward pass phase: computes ‘error signal’, propagates the error backwards through network starting at output units (where the error is the difference between actual and desired output values)


Uses gradient descent (steepest descent) and Delta Rule for minimizing error



Any given combination of weights will be associated with a particular error measure. The Delta Rule uses gradient descent learning to iteratively change network weights to minimize error (i.e., to locate the global minimum in the error surface).


0

To find a minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a maximum of that function; the procedure is then known as gradient ascent.




Too small steps, slow convergence of error, but convergence to minima assuredToo big steps, fast convergence but minima may be missed


( Rumelhart, Hinton and Williams ,1986)Step size: Learning rate


Derivative: How a function changes as its input changesOr how much one quantity changes in response to a change in some other quantityfor example, the derivative of the position of a moving object with respect to time is the object's instantaneous velocity.≈ Slope/gradient



Black: the graph of a functionRed: tangent line to that functionThe slope/gradient of the tangent line is equal to the derivative of the function at the marked point.

Black: Maximum ValueWhite: Minimum valueGradient points to wards higher values


Partial derivative: Suppose a function has several variables. Partial derivative of the function with respect to one of the variables is how the function changes as that variable changes (other variables assumed constant)






In the context of Neural networks - Function: ErrorVariables: weights/function parametersConceptual basis of weight adjustment:1. Determine partial derivative of error with respect to each of

the weights/parameters2. Adjust each weight in a direction opposite to the steepest

gradient


Input feature vector

X

Input layer

I

Hidden layer

J

Output layer

K

X1

X2

X3

X4

KwJwIx

yGenericall

lkkji

barren) if 0 bearing, resource if,1( arg

1

1 K ofOutput

K Input to

1

1 J ofOutput

J Input to

I ofOutput

I Input to

K

K

k

J

TetT

eO

bOwI

eO

bXwI

X

X

K

KJ

J

I

KJJ

I

JIJI


1. Calculate errors of output neurons:

δK = OK (1 - OK) (Target - OK)

2. Change output layer weights

WJ_K= W J_K + η*δK *OJ

3. Calculate (back-propagate) hidden layer errors

δJ = OJ (1 – OJ) (δK *WJ_K )

4. Change hidden layer weights

WI1_J = WI1_J + η*δJ*x1

WI2_J = WI2_J + η*δJ*x2


I1

I J K

Input layer

Hidden layer

Output layer

WI1_J

WI2_J

WJ_K

I2

X1

X2

The constant η (called the learning rate, and nominally equal to one) is put in to speed up or slow down the learning if required.


input1 60

Input2 25

Input3 120

Input4 5

Data

2 hidden neurons1 output neuronLearning rate 0.5Sigma functionStart with random weights between 0 and 1,Run the algorithm. See if the error is reduced in the next iteration.


Practical considerations: Neural Network training• Collect all possible examples of the pattern

• Encode and format the data

• Classify in three subset:

• Training set (70%)

• Validation(20%)

• Testing set (10%)

• Or use n-fold (k-fold) validation (also called jack-knifing)

• GOLDEN RULE : Number of training set samples should be at least 3 times the number of parameters to be estimated –


Practical considerations: Neural Network training:

Input data encoding and formatting

VALUE COUNTAREA SQKM Rock type

Distance to Fault (km) Soil type

Slope (Degree) Resource

1 62487 62487 4 1 1 10 12 446 446 3 2 1 11 13 383 383 3 1 3 10 14 91831 91831 3 1 2 12 05 2892 2892 2 2 2 14 06 1227 1227 3 3 3 14 17 934 934 1 4 1 11 08 102 102 2 2 1 9 19 601 601 1 1 2 9 0

10 2742 2742 2 7 3 9 111 2320 2320 1 7 2 8 112 289 289 2 7 1 8 013 1 1 3 9 1 6 014 21050 21050 1 10 2 6 115 2984 2984 4 2 1 8 116 69 69 3 2 1 9 117 174 174 2 2 2 7 018 21 21 1 2 1 6 019 379 379 1 3 3 10 020 23 23 1 4 2 11 0

Rock type Soil type1Granite 1Sandy2Sandstone 2Clayey3Shale 3Silty4Basalt




VALUE COUNTAREA SQKM

Rock type

Distance to Fault

(km)

Soil type

Slope (Degree) ResourceGranite SSt Shale Basalt Sandy Clayey Silty

1 62487 62487 0 0 0 1 1 1 0 0 10 12 446 446 0 0 1 0 2 1 0 0 11 13 383 383 0 0 1 0 1 0 0 1 10 14 91831 91831 0 0 1 0 1 0 1 0 12 05 2892 2892 0 1 0 0 2 0 1 0 14 06 1227 1227 0 0 1 0 3 0 0 1 14 17 934 934 1 0 0 0 4 1 0 0 11 08 102 102 0 1 0 0 2 1 0 0 9 19 601 601 1 0 0 0 1 0 1 0 9 0

10 2742 2742 0 1 0 0 7 0 0 1 9 111 2320 2320 1 0 0 0 7 0 1 0 8 112 289 289 0 1 0 0 7 1 0 0 8 013 1 1 0 0 1 0 9 1 0 0 6 014 21050 21050 1 0 0 0 10 0 1 0 6 115 2984 2984 0 0 0 1 2 1 0 0 8 116 69 69 0 0 1 0 2 1 0 0 9 117 174 174 0 1 0 0 2 0 1 0 7 018 21 21 1 0 0 0 2 1 0 0 6 019 379 379 1 0 0 0 3 0 0 1 10 020 23 23 1 0 0 0 4 0 1 0 11 0




0 0 0 1 1 1 0 0 10 10 0 1 0 2 1 0 0 11 10 0 1 0 1 0 0 1 10 10 0 1 0 1 0 1 0 12 00 1 0 0 2 0 1 0 14 00 0 1 0 3 0 0 1 14 11 0 0 0 4 1 0 0 11 00 1 0 0 2 1 0 0 9 11 0 0 0 1 0 1 0 9 00 1 0 0 7 0 0 1 9 11 0 0 0 7 0 1 0 8 10 1 0 0 7 1 0 0 8 00 0 1 0 9 1 0 0 6 01 0 0 0 10 0 1 0 6 10 0 0 1 2 1 0 0 8 10 0 1 0 2 1 0 0 9 10 1 0 0 2 0 1 0 7 01 0 0 0 2 1 0 0 6 01 0 0 0 3 0 0 1 10 01 0 0 0 4 0 1 0 11 0




0 0 0 1 1 1 0 0 10 10 0 1 0 2 1 0 0 11 10 0 1 0 1 0 0 1 10 10 0 1 0 1 0 1 0 12 00 1 0 0 2 0 1 0 14 00 0 1 0 3 0 0 1 14 11 0 0 0 4 1 0 0 11 00 1 0 0 2 1 0 0 9 11 0 0 0 1 0 1 0 9 00 1 0 0 7 0 0 1 9 1

1 0 0 0 7 0 1 0 80 1 0 0 7 1 0 0 80 0 1 0 9 1 0 0 61 0 0 0 10 0 1 0 60 0 0 1 2 1 0 0 80 0 1 0 2 1 0 0 9

0 1 0 0 2 0 1 0 71 0 0 0 2 1 0 0 61 0 0 0 3 0 0 1 101 0 0 0 4 0 1 0 11

Training data

Validation data

Testing data

100111

0000



Training

1. Chose a subset of training samples2. Computer the error for the subset3. Update weights so as to reduce the error (e.g., using gradient descent)4. Calculate error for validation samples

The above 4 steps comprise one pass through the subset of training samples along with an updating of weights, called a “training epoch” Number of training samples in the subset is epoch size.You can use an epoch size of 1, or an epoch size of n (=number of training samples), or any size between 1 and n.

Save the weights/parameters after every training epoch



Training

Plot training and validation errors against number of training epochs

Validation error minimizes at 70 epochs, beyond which it begins to rise

=> The weights/parameters saved after 70th epoch comprise the trained network



Training

Before jumping to processing the samples to be classified, test your trained network with the testing samples (the third subset)


05

10152025303540

2 4 6 8 10

Number of hidden units

Per

cen

t E

rro

rValidation SetError

Training SetError

Optimization of the number of hidden neurons

Indian Institute of Technology Bombay MACHINE LEARNING.

Documents

Transcript of Indian Institute of Technology Bombay MACHINE LEARNING.