Introduction to Neural Network - American University in Cairorafea/CSCE465/slides/GD.pdf ·...

Introduction to Neural Network

List of youtubes & sites

• https://www.youtube.com/watch?v=8d6jf7s6_Qs• https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-

example/

https://www.youtube.com/watch?v=8d6jf7s6_Qs

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

Introduction

• An Artificial Neural Network (ANN) is a computational model that is inspired by the way biological neural networks in the human brain process information.

• It is recently applied in my domains e.g. computer vision, speech recognition, text generation.

Neural Networks Basics

The function f is non-linear and is called the Activation Function. The purpose of the activation function is to introduce non-linearity into the output of a neuron. This is important because most real world data is non linear and we want neurons to learn these non linear representations.The bias works as the constant in the linear equation f(x)=ax+b

https://ujjwalkarn.me/2016/08/09/quick-intro-neural-networks/

Activation Functions

Sigmoid: takes a real-valued input and squashes it to range between 0 and 1σ(x) = 1 / (1 + exp(−x))tanh: takes a real-valued input and squashes it to the range [-1, 1]tanh(x)=[2/(1+exp(-2x)] -1ReLU: ReLU stands for Rectified Linear Unit. It takes a real-valued input and thresholds it at zero (replaces negative values with zero)f(x) = max(0, x)

Feed Forward Networks• This is the simple form of a neural network.• It contains multiple neurons (nodes) arranged in layers.

Nodes from adjacent layers have connections or edges between them. All these connections have weights associated with them.

• Input Nodes – The Input nodes provide information from the outside world to the network and are together referred to as the “Input Layer”.

• Hidden Nodes – The Hidden nodes have no direct connection with the outside world (hence the name “hidden”). They perform computations and transfer information from the input nodes to the output nodes. A collection of hidden nodes forms a “Hidden Layer”.

• Output Nodes – The Output nodes are collectively referred to as the “Output Layer” and are responsible for computations and transferring information from the network to the outside world.

Feed Forward Networks

• In a feedforward network, the information moves in only one direction – forward – from the input nodes, through the hidden nodes (if any) and to the output nodes.

• There are two main types of Feed Forward Networks:• Single layer perceptron

The simplest form of a network, has no hidden layers.Learn linear functions

• Multi layer perceptronThe most common used network, has one or more hidden layer.Learn linear and non-linear functions.

Back Propagation [Training the Network]

• The goal of training is to assign correct weights for the connections between the nodes of successive layers. E.g. connections between input nodes and hidden layer nodes.

• Back propagation of errors is one of the training methods of neural networks.

• It is a supervised method, i.e. it needs a labeled data to work with.• It Is based on comparing the (actual) output of the network with the

labeled (target) output and it propagates the error back to the network to learn from the mistakes.

Back Propagation [Training the Network]

• Steps for training a network:1. Initialize the weights randomly, usually with small values (setting all weights to zero

is also used)2. Feed Forward pass:

• For each hidden layer• Calculate summation of 𝑓𝑓 𝑊𝑊 ∗ 𝐼𝐼 for each neuron. Where W is the weight of the connections, I is

the input, and f() is the activation function.• For the output layer

• Calculate output of each neuron. For classification problem usually used Softmax function3. Error back propagation

• Calculate the total error at the output nodes and propagate these errors back through the network using Backpropagation to calculate the gradients.

• Use an optimization method such as Gradient Descent to ‘adjust’ all weights in the network with an aim of reducing the error at the output layer.

4. Repeat from Step 2 for all input values till the error is minimized.

AND Logic Gate using Neural Networks

Input Output

0 0 0

0 1 0

1 0 0

1 1 1

𝑥𝑥1

𝑥𝑥2𝑦𝑦

𝑥𝑥1

𝑥𝑥2 𝑤𝑤2

𝑤𝑤1𝑦𝑦

• The target output (t) is the output in the truth table of the AND.

• Weights are initialized at random numbers for example 0.

• Output of the network :𝑦𝑦 = ∑𝑖𝑖=1𝑛𝑛 (𝑥𝑥𝑖𝑖𝑤𝑤𝑖𝑖) where n is the number of nodes in the output layer.

• Learning rate(η) is a parameter set to value<1 determines the rate the network learns at.

• ∆𝑤𝑤𝑖𝑖 = η𝑥𝑥𝑖𝑖(𝑡𝑡 − 𝑦𝑦) where (t-y) is the error between the target output and the real output of the network

AND Logic Gate using Neural NetworksEpoch 𝑥𝑥1 𝑥𝑥2 𝑤𝑤1 𝑤𝑤2 𝑡𝑡 𝑦𝑦 (𝑡𝑡 − 𝑦𝑦) ∆𝑤𝑤1 ∆𝑤𝑤2

1

0 0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0

1 0 0 0 0 0 0 0 0

1 1 0 0 1 0 1 0.1 0.1

2

0 0 0.1 0.1 0 0 0 0 0

0 1 0.1 0.1 0 0.1 -0.1 0 -0.01

1 0 0.1 0.09 0 0.1 -0.1 -0.01 0

1 1 0.09 0.09 1 0.18 0.82 0.082 0.082

3

0 0 0.172 0.172 0 0 0 0 0

0 1 0.172 0.172 0 0.172 -0.172 0 -0.0172

1 0 0.172 0.1548 0 0.172 -0.172 -0.0172 0

1 1 0.1548 0.1548 1 0.3096 0.69 0.069 0.069

4 0 0 0.2238 0.2238 0 0 0 0 0

The weights are initialized to 0 and learning rate is set to 0.1

Example of a simple NN

• In this link you can try playing around with neural networks by changing inputs, number of neurons, activations functions,..etc

• TensorFlow Playground

http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.24270&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

Softmax Activation Function

• Softmax is used in classification problems as it give high confidence prediction.

• Each class get a prediction value <1• The sum of all predictions has to be equals to 1• The class of highest value is the winning class

• 𝑠𝑠𝑠𝑠𝑓𝑓𝑡𝑡𝑠𝑠𝑠𝑠𝑥𝑥 𝑓𝑓 𝑥𝑥𝑗𝑗 = 𝑒𝑒𝑥𝑥𝑗𝑗

∑𝑖𝑖 𝑒𝑒𝑥𝑥𝑖𝑖

= p(y = j|x)

Gradient Descent: Simple Example

• Assume that we have a NN with one input and and one output with weight w which is randomized to be 0.8

• Input is equal to1.5 and desired output is required to be equal to 0.5

Input Layer Output LayerInput Layer

W =0.8

a = i . w= =1.5 * 0.8 =1.2i

Cost Function

• We need to change the weight to adjust the output to be 0.5• This process is an optimization problem and it needs to construct an

error function; aka cost function• If the desired output is y then the cost function could be:

C= (a-y)2

• The question is how to change the weight w to reduce the error in the output until we get the nearest possible value to the output

Idea of Gradiendecent

• The figure to the right is the cost function in terms of the output a

• The idea of gradient descent method is to get the direction in which the gradient is approaching the optimal point

• The gradient is the derivative of the cost function

• The figure to the right shows that the error decreases when the output decreases

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4

C

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4

Gradient

Differentiation Chain Rule

• Now we want to examine the relationship between the output and the weight w which shows that the w decreases when the output decreases with a rate of 1.5

• The chain rule of differentiation will be used here which is d/dx f(g(x)) =g’(x) f’(g)

• If we have different layers then:d/dx(g(e(c(x))))= c’(x) e’(c) g’(e)

-0.5

0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

Output

Gradient of the Cost Function in terms of Weight

-3

-2

-1

0

1

2

3

4

5

-0.5 0 0.5 1 1.5

dC/dw

• Cost is a function of the output a and a is a function of w (The curve on the right shows the relation between the cost and w

• We can use the chain rule:dC/dw=da/dw* dC/da = d/da(a-y)^2 . d/dw(i.w)=2.(a-y).i=(2*a-2.y).i=(2.i.w-2.y).i= (3.w-1).1.5= 4.5.w-1.5

• The curve on the right shows the relationship between the rate of change of the cost and the weight

-0.5

0

0.5

1

1.5

2

-0.5 0 0.5 1 1.5

C

Learning rate

• The slope of the cost function determines the adjustment direction• Adjust weights in the network proportional to the negative of the gradient

• Learning rate determines the adjustment magnitude• Excessive adjustment causes exploding gradients• Typical learning rate between 0.00001 to 0.1

• The adjustment is back propagated through all layers in the network

Putting it all together

• Using the simple example, where initially network we have : i=1.5, w=0.8, y=0.5 , dC/dw =4.5.w -1.5, learning rate (r)= 0.1

• Reduce w by r times the gradient• w-1 = w-0 – r. dC/dw =x-0 -0.1.(4.5.w-0 -1.5)

w-0 w-1

0.8 0.59

0.59 0.4745

0.4745 0.4109

0.4109 0.376

0.376 0.357

0.333

Back propagation Generalization to Several layers

w2’= w2-r. a3.w1.w0.2.(a0-y)

w2 w1 w0

a0= w0.a1a1= w1.a2a2= w2.a3a3

-r.(da0/dw0).(dC/da0)

-r.(da1/dw1). (da0/da1).(dC/da0)

-r.(da2/dw2).(da1/da2).(da0/da1).(dC/da0)

Word Embedding

• Word embedding is a way of representing a word into vector contains probabilities of occurring of this word with other words.

• First build a vocabulary of words from the training documents. For example a vocabulary of 10,000 unique words.

• Then represent an input word like “ants” as a one-hot vector. This vector will have 10,000 components (one for every word in the vocabulary) and a “1” is placed in the position corresponding to the word “ants”, and 0s in all of the other positions.

• The output of the network is a single vector (also with 10,000 components) containing, for every word in the vocabulary, the probability that a randomly selected nearby word is that vocabulary word.

Word Embedding Network Architecture

There is no activation function on the hidden layer neurons, but the output neurons use softmax.

When training this network on word pairs, the input is a one-hot vector representing the input word and the training output is also a one-hot vector representing the output word. But when you evaluate the trained network on an input word, the output vector will actually be a probability distribution (i.e., a bunch of floating point values, not a one-hot vector).

Word Embedding Contd.

The example shows a window size of two, so every word is paired with the preceding two words and the next two words.

Word Embedding Hidden Layer

IF we have word vectors of 300 features (embedding size). So the hidden layer is going to be represented by a weight matrix with 10,000 rows (one for every word in the vocabulary) and 300 columns (one for every hidden neuron).So the end goal is to learn this hidden layer weight matrix.

The output of the hidden layer is the “word vector” fo the input word.

Output of hidden layer

Word Embedding Output

If two different words have very similar “contexts” (that is, what words are likely to appear around them), then the model will output very similar vectors for the two words. This means that synonyms like “intelligent” and “smart” would have very similar contexts. Or that words that are related, like “engine” and “transmission”, would probably have similar contexts as well.

Introduction to Neural Network - American University in Cairorafea/CSCE465/slides/GD.pdf ·...

Documents

Transcript of Introduction to Neural Network - American University in Cairorafea/CSCE465/slides/GD.pdf ·...