Post on 07-Jan-2017
Artificial Neural Network
Dessy Amirudin
May 2016Data Science Indonesia
Bootcamp
Intro
Brain Plasticity
http://www.nytimes.com/2000/04/25/science/rewired-ferrets-overturn-theories-of-brain-growth.html
auditory cortex
“One learning algorithm” to rule them all
Any sensor input!
Hearing with vibrationhttp://www.eaglemanlab.net/sensory-substitution
Human Echolocationhttp://www.sciencemag.org/news/2014/11/how-blind-people-use-batlike-sonar
Any sensor input!
Seeing with sound:https://www.newscientist.com/article/mg20727731-500-sensory-hijack-rewiring-brains-to-see-with-sound/
3rd Eye Frog:https://www.newscientist.com/article/mg20727731-500-sensory-hijack-rewiring-brains-to-see-with-sound/
Neuron
http://learn.genetics.utah.edu/
Neuron
http://learn.genetics.utah.edu/
Neural Network
input hidden layer output
Neural Network in Brief History
• Algorithm to mimic the brain• Was widely use in 80s and early 90s• Popularity diminishes in late 90s. Why?• Recent resurgence: State-of-the-art technique
for many application• Can be used for regression and classification
Recent Application of NN
• Speech recognition• Image recognition and search
• Playlist recommendation
• Skype Translate
Other application of NN
• Stock market prediction• Credit Worthiness• Credit Rating
FINANCIAL
• Medical Diagnosis• Electronic Noses
MEDICAL
• Churn prediction• Targeted Marketing• Service Usage Forecast
SALES & MARKETING
& many more
One Layer Neural Network
input output
input node =∑𝑖𝑤 𝑖𝑎𝑖 =
output node = ø(wTa)
wTa
Ø is the activation function
Some Common Activation Function
linear
step
sigmoid
tanh
ø(wTa) = wTa
ø(wTa) =
ø(wTa) =
ø(wTa) =
Revisit One Layer Neural Network
a. If the activation function is linear, what will happen?
b. If the activation function is sigmoid, what will happen?
Do we really need many layer?
Look at a classification problem
• Linear classification model is not enough• Add quadratic or qubic term as necessary
Suppose we have a classification problem, with n=100 featureAdding all quadratic term, the number of variable will become ~5000Adding all qubic and quadratic term, the number of variable will become ~170K
http://sebastianraschka.com
Dog vs Cat
vs
100 x 100 pixels (example) ~10000 variablesAdding all quadratic term, the number of variable become ~ 50 million variables
Multilayer Network
sigmoid
ø(wTa)
node endw
w
w
Recall a sigmoid function
AND Function
𝑎1
𝑎2
𝑎0output
0 0 0
0 1 0
1 0 0
1 1 1
=1, this is the bias valueActivation function is sigmoidSuppose we assign the weight = -20 = 15 = 15The AND logic will be correct
𝑤0
𝑤1
𝑤2
OR Function
𝑎1
𝑎2
𝑎0output
0 0 0
0 1 1
1 0 1
1 1 1
=1, this is the bias value
Task 1:Find value of , and to make the OR logic is TRUE
What is the value of the weight if the logic is NOT ( OR ?
𝑤0
𝑤1
𝑤2
XOR Functionoutput
0 0 0
0 1 1
1 0 1
1 1 0
=1, this is the bias value
Can you find the weight for XOR function?
XOR = OR
AND NOT (OR) output
0 0 0 1 0
0 1 0 0 1
1 0 0 0 1
1 1 1 0 0
Multilayered Network for XORRepresentation
𝑎01
𝑎11
𝑎21
𝑎02
𝑎12
𝑎22
𝑤0123
𝑤1123
𝑤2123
AND
NOT OR
𝑤0112
𝑤0212
𝑤1112
𝑤1212
𝑤2112
𝑤2212
Now, a multilayered network is necessary
How to assign the weight?
Intro to Optimization
How to find minimum value of this function? How to find minimum value of this function?
Gradient Descent Method
• Suppose the function is the descent direction is the first derivative • Parameter to start the algorithm
α = learning parameter, usually set with small value such as 0.001ϵ = convergence parameter, usually set with very small value such as 1e-6
Gradient Descent Method Initialize with k=0, some value of and ϵ Start from random as Calculate cost function Update value of as Calculate cost difference δ- If δ< ϵ , STOP
We can write linear regression learning as an optimization problem
min𝛽∑1
𝑛
(𝑦 𝑖− 𝛽𝑇 𝑥𝑖)
2
Exercise 1• Load “auto_data.csv”• Create linear regression model with dependent variable (y) = “weight” and
independent variable(x) = “mpg”• What is the value of intercept?• What is the value of mpg’s coefficient?• What is the MSE’s value?• Plot the mpg ~ weight and it’s model
• Can you write R code to find the optimum value of intercept’s coefficient and mpg’s coefficient using the gradient descent method?
Forward and Backward Propagation
Forward Propagation
• = input
• ) (add bias)
𝑠𝑖3𝑠𝑖2
𝑠𝑖4
𝑎𝑖3𝑎𝑖
2𝑎𝑖1
𝑎𝑖4
𝑤𝑖𝑗12 𝑤𝑖𝑗
23
𝑤𝑖𝑗34
Get the output using Forward Propagation
How to update weight using gradient descent?
Backward Propagation
Define error
inputoutput
𝑎1 𝑠2 𝑎2 𝑠3 𝑎3 𝑠4 𝑎4
𝑤12 𝑤23 𝑤34
Given training example update
update
sigm sigm sigm
Backward PropagationIn case of one output with many hidden layer, the formulation for hidden layer in one particular node become
Neural Network Tips• Most of the time, one hidden layer is enough• Number of neuron between input layer size and output layer size• Number of neuron in hidden usually 2/3 input size
HOWEVER, this is not always TRUE. The best way is to keep experimenting
Exercise
Neural Network for RegressionLoad MASS libraryUse “Boston” dataPredict the median value of the house (medv)Do the following:Data preparation- Split the data into train and test set. Train set comprises 0.75 % of the data
Model 1:- Create the linear regression model using the train data set (using lm or glm)- Predict the “medv” from the test data set- Calculate the RSS of the test set- Calculate the TSS of the test set- Calculate R2 of the test set- Calculate MSE of the test set
Regression ContinuesModel 2:- Load “neuralnet” library library(neuralnet)- Create the regression model using neural network algorithm with one hidden layer
with 8 node. Follow the code?n=names(train)f=as.formula(paste("medv~",paste(n[!n %in%
"medv"],collapse="+")))nn <- neuralnet(f,data=train,hidden=c(8),linear.output=T)
What happened? Do you see message like this?“algorithm did not converge in 1 of 1 repetition(s) within the stepmax”
• Predict the “medv” from the test data set• Calculate the RSS of the test set• Calculate the TSS of the test set• Calculate R2 of the test set• Calculate MSE of the test set• Plot the model graphCompare with the result from linear model.
Neural Network Plot
Use plot(“nn model”) to plot the graph
Neural Network Additional Tips
• Preprocessed data using normalization• Usually scaling in the intervals [0,1] or [-1,1] tends to give better results.
Exercise 2Same as exercise model 2, but normalized the data.
• Predict the “medv” from the test data set• Calculate the RSS of the test set• Calculate the TSS of the test set• Calculate R2 of the test set• Calculate MSE of the test set
Compare with the result from linear model. Can you improve it?What is the lesson learned?
Neural Network for Binomial Classification
Data Exploration and ModelingUse “credit_dlqn.csv” dataExplore the data• How many variable?• What is the type of variable?• Any other variable that you think are needed to create credit model?
Do the following:Data preparation- Split the data into train and test set. Train set comprises 0.75 % of the data
Model 1:• Use logistic regression to predict the default in 2 years
Model 2:• Use neural network to predict the default in 2 years• Use one hidden layer with number of node is 2/3 of input (equal to 7)
Continue ExperimentingHow long do you finish the model?
NOTE : Neural Network is very slow to converge. Depend on the objective of the business, as Data Scientist you have to be very considerate when choosing an algorithm
• Try with another number of node in hidden layer, such as 2 node• How is the result?• How is the accuracy compared to logistic regression?
Recall on Confusion Table
• Source wikipedia
Assignment – Due to Next Week• Increase the precision of the neural network model. Use neural network
with different parameter• In word document, tell what is the improvement that you can obtaind,
what is your method, why it is work, why it doesn’t work
• Submit your code and word document to trainer.datascience@gmail.com before 23 May 2016 23:59:59
References
• Machine Learning. Courses in Coursera by Andrew Ng, 2013.
• Hastie T., Tibshirani R., Witten D. and James G. The Introduction of Statistical Learning. Springer. 2014.