Intro to Neural Networks - Vanderbilt Biostatistics...
Transcript of Intro to Neural Networks - Vanderbilt Biostatistics...
![Page 1: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/1.jpg)
Intro to Neural Networks
Matthew S. Shotwell, Ph.D.
Department of BiostatisticsVanderbilt University School of Medicine
Nashville, TN, USA
April 8, 2020
![Page 2: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/2.jpg)
Neural Networks
NN is a nonlinear model, often represented by network diagram:
![Page 3: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/3.jpg)
Neural Networks
NN is a nonlinear model, often represented by network diagram:
![Page 4: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/4.jpg)
Neural Networks
NN is a nonlinear model, often represented by network diagram:
![Page 5: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/5.jpg)
Neural Networks
NN is a nonlinear model, often represented by network diagram:
![Page 6: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/6.jpg)
Neural Networks
I NN in previous diagram is for a K class problem
I output layer has K nodes, one for each class
I NN for regression would have just one output node
![Page 7: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/7.jpg)
Neural Networks
Model formula for NN in previous figure; K class problem:
I input layer: P features, x1 · · · , xPI hidden layer: for each hidden node m = 1, . . . ,M
zm = σ(α0m + α1mx1 + · · ·+ αPmxP )
I output layer: for each output node k = 1, . . . ,K
tk = s(β0k + β1kz1 + · · ·+ βMkzM )
I σ is an “activation” function
I s is a “link” function, e.g., logit, or softmax function
![Page 8: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/8.jpg)
Neural Networks
Model formula for NN in previous figure; K class problem:
I input layer: P features, x1 · · · , xPI hidden layer: for each hidden node m = 1, . . . ,M
zm = σ(α0m + α1mx1 + · · ·+ αPmxP )
I output layer: for each output node k = 1, . . . ,K
tk = s(β0k + β1kz1 + · · ·+ βMkzM )
I layers are “fully connected” to lower layer
I all nodes in lower layer contribute to each node of layer above
![Page 9: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/9.jpg)
Neural Networks
Model formula for NN in previous figure; K class problem:
I input layer: P features, x1 · · · , xPI hidden layer: for each hidden node m = 1, . . . ,M
zm = σ(α0m + α1mx1 + · · ·+ αPmxP )
I output layer: for each output node k = 1, . . . ,K
tk = s(β0k + β1kz1 + · · ·+ βMkzM )
I α0m and β0k called “bias” parameters
I # params: hidden M × (P + 1); output K × (M + 1)
![Page 10: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/10.jpg)
σ Activation function
σ is an “activation function” designed to mimic the behavior ofneurons in propagating signals in the (human) brain. Theactivation function makes the model nonlinear (in the parameters)
I sigmoid - σ(x) = 11+e−x
I ReLU - σ(x) = max(0, x)
I ReLU - “rectified linear unit”
I ReLU - faster training vs sigmoid, may perform better
![Page 11: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/11.jpg)
s Link function
The link function transforms the output nodes so that they makesense for the type of probelm: classification or regression.
I regression - identity link - s(t) = t
I classification - softmax link -
s(tk) =etk∑Kl=1 e
tl
I s(tk) is a value between 0 and 1
I s(tk) is P (yk = 1|x) where yk is target code for class k
![Page 12: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/12.jpg)
Training (fitting) neural networks
I NN’s often have large number of parameters: θ
I fit NN’s by minimizing average loss in training data!
I Regression: err =∑N
i=1(yi − f(xi, θ))2
I Classificatin: err = −∑N
i=1
∑Kk=1 yik log fk(xi, θ), where yik
is the indicator for class k, and fk gives the probability forclass k. This is called the “cross-entropy” or “deviance”
I other loss functions can be used
I err minimized w.r.t. θ using a gradient descent algorithmcalled “back-propagation” or “backprop”
![Page 13: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/13.jpg)
Training (fitting) neural networks
I backprop iterates two steps:
I forward step: fix θ and compute values of hidden and outputnodes zm and tk, and apply link function to get predictionsf(xi) (probabilities for classification, or numerical predictionfor regression)
I backward step: fix zm, tk, f(xi) and update θ using agradient descent step
![Page 14: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/14.jpg)
Training (fitting) neural networks
I usually don’t want global minimum of err due to overfitting
I # of iters of backprop, learning rate (of gradient descentalgorithm), and shrinkage penalties (weight decay, dropout)are tuning parameters
![Page 15: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/15.jpg)
Training (fitting) neural networks
Shrinkage:
I weight decay: modified objective
err(θ) + λJ(θ)
whereJ(θ) =
∑k
θ2k
like a ridge penalty, λ is tuning parameter
I dropout: at each round of training, some of the hidden orinput nodes are Zm or Xm are ignored (assigned a valuezero); ignored inputs are selected at random at each round ofbackprop; number of ignored features is tuning parameter
![Page 16: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/16.jpg)
Simple NN in R: nnet.R
![Page 17: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/17.jpg)
Extending NN’s
I The real power of NN’s comes through various extensions:
I Additional hidden layers
I Modifying connectivity between layers
I Processing between layers
![Page 18: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/18.jpg)
Additional layers
More than one hidden layer:
![Page 19: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/19.jpg)
Modified connectivity
I local connectivity: hidden units only receive input from asubset of “local” units in the layer below; not “fully”connected
I say there are 3 hidden nodes and 9 features
z1 = σ(α01 + α11x1 + α21x2 + α31x3)
z2 = σ(α02 + α12x4 + α22x5 + α32x6)
z3 = σ(α03 + α13x7 + α23x8 + α33x9)
I each hidden node linked to just 3 of 9 features
I hidden layer just 3× 4 = 12 parameters instead of 30
I output layer typically always fully connected
![Page 20: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/20.jpg)
Modified connectivity
Local connectivity
![Page 21: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/21.jpg)
Modified connectivity
I local connectivity: there can be some “overlap”: some hiddennodes can take input from some of the same featuresw
I say there are 3 hidden nodes and 9 features
z1 = σ(α01 + α11x1 + α21x2 + α31x3)
z2 = σ(α02 + α12x3 + α22x4 + α32x5 + α42x6)
z3 = σ(α03 + α13x6 + α23x7 + α33x8 + α43x9)
![Page 22: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/22.jpg)
Modified connectivity
Local connectivity
![Page 23: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/23.jpg)
Modified connectivity
I weight sharing: some hidden units share weights
I only makes sense in combination with local connectivity
I say there are 3 hidden nodes and 9 features
z1 = σ(α01 + α1x1 + α2x2 + α3x3)
z2 = σ(α02 + α1x4 + α2x5 + α3x6)
z3 = σ(α03 + α1x7 + α2x8 + α3x9)
I typically each hidden node retains a distinct bias (α0m)
I hidden layer has just 3 + 3 = 6 parameters
I number of links 6= number of parameters (more links)
![Page 24: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/24.jpg)
Modified connectivity
Local connectivity + weight sharing
![Page 25: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/25.jpg)
Example: zipcode data
I hand-written integers
I output: 10-class classification
I input: 16x16 B&W image
![Page 26: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/26.jpg)
Example: zipcode data
I input: 16x16 B&W image
I when input is an array (2D in this case), typically the hiddenunits in a hidden layer are represented as an array too
I figure below shows local connectivity with 5× 5 “kernel”
![Page 27: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/27.jpg)
Local connectivityI 3× 3 kernelI stride 1I edge handling
![Page 28: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/28.jpg)
Local conn. w/weight sharing (convolution)
Convolution is type of shape detector or “feature map”:
![Page 29: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/29.jpg)
Example: zipcode data
I Net 1: no hidden layer, same as multinomial logisticregression, (256+1)10 = 2570 parameters
I Net 2: one hidden layer, 12 units, fully connected, (256+1)12+ (12+1)10 = 3214 parameters
I Net 3: two hidden layers, locally connected, 1226 parameters
I Net 4: two hidden layers, locally connected, weight sharing,1132 parameters, 2266 links
![Page 30: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/30.jpg)
Local connectivity
![Page 31: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/31.jpg)
Local connectivity and shared weights
AKA: convolutional neural networks
I groups of hidden units form “shape detectors” or “featuremaps”
I more complex shape detectors near output layer
![Page 32: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/32.jpg)
Net 3
![Page 33: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/33.jpg)
Net 4
![Page 34: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/34.jpg)
Performance on zipcode data
![Page 35: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/35.jpg)
Processing between layers
I Pooling/subsampling: down-sample data from a layer bysummarizing of a group of units
I Max-pooling: summarize using maximum:
![Page 36: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/36.jpg)
Deep learning and deep neural networks
I Deep learning uses deep NNs
I Deep NNs are simply NNs with many layers, complexconnectivity, and processing steps between layers:
![Page 37: Intro to Neural Networks - Vanderbilt Biostatistics Wikibiostat.mc.vanderbilt.edu/wiki/pub/Main/CourseBios362/... · Intro to Neural Networks Matthew S. Shotwell, Ph.D. Department](https://reader034.fdocuments.in/reader034/viewer/2022042909/5f3d7a25db06260cec7d2778/html5/thumbnails/37.jpg)
Complex NNs in R
I No (good) native R libraries for complex NNs
I R can interface to good libraries, e.g., Keras, TensorFlow
I See https://keras.rstudio.com/