CS 189 Brian Chu [email protected] Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours:...
-
Upload
donald-mosley -
Category
Documents
-
view
215 -
download
0
Transcript of CS 189 Brian Chu [email protected] Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours:...
![Page 1: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/1.jpg)
CS 189
Brian [email protected]
Slides at: brianchu.com/ml/
Office Hours:Cory 246, 6-7p Mon. (hackerspace lounge)
twitter: @brrrianchu
![Page 2: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/2.jpg)
Agenda
•NEURAL NETS WOOOHOOO
![Page 3: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/3.jpg)
Terminology
• Unit – each “neuron”• 2-layer neural network: a neural network with
one hidden layer (what you’re building)• Epoch – one pass through entire training data– For SGD, this is N iterations– For mini-batch gradient descent (batch size of B),
this is (N/B) iterations
![Page 4: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/4.jpg)
First off…
• Many of you will struggle to even finish.• In which case you can ignore my bells and
whistles.• My 2.6GHz quad core 16GB RAM Macbook
takes ~1.5 hours to train to ~96-97%.
![Page 5: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/5.jpg)
First off…
• Add a signal handler + snapshotting• E.g. implement functionality where if you
press Ctrl-C (on Unix systems, this is sending the interrupt signal), your code saves a snapshot of the state of the training (current iteration, decayed learning rate, momentum, current weights, anything else), then exits.– Look into Python “signal” and “pickle” libraries.
![Page 6: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/6.jpg)
Art of tuning
• Training neural nets is an art, not a science• Cross-validation? Pfffft• “I used to tune that parameter but I’m too lazy and I don’t
bother any more” – grad student talking about weight decay hyperparameter.
• There are way too many hyperparameters for you to tune.• Training is too slow for you to bother using cross-
validation.• Many hyperparameters: just use what is standard and
spend your time elsewhere
![Page 7: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/7.jpg)
Knobs
• Learning: SGD/mini-batch/full batch, momentum, RMSprop, Adagrad, NAG, etc.– How to decay?
• ReLU, tanh, sigmoid activations• Loss: MSE or cross-entropy (with softmax)• L1, L2, Max-norm, Dropout, Dropconnect regularization• Convolutional layers• Initialization: Xavier, Gaussian, etc.• When to stop? Early stop? Stopping rule? Or just run
forever
![Page 8: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/8.jpg)
I recommend• Cross-entropy, softmax *• Only decay per epoch (or more than 1 epoch)*
– (e.g. don’t just divide by # iterations)– Epoch = one training pass thru entire data– Only decay after a round of seeing every data point.– Note: if your mini-batch size is 10, N = 20, then one epoch is 2 iterations
• Momentum learning rate (0.7-0.9?) *– Maybe RMSProp?
• Mini-batch (somewhere between 20-100) *• No regularization.• Gaussian initialization (mean 0, std. dev. 0.01) *• Run forever, take a snapshot when you feel like stopping (seriously!)
* = What everyone in the literature, in practice, uses
![Page 9: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/9.jpg)
Activation functions
• tanh >>> sigmoid– (tanh is just shifted sigmoid anyways)
• ReLU = stacked sigmoid• ReLU is basically standard in computer vision
![Page 10: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/10.jpg)
Almost certainly will improve accuracy but total overkill
• Considered “standard” today:– Convolutional layers (with max-pooling)– Dropout (Dropconnect?)
![Page 11: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/11.jpg)
If using numpy
• Not a single for-loop should be in your code.• Avoid unnecessary memory allocation:• Use the “out=“ keyword argument to re-use
numpy arrays
![Page 12: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/12.jpg)
May want to consider
• Faster implementation than Python w/ numpy:
• Cython, Java, Go, Julia, etc.
![Page 13: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/13.jpg)
Honestly, if you want to win…
• (if you have a compatible graphics card) Write a CUDA or OpenCL implementation, train for many days.– (you might consider adding regularization in this
case)• I didn’t do this: I used other generic tricks that
you can read in the literature.
![Page 14: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/14.jpg)
Debugging
• Check your dimensions• Check your numpy dtypes• Check your derivatives – comment all your
backprop steps• Numerical gradient calculator:– https://github.com/pbrod/numdifftools
![Page 15: CS 189 Brian Chu brian.c@berkeley.edu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu.](https://reader035.fdocuments.in/reader035/viewer/2022062423/5697bfa31a28abf838c96e05/html5/thumbnails/15.jpg)
Connection with SVMs / linear classifiers with kernels
• Kernel SVM can be thought of as:• 1st layer: |units| = |support vectors|– Value of each unit i = K(query, train(i))
• 2nd layer: linear combo of first layer• Simplest training for 1st layer: store all training
points as templates.
http://www.kdnuggets.com/2014/02/exclusive-yann-lecun-deep-learning-facebook-ai-lab.html