Algorithms other than SGD - Cornell University · Derivative Free Optimization (DFO) ......

Algorithms other than SGDCS6787 Lecture 10 — Fall 2017

Machine learning is not just SGD

• Once a model is trained, we need to use it to classify new examples• This inference task is not computed with SGD

• There are other algorithms for optimizing objectives besides SGD• Stochastic coordinate descent• Derivative-free optimization

• There are other common tasks, such as sampling from a distribution• Gibbs sampling and other Markov chain Monte Carlo methods• And we sometimes use this together with SGD à called contrastive divergence

Why understand these algorithms?

• They represent a significant fraction of machine learning computations• Inference in particular is huge

• You may want to use them instead of SGD• But you don’t want to suddenly pay a computational penalty for doing so because

you don’t know how to make them fast

• Intuition from SGD can be used to make these algorithms faster too• And vice-versa

Inference

• Suppose that our training loss function looks like

• Inference is the problem of computing the prediction

f(w) =1

l(y(w;xi), yi)

y(w;xi)

How important is inference?

• Train once, infer many times• Many production machine learning systems just do inference

• Image recognition, voice recognition, translation• All are just applications of inference once they’re trained

• Need to get responses to users quickly• On the web, users won’t wait more than a second

Inference on linear models

• Computational cost: relatively low• Just a matrix-vector multiply

• But still can be more costly in some settings• For example, if we need to compute a random kernel feature map• What is the cost of this?

• Which methods can we use to speed up inference in this setting?

Inference on neural networks

• Computational cost: relatively high• Several matrix-vector multiplies and non-linear elements

• Which methods can we use to speed up inference in this setting?

• Compression• Find an easier-to-compute network with similar accuracy by fine-tuning• The subject of this week’s paper

Other techniques for speeding up inference

• Train a fast model, and run it most of the time• If it’s uncertain, then run a more accurate, slower model

• For video and time-series data, re-use some of the computation from previous frames• For example, only update some of the activations in the network at each frame• Or have a more-heavyweight network run less frequently• Rests on the notion that the objects in the scene do not change frequently in

most video streams

Other Techniques for Training, Besides SGD

Coordinate Descent

• Start with objective

• Choose a random index i, and update

• And repeat in a loop

minimize:f(x1, x2, . . . , xn)

= argminxi

f(x1, x2, . . . , xi

, . . . , x

Variants

• Coordinate descent with derivative and step size

• Stochastic coordinate descent

• How do these compare to SGD?

Derivative Free Optimization (DFO)

• Optimization methods that don’t require differentiation

• Basic coordinate descent is actually an example of this

• Another example: for normally distributed ε

• Applications?

xt+1 = xt � ↵

f(xt + �✏)� f(xt � �✏)

2�✏

Another Task: Sampling

Focus problem for this setting:Statistical Inference• Major class of machine learning applications• Goal: draw conclusions from data using a statistical model• Formally: find marginal distribution of unobserved variables given observations

• Example: decide whether a coin is biased from a series of flips

• Applications: LDA, recommender systems, text extraction, etc.

• De facto algorithm used for inference at scale: Gibbs sampling

Graphical models

• A graphical way to describe a probability distribution

• Common in machine learning applications• Especially for applications that deal with uncertainty

What types of inference exist here?

• Maximum-a-posteriori (MAP) inference• Find the state with the highest probability• Often reduces to an optimization problem• What is the most likely state of the world?

• Marginal inference• Compute the marginal distributions of some variables• What does our model of the world tell us about this object or event?

x4x5 x7

Algorithm 1 Gibbs sampling

Require: Variables xi for 1 i n, and distribution ⇡.

Choose s by sampling uniformly from {1, . . . , n}.Re-sample xs uniformly from P⇡(xs|x{1,...,n}\{s}).output x

end loop

What is Gibbs Sampling?Choose a variable to update at random.

Update the variable by sampling from its

conditional distribution.

Compute its conditional distribution given the

other variables.Output the current state as a sample.

P( ) = 0.7P( ) = 0.3

Learning graphical models

• Contrastive divergence• SGD on top of Gibbs sampling

• The de facto way of training• Restricted boltzmann machines (RBM)• Deep belief networks (DBN)• Knowledge-base construction (KBC) applications

What do all these algorithms look like?Stochastic Iterative AlgorithmsGiven an immutable input dataset and a model we want to output.

Repeat:

1. Pick a data point at random

2. Update the model

3. Iterate

same structure⬇

same systems properties

⬇same techniques

Questions?

• Upcoming things• Paper Review #9 — due today• Paper Presentation #10 on Wednesday• Project proposals due the following Monday

Algorithms other than SGD - Cornell University · Derivative Free Optimization (DFO) ......

Documents

Transcript of Algorithms other than SGD - Cornell University · Derivative Free Optimization (DFO) ......

Detection of trans–cis flips and peptide-plane flips in protein structures

1. AZE SGD 150 - KawKaw Sg by Makanstate – Redefining ...kawkawsg.com/wp-content/uploads/2016/03/KAW-KAW-SG-FURNITUR… · aze sgd 150 2. aiz sgd 275 . 3. aiz sgd 275 4. cd sgd

The Complete Solution Provider Examination Training & 2020 ...training.cutechgroup.com/webupload... · 13 – 17 JAN Full Time 19 Dec 19 JAN SGD 2000 SGD 1332.50 SGD 200 SGD 200 SGD

Cultural flips

Présentation TTG Flips

SGD Cardiovascular

Plan Media Flips Audio

Dfo Brochure

This document is a supplement to the prospectus dated 5 ......Class Wd USD, Class Id SGD, Class Md SGD, Class Rd SGD and Class XXLd SGD, Class Zd SGD, Class Wd SGD, ... of the Fund’s

tamil.examsdaily.in€¦ · 58 59 66 69 70 73 74 77 80 82 88 89 90 93 96 97 98 sc, d-61 sgd-62 sgd-63 sgd-64 sgd„65 sgd„66 sgd-67 sgd-69 sgd-70 sc,d-71 scd-72 scd-73

FLIPS Teacher Notes - Lower.indd

DFO October 2012 Newsletter

DFO Session 1&2

The power of SGD . SGD overview.

Face Flips in Origami Tessellations

SPONSORSHIP & EXHIBITOR PROSPECTUS€¦ · IB Global Conference Sponsorship and Exhibitor Packages SGD 18,000.00 SGD 12,000.00 SGD 9,800.00 SGD 6,800.00 SGD 2,000.00 4 2 2 2 1 4 2

DFO | Direct Factory Outlets - FORTHAT FESTIVE …...LONGING FOR GIFT IDEASAquila 1. RRP: $419.00 DFO: $299.00 2. RRP: $79.00 DFO: $49.00 Calvin Klein 3. DFO: $39.95 4. DFO: $54.95

Janet Firth - Flea Market Flips

Electoral Office of St. Vincent and The Grenadines...05-Apr-2017 11:46:52 100 01/06/2005 05/04/2017 147827 SGA 102495 SGD 083654 SGD 083895 SGD 091749 SGD 091809 SGD 083587 SGD 083334

DFO Expert Support Annual Progress Report - dfo … · DFO Expert Support Annual Progress Report ... • The Canadian Forces Base ... advised on best practices and remedial measures