Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on...

82
Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences STA414/2104 Statistical Methods for Machine Learning II Lecture 7

Transcript of Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on...

Page 1: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Murat A. Erdogdu & David DuvenaudDepartment of Computer ScienceDepartment of Statistical Sciences

STA414/2104 �Statistical Methods for Machine Learning II

Lecture 7

Page 2: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Announcements• Midterm is this Friday March 1st, 7-9pm. Tests

material up to and including week 5.

• Tuesday section is one lecture behind due to uni closure, will catch up next week.

• HW2 Deadline is March 12th, in class. Can hand in on March 11th too.

• My office hours are Wednesdays, 10am-11am, in Pratt 384.

Page 3: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Last week• Unsupervised learning

• Mixture models

• K-means

• EM Algorithm

• PCA & PPCA

Page 4: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Today• Overview of rest of course

• Graphical model notation

• Specify complex distributions

• Reason about independencies

• Graphical model notation

• Neural Network Basics

Page 5: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Course so far• Basics of supervised and unsupervised learning

• Linear regression and friends

• Linear + discrete latent variable models

• Exponential families + maximum likelihood

• Optimization

Page 6: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Remainder of course• Combine these building blocks

• Add nonlinearities

• Scale up number of parameters

• Use gradient-based optimization to fit or approximately integrate out all continuous parameters

• That’s the state of the art in modern probabilistic ML!

Page 7: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Remainder of course• Combine building blocks using graphical models

• Adding nonlinearities to regression + classification gives us neural networks

• We can fit millions of parameters using gradient descent, and efficiently compute gradients with reverse-mode autodiff (backprop)

• Can approximate integrate out millions of parameters using gradient-based stochastic variational inference, or Hamiltonian Monte Carlo

• Will also talk about gradient estimation of unknown functions or with discrete variables (aka policy-gradient model-free reinforcement learning)

Page 8: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ML as a bag of tricks

• K-means

• Kernel Density Estimation

• SVMs

• Boosting

• Random Forests

• K-Nearest Neighbors

• Mixture of Gaussians

• Latent variable models

• Gaussian processes

• Deep neural nets

• Bayesian neural nets

• ??

Fast special cases: Extensible family:

Page 9: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Regularization as a bag of tricks

• Early stopping

• Ensembling

• L2 Regularization

• Gradient noise

• Dropout

• Expectation-Maximization

• Stochastic variational inference

Fast special cases: Extensible family:

Page 10: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Learning outcomes• Know standard algorithms (bag of tricks), when to use

them, and their limitations. For basic applications and baselines.

• Know main elements of language of deep probabilistic models (distributions, expectations, latent variables, neural networks) and how to combine them. For custom applications + research

• Know standard computational tools for fitting models: Monte Carlo, stochastic optimization, regularization, automatic differentiation

Page 11: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

AI as a bag of tricks

• Machine learning

• Natural language processing

• Knowledge representation

• Automated reasoning

• Computer vision

• Robotics

• Deep probabilistic latent-variable models + decision theory

Russel and Norvig’s parts of AI: Extensible family:

Page 12: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 13: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 14: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 15: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 16: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 17: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 18: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 19: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Modeling idea: graphical models on latent variables,neural network models for observations

Composing graphical models with neural networks for structured representations and fast inference. Johnson, Duvenaud, Wiltschko, Datta, Adams, NIPS 2016

Page 20: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Generative Model Families• Autoregressive Models:

LSTMs, NICE, PixelRNN

• Variational Autoencoders

• Invertible models:Normalizing flows,Real NVP, FFJORD

• Implicit models (GANs)

x ∼ pθ(x |z), p(x) = ∫ p(x |z)p(z)dz

x = fθ(z), p(x) ≊ Dϕ(x)pdata(x)

x = fθ(z), p(x) = p(z) det (∇z fθ)−1

xi ∼ pθ(xi |x<i), p(x) = ∏i

pθ(xi |x<i)

Page 21: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Recurrent Neural Nets

x2x1

…h0 h1 h2

p(x2|x1) p(x3|x2, x1)

p(x) = ∏i

pθ(xi |x<i)

• p(age, income, purchase) = p(age) p(income | age) p( purchase | age, income)

Page 22: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Pixel Recurrent Neural Networks van den Oord et al., 2016

p(x) = ∏i

pθ(xi |x<i)

Page 23: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

[1] Palmer, Wipf, Kreutz-Delgado, and Rao. Variational EM algorithms for non-Gaussian latent variable models. NIPS 2005. [2] Ghahramani and Beal. Propagation algorithms for variational Bayesian learning. NIPS 2001. [3] Beal. Variational algorithms for approximate Bayesian inference, Ch. 3. U of London Ph.D. Thesis 2003. [4] Ghahramani and Hinton. Variational learning for switching state-space models. Neural Computation 2000. [5] Jordan and Jacobs. Hierarchical Mixtures of Experts and the EM algorithm. Neural Computation 1994. [6] Bengio and Frasconi. An Input Output HMM Architecture. NIPS 1995. [7] Ghahramani and Jordan. Factorial Hidden Markov Models. Machine Learning 1997. [8] Bach and Jordan. A probabilistic interpretation of Canonical Correlation Analysis. Tech. Report 2005. [9] Archambeau and Bach. Sparse probabilistic projections. NIPS 2008. [10] Hoffman, Bach, Blei. Online learning for Latent Dirichlet Allocation. NIPS 2010.

[1] [2] [3] [4]

Gaussian mixture model Linear dynamical system Hidden Markov model Switching LDS

[8,9] [10]

Canonical correlations analysis admixture / LDA / NMF

[6][2][5]

Mixture of Experts Driven LDS IO-HMM Factorial HMM

[7]

Courtesy of Matthew Johnson

Page 24: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

DirectedGraphicalModels

BasedonslidesbyRichardZemelandMarkEbden

Page 25: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Learningoutcomes

• Whataspectsofamodelcanweexpressusinggraphicalnotation?

• Whichaspectsarenotcapturedinthisway?• Howdoindependencieschangeasaresultofconditioning?

• Reasonsforusinglatentvariables• Commonmotifssuchasmixturesandchains• Howtointegrateoutunobservedvariables

Page 26: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ConditionalIndependence• Notation:xA⊥xB|xC

• Definition:two(setsof)variablesxAandxBareconditionallyindependentgivenathirdxCif:

whichisequivalenttosaying

Page 27: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

DirectedGraphicalModels• Considerdirectedacyclicgraphsovernvariables.• Eachnodehas(possiblyempty)setofparentsπi

• Wecanthenwrite

• Hencewefactorizethejointintermsoflocalconditionalprobabilities

• Directedgraphicalmodelsshowhowadistributionfactorizes,andwhatconditionalindependenciesexist.

Page 28: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ConditionalIndependenceinDAGs

• Ifweorderthenodessothatparentsalwayscomebeforetheirchildrenthenthegraphicalmodelimplies:

wherearethenodescomingbeforexithatarenotitsparents

• Inotherwords,eachvariableisconditionallyindependentofitsnon-descendantsgivenitsparents.

• Suchanorderingiscalleda“topological”ordering

Page 29: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Bayesian Network Example

Page 30: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Bayesian Network Example

P(B) = 0.001 P(E) = 0.002

Page 31: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Calculating on a Bayesian Network

We can say

So, P(B,E,A,J,M) = ? Answer: ~1.2 x 10-6

P(B|J,M) = ? Answer: Tricky!

Page 32: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Exact Inference in BNs

where

≈ 0.284

Page 33: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Ancestral Sampling:• Put the nodes in topological order

(parents coming before children) • Sample each variable given its parents • Efficient if sampling conditionals is cheap

Page 34: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ExampleDAGConsiderthissixnodenetwork:Thejointprobabilityisnow:

Page 35: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

MissingEdges• Keypointaboutdirectedgraphicalmodels:

Missingedgesimplyconditionalindependence• Rememberthatbythechainrulewecanalwayswritethefulljointasaproductofconditionals,givenanordering:

• IfthejointisrepresentedbyaDAGM,thensomeoftheconditionedvariablesontherighthandsidesaremissing.

• Removinganedgeintonodeieliminatesanargumentfromtheconditionalprobabilityfactor

Page 36: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Chain

• Q:Whenweconditionony,arexandzindependent?

whichimplies

andthereforex⊥z|y

• Thinkofxasthepast,yasthepresentandzasthefuture.

Page 37: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

CommonCause

• Q:Whenweconditionony,arexandzindependent?

whichimplies

andthereforex⊥z|y

Page 38: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ExplainingAway

• Q:Whenweconditionony,arexandzindependent?

• xandzaremarginallyindependent,butgivenytheyareconditionallydependent.

• Thisimportanteffectiscalledexplainingaway(Berkson’sparadox.)• Forexample,fliptwocoinsindependently;letx=coin1,z=coin2.• Lety=1ifthecoinscomeupthesameandy=0ifdifferent.• xandzareindependent,butifItellyouy,theybecomecoupled!

Page 39: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Bayes-BallAlgorithm• TocheckifxA⊥xB|xCweneedtocheckifeveryvariableinAisd-separatedfromeveryvariableinBconditionedonallvarsinC.

• Inotherwords,giventhatallthenodesinxCareclamped,whenwewigglenodesxAcanwechangeanyofthenodesinxB?

• TheBayes-BallAlgorithmisasuchad-separationtest.• WeshadeallnodesxC,placeballsateachnodeinxA(orxB),letthembouncearoundaccordingtosomerules,andthenaskifanyoftheballsreachanyofthenodesinxB(orxA).

Page 40: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Bayes-BallRules• Thethreecasesweconsideredtellusrules:

Page 41: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Bayes-BallBoundaryRules

• Wealsoneedtheboundaryconditions:

• Here’satrickfortheexplainingawaycase:Ifyoranyofitsdescendantsisshaded,theballpassesthrough.

• Noticeballscantraveloppositetoedgedirections.

Page 42: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

CanonicalMicrographs

Page 43: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ExamplesofBayes-BallAlgorithm

Page 44: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ExamplesofBayes-BallAlgorithm

• Notice:ballscantraveloppositetoedgedirection

Page 45: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Plates

Page 46: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Example:NestedPlates

Page 47: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ExampleDAGM:MarkovChain

• MarkovProperty:Conditionedonthepresent,thepastandfutureareindependent

Page 48: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

UnobservedVariables

• CertainvariablesQinourmodelsmaybeunobserved,eithersomeofthetimeoralways,eitherattrainingtimeorattesttime

• Graphically,shadingindicatesobservation

Page 49: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

LatentVariables

• Whattodowhenavariablezisalwaysunobserved?• Ifweneverconditiononit,canintegrateitoutexactly.e.g.,giveny,xfitthemodelp(z,y|x)=p(z|y)p(y|x,w)p(w).(Inotherwordsifitisaleafnode.)

• ThisletsusignoremissingvaluesinnaiveBayes

• Butifzisconditionedon,weneedtomarginalizeit:e.g.giveny,xfitthemodel p(y|x)=Σzp(y|x,z)p(z)

w

y z

z

y x

Page 50: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Latent variable models• Often it’s useful to assume

unseen causes

• knowing p(disease) and p(symptoms | disease) implies p(symptoms)

• Sometime latent variables are interpretable

diseases

symptoms

Page 51: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

WhereDoLatentVariablesComeFrom?• Latentvariablesmayappearnaturally,fromthestructureoftheproblem,becausesomethingwasn’tmeasured,becauseoffaultysensors,occlusion,privacy,etc.

• Maywanttointentionallyintroducelatentvariablestomodelcomplexdependenciesbetweenvariableswithoutlookingatthedependenciesbetweenthemdirectly.Thiscanactuallysimplifythemodel(e.g.,mixtures).

Page 52: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

HiddenMarkovModels(HMMs)

• Averypopularformoflatentvariablemodel

• Ztà HiddenstatestakingoneofKdiscretevalues• Xt à Observationstakingvaluesinanyspace

Example:discrete,Mobservationsymbols

Page 53: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

InferenceinGraphicalModels

xEà Observedevidencevariables(subsetofnodes)xFà unobservedquerynodeswe’dliketoinferxR à remainingvariables,extraneoustothisquerybutpartofthegivengraphicalrepresentation

Page 54: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

InferencewithTwoVariables

Tablelook-up

Bayes’Rule

Page 55: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

NaïveInference

• Supposeeachvariabletakesoneofkdiscretevalues

• CostsO(k)operationstoupdateeachofO(k5)tableentries• Usefactorizationanddistributedlawtoreducecomplexity

Page 56: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

InferenceinDirectedGraphs

Page 57: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

InferenceinDirectedGraphs

Page 58: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

[1] Palmer, Wipf, Kreutz-Delgado, and Rao. Variational EM algorithms for non-Gaussian latent variable models. NIPS 2005. [2] Ghahramani and Beal. Propagation algorithms for variational Bayesian learning. NIPS 2001. [3] Beal. Variational algorithms for approximate Bayesian inference, Ch. 3. U of London Ph.D. Thesis 2003. [4] Ghahramani and Hinton. Variational learning for switching state-space models. Neural Computation 2000. [5] Jordan and Jacobs. Hierarchical Mixtures of Experts and the EM algorithm. Neural Computation 1994. [6] Bengio and Frasconi. An Input Output HMM Architecture. NIPS 1995. [7] Ghahramani and Jordan. Factorial Hidden Markov Models. Machine Learning 1997. [8] Bach and Jordan. A probabilistic interpretation of Canonical Correlation Analysis. Tech. Report 2005. [9] Archambeau and Bach. Sparse probabilistic projections. NIPS 2008. [10] Hoffman, Bach, Blei. Online learning for Latent Dirichlet Allocation. NIPS 2010.

[1] [2] [3] [4]

Gaussian mixture model Linear dynamical system Hidden Markov model Switching LDS

[8,9] [10]

Canonical correlations analysis admixture / LDA / NMF

[6][2][5]

Mixture of Experts Driven LDS IO-HMM Factorial HMM

[7]

Courtesy of Matthew Johnson

Page 59: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Probabilistic graphical models

+ structured representations

+ priors and uncertainty

+ data and computational efficiency

– rigid assumptions may not fit

– feature engineering

– top-down inference

Deep learning

– neural net “goo”

– difficult parameterization

– can require lots of data

+ flexible

+ feature learning

+ recognition networks

Page 60: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 61: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 62: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 63: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Modeling idea: graphical models on latent variables,neural network models for observations

Composing graphical models with neural networks for structured representations and fast inference. Johnson, Duvenaud, Wiltschko, Datta, Adams, NIPS 2016

Page 64: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

data space latent space

Page 65: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Denton & Fergus, 2018

Page 66: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Questions?

Page 67: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

ParameterConstraints

• Ifwewanttouseunconstrainedoptimization,wehavetoenforceconstraints(e.g.,Σkαk=1,Σkαkpositivedefinite)automatically.

• Re-parameterizeintermsofunconstrainedvalues.Formixingproportions,usesoftmax:

• Forcovariancematrices,usetheCholeskydecomposition

whereAisupperdiagonalwithpositivediagonal

Page 68: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Logsumexp

• Representpositivequantities(probabilities)bytheirlogarithm.• However,computinglog-marginalswilllooklike

• Careful!Donotdolog(sum(exp(b))).Youwillgetunderflow.• InsteadaddB=max(b)allthevaluesbk• –Computelog(sum(exp(b-B)))+B.• Ruleofthumb:neveruselogorexpbyitself

Page 69: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

VectorizedCode

• Evaluatingyourmodelonabatchofdatacanusuallybedonewithoutanyloops

• Usebroadcasting!

Page 70: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Basic Neural Networks

Page 71: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

unsupervised learning

supervised learning

Courtesy of Matthew Johnson

Page 72: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Takeaways

• Exact architecture of network isn’t usually important

• Just need lots of parameters, and gradients

• Gradient-based optimization is more effective than we thought it would be

Page 73: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with
Page 74: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Gradient descent • Cauchy (1847)

Page 75: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Gradient descent: 2d example

minimum

xi in this figure are gradient descent iterations. In the previous slide’s setting, xi is θi .

Page 76: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Non-convex optimization

Localminimum

Globalminimum

Localmaximum

In this example, gradient descent would converge to the closest local minimum.

In machine learning, often times we need to rely on non-convex optimization.

Convergence guarantees are very limited, mostly based on heuristic.

Loss,error,orneglog-likelihood

Gradientdescent

Page 77: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Non-convex optimization

Localminimum

GlobalminimumStochastic methods have higher chance to escape “bad” minima, and converge to favorable regions.

�3<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

�4<latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit><latexit sha1_base64="(null)">(null)</latexit>

StochasticGradientdescent

Draw picture of how higher dimensions help

Page 78: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Extra slides

Page 79: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Machine-learning-centric History of Probabilistic Models• 1940s - 1960s Motivating probability and Bayesian inference• 1980s - 2000s Bayesian machine learning with MCMC• 1990s - 2000s Graphical models with exact inference• 1990s - present Bayesian Nonparametrics with MCMC (Indian Buffet

process, Chinese restaurant process)• 1990s - 2000s Bayesian ML with mean-field variational inference• 2000s - present Probabilistic Programming• 2000s - 2013 Deep undirected graphical models (RBMs, pretraining)• 2010s - present Stan - Bayesian Data Analysis with HMC• 2000s - 2013 Autoencoders, denoising autoencoders• 2000s - present Invertible density estimation• 2013 - present Stochastic variational inference, variational

autoencoders• 2014 - present Generative adversarial nets, Real NVP, Pixelnet• 2016 - present Lego-style deep generative models (attend, infer,

repeat)

Page 80: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Stats vs Machine Learning• Statistician: Look at the data, consider the problem, and design a model we can

understand

• Analyze methods to give guarantees

• Want to make few assumptions

• ML: We only care about making good predictions!

• Let’s make a general procedure that works for lots of datasets

• No way around making assumptions, let’s just make the model large enough to hopefully include something close to the truth

• Can’t use bounds in practice, so evaluate empirically to choose model details

• Sometimes end up with interpretable models anyways

Page 81: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

Advantages of probabilistic latent-variable models

• Data-efficient learning - automatic regularization, can take advantage of more information

• Compose-able models - e.g. incorporate data corruption model. Different from composing feedforward computations

• Handle missing + corrupted data (without the standard hack of just guessing the missing values using averages).

• Predictive uncertainty - necessary for decision-making

• conditional predictions (e.g. if brexit happens, the value of the pound will fall)

• Active learning - what data would be expected to increase our confidence about a prediction

• Cons:

• intractable integral over latent variables

• Examples: medical diagnosis, image modeling

Page 82: Statistical Methods for Machine Learning II Murat A ... · Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with

D-Separation• D-separation,ordirected-separationisanotionofconnectednessinDAGMsinwhichtwo(setsof)variablesmayormaynotbeconnectedconditionedonathird(setof)variable.

• D-connectionimpliesconditionaldependenceandd-separationimpliesconditionalindependence.

• Inparticular,wesaythatxA⊥xB|xCifeveryvariableinAisd-separatedfromeveryvariableinBconditionedonallthevariablesinC.

• Tocheckifanindependenceistrue,wecancyclethrougheachnodeinA,doadepth-firstsearchtoreacheverynodeinB,andexaminethepathbetweenthem.Ifallofthepathsared-separated,thenwecanassertxA⊥xB|xC

• Thus,itwillbesufficienttoconsidertriplesofnodes.(Why?)• Pictorially,whenweconditiononanode,weshadeitin.