Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and...

download Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and Transfer Learning July 2 nd 2011, Bellevue, WA.

If you can't read please download the document

Transcript of Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and...

  • Slide 1
  • Yoshua Bengio Statistical Machine Learning Chair, U. Montreal ICML 2011 Workshop on Unsupervised and Transfer Learning July 2 nd 2011, Bellevue, WA
  • Slide 2
  • How to Beat the Curse of Many Factors of Variation? Compositionality: exponential gain in representational power Distributed representations Deep architecture
  • Slide 3
  • Distributed Representations Many neurons active simultaneously Input represented by the activation of a set of features that are not mutually exclusive Can be exponentially more efficient than local representations
  • Slide 4
  • Local vs Distributed
  • Slide 5
  • RBM Hidden Units Carve Input Space h1h1 h2h2 h3h3 x1x1 x2x2
  • Slide 6
  • Unsupervised Deep Feature Learning Classical: pre-process data with PCA = leading factors New: learning multiple levels of features/factors, often over-complete Greedy layer-wise strategy: raw input x unsupervised raw input x P(y|x) Supervised fine-tuning (Hinton et al 2006, Bengio et al 2007, Ranzato et al 2007) raw input x
  • Slide 7
  • Why Deep Learning? Hypothesis 1: need deep hierarchy of features to efficiently represent and learn complex abstractions needed for AI and mammal intelligence. Computational & statististical efficiency Hypothesis 2: unsupervised learning of representations is a crucial component of the solution. Optimization & regularization. Theoretical and ML-experimental support for both. Cortex: deep architecture, the same learning rule everywhere
  • Slide 8
  • Deep Motivations Brains have a deep architecture Humans ideas & artifacts composed from simpler ones Unsufficient depth can be exponentially inefficient Distributed (possibly sparse) representations necessary for non- local generalization, exponentially more efficient than 1-of-N enumeration of latent variable values Multiple levels of latent variables allow combinatorial sharing of statistical strength raw input x task 1task 3task 2 shared intermediate representations
  • Slide 9
  • Deep Architectures are More Expressive Theoretical arguments: 123 2n2n 123 n = universal approximator2 layers of Logic gates Formal neurons RBF units Theorems for all 3: (Hastad et al 86 & 91, Bengio et al 2007) Functions compactly represented with k layers may require exponential size with k-1 layers
  • Slide 10
  • Sharing Components in a Deep Architecture Polynomial expressed with shared components: advantage of depth may grow exponentially (Bengio & Delalleau, Learning Workshop 2011) Sum-product network
  • Slide 11
  • Parts Are Composed to Form Objects Layer 1: edges Layer 2: parts Lee et al. ICML2009 Layer 3: objects
  • Slide 12
  • Deep Architectures and Sharing Statistical Strength, Multi-Task Learning Generalizing better to new tasks is crucial to approach AI Deep architectures learn good intermediate representations that can be shared across tasks Good representations make sense for many tasks raw input x task 1 output y 1 task 3 output y 3 task 2 output y 2 shared intermediate representation h
  • Slide 13
  • Feature and Sub-Feature Sharing Different tasks can share the same high- level features Different high-level features can be built from the same set of lower-level features More levels = up to exponential gain in representational efficiency task 1 output y 1 task N output y N High-level features Low-level features task 1 output y 1 task N output y N High-level features Low-level features Sharing intermediate features Not sharing intermediate features
  • Slide 14
  • Representations as Coordinate Systems PCA: removing low-variance directions easy but what if signal has low variance? We would like to disentangle factors of variation, keeping them all. Overcomplete representations: richer, even if underlying distribution concentrates near low-dim manifold. Sparse/saturated features: allows for variable-dim manifolds. Different few sensitive features at x = local chart coordinate system.
  • Slide 15
  • Effect of Unsupervised Pre-training AISTATS2009+JMLR 2010, with Erhan, Courville, Manzagol, Vincent, S. Bengio
  • Slide 16
  • Effect of Depth w/o pre-training with pre-training
  • Slide 17
  • Unsupervised Feature Learning: a Regularizer to Find Better Local Minima of Generalization Error Unsupervised pre- training acts like a regularizer Helps to initialize in basin of attraction of local minima with better generalization error
  • Slide 18
  • Non-Linear Unsupervised Feature Extraction Algorithms CD for RBMs SML (PCD) for RBMs Sampling beyond Gibbs (e.g. tempered MCMC) Mean-field + SML for DBMs Sparse auto-encoders Sparse Predictive Decomposition Denoising Auto-Encoders Score Matching / Ratio Matching Noise-Contrastive Estimation Pseudo-Likelihood Contractive Auto-Encoders See my book / review paper (F&TML 2009): Learning Deep Architectures for AI
  • Slide 19
  • Auto-Encoders Reconstruction=decoder(encoder(input)) Probable inputs have small reconstruction error Linear decoder/encoder = PCA up to rotation Can be stacked to form highly non-linear representations, increasing disentangling (Goodfellow et al, NIPS 2009) code= latent features encoder decoder input reconstruction
  • Slide 20
  • Restricted Boltzman Machine (RBM) The most popular building block for deep architectures Bipartite undirected graphical model Inference is trivial: P(h|x) & P(x|h) factorize observed hidden
  • Slide 21
  • Gibbs Sampling in RBMs P(h|x) and P(x|h) factorize h 1 ~ P(h|x 1 ) x 2 ~ P(x|h 1 ) x 3 ~ P(x|h 2 ) x1 x1 h 2 ~ P(h|x 2 ) h 3 ~ P(h|x 3 ) Easy inference Convenient Gibbs sampling x h x h
  • Slide 22
  • RBMs are Universal Approximators Adding one hidden unit (with proper choice of parameters) guarantees increasing likelihood With enough hidden units, can perfectly model any discrete distribution RBMs with variable nb of hidden units = non-parametric (Le Roux & Bengio 2008, Neural Comp.)
  • Slide 23
  • Denoising Auto-Encoder Learns a vector field towards higher probability regions Minimizes variational lower bound on a generative model Similar to pseudo- likelihood A form of regularized score matching Reconstruction Corrupted input
  • Slide 24
  • Stacked Denoising Auto- Encoders No partition function, can measure training criterion Encoder & decoder: any parametrization Performs as well or better than stacking RBMs for usupervised pre-training Infinite MNIST
  • Slide 25
  • Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, Rifai, Vincent, Muller, Glorot & Bengio, ICML 2011. Higher Order Contractive Auto-Encoders, Rifai, Mesnil, Vincent, Muller, Bengio, Dauphin, Glorot, ECML 2011. Part of winning toolbox in final phase of the Unsupervised & Transfer Learning Challenge 2011 Contractive Auto-Encoders
  • Slide 26
  • Training criterion: wants contraction in all directions cannot afford contraction in manifold directions Few active units represent the active subspace (local chart) Jacobians spectrum is peaked = local low- dimensional representation / relevant factors
  • Slide 27
  • Unsupervised and Transfer Learning Challenge: 1 st Place in Final Phase Raw data 1 layer 2 layers 4 layers 3 layers
  • Slide 28
  • Transductive Representation Learning Validation set and test sets have different classes than training set, hence very different input distributions Directions that matter to distinguish them might have small variance under the training set Solution: perform last level of unsupervised feature learning (PCA) on the validation / test set input data.
  • Slide 29
  • Small (4-domain) Amazon benchmark: we beat the state-of-the-art handsomely Domain Adaptation (ICML 2011) Sparse rectifiers SDA finds more features that tend to be useful either for predicting domain or sentiment, not both
  • Slide 30
  • Sentiment Analysis: Transfer Learning 25 Amazon.com domains: toys, software, video, books, music, beauty, Unsupervised pre- training of input space on all domains Supervised SVM on 1 domain, generalize out- of-domain Baseline: bag-of-words + SVM
  • Slide 31
  • Sentiment Analysis: Large Scale Out-of-Domain Generalization Relative loss by going from in-domain to out-of-domain testing 340k examples from Amazon, from 56 (tools) to 124k (music)
  • Slide 32
  • Deep Sparse Rectifier Neural Networks, Glorot, Bordes & Bengio, AISTATS 2011. Sampled Reconstruction for Large-Scale Learning of Embeddings, Dauphin, Glorot & Bengio, ICML 2011. Representing Sparse High- Dimensional Stuff code= latent features sparse input dense output probabilities cheap expensive
  • Slide 33
  • Speedup from Sampled Reconstruction
  • Slide 34
  • Deep Self-Taught Learning for Handwritten Character Recognition Y. Bengio & 16 others (IFT6266 class project & AISTATS 2011 paper) discriminate 62 character classes (upper, lower, digits), 800k to 80M examples Deep learners beat state-of-the-art on NIST and reach human-level performance Deep learners benefit more from perturbed (out-of-distribution) data Deep learners benefit more from multi-task setting
  • Slide 35
  • 35
  • Slide 36
  • Improvement due to training on perturbed data Deep Shallow {SDA,MLP}1: trained only on data with distortions: thickness, slant, affine, elastic, pinch {SDA,MLP}2: all perturbations
  • Slide 37
  • Improvement due to multi-task setting Deep Shallow Multi-task: Train on all categories, test on target categories (share representations) Not multi-task: Train and test on target categories only (no sharing, separate models)
  • Slide 38
  • Comparing against Humans and SOA All 62 character classes Deep learners reach human performance [1] Granger et al., 2007 [2] Prez-Cortes et al., 2000 (nearest neighbor) [3] Oliveira et al., 2002b (MLP) [4] Milgram et al., 2005 (SVM)
  • Slide 39
  • Tips & Tricks Dont be scared by the many hyper- parameters: use random sampling (not grid search) & clusters / GPUs Learning rate is the most important, along with top-level dimension Make sure selected hyper-param value is not on the border of interval Early stopping Using NLDR for visualization Simulation of Final Evaluation Scenario
  • Slide 40
  • Conclusions Deep Learning: powerful arguments & generalization principles Unsupervised Feature Learning is crucial: many new algorithms and applications in recent years DL particularly suited for multi-task learning, transfer learning, domain adaptation, self-taught learning, and semi- supervised learning with few labels
  • Slide 41
  • http://deeplearning.net/software/theanohttp://deeplearning.net/software/theano : numpy GPU http://deeplearning.net
  • Slide 42
  • Merci! Questions? LISA ML Lab team: