ICS Summer School 2016 Data Science - Week 4 Gabriella ...
Transcript of ICS Summer School 2016 Data Science - Week 4 Gabriella ...
ICS Summer School 2016Data Science - Week 4
Gabriella Contardo
LIP6, University Pierre et Marie Curie, Paris, France
August 10, 2016
1/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Outline of the week
Course 1 : Reminders of the learning paradigm, neural networks /multi layer perceptron.Course 2 : Deep learning : Convolutional Neural NetworksCourse 3 : Tips on deep-learning - PCA, Matrix Factorizationand Recommender systemsCourse 4 : Unsupervised learning - Unsupervised Learning :Clustering (K-Means), EMCourse 5 : Unsupervised learning with (deep) neural networks :Auto-encoders, RNN. Word embeddings.
2/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
References
On-line course material for todayThanks to:
Patrick Gallinari - Professor at UPMC - Course ”ApprentissageStatistique”Nicolas Baskiotis - Assistant Professor at UPMC - Course ”ARF”(Master DAC)Fei-Fei Li’s course at Stanford : CS231n: Convolutional NeuralNetworks for Visual Recognition (Lecture : introduction to neuralnets, backpropagation)Course Y.Bengio (MLSS 2014 slides and video available online)
Also interesting (generally speaking):Machine Learning course by Andrew Ng on CourseraLectures by Nando De Freitas (Oxford) - videos available
3/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Outline of the day
Deep LearningTips on Deep Learning
Unsupervised Learning - Data Compression / Representation
PCAMF/NMF and recommender systems
4/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
GoogLeNet
Erratum : Google’s paper, not by LeCun (he’s at Facebook) butnamed as a reference/joke to his (old first) model ”LeNet”.”Yellow layers” : additional supervised layers added for training”on the side” parts of the networks. ”Later control experimentshave shown that the effect of the auxiliary networks is relativelyminor (around 0.5%) and that it required only one of them toachieve the same effect.” (cf paper)Learning time :”a rough estimate suggests that the GoogLeNetnetwork could be trained to convergence using few high-endGPUs within a week, the main limitation being the memory usage”More info in the paper :http://www.cs.unc.edu/ wliu/papers/GoogLeNet.pdf
5/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Tips for learning deep (convolutional) networks
Data Augmentation
Modify input (pixels) without changing label
Train on transformed data
(credit Fei-Fei Li’s course CS231n oxford)
Tips for learning deep(convolutional) networks
Data Augmentation
Flip
Random crops/scales
(+sample on crop)
Color jitter (e.g contrast)
Translation, rotation, ...
(credit Fei-Fei Li’s course CS231n oxford)
Data Augmentation
Flip
Random crops/scales
(+sample on crop)
Color jitter (e.g contrast)
Translation, rotation, ...
(credit Fei-Fei Li’s course CS231n oxford)
Straightforward for images but for
other types of data ?
Tips for learning deep(convolutional) networks
Generally :
Training : add random noise
Testing : marginalize over the noise
(credit Fei-Fei Li’s course CS231n oxford)
Tips for learning deep(convolutional) networks
« You need a lot of data if you want to train/use CNN »
→ Nope ! (well, not necessarily…) « Transfer » learning
(credit Fei-Fei Li’s course CS231n oxford)
Tips for learning deep(convolutional) networks
(credit Fei-Fei Li’s course CS231n oxford)
Tips for learning deep(convolutional) networks
« You need a lot of data if you want to train/use CNN »
→ Nope ! (well, not necessarily…)
(credit Fei-Fei Li’s course CS231n oxford)
Tips for learning deep(convolutional) networks
Using pre-trained CNN is the norm
Dimensionality Reduction
ProblemMore dimensions⇒ more expressivityToo many dimensions : low variance on a dimension, noise, takesmemory space and time...For tasks on images, text, and other : too many dimensions→dimensions that are not very informative, highly correlated⇒ Lower dimensions but without losing informationGoal : find a projection Φ : Rd → Rd ′
with d ′ << dApplications :
LearningVisualizationNoise reduction
credit N.Baskiotis UPMC
6/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data Compression
(credit Andrew Ng coursera)
Reduce data from 2D to 1D
Data Compression
(credit Andrew Ng coursera)
Reduce data from 2D to 1D
Data Compression
(credit Andrew Ng coursera)
Reduce data from 3D to 2D
Original datasetProjected dataset
Visualization of projected Dataset in 2D
z=[z1, z2] , zi = [z
i 1,z
i 2]
Data Compression
(credit http://setosa.io/ev/principal-component-analysis/)
Motivation : Visualization
Data Compression
(credit http://setosa.io/ev/principal-component-analysis/)
Motivation : Visualization
17D to 1D →
17D to 2D :
Dimensionality Reduction
Quizz !Suppose we apply dimensionality reduction to a dataset of mexamples {x1, . . . , xm} where x i ∈ Rn. As a result, we will get out :
A lower dimensional dataset {z1, . . . , zk} of k examples wherek ≤ nA lower dimensional dataset {z1, . . . , zk} of k examples wherek > nA lower dimensional dataset {z1, . . . , zm} of m examples wherez i ∈ Rk for some value of k and k ≤ nA lower dimensional dataset {z1, . . . , zm} of m examples wherez i ∈ Rk for some value of k and k > n
credit Andrew Ng - Stanford Univ
7/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data Compression - PCA
Principal Component Analysis (PCA) : Problem formulation
=> Goal : find a line on which projected the data
(credit Andrew Ng coursera)
Dimensionality Reduction - PCA
FormulationReduce from 2-dimension to 1-dimension : Find a direction (avector u1 ∈ Rn) onto which to project he data so as to minimizethe projection error.Generic case : Reduce from n-dimension to k-dimension : find kvectors u1, . . . ,uk (directions) onto which to project the data so asto minimize the projection error.
credit Andrew Ng - Stanford Univ
8/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data Compression - PCA
PCA is not linear regression
(credit Andrew Ng coursera)
Data Compression - PCA
Quiz : Suppose you run PCA on the following dataset. Which of the following would be a reasonnable vector u1 onto which to project the data ? (Usually ||u1||=1)
(credit Andrew Ng coursera)
u1 = [ 1 0 ]
u1 = [ 0 1 ]
u1 = [ 1/√2 1/√2 ]
u1 = [ -1/√2 1/√2 ]
Dimensionality Reduction - PCA algorithm
Data preprocessing
Training set x1, . . . , xm
Preprocessing : feature scaling / mean normalization:Compute µj = 1
m
∑mi=1 x i
j
Replace each x ij with x i
j − µj : each feature has 0 mean.If different features on different scales, rescale features to havecomparable range of values (e.g x i
j =x i
j −min(xj
max(xj )−min(xj )).
To do also in supervised learning ! credit Andrew Ng - Stanford Univ
9/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Dimensionality Reduction - PCA algorithm
Reduce data from n-dimensions to k-dimensionsCompute covariance matrix :
Σ =1m
n∑i=1
(x i)(x i)T
(Python : numpy.cov(x) if x is your matrix of examples withexamples in column (i.e ∈ Rn×n)).Compute eigenvectors of matrix Σ :
U,S,V = numpy.linalg.svd(Σ)W, V = numpy.linalg.eig(Σ)
The matrix U (or V if using eig) is ∈ Rn×n
Takes the first k columns of U→ Ureduce
Finding the new representation z i for an example x i :
z i = UTreduce × x i
credit Andrew Ng - Stanford Univ
10/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Dimensionality Reduction - PCA algorithm
Quiz
In PCA, we obtain z ∈ Rk from x ∈ Rn as follow :
z i = UTreducek × x i
Which of the following is a correct expression for z ij ?
z ij = (uk )T x i
z ij = (uj)T x i
j
z ij = (uj)T x i
k
z ij = (uj)T x i
credit Andrew Ng - Stanford Univ
11/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Dimensionality Reduction - PCA algorithm
Choosing number of components kAverage squared projection error =errorprojection = 1
m∑m
i=1 ||xi − x iapprox ||2
x iapprox = Ureducez : projection of x i .
Variation in data : 1m∑m
i=1 ||x i ||2 : distance on average of myexamples to origin.Choose k smallest value such that ratio between average squareprojection error and variation is small :errorprojection
variation ≤ 0.01”99% of variance retained”
credit Andrew Ng - Stanford Univ
12/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Dimensionality Reduction - PCA algorithm
Algorithm to find k
Compute Σ,UTry PCA with k=1
Compute Ureduce, z1, . . . , zm, x1approx , . . . , xm
approxCheck if ratio is inferior to 0.01Increment k if criterion is not met, stop otherwise.
Speeding up by using U,S, V = svd(Σ). S ∈ Rn×n, diagonal matrix Sii .For a given value of k , the ratio can be computed as :
1−∑k
i=1 Sii∑ni=1 Sii
⇒ Don’t need to recompute all z and xapprox .
credit Andrew Ng - Stanford Univ
13/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Dimensionality Reduction - PCA algorithm
Quiz
We said that PCA chooses k directions u1, . . . ,uk onto which toproject the data so as to minimize the squared projection error.Another way to say the same is that PCA tries to minimize
1m∑m
i=1 ||x i ||21m∑m
i=1 ||x iapprox ||2
1m∑m
i=1 ||x i − x iapprox ||2
1m∑m
i=1 ||x i + x iapprox ||2
credit Andrew Ng - Stanford Univ
14/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Dimensionality Reduction - PCA algorithm
Tips on PCA
Dataset (x1, y1), . . . , (xm, ym), with x i ∈ R10000
↪→ (z1, y1), . . . , (zm, ym) with z i ∈ R1000
PCA on training dataset (compute Ureduce, finding k etc.), appliedon train, validation and test.Less dimensions→ less space and models afterward with fewerparameters (e.g neural networks)If visualization : k=2 or 3
Don’tDon’t use PCA to prevent overfitting, use regularization instead.Don’t run PCA before even testing on your raw data !
credit Andrew Ng - Stanford Univ
15/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data Compression PCA : ortogonal base → no redundancy
Images : made of little « basic » parts :
Can we learn a dictionnary of such patchs to represent data ?
(credit N.Baskiotis)
Matrix Factorization
(credit P.Gallinari)
Idea Project data vectors in a latent space of dimension k < m size of
the original space Axis in this latent space represent a new basis for data
representation Each original data vector will be approximated as a linear
combination of k basis vectors in this new space
Matrix Factorization
(credit P.Gallinari)
X U V
Matrix Factorization
(credit P.Gallinari)
Applications Recommendation (User x Item matrix)
Matrix completion
Link prediction (Adjacency matrix) …
Matrix Factorization
(credit P.Gallinari)
X V
x.jv.j
u.1u.2u.3
Original data Basis vectorsDictionnary
Representation
x
v.j
u.1
u.2
u.3
Matrix Factorization
(credit P.Gallinari)
Interpretation If X is a User x Item matrix
Users and items are represented in a common representation space of size k
Their interaction is measured by a dot product in this space
x.jv.j
ui.
Original data User Representation
Item Representation
xUsers
Items
Matrix Factorization
(credit P.Gallinari)
Interpretation If X is a directed graph adjacency or weight matrix
x.jv.j
ui.
Original data Sender Representation
Receiver _Representation
xNodes
Nodes
Recommender Systems
Predicting movie ratings
(credit Andrew Ng coursera)
Recommender Systems
Collaborative Filtering
(credit Andrew Ng coursera)
Data compression - Matrix Factorization
Collaborative filtering optimization objective
Mimize J with ui , v j , with i ∈ 1, . . . ,nm (items e.g movie), andj ∈ 1, . . . ,nu (users) :J (u1, . . . ,unm , v1, . . . , vnu ) = 1/2
∑(i,j):r(i,j)=1(uiv j − y ij)2 +
λ/2∑nm
i=1∑n
k=1(uik )2 + λ/2
∑nuj=1
∑nk=1(v j
k )2
16/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data compression - Matrix Factorization
Algorithm
Initialize u1, . . . ,unm , v1, . . . , vnu randomlyMinimize J (u1, . . . ,unm , v1, . . . , vnu ) using gradient descent e.g
uik = ui
k − ε(∑
(i,j):r(i,j)=1(uiv j − y ij )v jk + λui
k )
v jk = v j
k − ε(∑
(i,j):r(i,j)=1(uiv j − y ij )uik + λv j
k )
(here : batch→ all ratings for i or j are used.)Predicting a rating for a user j with learned representation v j andan item i with learned representation ui : uiv j
N.B : possible to use alternate gradient descent or other.
17/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data compression - Matrix Factorization
Loss function -example
Minimize C = ||X − UV ||2 + c(U,V )
Constraints on U,V via c(U,V ) e.g :Positivity (NMF - next slides)Sparsity of representations, e.g ||V ||1Over-complete dictionnary U : k > nSymmetryBias on U and V (e.g bias on popularity for item recommendation)Any a priori knowledge on U and V
credit P.Gallinari UPMC
18/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data compression - Matrix Factorization
Non Negative Matrix Factorization
Minimize C = ||X − UV ||2 under constraints U,V ≥ 0Convex loss in U and in V but not in both U and V.Algorithm can be solved by a Lagrangian formulation :
Iterative multiplicative algorithmU,V initialized at random valuesIterate until convergence :
uij ← uijXV T
ij(UVV T )i j
vij ← vij(XT U)ij
(V T UT U)ij
Or by projected gradient formulations.Solution U,V is not unique : if U,V solution, then UD,D−1V for Ddiagonal positive is also solution,
credit P.Gallinari UPMC
19/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data compression - Matrix Factorization
Using NMF for ClusteringNormalize U as a column stochastic matrix : each column vectoris of norm 1
uij ←uij√∑
i u2ij
vij ← vij
√∑i u2
ij
Under the constraint ”U normalized”, the solution U,V is uniqueAssociate x i to cluster j if j = argmaxj(vij)
credit P.Gallinari UPMC
20/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data compression - Matrix Factorization
Many different version and extentions of NMFDifferent loss functions (e.g different constraints)Different algorithmsApplications:
ClusteringLink predictionRecommendationetc
credit P.Gallinari UPMC
21/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning
Data compression - Matrix Factorization
Many different version and extentions of NMFDifferent loss functions (e.g different constraints)Different algorithmsApplications:
ClusteringLink predictionRecommendationetc
credit P.Gallinari UPMC
22/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning