ICS Summer School 2016 Data Science - Week 4 Gabriella ...

ICS Summer School 2016Data Science - Week 4

Gabriella Contardo

LIP6, University Pierre et Marie Curie, Paris, France

August 10, 2016

1/22 G.Contardo Roscoff ICS Summer School 2016 - Data Science and Machine Learning

Outline of the week

Course 1 : Reminders of the learning paradigm, neural networks /multi layer perceptron.Course 2 : Deep learning : Convolutional Neural NetworksCourse 3 : Tips on deep-learning - PCA, Matrix Factorizationand Recommender systemsCourse 4 : Unsupervised learning - Unsupervised Learning :Clustering (K-Means), EMCourse 5 : Unsupervised learning with (deep) neural networks :Auto-encoders, RNN. Word embeddings.


References

On-line course material for todayThanks to:

Patrick Gallinari - Professor at UPMC - Course ”ApprentissageStatistique”Nicolas Baskiotis - Assistant Professor at UPMC - Course ”ARF”(Master DAC)Fei-Fei Li’s course at Stanford : CS231n: Convolutional NeuralNetworks for Visual Recognition (Lecture : introduction to neuralnets, backpropagation)Course Y.Bengio (MLSS 2014 slides and video available online)

Also interesting (generally speaking):Machine Learning course by Andrew Ng on CourseraLectures by Nando De Freitas (Oxford) - videos available


Outline of the day

Deep LearningTips on Deep Learning

Unsupervised Learning - Data Compression / Representation

PCAMF/NMF and recommender systems


GoogLeNet

Erratum : Google’s paper, not by LeCun (he’s at Facebook) butnamed as a reference/joke to his (old first) model ”LeNet”.”Yellow layers” : additional supervised layers added for training”on the side” parts of the networks. ”Later control experimentshave shown that the effect of the auxiliary networks is relativelyminor (around 0.5%) and that it required only one of them toachieve the same effect.” (cf paper)Learning time :”a rough estimate suggests that the GoogLeNetnetwork could be trained to convergence using few high-endGPUs within a week, the main limitation being the memory usage”More info in the paper :http://www.cs.unc.edu/ wliu/papers/GoogLeNet.pdf


Tips for learning deep (convolutional) networks

Data Augmentation

Modify input (pixels) without changing label

Train on transformed data

(credit Fei-Fei Li’s course CS231n oxford)

Tips for learning deep(convolutional) networks

Data Augmentation

Flip

Random crops/scales

(+sample on crop)

Color jitter (e.g contrast)

Translation, rotation, ...


Data Augmentation

Flip

Random crops/scales

(+sample on crop)

Color jitter (e.g contrast)

Translation, rotation, ...


Straightforward for images but for

other types of data ?


Generally :

Training : add random noise

Testing : marginalize over the noise



« You need a lot of data if you want to train/use CNN »

→ Nope ! (well, not necessarily…) « Transfer » learning





« You need a lot of data if you want to train/use CNN »

→ Nope ! (well, not necessarily…)



Using pre-trained CNN is the norm

Dimensionality Reduction

ProblemMore dimensions⇒ more expressivityToo many dimensions : low variance on a dimension, noise, takesmemory space and time...For tasks on images, text, and other : too many dimensions→dimensions that are not very informative, highly correlated⇒ Lower dimensions but without losing informationGoal : find a projection Φ : Rd → Rd ′

with d ′ << dApplications :

LearningVisualizationNoise reduction

credit N.Baskiotis UPMC


Data Compression

(credit Andrew Ng coursera)

Reduce data from 2D to 1D

Data Compression


Reduce data from 3D to 2D

Original datasetProjected dataset

Visualization of projected Dataset in 2D

z=[z1, z2] , zi = [z

i 1,z

i 2]

Data Compression

(credit http://setosa.io/ev/principal-component-analysis/)

Motivation : Visualization

Data Compression

(credit http://setosa.io/ev/principal-component-analysis/)

Motivation : Visualization

17D to 1D →

17D to 2D :

Dimensionality Reduction

Quizz !Suppose we apply dimensionality reduction to a dataset of mexamples {x1, . . . , xm} where x i ∈ Rn. As a result, we will get out :

A lower dimensional dataset {z1, . . . , zk} of k examples wherek ≤ nA lower dimensional dataset {z1, . . . , zk} of k examples wherek > nA lower dimensional dataset {z1, . . . , zm} of m examples wherez i ∈ Rk for some value of k and k ≤ nA lower dimensional dataset {z1, . . . , zm} of m examples wherez i ∈ Rk for some value of k and k > n

credit Andrew Ng - Stanford Univ


Data Compression - PCA

Principal Component Analysis (PCA) : Problem formulation

=> Goal : find a line on which projected the data


Dimensionality Reduction - PCA

FormulationReduce from 2-dimension to 1-dimension : Find a direction (avector u1 ∈ Rn) onto which to project he data so as to minimizethe projection error.Generic case : Reduce from n-dimension to k-dimension : find kvectors u1, . . . ,uk (directions) onto which to project the data so asto minimize the projection error.




PCA is not linear regression



Quiz : Suppose you run PCA on the following dataset. Which of the following would be a reasonnable vector u1 onto which to project the data ? (Usually ||u1||=1)


u1 = [ 1 0 ]

u1 = [ 0 1 ]

u1 = [ 1/√2 1/√2 ]

u1 = [ -1/√2 1/√2 ]

Dimensionality Reduction - PCA algorithm

Data preprocessing

Training set x1, . . . , xm

Preprocessing : feature scaling / mean normalization:Compute µj = 1

m

∑mi=1 x i

j

Replace each x ij with x i

j − µj : each feature has 0 mean.If different features on different scales, rescale features to havecomparable range of values (e.g x i

j =x i

j −min(xj

max(xj )−min(xj )).

To do also in supervised learning ! credit Andrew Ng - Stanford Univ



Reduce data from n-dimensions to k-dimensionsCompute covariance matrix :

Σ =1m

n∑i=1

(x i)(x i)T

(Python : numpy.cov(x) if x is your matrix of examples withexamples in column (i.e ∈ Rn×n)).Compute eigenvectors of matrix Σ :

U,S,V = numpy.linalg.svd(Σ)W, V = numpy.linalg.eig(Σ)

The matrix U (or V if using eig) is ∈ Rn×n

Takes the first k columns of U→ Ureduce

Finding the new representation z i for an example x i :

z i = UTreduce × x i




Quiz

In PCA, we obtain z ∈ Rk from x ∈ Rn as follow :

z i = UTreducek × x i

Which of the following is a correct expression for z ij ?

z ij = (uk )T x i

z ij = (uj)T x i

j

z ij = (uj)T x i

k

z ij = (uj)T x i




Choosing number of components kAverage squared projection error =errorprojection = 1

m∑m

i=1 ||xi − x iapprox ||2

x iapprox = Ureducez : projection of x i .

Variation in data : 1m∑m

i=1 ||x i ||2 : distance on average of myexamples to origin.Choose k smallest value such that ratio between average squareprojection error and variation is small :errorprojection

variation ≤ 0.01”99% of variance retained”




Algorithm to find k

Compute Σ,UTry PCA with k=1

Compute Ureduce, z1, . . . , zm, x1approx , . . . , xm

approxCheck if ratio is inferior to 0.01Increment k if criterion is not met, stop otherwise.

Speeding up by using U,S, V = svd(Σ). S ∈ Rn×n, diagonal matrix Sii .For a given value of k , the ratio can be computed as :

1−∑k

i=1 Sii∑ni=1 Sii

⇒ Don’t need to recompute all z and xapprox .




Quiz

We said that PCA chooses k directions u1, . . . ,uk onto which toproject the data so as to minimize the squared projection error.Another way to say the same is that PCA tries to minimize

1m∑m

i=1 ||x i ||21m∑m

i=1 ||x iapprox ||2

1m∑m

i=1 ||x i − x iapprox ||2

1m∑m

i=1 ||x i + x iapprox ||2




Tips on PCA

Dataset (x1, y1), . . . , (xm, ym), with x i ∈ R10000

↪→ (z1, y1), . . . , (zm, ym) with z i ∈ R1000

PCA on training dataset (compute Ureduce, finding k etc.), appliedon train, validation and test.Less dimensions→ less space and models afterward with fewerparameters (e.g neural networks)If visualization : k=2 or 3

Don’tDon’t use PCA to prevent overfitting, use regularization instead.Don’t run PCA before even testing on your raw data !



Data Compression PCA : ortogonal base → no redundancy

Images : made of little « basic » parts :

Can we learn a dictionnary of such patchs to represent data ?

(credit N.Baskiotis)

Matrix Factorization

(credit P.Gallinari)

Idea Project data vectors in a latent space of dimension k < m size of

the original space Axis in this latent space represent a new basis for data

representation Each original data vector will be approximated as a linear

combination of k basis vectors in this new space



X U V



Applications Recommendation (User x Item matrix)

Matrix completion

Link prediction (Adjacency matrix) …



X V

x.jv.j

u.1u.2u.3

Original data Basis vectorsDictionnary

Representation

x

v.j

u.1

u.2

u.3



Interpretation If X is a User x Item matrix

Users and items are represented in a common representation space of size k

Their interaction is measured by a dot product in this space

x.jv.j

ui.

Original data User Representation

Item Representation

xUsers

Items



Interpretation If X is a directed graph adjacency or weight matrix

x.jv.j

ui.

Original data Sender Representation

Receiver _Representation

xNodes

Nodes

Recommender Systems

Predicting movie ratings


Recommender Systems

Collaborative Filtering


Data compression - Matrix Factorization

Collaborative filtering optimization objective

Mimize J with ui , v j , with i ∈ 1, . . . ,nm (items e.g movie), andj ∈ 1, . . . ,nu (users) :J (u1, . . . ,unm , v1, . . . , vnu ) = 1/2

∑(i,j):r(i,j)=1(uiv j − y ij)2 +

λ/2∑nm

i=1∑n

k=1(uik )2 + λ/2

∑nuj=1

∑nk=1(v j

k )2



Algorithm

Initialize u1, . . . ,unm , v1, . . . , vnu randomlyMinimize J (u1, . . . ,unm , v1, . . . , vnu ) using gradient descent e.g

uik = ui

k − ε(∑

(i,j):r(i,j)=1(uiv j − y ij )v jk + λui

k )

v jk = v j

k − ε(∑

(i,j):r(i,j)=1(uiv j − y ij )uik + λv j

k )

(here : batch→ all ratings for i or j are used.)Predicting a rating for a user j with learned representation v j andan item i with learned representation ui : uiv j

N.B : possible to use alternate gradient descent or other.



Loss function -example

Minimize C = ||X − UV ||2 + c(U,V )

Constraints on U,V via c(U,V ) e.g :Positivity (NMF - next slides)Sparsity of representations, e.g ||V ||1Over-complete dictionnary U : k > nSymmetryBias on U and V (e.g bias on popularity for item recommendation)Any a priori knowledge on U and V

credit P.Gallinari UPMC



Non Negative Matrix Factorization

Minimize C = ||X − UV ||2 under constraints U,V ≥ 0Convex loss in U and in V but not in both U and V.Algorithm can be solved by a Lagrangian formulation :

Iterative multiplicative algorithmU,V initialized at random valuesIterate until convergence :

uij ← uijXV T

ij(UVV T )i j

vij ← vij(XT U)ij

(V T UT U)ij

Or by projected gradient formulations.Solution U,V is not unique : if U,V solution, then UD,D−1V for Ddiagonal positive is also solution,




Using NMF for ClusteringNormalize U as a column stochastic matrix : each column vectoris of norm 1

uij ←uij√∑

i u2ij

vij ← vij

√∑i u2

ij

Under the constraint ”U normalized”, the solution U,V is uniqueAssociate x i to cluster j if j = argmaxj(vij)




Many different version and extentions of NMFDifferent loss functions (e.g different constraints)Different algorithmsApplications:

ClusteringLink predictionRecommendationetc



ICS Summer School 2016 Data Science - Week 4 Gabriella ...

Documents

Transcript of ICS Summer School 2016 Data Science - Week 4 Gabriella ...