Deep Learning for Data Mining

University of Rome "La Sapienza"Dep. of Computer, Control and Management Engineering A. Ruberti

Valsamis Ntouskos, ALCOR Lab

Outline

• Introduction - Motivation

• Theoretical aspects

• Convolutional Neural Networks

• Recurrent Neural Networks

• Generative models

• Further directions

13/12/2018

Deep Learning with CNNs

Compositional Models

Learned End-to-End

Hierarchy of Representations- vision: pixel, motif, part, object

- text: character, word, clause, sentence

- speech: audio, band, phone, word

concrete abstractlearning

Slides from Caffe framework tutorial @ CVPR2015

13/12/2018

Deep Learning with CNNs

Compositional Models

Learned End-to-End

Back-propagation jointly learns

all of the model parameters to

optimize the output for the task.

13/12/2018

Motivation of CNNs

Inputs usually treated as general feature vectors

In some cases inputs have special structure:• Audio• Images• Videos

Signals: Numerical representations of physical quantities

Deep learning can be directly applied on signals by using suitable operators

13/12/2018

Motivation

. . . 0.0468 0.0468 0.0468 0.0390 0.0390 0.0390 0.0546 0.0625 0.0625 0.0390 0.0312 0.0468 0.0625 . . .

1D data - (variable length) vectors

13/12/2018

Motivation

Images

A sequence of images sampled through time - 3D data

2D data - matrices

13/12/2018

What is a CNN?

Deep Learning for Data Mining 13/12/2018

Some theory

13/12/2018Deep Learning for Data Mining

Convolution

• Image filtering is

based on convolution

with special kernels

Some theory

Pooling

• Introduces subsampling

13/12/2018

Some theory

Activation

Standard way to model a neuron

f(x) = tanh(x) or f(x) = (1 + e-x)-1

Very slow to train (saturation)

Non-saturating nonlinearity (RELU)f(x) = max(0, x)Quick to train

13/12/2018

Some theory

Regularization

Dropout

• Applied on the fully-connected layers

• During training prune nodes with probability α

• During testing nodes are weighed by α

Image from Srivastava et al.. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting"

13/12/2018

Some theory

Every convolutional layer of a CNN transforms the 3D input

volume to a 3D output volume of neuron activations.

A regular 3-layer Neural Network

Material from Fei-Fei’s group

13/12/2018

Some theory

Each neuron is connected to a

local region in the input volume

spatially, but to all channels

The neurons still compute a dot

product of their weights with the

input followed by a non-linearity

13/12/2018

Algorithms

• Each* neuron/layer is differentiable!

• Just apply backpropagation (chain-rule)

• Use standard gradient-based optimization algorithms

(SGD, AdaGrad, …)

• The devil lies in the details though …

▪Choosing hyperparameters / loss-function

▪Exploding/Vanishing gradients – batch normalization

▪Overfitting – Regularization

▪Cost of performing experiments

▪Convergence

▪…

*what about max-pooling?

13/12/2018

Kernels and

Feature maps

13/12/2018

Case study: image classification

13/12/2018

Case study: image classification

13/12/2018

Brief history of CNNs

Foundational work done in the middle of the 1900s

• 1940s-1960s: Cybernetics [McCulloch and Pitts 1943,

Hebb 1949, Rosenblatt 1958]

• 1980s-mid 1990s: Connectionism [Rumelhart 1986,

Hinton 1989]

• 1990s: modern convolutional networks [LeCun et al.

1998], LSTM [Hochreiter & Schmidhuber 1997,

MNIST and other large datasets]

Hubel & Wiesel [60s] Simple & Complex cells architecture:

Hubel & Wiesel [60s] Simple & Complex cells architecture Fukushima’s Neocognitron [70s]

Yann LeCun’s Early CNNs [80s]:

13/12/2018

Convolutional Networks: 1989

LeNet: a layered model composed of convolution and subsampling operations followed

by a holistic representation and ultimately a classifier for handwritten digits. [ LeNet ]

13/12/2018

Recent success

• Parallel Computation (GPU)

• Larger training sets

• International Competitions

• Theoretical advancements

– Dropout

– ReLUs

– Batch Normalization

Recent success

CUDA Jetson TX1, TK1

Android lib, demo

OpenCL branch

Better Hardware – GPUs

13/12/2018

Recent success

ImageNet

• Over 15M labeled high resolution images

• Roughly 22K categories

• Collected from web and labeled by Amazon Mechanical Turk

Larger training sets

13/12/2018

Recent success

ILSVRC

• Annual competition of image classification at large scale

• 1.2M images in 1K categories

• Classification: make 5 guesses about the image label

Competitions

EntleBucher Appenzeller

13/12/2018

CNNs in Computer Vision

• Image classification

13/12/2018

Evolution of CNNs for image classification

AlexNet: a layered model composed of convolution, subsampling, and further operations

followed by a holistic representation and all-in-all a landmark classifier on

ILSVRC12. [ AlexNet ]

Convolutional Nets: 2012

AlexNet

13/12/2018

ILSVRC14 Winners: ~6.6% Top-5 error

- GoogLeNet: composition of multi-scale

dimension-reduced modules

+ depth

+ data

+ dimensionality reduction

13/12/2018

ILSVRC14 Winners: ~6.6% Top-5 error

- VGG: 16 layers of 3x3 convolution

interleaved with max pooling +

3 fully-connected layers

+ depth

+ data

+ dimensionality reduction

13/12/2018

ResNet

ILSVRC15 Winner: ~3.6% Top-5 error

Intuition: Easier to learn zero than identity function

13/12/2018

Reasonable questions

• Is this just for a particular dataset? – No!

Slides from ICCV 2015 Math of Deep Learning tutorial

13/12/2018

Object Localization[R-CNN, HyperColumns, Overfeat, etc.]

Pose estimation [Thomson et al, CVPR’15]

• Is this just for a particular task? – No!

13/12/2018

Semantic Segmentation[Pinhero, Collobert, Dollar, ICCV’15]

• Is this just for a particular task? – No!

13/12/2018

Fine Tuning

Dogs vs. Cats

top 10 in 10 minutes

Take a pre-trained model and fine-tune to new tasks [DeCAF] [Zeiler-Fergus] [OverFeat]

Your Task

RecognitionLots of Data

ImageNet

13/12/2018

Dealing with sequences

• data points are related

• order is important!

(C)NNs disregard this information

Recurrent Neural Networks (RNNs) address this problem by introducing cells with loops

• allow information to persist

• the value of output 𝐡𝑡 depends both on 𝐱𝑡 and the previous output 𝐡𝑡−1

Material from Colah’s blog

Another way to see RNNs is to unfold the loop

• each cell dedicated to a different data sample

• output 𝐡𝑡 computed based also on 𝐡𝑡−1

RNN structure

Material from Colah’s blog

RNNs (and variants) are very effective in various domains:

• Speech recognition

• Translation

• Image/Video captioning

• Video summarization

• Activity recognition

• ...

a) "The clouds are in the _."

ProblemRNNs cannot capture long-term dependencies• so-called vanishing/exploding gradients

problem

b) "I grew up in France... I speak fluently _."

Long Short Term Memories (LSTMs)

• RNNs with special structure

• designed to capture long-term dependencies

Hochreiter & Schmidhuber (1997) Long short-term memory

Main idea: Separate cell output 𝐡𝑡 and cell state 𝐶𝑡• state is modified through a series of

structures called gates

LSTMs – step by step

1st gate: forget mechanism

• drop elements of 𝐶𝑡−1 based on values of 𝐡𝑡−1and input 𝐱𝑡

2nd and 3rd gate: update mechanism

i. decide which elements of 𝐶𝑡−1 to update

ii. compute the update vector ሚ𝐶𝑡

2nd and 3rd gate: update mechanism

iii. update elements selected via 𝑖𝑡 with values computed in ሚ𝐶𝑡

Last step: compute output

• Based on 𝐶𝑡 , 𝐡𝑡−1, 𝐱𝑡

LSTM variants 1/3

Add peephole connections

All layers “see” the state 𝐶𝑡−1

Gers & Schmidhuber (2000) Recurrent nets that time and count

LSTM variants 2/3

Coupled input and forget gates:

• only forget something if you are going to input something else

LSTM variants 3/3

Gated Recurrent Units (GRUs)

• no cell state! - directly operate on output 𝐡𝑡• just two gates instead of three

Yao et al. (2015) Depth-gated recurrent neural networks

Further improvements:

• Further modifications of gates/structure

• Bi-directional LSTMs

• Attention mechanisms

• ...

LSTMs – typical use cases

One-to-one – classical NNs (no RNN)

One-to-many – sequence output (e.g. image captioning)

Many-to-one – sequence input (e.g. sentiment analysis)

Many-to-many – sequence to sequence (e.g. machine translation)

Many-to-many synced – e.g. frame per frame video analysis

LSTMs – use case examples

Visual Sequence Tasks

Jeff Donahue et al. CVPR’15

Based on Long short-term memory (LSTM) layers

13/12/2018

Using LSTMs

Some videos:

- Facebook AI

- Music classification

- Action recognition SecondHands project

Beyond Regression/Classification

Generative models

• Variational Auto-Encoders (VAEs)

– focus on learning latent space structure

• Generative Adversarial Networks (GANs)

– focus on learning a distribution

– no latent space (in general)

Sample from the input data distribution 𝒳

Invert a (Convolutional) Neural Network

Encoder (conventional CNN)

• processes an image and produces a vector/code

Decoder

• receives a code and produces an image

• uses “deconvolutional” layers

Material from Frans’ blog

Problem

How to train the decoder to produce meaningful data?

Adversarial Training

GAN is a combination of two networks

1. a generator network (decoder)

2. a discriminator network (critic)

• Generator: produces realistic samples of 𝒳

• Discriminator: identifies if a sample actually comes from the (unknown) 𝒳 or not

Training

make the networks compete with each other

• generator tries to fool the discriminator in believing that the sample is ‘real’

• discriminator tries to discriminate as good as possible ‘real’ from ‘fake’ samples

Example on CIFAR dataset (32x32 images)

Are these real images or generated?

Results at 300, 900 and 5700 iterations

Example: Celebrity faces (1024x1024 images)

• modify data in specific directions

• identify meaningful directions in latent space

Examples

- ‘distort’ faces (change expression, add glasses)

- produce digits from different hand-writing styles

- distort 3D meshes

- ... 13/12/2018Deep Learning for Data Mining

What is an auto-encoder?

- a combination of an encoder and a decoder

- trained based on reconstruction loss

- provides low-dimensional representation

Material from Kurita’s blog

Feed vectors and get realistic samples of 𝒳

AE problem: we don’t know if the vectors are valid or not

Main idea:

Encoder produces a distribution instead of a vector

Decoder operates on samples from this distribution

Problems

1. How to produce a distribution?

2. How to prevent degeneration?

Solutions

1. parametric distributions – typically Gaussian

– produce mean 𝜇 and variance Σ

2. add loss term based on Kullback-Leiblerdivergence

One more problem:

- Sampling operation is not differentiable

Solution:

- re-parametrization

Images from Keita Kurita’s blog

Example: faces (Hou et al., 2016)

Example: 3D Mesh deformation (Tan et al., 2018)

Further directions

Few-shot and zero-shot learning

• meta-learner architectures

• Prototypical networks

Further directions

One –show learning example: Mullet Zero–show learning example: Zebra (from horse)

Further directions

Learning on manifolds

• learning functions defined on surfaces

Further directions

• learning functions defined on graphs

Further directions

• learning functions defined on graphs

Learning with constraints

• Frank-Wolf networks

Resources

Frameworks:

• TensorFlow (Google) | C/C++, Python, Java, Go

• Torch/PyTorch (Facebook) | Lua/Python

• Caffe/Caffe 2 (UC Berkeley) | C/C++, Python, Matlab

• Theano (U Montreal) | Python

• CNTK (Microsoft) | Python, C++ , C#/.Net, Java

• MxNet (DMLC) | Python, C++, R, Perl, …

• Darknet (Redmon J.) | C

• …

Resources

High-level libraries:

• Keras | Backends: TensorFlow (TF), Theano

Models:

• Depends on the framework, e.g.

– https://github.com/BVLC/caffe/wiki/Model-Zoo (Caffe)

– https://github.com/tensorflow/models/tree/master/research (TF)

Interactive Interfaces:

• DIGITS (NVIDIA) | Caffe, TF, Torch

• TensorBoard (TF)

Tools:

• http://ethereon.github.io/netscope (for networks defined in protobuf )

Thank you!

Deep Learning for Data Mining - Aris Anagnostopoulos -...

Transcript of Deep Learning for Data Mining - Aris Anagnostopoulos -...

Deep Learning for Data Mining - Aris Anagnostopoulos -...

Documents

Transcript of Deep Learning for Data Mining - Aris Anagnostopoulos -...

Statistical Data Mining€¦ · 3 Data Mining Data (re-design and maintain existing database) Mining (Analysis) -- our focus Statistical Data Mining What is Data Mining? Data mining

Data Mining By: Thai Hoa Nguyen Pham. Data Mining Define Data Mining Classification Association Clustering.

DATA MINING AND ANALYSIS - doc.lagout.org Mining/Data Mining and Analysis... · DATA MINING AND ANALYSIS The fundamental algorithms in data mining and analysis form the basis ...

Data Mining vs. Statistics Pavel Brusilovsky. 2 Objectives 2 Intro to Data Mining Data Mining vs. Statistics Data Mining vs. Text Mining Applications.

Data Mining-Graph Mining

Data Mining - The opportunity lies with Data!Data mining

CS590D: Data Mining Chris Clifton - Purdue University · Data Mining: Classification Schemes • General functionality – Descriptive data mining – Predictive data mining ... –

Applied Data Mining - Lagout Mining/Applied Data Mining [Xu... · The book reviews applied data mining from theoretical basis to ... data mining application arenas, ... research topics

What is Data Mining? Data Mining Motivation Data Mining Applications Applications of Data Mining in CRM Data Mining Taxonomy Data Mining Techniques.

Introduction to Data Mining and Knowledge Discovery · PDF fileIntroduction to Data Mining and ... Introduction to Data Mining and Knowledge Discovery, ... Data mining and data warehousing

EE3J2 Data Mining EE3J2 Data Mining

UNIT - I Data Mining. UNIT - I Introduction : Fundamentals of data mining, Data Mining Functionalities, Classification of Data Mining systems, Major issues.

Data Mining Lecture 1: Introduction to Data Mining

Data Mining: Data

Web Mining – Data Mining im Internet Mining – Data Mining im Internet Vorlesung SS 2014 ... Web Mining is Data Mining for Data on the World-Wide Web Text Mining: Application of

MINING EDUCATIONAL DATA USING DATA MINING … · Educational data mining includes machine learning and data mining techniques. Data related to ... , density based, grid based, model

Massive Data Analytics Data Mining Introductiondisi.unitn.it/.../MassiveDataAnalytics/slides/DataMiningIntro-2in1.pdfMassive Data Analytics Data Mining Introduction ... Data mining,

FROM DATA MINING TO KNOWLEDGE MINING: SYMBOLIC DATA ...

Web Mining – Data Mining im Internet Mining – Data Mining im Internet Vorlesung SS 2010 ... Web Mining is Data Mining for Data on the World-Wide Web Text Mining: Application of

1 Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Major issues in.