Deep learning in Computer Vision

30
Deep Learning in Computer Vision Deep Learning in Action 24. June ’15 Caner Hazırbaş

Transcript of Deep learning in Computer Vision

Page 1: Deep learning in Computer Vision

Deep Learning in Computer Vision

Deep Learning in Action24. June ’15

Caner Hazırbaş

Page 2: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Computer Vision Group

2

6 Postdocs, 16 PhD students

Page 3: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Research in Computer Vision

3

Image-based 3DReconstruction

Shape Analysis Robot Vision

RGB-D VisionImage

Segmentation

Convex RelaxationMethods

Visual SLAM

Optical Flow

Page 4: Deep learning in Computer Vision

Deep Learning in Computer Vision

Page 5: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

How to teach a machine ?

5

edges classifier

(or any other hand-crafted features)

Person

Page 6: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

How to teach a machine ?

6

edges classifier

Person

(or any other hand-crafted features)Not a

good repres

entat

ion

Page 7: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

What is deep learning ?• Representation learning method

Learning good features automatically from raw data

• Learning representations of data with multiple levels of abstraction

7

Google’s cat detection neural network

Page 8: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Construction of higher levels of abstraction

8

w1

w2

w3

“non-linear”transformation

1

b

Page 9: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Going deeper in the network

9

Input‘Pixels’

1st and 2nd Layers ‘Edges’Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

are not conditionally independent of one another giventhe layers above and below. In contrast, our treatmentusing undirected edges enables combining bottom-upand top-down information more e�ciently, as shownin Section 4.5.

In our approach, probabilistic max-pooling helps toaddress scalability by shrinking the higher layers;weight-sharing (convolutions) further speeds up thealgorithm. For example, inference in a three-layernetwork (with 200x200 input images) using weight-sharing but without max-pooling was about 10 timesslower. Without weight-sharing, it was more than 100times slower.

In work that was contemporary to and done indepen-dently of ours, Desjardins and Bengio (2008) also ap-plied convolutional weight-sharing to RBMs and ex-perimented on small image patches. Our work, how-ever, develops more sophisticated elements such asprobabilistic max-pooling to make the algorithm morescalable.

4. Experimental results

4.1. Learning hierarchical representationsfrom natural images

We first tested our model’s ability to learn hierarchi-cal representations of natural images. Specifically, wetrained a CDBN with two hidden layers from the Ky-oto natural image dataset.3 The first layer consistedof 24 groups (or “bases”)4 of 10x10 pixel filters, whilethe second layer consisted of 100 bases, each one 10x10as well.5 As shown in Figure 2 (top), the learned firstlayer bases are oriented, localized edge filters; this re-sult is consistent with much prior work (Olshausen &Field, 1996; Bell & Sejnowski, 1997; Ranzato et al.,2006). We note that the sparsity regularization dur-ing training was necessary for learning these orientededge filters; when this term was removed, the algo-rithm failed to learn oriented edges.

The learned second layer bases are shown in Fig-ure 2 (bottom), and many of them empirically re-sponded selectively to contours, corners, angles, andsurface boundaries in the images. This result is qual-itatively consistent with previous work (Ito & Ko-matsu, 2004; Lee et al., 2008).

4.2. Self-taught learning for object recognition

Raina et al. (2007) showed that large unlabeled datacan help in supervised learning tasks, even when the

3http://www.cnbc.cmu.edu/cplab/data_kyoto.html

4We will call one hidden group’s weights a “basis.”5Since the images were real-valued, we used Gaussian

visible units for the first-layer CRBM. The pooling ratio Cfor each layer was 2, so the second-layer bases cover roughlytwice as large an area as the first-layer ones.

Figure 2. The first layer bases (top) and the second layerbases (bottom) learned from natural images. Each secondlayer basis (filter) was visualized as a weighted linear com-bination of the first layer bases.

unlabeled data do not share the same class labels, orthe same generative distribution, as the labeled data.This framework, where generic unlabeled data improveperformance on a supervised learning task, is knownas self-taught learning. In their experiments, they usedsparse coding to train a single-layer representation,and then used the learned representation to constructfeatures for supervised learning tasks.

We used a similar procedure to evaluate our two-layerCDBN, described in Section 4.1, on the Caltech-101object classification task.6 The results are shown inTable 1. First, we observe that combining the firstand second layers significantly improves the classifica-tion accuracy relative to the first layer alone. Overall,we achieve 57.7% test accuracy using 15 training im-ages per class, and 65.4% test accuracy using 30 train-ing images per class. Our result is competitive withstate-of-the-art results using highly-specialized singlefeatures, such as SIFT, geometric blur, and shape-context (Lazebnik et al., 2006; Berg et al., 2005; Zhanget al., 2006).7 Recall that the CDBN was trained en-

6Details: Given an image from the Caltech-101dataset (Fei-Fei et al., 2004), we scaled the image so thatits longer side was 150 pixels, and computed the activationsof the first and second (pooling) layers of our CDBN. Werepeated this procedure after reducing the input image byhalf and concatenated all the activations to construct fea-tures. We used an SVM with a spatial pyramid matchingkernel for classification, and the parameters of the SVMwere cross-validated. We randomly selected 15/30 trainingset and 15/30 test set images respectively, and normal-ized the result such that classification accuracy for eachclass was equally weighted (following the standard proto-col). We report results averaged over 10 random trials.

7Varma and Ray (2007) reported better performancethan ours (87.82% for 15 training images/class), but theycombined many state-of-the-art features (or kernels) to im-prove the performance. In another approach, Yu et al.(2009) used kernel regularization using a (previously pub-lished) state-of-the-art kernel matrix to improve the per-

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Table 2. Test error for MNIST dataset

Labeled training samples 1,000 2,000 3,000 5,000 60,000CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% 0.82%Ranzato et al. (2007) 3.21% 2.53% - 1.52% 0.64%Hinton and Salakhutdinov (2006) - - - - 1.20%Weston et al. (2008) 2.73% - 1.83% - 1.50%

faces cars elephants chairs faces, cars, airplanes, motorbikes

Figure 3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned from specific objectcategories. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from a mixture of fourobject categories (faces, cars, airplanes, motorbikes).

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Faces

first layersecond layerthird layer

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Motorbikes

first layersecond layerthird layer

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Cars

first layersecond layerthird layer

Features Faces Motorbikes CarsFirst layer 0.39±0.17 0.44±0.21 0.43±0.19

Second layer 0.86±0.13 0.69±0.22 0.72±0.23Third layer 0.95±0.03 0.81±0.13 0.87±0.15

Figure 4. (top) Histogram of the area under the precision-recall curve (AUC-PR) for three classification problemsusing class-specific object-part representations. (bottom)Average AUC-PR for each classification problem.

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

Conditional entropy

first layersecond layerthird layer

Figure 5. Histogram of conditional entropy for the repre-sentation learned from the mixture of four object classes.

the posterior over class labels when a feature is ac-tive. Since lower conditional entropy corresponds to amore peaked posterior, it indicates greater specificity.As shown in Figure 5, the higher-layer features haveprogressively less conditional entropy, suggesting thatthey activate more selectively to specific object classes.

4.5. Hierarchical probabilistic inference

Lee and Mumford (2003) proposed that the human vi-sual cortex can conceptually be modeled as performing“hierarchical Bayesian inference.” For example, if youobserve a face image with its left half in dark illumina-

Figure 6. Hierarchical probabilistic inference. For each col-umn: (top) input image. (middle) reconstruction from thesecond layer units after single bottom-up pass, by project-ing the second layer activations into the image space. (bot-tom) reconstruction from the second layer units after 20iterations of block Gibbs sampling.

tion, you can still recognize the face and further inferthe darkened parts by combining the image with yourprior knowledge of faces. In this experiment, we showthat our model can tractably perform such (approxi-mate) hierarchical probabilistic inference in full-sizedimages. More specifically, we tested the network’s abil-ity to infer the locations of hidden object parts.

To generate the examples for evaluation, we usedCaltech-101 face images (distinct from the ones thenetwork was trained on). For each image, we simu-lated an occlusion by zeroing out the left half of theimage. We then sampled from the joint posterior overall of the hidden layers by performing Gibbs sampling.Figure 6 shows a visualization of these samples. To en-sure that the filling-in required top-down information,we compare with a “control” condition where only asingle upward pass was performed.

In the control (upward-pass only) condition, sincethere is no evidence from the first layer, the secondlayer does not respond much to the left side. How-

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Table 2. Test error for MNIST dataset

Labeled training samples 1,000 2,000 3,000 5,000 60,000CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% 0.82%Ranzato et al. (2007) 3.21% 2.53% - 1.52% 0.64%Hinton and Salakhutdinov (2006) - - - - 1.20%Weston et al. (2008) 2.73% - 1.83% - 1.50%

faces cars elephants chairs faces, cars, airplanes, motorbikes

Figure 3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned from specific objectcategories. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from a mixture of fourobject categories (faces, cars, airplanes, motorbikes).

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Faces

first layersecond layerthird layer

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Motorbikes

first layersecond layerthird layer

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Cars

first layersecond layerthird layer

Features Faces Motorbikes CarsFirst layer 0.39±0.17 0.44±0.21 0.43±0.19

Second layer 0.86±0.13 0.69±0.22 0.72±0.23Third layer 0.95±0.03 0.81±0.13 0.87±0.15

Figure 4. (top) Histogram of the area under the precision-recall curve (AUC-PR) for three classification problemsusing class-specific object-part representations. (bottom)Average AUC-PR for each classification problem.

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

Conditional entropy

first layersecond layerthird layer

Figure 5. Histogram of conditional entropy for the repre-sentation learned from the mixture of four object classes.

the posterior over class labels when a feature is ac-tive. Since lower conditional entropy corresponds to amore peaked posterior, it indicates greater specificity.As shown in Figure 5, the higher-layer features haveprogressively less conditional entropy, suggesting thatthey activate more selectively to specific object classes.

4.5. Hierarchical probabilistic inference

Lee and Mumford (2003) proposed that the human vi-sual cortex can conceptually be modeled as performing“hierarchical Bayesian inference.” For example, if youobserve a face image with its left half in dark illumina-

Figure 6. Hierarchical probabilistic inference. For each col-umn: (top) input image. (middle) reconstruction from thesecond layer units after single bottom-up pass, by project-ing the second layer activations into the image space. (bot-tom) reconstruction from the second layer units after 20iterations of block Gibbs sampling.

tion, you can still recognize the face and further inferthe darkened parts by combining the image with yourprior knowledge of faces. In this experiment, we showthat our model can tractably perform such (approxi-mate) hierarchical probabilistic inference in full-sizedimages. More specifically, we tested the network’s abil-ity to infer the locations of hidden object parts.

To generate the examples for evaluation, we usedCaltech-101 face images (distinct from the ones thenetwork was trained on). For each image, we simu-lated an occlusion by zeroing out the left half of theimage. We then sampled from the joint posterior overall of the hidden layers by performing Gibbs sampling.Figure 6 shows a visualization of these samples. To en-sure that the filling-in required top-down information,we compare with a “control” condition where only asingle upward pass was performed.

In the control (upward-pass only) condition, sincethere is no evidence from the first layer, the secondlayer does not respond much to the left side. How-

3rd Layer ‘Object Parts’

4th Layer ‘Objects’

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Table 2. Test error for MNIST dataset

Labeled training samples 1,000 2,000 3,000 5,000 60,000CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% 0.82%Ranzato et al. (2007) 3.21% 2.53% - 1.52% 0.64%Hinton and Salakhutdinov (2006) - - - - 1.20%Weston et al. (2008) 2.73% - 1.83% - 1.50%

faces cars elephants chairs faces, cars, airplanes, motorbikes

Figure 3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned from specific objectcategories. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from a mixture of fourobject categories (faces, cars, airplanes, motorbikes).

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Faces

first layersecond layerthird layer

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Motorbikes

first layersecond layerthird layer

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Cars

first layersecond layerthird layer

Features Faces Motorbikes CarsFirst layer 0.39±0.17 0.44±0.21 0.43±0.19

Second layer 0.86±0.13 0.69±0.22 0.72±0.23Third layer 0.95±0.03 0.81±0.13 0.87±0.15

Figure 4. (top) Histogram of the area under the precision-recall curve (AUC-PR) for three classification problemsusing class-specific object-part representations. (bottom)Average AUC-PR for each classification problem.

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

Conditional entropy

first layersecond layerthird layer

Figure 5. Histogram of conditional entropy for the repre-sentation learned from the mixture of four object classes.

the posterior over class labels when a feature is ac-tive. Since lower conditional entropy corresponds to amore peaked posterior, it indicates greater specificity.As shown in Figure 5, the higher-layer features haveprogressively less conditional entropy, suggesting thatthey activate more selectively to specific object classes.

4.5. Hierarchical probabilistic inference

Lee and Mumford (2003) proposed that the human vi-sual cortex can conceptually be modeled as performing“hierarchical Bayesian inference.” For example, if youobserve a face image with its left half in dark illumina-

Figure 6. Hierarchical probabilistic inference. For each col-umn: (top) input image. (middle) reconstruction from thesecond layer units after single bottom-up pass, by project-ing the second layer activations into the image space. (bot-tom) reconstruction from the second layer units after 20iterations of block Gibbs sampling.

tion, you can still recognize the face and further inferthe darkened parts by combining the image with yourprior knowledge of faces. In this experiment, we showthat our model can tractably perform such (approxi-mate) hierarchical probabilistic inference in full-sizedimages. More specifically, we tested the network’s abil-ity to infer the locations of hidden object parts.

To generate the examples for evaluation, we usedCaltech-101 face images (distinct from the ones thenetwork was trained on). For each image, we simu-lated an occlusion by zeroing out the left half of theimage. We then sampled from the joint posterior overall of the hidden layers by performing Gibbs sampling.Figure 6 shows a visualization of these samples. To en-sure that the filling-in required top-down information,we compare with a “control” condition where only asingle upward pass was performed.

In the control (upward-pass only) condition, sincethere is no evidence from the first layer, the secondlayer does not respond much to the left side. How-

Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations

Table 2. Test error for MNIST dataset

Labeled training samples 1,000 2,000 3,000 5,000 60,000CDBN 2.62±0.12% 2.13±0.10% 1.91±0.09% 1.59±0.11% 0.82%Ranzato et al. (2007) 3.21% 2.53% - 1.52% 0.64%Hinton and Salakhutdinov (2006) - - - - 1.20%Weston et al. (2008) 2.73% - 1.83% - 1.50%

faces cars elephants chairs faces, cars, airplanes, motorbikes

Figure 3. Columns 1-4: the second layer bases (top) and the third layer bases (bottom) learned from specific objectcategories. Column 5: the second layer bases (top) and the third layer bases (bottom) learned from a mixture of fourobject categories (faces, cars, airplanes, motorbikes).

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Faces

first layersecond layerthird layer

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Motorbikes

first layersecond layerthird layer

0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

Area under the PR curve (AUC)

Cars

first layersecond layerthird layer

Features Faces Motorbikes CarsFirst layer 0.39±0.17 0.44±0.21 0.43±0.19

Second layer 0.86±0.13 0.69±0.22 0.72±0.23Third layer 0.95±0.03 0.81±0.13 0.87±0.15

Figure 4. (top) Histogram of the area under the precision-recall curve (AUC-PR) for three classification problemsusing class-specific object-part representations. (bottom)Average AUC-PR for each classification problem.

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

Conditional entropy

first layersecond layerthird layer

Figure 5. Histogram of conditional entropy for the repre-sentation learned from the mixture of four object classes.

the posterior over class labels when a feature is ac-tive. Since lower conditional entropy corresponds to amore peaked posterior, it indicates greater specificity.As shown in Figure 5, the higher-layer features haveprogressively less conditional entropy, suggesting thatthey activate more selectively to specific object classes.

4.5. Hierarchical probabilistic inference

Lee and Mumford (2003) proposed that the human vi-sual cortex can conceptually be modeled as performing“hierarchical Bayesian inference.” For example, if youobserve a face image with its left half in dark illumina-

Figure 6. Hierarchical probabilistic inference. For each col-umn: (top) input image. (middle) reconstruction from thesecond layer units after single bottom-up pass, by project-ing the second layer activations into the image space. (bot-tom) reconstruction from the second layer units after 20iterations of block Gibbs sampling.

tion, you can still recognize the face and further inferthe darkened parts by combining the image with yourprior knowledge of faces. In this experiment, we showthat our model can tractably perform such (approxi-mate) hierarchical probabilistic inference in full-sizedimages. More specifically, we tested the network’s abil-ity to infer the locations of hidden object parts.

To generate the examples for evaluation, we usedCaltech-101 face images (distinct from the ones thenetwork was trained on). For each image, we simu-lated an occlusion by zeroing out the left half of theimage. We then sampled from the joint posterior overall of the hidden layers by performing Gibbs sampling.Figure 6 shows a visualization of these samples. To en-sure that the filling-in required top-down information,we compare with a “control” condition where only asingle upward pass was performed.

In the control (upward-pass only) condition, sincethere is no evidence from the first layer, the secondlayer does not respond much to the left side. How-

faces

faces cars

airplanes motorbikes

Page 10: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Deep Learning MethodsUnsupervised Methods

• Restricted Boltzmann Machines

• Deep Belief Networks

• Auto encoders: unsupervised feature extraction/learning

10

encode decode

Page 11: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Deep Learning MethodsSupervised Methods

• Deep Neural Networks

• Recurrent Neural Networks

• Convolutional Neural Networks

11

self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys-tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec-tacular results, almost halving the error rates of the best compet-ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and tech-niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook,

Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

ConvNets are easily amenable to efficient hardware implemen-tations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo-nential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).

The hidden layers of a multilayer neural network learn to repre-sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local

Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced

with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

VisionDeep CNN

LanguageGenerating RNN

A group of people shopping at an outdoor

market.

There are many vegetables at the

fruit stand.

A woman is throwing a frisbee in a park.

A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest withtrees in the background.

A dog is standing on a hardwood floor. A stop sign is on a road with amountain in the background

4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5

REVIEWINSIGHT

© 2015 Macmillan Publishers Limited. All rights reserved

Page 12: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

How to train a deep network ?

Stochastic Gradient Descent — supervised learning

• show input vector of few examples

• compute the output and the errors

• compute average gradient

• update the weights accordingly

12

Page 13: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Convolutional Neural Networks• CNNs are designed to process the data in the form of multiple arrays

(e.g. 2D images, 3D video/volumetric images)

• Typical architecture is composed of series of stages: convolutional layers and pooling layers

• Each unit is connected to local patches in the feature maps of the previous layer

13

2015 x 5415 x 54 8 x 27 4 x 14

50

500 x 1378 x 1

EA

qy

B4

conv1 pool1 conv pool2

10%

208 x 27

50

Page 14: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

• local connections

Key Idea behind Convolutional Networks

14

Convolutional networks take advantage of the properties of natural signals:

Page 15: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

• local connections

Key Idea behind Convolutional Networks

15

• shared weights

Convolutional networks take advantage of the properties of natural signals:

Page 16: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

• local connections

Key Idea behind Convolutional Networks

16

• shared weights

• pooling

Convolutional networks take advantage of the properties of natural signals:

Page 17: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

• local connections

Key Idea behind Convolutional Networks

Convolutional networks take advantage of the properties of natural signals:

17

• shared weights

• pooling • the use of many layers

Person

Page 18: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Pros & Cons• Best performing method in many

Computer Vision tasks

• No need of hand-crafted features

• Most applicable method for large-scale problems, e.g. classification of 1000 classes

• Easy parallelization on GPUs

18

• Need of huge amount of training data

• Hard to train (local minima problem, tuning hyper-parameters)

• Difficult to analyse (to be solved)

Page 19: Deep learning in Computer Vision

Deep Learning Applications in Computer Vision

Page 20: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

Handwritten Digit Recognition

20

Page 21: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)

21Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model.The correct label is written under each image, and the probability assigned to the correct label is also shownwith a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the first column. Theremaining columns show the six training images that produce feature vectors in the last hidden layer with thesmallest Euclidean distance from the feature vector for the test image.

In the left panel of Figure 4 we qualitatively assess what the network has learned by computing itstop-5 predictions on eight test images. Notice that even off-center objects, such as the mite in thetop-left, can be recognized by the net. Most of the top-5 labels appear reasonable. For example,only other types of cat are considered plausible labels for the leopard. In some cases (grille, cherry)there is genuine ambiguity about the intended focus of the photograph.

Another way to probe the network’s visual knowledge is to consider the feature activations inducedby an image at the last, 4096-dimensional hidden layer. If two images produce feature activationvectors with a small Euclidean separation, we can say that the higher levels of the neural networkconsider them to be similar. Figure 4 shows five images from the test set and the six images fromthe training set that are most similar to each of them according to this measure. Notice that at thepixel level, the retrieved training images are generally not close in L2 to the query images in the firstcolumn. For example, the retrieved dogs and elephants appear in a variety of poses. We present theresults for many more test images in the supplementary material.

Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vec-tors is inefficient, but it could be made efficient by training an auto-encoder to compress these vectorsto short binary codes. This should produce a much better image retrieval method than applying auto-encoders to the raw pixels [14], which does not make use of image labels and hence has a tendencyto retrieve images with similar patterns of edges, whether or not they are semantically similar.

7 Discussion

Our results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning. It is notablethat our network’s performance degrades if a single convolutional layer is removed. For example,removing any of the middle layers results in a loss of about 2% for the top-1 performance of thenetwork. So the depth really is important for achieving our results.

To simplify our experiments, we did not use any unsupervised pre-training even though we expectthat it will help, especially if we obtain enough computational power to significantly increase thesize of the network without obtaining a corresponding increase in the amount of labeled data. Thusfar, our results have improved as we have made our network larger and trained it longer but we stillhave many orders of magnitude to go in order to match the infero-temporal pathway of the humanvisual system. Ultimately we would like to use very large and deep convolutional nets on videosequences where the temporal structure provides very helpful information that is missing or far lessobvious in static images.

8

Page 22: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

FlowNet: Learning Optical Flow with Convolutional Networks

22

in collaboration with University of Freiburg lmb.informatik.uni-freiburg.de

Page 23: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

FlowNet: Learning Optical Flow with Convolutional Networks

23

Page 24: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

FlowNet: Learning Optical Flow with Convolutional Networks

24

96 x 1289999192 x 256

664

128256256

51251251251210245 x 5

5 x 5 3 x 3

conv6

prediction

conv5_1conv5conv4_1conv4conv3_1conv3

conv2conv1

136 x 320

7 x 7

384 x 512

refine-ment

conv1conv2

conv3

corr

conv_redir

conv3_1 conv4 conv4_1 conv5 conv5_1 conv6

364

128

256441 32473

256

512 5125125121024

384 x 512

sqrt

prediction

136 x 320

refine-ment

4 x 5124 x 51222222222222 kernel

7 x 75 x 5

1 x 1

1 x 1

3 x 3

FlowNetSimple

FlowNetCorr

Page 25: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

FlowNet: Learning Optical Flow with Convolutional Networks

25

96 x 1289999192 x 256

664

128256256

51251251251210245 x 5

5 x 5 3 x 3

conv6

prediction

conv5_1conv5conv4_1conv4conv3_1conv3

conv2conv1

136 x 320

7 x 7

384 x 512

refine-ment

conv1conv2

conv3

corr

conv_redir

conv3_1 conv4 conv4_1 conv5 conv5_1 conv6

364

128

256441 32473

256

512 5125125121024

384 x 512

sqrt

prediction

136 x 320

refine-ment

4 x 5124 x 51222222222222 kernel

7 x 75 x 5

1 x 1

1 x 1

3 x 3

FlowNetSimple

FlowNetCorr

Page 26: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

FlowNet: Learning Optical Flow with Convolutional Networks

26

Page 27: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

From Image to Caption

27

self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys-tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec-tacular results, almost halving the error rates of the best compet-ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and tech-niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook,

Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

ConvNets are easily amenable to efficient hardware implemen-tations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo-nential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).

The hidden layers of a multilayer neural network learn to repre-sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local

Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced

with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

VisionDeep CNN

LanguageGenerating RNN

A group of people shopping at an outdoor

market.

There are many vegetables at the

fruit stand.

A woman is throwing a frisbee in a park.

A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest withtrees in the background.

A dog is standing on a hardwood floor. A stop sign is on a road with amountain in the background

4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5

REVIEWINSIGHT

© 2015 Macmillan Publishers Limited. All rights reserved

self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys-tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec-tacular results, almost halving the error rates of the best compet-ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and tech-niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook,

Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

ConvNets are easily amenable to efficient hardware implemen-tations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo-nential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).

The hidden layers of a multilayer neural network learn to repre-sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local

Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced

with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

VisionDeep CNN

LanguageGenerating RNN

A group of people shopping at an outdoor

market.

There are many vegetables at the

fruit stand.

A woman is throwing a frisbee in a park.

A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest withtrees in the background.

A dog is standing on a hardwood floor. A stop sign is on a road with amountain in the background

4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5

REVIEWINSIGHT

© 2015 Macmillan Publishers Limited. All rights reserved

Page 28: Deep learning in Computer Vision

self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys-tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec-tacular results, almost halving the error rates of the best compet-ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and tech-niques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook,

Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

ConvNets are easily amenable to efficient hardware implemen-tations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo-nential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).

The hidden layers of a multilayer neural network learn to repre-sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local

Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced

with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

VisionDeep CNN

LanguageGenerating RNN

A group of people shopping at an outdoor

market.

There are many vegetables at the

fruit stand.

A woman is throwing a frisbee in a park.

A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest withtrees in the background.

A dog is standing on a hardwood floor. A stop sign is on a road with amountain in the background

4 4 0 | N A T U R E | V O L 5 2 1 | 2 8 M A Y 2 0 1 5

REVIEWINSIGHT

© 2015 Macmillan Publishers Limited. All rights reserved

Deep Learning in Computer Vision

Questions ? End ofPresentation

Caner Hazırbaş | [email protected]

Page 29: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

References• Building High-level Features Using Large Scale Unsupervised Learning

Quoc V. Le , Rajat Monga , Matthieu Devin , Kai Chen , Greg S. Corrado , Jeff Dean , Andrew Y. Ng ICML’12

• Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical RepresentationsHonglak Lee Roger Grosse Rajesh Ranganath Andrew Y. Ng ICML’09

• ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton NIPS’12

• Gradient-based learning applied to document recognition. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner Proceedings of the IEEE’98

• FlowNet: Learning Optical Flow with Convolutional NetworksPhilipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox

29

Page 30: Deep learning in Computer Vision

Deep Learning in Computer VisionCaner Hazırbaş | vision.in.tum.de

References• Google’s cat detection neural network http://www.resnap.com/image-

selection-technology/deep-learning-image-classification/

• Example auto-encoder : http://nghiaho.com/?p=1765

• SGD : http://blog.datumbox.com/tuning-the-learning-rate-in-gradient-descent/

30