Applications and Insights - University of...

26
Convolutional Neural Network Applications and Insights Christof Angermueller and Alex Kendall

Transcript of Applications and Insights - University of...

Page 1: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Convolutional Neural Network

Applications and InsightsChristof Angermueller and Alex Kendall

Page 2: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks."

Application 1: Classification

Page 3: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Visual Classification

Attention-grabbing image classification performanceClarifai classification demo

Page 4: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Large Scale Classification

Classification advances driven by:

● Large datasets such as ImageNet, Places with millions of images

● Annual ImageNet Challenge (ILSRC)

Page 5: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Depth over widthA function which is invariant to the many nuisance variables (pose, occlusion, lighting, clutter) is very complex and nonlinear

These functions are more efficiently represented with depth rather than width

● sequential mapping to connected spaces● deeper layers reuse computation

(On the Number of Linear Regions of Deep Neural Networks)

Deep architectures consistently outperform shallow representations with comparable networks (Return of the Devil in the Details: Delving Deep into Convolutional Networks)

Page 6: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Very deep architectures

1989: LeNet, 5 layers2006: Autoencoders, 7 layers2012: Alex Net, 9 layers2014: GoogLeNet, 22 layers and current ILSRC winner(‘Going Deeper with Convolutions’)

Page 7: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

What constrains depth?● GPU Memory - more efficient architectures

○ dimensionality reduction kernels

● Over-fitting

○ data-augmentation

○ drop out

● Back-propagated gradient magnitude decay

○ multi-loss training with auxiliary classifiers

Page 8: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Leverage Data HierarchyStrong hierarchy in data● Image recognition: Pixel → edge → texton → motif → part → object ● Text: Character → word → word group → clause → sentence → story● Speech: Sample → spectral band → sound → phone → word

Strong hierarchy in biological architectures:Thorpe, Simon, Denis Fize, and Catherine Marlot. "Speed of processing in the human visual system." nature 381.6582 (1996): 520-522.

Page 9: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Understanding deep representationsFirst layer filters for edges, blobs and low level features. Interesting when trained on dual GPUs a distinction forms between sharp monochrome features (rods) and colour blobs (cones) (ImageNet Classification with Deep Convolutional Neural Networks)

Page 10: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Hierarchy and Multi-ScalesA neuron’s receptive field increases in size with depth● Initial layer features are more discriminative● Deeper layers are more invariant and capture

semanticsDifferent and complementary features exist at different spatial scalesDepth Multiscale: Hypercolumns represent features over entire depth abstractionSpatial MultiScale: GoogLeNet uses multi scale filters in inception modules

Page 11: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

DeconvolutionWe can visualise the convolutional filters to find deficiencies in architecture. As you go deeper the filters represent more semantic concepts, similar to V1-V4 of the visual pathway in humans (Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks.")

Page 12: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Summary of Classification Insights

1. Large, augmented datasets2. Maximise depth (while avoiding overfitting

and vanishing gradients)3. Use multiscale and multi depth information

Page 13: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Application 2: Instantiation Variable Regression

Page 14: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Multi-Dimensional RegressionInstead of training a softmax classifier, an euclidean loss function can be used to train regression output

For example to regress camera location, x, and orientation, q, we can use the loss function

Page 15: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Multi-Dimensional RegressionDespite convnets being large piecewise linear function they can still continuously regress pose and instantiation variables - map to linear space● Human pose (Deep Pose: Human pose recognition.)

● Alex’s unpublished work in camera pose localisation

Page 16: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Saliency maps We can view the gradient of the pixels w.r.t. the outputVisualising these (back-propagated) gradients is called a saliency mapShow areas of the image, and features, which are most importantBack-propagated gradients are a generalisation of deconvolution (Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep inside convolutional networks: Visualising image classification models and saliency maps.")

Page 17: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Summary of Regression Insights

1. Convnet transforms data to a space linear in a number of instantiation parameters

2. Context is extremely important to understand the data

Page 18: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Other Applications

1. Image caption generation2. Text recognition3. Reinforcement learning

Page 19: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Image Caption Generation

Page 20: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Image Caption Generation

Page 21: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Image Caption GenerationObject detection combined with a multimodal Recurrent Neural Network architecture that uses the detected descriptions to learn to generate descriptions of image regions

● Karpathy et al., ‘Deep visual-semantic alignments for generating image descriptions‘● Vinyals et al., ‘Show and Tell’.

Page 22: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Text Recognition (OCR)Using superpixels to generate region proposals for convnets has been used for many applications, eg. OCR (Reading Text in the Wild with Convolutional Neural Networks, Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks )

Page 23: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Reinforcement LearningSpatial and temporal input through convolutional neural network to output joystick commands for a video game (Mnih et al., ‘Human-Level Control through Deep Reinforcement Learning’)

Same architecture trained on 100 atari games with separate weights trained for each game to maximise score

Page 24: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Reinforcement Learning

Page 25: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Final Insights

● Feature vectors from convolutional neural networks contain rich representations of images

● Invariant to nuisance variables and linear in a number of instantiation parameters

● Improvement of convnets over SIFT features is approx. equal to the improvement of SIFT over simple RGB patches

Page 26: Applications and Insights - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/cnn_applications.pdf · Applications and Insights Christof Angermueller and Alex

Conclusion● Convnets are pushing state-of-the-art in understanding

data with spatial structure● Produce powerful and transferable representationsHowever,● Can be hard to train and regularise● Very hard to get labelled data to train● Deep representations tend to lose spatial accuracy