Quoc le tera-scale deep learning

Tera-scale deep learning Quoc V. Le

Stanford University and Google

Samy Bengio, Zhenghao Chen, Tom Dean, Pangwei Koh, Mark Mao, Jiquan Ngiam, Patrick Nguyen, Andrew Saxe, Mark Segal, Jon Shlens, Vincent Vanhouke, Xiaoyun Wu, Peng Xe, Serena Yeung, Will Zou

AddiNonal Thanks:

Greg Corrado Jeff Dean MaQhieu Devin Kai Chen

Rajat Monga Andrew Ng Marc’Aurelio Ranzato

Paul Tucker Ke Yang

Joint work with

Machine Learning successes

Face recogniNon OCR Autonomous car

RecommendaNon systems Web page ranking

Email classificaNon

Classifier

Feature extracNon (Mostly hand-‐cra]ed features)

Feature ExtracNon

Hand-‐Cra]ed Features Computer vision:

Speech RecogniNon:

MFCC Spectrogram ZCR

…

SIFT/HOG SURF

…

New feature-‐designing paradigm

Unsupervised Feature Learning / Deep Learning ReconstrucNon ICA Expensive and typically applied to small problems

The Trend of BigData

No maQer the algorithm, more features always more successful.

Outline

-‐ ReconstrucNon ICA

-‐  ApplicaNons to videos, cancer images

-‐  Ideas for scaling up

-‐  Scaling up Results

Topographic Independent Component Analysis (TICA)

2. Learning

Input data:

W 1

W 9

W 1 T W

9 T

1. Feature computaNon

W 9 T ) (

2

W 1 T ) (

2

W =

W 1

W 2

W 10000

.

.

Topographic Independent Component Analysis (TICA)

Invariance explained

F2

F1

Loc1

1

0

0

1

Pooled feature of F1 and F2

Loc2

Images

Features

Image1 Image2

sqrt(1 + 0 ) = 1 2 2 sqrt(0 + 1 ) = 1 2 2

Same value regardless the locaNon of the edge

TICA: ReconstrucNon ICA:

Equivalence between Sparse Coding, Autoencoders, RBMs and ICA

Build deep architecture by treaNng the output of one layer as input to another layer

Le, et al., ICA with Reconstruc1on Cost for Efficient Overcomplete Feature Learning. NIPS 2011

ReconstrucNon ICA:


ReconstrucNon ICA:


Data whitening

TICA: ReconstrucNon ICA:


Data whitening

Why RICA?

Algorithms Speed

Sparse Coding

RBMs/Autoencoders

Invariant Features Ease of training

TICA

ReconstrucNon ICA


Summary of RICA

-‐  Two-‐layered network

-‐  ReconstrucNon cost instead of orthogonality constraints

-‐  Learns invariant features

ApplicaNons of RICA

Sit up Drive Car Get Out of Car

Eat Answer phone Kiss

Run Stand up Shake hands

AcNon recogniNon

Le, et al., Learning hierarchical spa1o-‐temporal features for ac1on recogni1on with independent subspace analysis. CVPR 2011

70

71

72

73

74

75

76

75

77

79

81

83

85

87

35

37

39

41

43

45

47

49

51

53

55

80

82

84

86

88

90

92

94 KTH Hollywood2

UCF YouTube

Hessian/SURF Learned Features Learned Features

Combined Engineered Features

Learned Features Learned Features

pLSA HOF HOG GRBMs 3DCNN HMAX Hessian/SURF HOG/HOF HOG3D GRBMS HOF

HOG3D HOF Hessian HOG.HOF

HOG Hessian/SURF


Cancer classificaNon

ApoptoNc

Viable tumor region

Necrosis

…

Le, et al., Learning Invariant Features of Tumor Signatures. ISBI 2012

84%

86%

88%

90%

92%

Hand engineered Features RICA

Scaling up deep RICA networks

Scaling up Deep Learning

Real data

Deep learning data


It’s beQer to have more features!

Coates, et al., An Analysis of Single-‐Layer Networks in Unsupervised Feature Learning. AISTATS’11

Most are local features

Local recepNve field networks

Machine #1 Machine #2 Machine #3 Machine #4

Le, et al., Tiled Convolu1onal Neural Networks. NIPS 2010

RICA features

Image

Challenges with 1000s of machines

Asynchronous Parallel SGDs

Parameter server

Le, et al., Building high-‐level features using large-‐scale unsupervised learning. ICML 2012

Summary of Scaling up

-‐  Local connecNvity

-‐  Asynchronous SGDs

… And more

-‐  RPC vs MapReduce

-‐  Prefetching

-‐  Single vs Double

-‐  Removing slow machines

-‐  OpNmized So]max

-‐  …

10 million 200x200 images

1 billion parameters

Training

Dataset: 10 million 200x200 unlabeled images from YouTube/Web Train on 2000 machines (16000 cores) for 1 week 1.15 billion parameters -‐  100x larger than previously reported -‐  Small compared to visual cortex

Pooling Size = 5 Number of maps = 8

Image Size = 200

Number of output channels = 8

Number of input channels = 3

One

laye

r

RF size = 18

Input to another layer above (image with 8 channels)

W

H

LCN Size = 5


Image

RICA

RICA

RICA

Top sNmuli from the test set OpNmal sNmulus by numerical opNmizaNon

The face neuron


Feature value

Random distractors

Faces


Frequency

Invariance properNes Feature respon

se

Horizontal shi]s VerNcal shi]s

Feature respon

se

3D rotaNon angle

Feature respon

se

90

20 pixels

o Feature respon

se

Scale factor

1.6x


0 pixels 0 pixels 20 pixels

0 o 1x 0.4x

Random distractors

Pedestrians

Feature value


Frequency

Random distractors

Cat faces

Feature value


Frequency

ImageNet classificaNon

22,000 categories 14,000,000 images Hand-‐engineered features (SIFT, HOG, LBP), SpaNal pyramid, SparseCoding/Compression


22,000 is a lot of categories… … smoothhound, smoothhound shark, Mustelus mustelus American smooth dogfish, Mustelus canis Florida smoothhound, Mustelus norrisi whiteNp shark, reef whiteNp shark, Triaenodon obseus AtlanNc spiny dogfish, Squalus acanthias Pacific spiny dogfish, Squalus suckleyi hammerhead, hammerhead shark smooth hammerhead, Sphyrna zygaena smalleye hammerhead, Sphyrna tudes shovelhead, bonnethead, bonnet shark, Sphyrna Nburo angel shark, angelfish, SquaNna squaNna, monkfish electric ray, crampfish, numbfish, torpedo smalltooth sawfish, PrisNs pecNnatus guitarfish roughtail sNngray, DasyaNs centroura buQerfly ray eagle ray spoQed eagle ray, spoQed ray, Aetobatus narinari cownose ray, cow-‐nosed ray, Rhinoptera bonasus manta, manta ray, devilfish AtlanNc manta, Manta birostris devil ray, Mobula hypostoma grey skate, gray skate, Raja baNs liQle skate, Raja erinacea …

SNngray

Mantaray

Best sNmuli


Image Size = 200



One

laye

r

RF size = 18


W

H

LCN Size = 5

Feature 1

Feature 2

Feature 3

Feature 4

Feature 5



Image Size = 200



One

laye

r

RF size = 18


W

H

LCN Size = 5

Feature 7

Feature 8

Feature 6

Feature 9

Best sNmuli



Image Size = 200



One

laye

r

RF size = 18


W

H

LCN Size = 5

Feature 11

Feature 10

Feature 12

Feature 13

Best sNmuli


0.005% Random guess

9.5% ? Feature learning From raw pixels

State-‐of-‐the-‐art (Weston, Bengio ‘11)


ImageNet 2009 (10k categories): Best published result: 17% (Sanchez & Perronnin ‘11 ), Our method: 20% Using only 1000 categories, our method > 50%

0.005% Random guess

9.5% State-‐of-‐the-‐art

(Weston, Bengio ‘11)

15.8% Feature learning From raw pixels



Other results

-‐  We also have great features for

-‐  Speech recogniNon

-‐  Word-‐vector embedding for NLPs

•  RICA learns invariant features •  Face neuron with totally unlabeled data with enough training and data •  State-‐of-‐the-‐art performances on

–  AcNon RecogniNon –  Cancer image classificaNon –  ImageNet

Conclusions

80

82

84

86

88

90

92

94

Cancer classificaNon AcNon recogniNon AcNon recogniNon benchmarks

Feature visualizaNon Face neuron

0.005% 9.5% 15.8% Random guess Best published result Our method

ImageNet

•  Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, A.Y. Ng. Building high-‐level features using large-‐scale unsupervised learning. ICML, 2012.

•  Q.V. Le, J. Ngiam, Z. Chen, D. Chia, P. Koh, A.Y. Ng. Tiled Convolu8onal Neural Networks. NIPS, 2010.

•  Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng. Learning hierarchical spa8o-‐temporal features for ac8on recogni8on with independent subspace analysis. CVPR, 2011.

•  Q.V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, A.Y. Ng. On op8miza8on methods for deep learning. ICML, 2011.

•  Q.V. Le, A. Karpenko, J. Ngiam, A.Y. Ng. ICA with Reconstruc8on Cost for Efficient Overcomplete Feature Learning. NIPS, 2011.

•  Q.V. Le, J. Han, J. Gray, P. Spellman, A. Borowsky, B. Parvin. Learning Invariant Features for Tumor Signatures. ISBI, 2012.

•  I.J. Goodfellow, Q.V. Le, A.M. Saxe, H. Lee, A.Y. Ng, Measuring invariances in deep networks. NIPS, 2009.

References

hQp://ai.stanford.edu/~quocle

Quoc le tera-scale deep learning

Education

Transcript of Quoc le tera-scale deep learning