Quoc le tera-scale deep learning

52
Tera-scale deep learning Quoc V. Le Stanford University and Google

Transcript of Quoc le tera-scale deep learning

Page 1: Quoc le   tera-scale deep learning

Tera-scale deep learning Quoc  V.  Le  

Stanford  University  and  Google  

Page 2: Quoc le   tera-scale deep learning

Samy  Bengio,  Zhenghao  Chen,  Tom  Dean,  Pangwei  Koh,  Mark  Mao,  Jiquan  Ngiam,  Patrick  Nguyen,  Andrew  Saxe,  Mark  Segal,  Jon  Shlens,    Vincent  Vanhouke,    Xiaoyun  Wu,    Peng  Xe,  Serena  Yeung,  Will  Zou  

AddiNonal  Thanks:  

Greg  Corrado   Jeff  Dean   MaQhieu  Devin  Kai  Chen  

Rajat  Monga   Andrew  Ng   Marc’Aurelio  Ranzato  

Paul  Tucker   Ke  Yang  

Joint  work  with  

Page 3: Quoc le   tera-scale deep learning

Machine  Learning  successes  

Face  recogniNon   OCR   Autonomous  car  

RecommendaNon  systems   Web  page  ranking  

Email  classificaNon  

Page 4: Quoc le   tera-scale deep learning

Classifier  

Feature  extracNon  (Mostly  hand-­‐cra]ed  features)  

Feature  ExtracNon  

Page 5: Quoc le   tera-scale deep learning

Hand-­‐Cra]ed  Features  Computer  vision:      

Speech  RecogniNon:      

MFCC   Spectrogram   ZCR  

…  

SIFT/HOG   SURF  

…  

Page 6: Quoc le   tera-scale deep learning

New  feature-­‐designing  paradigm  

Unsupervised  Feature  Learning  /  Deep  Learning      ReconstrucNon  ICA    Expensive  and  typically  applied  to  small  problems  

Page 7: Quoc le   tera-scale deep learning

The  Trend  of  BigData  

Page 8: Quoc le   tera-scale deep learning

No  maQer  the  algorithm,  more  features  always  more  successful.  

Outline  

 -­‐        ReconstrucNon  ICA  

-­‐  ApplicaNons  to  videos,  cancer  images  

-­‐  Ideas  for  scaling  up  

-­‐  Scaling  up  Results  

Page 9: Quoc le   tera-scale deep learning

Topographic  Independent  Component  Analysis  (TICA)  

Page 10: Quoc le   tera-scale deep learning

2.  Learning  

Input  data:  

W  1  

W  9  

W  1  T   W  

9  T  

1.  Feature  computaNon  

W  9  T   )  (  

2  

W  1  T   )  (  

2  

W  =    

W  1  

W  2  

W  10000  

.  

.  

Page 11: Quoc le   tera-scale deep learning

Topographic  Independent  Component  Analysis  (TICA)  

Page 12: Quoc le   tera-scale deep learning

Invariance  explained  

F2  

F1  

Loc1  

1  

0  

0  

1  

Pooled  feature  of  F1  and  F2  

Loc2  

Images  

Features  

Image1   Image2  

sqrt(1  +  0  )  =  1  2   2   sqrt(0    +  1  )  =  1  2   2  

Same  value  regardless  the  locaNon  of  the  edge  

Page 13: Quoc le   tera-scale deep learning

TICA:   ReconstrucNon  ICA:  

Equivalence  between  Sparse  Coding,  Autoencoders,  RBMs  and  ICA  

Build  deep  architecture  by  treaNng  the  output  of  one  layer  as  input  to  another  layer  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Page 14: Quoc le   tera-scale deep learning

ReconstrucNon  ICA:  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Page 15: Quoc le   tera-scale deep learning

ReconstrucNon  ICA:  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Data  whitening  

Page 16: Quoc le   tera-scale deep learning

TICA:   ReconstrucNon  ICA:  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Data  whitening  

Page 17: Quoc le   tera-scale deep learning

Why  RICA?  

Algorithms   Speed  

Sparse  Coding  

RBMs/Autoencoders  

Invariant  Features    Ease  of  training  

TICA  

ReconstrucNon  ICA  

Le,  et  al.,  ICA  with  Reconstruc1on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS  2011  

Page 18: Quoc le   tera-scale deep learning

Summary  of  RICA  

-­‐  Two-­‐layered  network  

-­‐  ReconstrucNon  cost  instead  of  orthogonality  constraints  

-­‐  Learns  invariant  features    

Page 19: Quoc le   tera-scale deep learning

ApplicaNons  of  RICA  

Page 20: Quoc le   tera-scale deep learning

Sit  up   Drive  Car   Get    Out  of  Car  

Eat   Answer  phone   Kiss  

Run   Stand  up   Shake  hands  

AcNon  recogniNon  

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for    ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

Page 21: Quoc le   tera-scale deep learning

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for    ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

Page 22: Quoc le   tera-scale deep learning

70  

71  

72  

73  

74  

75  

76  

75  

77  

79  

81  

83  

85  

87  

35  

37  

39  

41  

43  

45  

47  

49  

51  

53  

55  

80  

82  

84  

86  

88  

90  

92  

94  KTH   Hollywood2  

UCF   YouTube  

Hessian/SURF   Learned  Features   Learned  Features  

Combined  Engineered  Features  

Learned  Features  Learned  Features  

pLSA   HOF   HOG  GRBMs   3DCNN   HMAX   Hessian/SURF   HOG/HOF   HOG3D   GRBMS   HOF  

HOG3D  HOF  Hessian  HOG.HOF  

HOG  Hessian/SURF  

Le,  et  al.,  Learning  hierarchical  spa1o-­‐temporal  features  for    ac1on  recogni1on  with  independent  subspace  analysis.  CVPR  2011  

Page 23: Quoc le   tera-scale deep learning

Cancer  classificaNon  

ApoptoNc  

Viable  tumor  region  

Necrosis  

…  

Le,  et  al.,  Learning  Invariant  Features  of  Tumor  Signatures.  ISBI  2012  

84%  

86%  

88%  

90%  

92%  

Hand  engineered  Features   RICA  

Page 24: Quoc le   tera-scale deep learning

Scaling  up    deep  RICA  networks  

Page 25: Quoc le   tera-scale deep learning

Scaling  up  Deep  Learning  

Real  data  

Deep  learning  data  

Page 26: Quoc le   tera-scale deep learning

No  maQer  the  algorithm,  more  features  always  more  successful.  

It’s  beQer  to  have  more  features!  

Coates,  et  al.,  An  Analysis  of  Single-­‐Layer  Networks  in  Unsupervised  Feature  Learning.  AISTATS’11  

Page 27: Quoc le   tera-scale deep learning

Most  are    local  features  

Page 28: Quoc le   tera-scale deep learning

Local  recepNve  field  networks  

Machine  #1   Machine  #2   Machine  #3   Machine  #4  

Le,  et  al.,  Tiled  Convolu1onal  Neural  Networks.  NIPS  2010  

RICA  features  

Image  

Page 29: Quoc le   tera-scale deep learning

Challenges  with  1000s  of  machines  

Page 30: Quoc le   tera-scale deep learning

Asynchronous  Parallel  SGDs  

Parameter  server  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 31: Quoc le   tera-scale deep learning

Asynchronous  Parallel  SGDs  

Parameter  server  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 32: Quoc le   tera-scale deep learning

Summary  of  Scaling  up  

-­‐  Local  connecNvity  

-­‐  Asynchronous  SGDs    

…  And  more  

-­‐  RPC  vs  MapReduce  

-­‐  Prefetching  

-­‐  Single  vs  Double  

-­‐  Removing  slow  machines  

-­‐  OpNmized  So]max  

-­‐  …  

Page 33: Quoc le   tera-scale deep learning

10  million  200x200  images    

1  billion  parameters  

Page 34: Quoc le   tera-scale deep learning

Training  

Dataset:  10  million  200x200  unlabeled  images    from  YouTube/Web    Train  on  2000  machines  (16000  cores)  for  1  week    1.15  billion  parameters  -­‐  100x  larger  than  previously  reported    -­‐  Small  compared  to  visual  cortex    

Pooling Size = 5 Number of maps = 8

Image Size = 200

Number of output channels = 8

Number of input channels = 3

One

laye

r

RF size = 18

Input to another layer above (image with 8 channels)

W

H

LCN Size = 5

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Image  

RICA  

RICA  

RICA  

Page 35: Quoc le   tera-scale deep learning

Top  sNmuli  from  the  test  set   OpNmal  sNmulus    by  numerical  opNmizaNon  

The  face  neuron  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 36: Quoc le   tera-scale deep learning

Feature  value  

Random  distractors  

Faces  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Frequency  

Page 37: Quoc le   tera-scale deep learning

Invariance  properNes  Feature  respon

se  

Horizontal  shi]s   VerNcal  shi]s  

Feature  respon

se  

3D  rotaNon  angle  

Feature  respon

se  

90  

20  pixels  

o  Feature  respon

se  

Scale  factor  

1.6x  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

0  pixels   0  pixels   20  pixels  

0  o   1x  0.4x  

Page 38: Quoc le   tera-scale deep learning

Top  sNmuli  from  the  test  set   OpNmal  sNmulus    by  numerical  opNmizaNon  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 39: Quoc le   tera-scale deep learning

Random  distractors  

Pedestrians  

Feature  value  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Frequency  

Page 40: Quoc le   tera-scale deep learning

Top  sNmuli  from  the  test  set   OpNmal  sNmulus    by  numerical  opNmizaNon  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 41: Quoc le   tera-scale deep learning

Random  distractors  

Cat  faces  

Feature  value  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Frequency  

Page 42: Quoc le   tera-scale deep learning
Page 43: Quoc le   tera-scale deep learning

ImageNet  classificaNon  

22,000  categories    14,000,000  images    Hand-­‐engineered  features  (SIFT,  HOG,  LBP),    SpaNal  pyramid,    SparseCoding/Compression    

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 44: Quoc le   tera-scale deep learning

22,000  is  a  lot  of  categories…    …  smoothhound,  smoothhound  shark,  Mustelus  mustelus  American  smooth  dogfish,  Mustelus  canis  Florida  smoothhound,  Mustelus  norrisi  whiteNp  shark,  reef  whiteNp  shark,  Triaenodon  obseus  AtlanNc  spiny  dogfish,  Squalus  acanthias  Pacific  spiny  dogfish,  Squalus  suckleyi  hammerhead,  hammerhead  shark  smooth  hammerhead,  Sphyrna  zygaena  smalleye  hammerhead,  Sphyrna  tudes  shovelhead,  bonnethead,  bonnet  shark,  Sphyrna  Nburo  angel  shark,  angelfish,  SquaNna  squaNna,  monkfish  electric  ray,  crampfish,  numbfish,  torpedo  smalltooth  sawfish,  PrisNs  pecNnatus  guitarfish  roughtail  sNngray,  DasyaNs  centroura  buQerfly  ray  eagle  ray  spoQed  eagle  ray,  spoQed  ray,  Aetobatus  narinari  cownose  ray,  cow-­‐nosed  ray,  Rhinoptera  bonasus  manta,  manta  ray,  devilfish  AtlanNc  manta,  Manta  birostris  devil  ray,  Mobula  hypostoma  grey  skate,  gray  skate,  Raja  baNs  liQle  skate,  Raja  erinacea  …  

SNngray  

Mantaray  

Page 45: Quoc le   tera-scale deep learning

Best  sNmuli  

Pooling Size = 5 Number of maps = 8

Image Size = 200

Number of output channels = 8

Number of input channels = 3

One

laye

r

RF size = 18

Input to another layer above (image with 8 channels)

W

H

LCN Size = 5

Feature  1  

Feature  2  

Feature  3  

Feature  4  

Feature  5  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 46: Quoc le   tera-scale deep learning

Pooling Size = 5 Number of maps = 8

Image Size = 200

Number of output channels = 8

Number of input channels = 3

One

laye

r

RF size = 18

Input to another layer above (image with 8 channels)

W

H

LCN Size = 5

Feature  7    

Feature  8  

Feature  6  

Feature  9  

Best  sNmuli  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 47: Quoc le   tera-scale deep learning

Pooling Size = 5 Number of maps = 8

Image Size = 200

Number of output channels = 8

Number of input channels = 3

One

laye

r

RF size = 18

Input to another layer above (image with 8 channels)

W

H

LCN Size = 5

Feature  11    

Feature  10  

Feature  12  

Feature  13  

Best  sNmuli  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 48: Quoc le   tera-scale deep learning

0.005%  Random  guess  

9.5%   ?  Feature  learning    From  raw  pixels  

State-­‐of-­‐the-­‐art  (Weston,  Bengio  ‘11)  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 49: Quoc le   tera-scale deep learning

ImageNet  2009  (10k  categories):  Best  published  result:  17%                                                                                                                        (Sanchez  &  Perronnin  ‘11  ),                                                                                                                        Our  method:  20%    Using  only  1000  categories,  our  method  >  50%    

0.005%  Random  guess  

9.5%  State-­‐of-­‐the-­‐art  

(Weston,  Bengio  ‘11)  

15.8%  Feature  learning    From  raw  pixels  

Le,  et  al.,  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML  2012  

Page 50: Quoc le   tera-scale deep learning

No  maQer  the  algorithm,  more  features  always  more  successful.  

Other  results  

 -­‐  We  also  have  great  features  for    

-­‐  Speech  recogniNon  

-­‐  Word-­‐vector  embedding  for  NLPs  

Page 51: Quoc le   tera-scale deep learning

•  RICA  learns  invariant  features  •  Face  neuron  with  totally  unlabeled  data                    with  enough  training  and  data  •  State-­‐of-­‐the-­‐art  performances  on    

–  AcNon  RecogniNon  –  Cancer  image  classificaNon  –  ImageNet  

Conclusions  

80  

82  

84  

86  

88  

90  

92  

94  

Cancer  classificaNon   AcNon  recogniNon   AcNon  recogniNon  benchmarks  

Feature  visualizaNon   Face  neuron  

0.005%   9.5%   15.8%  Random  guess   Best  published  result   Our  method  

ImageNet  

Page 52: Quoc le   tera-scale deep learning

•  Q.V.  Le,  M.A.  Ranzato,  R.  Monga,  M.  Devin,  G.  Corrado,  K.  Chen,  J.  Dean,  A.Y.  Ng.  Building  high-­‐level  features  using  large-­‐scale  unsupervised  learning.  ICML,  2012.  

•  Q.V.  Le,  J.  Ngiam,  Z.  Chen,  D.  Chia,  P.  Koh,  A.Y.  Ng.  Tiled  Convolu8onal  Neural  Networks.  NIPS,  2010.    

•  Q.V.  Le,  W.Y.  Zou,  S.Y.  Yeung,  A.Y.  Ng.  Learning  hierarchical  spa8o-­‐temporal  features  for  ac8on  recogni8on  with  independent  subspace  analysis.  CVPR,  2011.  

•  Q.V.  Le,  J.  Ngiam,  A.  Coates,  A.  Lahiri,  B.  Prochnow,  A.Y.  Ng.    On  op8miza8on  methods  for  deep  learning.  ICML,  2011.    

•  Q.V.  Le,  A.  Karpenko,  J.  Ngiam,  A.Y.  Ng.    ICA  with  Reconstruc8on  Cost  for  Efficient  Overcomplete  Feature  Learning.  NIPS,  2011.    

•  Q.V.  Le,  J.  Han,  J.  Gray,  P.  Spellman,  A.  Borowsky,  B.  Parvin.  Learning  Invariant  Features  for  Tumor  Signatures.  ISBI,  2012.    

•  I.J.  Goodfellow,  Q.V.  Le,  A.M.  Saxe,  H.  Lee,  A.Y.  Ng,    Measuring  invariances  in  deep  networks.  NIPS,  2009.  

References  

hQp://ai.stanford.edu/~quocle