Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL...

Understanding Deep Learning for

Big Data

Le Songhttp://www.cc.gatech.edu/~lsong/

College of ComputingGeorgia Institute of Technology

1

2

AlexNet: deep convolution neural networks

11

11

5

5

33

33

256

13

1333

40964096

1000

Rectified linear unit: 224

224

3

55

55

96256

27

27

384

13

13

384

13

13

3.7 million parameters


Pr

Image

Label

cat/bike/…?

3

a benchmark image classification problem~ 1.3 million examples, ~ 1 thousand classes

4

Training is end-to-end

Minimize negative log-likelihood over data points

(Stochastic) gradient descent

Pr

AlexNet achieve~40%

top-1 error

5

Traditional image features not learned end-to-end

Handcrafted feature extractor

(eg. SIFT)

Divide imageto patches

Combine features

Learn classifier

6

Rectified linear unit:

Deep learning not fully understood

11

11

5

5

33

33

256

13

1333

40964096

1000

224

224

3

55

55

96256

27

27

384

13

13

384

13

13



Fully connected layers crucial?

Convolution layers crucial?

Image

Train end-to-end important?

Pr

Experiments

1. Fully connected layers crucial?

2. Convolution layers crucial?3. Learn parameters end-to-

end crucial?

8

Kernel methods: alternative nonlinear model

Combination of random basis functions

∑𝑖=1

7

𝛼𝑖exp (−‖𝑤𝑖−𝑥‖2 )

𝛼1 𝛼2 𝛼3 𝛼4 𝛼5 𝛼6 𝛼7

𝑤2 𝑤3 𝑤4 𝑤5 𝑤6 𝑤7 𝑥𝑤1

[Dai et al. NIPS 14]

𝑥

9

Replace fully connected by kernel methodsI. Jointly trained neural nets (AlexNet)

Pr

Learn

II. Fixed neural nets

III. Scalable kernel methods [Dai et al. NIPS 14]

Learn Fix

Learn Fix

10

Learn classifiers from a benchmark subset of~ 1.3 million examples, ~ 1 thousand classes

11

Kernel machine learns fasterImageNet 1.3M original images, and 1000 classesRandom cropping and mirroring images in streaming fashion

Number of training samples105

40

60

80

100

Test top-1 error

(%)

106 107 108

jointly-trained neural netfixed neural netdoubly SGD

Training 1 week using GPU

47.844.542.6

Random guessing 99.9% error

12

Similar results with MNIST8MClassification with handwritten digits8M images, 10 classes

LeNet5

13

Similar results with CIFAR10Classification with internet images60K images, 10 classes

Experiments

1. Fully connected layers crucial? No

2. Convolution layers crucial?3. Learn parameters end-to-

end crucial?

15

Kernel methods directly on inputs?

Fixed convolutionWithout convolution

MNIST0

0.20.40.60.8

11.2

2 convolution layerCIFAR10

05

10152025303540

2 convolution layersImageNet

020406080

100

5 convolution layers

16

Kernel methods + random convolutions?

Fixed convolutionWithout convolution Random convolution

MNIST0

0.20.40.60.8

11.2

2 convolution layerCIFAR10

05

10152025303540

2 convolution layers

# random conv

# fixed conv

Random

17

Structured composition usefulNot just fully connected layers, and plain

composition

Structured composition of nonlinear functions

the same function

Experiments

1. Fully connected layers crucial? No

2. Convolution layers crucial? Yes

3. Learn parameters end-to-end crucial?

19

Lots of random features used

58M parameters

131M parameters

AlexNet

ScalableKernel Method

Error

42.6%

Error

44.5%

1000

40964096256

13

13

25613

13

131K

1000

Fix

20

131M parameters needed?

58M parameters

32M parameters

AlexNetErro

r42.6

%

Error

50.0%

1000

40964096256

13

13

25613

13

32K

1000


Fix

21

Basis function adaptation crucialIntegrated squared approximation error by basis function [Barron ‘93]

Error ofadapting basis function

Error offixed basis function

𝑓 (𝑥 )=∑𝑖=1

7

𝛼 𝑖𝑘 (𝑥 𝑖 ,𝑥 )

𝛼1𝛼2𝛼3 𝛼4𝛼5𝛼6 𝛼7

𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7

𝑘(𝑥 𝑖 ,𝑥)

𝑓 (𝑥 )=∑𝑖=1

2

𝛼 𝑖𝑘𝜃𝑖 (𝑥𝑖 , 𝑥 )

𝑥1 𝑥2

𝑘𝜃 𝑖(𝑥𝑖 , 𝑥)

𝛼1 𝛼2

Learning random features helps a lot

58M parameters

32M parametersLearn and basis adaptation

AlexNetErro

r42.6

%

Error

43.7%

1000

40964096256

13

13

25613

13

32K

1000

Fix22/50


23

Learning convolution together helps more

58M parameters

32M parametersLearn and basis adaptation

AlexNetErro

r42.6

%

Error

41.9%

1000

40964096256

13

13

25613

13

32K

1000

Jointly learn


Lesson learned:Exploit Structure & Train

End-to-EndDeep learning over (time-

varying) graph

25

Co-evolutionary features

ChristineAliceDavid Jacob

02/0

2

Item embedding

User embedding

User-item interactions evolve over time

…

26


02/0

2

03/02

User embedding

Co-evolutionary featuresItem embedding


…

27


02/0

2

03/02

User embedding

06/02



…

28


02/0

2

03/02

Item embedding

User embedding

06/02

07/02



…

29


02/0

2

03/02

Item embedding

User embedding

06/02

09/02

07/02



…

30


02/0

2

03/0206/02

09/02

07/02


User embedding


…

31

Co-evolutionary embedding


Initialize item embedding

Initialize user embedding

(𝑢𝑛 ,𝑖𝑛 ,𝑡𝑛 ,𝑞𝑛)

Item raw profile features

User raw profile features

DriftContext

EvolutionCo-evolutionUser Item𝑓 𝑖𝑛 (𝑡𝑛 )=h(

𝑉 1 ⋅ 𝑓 𝑖𝑛 (𝑡𝑛− )

+𝑉 2⋅ 𝑓 𝑢𝑛(𝑡𝑛−)

+𝑉 3 ⋅𝑞𝑛

+𝑉 4 ⋅(𝑡𝑛−𝑡𝑛− 1))Update

U2I:

DriftContext

EvolutionCo-evolutionItemUser𝑓 𝑢𝑛

(𝑡𝑛 )=h(𝑊 1⋅ 𝑓 𝑢𝑛

(𝑡𝑛− )+𝑊 2 ⋅ 𝑓 𝑖𝑛 (𝑡𝑛

− )+𝑊 3 ⋅𝑞𝑛

+𝑊 4 ⋅(𝑡𝑛− 𝑡𝑛−1))Update

I2U:

[Dai et al. Recsys16]

32

Deep learning with time-varying computation graph

time

𝑡 2

𝑡 3

𝑡1

𝑡 0

Mini-batch 1

Computation graph of RNN determined by

1. The bipartite interaction graph

2. The temporal ordering of events

33

Much improvement prediction on Reddit dataset

Next item prediction Return time prediction

1,000 users, 1403 groups, ~10K interactionsMAR: mean absolute rank differenceMAE: mean absolute error (hours)

34

Predicting efficiency of solar panel materials

Dataset Harvard clean energy project

Data point #

2.3 million

Type MoleculeAtom type 6Avg node #

28

Avg edge #

33Power Conversion Efficiency (PCE)

(0 -12 %)predict

Organic Solar Panel Materials

35

Structure2Vec

𝜇2(1)

𝜇2(0)

𝜇1(0)𝜇3(1)𝜇1(1)

……

𝜇2(𝑇 ) 𝜇3(𝑇 )𝜇1(𝑇 )

𝑋 6

𝑋 1

𝑋 2 𝑋 3

𝑋 4

𝑋 5

𝐻6

𝐻1

𝐻5

𝐻2 𝐻

3

𝐻4

𝜒𝜇6(0)

……

……

Iteration 1:

Iteration :

Label classification/regressionwith parameter

Aggregate𝜇1(𝑇 )

𝜇2(𝑇 ) +¿+¿⋮

¿𝜇𝑎(𝑊 , 𝜒 )

[Dai et al. ICML 16]

36

Improved prediction with small model

Structure2vec gets ~4% relative error with 10,000 times smaller model!

Test MAE Test RMSE

# parameter

sMean

predictor1.986 2.406 1

WL level-3 0.143 0.204 1.6 mWL level-6 0.096 0.137 1378 mstructure2v

ec0.085 0.117 0.1 m

10% data for testing

Take Home Message:

Deep fully connected layers not the key

Exploit structure (CNN, Coevolution, Structure2vec)

Train end-to-end

Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL...

Technology

Transcript of Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL...