Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL...

37
Understanding Deep Learning for Big Data Le Song http://www.cc.gatech.edu/ ~lsong/ College of Computing Georgia Institute of Technology 1

Transcript of Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL...

Page 1: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Understanding Deep Learning for

Big Data

Le Songhttp://www.cc.gatech.edu/~lsong/

College of ComputingGeorgia Institute of Technology

1

Page 2: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

2

AlexNet: deep convolution neural networks

11

11

5

5

33

33

256

13

1333

40964096

1000

Rectified linear unit: 224

224

3

55

55

96256

27

27

384

13

13

384

13

13

3.7 million parameters

58.6 million parameters

Pr

Image

Label

cat/bike/โ€ฆ?

Page 3: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

3

a benchmark image classification problem~ 1.3 million examples, ~ 1 thousand classes

Page 4: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

4

Training is end-to-end

Minimize negative log-likelihood over data points

(Stochastic) gradient descent

Pr

AlexNet achieve~40%

top-1 error

Page 5: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

5

Traditional image features not learned end-to-end

Handcrafted feature extractor

(eg. SIFT)

Divide imageto patches

Combine features

Learn classifier

Page 6: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

6

Rectified linear unit:

Deep learning not fully understood

11

11

5

5

33

33

256

13

1333

40964096

1000

224

224

3

55

55

96256

27

27

384

13

13

384

13

13

3.7 million parameters

58.6 million parameters

Fully connected layers crucial?

Convolution layers crucial?

Image

Train end-to-end important?

Pr

Page 7: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Experiments

1. Fully connected layers crucial?

2. Convolution layers crucial?3. Learn parameters end-to-

end crucial?

Page 8: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

8

Kernel methods: alternative nonlinear model

Combination of random basis functions

โˆ‘๐‘–=1

7

๐›ผ๐‘–exp (โˆ’โ€–๐‘ค๐‘–โˆ’๐‘ฅโ€–2 )

๐›ผ1 ๐›ผ2 ๐›ผ3 ๐›ผ4 ๐›ผ5 ๐›ผ6 ๐›ผ7

๐‘ค2 ๐‘ค3 ๐‘ค4 ๐‘ค5 ๐‘ค6 ๐‘ค7 ๐‘ฅ๐‘ค1

[Dai et al. NIPS 14]

๐‘ฅ

Page 9: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

9

Replace fully connected by kernel methodsI. Jointly trained neural nets (AlexNet)

Pr

Learn

II. Fixed neural nets

III. Scalable kernel methods [Dai et al. NIPS 14]

Learn Fix

Learn Fix

Page 10: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

10

Learn classifiers from a benchmark subset of~ 1.3 million examples, ~ 1 thousand classes

Page 11: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

11

Kernel machine learns fasterImageNet 1.3M original images, and 1000 classesRandom cropping and mirroring images in streaming fashion

Number of training samples105

40

60

80

100

Test top-1 error

(%)

106 107 108

jointly-trained neural netfixed neural netdoubly SGD

Training 1 week using GPU

47.844.542.6

Random guessing 99.9% error

Page 12: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

12

Similar results with MNIST8MClassification with handwritten digits8M images, 10 classes

LeNet5

Page 13: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

13

Similar results with CIFAR10Classification with internet images60K images, 10 classes

Page 14: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Experiments

1. Fully connected layers crucial? No

2. Convolution layers crucial?3. Learn parameters end-to-

end crucial?

Page 15: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

15

Kernel methods directly on inputs?

Fixed convolutionWithout convolution

MNIST0

0.20.40.60.8

11.2

2 convolution layerCIFAR10

05

10152025303540

2 convolution layersImageNet

020406080

100

5 convolution layers

Page 16: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

16

Kernel methods + random convolutions?

Fixed convolutionWithout convolution Random convolution

MNIST0

0.20.40.60.8

11.2

2 convolution layerCIFAR10

05

10152025303540

2 convolution layers

# random conv

# fixed conv

Random

Page 17: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

17

Structured composition usefulNot just fully connected layers, and plain

composition

Structured composition of nonlinear functions

the same function

Page 18: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Experiments

1. Fully connected layers crucial? No

2. Convolution layers crucial? Yes

3. Learn parameters end-to-end crucial?

Page 19: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

19

Lots of random features used

58M parameters

131M parameters

AlexNet

ScalableKernel Method

Error

42.6%

Error

44.5%

1000

40964096256

13

13

25613

13

131K

1000

Fix

Page 20: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

20

131M parameters needed?

58M parameters

32M parameters

AlexNetErro

r42.6

%

Error

50.0%

1000

40964096256

13

13

25613

13

32K

1000

ScalableKernel Method

Fix

Page 21: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

21

Basis function adaptation crucialIntegrated squared approximation error by basis function [Barron โ€˜93]

Error ofadapting basis function

Error offixed basis function

๐‘“ (๐‘ฅ )=โˆ‘๐‘–=1

7

๐›ผ ๐‘–๐‘˜ (๐‘ฅ ๐‘– ,๐‘ฅ )

๐›ผ1๐›ผ2๐›ผ3 ๐›ผ4๐›ผ5๐›ผ6 ๐›ผ7

๐‘ฅ1 ๐‘ฅ2 ๐‘ฅ3 ๐‘ฅ4 ๐‘ฅ5 ๐‘ฅ6 ๐‘ฅ7

๐‘˜(๐‘ฅ ๐‘– ,๐‘ฅ)

๐‘“ (๐‘ฅ )=โˆ‘๐‘–=1

2

๐›ผ ๐‘–๐‘˜๐œƒ๐‘– (๐‘ฅ๐‘– , ๐‘ฅ )

๐‘ฅ1 ๐‘ฅ2

๐‘˜๐œƒ ๐‘–(๐‘ฅ๐‘– , ๐‘ฅ)

๐›ผ1 ๐›ผ2

Page 22: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Learning random features helps a lot

58M parameters

32M parametersLearn and basis adaptation

AlexNetErro

r42.6

%

Error

43.7%

1000

40964096256

13

13

25613

13

32K

1000

Fix22/50

ScalableKernel Method

Page 23: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

23

Learning convolution together helps more

58M parameters

32M parametersLearn and basis adaptation

AlexNetErro

r42.6

%

Error

41.9%

1000

40964096256

13

13

25613

13

32K

1000

Jointly learn

ScalableKernel Method

Page 24: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Lesson learned:Exploit Structure & Train

End-to-EndDeep learning over (time-

varying) graph

Page 25: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

25

Co-evolutionary features

ChristineAliceDavid Jacob

02/0

2

Item embedding

User embedding

User-item interactions evolve over time

โ€ฆ

Page 26: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

26

ChristineAliceDavid Jacob

02/0

2

03/02

User embedding

Co-evolutionary featuresItem embedding

User-item interactions evolve over time

โ€ฆ

Page 27: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

27

ChristineAliceDavid Jacob

02/0

2

03/02

User embedding

06/02

Co-evolutionary featuresItem embedding

User-item interactions evolve over time

โ€ฆ

Page 28: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

28

ChristineAliceDavid Jacob

02/0

2

03/02

Item embedding

User embedding

06/02

07/02

Co-evolutionary features

User-item interactions evolve over time

โ€ฆ

Page 29: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

29

ChristineAliceDavid Jacob

02/0

2

03/02

Item embedding

User embedding

06/02

09/02

07/02

Co-evolutionary features

User-item interactions evolve over time

โ€ฆ

Page 30: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

30

ChristineAliceDavid Jacob

02/0

2

03/0206/02

09/02

07/02

Co-evolutionary featuresItem embedding

User embedding

User-item interactions evolve over time

โ€ฆ

Page 31: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

31

Co-evolutionary embedding

ChristineAliceDavid Jacob

Initialize item embedding

Initialize user embedding

(๐‘ข๐‘› ,๐‘–๐‘› ,๐‘ก๐‘› ,๐‘ž๐‘›)

Item raw profile features

User raw profile features

DriftContext

EvolutionCo-evolutionUser Item๐‘“ ๐‘–๐‘› (๐‘ก๐‘› )=h(

๐‘‰ 1 โ‹… ๐‘“ ๐‘–๐‘› (๐‘ก๐‘›โˆ’ )

+๐‘‰ 2โ‹… ๐‘“ ๐‘ข๐‘›(๐‘ก๐‘›โˆ’)

+๐‘‰ 3 โ‹…๐‘ž๐‘›

+๐‘‰ 4 โ‹…(๐‘ก๐‘›โˆ’๐‘ก๐‘›โˆ’ 1))Update

U2I:

DriftContext

EvolutionCo-evolutionItemUser๐‘“ ๐‘ข๐‘›

(๐‘ก๐‘› )=h(๐‘Š 1โ‹… ๐‘“ ๐‘ข๐‘›

(๐‘ก๐‘›โˆ’ )+๐‘Š 2 โ‹… ๐‘“ ๐‘–๐‘› (๐‘ก๐‘›

โˆ’ )+๐‘Š 3 โ‹…๐‘ž๐‘›

+๐‘Š 4 โ‹…(๐‘ก๐‘›โˆ’ ๐‘ก๐‘›โˆ’1))Update

I2U:

[Dai et al. Recsys16]

Page 32: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

32

Deep learning with time-varying computation graph

time

๐‘ก 2

๐‘ก 3

๐‘ก1

๐‘ก 0

Mini-batch 1

Computation graph of RNN determined by

1. The bipartite interaction graph

2. The temporal ordering of events

Page 33: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

33

Much improvement prediction on Reddit dataset

Next item prediction Return time prediction

1,000 users, 1403 groups, ~10K interactionsMAR: mean absolute rank differenceMAE: mean absolute error (hours)

Page 34: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

34

Predicting efficiency of solar panel materials

Dataset Harvard clean energy project

Data point #

2.3 million

Type MoleculeAtom type 6Avg node #

28

Avg edge #

33Power Conversion Efficiency (PCE)

(0 -12 %)predict

Organic Solar Panel Materials

Page 35: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

35

Structure2Vec

๐œ‡2(1)

๐œ‡2(0)

๐œ‡1(0)๐œ‡3(1)๐œ‡1(1)

โ€ฆโ€ฆ

๐œ‡2(๐‘‡ ) ๐œ‡3(๐‘‡ )๐œ‡1(๐‘‡ )

๐‘‹ 6

๐‘‹ 1

๐‘‹ 2 ๐‘‹ 3

๐‘‹ 4

๐‘‹ 5

๐ป6

๐ป1

๐ป5

๐ป2 ๐ป

3

๐ป4

๐œ’๐œ‡6(0)

โ€ฆโ€ฆ

โ€ฆโ€ฆ

Iteration 1:

Iteration :

Label classification/regressionwith parameter

Aggregate๐œ‡1(๐‘‡ )

๐œ‡2(๐‘‡ ) +ยฟ+ยฟโ‹ฎ

ยฟ๐œ‡๐‘Ž(๐‘Š , ๐œ’ )

[Dai et al. ICML 16]

Page 36: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

36

Improved prediction with small model

Structure2vec gets ~4% relative error with 10,000 times smaller model!

Test MAE Test RMSE

# parameter

sMean

predictor1.986 2.406 1

WL level-3 0.143 0.204 1.6 mWL level-6 0.096 0.137 1378 mstructure2v

ec0.085 0.117 0.1 m

10% data for testing

Page 37: Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL 2016

Take Home Message:

Deep fully connected layers not the key

Exploit structure (CNN, Coevolution, Structure2vec)

Train end-to-end