Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw...

Post on 23-Jun-2018

223 views 0 download

Transcript of Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw...

Large-Scale Machine Learning

Shan-Hung Wushwu@cs.nthu.edu.tw

Department of Computer Science,National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 1 / 29

Outline

1 When ML Meets Big Data

2 Representation Learning

3 Curse of Dimensionality

4 Trade-Offs in Large-Scale Learning

5 SGD-Based Optimization

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 2 / 29

Outline

1 When ML Meets Big Data

2 Representation Learning

3 Curse of Dimensionality

4 Trade-Offs in Large-Scale Learning

5 SGD-Based Optimization

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 3 / 29

The Big Data Era

Today, more and more of our activities are recorded by ubiquitouscomputing devices

Networked computers make it easy to centralize these records andcurate them into a big datasetLarge-scale machine learning techniques solve problems byleveraging the posteriori knowledge learned from the big data

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 29

The Big Data Era

Today, more and more of our activities are recorded by ubiquitouscomputing devicesNetworked computers make it easy to centralize these records andcurate them into a big dataset

Large-scale machine learning techniques solve problems byleveraging the posteriori knowledge learned from the big data

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 29

The Big Data Era

Today, more and more of our activities are recorded by ubiquitouscomputing devicesNetworked computers make it easy to centralize these records andcurate them into a big datasetLarge-scale machine learning techniques solve problems byleveraging the posteriori knowledge learned from the big data

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 29

Characteristics of Big Data

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 5 / 29

Challenges of Large-Scale ML

Variety and veracityFeature engineering gets even

harder

Transfer learningVolume

Large D: curse ofdimensionalityLarge N: training efficiency

VelocityOnline learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 29

Challenges of Large-Scale ML

Variety and veracityFeature engineering gets even

harder

Transfer learning

VolumeLarge D: curse ofdimensionalityLarge N: training efficiency

VelocityOnline learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 29

Challenges of Large-Scale ML

Variety and veracityFeature engineering gets even

harder

Transfer learningVolume

Large D: curse ofdimensionalityLarge N: training efficiency

VelocityOnline learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 29

Challenges of Large-Scale ML

Variety and veracityFeature engineering gets even

harder

Transfer learningVolume

Large D: curse ofdimensionalityLarge N: training efficiency

VelocityOnline learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 29

The Rise of Deep Learning

Neural Networks (NNs) that go deep

Feature engineering:Learned automatically (a kind ofrepresentation learning)

Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations

Training efficiency:SGDGPU-based parallelism

Supports online & transfer learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29

The Rise of Deep Learning

Neural Networks (NNs) that go deepFeature engineering:

Learned automatically (a kind ofrepresentation learning)

Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations

Training efficiency:SGDGPU-based parallelism

Supports online & transfer learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29

The Rise of Deep Learning

Neural Networks (NNs) that go deepFeature engineering:

Learned automatically (a kind ofrepresentation learning)

Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations

Training efficiency:SGDGPU-based parallelism

Supports online & transfer learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29

The Rise of Deep Learning

Neural Networks (NNs) that go deepFeature engineering:

Learned automatically (a kind ofrepresentation learning)

Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations

Training efficiency:SGDGPU-based parallelism

Supports online & transfer learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29

The Rise of Deep Learning

Neural Networks (NNs) that go deepFeature engineering:

Learned automatically (a kind ofrepresentation learning)

Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations

Training efficiency:SGDGPU-based parallelism

Supports online & transfer learning

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29

Is Deep Learning a Panacea?

I have big data, so I have to use deep learning

Wrong! No free launch theorem: there is no single ML algorithm thatoutperforms others in every domainDeep learning is more useful when the function f to learn is complex(nonlinear to the input dimension) and has repeating patterns

E.g., image recognition, natural language processingFor simple (linear) f , there are specialized large-scale ML techniques(e.g., LIBLINEAR [4]) that are much more efficient

E.g., text classification

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 29

Is Deep Learning a Panacea?

I have big data, so I have to use deep learningWrong! No free launch theorem: there is no single ML algorithm thatoutperforms others in every domain

Deep learning is more useful when the function f to learn is complex(nonlinear to the input dimension) and has repeating patterns

E.g., image recognition, natural language processingFor simple (linear) f , there are specialized large-scale ML techniques(e.g., LIBLINEAR [4]) that are much more efficient

E.g., text classification

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 29

Is Deep Learning a Panacea?

I have big data, so I have to use deep learningWrong! No free launch theorem: there is no single ML algorithm thatoutperforms others in every domainDeep learning is more useful when the function f to learn is complex(nonlinear to the input dimension) and has repeating patterns

E.g., image recognition, natural language processing

For simple (linear) f , there are specialized large-scale ML techniques(e.g., LIBLINEAR [4]) that are much more efficient

E.g., text classification

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 29

Is Deep Learning a Panacea?

I have big data, so I have to use deep learningWrong! No free launch theorem: there is no single ML algorithm thatoutperforms others in every domainDeep learning is more useful when the function f to learn is complex(nonlinear to the input dimension) and has repeating patterns

E.g., image recognition, natural language processingFor simple (linear) f , there are specialized large-scale ML techniques(e.g., LIBLINEAR [4]) that are much more efficient

E.g., text classification

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 29

Outline

1 When ML Meets Big Data

2 Representation Learning

3 Curse of Dimensionality

4 Trade-Offs in Large-Scale Learning

5 SGD-Based Optimization

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 9 / 29

Representation Learning

Gray boxes are learned automatically

Deep learning maps the mostabstract (deepest) features to theoutput

Usually, a simple linear functionsuffices

In deep learning,features/presentations are distributed

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 10 / 29

Representation Learning

Gray boxes are learned automaticallyDeep learning maps the mostabstract (deepest) features to theoutput

Usually, a simple linear functionsuffices

In deep learning,features/presentations are distributed

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 10 / 29

Representation Learning

Gray boxes are learned automaticallyDeep learning maps the mostabstract (deepest) features to theoutput

Usually, a simple linear functionsuffices

In deep learning,features/presentations are distributed

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 10 / 29

Distributed Representations of Data

In deep learning, we assume that x’swere generated by the composition offactors, potentially at multiple levelsin a hierarchy

E.g., layer 3: face = 0.3 [corner] +0.7 [circle] + 0 [curve]

[.] a predefined non-linear functionWeights (arrows) learned fromtraining examples

Given x, factors at the same leveloutput a layer of features of x

Layer 2: 1, 2, 0.5 for [corner],[circle], and [curve] respectively

To be fed into the factors in the next(deeper) level

Face = 0.3 * 1 + 0.7 * 2

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 29

Distributed Representations of Data

In deep learning, we assume that x’swere generated by the composition offactors, potentially at multiple levelsin a hierarchy

E.g., layer 3: face = 0.3 [corner] +0.7 [circle] + 0 [curve][.] a predefined non-linear functionWeights (arrows) learned fromtraining examples

Given x, factors at the same leveloutput a layer of features of x

Layer 2: 1, 2, 0.5 for [corner],[circle], and [curve] respectively

To be fed into the factors in the next(deeper) level

Face = 0.3 * 1 + 0.7 * 2

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 29

Distributed Representations of Data

In deep learning, we assume that x’swere generated by the composition offactors, potentially at multiple levelsin a hierarchy

E.g., layer 3: face = 0.3 [corner] +0.7 [circle] + 0 [curve][.] a predefined non-linear functionWeights (arrows) learned fromtraining examples

Given x, factors at the same leveloutput a layer of features of x

Layer 2: 1, 2, 0.5 for [corner],[circle], and [curve] respectively

To be fed into the factors in the next(deeper) level

Face = 0.3 * 1 + 0.7 * 2

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 29

Distributed Representations of Data

In deep learning, we assume that x’swere generated by the composition offactors, potentially at multiple levelsin a hierarchy

E.g., layer 3: face = 0.3 [corner] +0.7 [circle] + 0 [curve][.] a predefined non-linear functionWeights (arrows) learned fromtraining examples

Given x, factors at the same leveloutput a layer of features of x

Layer 2: 1, 2, 0.5 for [corner],[circle], and [curve] respectively

To be fed into the factors in the next(deeper) level

Face = 0.3 * 1 + 0.7 * 2

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 29

Transfer Learning

Transfer learning: to reuse theknowledge learned from a task to helpanother task

In deep learning, it’s common to reusethe feature extraction network fromone task in another

Weights may be further updatedwhen training model in a new task

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 12 / 29

Transfer Learning

Transfer learning: to reuse theknowledge learned from a task to helpanother taskIn deep learning, it’s common to reusethe feature extraction network fromone task in another

Weights may be further updatedwhen training model in a new task

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 12 / 29

Outline

1 When ML Meets Big Data

2 Representation Learning

3 Curse of Dimensionality

4 Trade-Offs in Large-Scale Learning

5 SGD-Based Optimization

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 13 / 29

Curse of Dimensionality

Most classic nonlinear ML models find q by assuming functionsmoothness:

if x⇠ x

(i) 2 X, then f (x;w)⇠ f (x(i);w)

E.g., the nonlinear SVM predicts the label y of x by simplyinterpolating the labels of support vectors x

(i)’s close to x:

y = Âi

ai

y

(i)k(x(i),x)+b, where k(x(i),x) = exp(�gkx(i)�xk2)

Suppose f is smooth within a bin, we need exponentially moreexamples to get a good interpolation as D increases

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 29

Curse of Dimensionality

Most classic nonlinear ML models find q by assuming functionsmoothness:

if x⇠ x

(i) 2 X, then f (x;w)⇠ f (x(i);w)

E.g., the nonlinear SVM predicts the label y of x by simplyinterpolating the labels of support vectors x

(i)’s close to x:

y = Âi

ai

y

(i)k(x(i),x)+b, where k(x(i),x) = exp(�gkx(i)�xk2)

Suppose f is smooth within a bin, we need exponentially moreexamples to get a good interpolation as D increases

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 29

Curse of Dimensionality

Most classic nonlinear ML models find q by assuming functionsmoothness:

if x⇠ x

(i) 2 X, then f (x;w)⇠ f (x(i);w)

E.g., the nonlinear SVM predicts the label y of x by simplyinterpolating the labels of support vectors x

(i)’s close to x:

y = Âi

ai

y

(i)k(x(i),x)+b, where k(x(i),x) = exp(�gkx(i)�xk2)

Suppose f is smooth within a bin, we need exponentially moreexamples to get a good interpolation as D increases

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 29

Exponential Gains from Depth

In deep learning, a deep factor isdefined by “reusing” the shallow ones

Face = 0.3 [corner] + 0.7 [circle]

With a shallow structure, a deep factorneeds to be replaced by exponentiallymany factors

Face = 0.3 [0.5 [vertical] + 0.5[horizontal] ] + 0.7 [ ... ]Exponentially more weights to learnMore training data needed

Exponential gains from depth counterthe exponential challenges posed bythe curse of dimensionality

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 29

Exponential Gains from Depth

In deep learning, a deep factor isdefined by “reusing” the shallow ones

Face = 0.3 [corner] + 0.7 [circle]With a shallow structure, a deep factorneeds to be replaced by exponentiallymany factors

Face = 0.3 [0.5 [vertical] + 0.5[horizontal] ] + 0.7 [ ... ]

Exponentially more weights to learnMore training data needed

Exponential gains from depth counterthe exponential challenges posed bythe curse of dimensionality

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 29

Exponential Gains from Depth

In deep learning, a deep factor isdefined by “reusing” the shallow ones

Face = 0.3 [corner] + 0.7 [circle]With a shallow structure, a deep factorneeds to be replaced by exponentiallymany factors

Face = 0.3 [0.5 [vertical] + 0.5[horizontal] ] + 0.7 [ ... ]Exponentially more weights to learnMore training data needed

Exponential gains from depth counterthe exponential challenges posed bythe curse of dimensionality

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 29

Exponential Gains from Depth

In deep learning, a deep factor isdefined by “reusing” the shallow ones

Face = 0.3 [corner] + 0.7 [circle]With a shallow structure, a deep factorneeds to be replaced by exponentiallymany factors

Face = 0.3 [0.5 [vertical] + 0.5[horizontal] ] + 0.7 [ ... ]Exponentially more weights to learnMore training data needed

Exponential gains from depth counterthe exponential challenges posed bythe curse of dimensionality

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 29

Outline

1 When ML Meets Big Data

2 Representation Learning

3 Curse of Dimensionality

4 Trade-Offs in Large-Scale Learning

5 SGD-Based Optimization

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 16 / 29

Learning Theory Revisited

How to learn a function f

N

from N examples X that is close to thetrue function f

⇤?

Empirical risk: C

N

[fN

] = 1

N

ÂN

i=1

loss(fN

(x(i)),y(i))

Expected risk: C[fN

] =R

loss(f (x),y)dP(x,y)

Let f

⇤ = argmin

f

C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f

⇤F = argmin

f2F C[f ]

But we only minimizes errors on limited examples in our objective, sowe only have f

N

= argmin

f2F C

N

[f ]

The excess error E = C[fN

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29

Learning Theory Revisited

How to learn a function f

N

from N examples X that is close to thetrue function f

⇤?Empirical risk: C

N

[fN

] = 1

N

ÂN

i=1

loss(fN

(x(i)),y(i))

Expected risk: C[fN

] =R

loss(f (x),y)dP(x,y)

Let f

⇤ = argmin

f

C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f

⇤F = argmin

f2F C[f ]

But we only minimizes errors on limited examples in our objective, sowe only have f

N

= argmin

f2F C

N

[f ]

The excess error E = C[fN

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29

Learning Theory Revisited

How to learn a function f

N

from N examples X that is close to thetrue function f

⇤?Empirical risk: C

N

[fN

] = 1

N

ÂN

i=1

loss(fN

(x(i)),y(i))

Expected risk: C[fN

] =R

loss(f (x),y)dP(x,y)

Let f

⇤ = argmin

f

C[f ] be the true function (our goal)

Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f

⇤F = argmin

f2F C[f ]

But we only minimizes errors on limited examples in our objective, sowe only have f

N

= argmin

f2F C

N

[f ]

The excess error E = C[fN

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29

Learning Theory Revisited

How to learn a function f

N

from N examples X that is close to thetrue function f

⇤?Empirical risk: C

N

[fN

] = 1

N

ÂN

i=1

loss(fN

(x(i)),y(i))

Expected risk: C[fN

] =R

loss(f (x),y)dP(x,y)

Let f

⇤ = argmin

f

C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f

⇤F = argmin

f2F C[f ]

But we only minimizes errors on limited examples in our objective, sowe only have f

N

= argmin

f2F C

N

[f ]

The excess error E = C[fN

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29

Learning Theory Revisited

How to learn a function f

N

from N examples X that is close to thetrue function f

⇤?Empirical risk: C

N

[fN

] = 1

N

ÂN

i=1

loss(fN

(x(i)),y(i))

Expected risk: C[fN

] =R

loss(f (x),y)dP(x,y)

Let f

⇤ = argmin

f

C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f

⇤F = argmin

f2F C[f ]

But we only minimizes errors on limited examples in our objective, sowe only have f

N

= argmin

f2F C

N

[f ]

The excess error E = C[fN

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29

Learning Theory Revisited

How to learn a function f

N

from N examples X that is close to thetrue function f

⇤?Empirical risk: C

N

[fN

] = 1

N

ÂN

i=1

loss(fN

(x(i)),y(i))

Expected risk: C[fN

] =R

loss(f (x),y)dP(x,y)

Let f

⇤ = argmin

f

C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f

⇤F = argmin

f2F C[f ]

But we only minimizes errors on limited examples in our objective, sowe only have f

N

= argmin

f2F C

N

[f ]

The excess error E = C[fN

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29

Excess Error

Wait, we may not have enough training time, so we stop iterationsearly and have ˜

f

N

, where C

N

[˜fN

] C

N

[fN

]+r

The excess error becomes E = C[˜fN

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Approximation error Eapp

: reduced by choosing a larger modelEstimation error E

est

: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]

Optimization error Eopt

: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29

Excess Error

Wait, we may not have enough training time, so we stop iterationsearly and have ˜

f

N

, where C

N

[˜fN

] C

N

[fN

]+rThe excess error becomes E = C[˜f

N

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Approximation error Eapp

: reduced by choosing a larger modelEstimation error E

est

: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]

Optimization error Eopt

: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29

Excess Error

Wait, we may not have enough training time, so we stop iterationsearly and have ˜

f

N

, where C

N

[˜fN

] C

N

[fN

]+rThe excess error becomes E = C[˜f

N

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Approximation error Eapp

: reduced by choosing a larger model

Estimation error Eest

: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]

Optimization error Eopt

: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29

Excess Error

Wait, we may not have enough training time, so we stop iterationsearly and have ˜

f

N

, where C

N

[˜fN

] C

N

[fN

]+rThe excess error becomes E = C[˜f

N

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Approximation error Eapp

: reduced by choosing a larger modelEstimation error E

est

: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]

Optimization error Eopt

: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29

Excess Error

Wait, we may not have enough training time, so we stop iterationsearly and have ˜

f

N

, where C

N

[˜fN

] C

N

[fN

]+rThe excess error becomes E = C[˜f

N

]�C[f ⇤]:

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Approximation error Eapp

: reduced by choosing a larger modelEstimation error E

est

: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]

Optimization error Eopt

: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29

Minimizing Excess Error

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Small-scale ML tasks:

Mainly constrained by N

Computing time is not an issue, so Eopt

can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E

app

and Eest

Large-scale ML tasks:Mainly constrained by time (significant E

opt

), so SGD is preferredN is large, so E

est

can be reducedLarge model is preferred to reduce E

app

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29

Minimizing Excess Error

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Small-scale ML tasks:Mainly constrained by N

Computing time is not an issue, so Eopt

can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E

app

and Eest

Large-scale ML tasks:Mainly constrained by time (significant E

opt

), so SGD is preferredN is large, so E

est

can be reducedLarge model is preferred to reduce E

app

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29

Minimizing Excess Error

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Small-scale ML tasks:Mainly constrained by N

Computing time is not an issue, so Eopt

can be insignificant bychoosing small r

Size of hypothesis is important to balance the trade-off between Eapp

and Eest

Large-scale ML tasks:Mainly constrained by time (significant E

opt

), so SGD is preferredN is large, so E

est

can be reducedLarge model is preferred to reduce E

app

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29

Minimizing Excess Error

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Small-scale ML tasks:Mainly constrained by N

Computing time is not an issue, so Eopt

can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E

app

and Eest

Large-scale ML tasks:Mainly constrained by time (significant E

opt

), so SGD is preferredN is large, so E

est

can be reducedLarge model is preferred to reduce E

app

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29

Minimizing Excess Error

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Small-scale ML tasks:Mainly constrained by N

Computing time is not an issue, so Eopt

can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E

app

and Eest

Large-scale ML tasks:

Mainly constrained by time (significant Eopt

), so SGD is preferredN is large, so E

est

can be reducedLarge model is preferred to reduce E

app

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29

Minimizing Excess Error

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Small-scale ML tasks:Mainly constrained by N

Computing time is not an issue, so Eopt

can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E

app

and Eest

Large-scale ML tasks:Mainly constrained by time (significant E

opt

), so SGD is preferred

N is large, so Eest

can be reducedLarge model is preferred to reduce E

app

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29

Minimizing Excess Error

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Small-scale ML tasks:Mainly constrained by N

Computing time is not an issue, so Eopt

can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E

app

and Eest

Large-scale ML tasks:Mainly constrained by time (significant E

opt

), so SGD is preferredN is large, so E

est

can be reduced

Large model is preferred to reduce Eapp

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29

Minimizing Excess Error

E = C[f ⇤F ]�C[f ⇤]| {z }Eapp

+C[fN

]�C[f ⇤F ]| {z }Eest

+C[˜fN

]�C[fN

]| {z }Eopt

Small-scale ML tasks:Mainly constrained by N

Computing time is not an issue, so Eopt

can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E

app

and Eest

Large-scale ML tasks:Mainly constrained by time (significant E

opt

), so SGD is preferredN is large, so E

est

can be reducedLarge model is preferred to reduce E

app

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29

Big Data + Big Models

9. COTS HPC unsupervised convolutional network [3]10. GoogleLeNet [6]

With domain-specific architecture such as convolutional NNs(CNNs) and recurrent NNs (RNNs)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 20 / 29

Big Data + Big Models

9. COTS HPC unsupervised convolutional network [3]10. GoogleLeNet [6]

With domain-specific architecture such as convolutional NNs(CNNs) and recurrent NNs (RNNs)

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 20 / 29

Outline

1 When ML Meets Big Data

2 Representation Learning

3 Curse of Dimensionality

4 Trade-Offs in Large-Scale Learning

5 SGD-Based Optimization

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 21 / 29

Stochastic Gradient DescentGradient Descent (GD)

w

(0) a randon vector;Repeat until convergence {

w

(t+1) w

(t)�h—w

C

N

(w(t);X);

}

Needs to scan the entire dataset to descent

(Mini-Batched) Stochastic Gradient Descent (SGD)

w

(0) a randon vector;Repeat until convergence {

Randomly partition the training set X into minibatches {X(j)}j

;w

(t+1) w

(t)�h—w

C(w(t);X(j));

}

Supports online training

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 29

Stochastic Gradient DescentGradient Descent (GD)

w

(0) a randon vector;Repeat until convergence {

w

(t+1) w

(t)�h—w

C

N

(w(t);X);

}

Needs to scan the entire dataset to descent

(Mini-Batched) Stochastic Gradient Descent (SGD)

w

(0) a randon vector;Repeat until convergence {

Randomly partition the training set X into minibatches {X(j)}j

;w

(t+1) w

(t)�h—w

C(w(t);X(j));

}

Supports online training

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 29

Stochastic Gradient DescentGradient Descent (GD)

w

(0) a randon vector;Repeat until convergence {

w

(t+1) w

(t)�h—w

C

N

(w(t);X);

}

Needs to scan the entire dataset to descent

(Mini-Batched) Stochastic Gradient Descent (SGD)

w

(0) a randon vector;Repeat until convergence {

Randomly partition the training set X into minibatches {X(j)}j

;w

(t+1) w

(t)�h—w

C(w(t);X(j));

}

Supports online training

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 29

Stochastic Gradient DescentGradient Descent (GD)

w

(0) a randon vector;Repeat until convergence {

w

(t+1) w

(t)�h—w

C

N

(w(t);X);

}

Needs to scan the entire dataset to descent

(Mini-Batched) Stochastic Gradient Descent (SGD)

w

(0) a randon vector;Repeat until convergence {

Randomly partition the training set X into minibatches {X(j)}j

;w

(t+1) w

(t)�h—w

C(w(t);X(j));

}

Supports online trainingShan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 29

GD vs. SGD

Is SGD really a better algorithm?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 23 / 29

GD vs. SGD

Is SGD really a better algorithm?

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 23 / 29

Yes, If You Have Big Data

Performance is limited by training time

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 24 / 29

Asymptotic Analysis [1]

GD SGDTime per iteration N 1

#Iterations to opt. error r log

1

r1

rTime to opt. error r N log

1

r1

rTime to excess error e 1

e1/a log

1

e , where a 2 [1

2

,1] 1

e

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 25 / 29

Parallelizing SGDData Parallelism Model Parallelism

Every core trains the full modelgiven partitioned data.

Every core train a partitionedmodel partition given full data.

Normally, model parallelism exchange less parameters in a large NNand can support more coresThe effectiveness depends on settings such as CPU speed,communication latency, and bandwidth, etc.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 26 / 29

Parallelizing SGDData Parallelism Model Parallelism

Every core trains the full modelgiven partitioned data.

Every core train a partitionedmodel partition given full data.

Normally, model parallelism exchange less parameters in a large NNand can support more cores

The effectiveness depends on settings such as CPU speed,communication latency, and bandwidth, etc.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 26 / 29

Parallelizing SGDData Parallelism Model Parallelism

Every core trains the full modelgiven partitioned data.

Every core train a partitionedmodel partition given full data.

Normally, model parallelism exchange less parameters in a large NNand can support more coresThe effectiveness depends on settings such as CPU speed,communication latency, and bandwidth, etc.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 26 / 29

Reference I

[1] Léon Bottou.Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.

[2] Olivier Bousquet.Concentration inequalities and empirical processes theory applied to theanalysis of learning algorithms.Ph.D. thesis, Ecole Polytechnique, Palaiseau, France, 2002.

[3] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro,and Ng Andrew.Deep learning with cots hpc systems.In Proceedings of The 30th International Conference on MachineLearning, pages 1337–1345, 2013.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 29

Reference II

[4] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, andChih-Jen Lin.Liblinear: A library for large linear classification.Journal of machine learning research, 9(Aug):1871–1874, 2008.

[5] Pascal Massart.Some applications of concentration inequalities to statistics.In Annales de la Faculté des sciences de Toulouse: Mathématiques,volume 9, pages 245–303, 2000.

[6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich.Going deeper with convolutions.In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1–9, 2015.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 29

Reference III

[7] Vladimir N Vapnik and A Ya Chervonenkis.On the uniform convergence of relative frequencies of events to theirprobabilities.In Measures of Complexity, pages 11–30. Springer, 2015.

Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 29