Fast Track Machine Learning Part 1 (Machine Learning Overview) - Types of Machine Learning Problems
Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu [email protected]...
Transcript of Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu [email protected]...
![Page 1: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/1.jpg)
Large-Scale Machine Learning
Shan-Hung [email protected]
Department of Computer Science,National Tsing Hua University, Taiwan
Machine Learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 1 / 29
![Page 2: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/2.jpg)
Outline
1 When ML Meets Big Data
2 Representation Learning
3 Curse of Dimensionality
4 Trade-Offs in Large-Scale Learning
5 SGD-Based Optimization
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 2 / 29
![Page 3: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/3.jpg)
Outline
1 When ML Meets Big Data
2 Representation Learning
3 Curse of Dimensionality
4 Trade-Offs in Large-Scale Learning
5 SGD-Based Optimization
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 3 / 29
![Page 4: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/4.jpg)
The Big Data Era
Today, more and more of our activities are recorded by ubiquitouscomputing devices
Networked computers make it easy to centralize these records andcurate them into a big datasetLarge-scale machine learning techniques solve problems byleveraging the posteriori knowledge learned from the big data
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 29
![Page 5: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/5.jpg)
The Big Data Era
Today, more and more of our activities are recorded by ubiquitouscomputing devicesNetworked computers make it easy to centralize these records andcurate them into a big dataset
Large-scale machine learning techniques solve problems byleveraging the posteriori knowledge learned from the big data
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 29
![Page 6: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/6.jpg)
The Big Data Era
Today, more and more of our activities are recorded by ubiquitouscomputing devicesNetworked computers make it easy to centralize these records andcurate them into a big datasetLarge-scale machine learning techniques solve problems byleveraging the posteriori knowledge learned from the big data
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 4 / 29
![Page 7: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/7.jpg)
Characteristics of Big Data
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 5 / 29
![Page 8: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/8.jpg)
Challenges of Large-Scale ML
Variety and veracityFeature engineering gets even
harder
Transfer learningVolume
Large D: curse ofdimensionalityLarge N: training efficiency
VelocityOnline learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 29
![Page 9: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/9.jpg)
Challenges of Large-Scale ML
Variety and veracityFeature engineering gets even
harder
Transfer learning
VolumeLarge D: curse ofdimensionalityLarge N: training efficiency
VelocityOnline learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 29
![Page 10: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/10.jpg)
Challenges of Large-Scale ML
Variety and veracityFeature engineering gets even
harder
Transfer learningVolume
Large D: curse ofdimensionalityLarge N: training efficiency
VelocityOnline learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 29
![Page 11: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/11.jpg)
Challenges of Large-Scale ML
Variety and veracityFeature engineering gets even
harder
Transfer learningVolume
Large D: curse ofdimensionalityLarge N: training efficiency
VelocityOnline learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 6 / 29
![Page 12: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/12.jpg)
The Rise of Deep Learning
Neural Networks (NNs) that go deep
Feature engineering:Learned automatically (a kind ofrepresentation learning)
Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations
Training efficiency:SGDGPU-based parallelism
Supports online & transfer learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29
![Page 13: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/13.jpg)
The Rise of Deep Learning
Neural Networks (NNs) that go deepFeature engineering:
Learned automatically (a kind ofrepresentation learning)
Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations
Training efficiency:SGDGPU-based parallelism
Supports online & transfer learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29
![Page 14: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/14.jpg)
The Rise of Deep Learning
Neural Networks (NNs) that go deepFeature engineering:
Learned automatically (a kind ofrepresentation learning)
Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations
Training efficiency:SGDGPU-based parallelism
Supports online & transfer learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29
![Page 15: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/15.jpg)
The Rise of Deep Learning
Neural Networks (NNs) that go deepFeature engineering:
Learned automatically (a kind ofrepresentation learning)
Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations
Training efficiency:SGDGPU-based parallelism
Supports online & transfer learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29
![Page 16: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/16.jpg)
The Rise of Deep Learning
Neural Networks (NNs) that go deepFeature engineering:
Learned automatically (a kind ofrepresentation learning)
Curse of dimensionality:Countered by the exponential gainsof deep, distributed representations
Training efficiency:SGDGPU-based parallelism
Supports online & transfer learning
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 7 / 29
![Page 17: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/17.jpg)
Is Deep Learning a Panacea?
I have big data, so I have to use deep learning
Wrong! No free launch theorem: there is no single ML algorithm thatoutperforms others in every domainDeep learning is more useful when the function f to learn is complex(nonlinear to the input dimension) and has repeating patterns
E.g., image recognition, natural language processingFor simple (linear) f , there are specialized large-scale ML techniques(e.g., LIBLINEAR [4]) that are much more efficient
E.g., text classification
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 29
![Page 18: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/18.jpg)
Is Deep Learning a Panacea?
I have big data, so I have to use deep learningWrong! No free launch theorem: there is no single ML algorithm thatoutperforms others in every domain
Deep learning is more useful when the function f to learn is complex(nonlinear to the input dimension) and has repeating patterns
E.g., image recognition, natural language processingFor simple (linear) f , there are specialized large-scale ML techniques(e.g., LIBLINEAR [4]) that are much more efficient
E.g., text classification
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 29
![Page 19: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/19.jpg)
Is Deep Learning a Panacea?
I have big data, so I have to use deep learningWrong! No free launch theorem: there is no single ML algorithm thatoutperforms others in every domainDeep learning is more useful when the function f to learn is complex(nonlinear to the input dimension) and has repeating patterns
E.g., image recognition, natural language processing
For simple (linear) f , there are specialized large-scale ML techniques(e.g., LIBLINEAR [4]) that are much more efficient
E.g., text classification
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 29
![Page 20: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/20.jpg)
Is Deep Learning a Panacea?
I have big data, so I have to use deep learningWrong! No free launch theorem: there is no single ML algorithm thatoutperforms others in every domainDeep learning is more useful when the function f to learn is complex(nonlinear to the input dimension) and has repeating patterns
E.g., image recognition, natural language processingFor simple (linear) f , there are specialized large-scale ML techniques(e.g., LIBLINEAR [4]) that are much more efficient
E.g., text classification
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 8 / 29
![Page 21: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/21.jpg)
Outline
1 When ML Meets Big Data
2 Representation Learning
3 Curse of Dimensionality
4 Trade-Offs in Large-Scale Learning
5 SGD-Based Optimization
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 9 / 29
![Page 22: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/22.jpg)
Representation Learning
Gray boxes are learned automatically
Deep learning maps the mostabstract (deepest) features to theoutput
Usually, a simple linear functionsuffices
In deep learning,features/presentations are distributed
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 10 / 29
![Page 23: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/23.jpg)
Representation Learning
Gray boxes are learned automaticallyDeep learning maps the mostabstract (deepest) features to theoutput
Usually, a simple linear functionsuffices
In deep learning,features/presentations are distributed
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 10 / 29
![Page 24: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/24.jpg)
Representation Learning
Gray boxes are learned automaticallyDeep learning maps the mostabstract (deepest) features to theoutput
Usually, a simple linear functionsuffices
In deep learning,features/presentations are distributed
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 10 / 29
![Page 25: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/25.jpg)
Distributed Representations of Data
In deep learning, we assume that x’swere generated by the composition offactors, potentially at multiple levelsin a hierarchy
E.g., layer 3: face = 0.3 [corner] +0.7 [circle] + 0 [curve]
[.] a predefined non-linear functionWeights (arrows) learned fromtraining examples
Given x, factors at the same leveloutput a layer of features of x
Layer 2: 1, 2, 0.5 for [corner],[circle], and [curve] respectively
To be fed into the factors in the next(deeper) level
Face = 0.3 * 1 + 0.7 * 2
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 29
![Page 26: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/26.jpg)
Distributed Representations of Data
In deep learning, we assume that x’swere generated by the composition offactors, potentially at multiple levelsin a hierarchy
E.g., layer 3: face = 0.3 [corner] +0.7 [circle] + 0 [curve][.] a predefined non-linear functionWeights (arrows) learned fromtraining examples
Given x, factors at the same leveloutput a layer of features of x
Layer 2: 1, 2, 0.5 for [corner],[circle], and [curve] respectively
To be fed into the factors in the next(deeper) level
Face = 0.3 * 1 + 0.7 * 2
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 29
![Page 27: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/27.jpg)
Distributed Representations of Data
In deep learning, we assume that x’swere generated by the composition offactors, potentially at multiple levelsin a hierarchy
E.g., layer 3: face = 0.3 [corner] +0.7 [circle] + 0 [curve][.] a predefined non-linear functionWeights (arrows) learned fromtraining examples
Given x, factors at the same leveloutput a layer of features of x
Layer 2: 1, 2, 0.5 for [corner],[circle], and [curve] respectively
To be fed into the factors in the next(deeper) level
Face = 0.3 * 1 + 0.7 * 2
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 29
![Page 28: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/28.jpg)
Distributed Representations of Data
In deep learning, we assume that x’swere generated by the composition offactors, potentially at multiple levelsin a hierarchy
E.g., layer 3: face = 0.3 [corner] +0.7 [circle] + 0 [curve][.] a predefined non-linear functionWeights (arrows) learned fromtraining examples
Given x, factors at the same leveloutput a layer of features of x
Layer 2: 1, 2, 0.5 for [corner],[circle], and [curve] respectively
To be fed into the factors in the next(deeper) level
Face = 0.3 * 1 + 0.7 * 2
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 11 / 29
![Page 29: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/29.jpg)
Transfer Learning
Transfer learning: to reuse theknowledge learned from a task to helpanother task
In deep learning, it’s common to reusethe feature extraction network fromone task in another
Weights may be further updatedwhen training model in a new task
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 12 / 29
![Page 30: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/30.jpg)
Transfer Learning
Transfer learning: to reuse theknowledge learned from a task to helpanother taskIn deep learning, it’s common to reusethe feature extraction network fromone task in another
Weights may be further updatedwhen training model in a new task
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 12 / 29
![Page 31: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/31.jpg)
Outline
1 When ML Meets Big Data
2 Representation Learning
3 Curse of Dimensionality
4 Trade-Offs in Large-Scale Learning
5 SGD-Based Optimization
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 13 / 29
![Page 32: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/32.jpg)
Curse of Dimensionality
Most classic nonlinear ML models find q by assuming functionsmoothness:
if x⇠ x
(i) 2 X, then f (x;w)⇠ f (x(i);w)
E.g., the nonlinear SVM predicts the label y of x by simplyinterpolating the labels of support vectors x
(i)’s close to x:
y = Âi
ai
y
(i)k(x(i),x)+b, where k(x(i),x) = exp(�gkx(i)�xk2)
Suppose f is smooth within a bin, we need exponentially moreexamples to get a good interpolation as D increases
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 29
![Page 33: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/33.jpg)
Curse of Dimensionality
Most classic nonlinear ML models find q by assuming functionsmoothness:
if x⇠ x
(i) 2 X, then f (x;w)⇠ f (x(i);w)
E.g., the nonlinear SVM predicts the label y of x by simplyinterpolating the labels of support vectors x
(i)’s close to x:
y = Âi
ai
y
(i)k(x(i),x)+b, where k(x(i),x) = exp(�gkx(i)�xk2)
Suppose f is smooth within a bin, we need exponentially moreexamples to get a good interpolation as D increases
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 29
![Page 34: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/34.jpg)
Curse of Dimensionality
Most classic nonlinear ML models find q by assuming functionsmoothness:
if x⇠ x
(i) 2 X, then f (x;w)⇠ f (x(i);w)
E.g., the nonlinear SVM predicts the label y of x by simplyinterpolating the labels of support vectors x
(i)’s close to x:
y = Âi
ai
y
(i)k(x(i),x)+b, where k(x(i),x) = exp(�gkx(i)�xk2)
Suppose f is smooth within a bin, we need exponentially moreexamples to get a good interpolation as D increases
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 14 / 29
![Page 35: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/35.jpg)
Exponential Gains from Depth
In deep learning, a deep factor isdefined by “reusing” the shallow ones
Face = 0.3 [corner] + 0.7 [circle]
With a shallow structure, a deep factorneeds to be replaced by exponentiallymany factors
Face = 0.3 [0.5 [vertical] + 0.5[horizontal] ] + 0.7 [ ... ]Exponentially more weights to learnMore training data needed
Exponential gains from depth counterthe exponential challenges posed bythe curse of dimensionality
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 29
![Page 36: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/36.jpg)
Exponential Gains from Depth
In deep learning, a deep factor isdefined by “reusing” the shallow ones
Face = 0.3 [corner] + 0.7 [circle]With a shallow structure, a deep factorneeds to be replaced by exponentiallymany factors
Face = 0.3 [0.5 [vertical] + 0.5[horizontal] ] + 0.7 [ ... ]
Exponentially more weights to learnMore training data needed
Exponential gains from depth counterthe exponential challenges posed bythe curse of dimensionality
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 29
![Page 37: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/37.jpg)
Exponential Gains from Depth
In deep learning, a deep factor isdefined by “reusing” the shallow ones
Face = 0.3 [corner] + 0.7 [circle]With a shallow structure, a deep factorneeds to be replaced by exponentiallymany factors
Face = 0.3 [0.5 [vertical] + 0.5[horizontal] ] + 0.7 [ ... ]Exponentially more weights to learnMore training data needed
Exponential gains from depth counterthe exponential challenges posed bythe curse of dimensionality
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 29
![Page 38: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/38.jpg)
Exponential Gains from Depth
In deep learning, a deep factor isdefined by “reusing” the shallow ones
Face = 0.3 [corner] + 0.7 [circle]With a shallow structure, a deep factorneeds to be replaced by exponentiallymany factors
Face = 0.3 [0.5 [vertical] + 0.5[horizontal] ] + 0.7 [ ... ]Exponentially more weights to learnMore training data needed
Exponential gains from depth counterthe exponential challenges posed bythe curse of dimensionality
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 15 / 29
![Page 39: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/39.jpg)
Outline
1 When ML Meets Big Data
2 Representation Learning
3 Curse of Dimensionality
4 Trade-Offs in Large-Scale Learning
5 SGD-Based Optimization
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 16 / 29
![Page 40: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/40.jpg)
Learning Theory Revisited
How to learn a function f
N
from N examples X that is close to thetrue function f
⇤?
Empirical risk: C
N
[fN
] = 1
N
ÂN
i=1
loss(fN
(x(i)),y(i))
Expected risk: C[fN
] =R
loss(f (x),y)dP(x,y)
Let f
⇤ = argmin
f
C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f
⇤F = argmin
f2F C[f ]
But we only minimizes errors on limited examples in our objective, sowe only have f
N
= argmin
f2F C
N
[f ]
The excess error E = C[fN
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29
![Page 41: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/41.jpg)
Learning Theory Revisited
How to learn a function f
N
from N examples X that is close to thetrue function f
⇤?Empirical risk: C
N
[fN
] = 1
N
ÂN
i=1
loss(fN
(x(i)),y(i))
Expected risk: C[fN
] =R
loss(f (x),y)dP(x,y)
Let f
⇤ = argmin
f
C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f
⇤F = argmin
f2F C[f ]
But we only minimizes errors on limited examples in our objective, sowe only have f
N
= argmin
f2F C
N
[f ]
The excess error E = C[fN
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29
![Page 42: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/42.jpg)
Learning Theory Revisited
How to learn a function f
N
from N examples X that is close to thetrue function f
⇤?Empirical risk: C
N
[fN
] = 1
N
ÂN
i=1
loss(fN
(x(i)),y(i))
Expected risk: C[fN
] =R
loss(f (x),y)dP(x,y)
Let f
⇤ = argmin
f
C[f ] be the true function (our goal)
Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f
⇤F = argmin
f2F C[f ]
But we only minimizes errors on limited examples in our objective, sowe only have f
N
= argmin
f2F C
N
[f ]
The excess error E = C[fN
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29
![Page 43: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/43.jpg)
Learning Theory Revisited
How to learn a function f
N
from N examples X that is close to thetrue function f
⇤?Empirical risk: C
N
[fN
] = 1
N
ÂN
i=1
loss(fN
(x(i)),y(i))
Expected risk: C[fN
] =R
loss(f (x),y)dP(x,y)
Let f
⇤ = argmin
f
C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f
⇤F = argmin
f2F C[f ]
But we only minimizes errors on limited examples in our objective, sowe only have f
N
= argmin
f2F C
N
[f ]
The excess error E = C[fN
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29
![Page 44: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/44.jpg)
Learning Theory Revisited
How to learn a function f
N
from N examples X that is close to thetrue function f
⇤?Empirical risk: C
N
[fN
] = 1
N
ÂN
i=1
loss(fN
(x(i)),y(i))
Expected risk: C[fN
] =R
loss(f (x),y)dP(x,y)
Let f
⇤ = argmin
f
C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f
⇤F = argmin
f2F C[f ]
But we only minimizes errors on limited examples in our objective, sowe only have f
N
= argmin
f2F C
N
[f ]
The excess error E = C[fN
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29
![Page 45: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/45.jpg)
Learning Theory Revisited
How to learn a function f
N
from N examples X that is close to thetrue function f
⇤?Empirical risk: C
N
[fN
] = 1
N
ÂN
i=1
loss(fN
(x(i)),y(i))
Expected risk: C[fN
] =R
loss(f (x),y)dP(x,y)
Let f
⇤ = argmin
f
C[f ] be the true function (our goal)Since we are seeking a function in a model (hypothesis space) F, thisis what can have at best: f
⇤F = argmin
f2F C[f ]
But we only minimizes errors on limited examples in our objective, sowe only have f
N
= argmin
f2F C
N
[f ]
The excess error E = C[fN
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 17 / 29
![Page 46: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/46.jpg)
Excess Error
Wait, we may not have enough training time, so we stop iterationsearly and have ˜
f
N
, where C
N
[˜fN
] C
N
[fN
]+r
The excess error becomes E = C[˜fN
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Approximation error Eapp
: reduced by choosing a larger modelEstimation error E
est
: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]
Optimization error Eopt
: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29
![Page 47: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/47.jpg)
Excess Error
Wait, we may not have enough training time, so we stop iterationsearly and have ˜
f
N
, where C
N
[˜fN
] C
N
[fN
]+rThe excess error becomes E = C[˜f
N
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Approximation error Eapp
: reduced by choosing a larger modelEstimation error E
est
: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]
Optimization error Eopt
: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29
![Page 48: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/48.jpg)
Excess Error
Wait, we may not have enough training time, so we stop iterationsearly and have ˜
f
N
, where C
N
[˜fN
] C
N
[fN
]+rThe excess error becomes E = C[˜f
N
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Approximation error Eapp
: reduced by choosing a larger model
Estimation error Eest
: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]
Optimization error Eopt
: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29
![Page 49: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/49.jpg)
Excess Error
Wait, we may not have enough training time, so we stop iterationsearly and have ˜
f
N
, where C
N
[˜fN
] C
N
[fN
]+rThe excess error becomes E = C[˜f
N
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Approximation error Eapp
: reduced by choosing a larger modelEstimation error E
est
: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]
Optimization error Eopt
: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29
![Page 50: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/50.jpg)
Excess Error
Wait, we may not have enough training time, so we stop iterationsearly and have ˜
f
N
, where C
N
[˜fN
] C
N
[fN
]+rThe excess error becomes E = C[˜f
N
]�C[f ⇤]:
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Approximation error Eapp
: reduced by choosing a larger modelEstimation error E
est
: reduced by1 Increasing N, or2 Choosing smaller model [2, 5, 7]
Optimization error Eopt
: reduced by1 Running optimization alg. longer (with smaller r)2 Choosing more efficient optimization alg.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 18 / 29
![Page 51: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/51.jpg)
Minimizing Excess Error
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Small-scale ML tasks:
Mainly constrained by N
Computing time is not an issue, so Eopt
can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E
app
and Eest
Large-scale ML tasks:Mainly constrained by time (significant E
opt
), so SGD is preferredN is large, so E
est
can be reducedLarge model is preferred to reduce E
app
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29
![Page 52: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/52.jpg)
Minimizing Excess Error
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Small-scale ML tasks:Mainly constrained by N
Computing time is not an issue, so Eopt
can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E
app
and Eest
Large-scale ML tasks:Mainly constrained by time (significant E
opt
), so SGD is preferredN is large, so E
est
can be reducedLarge model is preferred to reduce E
app
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29
![Page 53: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/53.jpg)
Minimizing Excess Error
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Small-scale ML tasks:Mainly constrained by N
Computing time is not an issue, so Eopt
can be insignificant bychoosing small r
Size of hypothesis is important to balance the trade-off between Eapp
and Eest
Large-scale ML tasks:Mainly constrained by time (significant E
opt
), so SGD is preferredN is large, so E
est
can be reducedLarge model is preferred to reduce E
app
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29
![Page 54: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/54.jpg)
Minimizing Excess Error
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Small-scale ML tasks:Mainly constrained by N
Computing time is not an issue, so Eopt
can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E
app
and Eest
Large-scale ML tasks:Mainly constrained by time (significant E
opt
), so SGD is preferredN is large, so E
est
can be reducedLarge model is preferred to reduce E
app
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29
![Page 55: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/55.jpg)
Minimizing Excess Error
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Small-scale ML tasks:Mainly constrained by N
Computing time is not an issue, so Eopt
can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E
app
and Eest
Large-scale ML tasks:
Mainly constrained by time (significant Eopt
), so SGD is preferredN is large, so E
est
can be reducedLarge model is preferred to reduce E
app
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29
![Page 56: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/56.jpg)
Minimizing Excess Error
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Small-scale ML tasks:Mainly constrained by N
Computing time is not an issue, so Eopt
can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E
app
and Eest
Large-scale ML tasks:Mainly constrained by time (significant E
opt
), so SGD is preferred
N is large, so Eest
can be reducedLarge model is preferred to reduce E
app
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29
![Page 57: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/57.jpg)
Minimizing Excess Error
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Small-scale ML tasks:Mainly constrained by N
Computing time is not an issue, so Eopt
can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E
app
and Eest
Large-scale ML tasks:Mainly constrained by time (significant E
opt
), so SGD is preferredN is large, so E
est
can be reduced
Large model is preferred to reduce Eapp
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29
![Page 58: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/58.jpg)
Minimizing Excess Error
E = C[f ⇤F ]�C[f ⇤]| {z }Eapp
+C[fN
]�C[f ⇤F ]| {z }Eest
+C[˜fN
]�C[fN
]| {z }Eopt
Small-scale ML tasks:Mainly constrained by N
Computing time is not an issue, so Eopt
can be insignificant bychoosing small rSize of hypothesis is important to balance the trade-off between E
app
and Eest
Large-scale ML tasks:Mainly constrained by time (significant E
opt
), so SGD is preferredN is large, so E
est
can be reducedLarge model is preferred to reduce E
app
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 19 / 29
![Page 59: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/59.jpg)
Big Data + Big Models
9. COTS HPC unsupervised convolutional network [3]10. GoogleLeNet [6]
With domain-specific architecture such as convolutional NNs(CNNs) and recurrent NNs (RNNs)
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 20 / 29
![Page 60: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/60.jpg)
Big Data + Big Models
9. COTS HPC unsupervised convolutional network [3]10. GoogleLeNet [6]
With domain-specific architecture such as convolutional NNs(CNNs) and recurrent NNs (RNNs)
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 20 / 29
![Page 61: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/61.jpg)
Outline
1 When ML Meets Big Data
2 Representation Learning
3 Curse of Dimensionality
4 Trade-Offs in Large-Scale Learning
5 SGD-Based Optimization
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 21 / 29
![Page 62: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/62.jpg)
Stochastic Gradient DescentGradient Descent (GD)
w
(0) a randon vector;Repeat until convergence {
w
(t+1) w
(t)�h—w
C
N
(w(t);X);
}
Needs to scan the entire dataset to descent
(Mini-Batched) Stochastic Gradient Descent (SGD)
w
(0) a randon vector;Repeat until convergence {
Randomly partition the training set X into minibatches {X(j)}j
;w
(t+1) w
(t)�h—w
C(w(t);X(j));
}
Supports online training
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 29
![Page 63: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/63.jpg)
Stochastic Gradient DescentGradient Descent (GD)
w
(0) a randon vector;Repeat until convergence {
w
(t+1) w
(t)�h—w
C
N
(w(t);X);
}
Needs to scan the entire dataset to descent
(Mini-Batched) Stochastic Gradient Descent (SGD)
w
(0) a randon vector;Repeat until convergence {
Randomly partition the training set X into minibatches {X(j)}j
;w
(t+1) w
(t)�h—w
C(w(t);X(j));
}
Supports online training
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 29
![Page 64: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/64.jpg)
Stochastic Gradient DescentGradient Descent (GD)
w
(0) a randon vector;Repeat until convergence {
w
(t+1) w
(t)�h—w
C
N
(w(t);X);
}
Needs to scan the entire dataset to descent
(Mini-Batched) Stochastic Gradient Descent (SGD)
w
(0) a randon vector;Repeat until convergence {
Randomly partition the training set X into minibatches {X(j)}j
;w
(t+1) w
(t)�h—w
C(w(t);X(j));
}
Supports online training
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 29
![Page 65: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/65.jpg)
Stochastic Gradient DescentGradient Descent (GD)
w
(0) a randon vector;Repeat until convergence {
w
(t+1) w
(t)�h—w
C
N
(w(t);X);
}
Needs to scan the entire dataset to descent
(Mini-Batched) Stochastic Gradient Descent (SGD)
w
(0) a randon vector;Repeat until convergence {
Randomly partition the training set X into minibatches {X(j)}j
;w
(t+1) w
(t)�h—w
C(w(t);X(j));
}
Supports online trainingShan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 22 / 29
![Page 66: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/66.jpg)
GD vs. SGD
Is SGD really a better algorithm?
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 23 / 29
![Page 67: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/67.jpg)
GD vs. SGD
Is SGD really a better algorithm?
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 23 / 29
![Page 68: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/68.jpg)
Yes, If You Have Big Data
Performance is limited by training time
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 24 / 29
![Page 69: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/69.jpg)
Asymptotic Analysis [1]
GD SGDTime per iteration N 1
#Iterations to opt. error r log
1
r1
rTime to opt. error r N log
1
r1
rTime to excess error e 1
e1/a log
1
e , where a 2 [1
2
,1] 1
e
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 25 / 29
![Page 70: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/70.jpg)
Parallelizing SGDData Parallelism Model Parallelism
Every core trains the full modelgiven partitioned data.
Every core train a partitionedmodel partition given full data.
Normally, model parallelism exchange less parameters in a large NNand can support more coresThe effectiveness depends on settings such as CPU speed,communication latency, and bandwidth, etc.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 26 / 29
![Page 71: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/71.jpg)
Parallelizing SGDData Parallelism Model Parallelism
Every core trains the full modelgiven partitioned data.
Every core train a partitionedmodel partition given full data.
Normally, model parallelism exchange less parameters in a large NNand can support more cores
The effectiveness depends on settings such as CPU speed,communication latency, and bandwidth, etc.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 26 / 29
![Page 72: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/72.jpg)
Parallelizing SGDData Parallelism Model Parallelism
Every core trains the full modelgiven partitioned data.
Every core train a partitionedmodel partition given full data.
Normally, model parallelism exchange less parameters in a large NNand can support more coresThe effectiveness depends on settings such as CPU speed,communication latency, and bandwidth, etc.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 26 / 29
![Page 73: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/73.jpg)
Reference I
[1] Léon Bottou.Large-scale machine learning with stochastic gradient descent.In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
[2] Olivier Bousquet.Concentration inequalities and empirical processes theory applied to theanalysis of learning algorithms.Ph.D. thesis, Ecole Polytechnique, Palaiseau, France, 2002.
[3] Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro,and Ng Andrew.Deep learning with cots hpc systems.In Proceedings of The 30th International Conference on MachineLearning, pages 1337–1345, 2013.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 27 / 29
![Page 74: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/74.jpg)
Reference II
[4] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, andChih-Jen Lin.Liblinear: A library for large linear classification.Journal of machine learning research, 9(Aug):1871–1874, 2008.
[5] Pascal Massart.Some applications of concentration inequalities to statistics.In Annales de la Faculté des sciences de Toulouse: Mathématiques,volume 9, pages 245–303, 2000.
[6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich.Going deeper with convolutions.In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1–9, 2015.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 28 / 29
![Page 75: Large-Scale Machine Learning - GitHub Pages Machine Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning](https://reader031.fdocuments.in/reader031/viewer/2022021704/5b2d83277f8b9ab66e8bebc0/html5/thumbnails/75.jpg)
Reference III
[7] Vladimir N Vapnik and A Ya Chervonenkis.On the uniform convergence of relative frequencies of events to theirprobabilities.In Measures of Complexity, pages 11–30. Springer, 2015.
Shan-Hung Wu (CS, NTHU) Large-Scale ML Machine Learning 29 / 29