Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng ›...
Transcript of Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng ›...
![Page 1: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/1.jpg)
Big Data Analytics: Optimizationand Randomization
Tianbao Yang
Tutorial@ACML 2015Hong Kong
†Department of Computer Science, The University of Iowa, IA, USA
Nov. 20, 2015
Yang Tutorial for ACML’15 Nov. 20, 2015 1 / 210
![Page 2: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/2.jpg)
URL
http://www.cs.uiowa.edu/˜tyng/acml15-tutorial.pdf
Yang Tutorial for ACML’15 Nov. 20, 2015 2 / 210
![Page 3: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/3.jpg)
Some Claims
NoThis tutorial is not an exhaustive literature surveyIt is not a survey on different machine learning algorithms
YesIt is about how to efficiently solve machine learning (formulated asoptimization) problems for big data
Yang Tutorial for ACML’15 Nov. 20, 2015 3 / 210
![Page 4: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/4.jpg)
Outline
Part I: BasicsPart II: OptimizationPart III: Randomization
Yang Tutorial for ACML’15 Nov. 20, 2015 4 / 210
![Page 5: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/5.jpg)
Big Data Analytics: Optimization and Randomization
Part I: Basics
Yang Tutorial for ACML’15 Nov. 20, 2015 5 / 210
![Page 6: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/6.jpg)
Basics Introduction
Outline
1 BasicsIntroductionNotations and Definitions
Yang Tutorial for ACML’15 Nov. 20, 2015 6 / 210
![Page 7: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/7.jpg)
Basics Introduction
Three Steps for Machine Learning
Model Optimization
20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
0.3
iterations
dist
ance
to o
ptim
al o
bjec
tive
0.5T
1/T2
1/T
Data
Yang Tutorial for ACML’15 Nov. 20, 2015 7 / 210
![Page 8: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/8.jpg)
Basics Introduction
Big Data Challenge
Big Data
Yang Tutorial for ACML’15 Nov. 20, 2015 8 / 210
![Page 9: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/9.jpg)
Basics Introduction
Big Data Challenge
Big Model
60 million parameters
Yang Tutorial for ACML’15 Nov. 20, 2015 9 / 210
![Page 10: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/10.jpg)
Basics Introduction
Learning as Optimization
Ridge Regression Problem:
minw∈Rd
1n
n∑i=1
(yi −w>xi )2 +
λ
2 ‖w‖22
xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points
Yang Tutorial for ACML’15 Nov. 20, 2015 10 / 210
![Page 11: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/11.jpg)
Basics Introduction
Learning as Optimization
Ridge Regression Problem:
minw∈Rd
1n
n∑i=1
(yi −w>xi )2
︸ ︷︷ ︸Empirical Loss
+λ
2 ‖w‖22
xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points
Yang Tutorial for ACML’15 Nov. 20, 2015 11 / 210
![Page 12: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/12.jpg)
Basics Introduction
Learning as Optimization
Ridge Regression Problem:
minw∈Rd
1n
n∑i=1
(yi −w>xi )2 +
λ
2 ‖w‖22︸ ︷︷ ︸
Regularization
xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points
Yang Tutorial for ACML’15 Nov. 20, 2015 12 / 210
![Page 13: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/13.jpg)
Basics Introduction
Learning as Optimization
Classification Problems:
minw∈Rd
1n
n∑i=1
`(yiw>xi ) +λ
2 ‖w‖22
yi ∈ +1,−1: labelLoss function `(z): z = yw>x
1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2
2. Logistic Regression: `(z) = log(1 + exp(−z))
Yang Tutorial for ACML’15 Nov. 20, 2015 13 / 210
![Page 14: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/14.jpg)
Basics Introduction
Learning as Optimization
Feature Selection:
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + λ‖w‖1
`1 regularization ‖w‖1 =∑d
i=1 |wi |λ controls sparsity level
Yang Tutorial for ACML’15 Nov. 20, 2015 14 / 210
![Page 15: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/15.jpg)
Basics Introduction
Learning as Optimization
Feature Selection using Elastic Net:
minw∈Rd
1n
n∑i=1
`(w>xi , yi )+λ(‖w‖1 + γ‖w‖2
2
)
Elastic net regularizer, more robust than `1 regularizer
Yang Tutorial for ACML’15 Nov. 20, 2015 15 / 210
![Page 16: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/16.jpg)
Basics Introduction
Learning as Optimization
Multi-class/Multi-task Learning:
minW
1n
n∑i=1
`(Wxi , yi ) + λr(W)
W ∈ RK×d
r(W) = ‖W‖2F =
∑Kk=1
∑dj=1 W 2
kj : Frobenius Normr(W) = ‖W‖∗ =
∑i σi : Nuclear Norm (sum of singular values)
r(W) = ‖W‖1,∞ =∑d
j=1 ‖W:j‖∞: `1,∞mixed norm
Yang Tutorial for ACML’15 Nov. 20, 2015 16 / 210
![Page 17: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/17.jpg)
Basics Introduction
Learning as Optimization
Regularized Empirical Loss Minimization
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + R(w)
Both ` and R are convex functionsExtensions to Matrix Cases are possible (sometimes straightforward)Extensions to Kernel methods can be combined with randomizedapproachesExtensions to Non-convex (e.g., deep learning) are in progress
Yang Tutorial for ACML’15 Nov. 20, 2015 17 / 210
![Page 18: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/18.jpg)
Basics Introduction
Data Matrices and Machine Learning
The Instance-feature Matrix: X ∈ Rn×d
X =
x>1x>2···
x>n
Yang Tutorial for ACML’15 Nov. 20, 2015 18 / 210
![Page 19: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/19.jpg)
Basics Introduction
Data Matrices and Machine Learning
The output vector: y =
y1y2···
yn
∈ Rn×1
continuous yi ∈ R: regression (e.g., house price)discrete, e.g., yi ∈ 1, 2, 3: classification (e.g., species of iris)
Yang Tutorial for ACML’15 Nov. 20, 2015 19 / 210
![Page 20: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/20.jpg)
Basics Introduction
Data Matrices and Machine LearningThe Instance-Instance Matrix: K ∈ Rn×n
Similarity MatrixKernel Matrix
Yang Tutorial for ACML’15 Nov. 20, 2015 20 / 210
![Page 21: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/21.jpg)
Basics Introduction
Data Matrices and Machine LearningSome machine learning tasks are formulated on the kernel matrix
ClusteringKernel Methods
Yang Tutorial for ACML’15 Nov. 20, 2015 21 / 210
![Page 22: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/22.jpg)
Basics Introduction
Data Matrices and Machine Learning
The Feature-Feature Matrix: C ∈ Rd×d
Covariance MatrixDistance Metric Matrix
Yang Tutorial for ACML’15 Nov. 20, 2015 22 / 210
![Page 23: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/23.jpg)
Basics Introduction
Data Matrices and Machine Learning
Some machine learning tasks requires the covariance matrixPrincipal Component AnalysisTop-k Singular Value (Eigen-Value) Decomposition of the CovarianceMatrix
Yang Tutorial for ACML’15 Nov. 20, 2015 23 / 210
![Page 24: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/24.jpg)
Basics Introduction
Why Learning from Big Data is Challenging?
High per-iteration cost
High memory cost
High communication cost
Large iteration complexity
Yang Tutorial for ACML’15 Nov. 20, 2015 24 / 210
![Page 25: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/25.jpg)
Basics Notations and Definitions
Outline
1 BasicsIntroductionNotations and Definitions
Yang Tutorial for ACML’15 Nov. 20, 2015 25 / 210
![Page 26: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/26.jpg)
Basics Notations and Definitions
Norms
Vector x ∈ Rd
Euclidean vector norm: ‖x‖2 =√
x>x =√∑d
i=1 x2i
`p-norm of a vector: ‖x‖p =(∑d
i=1 |xi |p)1/p
where p ≥ 1
1 `2 norm ‖x‖2 =√∑d
i=1 x2i
2 `1 norm ‖x‖1 =∑d
i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |
Yang Tutorial for ACML’15 Nov. 20, 2015 26 / 210
![Page 27: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/27.jpg)
Basics Notations and Definitions
Norms
Vector x ∈ Rd
Euclidean vector norm: ‖x‖2 =√
x>x =√∑d
i=1 x2i
`p-norm of a vector: ‖x‖p =(∑d
i=1 |xi |p)1/p
where p ≥ 1
1 `2 norm ‖x‖2 =√∑d
i=1 x2i
2 `1 norm ‖x‖1 =∑d
i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |
Yang Tutorial for ACML’15 Nov. 20, 2015 26 / 210
![Page 28: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/28.jpg)
Basics Notations and Definitions
Norms
Vector x ∈ Rd
Euclidean vector norm: ‖x‖2 =√
x>x =√∑d
i=1 x2i
`p-norm of a vector: ‖x‖p =(∑d
i=1 |xi |p)1/p
where p ≥ 1
1 `2 norm ‖x‖2 =√∑d
i=1 x2i
2 `1 norm ‖x‖1 =∑d
i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |
Yang Tutorial for ACML’15 Nov. 20, 2015 26 / 210
![Page 29: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/29.jpg)
Basics Notations and Definitions
Matrix Factorization
Matrix X ∈ Rn×d
Singular Value Decomposition X = UΣV>1 U ∈ Rn×r : orthonormal columns (U>U = I): span column space2 Σ ∈ Rr×r : diagonal matrix Σii = σi > 0, σ1 ≥ σ2 . . . ≥ σr3 V ∈ Rd×r : orthonormal columns (V>V = I): span row space4 r ≤ min(n, d): max value such that σr > 0: rank of X5 UkΣkV>k : top-k approximation
Pseudo inverse: X † = V Σ−1U>
QR factorization: X = QR (n ≥ d)
Q ∈ Rn×d : orthonormal columnsR ∈ Rd×d : upper triangular matrix
Yang Tutorial for ACML’15 Nov. 20, 2015 27 / 210
![Page 30: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/30.jpg)
Basics Notations and Definitions
Matrix Factorization
Matrix X ∈ Rn×d
Singular Value Decomposition X = UΣV>1 U ∈ Rn×r : orthonormal columns (U>U = I): span column space2 Σ ∈ Rr×r : diagonal matrix Σii = σi > 0, σ1 ≥ σ2 . . . ≥ σr3 V ∈ Rd×r : orthonormal columns (V>V = I): span row space4 r ≤ min(n, d): max value such that σr > 0: rank of X5 UkΣkV>k : top-k approximation
Pseudo inverse: X † = V Σ−1U>
QR factorization: X = QR (n ≥ d)
Q ∈ Rn×d : orthonormal columnsR ∈ Rd×d : upper triangular matrix
Yang Tutorial for ACML’15 Nov. 20, 2015 27 / 210
![Page 31: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/31.jpg)
Basics Notations and Definitions
Matrix Factorization
Matrix X ∈ Rn×d
Singular Value Decomposition X = UΣV>1 U ∈ Rn×r : orthonormal columns (U>U = I): span column space2 Σ ∈ Rr×r : diagonal matrix Σii = σi > 0, σ1 ≥ σ2 . . . ≥ σr3 V ∈ Rd×r : orthonormal columns (V>V = I): span row space4 r ≤ min(n, d): max value such that σr > 0: rank of X5 UkΣkV>k : top-k approximation
Pseudo inverse: X † = V Σ−1U>
QR factorization: X = QR (n ≥ d)
Q ∈ Rn×d : orthonormal columnsR ∈ Rd×d : upper triangular matrix
Yang Tutorial for ACML’15 Nov. 20, 2015 27 / 210
![Page 32: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/32.jpg)
Basics Notations and Definitions
Norms
Matrix X ∈ Rn×d
Frobenius norm: ‖X‖F =√
tr(X>X ) =√∑n
i=1∑d
j=1 X 2ij
Spectral (induced norm) of a matrix: ‖X‖2 = max‖u‖2=1 ‖Xu‖2‖X‖2 = σ1 (maximum singular value)
Yang Tutorial for ACML’15 Nov. 20, 2015 28 / 210
![Page 33: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/33.jpg)
Basics Notations and Definitions
Norms
Matrix X ∈ Rn×d
Frobenius norm: ‖X‖F =√
tr(X>X ) =√∑n
i=1∑d
j=1 X 2ij
Spectral (induced norm) of a matrix: ‖X‖2 = max‖u‖2=1 ‖Xu‖2‖X‖2 = σ1 (maximum singular value)
Yang Tutorial for ACML’15 Nov. 20, 2015 28 / 210
![Page 34: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/34.jpg)
Basics Notations and Definitions
Convex Optimization
minx∈X f (x)
X is a convex domainfor any x , y ∈ X , their convex combinationαx + (1− α)y ∈ X
f (x) is a convex function
Yang Tutorial for ACML’15 Nov. 20, 2015 29 / 210
![Page 35: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/35.jpg)
Basics Notations and Definitions
Convex Function
Characterization of Convex Function
f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),
∀x , y ∈ X , α ∈ [0, 1]
f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X
local optimum is global optimum
Yang Tutorial for ACML’15 Nov. 20, 2015 30 / 210
![Page 36: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/36.jpg)
Basics Notations and Definitions
Convex Function
Characterization of Convex Function
f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),
∀x , y ∈ X , α ∈ [0, 1]
f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X
local optimum is global optimum
Yang Tutorial for ACML’15 Nov. 20, 2015 30 / 210
![Page 37: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/37.jpg)
Basics Notations and Definitions
Convex vs Strongly Convex
Convex function:
f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X
Strongly Convex function:
f (x) ≥ f (y) +∇f (y)>(x − y) +λ
2 ‖x − y‖22 ∀x , y ∈ X
Global optimum is unique
strong convexityconstant
e.g., λ2‖w‖
22 is λ-strongly convex
Yang Tutorial for ACML’15 Nov. 20, 2015 31 / 210
![Page 38: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/38.jpg)
Basics Notations and Definitions
Convex vs Strongly Convex
Convex function:
f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X
Strongly Convex function:
f (x) ≥ f (y) +∇f (y)>(x − y) +λ
2 ‖x − y‖22 ∀x , y ∈ X
Global optimum is unique
strong convexityconstant
e.g., λ2‖w‖
22 is λ-strongly convex
Yang Tutorial for ACML’15 Nov. 20, 2015 31 / 210
![Page 39: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/39.jpg)
Basics Notations and Definitions
Non-smooth function vs Smooth functionNon-smooth function
Lipschitz continuous: e.g. absolute lossf (x) = |x ||f (x)− f (y)| ≤ G‖x − y‖2
Lipschitzconstant
Subgradient: f (x) ≥ f (y) + ∂f (y)>(x − y)
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
|x|
non−smooth
sub−gradient
Smooth function
e.g. logistic loss f (x) = log(1 + exp(−x))
‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2
smoothnessconstant
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
0
1
2
3
4
5
6
log(1+exp(−x))
f(y)+f’(y)(x−y)
y
f(x)
Quadratic Function
Yang Tutorial for ACML’15 Nov. 20, 2015 32 / 210
![Page 40: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/40.jpg)
Basics Notations and Definitions
Non-smooth function vs Smooth functionNon-smooth function
Lipschitz continuous: e.g. absolute lossf (x) = |x ||f (x)− f (y)| ≤ G‖x − y‖2
Lipschitzconstant
Subgradient: f (x) ≥ f (y) + ∂f (y)>(x − y)
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
|x|
non−smooth
sub−gradient
Smooth function
e.g. logistic loss f (x) = log(1 + exp(−x))
‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2
smoothnessconstant
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
0
1
2
3
4
5
6
log(1+exp(−x))
f(y)+f’(y)(x−y)
y
f(x)
Quadratic Function
Yang Tutorial for ACML’15 Nov. 20, 2015 32 / 210
![Page 41: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/41.jpg)
Basics Notations and Definitions
Non-smooth function vs Smooth functionNon-smooth function
Lipschitz continuous: e.g. absolute lossf (x) = |x ||f (x)− f (y)| ≤ G‖x − y‖2
Lipschitzconstant
Subgradient: f (x) ≥ f (y) + ∂f (y)>(x − y)
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
|x|
non−smooth
sub−gradient
Smooth function
e.g. logistic loss f (x) = log(1 + exp(−x))
‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2
smoothnessconstant
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
0
1
2
3
4
5
6
log(1+exp(−x))
f(y)+f’(y)(x−y)
y
f(x)
Quadratic Function
Yang Tutorial for ACML’15 Nov. 20, 2015 32 / 210
![Page 42: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/42.jpg)
Basics Notations and Definitions
Next ...
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + R(w)
Part II: Optimizationstochastic optimizationdistributed optimization
Reduce Iteration Complexity: utilizing properties of functions and thestructure of the problem
Yang Tutorial for ACML’15 Nov. 20, 2015 33 / 210
![Page 43: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/43.jpg)
Basics Notations and Definitions
Next ...
Part III: RandomizationClassification, RegressionSVD, K-means, Kernel methods
Reduce Data Size: utilizing properties of data
Please stay tuned!
Yang Tutorial for ACML’15 Nov. 20, 2015 34 / 210
![Page 44: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/44.jpg)
Optimization
Big Data Analytics: Optimization and Randomization
Part II: Optimization
Yang Tutorial for ACML’15 Nov. 20, 2015 35 / 210
![Page 45: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/45.jpg)
Optimization (Sub)Gradient Methods
Outline
2 Optimization(Sub)Gradient MethodsStochastic Optimization Algorithms for Big Data
Stochastic OptimizationDistributed Optimization
Yang Tutorial for ACML’15 Nov. 20, 2015 36 / 210
![Page 46: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/46.jpg)
Optimization (Sub)Gradient Methods
Learning as Optimization
Regularized Empirical Loss Minimization
minw∈Rd
1n
n∑i=1
`(w>xi , yi) + R(w)︸ ︷︷ ︸F (w)
Yang Tutorial for ACML’15 Nov. 20, 2015 37 / 210
![Page 47: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/47.jpg)
Optimization (Sub)Gradient Methods
Convergence Measure
Most optimization algorithms are iterative
wt+1 = wt + ∆wt
Iteration Complexity: the number ofiterations T (ε) needed to have
F (wT )−minw
F (w) ≤ ε (ε 1)
Convergence Rate: after T iterations, howgood is the solution
F (wT )−minw
F (w) ≤ ε(T )
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations
obje
ctive
T
ε
Total Runtime = Per-iteration Cost×Iteration Complexity
Yang Tutorial for ACML’15 Nov. 20, 2015 38 / 210
![Page 48: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/48.jpg)
Optimization (Sub)Gradient Methods
Convergence Measure
Most optimization algorithms are iterative
wt+1 = wt + ∆wt
Iteration Complexity: the number ofiterations T (ε) needed to have
F (wT )−minw
F (w) ≤ ε (ε 1)
Convergence Rate: after T iterations, howgood is the solution
F (wT )−minw
F (w) ≤ ε(T )
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations
obje
ctive
T
ε
Total Runtime = Per-iteration Cost×Iteration Complexity
Yang Tutorial for ACML’15 Nov. 20, 2015 38 / 210
![Page 49: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/49.jpg)
Optimization (Sub)Gradient Methods
Convergence Measure
Most optimization algorithms are iterative
wt+1 = wt + ∆wt
Iteration Complexity: the number ofiterations T (ε) needed to have
F (wT )−minw
F (w) ≤ ε (ε 1)
Convergence Rate: after T iterations, howgood is the solution
F (wT )−minw
F (w) ≤ ε(T )
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations
obje
ctive
T
ε
Total Runtime = Per-iteration Cost×Iteration Complexity
Yang Tutorial for ACML’15 Nov. 20, 2015 38 / 210
![Page 50: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/50.jpg)
Optimization (Sub)Gradient Methods
Convergence Measure
Most optimization algorithms are iterative
wt+1 = wt + ∆wt
Iteration Complexity: the number ofiterations T (ε) needed to have
F (wT )−minw
F (w) ≤ ε (ε 1)
Convergence Rate: after T iterations, howgood is the solution
F (wT )−minw
F (w) ≤ ε(T )
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations
obje
ctive
T
ε
Total Runtime = Per-iteration Cost×Iteration ComplexityYang Tutorial for ACML’15 Nov. 20, 2015 38 / 210
![Page 51: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/51.jpg)
Optimization (Sub)Gradient Methods
More on Convergence Measure
Big O(·) notation: explicit dependence on T or ε
Convergence Rate Iteration Complexity
linear O(µT)
(µ < 1) O(
log(1ε
))sub-linear O
(1
Tα
)α > 0 O
( 1ε1/α
)Why are we interested in Bounds?
Yang Tutorial for ACML’15 Nov. 20, 2015 39 / 210
![Page 52: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/52.jpg)
Optimization (Sub)Gradient Methods
More on Convergence Measure
Big O(·) notation: explicit dependence on T or ε
Convergence Rate Iteration Complexity
linear O(µT)
(µ < 1) O(
log(1ε
))sub-linear O
(1
Tα
)α > 0 O
( 1ε1/α
)Why are we interested in Bounds?
Yang Tutorial for ACML’15 Nov. 20, 2015 39 / 210
![Page 53: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/53.jpg)
Optimization (Sub)Gradient Methods
More on Convergence Measure
Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O
(log( 1
ε ))
sub-linear O( 1Tα ) α > 0 O
(1
ε1/α
)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations (T)
dis
tan
ce
to
op
tim
um
0.5T
seconds
Yang Tutorial for ACML’15 Nov. 20, 2015 40 / 210
![Page 54: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/54.jpg)
Optimization (Sub)Gradient Methods
More on Convergence Measure
Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O
(log( 1
ε ))
sub-linear O( 1Tα ) α > 0 O
(1
ε1/α
)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations (T)
dis
tan
ce
to
op
tim
um
0.5T
1/T
secondsminutes
Yang Tutorial for ACML’15 Nov. 20, 2015 41 / 210
![Page 55: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/55.jpg)
Optimization (Sub)Gradient Methods
More on Convergence Measure
Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O
(log( 1
ε ))
sub-linear O( 1Tα ) α > 0 O
(1
ε1/α
)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations (T)
dis
tan
ce
to
op
tim
um
0.5T
1/T
1/T0.5
secondsminutes
hours
Yang Tutorial for ACML’15 Nov. 20, 2015 42 / 210
![Page 56: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/56.jpg)
Optimization (Sub)Gradient Methods
More on Convergence Measure
Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O
(log( 1
ε ))
sub-linear O( 1Tα ) α > 0 O
(1
ε1/α
)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations (T)
dis
tan
ce
to
op
tim
um
0.5T
1/T
1/T0.5
secondsminutes
hours
Theoretically, we consider
log(1ε
)≺ 1√
ε≺ 1ε≺ 1ε2
Yang Tutorial for ACML’15 Nov. 20, 2015 43 / 210
![Page 57: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/57.jpg)
Optimization (Sub)Gradient Methods
Non-smooth V.S. Smooth
Smooth `(z)
squared hinge loss: `(w>x, y) = max(0, 1− yw>x)2
logistic loss: `(w>x, y) = log(1 + exp(−yw>x))square loss: `(w>x, y) = (w>x− y)2
Non-smooth `(z)
hinge loss: `(w>x, y) = max(0, 1− yw>x)absolute loss: `(w>x, y) = |w>x− y |
Yang Tutorial for ACML’15 Nov. 20, 2015 44 / 210
![Page 58: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/58.jpg)
Optimization (Sub)Gradient Methods
Strongly convex V.S. Non-strongly convex
λ-strongly convex R(w)
`2 regularizer: λ2 ‖w‖
22
Elastic net regularizer: τ‖w‖1 + λ2 ‖w‖
22
Non-strongly convex R(w)
unregularized problem: R(w) ≡ 0`1 regularizer: τ‖w‖1
Yang Tutorial for ACML’15 Nov. 20, 2015 45 / 210
![Page 59: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/59.jpg)
Optimization (Sub)Gradient Methods
Gradient Method in Machine Learning
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Suppose `(z , y) is smoothFull gradient: ∇F (w) = 1
n∑n
i=1∇`(w>xi , yi ) + λwPer-iteration cost: O(nd)
Gradient Descent
wt = wt−1 − γt∇F (wt−1)
step size
step size
γt = constant, e.g ., 1L
Yang Tutorial for ACML’15 Nov. 20, 2015 46 / 210
![Page 60: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/60.jpg)
Optimization (Sub)Gradient Methods
Gradient Method in Machine Learning
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Suppose `(z , y) is smoothFull gradient: ∇F (w) = 1
n∑n
i=1∇`(w>xi , yi ) + λwPer-iteration cost: O(nd)
Gradient Descent
wt = wt−1 − γt∇F (wt−1)
step size
step size
γt = constant, e.g ., 1L
Yang Tutorial for ACML’15 Nov. 20, 2015 46 / 210
![Page 61: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/61.jpg)
Optimization (Sub)Gradient Methods
Gradient Method in Machine Learning
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2‖w‖22︸ ︷︷ ︸
R(w)
If λ = 0: R(w) is non-strongly convexIteration complexity O( 1
ε )
If λ > 0: R(w) is λ-strongly convexIteration complexity O( 1
λ log( 1ε ))
Yang Tutorial for ACML’15 Nov. 20, 2015 47 / 210
![Page 62: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/62.jpg)
Optimization (Sub)Gradient Methods
Accelerated Full Gradient (AFG)
Nesterov’s Accelerated Gradient Descent
wt = vt−1 − γt∇F (vt−1)
vt = wt + ηt(wt −wt−1)
MomentumStep
wt is the output and vt is an auxiliary sequence.
Yang Tutorial for ACML’15 Nov. 20, 2015 48 / 210
![Page 63: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/63.jpg)
Optimization (Sub)Gradient Methods
Accelerated Full Gradient (AFG)
Nesterov’s Accelerated Gradient Descent
wt = vt−1 − γt∇F (vt−1)
vt = wt + ηt(wt −wt−1)
MomentumStep
wt is the output and vt is an auxiliary sequence.
Yang Tutorial for ACML’15 Nov. 20, 2015 48 / 210
![Page 64: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/64.jpg)
Optimization (Sub)Gradient Methods
Accelerated Full Gradient (AFG)
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
If λ = 0: R(w) is non-strongly convexIteration complexity O( 1√
ε), better than O( 1
ε )
If λ > 0: R(w) is λ-strongly convexIteration complexity O( 1√
λlog( 1
ε )), better than O( 1λ log( 1
ε )) for smallλ
Yang Tutorial for ACML’15 Nov. 20, 2015 49 / 210
![Page 65: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/65.jpg)
Optimization (Sub)Gradient Methods
Deal with non-smooth regularizer
Consider `1 norm regularization
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi )︸ ︷︷ ︸f (w)
+ τ‖w‖1︸ ︷︷ ︸R(w)
f (w): smoothR(w): non-smooth and non-strongly convex
Yang Tutorial for ACML’15 Nov. 20, 2015 50 / 210
![Page 66: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/66.jpg)
Optimization (Sub)Gradient Methods
Accelerated Proximal Gradient (APG)
Accelerated Gradient Descent
wt = arg minw∈Rd
∇f (vt−1)>w +
12γt‖w− vt−1‖2
2 + τ‖w‖1
vt = wt + ηt(wt −wt−1)
Proximalmapping
Proximal mapping has close-form solution: Soft-thresholdingIteration complexity and runtime remain the same as for smooth andnon-strongly convex, i.e., O( 1√
ε)
Yang Tutorial for ACML’15 Nov. 20, 2015 51 / 210
![Page 67: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/67.jpg)
Optimization (Sub)Gradient Methods
Accelerated Proximal Gradient (APG)
Accelerated Gradient Descent
wt = arg minw∈Rd
∇f (vt−1)>w +
12γt‖w− vt−1‖2
2 + τ‖w‖1
vt = wt + ηt(wt −wt−1)
Proximalmapping
Proximal mapping has close-form solution: Soft-thresholdingIteration complexity and runtime remain the same as for smooth andnon-strongly convex, i.e., O( 1√
ε)
Yang Tutorial for ACML’15 Nov. 20, 2015 51 / 210
![Page 68: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/68.jpg)
Optimization (Sub)Gradient Methods
Deal with non-smooth but strongly convex regularizer
Consider the elastic net regularization
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22 + τ‖w‖1︸ ︷︷ ︸R(w)
R(w): non-smooth but strongly convex
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22︸ ︷︷ ︸
f (w)
+ τ‖w‖1︸ ︷︷ ︸R′(w)
f (w): smooth and strongly convexR ′(w): non-smooth and non-strongly convexIteration Complexity: O
(1√λ
log(
1ε
))Yang Tutorial for ACML’15 Nov. 20, 2015 52 / 210
![Page 69: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/69.jpg)
Optimization (Sub)Gradient Methods
Sub-Gradient Method in Machine Learning
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Suppose `(z , y) is non-smoothFull sub-gradient: ∂F (w) = 1
n∑n
i=1 ∂`(w>xi , yi ) + λw
Sub-Gradient Descent
wt = wt−1 − γt∂F (wt−1)
step size
γt −→ 0Yang Tutorial for ACML’15 Nov. 20, 2015 53 / 210
![Page 70: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/70.jpg)
Optimization (Sub)Gradient Methods
Sub-Gradient Method in Machine Learning
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Suppose `(z , y) is non-smoothFull sub-gradient: ∂F (w) = 1
n∑n
i=1 ∂`(w>xi , yi ) + λw
Sub-Gradient Descent
wt = wt−1 − γt∂F (wt−1)
step size
γt −→ 0Yang Tutorial for ACML’15 Nov. 20, 2015 53 / 210
![Page 71: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/71.jpg)
Optimization (Sub)Gradient Methods
Sub-Gradient Method
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
If λ = 0: R(w) is non-strongly convexgeneralizes to `1 norm and other non-strongly convex regularizerIteration complexity O( 1
ε2 )
If λ > 0: R(w) is λ-strongly convexgeneralizes to elastic net and other strongly convex regularizerIteration complexity O( 1
λε)
No efficient acceleration scheme in general
Yang Tutorial for ACML’15 Nov. 20, 2015 54 / 210
![Page 72: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/72.jpg)
Optimization (Sub)Gradient Methods
Problem Classes and Iteration Complexity
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + R(w)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex O
(1ε2
)O(
1√ε
)λ-strongly convex O
(1λε
)O(
1√λ
log(
1ε
))Per-iteration cost: O(nd), too high if n or d are large.
Yang Tutorial for ACML’15 Nov. 20, 2015 55 / 210
![Page 73: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/73.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Outline
2 Optimization(Sub)Gradient MethodsStochastic Optimization Algorithms for Big Data
Stochastic OptimizationDistributed Optimization
Yang Tutorial for ACML’15 Nov. 20, 2015 56 / 210
![Page 74: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/74.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Stochastic First-Order Method by Sampling
Randomly Sample Example1 Stochastic Gradient Descent (SGD)
2 Stochastic Variance Reduced Gradient (SVRG)
3 Stochastic Average Gradient Algorithm (SAGA)
4 Stochastic Dual Coordinate Ascent (SDCA)
Randomly Sample Feature1 Randomized Coordinate Descent (RCD)
2 Accelerated Proximal Coordinate Gradient (APCG)
Yang Tutorial for ACML’15 Nov. 20, 2015 57 / 210
![Page 75: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/75.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Basic SGD (Nemirovski & Yudin (1978))
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Full sub-gradient: ∂F (w) = 1n∑n
i=1 ∂`(w>xi , yi ) + λw
Randomly sample i ∈ 1, . . . , nStochastic sub-gradient: ∂`(wT xi , yi ) + λw
Ei [∂`(wT xi , yi ) + λw] = ∂F (w)
Yang Tutorial for ACML’15 Nov. 20, 2015 58 / 210
![Page 76: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/76.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Basic SGD (Nemirovski & Yudin (1978))
Applicable in all settings!
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
sample: it ∈ 1, . . . , n
update: wt = wt−1 − γt(∂`(wT
t−1xit , yit ) + λwt−1)
output: wT =1T
T∑t=1
wt
step size: γt −→ 0
Yang Tutorial for ACML’15 Nov. 20, 2015 59 / 210
![Page 77: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/77.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Basic SGD (Nemirovski & Yudin (1978))
Applicable in all settings!
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
sample: it ∈ 1, . . . , n
update: wt = wt−1 − γt(∂`(wT
t−1xit , yit ) + λwt−1)
output: wT =1T
T∑t=1
wt
step size: γt −→ 0
Yang Tutorial for ACML’15 Nov. 20, 2015 59 / 210
![Page 78: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/78.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Basic SGD (Nemirovski & Yudin (1978))
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
If λ = 0: R(w) is non-strongly convexgeneralizes to `1 norm and other non-strongly convex regularizerIteration complexity O( 1
ε2 )
If λ > 0: R(w) is λ-strongly convexgeneralizes to elastic net and other strongly convex regularizerIteration complexity O( 1
λε)
Exactly the same as sub-gradient descent!
Yang Tutorial for ACML’15 Nov. 20, 2015 60 / 210
![Page 79: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/79.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Total Runtime
Per-iteration cost: O(d)
Much lower than full gradient methode.g. hinge loss (SVM)
stochastic gradient: ∂`(w>xit , yit ) =
−yit xit , 1− yit w>xit > 0
0, otherwise
Yang Tutorial for ACML’15 Nov. 20, 2015 61 / 210
![Page 80: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/80.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Total Runtime
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + R(w)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex O
(1ε2
)O(
1ε2
)λ-strongly convex O
(1λε
)O(
1λε
)For SGD, only strongly convexity helps but the smoothness does notmake any difference!The reason: the step size has to be decreasing due to stochasticgradient does not approach 0
Yang Tutorial for ACML’15 Nov. 20, 2015 62 / 210
![Page 81: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/81.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Variance Reduction
Stochastic Variance Reduced Gradient (SVRG)
Stochastic Average Gradient Algorithm (SAGA)
Stochastic Dual Coordinate Ascent (SDCA)
Yang Tutorial for ACML’15 Nov. 20, 2015 63 / 210
![Page 82: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/82.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Applicable when `(z) is smooth and R(w) is λ-strongly convex
Stochastic gradient:
git (w) = ∇`(wT xit , yit ) + λw
Eit [git (w)] = ∇F (w) but...Var [git (w)] 6= 0 even if w = w?
Yang Tutorial for ACML’15 Nov. 20, 2015 64 / 210
![Page 83: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/83.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Applicable when `(z) is smooth and R(w) is λ-strongly convex
Stochastic gradient:
git (w) = ∇`(wT xit , yit ) + λw
Eit [git (w)] = ∇F (w) but...Var [git (w)] 6= 0 even if w = w?
Yang Tutorial for ACML’15 Nov. 20, 2015 64 / 210
![Page 84: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/84.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)
Compute the full gradient at a reference point w
∇F (w) =1n
n∑i=1
gi (w)
Stochastic variance reduced gradient:
git (w) = git (w)− git (w) +∇F (w)
Eit [git (w)] = ∇F (w)
Var [git (w)] −→ 0 as w,w→ w?
Yang Tutorial for ACML’15 Nov. 20, 2015 65 / 210
![Page 85: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/85.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Variance Reduction (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao &
Zhang, 2014)
At optimal solution w?: ∇F (w?) = 0It does not mean
git (w) −→ 0
as w→ w?
However, we have
git (w) = git (w)− git (w) +∇F (w) −→ 0
as w,w→ w?
Yang Tutorial for ACML’15 Nov. 20, 2015 66 / 210
![Page 86: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/86.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)
Iterate s = 1, . . . ,T − 1Let w0 = ws and compute ∇F (ws)
Iterate t = 1, . . . ,mgit (wt−1) = ∇F (ws)− git (ws) + git (wt−1)wt = wt−1 − γt git (wt−1)
ws+1 = 1m∑m
t=1 wtoutput: wT
m = O(
1λ
)γt = constant
Yang Tutorial for ACML’15 Nov. 20, 2015 67 / 210
![Page 87: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/87.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)
Per-iteration cost: O (d)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A. N.A. 1
λ-strongly convex N.A. O((
n + 1λ
)log(
1ε
))Total Runtime: O
(d(
n + 1λ
)log(
1ε
))Better than AFG
O(
nd√λ
log(
1ε
))Use proximal mapping for elastic net regularizer
1A small trick can fix thisYang Tutorial for ACML’15 Nov. 20, 2015 68 / 210
![Page 88: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/88.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)
Per-iteration cost: O (d)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A. N.A. 1
λ-strongly convex N.A. O((
n + 1λ
)log(
1ε
))Total Runtime: O
(d(
n + 1λ
)log(
1ε
))Better than AFG
O(
nd√λ
log(
1ε
))Use proximal mapping for elastic net regularizer
1A small trick can fix thisYang Tutorial for ACML’15 Nov. 20, 2015 68 / 210
![Page 89: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/89.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA (Defazio et al. (2014))
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
A new version of SAG (Roux et al. (2012))Applicable when `(z) is smoothStrong convexity is not necessary.
Yang Tutorial for ACML’15 Nov. 20, 2015 69 / 210
![Page 90: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/90.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA (Defazio et al. (2014))
SAGA also reduces the variance of stochastic gradient but with adifferent techniqueSVRG uses gradients at the same point w
git (w) = git (w)− git (w) +∇F (w)
∇F (w) =1n
n∑i=1
gi (w)
SAGA uses gradients at different point w1, w2, · · · , wn
git (w) = git (w)− git (wit ) + G
G =1n
n∑i=1
gi (wi )
Yang Tutorial for ACML’15 Nov. 20, 2015 70 / 210
![Page 91: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/91.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA (Defazio et al. (2014))
Initialize average gradient G0:
G0 =1n
n∑i=1
gi , gi = ∇`(w>0 xi , yi ) + λw0
average gradient Gt−1 = 1n∑n
i=1 gistochastic variance reduced gradient:
git (wt−1) =(∇`(w>t−1xit , yit ) + λwt−1 − git + Gt−1
)wt = wt−1 − γt git (wt−1)
Update the selected component of the average gradient
Gt =1n
n∑i=1
gi , git = ∇`(w>t−1xit , yit ) + λwt−1
Yang Tutorial for ACML’15 Nov. 20, 2015 71 / 210
![Page 92: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/92.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA (Defazio et al. (2014))
Initialize average gradient G0:
G0 =1n
n∑i=1
gi , gi = ∇`(w>0 xi , yi ) + λw0
average gradient Gt−1 = 1n∑n
i=1 gistochastic variance reduced gradient:
git (wt−1) =(∇`(w>t−1xit , yit ) + λwt−1 − git + Gt−1
)wt = wt−1 − γt git (wt−1)
Update the selected component of the average gradient
Gt =1n
n∑i=1
gi , git = ∇`(w>t−1xit , yit ) + λwt−1
Yang Tutorial for ACML’15 Nov. 20, 2015 71 / 210
![Page 93: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/93.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA (Defazio et al. (2014))
Initialize average gradient G0:
G0 =1n
n∑i=1
gi , gi = ∇`(w>0 xi , yi ) + λw0
average gradient Gt−1 = 1n∑n
i=1 gistochastic variance reduced gradient:
git (wt−1) =(∇`(w>t−1xit , yit ) + λwt−1 − git + Gt−1
)wt = wt−1 − γt git (wt−1)
Update the selected component of the average gradient
Gt =1n
n∑i=1
gi , git = ∇`(w>t−1xit , yit ) + λwt−1
Yang Tutorial for ACML’15 Nov. 20, 2015 71 / 210
![Page 94: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/94.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA (Defazio et al. (2014))
Initialize average gradient G0:
G0 =1n
n∑i=1
gi , gi = ∇`(w>0 xi , yi ) + λw0
average gradient Gt−1 = 1n∑n
i=1 gistochastic variance reduced gradient:
git (wt−1) =(∇`(w>t−1xit , yit ) + λwt−1 − git + Gt−1
)wt = wt−1 − γt git (wt−1)
Update the selected component of the average gradient
Gt =1n
n∑i=1
gi , git = ∇`(w>t−1xit , yit ) + λwt−1
Yang Tutorial for ACML’15 Nov. 20, 2015 71 / 210
![Page 95: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/95.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA: efficient update of averaged gradient
Gt and Gt−1 only differs in gi for i = itBefore we update gi , we update
Gt =1n
n∑i=1
gi = Gt−1 −1n git +
1n(∇`(w>t−1xit , yit ) + λwt−1
)computation cost: O(d)
Yang Tutorial for ACML’15 Nov. 20, 2015 72 / 210
![Page 96: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/96.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA (Defazio et al. (2014))
Per-iteration cost: O(d)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A. O
(nε
)λ-strongly convex N.A. O
((n + 1
λ
)log(
1ε
))Total Runtime (strongly convex): O
(d(
n + 1λ
)log(
1ε
)). Same as
SVRG!Use proximal mapping for `1 regularizer
Yang Tutorial for ACML’15 Nov. 20, 2015 73 / 210
![Page 97: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/97.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SAGA (Defazio et al. (2014))
Per-iteration cost: O(d)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A. O
(nε
)λ-strongly convex N.A. O
((n + 1
λ
)log(
1ε
))Total Runtime (strongly convex): O
(d(
n + 1λ
)log(
1ε
)). Same as
SVRG!Use proximal mapping for `1 regularizer
Yang Tutorial for ACML’15 Nov. 20, 2015 73 / 210
![Page 98: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/98.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Compare the Runtime of SGD and SVRG/SAGA
Smooth but non-strongly convex:SGD: O
( dε2
)SAGA: O
( dnε
)Smooth and strongly convex:
SGD: O( dλε
)SVRG/SAGA: O
(d(n + 1
λ
)log( 1ε
))For small ε, use SVRG/SAGASatisfied with large ε, use SGD
Yang Tutorial for ACML’15 Nov. 20, 2015 74 / 210
![Page 99: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/99.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Conjugate Duality
Define `i (z) ≡ `(z , yi )
Conjugate function: `∗i (α)⇐⇒ `i (z)
`i (z) = maxα∈R
[αz − `∗(α)] , `∗i (α) = maxz∈R
[αz − `(z)]
E.g. hinge loss: `i (z) = max(0, 1− yi z)
`∗i (α) =
αyi if − 1 ≤ αyi ≤ 0+∞ otherwise
E.g. square hinge loss: `i (z) = max(0, 1− yi z)2
`∗i (α) =
α2
4 + αyi if αyi ≤ 0+∞ otherwise
Yang Tutorial for ACML’15 Nov. 20, 2015 75 / 210
![Page 100: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/100.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Conjugate Duality
Define `i (z) ≡ `(z , yi )
Conjugate function: `∗i (α)⇐⇒ `i (z)
`i (z) = maxα∈R
[αz − `∗(α)] , `∗i (α) = maxz∈R
[αz − `(z)]
E.g. hinge loss: `i (z) = max(0, 1− yi z)
`∗i (α) =
αyi if − 1 ≤ αyi ≤ 0+∞ otherwise
E.g. square hinge loss: `i (z) = max(0, 1− yi z)2
`∗i (α) =
α2
4 + αyi if αyi ≤ 0+∞ otherwise
Yang Tutorial for ACML’15 Nov. 20, 2015 75 / 210
![Page 101: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/101.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
The Dual Problem
From Primal problem to Dual problem:
minw
1n
n∑i=1
`(w>xi︸ ︷︷ ︸z
, yi ) +λ
2 ‖w‖22
= minw
1n
n∑i=1
maxαi∈R
[−αi (w>xi )− `∗i (−αi )
]+λ
2 ‖w‖22
= maxα∈Rn
1n
n∑i=1−`∗i (−αi )−
λ
2
∥∥∥∥∥ 1λn
n∑i=1
αixi
∥∥∥∥∥2
2
Primal solution w = 1λn∑n
i=1 αixi
Yang Tutorial for ACML’15 Nov. 20, 2015 76 / 210
![Page 102: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/102.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SDCA (Shalev-Shwartz & Zhang (2013))
Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))Applicable when R(w) is λ-strongly convexSmoothness is not requiredSolve Dual Problem:
maxα∈Rn
1n
n∑i=1−`∗i (−αi )−
λ
2
∥∥∥∥∥ 1λn
n∑i=1
αixi
∥∥∥∥∥2
2
Sample it ∈ 1, . . . , n. Optimize αit while fixing others
Yang Tutorial for ACML’15 Nov. 20, 2015 77 / 210
![Page 103: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/103.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SDCA (Shalev-Shwartz & Zhang (2013))
Maintain a primal solution: wt = 1λn∑n
i=1 αti xi
Optimize the increment ∆αit
max∆α∈R
1n − `
∗it (−(αt
it + ∆αit ))− λ
2
∥∥∥∥∥ 1λn
( n∑i=1
αti xi + ∆αit xit
)∥∥∥∥∥2
2
⇐⇒ max∆α∈R
1n − `
∗it (−(αt
it + ∆αit ))− λ
2
∥∥∥∥wt +1λn ∆αit xit
∥∥∥∥2
2
Yang Tutorial for ACML’15 Nov. 20, 2015 78 / 210
![Page 104: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/104.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SDCA (Shalev-Shwartz & Zhang (2013))
Dual Coordinate Updates
∆αit = max∆αit∈R
−1n `∗it (−(αt
it + ∆αit ))− λ
2
∥∥∥∥wt +1λn ∆αit xit
∥∥∥∥2
2
αt+1it = αt
it + ∆αit
wt+1 = wt +1λn ∆αit xi
Yang Tutorial for ACML’15 Nov. 20, 2015 79 / 210
![Page 105: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/105.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SDCA updates
Close-form solution for ∆αi : hinge loss, squared hinge loss, absoluteloss and square loss (Shalev-Shwartz & Zhang (2013))e.g. square loss
∆αi =yi −w>t xi − αt
i1 + ‖xi‖2
2/(λn)
Per-iteration cost: O(d)
Approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013))
Yang Tutorial for ACML’15 Nov. 20, 2015 80 / 210
![Page 106: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/106.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SDCA
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A.2 N.A. 2
λ-strongly convex O(
n + 1λε
)O((
n + 1λ
)log(
1ε
))Total Runtime (smooth loss): O
(d(
n + 1λ
)log(
1ε
)). The same as
SVRG and SAGA!also equivalent to some kind of variance reductionProximal variant for elastic net regularizerWang & Lin (2014) shows linear convergence is achievable fornon-smooth loss
2A small trick can fix thisYang Tutorial for ACML’15 Nov. 20, 2015 81 / 210
![Page 107: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/107.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SDCA
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A.2 N.A. 2
λ-strongly convex O(
n + 1λε
)O((
n + 1λ
)log(
1ε
))Total Runtime (smooth loss): O
(d(
n + 1λ
)log(
1ε
)). The same as
SVRG and SAGA!also equivalent to some kind of variance reductionProximal variant for elastic net regularizerWang & Lin (2014) shows linear convergence is achievable fornon-smooth loss
2A small trick can fix thisYang Tutorial for ACML’15 Nov. 20, 2015 81 / 210
![Page 108: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/108.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
SDCA vs SVRG/SAGA
Advantages of SDCACan handle non-smooth loss functionsCan explore the data sparsity for efficient updateParameter free
Yang Tutorial for ACML’15 Nov. 20, 2015 82 / 210
![Page 109: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/109.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Randomized Coordinate Updates
Randomized Coordinate Descent
Accelerated Proximal Coordinate Gradient
Yang Tutorial for ACML’15 Nov. 20, 2015 83 / 210
![Page 110: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/110.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Randomized Coordinate Updates
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) + R(w)
Suppose d >> n. Per-iteration cost O(d) is too highSample over features instead of dataPer-iteration cost becomes O(n)
Applicable when `(z , y) is smooth and R(w) is decomposableStrong convexity is not necessary
Yang Tutorial for ACML’15 Nov. 20, 2015 84 / 210
![Page 111: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/111.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Randomized Coordinate Descent (Nesterov (2012))
minw∈Rd
F (w) =12‖Xw− y‖2
2 +λ
2 ‖w‖22
X = [x1, x2, · · · , xd ] ∈ Rn×d
Partial gradient: ∇i F (w) = xTi (Xw− y) + λwi
Randomly sample it ∈ 1, . . . , d
Randomized Coordinate Descent (RCD)
w ti =
w t−1
i − γt∇i F (wt−1) if i = itw t−1
i otherwise
step size γt : constant∇i F (wt) can be updated in O(n)
Yang Tutorial for ACML’15 Nov. 20, 2015 85 / 210
![Page 112: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/112.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Randomized Coordinate Descent (Nesterov (2012))
minw∈Rd
F (w) =12‖Xw− y‖2
2 +λ
2 ‖w‖22
X = [x1, x2, · · · , xd ] ∈ Rn×d
Partial gradient: ∇i F (w) = xTi (Xw− y) + λwi
Randomly sample it ∈ 1, . . . , d
Randomized Coordinate Descent (RCD)
w ti =
w t−1
i − γt∇i F (wt−1) if i = itw t−1
i otherwise
step size γt : constant∇i F (wt) can be updated in O(n)
Yang Tutorial for ACML’15 Nov. 20, 2015 85 / 210
![Page 113: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/113.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Randomized Coordinate Descent (Nesterov (2012))
Partial gradient: ∇i F (w) = xTi (Xw− y) + λwi
Randomly sample it ∈ 1, . . . , d
Randomized Coordinate Descent (RCD)
w ti =
w t−1
i − γt∇i F (wt−1) if i = itw t−1
i otherwise
maintain and update u = Xw− y ∈ Rn in O(n)
ut = ut−1 + xit (w tit − w t−1
it ) = ut−1 + xit ∆w
partial gradient can be computed in O(n)
∇i F (wt) = x>i ut
Yang Tutorial for ACML’15 Nov. 20, 2015 86 / 210
![Page 114: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/114.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
RCD
Per-iteration Cost O(n)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A. O
(dε
)λ-strongly convex N.A O
(dλ log
(1ε
))Total Runtime (strongly convex): O
(ndλ log
(1ε
)). The same as
Gradient Descent Method! In practice, could be much faster
Yang Tutorial for ACML’15 Nov. 20, 2015 87 / 210
![Page 115: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/115.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
RCD
Per-iteration Cost O(n)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A. O
(dε
)λ-strongly convex N.A O
(dλ log
(1ε
))Total Runtime (strongly convex): O
(ndλ log
(1ε
)). The same as
Gradient Descent Method! In practice, could be much faster
Yang Tutorial for ACML’15 Nov. 20, 2015 87 / 210
![Page 116: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/116.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Accelerated Proximal Coordinate Gradient (APCG)
minw∈Rd
F (w) =12‖Xw− y‖2
2 +λ
2 ‖w‖22 + τ‖w‖1
Using AccelerationUsing Proximal Mapping
APCG (Lin et al., 2014)
wti =
arg minwi∈R∇i F (vt−1)wi + 1
2γt(wi − vt−1
i )2 + τ |wi | if i = itwt−1
i otherwisevt = wt + ηt(wt −wt−1)
Yang Tutorial for ACML’15 Nov. 20, 2015 88 / 210
![Page 117: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/117.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
APCG
Per-iteration cost: O(n)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A. O
(d√ε
)λ-strongly convex N.A. O
(d√λ
log(
1ε
))Total Runtime (strongly convex): O
(nd√λ
log(
1ε
)). The same as
APG!, in practice, could be much faster
Yang Tutorial for ACML’15 Nov. 20, 2015 89 / 210
![Page 118: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/118.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
APCG applied to the Dual
Recall the acceleration scheme for full gradient methodAuxiliary sequence (βt)Momentum step
Maintain a primal solution: wt = 1λn∑n
i=1 βti xi
Dual Coordinate UpdatesSample it ∈ 1, . . . , n
∆βit = max∆βit∈Rn
−1n `∗it (−βt
it −∆βit )− λ
2
∥∥∥∥wt +1λn ∆βit xit
∥∥∥∥2
2
αt+1it = βt
it + ∆βit
βt+1 = αt+1 + ηt(αt+1 − αt)
Momentum Step
Yang Tutorial for ACML’15 Nov. 20, 2015 90 / 210
![Page 119: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/119.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
APCG applied to the Dual
Recall the acceleration scheme for full gradient methodAuxiliary sequence (βt)Momentum step
Maintain a primal solution: wt = 1λn∑n
i=1 βti xi
Dual Coordinate UpdatesSample it ∈ 1, . . . , n
∆βit = max∆βit∈Rn
−1n `∗it (−βt
it −∆βit )− λ
2
∥∥∥∥wt +1λn ∆βit xit
∥∥∥∥2
2
αt+1it = βt
it + ∆βit
βt+1 = αt+1 + ηt(αt+1 − αt)
Momentum Step
Yang Tutorial for ACML’15 Nov. 20, 2015 90 / 210
![Page 120: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/120.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
APCG applied to the Dual
Per-iteration cost: O(d)
Iteration complexity`(z) ≡ `(z , y)
Non-smooth Smooth
R(w)Non-strongly convex N.A. 3 N.A. 4
λ-strongly convex O(
n +√
nλε
)O((
n +√
nλ
)log(
1ε
))Total Runtime (smooth): O
(d(n +
√nλ) log
(1ε
)). could be faster
than SDCA O(
d(n + 1λ) log
(1ε
))when λ ≤ 1
n
3A small trick can fix this4A small trick can fix this
Yang Tutorial for ACML’15 Nov. 20, 2015 91 / 210
![Page 121: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/121.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
APCG V.S. SDCALin et al. (2014)
Yang Tutorial for ACML’15 Nov. 20, 2015 92 / 210
![Page 122: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/122.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Summary
`(z) ≡ `(z , y)Non-smooth Smooth
R(w)Non str-cvx SGD RCD, APCG, SAGA
str-cvx SDCA, APCG RCD, APCG, SDCASVRG, SAGA
Red: stochastic gradient, primalBlue: randomized coordinate, primalGreen: stochastic coordinate, dual
Yang Tutorial for ACML’15 Nov. 20, 2015 93 / 210
![Page 123: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/123.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Summary
SGD SVRG SAGA SDCA APCGParameters γt γt , m γt None ηt
non-smooth loss
3 7 7 3 3
smooth loss
3 3 3 3 3
strongly cvx
3 3 3 3 3
Non-strongly cvx
3 7 3 7 7
Primal
3 3 3 7 3
Dual
7 7 7 3 3
Yang Tutorial for ACML’15 Nov. 20, 2015 94 / 210
![Page 124: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/124.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Summary
SGD SVRG SAGA SDCA APCGParameters γt γt , m γt None ηt
non-smooth loss 3 7 7 3 3
smooth loss 3 3 3 3 3
strongly cvx
3 3 3 3 3
Non-strongly cvx
3 7 3 7 7
Primal
3 3 3 7 3
Dual
7 7 7 3 3
Yang Tutorial for ACML’15 Nov. 20, 2015 94 / 210
![Page 125: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/125.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Summary
SGD SVRG SAGA SDCA APCGParameters γt γt , m γt None ηt
non-smooth loss 3 7 7 3 3
smooth loss 3 3 3 3 3
strongly cvx 3 3 3 3 3
Non-strongly cvx 3 7 3 7 7
Primal
3 3 3 7 3
Dual
7 7 7 3 3
Yang Tutorial for ACML’15 Nov. 20, 2015 94 / 210
![Page 126: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/126.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Summary
SGD SVRG SAGA SDCA APCGParameters γt γt , m γt None ηt
non-smooth loss 3 7 7 3 3
smooth loss 3 3 3 3 3
strongly cvx 3 3 3 3 3
Non-strongly cvx 3 7 3 7 7
Primal 3 3 3 7 3
Dual 7 7 7 3 3
Yang Tutorial for ACML’15 Nov. 20, 2015 94 / 210
![Page 127: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/127.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Trick for generalizing to non-strongly convexregularizer (Shalev-Shwartz & Zhang, 2012)
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + τ‖w‖1
Issue: Not Strongly Convex Solution: Add `22 regularization
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + τ‖w‖1 +λ
2 ‖w‖22
If ‖w∗‖2 ≤ B, we can set λ = εB2 .
An ε/2-suboptimal solution for the new problem isε-suboptimal for the original problem
Yang Tutorial for ACML’15 Nov. 20, 2015 95 / 210
![Page 128: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/128.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Trick for generalizing to non-strongly convexregularizer (Shalev-Shwartz & Zhang, 2012)
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + τ‖w‖1
Issue: Not Strongly Convex Solution: Add `22 regularization
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + τ‖w‖1 +λ
2 ‖w‖22
If ‖w∗‖2 ≤ B, we can set λ = εB2 .
An ε/2-suboptimal solution for the new problem isε-suboptimal for the original problem
Yang Tutorial for ACML’15 Nov. 20, 2015 95 / 210
![Page 129: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/129.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Outline
2 Optimization(Sub)Gradient MethodsStochastic Optimization Algorithms for Big Data
Stochastic OptimizationDistributed Optimization
Yang Tutorial for ACML’15 Nov. 20, 2015 96 / 210
![Page 130: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/130.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Big Data and Distributed Optimization
Distributed Optimizationdata distributed over a cluster ofmultiple machines
moving to single machine sufferslow network bandwidthlimited disk or memory
communication V.S. computationRAM 100 nanosecondsstandard network connection 250, 000nanoseconds
Yang Tutorial for ACML’15 Nov. 20, 2015 97 / 210
![Page 131: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/131.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Distributed Data
N data points are partitioned and distributed to m machines[x1, x2, . . . , xn] = S1 ∪ S2 ∪ · · · ∪ SK
Machine j only has access to Sj .W.L.O.G: |Sj | = nk = n
K
S1 S2 S3 S4 S5 S6
Yang Tutorial for ACML’15 Nov. 20, 2015 98 / 210
![Page 132: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/132.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
A simple solution: Average Solution
Global problem
w? = arg minw∈Rd
F (w) =
1N
N∑i=1
`(w>xi , yi ) + R(w)
Machine j solves a local problem
wj = arg minw∈Rd
fj(w) =1
nk
∑i∈Sj
`(w>xi , yi ) + R(w)
S1 S2 S3 S4 S5 S6
w1 w2 w3 w4 w5 w6
Center computes: w =1K
K∑j=1
wj , Issue: Will not converge to w?
Yang Tutorial for ACML’15 Nov. 20, 2015 99 / 210
![Page 133: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/133.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
A simple solution: Average Solution
Global problem
w? = arg minw∈Rd
F (w) =
1N
N∑i=1
`(w>xi , yi ) + R(w)
Machine j solves a local problem
wj = arg minw∈Rd
fj(w) =1
nk
∑i∈Sj
`(w>xi , yi ) + R(w)
S1 S2 S3 S4 S5 S6
w1 w2 w3 w4 w5 w6
Center computes: w =1K
K∑j=1
wj , Issue: Will not converge to w?
Yang Tutorial for ACML’15 Nov. 20, 2015 99 / 210
![Page 134: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/134.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Total Runtime
Single machineTotal Runtime= Per-iteration Cost×Iteration Complexity
Distributed optimizationTotal Runtime= (Communication Time Per-round+Local Runtime Per-round)×Rounds of Communication
Trading Computation for Communication: Increase Local ComputationBalance between CommunicationReduce the Rounds of Communication
Yang Tutorial for ACML’15 Nov. 20, 2015 100 / 210
![Page 135: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/135.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Distributed SDCA (DisDCA) (Yang, 2013), CoCoA+ (Ma et al.,
2015)
Applicable when R(w) is strongly convex, e.g. R(w) = λ2‖w‖
22
Global dual problem
maxα∈Rn
1n
n∑i=1−`∗i (−αi )−
λ
2
∥∥∥∥∥ 1λn
n∑i=1
αixi
∥∥∥∥∥2
2
Incremental variable ∆αi
max∆α
1n − `
∗i (−(αt
i + ∆αi ))− λ
2
∥∥∥∥∥wt +1λn
n∑i=1
∆αixi
∥∥∥∥∥2
2
Primal solution: wt =1λn
n∑i=1
αti xi
Yang Tutorial for ACML’15 Nov. 20, 2015 101 / 210
![Page 136: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/136.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Distributed SDCA (DisDCA) (Yang, 2013), CoCoA+ (Ma et al.,
2015)
Applicable when R(w) is strongly convex, e.g. R(w) = λ2‖w‖
22
Global dual problem
maxα∈Rn
1n
n∑i=1−`∗i (−αi )−
λ
2
∥∥∥∥∥ 1λn
n∑i=1
αixi
∥∥∥∥∥2
2
Incremental variable ∆αi
max∆α
1n − `
∗i (−(αt
i + ∆αi ))− λ
2
∥∥∥∥∥wt +1λn
n∑i=1
∆αixi
∥∥∥∥∥2
2
Primal solution: wt =1λn
n∑i=1
αti xi
Yang Tutorial for ACML’15 Nov. 20, 2015 101 / 210
![Page 137: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/137.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
DisDCA: Trading Computation for Communication
∆αij = arg max−`∗ij (−αtij −∆αij )−
λn2K
∥∥∥∥utj +
Kλn ∆αij xij
∥∥∥∥2
2
utj+1 = ut
j +Kλn ∆αij xij
Yang Tutorial for ACML’15 Nov. 20, 2015 102 / 210
![Page 138: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/138.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
DisDCA: Trading Computation for Communication
Yang Tutorial for ACML’15 Nov. 20, 2015 103 / 210
![Page 139: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/139.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
DisDCA: Trading Computation for Communication
Yang Tutorial for ACML’15 Nov. 20, 2015 104 / 210
![Page 140: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/140.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
CoCoA+ (Ma et al., 2015)
Machine j approximately solves
∆αtSj≈
arg max∆αSj∈Rn
∑i∈Sj
−`∗i (−(αti + ∆αi ))− 〈wt ,
∑i∈Sj
∆αixi〉 −K
2λn
∥∥∥∥∥∥∑i∈Sj
∆αixi
∥∥∥∥∥∥2
2
αt+1Sj
= αtSj
+ ∆αtSj, ∆wt
j =1λn∑i∈Sj
∆αtSj
xi
Center computes: wt+1 = wt +m∑
j=1∆wt
j
Yang Tutorial for ACML’15 Nov. 20, 2015 105 / 210
![Page 141: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/141.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
CoCoA+ (Ma et al., 2015)
Local objective value
Gj(∆αSj ,wt) =
1n∑i∈Sj
−`∗i (−(αti + ∆αi ))− 1
n 〈wt ,∑i∈Sj
∆αixi〉 −K
2λn2
∥∥∥∥∥∥∑i∈Sj
∆αixi
∥∥∥∥∥∥2
2
Solve ∆αtSj
by any local solver as long as(max∆αSj
Gj(∆αSj ,wt)− Gj(∆αtSj,wt)
)≤ Θ
(max∆αSj
Gj(∆αSj ,wt)− Gj(0,wt)
)0 < Θ < 1
CoCoA+ is equivalent to DisDCA when employing SDCA to solvelocal problems with m iterations
Yang Tutorial for ACML’15 Nov. 20, 2015 106 / 210
![Page 142: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/142.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Distributed SDCA in Practice
Choice of m (i.e., the number of inner iterations)the larger m, the higher local computation cost, the lowercommunication costs
Choice of K (i.e., the number of machines)the larger K , the lower local computation costs, the highercommunication costs
Yang Tutorial for ACML’15 Nov. 20, 2015 107 / 210
![Page 143: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/143.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
DisDCA is implemented
http://cs.uiowa.edu/˜tyng/software.htmlClassification and RegressionLoss
1 Hinge loss and squared hinge loss (SVM)2 Logistic loss (Logistic Regression)3 Square loss (Ridge Regression/LASSO)
Regularizer1 `2 norm2 mixture of `1 norm and `2 norm
Multi-class : one-vs-all
Yang Tutorial for ACML’15 Nov. 20, 2015 108 / 210
![Page 144: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/144.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Alternating Direction Method of Multipliers (ADMM)
minw∈Rd
F (w) =K∑
k=1
1n∑i∈Sk
`(w>xi , yi )︸ ︷︷ ︸fk (w)
+λ
2 ‖w‖22
each fk(w) on individual machinesbut w are coupled together
minw1,...,wK ,w∈Rd
F (w) =K∑
k=1fk(wk) +
λ
2 ‖w‖22
s.t. wk = w, k = 1, . . . ,K
Yang Tutorial for ACML’15 Nov. 20, 2015 109 / 210
![Page 145: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/145.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
The Augmented Lagrangian Function
minw1,...,wK ,w∈Rd
K∑k=1
fk(wk) +λ
2 ‖w‖22
s.t. wk = w, k = 1, . . . ,KThe Augmented Lagrangian function
L (wk, zk,w)
=K∑
k=1fk(wk) +
λ
2 ‖w‖22 +
K∑k=1
z>k (wk −w) +ρ
2
K∑k=1‖wk −w‖2
2
LagrangianMultipliers
is the Lagrangian function of
minw1,...,wK ,w∈Rd
K∑k=1
fk(wk) +λ
2 ‖w‖22 +
ρ
2
K∑k=1‖wk −w‖2
2
s.t. wk = w, k = 1, . . . ,KYang Tutorial for ACML’15 Nov. 20, 2015 110 / 210
![Page 146: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/146.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
The Augmented Lagrangian Function
minw1,...,wK ,w∈Rd
K∑k=1
fk(wk) +λ
2 ‖w‖22
s.t. wk = w, k = 1, . . . ,KThe Augmented Lagrangian function
L (wk, zk,w)
=K∑
k=1fk(wk) +
λ
2 ‖w‖22 +
K∑k=1
z>k (wk −w) +ρ
2
K∑k=1‖wk −w‖2
2
LagrangianMultipliers
is the Lagrangian function of
minw1,...,wK ,w∈Rd
K∑k=1
fk(wk) +λ
2 ‖w‖22 +
ρ
2
K∑k=1‖wk −w‖2
2
s.t. wk = w, k = 1, . . . ,KYang Tutorial for ACML’15 Nov. 20, 2015 110 / 210
![Page 147: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/147.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
The Augmented Lagrangian Function
minw1,...,wK ,w∈Rd
K∑k=1
fk(wk) +λ
2 ‖w‖22
s.t. wk = w, k = 1, . . . ,KThe Augmented Lagrangian function
L (wk, zk,w)
=K∑
k=1fk(wk) +
λ
2 ‖w‖22 +
K∑k=1
z>k (wk −w) +ρ
2
K∑k=1‖wk −w‖2
2
LagrangianMultipliers
is the Lagrangian function of
minw1,...,wK ,w∈Rd
K∑k=1
fk(wk) +λ
2 ‖w‖22 +
ρ
2
K∑k=1‖wk −w‖2
2
s.t. wk = w, k = 1, . . . ,KYang Tutorial for ACML’15 Nov. 20, 2015 110 / 210
![Page 148: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/148.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
ADMM
L (wk, zk,w)
=K∑
k=1fk(wk) +
λ
2 ‖w‖22 +
K∑k=1
z>k (wk −w) +ρ
2
K∑k=1‖wk −w‖2
2
Update from (wtk , zt
k ,wt) to (wt+1k , zt+1
k ,wt+1)
wt+1k = arg min
wkfk(wk) + (zt
k)>(wk −wt) +ρ
2‖wk −wt‖22, k = 1, . . . ,K
wt+1 = arg minw
λ
2 ‖w‖22 +
K∑k=1
(ztk)>w +
ρ
2
K∑k=1‖wt+1
k −w‖22
zt+1k = zk + ρ(wt+1
k −wt+1)
Optimize onIndividualMachinesAggregate and
Update onOne MachineUpdate onIndividualMachines
Yang Tutorial for ACML’15 Nov. 20, 2015 111 / 210
![Page 149: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/149.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
ADMM
L (wk, zk,w)
=K∑
k=1fk(wk) +
λ
2 ‖w‖22 +
K∑k=1
z>k (wk −w) +ρ
2
K∑k=1‖wk −w‖2
2
Update from (wtk , zt
k ,wt) to (wt+1k , zt+1
k ,wt+1)
wt+1k = arg min
wkfk(wk) + (zt
k)>(wk −wt) +ρ
2‖wk −wt‖22, k = 1, . . . ,K
wt+1 = arg minw
λ
2 ‖w‖22 +
K∑k=1
(ztk)>w +
ρ
2
K∑k=1‖wt+1
k −w‖22
zt+1k = zk + ρ(wt+1
k −wt+1)
Optimize onIndividualMachinesAggregate and
Update onOne MachineUpdate onIndividualMachines
Yang Tutorial for ACML’15 Nov. 20, 2015 111 / 210
![Page 150: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/150.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
ADMM
L (wk, zk,w)
=K∑
k=1fk(wk) +
λ
2 ‖w‖22 +
K∑k=1
z>k (wk −w) +ρ
2
K∑k=1‖wk −w‖2
2
Update from (wtk , zt
k ,wt) to (wt+1k , zt+1
k ,wt+1)
wt+1k = arg min
wkfk(wk) + (zt
k)>(wk −wt) +ρ
2‖wk −wt‖22, k = 1, . . . ,K
wt+1 = arg minw
λ
2 ‖w‖22 +
K∑k=1
(ztk)>w +
ρ
2
K∑k=1‖wt+1
k −w‖22
zt+1k = zk + ρ(wt+1
k −wt+1)
Optimize onIndividualMachinesAggregate and
Update onOne MachineUpdate onIndividualMachines
Yang Tutorial for ACML’15 Nov. 20, 2015 111 / 210
![Page 151: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/151.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
ADMM
L (wk, zk,w)
=K∑
k=1fk(wk) +
λ
2 ‖w‖22 +
K∑k=1
z>k (wk −w) +ρ
2
K∑k=1‖wk −w‖2
2
Update from (wtk , zt
k ,wt) to (wt+1k , zt+1
k ,wt+1)
wt+1k = arg min
wkfk(wk) + (zt
k)>(wk −wt) +ρ
2‖wk −wt‖22, k = 1, . . . ,K
wt+1 = arg minw
λ
2 ‖w‖22 +
K∑k=1
(ztk)>w +
ρ
2
K∑k=1‖wt+1
k −w‖22
zt+1k = zk + ρ(wt+1
k −wt+1)
Optimize onIndividualMachinesAggregate and
Update onOne MachineUpdate onIndividualMachines
Yang Tutorial for ACML’15 Nov. 20, 2015 111 / 210
![Page 152: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/152.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
ADMM
wt+1k = arg min
wkfk(wk) + (zt
k)>(wk −wt) +ρ
2‖wk −wt‖22, k = 1, . . . ,K
Each local problem can be solved by a local solver (e.g., SDCA)Optimization can be inexact (trading computation forcommunication)
Yang Tutorial for ACML’15 Nov. 20, 2015 112 / 210
![Page 153: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/153.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Complexity of ADMM
Assume local problems are solved exactly.
Communication Complexity: O(
log(
1ε
))due to the strong convexity
of R(w)
Applicable to Non-strongly Convex Regularizer R(w) = ‖w‖1
minw∈Rd
F (w) =K∑
k=1
1n∑i∈Sk
`(w>xi , yi ) + τ‖w‖1
Communication Complexity: O(
1ε
)
Yang Tutorial for ACML’15 Nov. 20, 2015 113 / 210
![Page 154: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/154.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Thank You! Questions?
Yang Tutorial for ACML’15 Nov. 20, 2015 114 / 210
![Page 155: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/155.jpg)
Optimization Stochastic Optimization Algorithms for Big Data
Research Assistant Positions Available for PhD Candidates!Start Fall’16Optimization and RandomizationOnline LearningDeep LearningMachine Learningsend email to [email protected]
Yang Tutorial for ACML’15 Nov. 20, 2015 115 / 210
![Page 156: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/156.jpg)
Randomized Dimension Reduction
Big Data Analytics: Optimization and Randomization
Part III: Randomization
Yang Tutorial for ACML’15 Nov. 20, 2015 116 / 210
![Page 157: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/157.jpg)
Randomized Dimension Reduction
Outline
1 Basics
2 Optimization
3 Randomized Dimension Reduction
4 Randomized Algorithms
5 Concluding Remarks
Yang Tutorial for ACML’15 Nov. 20, 2015 117 / 210
![Page 158: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/158.jpg)
Randomized Dimension Reduction
Random Sketch
Approximate a large data matrix
by a much smaller sketch
Yang Tutorial for ACML’15 Nov. 20, 2015 118 / 210
![Page 159: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/159.jpg)
Randomized Dimension Reduction
The Framework of Randomized Algorithms
Yang Tutorial for ACML’15 Nov. 20, 2015 119 / 210
![Page 160: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/160.jpg)
Randomized Dimension Reduction
The Framework of Randomized Algorithms
Yang Tutorial for ACML’15 Nov. 20, 2015 120 / 210
![Page 161: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/161.jpg)
Randomized Dimension Reduction
The Framework of Randomized Algorithms
Yang Tutorial for ACML’15 Nov. 20, 2015 121 / 210
![Page 162: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/162.jpg)
Randomized Dimension Reduction
The Framework of Randomized Algorithms
Yang Tutorial for ACML’15 Nov. 20, 2015 122 / 210
![Page 163: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/163.jpg)
Randomized Dimension Reduction
Why randomized dimension reduction?
Efficient
Robust (e.g., dropout)
Formal Guarantees
Can explore parallel algorithms
Yang Tutorial for ACML’15 Nov. 20, 2015 123 / 210
![Page 164: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/164.jpg)
Randomized Dimension Reduction
Randomized Dimension Reduction
Johnson-Lindenstauss (JL) transforms
Subspace embeddings
Column sampling
Yang Tutorial for ACML’15 Nov. 20, 2015 124 / 210
![Page 165: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/165.jpg)
Randomized Dimension Reduction
JL Lemma
JL Lemma (Johnson & Lindenstrauss, 1984)For any 0 < ε, δ < 1/2, there exists a probability distribution on m × dreal matrices A such that there exists a small universal constant c > 0 andfor any fixed x ∈ Rd with a probability at least 1− δ, we have∣∣∣‖Ax‖2
2 − ‖x‖22
∣∣∣ ≤ c
√log(1/δ)
m ‖x‖22
or for m = Θ(ε−2 log(1/δ)), then with a probability at least 1− δ∣∣∣‖Ax‖22 − ‖x‖2
2
∣∣∣ ≤ ε‖x‖22
Yang Tutorial for ACML’15 Nov. 20, 2015 125 / 210
![Page 166: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/166.jpg)
Randomized Dimension Reduction
Embedding a set of points into low dimensional space
Given a set of points x1, . . . , xn ∈ Rd , we can embed them into a lowdimensional space Ax1, . . . ,Axn ∈ Rm such thatthe pairwise distance between any two points are well preserved in the lowdimensional space
‖Axi − Axj‖22 = ‖A(xi − xj)‖2
2 ≤ (1 + ε) ‖xi − xj‖22
‖Axi − Axj‖22 = ‖A(xi − xj)‖2
2 ≥ (1− ε) ‖xi − xj‖22
In other words, in order to have all pairwise Euclidean distances preservedup to 1± ε, only m = Θ(ε−2 log(n2/δ)) dimensions are necessary
Yang Tutorial for ACML’15 Nov. 20, 2015 126 / 210
![Page 167: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/167.jpg)
Randomized Dimension Reduction
JL transforms: Gaussian Random Projection
Gaussian Random Projection (Dasgupta & Gupta, 2003): A ∈ Rm×d
Aij ∼ N (0, 1/m)
m = Θ(ε−2 log(1/δ))
Computational cost of AX : where X ∈ Rd×n
mnd for dense matricesnnz(X )m for sparse matrices
Computational Cost is very High (could be as high as solving manyproblems)
Yang Tutorial for ACML’15 Nov. 20, 2015 127 / 210
![Page 168: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/168.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using discrete distributions
Using Discrete Distributions (Achlioptas, 2003):Pr(Aij = ± 1√
m ) = 0.5
Pr(Aij = ±√
3m ) = 1
6 , Pr(Aij = 0) = 23
Database friendlyReplace multiplications by additions and subtractions
Yang Tutorial for ACML’15 Nov. 20, 2015 128 / 210
![Page 169: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/169.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using Hadmard transform (I)
Fast JL transform based on randomized Hadmard transform:
Motivation: Can we simply use random sampling matrix P ∈ Rm×d thatrandomly selects m coordinates out of d coordinates (scaled by
√d/m)?
Unfortunately: by Chernoff bound
|‖Px‖22 − ‖x‖2
2| ≤√
d‖x‖∞‖x‖2
√3 log(2/δ)
m ‖x‖22
Unless√
d‖x‖∞‖x‖2
≤ c, the random sampling doest not work
Remedy is given by randomized Hadmard transform
Yang Tutorial for ACML’15 Nov. 20, 2015 129 / 210
![Page 170: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/170.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using Hadmard transform (I)
Fast JL transform based on randomized Hadmard transform:
Motivation: Can we simply use random sampling matrix P ∈ Rm×d thatrandomly selects m coordinates out of d coordinates (scaled by
√d/m)?
Unfortunately: by Chernoff bound
|‖Px‖22 − ‖x‖2
2| ≤√
d‖x‖∞‖x‖2
√3 log(2/δ)
m ‖x‖22
Unless√
d‖x‖∞‖x‖2
≤ c, the random sampling doest not work
Remedy is given by randomized Hadmard transform
Yang Tutorial for ACML’15 Nov. 20, 2015 129 / 210
![Page 171: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/171.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using Hadmard transform (I)
Fast JL transform based on randomized Hadmard transform:
Motivation: Can we simply use random sampling matrix P ∈ Rm×d thatrandomly selects m coordinates out of d coordinates (scaled by
√d/m)?
Unfortunately: by Chernoff bound
|‖Px‖22 − ‖x‖2
2| ≤√
d‖x‖∞‖x‖2
√3 log(2/δ)
m ‖x‖22
Unless√
d‖x‖∞‖x‖2
≤ c, the random sampling doest not work
Remedy is given by randomized Hadmard transform
Yang Tutorial for ACML’15 Nov. 20, 2015 129 / 210
![Page 172: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/172.jpg)
Randomized Dimension Reduction
Randomized Hadmard transform
Hadmard transform:H ∈ Rd×d : H =
√1d H2k
H1 = [1] , H2 =
[1 11 −1
], H2k =
[H2k−1 H2k−1
H2k−1 −H2k−1
]
‖Hx‖2 = ‖x‖2 and H is orthogonalComputational costs of Hx : d log(d)
randomized Hadmard transform: HDD ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5HD is orthogonal and ‖HDx‖2 = ‖x‖2
Key property:√
d‖HDx‖∞‖HDx‖2
≤√
log(d/δ) w.h.p 1− δ
Yang Tutorial for ACML’15 Nov. 20, 2015 130 / 210
![Page 173: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/173.jpg)
Randomized Dimension Reduction
Randomized Hadmard transform
Hadmard transform:H ∈ Rd×d : H =
√1d H2k
H1 = [1] , H2 =
[1 11 −1
], H2k =
[H2k−1 H2k−1
H2k−1 −H2k−1
]
‖Hx‖2 = ‖x‖2 and H is orthogonalComputational costs of Hx : d log(d)
randomized Hadmard transform: HDD ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5HD is orthogonal and ‖HDx‖2 = ‖x‖2
Key property:√
d‖HDx‖∞‖HDx‖2
≤√
log(d/δ) w.h.p 1− δ
Yang Tutorial for ACML’15 Nov. 20, 2015 130 / 210
![Page 174: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/174.jpg)
Randomized Dimension Reduction
Randomized Hadmard transform
Hadmard transform:H ∈ Rd×d : H =
√1d H2k
H1 = [1] , H2 =
[1 11 −1
], H2k =
[H2k−1 H2k−1
H2k−1 −H2k−1
]
‖Hx‖2 = ‖x‖2 and H is orthogonalComputational costs of Hx : d log(d)
randomized Hadmard transform: HDD ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5HD is orthogonal and ‖HDx‖2 = ‖x‖2
Key property:√
d‖HDx‖∞‖HDx‖2
≤√
log(d/δ) w.h.p 1− δ
Yang Tutorial for ACML’15 Nov. 20, 2015 130 / 210
![Page 175: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/175.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using Hadmard transform (I)
Fast JL transform based on randomized Hadmard transform (Tropp, 2011):
A =
√dm PHD
yields
|‖Ax‖22 − ‖x‖2
2| ≤
√3 log(2/δ) log(d/δ)
m ‖x‖22
m = Θ(ε−2 log(1/δ) log(d/δ)) suffice for 1± εadditional factor log(d/δ) can be removedComputational cost of AX : O(nd log(m))
Yang Tutorial for ACML’15 Nov. 20, 2015 131 / 210
![Page 176: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/176.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using a sparse matrix (I)
Random hashing (Dasgupta et al., 2010)
A = HD
where D ∈ Rd×d and H ∈ Rm×d
random hashing: h(j) : 1, . . . , d → 1, . . . ,mHij = 1 if h(j) = i : sparse matrix (each column has only one non-zeroentry)D ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5[Ax]j =
∑i :h(i)=j xi Dii
Technically speaking, random hashing does not satisfy JL lemma
Yang Tutorial for ACML’15 Nov. 20, 2015 132 / 210
![Page 177: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/177.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using a sparse matrix (I)
Random hashing (Dasgupta et al., 2010)
A = HD
where D ∈ Rd×d and H ∈ Rm×d
random hashing: h(j) : 1, . . . , d → 1, . . . ,mHij = 1 if h(j) = i : sparse matrix (each column has only one non-zeroentry)D ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5[Ax]j =
∑i :h(i)=j xi Dii
Technically speaking, random hashing does not satisfy JL lemma
Yang Tutorial for ACML’15 Nov. 20, 2015 132 / 210
![Page 178: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/178.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using a sparse matrix (I)
key properties:E[〈HDx1,HDx2〉] = 〈x1, x2〉and norm perserving |‖HDx‖2
2 − ‖x‖22| ≤ ε‖x‖2
2, only when
‖x‖∞‖x‖2
≤ 1√c
Apply randomized Hadmard transform P first: Θ(c log(c/δ)) blocks ofrandomized Hadmard transform
‖Px‖∞‖Px‖2
≤ 1√c
Yang Tutorial for ACML’15 Nov. 20, 2015 133 / 210
![Page 179: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/179.jpg)
Randomized Dimension Reduction
Accelerate JL transforms: using a sparse matrix (II)
Sparse JL transform based on block random hashing (Kane & Nelson,2014)
A =
1√s Q1
. . .1√s Qs
Each Qs ∈ Rv×d is an independent random hashing (HD) matrixSet v = Θ(ε−1) and s = Θ(ε−1 log(1/δ))
Computational Cost of AX : O(nnz(X )
εlog[1δ
])
Yang Tutorial for ACML’15 Nov. 20, 2015 134 / 210
![Page 180: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/180.jpg)
Randomized Dimension Reduction
Randomized Dimension Reduction
Johnson-Lindenstauss (JL) transforms
Subspace embeddings
Column sampling
Yang Tutorial for ACML’15 Nov. 20, 2015 135 / 210
![Page 181: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/181.jpg)
Randomized Dimension Reduction
Subspace Embeddings
Definition: a subspace embedding given some parameters0 < ε, δ < 1, k ≤ d is a distribution D over matrices A ∈ Rm×d such thatfor any fixed linear subspace W ∈ Rd with dim(W ) = k it holds that
PrA∼D
(∀x ∈W , ‖Ax‖2 ∈ (1± ε)‖x‖2) ≥ 1− δ
It impliesIf U ∈ Rd×k is orthogonal matrix (contains the orthonormal bases)
AU ∈ Rm×k is of full column rank‖AU‖2 ∈ (1± ε)(1− ε)2 ≤ ‖U>A>AU‖2 ≤ (1 + ε)2
These are key properties in the theoretical analysis of manyalgorithms (e.g., low-rank matrix approximation, randomizedleast-squares regression, randomized classification)
Yang Tutorial for ACML’15 Nov. 20, 2015 136 / 210
![Page 182: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/182.jpg)
Randomized Dimension Reduction
Subspace Embeddings
Definition: a subspace embedding given some parameters0 < ε, δ < 1, k ≤ d is a distribution D over matrices A ∈ Rm×d such thatfor any fixed linear subspace W ∈ Rd with dim(W ) = k it holds that
PrA∼D
(∀x ∈W , ‖Ax‖2 ∈ (1± ε)‖x‖2) ≥ 1− δ
It impliesIf U ∈ Rd×k is orthogonal matrix (contains the orthonormal bases)
AU ∈ Rm×k is of full column rank‖AU‖2 ∈ (1± ε)(1− ε)2 ≤ ‖U>A>AU‖2 ≤ (1 + ε)2
These are key properties in the theoretical analysis of manyalgorithms (e.g., low-rank matrix approximation, randomizedleast-squares regression, randomized classification)
Yang Tutorial for ACML’15 Nov. 20, 2015 136 / 210
![Page 183: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/183.jpg)
Randomized Dimension Reduction
Subspace Embeddings
Definition: a subspace embedding given some parameters0 < ε, δ < 1, k ≤ d is a distribution D over matrices A ∈ Rm×d such thatfor any fixed linear subspace W ∈ Rd with dim(W ) = k it holds that
PrA∼D
(∀x ∈W , ‖Ax‖2 ∈ (1± ε)‖x‖2) ≥ 1− δ
It impliesIf U ∈ Rd×k is orthogonal matrix (contains the orthonormal bases)
AU ∈ Rm×k is of full column rank‖AU‖2 ∈ (1± ε)(1− ε)2 ≤ ‖U>A>AU‖2 ≤ (1 + ε)2
These are key properties in the theoretical analysis of manyalgorithms (e.g., low-rank matrix approximation, randomizedleast-squares regression, randomized classification)
Yang Tutorial for ACML’15 Nov. 20, 2015 136 / 210
![Page 184: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/184.jpg)
Randomized Dimension Reduction
Subspace Embeddings
From a JL transform to a Subspace Embedding (Sarlos, 2006).Let A ∈ Rm×d be a JL transform. If
m = O
k log[
kδε
]ε2
Then w.h.p 1− δk , A ∈ Rm×d is a subspace embedding w.r.t ak-dimensional space in Rd
Yang Tutorial for ACML’15 Nov. 20, 2015 137 / 210
![Page 185: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/185.jpg)
Randomized Dimension Reduction
Subspace Embeddings
Making block random hashing a Subspace Embedding (Nelson & Nguyen,2013).
A =
1√s Q1
. . .1√s Qs
Each Qs ∈ Rv×d is an independent random hashing (HD) matrixSet v = Θ(kε−1 log5(k/δ)) and s = Θ(ε−1 log3(k/δ))
w.h.p 1− δ, A ∈ Rm×d with m = Θ(
k log8(k/δ)ε2
)is a subspace
embedding w.r.t a k-dimensional space in Rd
Computational Cost of AX : O(nnz(X )
εlog3
[kδ
])
Yang Tutorial for ACML’15 Nov. 20, 2015 138 / 210
![Page 186: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/186.jpg)
Randomized Dimension Reduction
Sparse Subspace Embedding (SSE)
Random hashing is SSE with a Constant Probability (Nelson & Nguyen,2013)
A = HD
where D ∈ Rd×d and H ∈ Rm×d
m = Ω(k2/ε2) suffice for a subspace embedding with a probability 2/3Computational Cost AX : O(nnz(X ))
Yang Tutorial for ACML’15 Nov. 20, 2015 139 / 210
![Page 187: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/187.jpg)
Randomized Dimension Reduction
Randomized Dimensionality Reduction
Johnson-Lindenstauss (JL) transforms
Subspace embeddings
Column (Row) sampling
Yang Tutorial for ACML’15 Nov. 20, 2015 140 / 210
![Page 188: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/188.jpg)
Randomized Dimension Reduction
Column sampling
Column subset selection (feature selection)More interpretableUniform sampling usually does not work (not a JL transform)Non-oblivious sampling (data-dependent sampling)
leverage-score sampling
Yang Tutorial for ACML’15 Nov. 20, 2015 141 / 210
![Page 189: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/189.jpg)
Randomized Dimension Reduction
Column sampling
Column subset selection (feature selection)More interpretableUniform sampling usually does not work (not a JL transform)Non-oblivious sampling (data-dependent sampling)
leverage-score sampling
Yang Tutorial for ACML’15 Nov. 20, 2015 141 / 210
![Page 190: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/190.jpg)
Randomized Dimension Reduction
Column sampling
Column subset selection (feature selection)More interpretableUniform sampling usually does not work (not a JL transform)Non-oblivious sampling (data-dependent sampling)
leverage-score sampling
Yang Tutorial for ACML’15 Nov. 20, 2015 141 / 210
![Page 191: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/191.jpg)
Randomized Dimension Reduction
Leverage-score sampling (Drineas et al., 2006)
Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k
Leverage scores ‖Ui∗‖22, i = 1, . . . , d
Let pi =‖Ui∗‖2
2∑di=1 ‖Ui∗‖2
2, i = 1, . . . , d
Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi
Let A ∈ Rm×d be sampling-and-rescaling matrix:
Aij =
1√mpj
if j = ij
0 otherwise
AX ∈ Rm×n is a small sketch of X
Yang Tutorial for ACML’15 Nov. 20, 2015 142 / 210
![Page 192: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/192.jpg)
Randomized Dimension Reduction
Leverage-score sampling (Drineas et al., 2006)
Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k
Leverage scores ‖Ui∗‖22, i = 1, . . . , d
Let pi =‖Ui∗‖2
2∑di=1 ‖Ui∗‖2
2, i = 1, . . . , d
Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi
Let A ∈ Rm×d be sampling-and-rescaling matrix:
Aij =
1√mpj
if j = ij
0 otherwise
AX ∈ Rm×n is a small sketch of X
Yang Tutorial for ACML’15 Nov. 20, 2015 142 / 210
![Page 193: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/193.jpg)
Randomized Dimension Reduction
Leverage-score sampling (Drineas et al., 2006)
Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k
Leverage scores ‖Ui∗‖22, i = 1, . . . , d
Let pi =‖Ui∗‖2
2∑di=1 ‖Ui∗‖2
2, i = 1, . . . , d
Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi
Let A ∈ Rm×d be sampling-and-rescaling matrix:
Aij =
1√mpj
if j = ij
0 otherwise
AX ∈ Rm×n is a small sketch of X
Yang Tutorial for ACML’15 Nov. 20, 2015 142 / 210
![Page 194: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/194.jpg)
Randomized Dimension Reduction
Leverage-score sampling (Drineas et al., 2006)
Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k
Leverage scores ‖Ui∗‖22, i = 1, . . . , d
Let pi =‖Ui∗‖2
2∑di=1 ‖Ui∗‖2
2, i = 1, . . . , d
Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi
Let A ∈ Rm×d be sampling-and-rescaling matrix:
Aij =
1√mpj
if j = ij
0 otherwise
AX ∈ Rm×n is a small sketch of X
Yang Tutorial for ACML’15 Nov. 20, 2015 142 / 210
![Page 195: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/195.jpg)
Randomized Dimension Reduction
Leverage-score sampling (Drineas et al., 2006)
Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k
Leverage scores ‖Ui∗‖22, i = 1, . . . , d
Let pi =‖Ui∗‖2
2∑di=1 ‖Ui∗‖2
2, i = 1, . . . , d
Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi
Let A ∈ Rm×d be sampling-and-rescaling matrix:
Aij =
1√mpj
if j = ij
0 otherwise
AX ∈ Rm×n is a small sketch of X
Yang Tutorial for ACML’15 Nov. 20, 2015 142 / 210
![Page 196: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/196.jpg)
Randomized Dimension Reduction
Properties of Leverage-score sampling
When m = Θ(
kε2 log
[2kδ
]), w.h.p 1− δ,
AU ∈ Rm×k is full column rankσ2
i (AU) ≥ (1− ε) ≥ (1− ε)2
σ2i (AU) ≤ 1 + ε ≤ (1 + ε)2
Leverage-score sampling performs like a subspace embedding (only forU, the top singular vector matrix of X )Computational cost: compute top-k SVD of X , expensiveRandomized algoritms to compute approximate leverage scores
Yang Tutorial for ACML’15 Nov. 20, 2015 143 / 210
![Page 197: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/197.jpg)
Randomized Dimension Reduction
Properties of Leverage-score sampling
When m = Θ(
kε2 log
[2kδ
]), w.h.p 1− δ,
AU ∈ Rm×k is full column rankσ2
i (AU) ≥ (1− ε) ≥ (1− ε)2
σ2i (AU) ≤ 1 + ε ≤ (1 + ε)2
Leverage-score sampling performs like a subspace embedding (only forU, the top singular vector matrix of X )Computational cost: compute top-k SVD of X , expensiveRandomized algoritms to compute approximate leverage scores
Yang Tutorial for ACML’15 Nov. 20, 2015 143 / 210
![Page 198: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/198.jpg)
Randomized Dimension Reduction
When uniform sampling makes sense?
Coherence measureµk =
dk max
1≤i≤d‖Ui∗‖2
2
Valid when the coherence measure is small (some real data miningdatasets have small coherence measures)The Nystrom method usually uses uniform sampling (Gittens, 2011)
Yang Tutorial for ACML’15 Nov. 20, 2015 144 / 210
![Page 199: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/199.jpg)
Randomized Algorithms Randomized Classification (Regression)
Outline
4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation
Yang Tutorial for ACML’15 Nov. 20, 2015 145 / 210
![Page 200: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/200.jpg)
Randomized Algorithms Randomized Classification (Regression)
Classification
Classification problems:
minw∈Rd
1n
n∑i=1
`(yiw>xi ) +λ
2 ‖w‖22
yi ∈ +1,−1: labelLoss function `(z): z = yw>x
1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2
2. Logistic Regression: `(z) = log(1 + exp(−z))
Yang Tutorial for ACML’15 Nov. 20, 2015 146 / 210
![Page 201: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/201.jpg)
Randomized Algorithms Randomized Classification (Regression)
Randomized Classification
For large-scale high-dimensional problems, the computational cost ofoptimization is O((nd + dκ) log(1/ε)).
Use random reduction A ∈ Rd×m (m d), we reduce X ∈ Rn×d toX = XA ∈ Rn×m. Then solve
minu∈Rm
1n
n∑i=1
`(yiu>xi ) +λ
2 ‖u‖22
JL transformsSparse subspace embeddings
Yang Tutorial for ACML’15 Nov. 20, 2015 147 / 210
![Page 202: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/202.jpg)
Randomized Algorithms Randomized Classification (Regression)
Randomized Classification
Two questions:Is there any performance guarantee?
margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12
ε2 log( 6mδ )
generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))
ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)
Yang Tutorial for ACML’15 Nov. 20, 2015 148 / 210
![Page 203: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/203.jpg)
Randomized Algorithms Randomized Classification (Regression)
Randomized Classification
Two questions:Is there any performance guarantee?
margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12
ε2 log( 6mδ )
generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))
ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)
Yang Tutorial for ACML’15 Nov. 20, 2015 148 / 210
![Page 204: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/204.jpg)
Randomized Algorithms Randomized Classification (Regression)
Randomized Classification
Two questions:Is there any performance guarantee?
margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12
ε2 log( 6mδ )
generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))
ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)
Yang Tutorial for ACML’15 Nov. 20, 2015 148 / 210
![Page 205: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/205.jpg)
Randomized Algorithms Randomized Classification (Regression)
Randomized Classification
Two questions:Is there any performance guarantee?
margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12
ε2 log( 6mδ )
generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))
ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)
Yang Tutorial for ACML’15 Nov. 20, 2015 148 / 210
![Page 206: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/206.jpg)
Randomized Algorithms Randomized Classification (Regression)
Randomized Classification
Two questions:Is there any performance guarantee?
margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12
ε2 log( 6mδ )
generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))
ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)
Yang Tutorial for ACML’15 Nov. 20, 2015 148 / 210
![Page 207: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/207.jpg)
Randomized Algorithms Randomized Classification (Regression)
The Dual probelm
Using Fenchel conjugate
`∗i (αi ) = maxαi
αi z − `(z , yi )
Primal:w∗ = arg min
w∈Rd
1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Dual:α∗ = arg max
α∈Rn−1
n
n∑i=1
`∗i (αi )−1
2λn2α>XX>α
From dual to primal:w∗ = − 1
λn X>α∗
Yang Tutorial for ACML’15 Nov. 20, 2015 149 / 210
![Page 208: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/208.jpg)
Randomized Algorithms Randomized Classification (Regression)
Dual Recovery for Randomized Reduction
From dual formulation: w∗ lies in the row space of the data matrixX ∈ Rn×d
Dual Recovery: w∗ = − 1λn X>α∗, where
α∗ = arg maxα∈Rn
−1n
n∑i=1
`∗i (αi )−1
2λn2α>X X>α
and X = XA ∈ Rn×m
Subspace Embedding A with m = Θ(r log(r/δ)ε−2)
Guarantee: under low-rank assumption of the data matrix X (e.g.,rank(X ) = r), with a high probability 1− δ,
‖w∗ −w∗‖2 ≤ε
1− ε‖w∗‖2
Yang Tutorial for ACML’15 Nov. 20, 2015 150 / 210
![Page 209: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/209.jpg)
Randomized Algorithms Randomized Classification (Regression)
Dual Recovery for Randomized Reduction
From dual formulation: w∗ lies in the row space of the data matrixX ∈ Rn×d
Dual Recovery: w∗ = − 1λn X>α∗, where
α∗ = arg maxα∈Rn
−1n
n∑i=1
`∗i (αi )−1
2λn2α>X X>α
and X = XA ∈ Rn×m
Subspace Embedding A with m = Θ(r log(r/δ)ε−2)
Guarantee: under low-rank assumption of the data matrix X (e.g.,rank(X ) = r), with a high probability 1− δ,
‖w∗ −w∗‖2 ≤ε
1− ε‖w∗‖2
Yang Tutorial for ACML’15 Nov. 20, 2015 150 / 210
![Page 210: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/210.jpg)
Randomized Algorithms Randomized Classification (Regression)
Dual Recovery for Randomized Reduction
From dual formulation: w∗ lies in the row space of the data matrixX ∈ Rn×d
Dual Recovery: w∗ = − 1λn X>α∗, where
α∗ = arg maxα∈Rn
−1n
n∑i=1
`∗i (αi )−1
2λn2α>X X>α
and X = XA ∈ Rn×m
Subspace Embedding A with m = Θ(r log(r/δ)ε−2)
Guarantee: under low-rank assumption of the data matrix X (e.g.,rank(X ) = r), with a high probability 1− δ,
‖w∗ −w∗‖2 ≤ε
1− ε‖w∗‖2
Yang Tutorial for ACML’15 Nov. 20, 2015 150 / 210
![Page 211: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/211.jpg)
Randomized Algorithms Randomized Classification (Regression)
Dual Sparse Recovery for Randomized Reduction
Assume the optimal dual solution α∗ is sparse (i.e., the number of supportvectors is small)
Dual Sparse Recovery: w∗ = − 1λn X>α∗, where
α∗ = arg maxα∈Rn
−1n
n∑i=1
`∗i (αi )−1
2λn2α>X X>α− τ
n‖α‖1
where X = XA ∈ Rn×m
JL transform A with m = Θ(s log(n/δ)ε−2)
Guarantee: if α∗ is s-sparse, with a high probability 1− δ,
‖w∗ −w∗‖2 ≤ ε‖w∗‖2
Yang Tutorial for ACML’15 Nov. 20, 2015 151 / 210
![Page 212: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/212.jpg)
Randomized Algorithms Randomized Classification (Regression)
Dual Sparse Recovery for Randomized Reduction
Assume the optimal dual solution α∗ is sparse (i.e., the number of supportvectors is small)
Dual Sparse Recovery: w∗ = − 1λn X>α∗, where
α∗ = arg maxα∈Rn
−1n
n∑i=1
`∗i (αi )−1
2λn2α>X X>α− τ
n‖α‖1
where X = XA ∈ Rn×m
JL transform A with m = Θ(s log(n/δ)ε−2)
Guarantee: if α∗ is s-sparse, with a high probability 1− δ,
‖w∗ −w∗‖2 ≤ ε‖w∗‖2
Yang Tutorial for ACML’15 Nov. 20, 2015 151 / 210
![Page 213: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/213.jpg)
Randomized Algorithms Randomized Classification (Regression)
Dual Sparse Recovery for Randomized Reduction
Assume the optimal dual solution α∗ is sparse (i.e., the number of supportvectors is small)
Dual Sparse Recovery: w∗ = − 1λn X>α∗, where
α∗ = arg maxα∈Rn
−1n
n∑i=1
`∗i (αi )−1
2λn2α>X X>α− τ
n‖α‖1
where X = XA ∈ Rn×m
JL transform A with m = Θ(s log(n/δ)ε−2)
Guarantee: if α∗ is s-sparse, with a high probability 1− δ,
‖w∗ −w∗‖2 ≤ ε‖w∗‖2
Yang Tutorial for ACML’15 Nov. 20, 2015 151 / 210
![Page 214: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/214.jpg)
Randomized Algorithms Randomized Classification (Regression)
Dual Sparse Recovery
RCV1 text data, n = 677, 399, and d = 47, 236
Dual Error Primal Error
0 0.1 0.3 0.5 0.7 0.90.2
0.4
0.6
0.8
1
τ
rela
tive−
dual−
err
or−
L2−
norm
λ=0.001
m=1024
m=2048
m=4096
m=8192
0 0.1 0.3 0.5 0.7 0.9
0.2
0.4
0.6
0.8
1
τ
rela
tive−
prim
al−
err
or−
L2−
norm
λ=0.001
m=1024
m=2048
m=4096
m=8192
Yang Tutorial for ACML’15 Nov. 20, 2015 152 / 210
![Page 215: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/215.jpg)
Randomized Algorithms Randomized Least-Squares Regression
Outline
4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation
Yang Tutorial for ACML’15 Nov. 20, 2015 153 / 210
![Page 216: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/216.jpg)
Randomized Algorithms Randomized Least-Squares Regression
Least-squares regression
Let X ∈ Rn×d with d n and b ∈ Rn. The least-squares regressionproblem is to find w∗ such that
w∗ = arg minw∈Rd
‖Xw− b‖2
Computational Cost: O(nd2)
Goal of RA: o(nd2)
Yang Tutorial for ACML’15 Nov. 20, 2015 154 / 210
![Page 217: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/217.jpg)
Randomized Algorithms Randomized Least-Squares Regression
Randomized Least-squares regression
Let A ∈ Rm×n be a random reduction matrix. Solve
w∗ = arg minw∈Rd
‖A(Xw− b)‖2 = ‖AXw− Ab‖2
Computational Cost: O(md2) + reduction time
Yang Tutorial for ACML’15 Nov. 20, 2015 155 / 210
![Page 218: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/218.jpg)
Randomized Algorithms Randomized Least-Squares Regression
Randomized Least-squares regression
Theoretical Guarantees (Sarlos, 2006; Drineas et al., 2011; Nelson &Nguyen, 2012):
‖X w∗ − b‖2 ≤ (1 + ε)‖Xw∗ − b‖2
Total Time O(nnz(X ) + d3 log(d/ε)ε−2)
Yang Tutorial for ACML’15 Nov. 20, 2015 156 / 210
![Page 219: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/219.jpg)
Randomized Algorithms Randomized K-means Clustering
Outline
4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation
Yang Tutorial for ACML’15 Nov. 20, 2015 157 / 210
![Page 220: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/220.jpg)
Randomized Algorithms Randomized K-means Clustering
K-means Clustering
Let x1, . . . , xn ∈ Rd be a set of data points.
K-means clustering aims to solve
minC1,...,Ck
k∑j=1
∑xi∈Cj
‖xi − µj‖22
Computational Cost: O(ndkt), where t is number of iterations.
Yang Tutorial for ACML’15 Nov. 20, 2015 158 / 210
![Page 221: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/221.jpg)
Randomized Algorithms Randomized K-means Clustering
Randomized Algorithms for K-means Clustering
Let X = (x1, . . . , xn)> ∈ Rn×d be the data matrix.High-dimensional data: Random Sketch: X = XA ∈ Rn×m, ` d
Approximate K-means:
minC1,...,Ck
k∑j=1
∑xi∈Cj
‖xi − µj‖22
Yang Tutorial for ACML’15 Nov. 20, 2015 159 / 210
![Page 222: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/222.jpg)
Randomized Algorithms Randomized K-means Clustering
Randomized Algorithms for K-means Clustering
Let X = (x1, . . . , xn)> ∈ Rn×d be the data matrix.High-dimensional data: Random Sketch: X = XA ∈ Rn×m, ` d
Approximate K-means:
minC1,...,Ck
k∑j=1
∑xi∈Cj
‖xi − µj‖22
Yang Tutorial for ACML’15 Nov. 20, 2015 159 / 210
![Page 223: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/223.jpg)
Randomized Algorithms Randomized K-means Clustering
Randomized Algorithms for K-means Clustering
For random sketch: JL transforms, sparse subspace embedding all workJL transform: m = O(k log(k/(εδ))
ε2 )
Sparse subspace embedding: m = O( k2
ε2δ )
ε relates to the approximation accuracyAnalysis of approximation error for K-means can be formulates asConstrained Low-rank Approximation (Cohen et al., 2015)
minQ>Q=I
‖X − QQ>X‖2F
where Q is orthonormal.
Yang Tutorial for ACML’15 Nov. 20, 2015 160 / 210
![Page 224: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/224.jpg)
Randomized Algorithms Randomized Kernel methods
Outline
4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation
Yang Tutorial for ACML’15 Nov. 20, 2015 161 / 210
![Page 225: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/225.jpg)
Randomized Algorithms Randomized Kernel methods
Kernel methods
Kernel function: κ(·, ·)a set of examples x1, . . . , xn
Kernel matrix: K ∈ Rn×n with Kij = κ(xi , xj)
K is a PSD matrixComputational and memory costs: Ω(n2)
Approximation methodsThe Nystrom methodRandom Fourier features
Yang Tutorial for ACML’15 Nov. 20, 2015 162 / 210
![Page 226: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/226.jpg)
Randomized Algorithms Randomized Kernel methods
Kernel methods
Kernel function: κ(·, ·)a set of examples x1, . . . , xn
Kernel matrix: K ∈ Rn×n with Kij = κ(xi , xj)
K is a PSD matrixComputational and memory costs: Ω(n2)
Approximation methodsThe Nystrom methodRandom Fourier features
Yang Tutorial for ACML’15 Nov. 20, 2015 162 / 210
![Page 227: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/227.jpg)
Randomized Algorithms Randomized Kernel methods
The Nystrom method
Let A ∈ Rn×` be uniform sampling matrix.B = KA ∈ Rn×`
C = A>B = A>KAThe Nystrom approximation (Drineas & Mahoney, 2005)
K = BC †B>
Computational Cost: O(`3 + n`2)
Yang Tutorial for ACML’15 Nov. 20, 2015 163 / 210
![Page 228: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/228.jpg)
Randomized Algorithms Randomized Kernel methods
The Nystrom method
Let A ∈ Rn×` be uniform sampling matrix.B = KA ∈ Rn×`
C = A>B = A>KAThe Nystrom approximation (Drineas & Mahoney, 2005)
K = BC †B>
Computational Cost: O(`3 + n`2)
Yang Tutorial for ACML’15 Nov. 20, 2015 163 / 210
![Page 229: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/229.jpg)
Randomized Algorithms Randomized Kernel methods
The Nystrom based kernel machine
The dual problem:
arg maxα∈Rn
−1n
n∑i=1
`∗i (αi )−1
2λn2α>BC †B>α
Solve it like solving a linear method: X = BC−1/2 ∈ Rn×`
arg maxα∈Rn
−1n
n∑i=1
`∗i (αi )−1
2λn2α>X X>α
Yang Tutorial for ACML’15 Nov. 20, 2015 164 / 210
![Page 230: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/230.jpg)
Randomized Algorithms Randomized Kernel methods
The Nystrom based kernel machine
Yang Tutorial for ACML’15 Nov. 20, 2015 165 / 210
![Page 231: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/231.jpg)
Randomized Algorithms Randomized Kernel methods
Random Fourier Features (RFF)
Bochner’s theoremA shift-invariant kernel κ(x, y) = κ(x− y) is a valid kernel if only if κ(δ) isthe Fourier transform of a non-negative measure, i.e.,
κ(x− y) =
∫p(ω)e−jω>(x−y)dω
RFF (Rahimi & Recht, 2008): generate a set of ω1, . . . , ωm ∈ Rd followingp(ω). For an example x ∈ Rd , construct
x = (cos(ω>1 x), sin(ω>1 x), . . . , cos(ω>mx), sin(ω>mx))> ∈ R2m
RBF kernel exp(−‖x−y‖22
2γ2 ): p(ω) = N (0, γ2)
Yang Tutorial for ACML’15 Nov. 20, 2015 166 / 210
![Page 232: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/232.jpg)
Randomized Algorithms Randomized Kernel methods
Random Fourier Features (RFF)
Bochner’s theoremA shift-invariant kernel κ(x, y) = κ(x− y) is a valid kernel if only if κ(δ) isthe Fourier transform of a non-negative measure, i.e.,
κ(x− y) =
∫p(ω)e−jω>(x−y)dω
RFF (Rahimi & Recht, 2008): generate a set of ω1, . . . , ωm ∈ Rd followingp(ω). For an example x ∈ Rd , construct
x = (cos(ω>1 x), sin(ω>1 x), . . . , cos(ω>mx), sin(ω>mx))> ∈ R2m
RBF kernel exp(−‖x−y‖22
2γ2 ): p(ω) = N (0, γ2)
Yang Tutorial for ACML’15 Nov. 20, 2015 166 / 210
![Page 233: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/233.jpg)
Randomized Algorithms Randomized Kernel methods
Random Fourier Features (RFF)
Bochner’s theoremA shift-invariant kernel κ(x, y) = κ(x− y) is a valid kernel if only if κ(δ) isthe Fourier transform of a non-negative measure, i.e.,
κ(x− y) =
∫p(ω)e−jω>(x−y)dω
RFF (Rahimi & Recht, 2008): generate a set of ω1, . . . , ωm ∈ Rd followingp(ω). For an example x ∈ Rd , construct
x = (cos(ω>1 x), sin(ω>1 x), . . . , cos(ω>mx), sin(ω>mx))> ∈ R2m
RBF kernel exp(−‖x−y‖22
2γ2 ): p(ω) = N (0, γ2)
Yang Tutorial for ACML’15 Nov. 20, 2015 166 / 210
![Page 234: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/234.jpg)
Randomized Algorithms Randomized Kernel methods
The Nystrom method vs RFF (Yang et al., 2012)
functional approximation frameworkThe Nystrom method: data-dependent basesRFF: data independent basesIn certain cases (e.g., large eigen-gap, skewed eigen-valuedistribution): the generalization performance of the Nystrom methodis better than RFF
Yang Tutorial for ACML’15 Nov. 20, 2015 167 / 210
![Page 235: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/235.jpg)
Randomized Algorithms Randomized Kernel methods
The Nystrom method vs RFF
Yang Tutorial for ACML’15 Nov. 20, 2015 168 / 210
![Page 236: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/236.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Outline
4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation
Yang Tutorial for ACML’15 Nov. 20, 2015 169 / 210
![Page 237: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/237.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Randomized low-rank matrix approximation
Let X ∈ Rn×d . The goal is to obtain
UΣV> ≈ X
where U ∈ Rn×k , V ∈ Rd×k have orthonormal columns, Σ ∈ Rk×k is adiagonal matrix with nonegative entries
k is target rankThe best rank-k approximation Xk = UkΣkV>kApproximation error
‖UΣV> − X‖ξ ≤ (1 + ε)‖UkΣkV>k − X‖ξ
where ξ = F or ξ = 2
Yang Tutorial for ACML’15 Nov. 20, 2015 170 / 210
![Page 238: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/238.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Why low-rank approximation?
Applications in Data mining and Machine learningPCASpectral clustering· · ·
Yang Tutorial for ACML’15 Nov. 20, 2015 171 / 210
![Page 239: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/239.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Why randomized algorithms?
Deterministic AlgorithmsTruncated SVD O(nd min(n, d))
Rank-Revealing QR factorization O(ndk)
Krylov subspace method (e.g. Lanczos algorithm):O(ndk + (n + d)k2)
Randomized AlgorithmsSpeed can be faster (e.g., O(nd log(k)))Output more robust (e.g. Lanczos requires sophisticatedmodifications)Can be pass efficientCan exploit parallel algorithms
Yang Tutorial for ACML’15 Nov. 20, 2015 172 / 210
![Page 240: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/240.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Why randomized algorithms?
Deterministic AlgorithmsTruncated SVD O(nd min(n, d))
Rank-Revealing QR factorization O(ndk)
Krylov subspace method (e.g. Lanczos algorithm):O(ndk + (n + d)k2)
Randomized AlgorithmsSpeed can be faster (e.g., O(nd log(k)))Output more robust (e.g. Lanczos requires sophisticatedmodifications)Can be pass efficientCan exploit parallel algorithms
Yang Tutorial for ACML’15 Nov. 20, 2015 172 / 210
![Page 241: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/241.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Randomized algorithms for low-rank matrix approximation
The Basic Randomized Algorithms for Approximating X ∈ Rn×d (Halkoet al., 2011)
1 Obtain a small sketch by Y = XA ∈ Rn×m
2 Compute Q ∈ Rn×m that contains the orthonormal basis of col(Y )
3 Compute SVD of QT X = UΣV>4 Approximation X ≈ UΣV>, where U = QU
Explanation: If col(XA) captures the top-k column space of X well,i.e.,
‖X − QQ>X‖ ≤ ε
then‖X − UΣV>‖ ≤ ε
Yang Tutorial for ACML’15 Nov. 20, 2015 173 / 210
![Page 242: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/242.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Randomized algorithms for low-rank matrix approximation
Three questions:1 What is the value of m?
m = k + p, p is the oversampling parameter. In practice p = 5 or 10gives superb results
2 What is the computational cost?Subsampled Randomized Hadmard Transform: can be as fast asO(nd log(k) + k2(n + d))
3 What is the quality?Theoretical Guarantee:Practically, very accurate
Yang Tutorial for ACML’15 Nov. 20, 2015 174 / 210
![Page 243: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/243.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Randomized algorithms for low-rank matrix approximation
Three questions:1 What is the value of m?
m = k + p, p is the oversampling parameter. In practice p = 5 or 10gives superb results
2 What is the computational cost?Subsampled Randomized Hadmard Transform: can be as fast asO(nd log(k) + k2(n + d))
3 What is the quality?Theoretical Guarantee:Practically, very accurate
Yang Tutorial for ACML’15 Nov. 20, 2015 174 / 210
![Page 244: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/244.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Randomized algorithms for low-rank matrix approximation
Three questions:1 What is the value of m?
m = k + p, p is the oversampling parameter. In practice p = 5 or 10gives superb results
2 What is the computational cost?Subsampled Randomized Hadmard Transform: can be as fast asO(nd log(k) + k2(n + d))
3 What is the quality?Theoretical Guarantee:Practically, very accurate
Yang Tutorial for ACML’15 Nov. 20, 2015 174 / 210
![Page 245: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/245.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Randomized algorithms for low-rank matrix approximation
Three questions:1 What is the value of m?
m = k + p, p is the oversampling parameter. In practice p = 5 or 10gives superb results
2 What is the computational cost?Subsampled Randomized Hadmard Transform: can be as fast asO(nd log(k) + k2(n + d))
3 What is the quality?Theoretical Guarantee:Practically, very accurate
Yang Tutorial for ACML’15 Nov. 20, 2015 174 / 210
![Page 246: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/246.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Randomized algorithms for low-rank matrix approximation
Yang Tutorial for ACML’15 Nov. 20, 2015 175 / 210
![Page 247: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/247.jpg)
Randomized Algorithms Randomized Low-rank Matrix Approximation
Randomized algorithms for low-rank matrix approximation
Other thingsUse power iteration to reduce the error: use (XX>)qX
Can use sparse JL transform/subspace embedding matrices(Frobenius norm guarantee only)
Yang Tutorial for ACML’15 Nov. 20, 2015 176 / 210
![Page 248: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/248.jpg)
Concluding Remarks
Outline
1 Basics
2 Optimization
3 Randomized Dimension Reduction
4 Randomized Algorithms
5 Concluding Remarks
Yang Tutorial for ACML’15 Nov. 20, 2015 177 / 210
![Page 249: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/249.jpg)
Concluding Remarks
How to address big data challenge?
Optimization perspective: improve convergence rates, exploringproperties of functions
stochastic optimization (e.g., SDCA, SVRG, SAGA)distributed optimization (e.g., DisDCA)
Randomization perspective: reduce data size, exploring properties ofdata
randomized feature reduction (e.g., reduce the number of features)randomized instance reduction (e.g., reduce the number of instances)
Yang Tutorial for ACML’15 Nov. 20, 2015 178 / 210
![Page 250: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/250.jpg)
Concluding Remarks
How can we address big data challenge?
Optimization perspective: improve convergence rates, exploringproperties of functions
Pro: can obtain the optimal solutionCon: high computational/communication costs
Randomization perspective: reduce data size, exploring properties ofdata
Pro: fastCon: still exists recovery error
Can we combine the benefits of two techniques?
Yang Tutorial for ACML’15 Nov. 20, 2015 179 / 210
![Page 251: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/251.jpg)
Concluding Remarks
Research Assistant Positions Available for PhD Candidates!Start Fall’16Optimization and RandomizationOnline LearningDeep LearningMachine Learningsend email to [email protected]
Yang Tutorial for ACML’15 Nov. 20, 2015 180 / 210
![Page 252: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/252.jpg)
Concluding Remarks
Thank You! Questions?
Yang Tutorial for ACML’15 Nov. 20, 2015 181 / 210
![Page 253: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/253.jpg)
Concluding Remarks
References I
Achlioptas, Dimitris. Database-friendly random projections:Johnson-Lindenstrauss with binary coins. Journal of Computer andSystem Sciences, 66(4):671 – 687, 2003.
Balcan, Maria-Florina, Blum, Avrim, and Vempala, Santosh. Kernels asfeatures: on kernels, margins, and low-dimensional mappings. MachineLearning, 65(1):79–94, 2006.
Cohen, Michael B., Elder, Sam, Musco, Cameron, Musco, Christopher,and Persu, Madalina. Dimensionality reduction for k-means clusteringand low rank approximation. In Proceedings of the Forty-SeventhAnnual ACM on Symposium on Theory of Computing (STOC), pp.163–172, 2015.
Dasgupta, Anirban, Kumar, Ravi, and Sarlos, Tamas. A sparse johnson:Lindenstrauss transform. In Proceedings of the 42nd ACM symposiumon Theory of computing, STOC ’10, pp. 341–350, 2010.
Yang Tutorial for ACML’15 Nov. 20, 2015 182 / 210
![Page 254: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/254.jpg)
Concluding Remarks
References II
Dasgupta, Sanjoy and Gupta, Anupam. An elementary proof of a theoremof Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. Saga: A fastincremental gradient method with support for non-strongly convexcomposite objectives. In NIPS, 2014.
Drineas, Petros and Mahoney, Michael W. On the nystrom method forapproximating a gram matrix for improved kernel-based learning.Journal of Machine Learning Research, 6:2005, 2005.
Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Samplingalgorithms for l2 regression and applications. In ACM-SIAM Symposiumon Discrete Algorithms (SODA), pp. 1127–1136, 2006.
Yang Tutorial for ACML’15 Nov. 20, 2015 183 / 210
![Page 255: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/255.jpg)
Concluding Remarks
References III
Drineas, Petros, Mahoney, Michael W., Muthukrishnan, S., and Sarlos,Tamas. Faster least squares approximation. Numerische Mathematik,117(2):219–249, February 2011.
Gittens, Alex. The spectral norm error of the naive nystrom extension.CoRR, 2011.
Halko, Nathan, Martinsson, Per Gunnar., and Tropp, Joel A. Findingstructure with randomness: Probabilistic algorithms for constructingapproximate matrix decompositions. SIAM Review, 53(2):217–288, May2011.
Hsieh, Cho-Jui, Chang, Kai-Wei, Lin, Chih-Jen, Keerthi, S. Sathiya, andSundararajan, S. A dual coordinate descent method for large-scale linearsvm. In ICML, pp. 408–415, 2008.
Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descentusing predictive variance reduction. In NIPS, pp. 315–323, 2013.
Yang Tutorial for ACML’15 Nov. 20, 2015 184 / 210
![Page 256: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/256.jpg)
Concluding Remarks
References IV
Johnson, William and Lindenstrauss, Joram. Extensions of Lipschitzmappings into a Hilbert space. In Conference in modern analysis andprobability (New Haven, Conn., 1982), volume 26, pp. 189–206. 1984.
Kane, Daniel M. and Nelson, Jelani. Sparser johnson-lindenstrausstransforms. Journal of the ACM, 61:4:1–4:23, 2014.
Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An accelerated proximalcoordinate gradient method and its application to regularized empiricalrisk minimization. In NIPS, 2014.
Ma, Chenxin, Smith, Virginia, Jaggi, Martin, Jordan, Michael I., Richtarik,Peter, and Takac, Martin. Adding vs. averaging in distributedprimal-dual optimization. In ICML, 2015.
Nelson, Jelani and Nguyen, Huy L. OSNAP: faster numerical linear algebraalgorithms via sparser subspace embeddings. CoRR, abs/1211.1002,2012.
Yang Tutorial for ACML’15 Nov. 20, 2015 185 / 210
![Page 257: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/257.jpg)
Concluding Remarks
References V
Nelson, Jelani and Nguyen, Huy L. OSNAP: faster numerical linear algebraalgorithms via sparser subspace embeddings. In 54th Annual IEEESymposium on Foundations of Computer Science (FOCS), pp. 117–126,2013.
Nemirovski, A. and Yudin, D. On cezari’s convergence of the steepestdescent method for approximating saddle point of convex-concavefunctons. Soviet Math Dkl., 19:341–362, 1978.
Nesterov, Yurii. Efficiency of coordinate descent methods on huge-scaleoptimization problems. SIAM Journal on Optimization, 22:341–362,2012.
Paul, Saurabh, Boutsidis, Christos, Magdon-Ismail, Malik, and Drineas,Petros. Random projections for support vector machines. In Proceedingsof the International Conference on Artificial Intelligence and Statistics(AISTATS), pp. 498–506, 2013.
Yang Tutorial for ACML’15 Nov. 20, 2015 186 / 210
![Page 258: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/258.jpg)
Concluding Remarks
References VI
Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernelmachines. In Advances in Neural Information Processing Systems 20,pp. 1177–1184, 2008.
Recht, Benjamin. A simpler approach to matrix completion. JournalMachine Learning Research (JMLR), pp. 3413–3430, 2011.
Roux, Nicolas Le, Schmidt, Mark, and Bach, Francis. A stochasticgradient method with an exponential convergence rate forstrongly-convex optimization with finite training sets. CoRR, 2012.
Sarlos, Tamas. Improved approximation algorithms for large matrices viarandom projections. In 47th Annual IEEE Symposium on Foundations ofComputer Science (FOCS), pp. 143–152, 2006.
Shalev-Shwartz, Shai and Zhang, Tong. Proximal stochastic dualcoordinate ascent. CoRR, abs/1211.2717, 2012.
Yang Tutorial for ACML’15 Nov. 20, 2015 187 / 210
![Page 259: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/259.jpg)
Concluding Remarks
References VII
Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascentmethods for regularized loss. Journal of Machine Learning Research, 14:567–599, 2013.
Tropp, Joel A. Improved analysis of the subsampled randomized hadamardtransform. Advances in Adaptive Data Analysis, 3(1-2):115–126, 2011.
Tropp, Joel A. User-friendly tail bounds for sums of random matrices.Found. Comput. Math., 12(4):389–434, August 2012. ISSN 1615-3375.
Wang, Po-Wei and Lin, Chih-Jen. Iteration complexity of feasible descentmethods for convex optimization. Journal of Machine LearningResearch, 15(1):1523–1548, 2014.
Xiao, L. and Zhang, T. A proximal stochastic gradient method withprogressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
Yang Tutorial for ACML’15 Nov. 20, 2015 188 / 210
![Page 260: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/260.jpg)
Concluding Remarks
References VIII
Yang, Tianbao. Trading computation for communication: Distributedstochastic dual coordinate ascent. NIPS’13, pp. –, 2013.
Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou,Zhi-Hua. Nystrom method vs random fourier features: A theoretical andempirical comparison”. In Advances in Neural Information ProcessingSystems (NIPS), pp. 485–493, 2012.
Yang, Tianbao, Zhang, Lijun, Jin, Rong, and Zhu, Shenghuo. Theory ofdual-sparse regularized randomized reduction. In Proceedings of the32nd International Conference on Machine Learning, (ICML), pp.305–314, 2015.
Zhang, Lijun, Mahdavi, Mehrdad, and Jin, Rong. Linear convergence withcondition number independent access of full gradients. In NIPS, pp.980–988. 2013.
Yang Tutorial for ACML’15 Nov. 20, 2015 189 / 210
![Page 261: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/261.jpg)
Concluding Remarks
References IX
Zhang, Lijun, Mahdavi, Mehrdad, Jin, Rong, Yang, Tianbao, and Zhu,Shenghuo. Random projections for classification: A recovery approach.IEEE Transactions on Information Theory (IEEE TIT), 60(11):7300–7316, 2014.
Yang Tutorial for ACML’15 Nov. 20, 2015 190 / 210
![Page 262: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/262.jpg)
Appendix
Examples of Convex functions
ax + b, Ax + bx2, ‖x‖2
2exp(ax), exp(w>x)
log(1 + exp(ax)), log(1 + exp(w>x))
x log(x),∑
i xi log(xi )
‖x‖p, p ≥ 1, ‖x‖2p
maxi (xi )
Yang Tutorial for ACML’15 Nov. 20, 2015 191 / 210
![Page 263: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/263.jpg)
Appendix
Operations that preserve convexity
Nonnegative scale: a · f (x) where a ≥ 0Sum: f (x) + g(x)
Composition with affine function f (Ax + b)
Point-wise maximum: maxi fi (x)
Examples:Least-squares regression: ‖Ax− b‖2
SVM: 1n∑n
i=1 max(0, 1− yiw>xi ) + λ2‖w‖
22
Yang Tutorial for ACML’15 Nov. 20, 2015 192 / 210
![Page 264: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/264.jpg)
Appendix
Smooth Convex function
smooth: e.g. logistic loss f (x) = log(1 + exp(−x))
‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2
where L > 0
smoothnessconstant
Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ L
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
0
1
2
3
4
5
6
log(1+exp(−x))
f(y)+f’(y)(x−y)
y
f(x)
Quadratic Function
Yang Tutorial for ACML’15 Nov. 20, 2015 193 / 210
![Page 265: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/265.jpg)
Appendix
Smooth Convex function
smooth: e.g. logistic loss f (x) = log(1 + exp(−x))
‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2
where L > 0
smoothnessconstant
Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ L
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
0
1
2
3
4
5
6
log(1+exp(−x))
f(y)+f’(y)(x−y)
y
f(x)
Quadratic Function
Yang Tutorial for ACML’15 Nov. 20, 2015 193 / 210
![Page 266: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/266.jpg)
Appendix
Strongly Convex function
strongly convex: e.g. Euclidean norm f (x) = 12‖x‖
22
‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2
where λ > 0
strong convexityconstant
Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
x2
gradient
smooth
Yang Tutorial for ACML’15 Nov. 20, 2015 194 / 210
![Page 267: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/267.jpg)
Appendix
Strongly Convex function
strongly convex: e.g. Euclidean norm f (x) = 12‖x‖
22
‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2
where λ > 0
strong convexityconstant
Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
x2
gradient
smooth
Yang Tutorial for ACML’15 Nov. 20, 2015 194 / 210
![Page 268: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/268.jpg)
Appendix
Smooth and Strongly Convex function
smooth and strongly convex: e.g. quadratic function:f (z) = 1
2 (z − 1)2
λ‖x − y‖2 ≤ ‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2, L ≥ λ > 0
Yang Tutorial for ACML’15 Nov. 20, 2015 195 / 210
![Page 269: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/269.jpg)
Appendix
Chernoff bound
Let X1, . . . ,Xn be independent random variables. Assume 0 ≤ Xi ≤ 1.Let X = X1 + . . .+ Xn. µ = E[X ]. Then
Pr(X ≥ (1 + ε)µ) ≤ exp(− ε2
2 + εµ
)
Pr(X ≤ (1− ε)µ) ≤ exp(−ε
2
2 µ)
or
Pr(|X − µ| ≥ εµ) ≤ 2 exp(− ε2
2 + εµ
)≤ 2 exp
(−ε
2
3 µ)
the last inequality holds when 0 < ε ≤ 1
Yang Tutorial for ACML’15 Nov. 20, 2015 196 / 210
![Page 270: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/270.jpg)
Appendix
Theoretical Guarantee of RA for low-rank approximation
X = U[
Σ1Σ2
] [V>1V>2
]
X ∈ Rm×n: the target matrixΣ1 ∈ Rk×k , V1 ∈ Rn×k
A ∈ Rn×`: random reduction matrixY = XA ∈ Rm×`: the small sketch
Key inequality:
‖(I − PY )X‖2 ≤ ‖Σ2‖2 + ‖Σ2Ω2Ω†1‖2
Yang Tutorial for ACML’15 Nov. 20, 2015 197 / 210
![Page 271: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/271.jpg)
Appendix
Gaussian Matrices
G is a standard Gaussian matrixU and V are orthonormal matricesUT GV follows the standard Gaussian distributionE[‖SGT‖2
F ] = ‖S‖2F‖T‖2
FE[‖SGT‖] ≤ ‖S‖‖T‖F + ‖S‖F‖T‖Concentration for function of a Gaussian matrix. Suppose h is aLipschitz function on matrices
h(X )− h(Y ) ≤ L‖X − Y ‖F
ThenPr(h(G) ≥ E[h(G)] + Lt) ≤ e−t2/2
Yang Tutorial for ACML’15 Nov. 20, 2015 198 / 210
![Page 272: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/272.jpg)
Appendix
Analysis for Randomized Least-square regression
Let X = UΣV>w∗ = arg min
w∈Rd‖Xw− b‖2
Let Z = ‖Xw∗ − b‖2, ω = b − Xw∗, and Xw∗ = Uα
w∗ = arg minw∈Rd
‖A(Xw− b)‖2
Since b − Xw∗ = b − X (X>X )†X>b = (I − UU>)b, X w∗ − Xw∗ = Uβ.Then
‖X w∗ − b‖2 = ‖Xw∗ − b‖2 + ‖X w∗ − Xw‖2 = Z + ‖β‖2
Yang Tutorial for ACML’15 Nov. 20, 2015 199 / 210
![Page 273: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/273.jpg)
Appendix
Analysis for Randomized Least-square regression
AU(α + β) = AX w∗ = AX (AX )†Ab = PAX (Ab) = PAU(Ab)
PAU(Ab) = PAU(A(ω + Uα)) = AUα + PAU(Aω)
Hence
U>A>AUβ = (AU)>(AU)(AU)†Aω = (AU)>(AU)((AU)>AU)−1(AU)>Aω
where we use AU is full column matrix. Then
U>A>AUβ = U>A>Aω
‖β‖22/2 ≤ ‖U>A>AUβ‖2
2 = ‖U>A>Aω‖22 ≤ ε′2‖U‖2
F‖ω‖22
where the last inequality uses the matrix products approximation shown innext slide. Since ‖U‖2
F ≤ d , setting ε′ =√
εd suffices.
Yang Tutorial for ACML’15 Nov. 20, 2015 200 / 210
![Page 274: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/274.jpg)
Appendix
Approximate Matrix Products
Given X ∈ Rn×d and Y ∈ Rd×p, let A ∈ Rm×d one of the followingmatrices
a JL transform matrix with m = Θ(ε−2 log((n + p)/δ))
the sparse subspace embedding with m = Θ(ε−2)
leverage-score sampling matrix based on pi ≥‖Xi∗‖2
22‖X‖2
Fand m = Θ(ε−2)
Then w.h.p 1− δ
‖XA>AY − XY ‖F ≤ ε‖X‖F‖Y ‖F
Yang Tutorial for ACML’15 Nov. 20, 2015 201 / 210
![Page 275: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/275.jpg)
Appendix
Analysis for Randomized Least-square regression
A ∈ Rm×n
1. Subspace embedding: AU full column rank2. Matrix product approximation:
√ε/d
Order of mJL transforms: 1. O(d log(d)), 2. O(d log(d)ε−1)⇒ O(d log(d)ε−1)
Sparse subspace embedding: 1. O(d2), 2. O(dε−1)⇒ O(d2ε−1)
If we use SSE (A1 ∈ Rm1×n) and JL transform A2 ∈ Rm2×m1
‖A2A1(Xw2∗ − b)‖2 ≤ (1 + ε)‖A1(Xw1
∗ − b)‖2
≤ (1 + ε)‖A1(Xw∗ − b)‖2 ≤ (1 + ε)2‖Xw∗ − b‖
with m1 = O(d2ε−2) and m2 = d log(d)ε−1, w2∗ is the optimal solution
using A2A1 and w1∗ is the optimal using A1 and w∗ is the original optimal
solution.Yang Tutorial for ACML’15 Nov. 20, 2015 202 / 210
![Page 276: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/276.jpg)
Appendix
Randomized Least-squares regression
Theoretical Guarantees (Sarlos, 2006; Drineas et al., 2011; Nelson &Nguyen, 2012):
‖X w∗ − b‖2 ≤ (1 + ε)‖Xw∗ − b‖2
If A is a fast JL transform with m = Θ(ε−1d log(d)): Total TimeO(nd log(m) + d3 log(d)ε−1)
If A is a Sparse Subspace Embedding with m = Θ(d2ε−1): TotalTime O(nnz(X ) + d4ε−1)
If A = A1A2 combine fast JL (m1 = Θ(ε−1d log(d))) and SSE(m2 = Θ(d2ε−2)): Total Time O(nnz(X ) + d3 log(d/ε)ε−2)
Yang Tutorial for ACML’15 Nov. 20, 2015 203 / 210
![Page 277: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/277.jpg)
Appendix
Matrix Chernoff bound
Lemma (Matrix Chernoff (Tropp, 2012))
Let X be a finite set of PSD matrices with dimension k, and suppose thatmaxX∈X λmax(X ) ≤ B. Sample X1, . . . ,X` independently from X .Compute
µmax = `λmax(E[X1]), µmin = `λmin(E[X1])
Then
Prλmax
(∑i=1
Xi
)≥ (1 + δ)µmax
≤k
[eδ
(1 + δ)1+δ
]µmaxB
Prλmin
(∑i=1
Xi
)≤ (1− δ)µmin
≤k
[e−δ
(1− δ)1−δ
]µminB
Yang Tutorial for ACML’15 Nov. 20, 2015 204 / 210
![Page 278: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/278.jpg)
Appendix
To simplify the usage of Matrix Chernoff bound, we note that[e−δ
[1− δ]1−δ
]µ≤ exp
(−δ
2
2
)[
eδ(1 + δ)1+δ
]µ≤ exp
(−µδ2/3
), δ ≤ 1[
eδ(1 + δ)1+δ
]µ≤ exp (−µδ log(δ)/2) , δ > 1
Yang Tutorial for ACML’15 Nov. 20, 2015 205 / 210
![Page 279: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/279.jpg)
Appendix
Noncommutative Bernstein Inequality
Lemma (Noncommutative Bernstein Inequality (Recht, 2011))
Let Z1, . . . ,ZL be independent zero-mean random matrices of dimensiond1 × d2. Suppose τ2
j = max‖E[ZjZ>j ]‖2, ‖E[Z>j Zj‖2
and ‖Zj‖2 ≤ M
almost surely for all k. Then, for any ε > 0,
Pr
∥∥∥∥∥∥L∑
j=1Zj
∥∥∥∥∥∥2
> ε
≤ (d1 + d2) exp[
−ε2/2∑Lj=1 τ
2j + Mε/3
]
Yang Tutorial for ACML’15 Nov. 20, 2015 206 / 210
![Page 280: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/280.jpg)
Appendix
Randomized Algorithms for K-means Clustering
K-means:k∑
j=1
∑xi∈Cj
‖xi − µj‖22 = ‖X − CC>X‖2
F
where C ∈ Rn×k is the scaled cluster indicator matrix such that C>C = I.
Constrained Low-rank Approximation (Cohen et al., 2015)
minP∈S‖X − PX‖2
F
where S = QQ> is any set of rank k orthogonal projection matrix withorthonormal Q ∈ Rn×k
Low-rank Approximation: S is the set of all rank k orthogonal projectionmatrix. P∗ = UkU>k
Yang Tutorial for ACML’15 Nov. 20, 2015 207 / 210
![Page 281: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/281.jpg)
Appendix
Randomized Algorithms for K-means Clustering
K-means:k∑
j=1
∑xi∈Cj
‖xi − µj‖22 = ‖X − CC>X‖2
F
where C ∈ Rn×k is the scaled cluster indicator matrix such that C>C = I.
Constrained Low-rank Approximation (Cohen et al., 2015)
minP∈S‖X − PX‖2
F
where S = QQ> is any set of rank k orthogonal projection matrix withorthonormal Q ∈ Rn×k
Low-rank Approximation: S is the set of all rank k orthogonal projectionmatrix. P∗ = UkU>k
Yang Tutorial for ACML’15 Nov. 20, 2015 207 / 210
![Page 282: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/282.jpg)
Appendix
Randomized Algorithms for K-means Clustering
K-means:k∑
j=1
∑xi∈Cj
‖xi − µj‖22 = ‖X − CC>X‖2
F
where C ∈ Rn×k is the scaled cluster indicator matrix such that C>C = I.
Constrained Low-rank Approximation (Cohen et al., 2015)
minP∈S‖X − PX‖2
F
where S = QQ> is any set of rank k orthogonal projection matrix withorthonormal Q ∈ Rn×k
Low-rank Approximation: S is the set of all rank k orthogonal projectionmatrix. P∗ = UkU>k
Yang Tutorial for ACML’15 Nov. 20, 2015 207 / 210
![Page 283: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/283.jpg)
Appendix
Randomized Algorithms for K-means Clustering
DefineP∗ = min
P∈S‖X − PX‖2
F
P∗ = minP∈S‖X − PX‖2
F
Guarantees on Approximation
‖X − P∗X‖2F ≤
1 + ε
1− ε‖X − P∗X‖2F
Yang Tutorial for ACML’15 Nov. 20, 2015 208 / 210
![Page 284: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/284.jpg)
Appendix
Properties of Leverage-score sampling
We prove the properties using Matrix Chernoff bound. Let Ω = AU.
Ω>Ω = (AU)>(AU) =m∑
j=1
1mpij
uij u>ij
Let Xi = 1mpi
uiu>i . E[Xi ] = 1m Ik . Therefore λmax(Xi ) = λmin(Xi ) = 1
m .
And λmax(Xi ) ≤ maxi‖ui‖2
2mpi
= km . Applying the Matrix Chernoff bound for
the minimum and maximum eigen-value, we have
Pr(λmin(Ω>Ω) ≤ (1− ε)) ≤ k exp(−mε2
2k
)≤ k exp
(−mε2
3k
)
Pr(λmax(Ω>Ω) ≥ (1 + ε)) ≤ k exp(−mε2
3k
)
Yang Tutorial for ACML’15 Nov. 20, 2015 209 / 210
![Page 285: Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML](https://reader030.fdocuments.in/reader030/viewer/2022041115/5f243499e266bf1b563c674d/html5/thumbnails/285.jpg)
Appendix
When uniform sampling makes sense?
Coherence measureµk =
dk max
1≤i≤d‖Ui∗‖2
2
When µk ≤ τ and m = Θ(
kτε2 log
[2kδ
])w.h.p 1− δ,
A formed by uniform sampling (and scaling)AU ∈ Rm×k is full column rankσ2
i (AU) ≥ (1− ε) ≥ (1− ε)2
σ2i (AU) ≤ (1 + ε) ≤ (1 + ε)2
Valid when the coherence measure is small (some real data miningdatasets have small coherence measures)The Nystrom method usually uses uniform sampling (Gittens, 2011)
Yang Tutorial for ACML’15 Nov. 20, 2015 210 / 210