Introduction to Machine Learning...
Transcript of Introduction to Machine Learning...
![Page 1: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/1.jpg)
Convex Optimization
Prof. Nati Srebro
Lecture 18:Course Summary
![Page 2: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/2.jpg)
About the Courseβ’ Methods for solving convex optimization problems,
based on oracle access, with guarantees based on their properties
β’ Understanding different optimization methodsβ’ Understanding their derivationβ’ When are they appropriateβ’ Guarantees (a few proofs, not a core component)
β’ Working and reasoning about opt problemsβ’ Standard forms: LP, QP, SDP, etcβ’ Optimality conditionsβ’ Dualityβ’ Using Optimality Conditions and Duality to reason about π₯β
![Page 3: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/3.jpg)
Oracles
β’ 0th,1st,2nd, etc order oracle for function π:π₯ β¦ π π₯ , π»π π₯ , π»2π(π₯)
β’ Separation oracle for constraint π³
π₯ β¦ π₯ β π³
π π . π‘.π³ β π₯β² π, π₯ β π₯β² < 0
β’ Projection oracle for constraint π³ (w.r.t. β β β2):
π¦ β argminπ₯βπ³
π₯ β π¦ 22
β’ Linear optimization oracle
π£ β argminπ₯βπ³
π£, π₯
β’ Prox/projection oracle for function β w.r.t. β β β2:
π¦, π β¦ argminπ₯
β π₯ + π π₯ β π¦ 22
β’ Stochastic oracle:π₯ β¦ π π . π‘. πΌ π = π»π(π₯)
![Page 4: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/4.jpg)
Analysis Assumptions
β’ π is πΏ-Lipschitz (w.r.t β β β)π π₯ β π π¦ β€ πΏ π₯ β π¦
β’ π is π-Smooth (w.r.t β β β)
π π₯ + Ξπ₯ β€ π π₯ + π»π π₯ , Ξπ₯ +π
2Ξπ₯ 2
β’ π is π-strongly convex (w.r.t β β β)
π π₯ + π»π π₯ , Ξπ₯ +π
2Ξπ₯ 2 β€ π π₯ + Ξπ₯
β’ π is self-concordantβπ₯βdirection π£ ππ£
β²β²β² π₯ β€ 2πβ²β² π₯ 3/2
β’ π is quadratic (i.e. πβ²β²β² = 0)
β’ βπ₯ββ is small (or domain is bounded)
β’ Initial sub-optimality π0 = π π₯ 0 β π is small
![Page 5: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/5.jpg)
Overall runtime
( runtime of each iteration ) Γ (number of required iterations)
( oracle runtime ) + ( other operations )
![Page 6: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/6.jpg)
Contrast to Non-Oracle Methods
β’ Symbolically solving π»π(π₯) = 0
β’ Gaussian elimination
β’ Simplex method for LPs
![Page 7: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/7.jpg)
Advantages of Oracle Approach
β’ Handle generic functions (not only of specific parametric form)
β’ Makes it clear what functions can be handled, what we need to be able to compute about them
β’ Complexity in terms of βoracle accessesβ
β’ Can obtain lower bound on number of oracle accesses required
![Page 8: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/8.jpg)
Unconstrained Optimizationwith 1st Oder Oracleβ
Dimension Independent Guarantees
π βΌ ππ βΌ π΄ ππ βΌ π΄ π β€ π³π β€ π³π βΌ ππ
GDπ
πlog 1/π
π π₯β 2
π
πΏ2 π₯β 2
π2πΏ2
ππ
A-GD ππ log 1/π
π π₯β 2
π
Lower bound Ξ© ππ log 1/π Ξ©
π π₯β 2
πΞ©
πΏ2 π₯β 2
π2Ξ©
πΏ2
ππ
Theorem: for any method that uses only 1st order oracle access, there exists a
πΏ-Lipschitz function with π₯β β€ 1 s.t. at least Ξ© πΏ2
π2 oracle accesses are
required for the method to find an π-suboptimal solution
YuriNesterov
![Page 9: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/9.jpg)
Unconstrained Optimizationwith 1st Oder Oracleβ
Low Dimensional Guarantees
#ππππRuntime per
iterOverallruntime
Center of Mass π log
1
πβ β
Ellipsoid π2 log1
ππ2 π4 log
1
π
Lower Bound Ξ© π log1
π
![Page 10: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/10.jpg)
Overall runtime
( runtime of each iteration ) Γ (number of required iterations)
![Page 11: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/11.jpg)
Unconstrained Optimization-Practical Methods
Memory Per iter#iter
π βΌ ππ βΌ π΄#iter
quadratic
GD πΆ π ππ + πΆ(π) πΏ π₯π¨π π/π πΏ π₯π¨π π/π
A-GD π(π) π»π + π(π) πΏ π₯π¨π π/π π log 1/π
Momentum π π π»π + π(π)
Conj-GD π π π»π + π(π) π log 1/π
L-BFGS π(ππ) π»π + π(ππ)
BFGS π(π2) π»π + π(π2) π log 1/π
Newton πΆ ππ πππ + πΆ(ππ)π π π + πβ
πΈ+ π₯π¨π π₯π¨π π/π 1
![Page 12: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/12.jpg)
Constrained Optimizationminπ₯βπ³
π(π₯)
β’ Projected Gradient Descent
β’ Oracle: π¦ β¦ argminπ₯βπ³
π₯ β π¦ 22
β’ π(π) + projection per iteration
β’ Similar iteration complexity to Grad Descent:poly dependence on π or on πΏ
β’ Conditional Gradient Descent:
β’ Oracle: π£ β argminπ₯βπ³
π£, π₯
β’ π(π) + linear optimization per iteration
β’ π( ππ) iterations, but no acceleration, no π(π log 1 π)
β’ Ellipsoid Method
β’ Separation oracle (can handle much more generic π³)
β’ π(π2) + separation oracle per itration
β’ π(π2 log 1 π) iterations total runtime π(π4 log 1 π)
![Page 13: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/13.jpg)
Constrained Optimization
β’ Interior Point Methods
β’ Only 1st and 2nd oracle access to π0, ππβ’ Overall #Newton Iterations: π π (log 1 π + log log 1 πΏ)
β’ Overall runtime: β π π π + π 3 +π π»2 evals log 1 π
β’ Can also handle matrix inequalities (SDPs)
β’ Other cones: need self concordant barrier
β’ βStandard Methodβ for LPs, QPs, SDPs, etc
minπ₯ββπ
π0(π₯)
π . π‘. ππ π₯ β€ 0π΄π₯ = π
minπ₯ββπ
π0(π₯)
π . π‘. ππ π₯ βΌ 0π΄π₯ = π
minπ₯ββπ
π0(π₯)
π . π‘. βππ π₯ β πΎπ
π΄π₯ = π
![Page 14: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/14.jpg)
Barrier Methods
β’ Log barriers πΌπ‘ π’ = β1
π‘log(βπ’):
β’ Analysis and Guarantees
β’ βCentral Pathβ
β’ Relaxation of KKT conditions
minπ₯ββπ
π0 π₯ + π=1π πΌ ππ π₯
π . π‘. π΄π₯ = π
0
β
0barrier function penalty function
π₯β
ππ π₯ β€ 0π΄π₯ = π
![Page 15: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/15.jpg)
Optimality Condition(assuming Strong Duality)
π₯β optimal for (P)
πβ, πβ optimal for (D)
ππ π₯β β€ 0 βπ=1β¦π
βπ π₯β = 0 βπ=1β¦π
ππβ β₯ 0 βπ=1..π
π»π₯πΏ π₯β, πβ, πβ = 0
ππβππ π₯β = 0 βπ=1..π
KKT Conditions
AlbertTucker
HaroldKuhn
WilliamKarush
![Page 16: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/16.jpg)
Optimality Conditionfor Problem with Log Barrier
π₯β optimal for (ππ‘)
πβ optimal for (π·π‘)
ππ π₯β β€ 0 βπ=1β¦π
βπ π₯β = 0 βπ=1β¦π
ππβ β₯ 0 βπ=1..π
π»π₯πΏ π₯β, πβ, πβ = 0
ππβππ π₯β = π βπ=1..π
KKT Conditions
![Page 17: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/17.jpg)
Formulation of LP
1939 1947
Simplex
1984ArkadiNemirovski
1994YuriNesterov
NarendraKarmarkar
19941984
GeorgeDantzig
LeonidKhachiyanLeonid
Kantorovich
1984ArkadiNemirovski
General IP
IP for LP
1972 1979
EllipsoidMethod
Ellipsoid isPoly time
for LP
HerbertRobbins
1951
SGD (w/ Monro)
Non-SmoothSGD
+ analysis
1978
MargueriteFrank
PhilipWolfe
Conditional GD
1956
KKT
![Page 18: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/18.jpg)
From Interior Point MethodsBack to First Order Methods
β’ Interior Point Methods (and before them Ellipsoid):
β’ π ππππ¦ π log1
π
β’ LP in time polynomial in size of input (number of bits in coefficients)
β’ But in large scale applications, π(π3.5) too high
β’ First order (and stochastic) methods:
β’ Better dependence on π
β’ ππππ¦1
π(or ππππ¦(π ))
![Page 19: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/19.jpg)
Overall runtime
( runtime of each iteration ) Γ (number of required iterations)
( oracle runtime ) + ( other operations )
![Page 20: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/20.jpg)
Example: β1 regression
π π₯ =
π=1
π
ππ , π₯ β ππ
β’ Option 1: Cast as constrained optimization
minπ₯ββπ,π‘ββπ
π
π‘π π . π‘. βπ‘π β€ ππ , π₯ β ππ β€ π‘
β’ Runtime: π( π π +π 3 log 1 π)
β’ Options 2: Gradient Descent
β’ ππ 2 π₯β 2
π2iterations
β’ π(ππ) per iteration overall π ππ1
π2
β’ Option 3: Stochastic Gradient Descent
β’ ππ 2 π₯β 2
π2iterations
β’ π(π) per iteration overall π π1
π2
![Page 21: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/21.jpg)
Remember!
β’ Gradient is in dual space---donβt take mapping for granted!
βπ βπ βπ₯ π»ππ»Ξ¨
π»Ξ¨β1
![Page 22: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/22.jpg)
Some things we didnβt coverβ’ Technology for unconstrained optimization
β’ Different quasi-Newton and conj gradient methods
β’ Different line searches
β’ Technology for Interior Point Primal-Dual methods
β’ Numerical Issues
β’ Recent advances in first order and stochastic methods
β’ Mirror Descent, Different Geometries, Adaptive Geometries
β’ Decomposable functions, partial linearization and acceleration
β’ Faster methods for problems with specific forms or properties
β’ Message passing for solving LPs
β’ Flows and network problems
β’ Distributed Optimization
![Page 23: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/23.jpg)
Current Trends
β’ Small scale problems: β’ Super-efficient and reliable solvers for use in real-time
β’ Very large scale problems:β’ Stochastic first order methods
β’ Linear time, one-pass or few-pass
β’ Relationship to Machine Learning
![Page 24: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/24.jpg)
Other Courses
β’ Winter 2016:
Non-Linear Optimization (Mihai Anitescu, Stats)
β’ Winter 2016:
Networks I: Introduction to Modeling and Analysis (Ozan Candogan, Booth)
β’ 2016/17:
Computational and Statistical Learning Theory (TTIC)
![Page 25: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/25.jpg)
About the Courseβ’ Methods for solving convex optimization problems,
based on oracle access, with guarantees based on their properties
β’ Understanding different optimization methodsβ’ Understanding their derivationβ’ When are they appropriateβ’ Guarantees (a few proofs, not a core component)
β’ Working and reasoning about opt problemsβ’ Standard forms: LP, QP, SDP, etcβ’ Optimality conditionsβ’ Dualityβ’ Using Optimality Conditions and Duality to reason about π₯β
![Page 26: Introduction to Machine Learning 236756ttic.uchicago.edu/~nati/Teaching/TTIC31070/2015/Lecture18.pdfΒ Β· β 2 π 2 β 2 π2 2 π A-GD log1/π β 2 π Lower bound Ξ© log1/πΞ©](https://reader036.fdocuments.in/reader036/viewer/2022063015/5fd3417a163e1c35bc6aea09/html5/thumbnails/26.jpg)
Final
β’ Tuesday 10:30am
β’ Straight-forward questions
β’ Allowed: anything hand-written by you (not photo-copied)
β’ In depth: Up to and including Interior Point Methods
β’ Unconstrained Optimization
β’ Duality
β’ Optimality Conditions
β’ Interior Point Methods
β’ Phase I Methods and Feasibility Problems
β’ Superficially (a few multiple choice or similar)
β’ Other first order and stochastic methods
β’ Lower Bounds
β’ Center of Mass / Ellipsoid
β’ Other approaches