Sparse Inverse Covariance Estimation for a Million...
Transcript of Sparse Inverse Covariance Estimation for a Million...
![Page 1: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/1.jpg)
Sparse Inverse Covariance Estimation for a MillionVariables
Cho-Jui HsiehThe University of Texas at Austin
ICML workshop on Covariance SelectionBeijing, ChinaJune 26, 2014
Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack, A. Banerjee,S. Becker, and P. Olsen
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 2: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/2.jpg)
FMRI Brain Analysis
Goal: Reveal functional connections between regions of the brain.(Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011)
p = 228, 483 voxels.
Figure from (Varoquaux et al, 2010)Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 3: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/3.jpg)
Other Applications
Gene regulatory network discovery:
(Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez etal, 2010; Yin and Li, 2011)
Financial Data Analysis:
Model dependencies in multivariate time series (Xuan & Murphy, 2007).Sparse high dimensional models in economics (Fan et al, 2011).
Social Network Analysis / Web data:
Model co-authorship networks (Goldenberg & Moore, 2005).Model item-item similarity for recommender system(Agarwal et al, 2011).
Climate Data Analysis (Chen et al., 2010).
Signal Processing (Zhang & Fung, 2013).
Anomaly Detection (Ide et al, 2009).
Time series prediction and classification (Wang et al., 2014).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 4: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/4.jpg)
L1-regularized inverse covariance selection
Goal: Estimate the inverse covariance matrix in the high dimensionalsetting: p(# variables)� n(# samples)
Add `1 regularization – a sparse inverse covriance matrix is preferred.
The `1-regularized Maximum Likelihood Estimator:
Σ−1 = arg minX�0
{− log detX + tr(SX )︸ ︷︷ ︸
negative log likelihood
+λ‖X‖1
}= arg min
X�0f (X ),
where ‖X‖1 =∑n
i ,j=1 |Xij |.High dimensional problems
Number of parameters scales quadratically with number of variables.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 5: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/5.jpg)
Proximal Newton method
Split smooth and non-smooth terms: f (X ) = g(X ) + h(X ), where
g(X ) = − log detX + tr(SX ) and h(X ) = λ‖X‖1.
Form quadratic approximation for g(Xt + ∆):
gXt (∆) = tr((S −Wt)∆) + (1/2) vec(∆)T (Wt ⊗Wt) vec(∆)
− log detXt + tr(SXt),
where Wt = (Xt)−1 = ∂∂X log det(X ) |X =Xt .
Define the generalized Newton direction:
Dt = arg min∆
gXt (∆) + λ‖Xt + ∆‖1.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 6: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/6.jpg)
Algorithm – Proximal Newton method
Proximal Newton method for sparse Inverse Covariance estimation
Input: Empirical covariance matrix S , scalar λ, initial X0.
For t = 0, 1, . . .1 Compute Newton direction: Dt = arg min∆ fXt (Xt + ∆) (A Lasso
problem.)2 Line Search: use an Armijo-rule based step-size selection to get α s.t.
Xt+1 = Xt + αDt is
positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .
(Cholesky factorization of Xt + αDt)
(Hsieh et al., 2011; Olsen et al., 2012; Lee et al., 2012; Dinh et al., 2013;Scheinberg et al., 2014)
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 7: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/7.jpg)
Scalability
How to apply proximal Newton method to high dimensional problems?
Number of parameters: too large (p2)Hessian: cannot be explicited formed (p2 × p2)Limited memory. . .
BigQUIC (2013):
p = 1, 000, 000 (1 trillion parameters) in 22.9 hrs with 32 GBytesmemory (using a single machine with 32 cores).
Key ingradients:
Active variable selection⇒ the complexity scales with the intrincic dimensionality.
Block coordinate descent⇒ Handle problems where # parameters > memory size.
(Big & Quic).Other techniques: Divide-and-conquer, efficient line search. . .Extension to general dirty models.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 8: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/8.jpg)
Active Variable Selection
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 9: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/9.jpg)
Motivation
We can directly implement a proximal Newton method.
Takes 28 seconds on ER dataset.
Run Glasso: takes only 25 seconds.
Why do people like to use proximal Newton method?
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 10: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/10.jpg)
Newton Iteration 1
ER dataset, p = 692, Newton iteration 1.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 11: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/11.jpg)
Newton Iteration 3
ER dataset, p = 692, Newton iteration 3.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 12: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/12.jpg)
Newton Iteration 5
ER dataset, p = 692, Newton iteration 5.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 13: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/13.jpg)
Newton Iteration 7
ER dataset, p = 692, Newton iteration 7.
Only 2.7% variables have been changed to nonzero.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 14: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/14.jpg)
Active Variable Selection
Our goal: Reduce the number of variables from O(p2) to O(‖X ∗‖).
The solution usually has a small number of nonzeros: ‖X ∗‖0 � p2
Strategy: before solving the Newton direction, “guess” the variablesthat should be updated.
Fixed set: Sfixed = {(i , j) | Xij = 0 and |∇ijg(X )| ≤ λ.}.The remaining variables constitute the free set.
Solve the smaller subproblem:
Dt = arg min∆:∆Sfixed
=0gXt (∆) + λ‖Xt + ∆‖1.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 15: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/15.jpg)
Active Variable Selection
Is this just a heuristic? No.
The process is equivalent to a block coordinate descent algorithmwith two blocks:
1 Update the fixed set – no change (due to the definition of Sfixed).2 Update the free set – update on the subproblem.
Convergence is guaranteed.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 16: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/16.jpg)
Size of free set
In practice, the size of free set is small.
Take Hereditary dataset as an example:
p = 1869, number of variables = p2 = 3.49 million. The size of free setdrops to 20, 000 at the end.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 17: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/17.jpg)
Real datasets
(a) Time for Estrogen, p = 692 (b) Time for hereditarybc, p = 1, 869
Figure : Comparison of algorithms on real datasets. The results showQUIC converges faster than other methods.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 18: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/18.jpg)
Algorithm
QUIC: QUadratic approximation for sparse Inverse Covariance estimation
Input: Empirical covariance matrix S , scalar λ, initial X0.
For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Use coordinate descent to find descent direction:
Dt = arg min∆ fXt (Xt + ∆) over set of free variables, (A Lasso problem.)3 Line Search: use an Armijo-rule based step-size selection to get α s.t.
Xt+1 = Xt + αDt is
positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .
(Cholesky factorization of Xt + αDt)
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 19: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/19.jpg)
Extensions
Also used in LIBLINEAR (Yuan et al., 2011) for `1-regularized logisticregression.
Can be generalized to active subspace selection for nuclear norm(Hsieh et al., 2014).
Can be generalized to the general decomposable norm.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 20: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/20.jpg)
Efficient (block) Coordinate Descent
with Limited Memory
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 21: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/21.jpg)
Coordinate Descent Updates
Use coordinate descent to solve: arg minD{gX (D) + λ‖X + D‖1}.Current gradient∇gX (D) = (W ⊗W )vec(D) + S −W = WDW + S −W , whereW = X−1.
Closed form solution for each coordinate descent update:
Dij ← −c + S(c − b/a, λ/a),
where S(z , r) = sign(z) max{|z | − r , 0} is the soft-thresholdingfunction, a = W 2
ij +WiiWjj , b = Sij −Wij +wTi Dwj , and c = Xij +Dij .
The main cost is in computing wTi Dwj .
Instead, we maintain U = DW after each coordinate updates, and thencompute wT
i uj — only O(p) flops per updates.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 22: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/22.jpg)
Difficulties in Scaling QUIC
Consider the case that p ≈ 1million, m = ‖Xt‖0 ≈ 50million.
Coordinate descent requires Xt and W = X−1t ,
needs O(p2) storageneeds O(mp) computation per sweep, where m = ‖Xt‖0
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 23: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/23.jpg)
Coordinate Updates with Memory Cache
Assume we can store M columns of W in memory.
Coordinate descent update (i , j): compute wTi Dwj .
If wi ,wj are not in memory: recompute by CG:
Xwi = ei : O(TCG ) time.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 24: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/24.jpg)
Coordinate Updates with Memory Cache
w1,w2,w3,w4 stored in memory.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 25: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/25.jpg)
Coordinate Updates with Memory Cache
Cache hit, do not need to recompute wi ,wj .
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 26: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/26.jpg)
Coordinate Updates with Memory Cache
Cache miss, recompute wi ,wj .
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 27: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/27.jpg)
Coordinate Updates – ideal case
Want to find update sequence that minimizes number of cache misses:
probably NP Hard.
Our strategy: update variables block by block.
The ideal case: there exists a partition {S1, . . . ,Sk} such that all freesets are in diagonal blocks:
Only requires p column evaluations.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 28: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/28.jpg)
General case: block diagonal + sparse
If the block partition is not perfect:
extra column computations can be characterized by boundary nodes.
Given a partition {S1, . . . ,Sk}, we define boundary nodes as
B(Sq) ≡ {j | j ∈ Sq and ∃i ∈ Sz , z 6= q s.t. Fij = 1},
where F is adjacency matrix of the free set.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 29: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/29.jpg)
Graph Clustering Algorithm
The number of columns to be computed in one sweep is
p +∑
q
|B(Sq)|.
Can be upper bounded by
p +∑
q
|B(Sq)| ≤ p +∑z 6=q
∑i∈Sz ,j∈Sq
Fij .
Use Graph Clustering (METIS or Graclus) to find the partition.
Example: on fMRI dataset (p = 0.228 million) with 20 blocks,
random partition: need 1.6 million column computations.
graph clustering: need 0.237 million column computations.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 30: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/30.jpg)
BigQUIC
Block co-ordinate descent with clustering,
needs O(p2) → O(m + p2/k) storageneeds O(mp) → O(mp) computation per sweep, where m = ‖Xt‖0
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 31: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/31.jpg)
Algorithm
BigQUIC
Input: Samples Y , scalar λ, initial X0.
For t = 0, 1, . . .1 Variable selection: select a free set of m� p2 variables.2 Construct a partition by clustering.3 Run block coordinate descent to find descent direction:
Dt = arg min∆ fXt (Xt + ∆) over set of free variables.4 Line Search: use an Armijo-rule based step-size selection to get α s.t.
Xt+1 = Xt + αDt is
positive definite,satisfies a sufficient decrease condition f (Xt + αDt) ≤ f (Xt) + ασ∆t .
(Schur complement with conjugate gradient method. )
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 32: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/32.jpg)
Convergence Analysis
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 33: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/33.jpg)
Convergence Guarantees for QUIC
Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011)
QUIC converges quadratically to the global optimum, that is for someconstant 0 < κ < 1:
limt→∞
‖Xt+1 − X ∗‖F
‖Xt − X ∗‖2F
= κ.
Other convergence proof for proximal Newton:
(Lee et al, 2012) discuss the convergence for general loss function andregularization.
(Katyas et al., 2014) discuss the global convergence rate.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 34: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/34.jpg)
BigQUIC Convergence Analysis
Recall W = X−1.
When each wi is computed by CG (Xwi = ei ):
The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT
i Dwj in coordinate updates) needs to be repeatedlycomputed.
To reduce the time overhead, Hessian should be computedapproximately.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 35: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/35.jpg)
BigQUIC Convergence Analysis
Recall W = X−1.
When each wi is computed by CG (Xwi = ei ):
The gradient ∇ijg(X ) = Sij −Wij on free set can be computed once andstored in memory.Hessian (wT
i Dwj in coordinate updates) needs to be repeatedlycomputed.
To reduce the time overhead, Hessian should be computedapproximately.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
If ∇g(X ) is computed exactly and ηI � Ht � ηI for some constantη, η > 0 at every Newton iteration, then BigQUIC converges to theglobal optimum.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 36: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/36.jpg)
BigQUIC Convergence Analysis
To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.
Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .
The optimality of Xt can be measured by
∇Sij f (X ) =
{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,
sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.
When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 37: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/37.jpg)
BigQUIC Convergence Analysis
To have super-linear convergence rate, we require the Hessiancomputation to be more and more accurate as Xt approaches X ∗.
Measure the accuracy of Hessian computation by R = [r1 . . . , rp],where ri = Xwi − ei .
The optimality of Xt can be measured by
∇Sij f (X ) =
{∇ijg(X ) + sign(Xij )λ if Xij 6= 0,
sign(∇ijg(X )) max(|∇ijg(X )| − λ, 0) if Xij = 0.
Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013)
In BigQUIC, if residual matrix ‖Rt‖ = O(‖∇S f (Xt)‖`) for some0 < ` ≤ 1 as t →∞, then ‖Xt+1 − X ∗‖ = O(‖Xt − X ∗‖1+`) ast →∞.
When ‖Rt‖ = O(‖∇S f (Xt)‖), BigQUIC has asymptotic quadraticconvergence rate.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 38: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/38.jpg)
Experimental results (scalability)
(a) Scalability on random graph (b) Scalability on chain graph
Figure : BigQUIC can solve one million dimensional problems.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 39: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/39.jpg)
Experimental results
BigQUIC is faster even for medium size problems.
Figure : Comparison on FMRI data with a p = 20000 subset (maximum dimensionthat previous methods can handle).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 40: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/40.jpg)
Results on FMRI dataset
228,483 voxels, 518 time points.
λ = 0.6 =⇒ average degree 8, BigQUIC took 5 hours.
λ = 0.5 =⇒ average degree 38, BigQUIC took 21 hours.
Findings:
Voxels with large degree were generally found in the gray matter.Can detect meaningful brain modules by modularity clustering.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 41: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/41.jpg)
Quic & Dirty
A Quadratic Approximation Approach for Dirty Statistical Models
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 42: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/42.jpg)
Dirty Statistical Models
Superposition-structured models or dirty models (Yang et al., 2013):
min{θ(r)}k
r=1
L(∑
r
θ(r)
)+∑
r
λr‖θ(r)‖Ar := F (θ),
An example: GMRF with latent features (Chandrasekaran et al., 2012)
minS,L:L�0,S−L�0
− log det(S − L) + 〈S − L,Σ〉+ λS‖S‖1 + λL‖L‖∗.
When there are two groups of random variables X ,Y ,Θ = [ΘXX ,ΘXY ; ΘYX ,ΘYY ]. If we only observe X , and Y is the set of hiddenvariables, the marginal precision matrix will be [ΘXX −ΘXY Θ−1
YY ΘYX ],therefore will have a structure of low rank + sparse.
Error bound: (Meng et al., 2014).
Other applications: Robust PCA (low rank + sparse), multitask (sparse +group sparse), . . .
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 43: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/43.jpg)
Quadratic Approximation
The Newton direction d:
[d(1), . . . ,d(k)] = argmin∆(1),...,∆(k)
g(θ + ∆) +k∑
r=1
λr‖θ(r) + ∆(r)‖Ar ,
where
g(θ + ∆) = g(θ) +k∑
r=1
〈∆(t),G 〉+1
2∆TH∆,
and
H := ∇2g(θ) =
H · · · H...
. . ....
H · · · H
, H := ∇2L(θ).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 44: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/44.jpg)
Quadratic Approximation
A general active subspace selection for decomposable norms.
L1 norm: variable selectionNuclear norm: subspace selectionGroup sparse norm: group selection
Method for solving the quadratic approximation subproblem:
Alternating minimization.
Convergence analysis: Hessian is not positive definite.
Still have quadratic convergence.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 45: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/45.jpg)
Experimental Comparisons
(a) ER dataset (b) Leukemia dataset
Figure : Comparison on gene expression data (GMRF with latent variables).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 46: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/46.jpg)
Conclusions
How to scale proximal Newton method to high dimensional problems:
Active variable selection⇒ the complexity scales with the intrincic dimensionality.
Block coordinate descent⇒ Handle problems where # parameters > memory size.
(Big & Quic).Other techniques: Divide-and-conquer, efficient line search. . .
BigQUIC: Memory efficient quadratic approximation method forsparse inverse covariance estimation (scale to 1 million by 1 millionproblems).
With a careful design, we can apply proximal Newton method to solvemodels with superposition-structured regularization (sparse + low rank;sparse + group sparse; low rank + group sparse . . . ).
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation
![Page 47: Sparse Inverse Covariance Estimation for a Million Variablescjhsieh/covselect_icml14/slides/Cho.pdfSocial Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore,](https://reader033.fdocuments.in/reader033/viewer/2022050519/5fa29e26b8bb983c87754d7e/html5/thumbnails/47.jpg)
References
[1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparseinverse covariance estimation for a million variables . NIPS (oral presentation), 2013.[2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance MatrixEstimation using Quadratic Approximation. NIPS, 2011.[3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure forSparse Inverse Covariance Estimation. NIPS, 2012.[4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for SparseInverse Covariance Estimation. Optimization Online, 2012.[5] O. Banerjee, L. El Ghaoui, and A. d’Aspremont Model Selection Through Sparse MaximumLikelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, 2008.[6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical lasso. Biostatistics, 2008.[7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covarianceselection. Mathematical Programming Computation, 2010.[8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternatinglinearization methods. NIPS, 2010.[9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedycoordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.
Cho-Jui Hsieh The University of Texas at Austin Sparse Inverse Covariance Estimation