Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse...

Post on 21-Jun-2018

221 views 0 download

Transcript of Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse...

Problems in Sparse Multivariate Statistics with aDiscrete Optimization Lens

Rahul Mazumder

Massachusetts Institute of Technology

(Joint with D. Bertsimas, M. Copenhaver, A. King, P. Radchenko, H. Qin, J.

Goetz, K. Khamaru)

August, 2016

R. Mazumder Sparsity with Discrete Optimization 1

Motivation

I Several basic statistical estimation tasks are inherently discrete

I Often dismissed as computationally infeasible

I We often “relax” the hard problems:– Convex (continuous) optimization plays a key role (e.g. Lasso)– They work very well in many cases...

I However, often leads to a compromise in statistical performance

I Question: Can we use advances in discrete optimization to globallysolve nonconvex problems?

R. Mazumder Sparsity with Discrete Optimization 2

Motivation

I Several basic statistical estimation tasks are inherently discrete

I Often dismissed as computationally infeasible

I We often “relax” the hard problems:– Convex (continuous) optimization plays a key role (e.g. Lasso)– They work very well in many cases...

I However, often leads to a compromise in statistical performance

I Question: Can we use advances in discrete optimization to globallysolve nonconvex problems?

R. Mazumder Sparsity with Discrete Optimization 2

Motivation

I We seldom know a-priori which method will work for a givenapplication

I “...A statistician’s toolkit should have a whole array of methods, toexperiment with...”

...Jerome. H. Friedman

I Use tools from mathematical optimization︸ ︷︷ ︸Discrete & Convex Optimization

to devise estimators:

I that are flexibleI have a disciplined computational framework:

– Obtain almost optimal solutions in seconds/minutes– Certify optimality in minutes/hours

R. Mazumder Sparsity with Discrete Optimization 3

Motivation

I We seldom know a-priori which method will work for a givenapplication

I “...A statistician’s toolkit should have a whole array of methods, toexperiment with...”

...Jerome. H. Friedman

I Use tools from mathematical optimization︸ ︷︷ ︸Discrete & Convex Optimization

to devise estimators:

I that are flexibleI have a disciplined computational framework:

– Obtain almost optimal solutions in seconds/minutes– Certify optimality in minutes/hours

R. Mazumder Sparsity with Discrete Optimization 3

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 4

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 5

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 6

Best Subset Regression: Statement

[Bertsimas, King, M., ’16, Annals of Statistics]

I Usual linear regression model n samples, p regressors

I Want a sparse β with good data-fidelity:

minβ

1

2‖y −Xβ‖22 s.t. ‖β‖0 ≤ k, (?)

[Miller ’90; Foster & George ’94; George ’00 ]

I Problem (?) is NP-hard [Natarajan ’95].

I R package leaps can handle n ≥ p ≤ 31.(branch and leaps [Furnival & Wilson 1974])

I Not surprisingly, advised to stay away from Problem (?).

R. Mazumder Sparsity with Discrete Optimization 7

Best Subset Regression: Current Approaches & Limitations

I Lasso (`1) [Tibshirani ’96, Chen & Donoho ’98] is a very popular andeffective proxy:

minβ

12‖y −Xβ‖22 + λ‖β‖1,

I Computation: convex optimization, fast & scalable

I `1 =⇒ good models, under assumptions︸ ︷︷ ︸difficult to verify

I `1 6=⇒ reliable sparse solutions, and `1 6= `0 solutions.

[Buhlmann, Van de Geer ’11; Cai, Shen ’11; Zhang, Jiang ’08....]

R. Mazumder Sparsity with Discrete Optimization 8

Shortcomings of the Lasso: a simple explanation

I In presence of correlated variables, to obtain model with goodpredictive power, Lasso brings in a large number of nonzerocoefficients

I Lasso leads to biased estimates—`1-norm penalizes large and smallcoefficients uniformly.

I Upon increasing the degree of regularization, Lasso sets morecoefficients to zero—leaves out true predictors from the active set.

R. Mazumder Sparsity with Discrete Optimization 9

Best Subset Regression: `1 vs `0

I If β denotes the best subset solution, for any (fixed) X,

sup‖β∗‖0≤k

1

nE(‖Xβ −Xβ∗‖22) .

σ2k log p

n,

I If β`1 denotes a Lasso-based k-sparse estimator, then ∃ X:

1

γ2

σ2k1−δ log p

n. sup‖β∗‖0≤k

1

nE(‖Xβ`1 −Xβ∗‖22) .

1

γ2

σ2k log p

n,

I There is a significant gap between `0 and `1-type solutions.

[Bunea et. al. ’07; Raskutti et. al. ’09; Zhang et. al. ’14 ]

R. Mazumder Sparsity with Discrete Optimization 10

Best Subset Regression: Current Approaches & Limitations

I To circumvent shortcomings, alternatives exist

I Non-convex penalties/ greedy methods[Fan, Li ’01; Zou ’06; Zou, Li ’08; Zhang ’10 ; Mazumder et. al. ’11; Zhang ,

Zhang ’12; Loh, Wainwright ’14]

I Problems are non-convex and hard to solve.

I Computational approaches mostly heuristic:cannot certify/prove global optimality for arbitrary dataset.Exception: [Liu , Yao , Li ’16]

R. Mazumder Sparsity with Discrete Optimization 11

Best Subset Regression: Our approach

[Bertsimas, King, M., ’16, Annals of Statistics]

I Certifiably minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Main workhorses:

Tools from different branches of Optimization:

I Modern Technology of Mixed Integer Optimization (MIO)

I Discrete First Order methods (motivated from convexcontinuous optimization)

R. Mazumder Sparsity with Discrete Optimization 12

Best Subset Regression: Our approach

I Consider minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express as Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations

R. Mazumder Sparsity with Discrete Optimization 13

Best Subset Regression: Our approach

I Consider minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express as Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations

R. Mazumder Sparsity with Discrete Optimization 14

Brief Background on MIO

R. Mazumder Sparsity with Discrete Optimization 15

Mixed Integer Optimation (MIO)

I MIO: a particular class of discrete optimization problems

I The general form of a Mixed Integer Quadratic Optimization:

min αTQα + αTa

s.t. Aα ≤ b

αi ∈ 0, 1, ∀i ∈ I

αj ∈ <+, ∀j /∈ I,

a ∈ <m,A ∈ <k×m,b ∈ <k and Q ∈ <m×m (PSD)problem-parameters;

I Special instances: Mixed Integer Linear Optimization,Quadratic/Linear Programming...

R. Mazumder Sparsity with Discrete Optimization 16

Mixed Integer Optimation (MIO)

I MIO optimization methods employ a combination of branch andbound, branch and cut, cutting plane methods, ...(not complete enumeration)

I Foundations deeply rooted in polyhedral theory, combinatorics,discrete geometry/algebra,...

I Worst case: NP hard. Our focus is not worst case analysis.(Simplex Algorithm, Path Algorithms like LARS, TSP, ...)

I Modern MIO is tractable (in practice)tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory

management, air-traffic control, warehouse location, matching assignments,...)

R. Mazumder Sparsity with Discrete Optimization 17

Mixed Integer Optimation (MIO)

I MIO optimization methods employ a combination of branch andbound, branch and cut, cutting plane methods, ...(not complete enumeration)

I Foundations deeply rooted in polyhedral theory, combinatorics,discrete geometry/algebra,...

I Worst case: NP hard. Our focus is not worst case analysis.(Simplex Algorithm, Path Algorithms like LARS, TSP, ...)

I Modern MIO is tractable (in practice)

tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory

management, air-traffic control, warehouse location, matching assignments,...)

R. Mazumder Sparsity with Discrete Optimization 17

Mixed Integer Optimation (MIO)

I MIO optimization methods employ a combination of branch andbound, branch and cut, cutting plane methods, ...(not complete enumeration)

I Foundations deeply rooted in polyhedral theory, combinatorics,discrete geometry/algebra,...

I Worst case: NP hard. Our focus is not worst case analysis.(Simplex Algorithm, Path Algorithms like LARS, TSP, ...)

I Modern MIO is tractable (in practice)tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory

management, air-traffic control, warehouse location, matching assignments,...)

R. Mazumder Sparsity with Discrete Optimization 17

Progress of MIO

I Algorithms and Software have undergone huge improvements overpast 25+ years (1991 - 2016).

I Algorithms speed-up: ∼1.4 million times(Combined speedup: CPLEX 1.2 to 11 & Gurobi 1.0 to 6.5)

I Hardware speed-up: ∼1.6 million times(Peak Supercomputer performance)

I Total speed-up: 2.2 trillion times!

I Commercial packages: Xpress, Gurobi, Cplex,...

Non-commercial packages: GLPK, lpsolve, CBC, SCIP,...

Interfaces: Matlab, R, Python, Julia (JuMP)

R. Mazumder Sparsity with Discrete Optimization 18

Back to Formulation

R. Mazumder Sparsity with Discrete Optimization 19

Vanilla MIO formulation

For problem: minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k,

A simple (natural) MIO formulation is given by

minβ,z

12‖y −Xβ‖22

s.t. |βi| ≤M · zi, i = 1, . . . , pp∑i=1

zi ≤ k

zi ∈ 0, 1, i = 1, . . . , p,

where, M (“Big-M”) is a parameter— M ≥ ‖β‖∞— M controls the strength of the MIO formulation

R. Mazumder Sparsity with Discrete Optimization 20

Diabetes Dataset, n = 350, p = 64, k = 6

Typical behavior of Overall Algorithm

0 100 200 300 400

0.6

80

.70

0.7

20

.74

0.7

60

.78

0.8

0

k=6

Time (secs)

Ob

jective

Upper BoundsLower BoundsGlobal Minimum

0 100 200 300 400

0.0

00

.05

0.1

00

.15

k=6

Time (secs)

MIO

−G

ap

Time (secs) Time (secs)

R. Mazumder Sparsity with Discrete Optimization 21

Our approach

I Consider minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express best-subset as a Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations

R. Mazumder Sparsity with Discrete Optimization 22

Discrete First Order Method

I Stylized gradient based method for

minβ

g(β) s.t. ‖β‖0 ≤ k,

I g(β) convex and ‖∇g(β)−∇g(β0)‖ ≤ ` · ‖β − β0‖.

I This implies that for all L ≥ `

g(β) ≤ Q(β) = g(β0) + 〈∇g(β0),β − β0〉+L

2‖β − β0‖22

I For the purpose of finding feasible solutions, we propose

minβ

Q(β) s.t. ‖β‖0 ≤ k

[Related work: Blumensath, Davis ’08; Donoho, Johnstone ’95 ]

R. Mazumder Sparsity with Discrete Optimization 23

Solution

I Equivalent to

minβ

L

2

∥∥∥∥β − (β0 −1

L∇g(β0)

)∥∥∥∥2

2

s.t. ‖β‖0 ≤ k

I Reducing tominβ‖β − u‖22 s.t. ‖β‖0 ≤ k

I Optimal solution is β∗ ∈ Hk(u), where Hk(u) is thehard-thresholding operator (retains the top k entries of u inabsolute value).[Donoho & Johnstone ’95]

R. Mazumder Sparsity with Discrete Optimization 24

Discrete First Order Algorithm (DFA)

Algorithm to get feasible solutions for:

minβ

g(β) s.t. ‖β‖0 ≤ k.

1. Initialize with a solution β0; m = 0.

2. m := m+ 1.

3. βm+1 ∈ Hk

(βm − 1

L∇g(βm)).

4. Perform a line search to get βm+1

5. Repeat Steps 2-4 until ‖βm+1 − βm‖ ≤ ε.

R. Mazumder Sparsity with Discrete Optimization 25

Convergence properties

Theorem. (Bertsimas, King, M. ’16)

Let βm,m ≥ 1 be generated by DFA:

(a) For any L ≥ `, the sequence g(βm) is decreasing and converges.

(b) If L > ` and under some minor regularity properties

I ‖βm+1 − βm‖22 ≤ ε in at most O( 1ε ) many iterations.

I Supp(βm) stabilizes after finitely many iterations and βmconverges to a first order stationary point.

R. Mazumder Sparsity with Discrete Optimization 26

Our approach

I Consider minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express best-subset as a Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations

R. Mazumder Sparsity with Discrete Optimization 27

Special Ordered Sets-formulation

minβ

‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

is equivalent to

minβ,z

‖y −Xβ‖22

s.t. (βi, 1− zi) : SOS type-1,i = 1, . . . , pp∑i=1

zi ≤ k

zi ∈ 0, 1, i = 1, . . . , p.

R. Mazumder Sparsity with Discrete Optimization 28

Implied Constraints

minβ

‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

is equivalent to

minβ

‖y −Xβ‖22

s.t. ‖β‖0 ≤ k‖β‖∞ ≤ δ11, ‖β‖1 ≤ δ21

‖Xβ‖∞ ≤ δ12, ‖Xβ‖1 ≤ δ22

for constants δ11, δ12, δ21, δ22 (which can be computed from data).

R. Mazumder Sparsity with Discrete Optimization 29

Behavior with user-guided intelligence

Diabetes data: n = 350, p = 64.

0 20 40 60 80

−3

.5−

3.0

−2

.5−

2.0

−1

.5−

1.0

k=5

Time (secs)

log

(MIO

−G

ap

)

Warm StartCold Start

0 500 1000 1500 2000 2500 3000 3500

−2

.5−

2.0

−1

.5−

1.0

k=31

Time (secs)

log

(MIO

−G

ap

)

Warm StartCold Start

R. Mazumder Sparsity with Discrete Optimization 30

Statistical Behavior

R. Mazumder Sparsity with Discrete Optimization 31

Sparsity Detection for n = 500, p = 100

0

10

20

30

40

1.742 3.484 6.967Signal−to−Noise Ratio

Num

ber

of N

onze

ros

Method

MIO

Lasso

Step

Sparsenet

R. Mazumder Sparsity with Discrete Optimization 32

Prediction Error = ‖Xβalg −Xβtrue‖22/‖Xβtrue‖22

0.00

0.05

0.10

0.15

0.20

0.25

1.742 3.484 6.967Signal−to−Noise Ratio

Pre

dict

ion

Per

form

ance

Method

MIO

Lasso

Step

Sparsenet

R. Mazumder Sparsity with Discrete Optimization 33

Sparsity Detection for n = 50, p = 2000

0

10

20

30

40

3 7 10Signal−to−Noise Ratio

Num

ber

Non

zero

s

Method

Lasso

First Order + MIO

First Order Only

Sparsenet

R. Mazumder Sparsity with Discrete Optimization 34

Prediction Error for n = 50, p = 2000

0.0

0.1

0.2

3 7 10Signal−to−Noise Ratio

Pre

dict

ion

Per

form

ance

Method

Lasso

First Order + MIO

First Order Only

Sparsenet

R. Mazumder Sparsity with Discrete Optimization 35

What did we learn?

I For the case n > p, MIO+intelligence finds provably optimalsolutions for n = 500s, p = 100s in minutes.

I For the case n < p, MIO+intelligence finds solutions for n = 50s,p = 1000s in minutes and proving (approx)-optimality in hours.

I MIO solutions have a significant edge in sparsity and improvedprediction accuracy.

I Modern optimization (MIO+user guided intelligence) is capable oftackling large instances.

R. Mazumder Sparsity with Discrete Optimization 36

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 37

The Discrete Dantzig Selector

[M. & Radchenko ’16+]

I The Dantzig Selector [Candes, Tao ’07]:

βDS`1 ∈ argmin ‖β‖1 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Instead, consider its `0 analogue:

βDS`0 ∈ argmin ‖β‖0 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Find the sparsest β such that maximal (abs) correlation betweencovariates and residuals is small.

I Why is this important?– Formulation is a Mixed Integer Linear Optimization.

– Mixed Integer Linear is a more mature technology than MixedInteger Quadratic Optimization

R. Mazumder Sparsity with Discrete Optimization 38

The Discrete Dantzig Selector

[M. & Radchenko ’16+]

I The Dantzig Selector [Candes, Tao ’07]:

βDS`1 ∈ argmin ‖β‖1 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Instead, consider its `0 analogue:

βDS`0 ∈ argmin ‖β‖0 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Find the sparsest β such that maximal (abs) correlation betweencovariates and residuals is small.

I Why is this important?– Formulation is a Mixed Integer Linear Optimization.

– Mixed Integer Linear is a more mature technology than MixedInteger Quadratic Optimization

R. Mazumder Sparsity with Discrete Optimization 38

The Discrete Dantzig Selector

Under a sparse linear model with Gaussian errors: y = Xβ∗ + ε

I The errors:

– ‖βDS`0− β∗‖22

– ‖βDS`0− β∗‖21

– ‖X(βDS`0− β∗)‖22

are much smaller than the convex estimator βDS`1

(when features arecorrelated)

I # Non-zeros βDS`0

# Non-zeros βDS`1

I Statistical properties of βDS`0

comparable with Least Squares SubsetSelection

R. Mazumder Sparsity with Discrete Optimization 39

Some Large Problems

(Synthetic Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

4,000 8,000 20 20 20 0 41.93,000 8,000 20 20 20 0 18.31,000 10,000 10 10 10 0 14.25,000 10,000 10 10 10 0 2.5

10,000 10,000 30 30 27 10% 42.5

(Real Data Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

6,000 4,500 20 20 20 0 5.06,000 4,500 40 40 37 10% 12.5

Table: Solutions obtained within 5-10 minutes for all problems. CertifyingOptimality takes longer.

R. Mazumder Sparsity with Discrete Optimization 40

Some Large Problems

(Synthetic Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

4,000 8,000 20 20 20 0 41.93,000 8,000 20 20 20 0 18.31,000 10,000 10 10 10 0 14.25,000 10,000 10 10 10 0 2.5

10,000 10,000 30 30 27 10% 42.5

(Real Data Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

6,000 4,500 20 20 20 0 5.06,000 4,500 40 40 37 10% 12.5

Table: Solutions obtained within 5-10 minutes for all problems. CertifyingOptimality takes longer (several hours).

R. Mazumder Sparsity with Discrete Optimization 41

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 42

Effect of Outliers in Regression

[Bertsimas, M., ’14, Annals of Statistics]

I Least Squares (LS) estimator

β(LS) ∈ argminβ

n∑i=1

r2i , ri = yi − x′iβ

has a breakdown point of zero (Dohono & Huber ’83; Hampel ’75).

I The Least Absolute Deviation (LAD) estimator has a breakdownpoint of zero

β(LAD) ∈ argminβ

n∑i=1

|ri|,

I M-Estimators (Huber ’73) slightly improve the breakdown point

n∑i=1

ρ(ri), ρ(r) symmetric function

R. Mazumder Sparsity with Discrete Optimization 43

Least Median Regression

I Least Median of Squares (LMS) estimator [Rousseew (’84)]

β(LMS) ∈ argminβ

(mediani=1,...,n

|ri|).

I LMS highest possible breakdown point of almost 50%.

I More generally, Least Quantile of Squares (LQS) estimator:

β(LQS) ∈ argminβ

|r(q)|,

where, r(q) is the qth ordered absolute residual:

|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

R. Mazumder Sparsity with Discrete Optimization 44

Problem we address

I Solve the following problem:

minβ|r(q)|,

where, ri = yi − x′iβ, q is a quantile.

I Our approach extends to

minβ|r(q)|, s.t. Aβ ≤ b (and/or ‖β‖22 ≤ δ)

R. Mazumder Sparsity with Discrete Optimization 45

LQS and Subset Selection: A surprising link

I LQS and subset-selection in regression seem to be completelyunrelated concepts...

I However, a curious link emerges...

I Claim: LQS is performing an implicit subset search

Theorem [Bertsimas & M. ’14]: The LQS problem is equivalent to thefollowing:

minβ|r(q)| = min

I∈Ωq

(minβ‖yI −XIβ‖∞

),

where, Ωq := I : I ⊂ 1, . . . , n, |I| = q and (yI ,XI) denotes thesubsample (yi,xi), i ∈ I.

R. Mazumder Sparsity with Discrete Optimization 46

LQS and Subset Selection: A surprising link

I LQS and subset-selection in regression seem to be completelyunrelated concepts...

I However, a curious link emerges...

I Claim: LQS is performing an implicit subset search

Theorem [Bertsimas & M. ’14]: The LQS problem is equivalent to thefollowing:

minβ|r(q)| = min

I∈Ωq

(minβ‖yI −XIβ‖∞

),

where, Ωq := I : I ⊂ 1, . . . , n, |I| = q and (yI ,XI) denotes thesubsample (yi,xi), i ∈ I.

R. Mazumder Sparsity with Discrete Optimization 46

LQS and Subset Selection: A surprising link

I LQS and subset-selection in regression seem to be completelyunrelated concepts...

I However, a curious link emerges...

I Claim: LQS is performing an implicit subset search

Theorem [Bertsimas & M. ’14]: The LQS problem is equivalent to thefollowing:

minβ|r(q)| = min

I∈Ωq

(minβ‖yI −XIβ‖∞

),

where, Ωq := I : I ⊂ 1, . . . , n, |I| = q and (yI ,XI) denotes thesubsample (yi,xi), i ∈ I.

R. Mazumder Sparsity with Discrete Optimization 46

Overview of our approach

I Write the LMS problem as a MIO.— Main idea: MIO formulation sorts to express |r(q)|— Formulation very different from best subset selection inregression

I Using Discrete First Order methods we find good feasible solutions.

I Warm-starts and improved behavior with user-guided intelligence

R. Mazumder Sparsity with Discrete Optimization 47

MIO Formulation

Notation:|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

Step 1: Introduce binary variables zi, i = 1, . . . , n such that:

zi =

1, if |ri| ≤ |r(q)|,0, otherwise.

Step 2: Use auxiliary continuous variables µi, µi ≥ 0 such that:

|ri| − µi ≤ |r(q)| ≤ |ri|+ µi, i = 1, . . . , n,

with the conditions:

If |ri| ≥ |r(q)|, then µi = 0, µi ≥ 0,

and if |ri| ≤ |r(q)|, then µi = 0, µi ≥ 0.

R. Mazumder Sparsity with Discrete Optimization 48

MIO Formulation

Notation:|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

Step 1: Introduce binary variables zi, i = 1, . . . , n such that:

zi =

1, if |ri| ≤ |r(q)|,0, otherwise.

Step 2: Use auxiliary continuous variables µi, µi ≥ 0 such that:

|ri| − µi ≤ |r(q)| ≤ |ri|+ µi, i = 1, . . . , n,

with the conditions:

If |ri| ≥ |r(q)|, then µi = 0, µi ≥ 0,

and if |ri| ≤ |r(q)|, then µi = 0, µi ≥ 0.

MIO representable

R. Mazumder Sparsity with Discrete Optimization 49

MIO Formulation

min γ

s.t. |ri|+ µi ≥ γ, i = 1 . . . , n

γ ≥ |ri| − µi, i = 1 . . . , n

Muzi ≥ µi, i = 1, . . . , n

M`(1− zi) ≥ µi, i = 1, . . . , nn∑i=1

zi = q

µi ≥ 0, i = 1, . . . , n

µi ≥ 0, i = 1, . . . , n

zi ∈ 0, 1, i = 1, . . . , n,

where γ, zi, µi, µi, i = 1, . . . , n are decision variables and Mu,M` areBig-M constants.

R. Mazumder Sparsity with Discrete Optimization 50

What do we achieve?

I Prior exact algorithms can solve upto: n = 50 and p = 5

I We obtain:

I near optimal solutions for problems with n ≈ 200s and p ≈ 20sin seconds, proving optimality in minutes.

I near optimal solutions for problems with n ≈ 10, 000 andp ≈ 50 in minutes.

R. Mazumder Sparsity with Discrete Optimization 51

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 52

Background & Formulation

[Bertsimas, Copenhaver, M., ’16+]

Low Rank Factor Analysis (FA) [Spearman 1904]:

I widely used in multivariate statistics, econometrics, psychometrics

I represent correlation structure with few common (latent) factors.

Estimation Problem: Σ = L1L′1︸ ︷︷ ︸

Θ

+ L2L′2︸ ︷︷ ︸

Small

− Σ ≈ Θ + Φ

− Φ = diag(Φ1, . . . ,Φp) 0

− rank(Θ) ≤ r,Θ 0

− Σ−Θ 0; Σ−Φ 0

R. Mazumder Sparsity with Discrete Optimization 53

Background & Formulation

Low Rank Factor Analysis (FA) [Spearman 1904]:

I widely used in multivariate statistics, econometrics, psychometrics

I represent correlation structure with few common (latent) factors.

Estimation Problem: Σ = L1L′1︸ ︷︷ ︸

Θ

+ L2L′2︸ ︷︷ ︸

Small

− Σ ≈ Θ + Φ

− Φ = diag(Φ1, . . . ,Φp) 0

− rank(Θ) ≤ r,Θ 0

− Σ−Θ 0; Σ−Φ 0

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Φ 0

R. Mazumder Sparsity with Discrete Optimization 54

Our Approach

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

(†)

R. Mazumder Sparsity with Discrete Optimization 55

Our Approach

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

← Sum of Singular Values

← Rank Constraint

← Semidefinite Constraint

(†)

R. Mazumder Sparsity with Discrete Optimization 55

Our Approach

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

← Sum of Singular Values

← Rank Constraint

← Semidefinite Constraint

(†)

I SDP with rank constraints

I Key Idea: Reformulate (†) equivalently as a SDP (without rankconstraint)

I Nonlinear Optimization techniques for feasible solutions

I Specialized Branch & Bound methods to certify optimality

R. Mazumder Sparsity with Discrete Optimization 55

Reformulation and tailored B&B

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

~w Variational Representation

of Spectral Functions

min 〈W,Σ−Θ〉 −p∑i=1

wiiΦi

s.t. I W 0

Tr(W) = p− rΣ−Θ 0

R. Mazumder Sparsity with Discrete Optimization 56

Reformulation and tailored B&B

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

~w Variational Representation

of Spectral Functions

min 〈W,Σ−Θ〉 −p∑i=1

wiiΦi ←−

Bilinear Form (Nonconvex)

McCormick Hulls/ B&B

s.t. I W 0

Tr(W) = p− rΣ−Θ 0

R. Mazumder Sparsity with Discrete Optimization 57

What do we learn?

I Several experiments on both real and synthetic datasets, reveal:

I Upper bounds obtained within few seconds (p = 100) toseveral minutes (p = 4000)

I Certifying optimality takes longer (several hours)

I Global optimality certificates obtained on datasets, where,assumptions required for convex problem to succeed cannot beverified.

ArXiv link: http://arxiv.org/pdf/1604.06837v1.pdf

R. Mazumder Sparsity with Discrete Optimization 58

Conclusions

I MIO is an advanced, computationally tractable mathematicalprogramming framework

I Provides a powerful modeling tool for statistical problems

I Leads to a significant edge in Sparse Learning problems that areinherently discrete.

I 15.097: PhD class taught at MIT Spring 2016 on related topics.

R. Mazumder Sparsity with Discrete Optimization 59

Thank you!

All papers available at:http://www.mit.edu/~rahulmaz/research.html

R. Mazumder Sparsity with Discrete Optimization 60