TGDR: An Introduction - School of Public Healthjulianw/downloads/JW.TGDR.pdf · Application: ACTG...

Variable Selection Penalization, Solution Paths and TGDR Applying TGDR Extensions Final Thoughts

TGDR: An Introduction

Julian Wolfson

Student Seminar

March 28, 2007

1 Variable Selection

2 Penalization, Solution Paths and TGDR

3 Applying TGDR

4 Extensions

5 Final Thoughts

Some motivating examples

We are interested in identifying which covariates from a setX = {X1, . . . ,Xp} best predict an outcome Y measured on n individuals,where p >> n. For example:

Y is blood pressure at age 50, X is a set of answers from a lengthyFood Frequency Questionnaire

Y is an indicator of volcano activity, X is a set of geologicalmeasurements in the vicinity of the volcano

Y is a survival endpoint (T ,C ) representing time to acquisition ofHIV drug resistance, X is a portion of the viral genome

For the last example, which we will pursue, a typical dataset mighthave n = 300 individuals with amino acid sequences of length 500.

500 sites × 21 possible AAs per site ≈ 10000 covariates.

The Problem

When p >> n, standard regression approaches yield estimates withhuge variance and poor predictive ability

Cox regression typically fails with even modestly large numbers ofcovariates (≈ 100)

Standard approaches typically force small/no bias of the parameterestimates, and so do not “trade off” bias and variance.

MSE = Var + Bias2

IdeaAccept some bias in exchange for more stable estimates with betterpredictive power

Select a subset of variables which “best” predicts the outcome

Use the available data to estimate their relative importance

The Problem

When p >> n, standard regression approaches yield estimates withhuge variance and poor predictive ability

Cox regression typically fails with even modestly large numbers ofcovariates (≈ 100)

Standard approaches typically force small/no bias of the parameterestimates, and so do not “trade off” bias and variance.

MSE = Var + Bias2

IdeaAccept some bias in exchange for more stable estimates with betterpredictive power

Select a subset of variables which “best” predicts the outcome

Use the available data to estimate their relative importance

Loss functions

Estimation is based on a loss function L:

Squared-error loss (linear regression):

L =∑

(Yi − Xiβ)2

Negative Log-likelihood (many contexts):

L = −`(β;X )

Negative Log partial likelihood (Cox regression):

L = −`p(β;X )

Penalization

Common way to trade off bias and variance: penalize loss functionL via P(β)

Yields modified loss L∗.

Two common penalties:1 P(β) =

∑β2

i (Ridge regression)2 P(β) =

∑|βi | (LASSO)

Examples

Linear regression, ridge penalty:

L∗ =∑

(Yi − Xiβ)2 + λ∑

Cox regression, LASSO penalty:

L∗ = −`p(β, X ) + λ∑

|βi |

Penalization

Common way to trade off bias and variance: penalize loss functionL via P(β)

Yields modified loss L∗.

Two common penalties:1 P(β) =

∑β2

i (Ridge regression)2 P(β) =

∑|βi | (LASSO)

Examples

Linear regression, ridge penalty:

L∗ =∑

(Yi − Xiβ)2 + λ∑

Cox regression, LASSO penalty:

L∗ = −`p(β, X ) + λ∑

|βi |

We seek

β̂ = arg minβ

L∗ ≡ arg minβ

[L + λP(β)]

Constrained optimization problem (equivalent to “arg minβ L subj toP(β) ≤ λ”)

λ controls how much the estimates are penalized

It also indexes a one-dimensional path through the parameter space

“Optimal” λ usually chosen via cross-validation

Solution Paths

Problems of Penalization?

Choice of penalty P(β) defines a set of possible paths- but what if none of these paths passes near the true parameter value?

We might prefer a technique which does not require us to choose apenalty function a priori

Constrained optimization procedures can be tricky to use

Problems of Penalization?

Choice of penalty P(β) defines a set of possible paths- but what if none of these paths passes near the true parameter value?

We might prefer a technique which does not require us to choose apenalty function a priori

Constrained optimization procedures can be tricky to use

Enter TGDR

TGDR:Threshold Gradient Descent RegularizationSuggested by Friedman and Popescu (2004)

IdeaConstruct paths in the parameter space iteratively

Choose a point on the constructed path which is “closest” to thetrue parameter value (usually via cross-validation)

Iterative path construction

Basic calculus: g(β) = ∂f∂β gives direction of steepest descent

Steepest descent algorithm for finding minimum of a function f :

β̂(λ + ∆λ) = β̂(λ) + ∆ · g(β)∣∣∣β=β̂(λ)

To reduce instability of estimates, consider instead the step

β̂(λ + ∆λ) = β̂(λ) + ∆ · T(β) · g(β)∣∣∣β=β̂(λ)

Ti (β) = 1[|gi | >= τ · maxk=1,...,p

(|gk |)]

Thresholding

We now have a general method for constructing paths in the parameterspace. To apply it, we need:

A (differentiable) loss function (squared error, log-likelihood, etc.)

A way to choose threshold parameter τ

A way to choose path parameter λ

TGDR for Cox regression

Gui and Li (2005) extended TGDR for Cox regression (partiallikelihood loss)

Recall:L = −`p(β;X )

g = −∂L

We started by adapting TGDR to handle time-varying covariates

Application: ACTG 398

Relevant DataHIV envelope protein sequences collected post-infection forapproximately two years

Current drug regimen

Endpoint of Interest

(T ,C ), where

T is the time until a patient “fails” a drug regimen

C is the censoring indicator

Question

Which amino acid positions on HIV (mutations, insertions, deletions) areassociated with time until drug regimen failure?

(T ,C ), where

Question

(T ,C ), where

Question

Results: ACTG 398 Data

Estimated coefficients from training set (60% of data)

70R 74V 103N 108I 118I 122E 123E 181C 184V 190Aτ K L K V V K D Y M G0.5 0.134 0.258 0.134 −0.164 0.1310.55 0.115 0.421 0.096 0.092 0.117 −0.255 0.1280.6 0.115 0.421 0.117 −0.164 0.1280.65 0.118 0.434 0.125 −0.143 0.1280.7 0.092 0.535 0.086 0.088 0.207 −0.143 0.2290.75 0.105 0.542 0.078 −0.080 0.085 0.075 0.184 −0.143 0.2210.8 0.434 −0.1430.85 −0.063 0.087 0.554 0.143 −0.082 0.088 0.142 0.119 −0.201 0.3680.9 −0.069 0.083 0.554 0.147 −0.082 0.087 0.079 0.119 −0.202 0.3100.95 −0.062 0.145 0.541 0.206 −0.207 0.147 0.141 0.105 −0.204 0.3800.96 −0.062 0.092 0.541 0.206 −0.148 0.144 0.141 0.094 −0.203 0.3870.97 −0.066 0.098 0.535 0.208 −0.149 0.082 0.143 0.087 −0.204 0.3860.98 −0.066 0.092 0.535 0.146 −0.149 0.084 0.143 0.094 −0.205 0.3810.99 −0.066 0.086 0.535 0.147 −0.150 0.087 0.143 0.094 −0.205 0.380

Results (cont’d)

Get η̂ = X β̂ from test set (40% of data)

HR = Hazard ratio comparing group with η̂ ≥ 0 (“high risk”) toη̂ < 0 (“low risk”)

τ HR 95% CI0.5 2.258 1.438 3.5460.55 2.360 1.499 3.7160.6 2.025 1.290 3.1780.65 2.025 1.290 3.1780.7 2.384 1.492 3.8100.75 2.349 1.476 3.7390.8 2.054 1.311 3.2170.85 2.441 1.549 3.8460.9 2.475 1.571 3.9000.95 2.429 1.537 3.8370.96 2.429 1.537 3.8370.97 2.463 1.558 3.8930.98 2.463 1.558 3.8930.99 2.463 1.558 3.893

Extensions

For log-likelihood (or log partial likelihood) loss, the descentdirection is just g = ∂`

∂β ≡ ˙̀, the score function.

Extensive literature on modified/adapted/approximate/quasi scorefunctions which allow for:

Missing dataMeasurement errorHeteroskedasticity. . .

Straightforward to incorporate these methods which propose somemodification g∗ of our original step direction g . Go

Currently working on allowing TGDR to handle missing data (basedon work of Lin and Ying) and measurement error (Augustin)

Extensions

Crazy ideas (i.e. future work)

TGDR with more sophisticated steps (Newton-Raphson, BFGS, etc.)

Incorporating biological knowledge (restricting some coefficients> 0, etc.)

TGDR for GEE? (based on estimating functions...)

TGDR as a meta-method? (TGDR with LASSO loss...)

In Conclusion

TGDR is...

Variable selection based on thresholded gradient descent

Beautifully simple

Computationally tractable

Easy to extend to more complex data structures

But TGDR is not...

Popular (yet)

Particularly amenable to inference (confidence intervals?)

Well studied from a theoretical perspective:

When does it work?How well does it work?How does it compare to competing methods?

In Conclusion

TGDR is...

Variable selection based on thresholded gradient descent

Beautifully simple

Computationally tractable

Easy to extend to more complex data structures

But TGDR is not...

Popular (yet)

Particularly amenable to inference (confidence intervals?)

Well studied from a theoretical perspective:

When does it work?How well does it work?How does it compare to competing methods?

A word about LATEX and presentations

This presentation is a PDF file generated from a LATEX (text) document,with the help of a package called beamer. More info available at

http : //latex− beamer.sourceforge.net/

Ask me if you have any questions... but no guarantees.

Acknowledgements

Prof. Peter Gilbert (thesis supervisor)

Prof. Victor DeGruttola (for providing ACTG data)

Thanks!

Questions?

TGDR: An Introduction - School of Public Healthjulianw/downloads/JW.TGDR.pdf · Application: ACTG...

Documents

Transcript of TGDR: An Introduction - School of Public Healthjulianw/downloads/JW.TGDR.pdf · Application: ACTG...