Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1.

Machine Learning & Data MiningCS/CNS/EE 155

Lecture 3:Regularization, Sparsity & Lasso

Homework 1

• Check course website!

• Some coding required• Some plotting required– I recommend Matlab

• Has supplementary datasets

• Submit via Moodle (due Jan 20th @5pm)

Recap: Complete Pipeline

Training Data Model Class(es) Loss Function

Cross Validation & Model Selection Profit!

Different Model Classes?

Cross Validation & Model Selection

• Option 1: SVMs vs ANNs vs LR vs LS• Option 2: Regularization

Notation

• L0 Norm– # of non-zero entries

• L1 Norm– Sum of absolute values

• L2 Norm & Squared L2 Norm– Sum of squares– Sqrt(sum of squares)

• L-infinity Norm– Max absolute value

Notation Part 2

• Minimizing Squared Loss– Regression– Least-Squares

– (Unless Otherwise Stated)• E.g., Logistic Regression = Log Loss

Ridge Regression

• aka L2-Regularized Regression• Trades off model complexity vs training loss• Each choice of λ a “model class”– Will discuss the further later

Regularization Training Loss

Person Age>10 Male? Height > 55”

Alice 1 0 1Bob 0 1 0Carol 0 0 0Dave 1 1 1Erin 1 0 1

Frank 0 1 1Gena 0 0 0

Harold 1 1 1

Irene 1 0 0

John 0 1 1

Kelly 1 0 1

Larry 1 1 1

0.7401 0.2441 -0.1745 0.7122 0.2277 -0.19670.6197 0.1765 -0.2686 0.4124 0.0817 -0.41960.1801 0.0161 -0.5686

0.0001 0.0000 -0.6666

Updated Pipeline

Training Data Model Class Loss Function

Cross Validation & Model Selection Profit!Choosing λ!

Alice 1 0 1 0.91 0.89 0.83 0.75 0.67Bob 0 1 0 0.42 0.45 0.50 0.58 0.67Carol 0 0 0 0.17 0.26 0.42 0.50 0.67Dave 1 1 1 1.16 1.06 0.91 0.83 0.67Erin 1 0 1 0.91 0.89 0.83 0.79 0.67

Frank 0 1 1 0.42 0.45 0.50 0.54 0.67Gena 0 0 0 0.17 0.27 0.42 0.50 0.67

Harold 1 1 1 1.16 1.06 0.91 0.83 0.67

Irene 1 0 0 0.91 0.89 0.83 0.79 0.67

John 0 1 1 0.42 0.45 0.50 0.54 0.67

Kelly 1 0 1 0.91 0.89 0.83 0.79 0.67

Larry 1 1 1 1.16 1.06 0.91 0.83 0.67

nModel Score w/ Increasing Lambda

Best test error

25 dimensional spaceRandomly generated linear response function + noise

Choice of Lambda Depends on Training Size

Recap: Ridge Regularization

• Ridge Regression:– L2 Regularized Least-Squares

• Large λ more stable predictions– Less likely to overfit to training data– Too large λ underfit

• Works with other loss– Hinge Loss, Log Loss, etc.

Model Class Interpretation

• This is not a model class!– At least not what we’ve discussed...

• An optimization procedure– Is there a connection?

Norm Constrained Model Class

c=1c=2c=3

Visualization Seems to correspond to lambda…

Lagrange Multipliers

http://en.wikipedia.org/wiki/Lagrange_multiplier

• Optimality Condition:– Gradients aligned!– Constraint Boundary– Loss

Omitting b &1 training datafor simplicity

Norm Constrained Model Class Training:

Observation about Optimality:

Lagrangian:

Optimality Implication of Lagrangian:

Claim: Solving LagrangianSolves Norm-Constrained Training Problem

Two Conditions Must Be Satisfied At Optimality .

Satisfies First Condition!

Norm Constrained Model Class Training:

Observation about Optimality:

Lagrangian:

Claim: Solving LagrangianSolves Norm-Constrained Training Problem

Two Conditions Must Be Satisfied At Optimality .

Satisfies First Condition!

Satisfies 2nd Condition!

Lagrangian:

Norm Constrained Model Class Training: L2 Regularized Training:

• Lagrangian = Norm Constrained Training:

• Lagrangian = L2 Regularized Training:– Hold λ fixed– Equivalent to solving Norm Constrained!

– For some chttp://en.wikipedia.org/wiki/Lagrange_multiplier

Recap #2: Ridge Regularization

• Ridge Regression:– L2 Regularized Least-Squares = Norm Constrained Model

• Large λ more stable predictions– Less likely to overfit to training data– Too large λ underfit

• Works with other loss– Hinge Loss, Log Loss, etc.

Hallucinating Data Points

• Instead hallucinate D data points?

Omitting b &for simplicity

Identical toRegularization!

Unit vectoralong d-thDimension

Extension: Multi-task Learning

• 2 prediction tasks:– Spam filter for Alice– Spam filter for Bob

• Limited training data for both…– … but Alice is similar to Bob

• Two Training Sets– N relatively small

• Option 1: Train Separately

Both models have high error.

• Two Training Sets– N relatively small

• Option 2: Train Jointly

Doesn’t accomplish anything!(w & v don’t depend on each other)

Multi-task Regularization

• Prefer w & v to be “close”– Controlled by γ– Tasks similar

• Larger γ helps!

– Tasks not identical • γ not too large

StandardRegularization

Multi-taskRegularization

Training Loss

Test Loss (Task 2)

LassoL1-Regularized Least-Squares

L1 Regularized Least Squares

• L2:

• L1:

Subgradient (sub-differential)

• Differentiable:

• L1:

Continuous range for w=0! Omitting b &for simplicity

L1 Regularized Least Squares

• L2:

• L1:

Lagrange Multipliers

Solutions tend to be at corners!

Sparsity

• w is sparse if mostly 0’s:– Small L0 Norm

• Why not L0 Regularization?– Not continuous!

• L1 induces sparsity– And is continuous!

Why is Sparsity Important?

• Computational / Memory Efficiency– Store 1M numbers in array– Store 2 numbers per non-zero• (Index, Value) pairs• E.g., [ (50,1), (51,1) ]

– Dot product more efficient:

• Sometimes true w is sparse– Want to recover non-zero dimensions

Lasso Guarantee

• Suppose data generated as:

• Then if:

• With high probability (increasing with N):

See also: https://www.cs.utexas.edu/~pradeepr/courses/395T-LT/filez/highdimII.pdf http://www.eecs.berkeley.edu/~wainwrig/Papers/Wai_SparseInfo09.pdf

High Precision Parameter Recovery

Sometimes High Recall

Alice 1 0 1Bob 0 1 0Carol 0 0 0Dave 1 1 1Erin 1 0 1

Frank 0 1 1Gena 0 0 0

Harold 1 1 1

Irene 1 0 0

John 0 1 1

Kelly 1 0 1

Larry 1 1 1

Recap: Lasso vs Ridge

• Model Assumptions– Lasso learns sparse weight vector

• Predictive Accuracy– Lasso often not as accurate– Re-run Least Squares on dimensions selected by Lasso

• Easy of Inspection– Sparse w’s easier to inspect

• Easy of Optimization– Lasso somewhat trickier to optimize

Recap: Regularization

• L2

• L1 (Lasso)

• Multi-task

• [Insert Yours Here!]Omitting b &for simplicity

Next Lecture:Recent Applications of Lasso

Cancer Detection Personalizationvia twitter

Recitation on Wednesday: Probability & Statistics

Image Sources: http://statweb.stanford.edu/~tibs/ftp/canc.pdf https://dl.dropboxusercontent.com/u/16830382/papers/badgepaper-kdd2013.pdf

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1.

Documents

Transcript of Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1.

Regularization with Ridge penalties, the Lasso, and the ... · 66 CHAPTER 4. RIDGE PENALTIES, LASSO, AND ELASTIC NET recently developed methods “Group Lasso” and “Blockwise

The Trimmed Lasso: Sparsity and Robustness · The Trimmed Lasso: Sparsity and Robustness Dimitris Bertsimas, ... We now turn our attention to a di erent approach to the notion of

Theory for the Lasso - ETH Zgeer/Lasso+GLM+group.pdf · Sparsity We now turn to the situation where possi-bly p > n. The philosophy that will gen-erally rescue us, is to “believe”

Sparsity regularization for inverse problems using curvelets€¦ · 1.1 Inverse problems, regularization and curvelets Inverse problems and regularization are widely used in di erent

CondenseNet with exclusive lasso regularization

Generalized Lasso Regularization for Regression Models · Regularization approaches for normal regression problems are based on penalized least squares PLS(l,b) = (y Xb)T (y Xb)+

Machine Learning Regression - Uni Stuttgart...Machine Learning Regression Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso,

Lasso Regularization of Generalized Linear Models - MATLAB & Simulink

Regularization, Sparsity, and Learning Complexityranger.uta.edu/~huber/cse6363/Slides... · © Manfred Huber 2015 2 Regularization ! Bias vs. Variance tradeoff arises in all learning

REGULARIZATION ERROR ESTIMATES AND DISCREPANCY PRINCIPLE … · Key words. source condition, discrepancy principle, non-smooth optimization, convex con-straints,sparsity,regularizationerrorestimates

Beyond Sparsity: Tree Regularization of Deep Models for ...Beyond Sparsity: Tree Regularization of Deep Models for Interpretability Mike Wu1, Michael C. Hughes2, Sonali Parbhoo3, Maurizio

Compressed Sensing and Neural Networksmeetings.nomad-coe.eu/nomad-summer-2017/uploads/Meeting/... · 2017-10-02 · Lasso & Compressed Sensing Neural Networks Least squares & Regularization

1 Robust Regression and Lasso · 2008-11-11 · 3 to the sparsity of Lasso. Our approach gives new analysis and also geometric intuition, and furthermore allows one to obtain sparsity

Tree-Guided Group Lasso for Multi-Response Regression with ...sssykim/papers/tlasso_final.pdf · TREE-GUIDED GROUP LASSO FOR MULTI-RESPONSE REGRESSION WITH STRUCTURED SPARSITY, WITH

Spatial regularization and sparsity for brain mapping - Machine

Trace Lasso: a trace norm regularization for correlated ...fbach/grave_nips2011.pdf · itly using a regularization term that depends on the data or design matrix X. In fact, there

Advanced Machine Learning Lecture 6 - Computer Vision · g ’15 This Lecture: Advanced Machine Learning • Regression Approaches Linear Regression Regularization (Ridge, Lasso)

Tomographic inversion using ℓ1-norm regularization of ...ingrid/publications/21.pdfKey words: inverse problem, one-norm, sparsity, tomography, wavelets. 1 INTRODUCTION Like most

OICSR: Out-In-Channel Sparsity Regularization for Compact ...openaccess.thecvf.com/content_CVPR_2019/papers/Li_OICSR...100 and ImageNet-1K datasets. Notably, on ImageNet-1K, we reduce

“Refractive Index Map Reconstruction in Optical ... · ELEN Wavelets and Sparsity XIV “Refractive Index Map Reconstruction in Optical Deﬂectometry Using Total-Variation Regularization”