Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications

Thesis ProposalLearning with Sparsity: Structures,

Optimization and Applications

Xi ChenCommittee Members: Jaime Carbonell (chair),

Tom Mitchell, Larry Wasserman, Robert Tibshirani

Machine Learning DepartmentCarnegie Mellon University

Modern Data Analysis

Gene expression data for tumor classification: Characteristic: High-dimensional; Very few samples; complex structure Climate Data

Characteristic: Dynamic complex structure

Web-text data:Characteristic: Both high-dimensional& massive amountStructures of word features (e.g., synonym)

Challenges : High-dimensions

Complex & Dynamic Structures

2

Solutions: Sparse Learning

Sparse regression for feature selection & prediction

Incorporating Structural Prior Knowledge

Nonparametric Sparse Regression: flexible model

[Tibshirani 96]

Smooth Convex Loss L1-regularization

[Jenatton et al., 09, Peng et al., 09Tibshirani et al., 05Friedman et al., 10Kim et al., 10]

Structured Penalty (e.g., group, hierarchical tree, graph)

[Ravikumar et al., 09] 3

AdditiveModel

Sparse Learning in Graphical Models

Undirected Graphical Model (Markov Random Fields)

Learn Sparse Structure of Graphical Models

Gene GraphPairwise model for

image

4

Graphical Lasso (gLasso) ( Yuan et al. 06, Friedman et al. 07, Banerjee et al. 08)

Iterated Lasso (Meishausen and Buhlmann, 06)

Forest Density Estimator (Liu et al. 10)

Thesis OverviewHigh-dimensional Sparse Learning with Structures

Sparse Single/Multi-task Regression with General Structured- Penalty

Learning Sparse Structures for Undirected Graphical Models

Nonparametric Sparse Regression

Challenge: Computation

Completed Work:Unified Optimization Framework: Smoothing Proximal Gradient [UAI 11, AOAS]

Future Work: (1) Online Learning for Massive Data (2) Incorporate Structured-Penalty in Other Models (e.g. PCA, CCA)

Existing: Static or Time-varying GraphChallenge: Dynamic Structures

Completed Work:Conditional Gaussian Graphical Model (1) Kernel Smoothing Method for

Spatial-Temporal Graphs [AAAI 10](2) Partition-Based Method [NIPS 10]

Future Work: Relax Conditional Gaussian Assumption: Continuous & Discrete

Existing: Additive Models Challenge: (1) Generalized Models, (2) Structures

Completed Work(1) Generalized Forward

Regression [NIPS 09](2) Penalized Tree

Regression [NIPS 10]

Future Work: Incorporating Rich Structures

Application areas: tumor classification using gene expression data [UAI 11, AOAS], climate data analysis [AAAI 10, NIPS 10], web-text mining [ICDM 10, SDM 10] 5

Roadmap

Q & A

Structure Learning in Graphical Models

Summary and Timeline

Smoothing Proximal Gradient for Structured Sparse Regression


6

Useful Structures and Structured Penalty

Group Structure (group-wise selection)

[Yuan 06]

[Peng et al 09, Kim et al 10]

[Bach et al., 09]

Application: pathway selection for gene-expression data in tumor classification

Example: WordNet 7

Useful Structure and Structured Penalty

Graph Structure (to enforce smoothness)

Piece-wise constant

[Kim et al., 10]

[Tibshirani 05]

Graph smoothness

8

Challenge

Unified, Efficient and Scalable Optimization Framework

for Solving all these Structured Penalties

Single-task Regression

Multi-task Regression

9

Nonsmooth

Nonseparable

Existing Optimization

Interior-point Method (IPM) for Second-order Cone Programming (SOCP) / Quadratic Programming (QP)

Poor scalability: solving a huge Newton linear system for each iteration

Sub-gradient Descent(first-order method) Convergence is slow

Block Coordinate Descent

Cannot be applied for non-separable penalties

Optimize at one time while keeping other variables fixed

Proximal Gradient Descent (first-order method)

Cannot be applied to complex structured penalty.No exact solution for proximal operator

Augmented Lagrangian Alternating Direction

No convergence rate resultGlobally converge only for 2 blocks

Solving a large-scale linear system for each iteration

[Nesterov 07, Beck and Teboulle, 09]

10

Proximal Operator:

Overview: Smoothing Proximal Gradient (SPG)

First-order Method (only gradient info): fast and scalable No exact solution for proximal operator Idea: 1) Reformulate the structured penalty (via the dual norm)

2) Introduce its smooth approximation

3) Plug the smooth approximation back into the original problem and solve it by accelerated proximal gradient methods Convergence Results:

[Nesterov 05]

11

Why the Approximation is Smooth?

Geometric Interpretation:

Uppermost LineSmooth

Uppermost LineNonsmooth

12

Smoothing Proximal Gradient (SPG)

Original Problem:

Approximated Problem:

Gradient of the Approximation(Danskin’s Theorem)

Smooth function

Non-smooth with good separability

Convex Smooth Loss Non-smooth Penalty with complex structure

13[Nesterov 07, Beck and Teboulle, 09]

Proximal Operator: Soft-thresholding

Convergence Rate

14

SPG Subgradient IPM

Convergence Rate

Per-iteration Time / Storage

Cheap (Gradient)

Cheap (Gradient)

Expensive(Newton Linear

System/Hessian)

Single-Task Multi-Task

Multi-Task Extension

15

Simulation Study

Multi-task Graph-guided Fused Lasso

ACGTTTTACTGTACAATTTACSNP

Gene-expression data

16

Biological Application

Breast Cancer Tumor Classification Gene expression data for 8,141 genes in 295 breast cancer tumors.

(78 metastatic and 217 non-metastatic, logistic regression loss)

Canonical pathways from MSigDB containing 637 groups of genes

Training:Test=2:1 Important pathways: proteasome, nicotinate (ENPP1) 17

SPG for Overlapping Group Lasso Regularization path (20 parameters): 331 seconds

Proposed Research

More applications for SPG

Web-scale learning: massive amounts of data• Inputs arrive sequentially at a high-rate• Need to provide real-time service

Solution: Stochastic Optimization for Online Learning

Complex Structured Penalty: Smoothing Technique

Simple Penalty with good separability: closed-form solution in proximal operator

E.g. Low Rank + Sparse

18

Proposed Research

Stochastic Optimization

Structured Sparsity: Beyond Regression• Canonical Correlation Analysis and its Application in

Genome-wide Association Study

Deterministic: Stochastic:

Existing Methods : RDA [Lin 10] , Accelerated Stochastic Gradient Descent [Lan et al. 10]

Ruin the sparsity-pattern

Goal: sparsity-persevering stochastic optimization for large-scale online learning

19

Roadmap

Q & A




20


Gaussian Graphical Model

Gaussian Graphical Model

Graphical Lasso (gLasso)

Challenge: Dynamic Graph Structure

[Yuan et al., 06, Friedman et al., 07Banerjee et al., 08]

21

gLasso

[Lauritzen 96]

Idea: Graph-Valued Regression

Multivariate Regression Undirected Graphical Model

Graph-Valued Regression:

Application:

[Zhou et al., 08Song et al., 09]

Input data:

22

Applications for higher dimensional X

X: Patient Symptoms CharacterizationY: Gene expression levels

23

Kernel Smoothing Estimator

Conditional Gaussian Assumption

Kernel Smoothing Estimator

Cons: (1) Unstable when the dimension of x is high (2) Computationally heavy and difficult to

analyze (3) Hard to Visualize

bG(x)

24

Partition Based Estimator

Partition Based Estimator: Graph-Optimized CART(Go-CART)

CART (Classification and Regression Tree)

Graphical model: difficult to search for the split point

[Breiman 84, Tibshirani et al.,09]

25

Dyadic Partitioning Tree

Dyadic Partitioning Tree (DPT)

Assumptions and Notations:

[Scott and Nowak, 04]

26

Graph-Optimized CART (Go-CART)

Go-CART: penalized risk minimization estimator

Go-CART: held-out risk minimization estimator• Split the data:

Practical algorithm: greedy learning using held-out data27

Statistical Property

We do not assume that underlying partition is dyadic Oracle Risk

Oracle Inequality: bound the oracle excessive risk

Add the assumption that underlying partition is dyadic: Tree Partitioning Consistency(might obtain finer partition)

28

Real Climate Data Analysis

Data Description 125 locations of U.S. 1990 ~ 2002 (13 years) Monthly observation (18 variables/factors)

CO2CH4

CO

H2

WET

CLD

VAP

PRE

FRSDTRTMN

TMP

TMX

GLO

ETR

ETRN

DIR

UV

Variables Type

CO2, CH4, H2, CO Greenhouse Gases

Precipitation (PRE); Vapor (VAP); Cloud Cover (CLD); Wet Days (WET); Frost Days (FRS)

Weather

Avg. Temp. (TMP); Diurnal Temp. Range (DTR); Min. Temp. (TMN); Max. Temp. (TMX)

Temperature

Global Radiation (GLO); Direct Radiation (DIR); Extraterrestrial Global Radiation (ETR) Extraterrestrial Direct Normal Radiation (ETRN)

Solar Radiation

Ultraviolet irradiance (UV) Aerosol Index

[Lozano et al., 09, IBM]

29

Real Climate Data Analysis

Observations: (1): For graphical lasso, no edge connects greenhouse gases (CO2, CH4, CO, H2) with solar radiation factors (GLO, DIR) which contradicts IPCC report; Co-CART, there is. (2): Graphs along the coasts are more sparse than the ones in the mainland.

30

glasso

Proposed Research

Limitations of Go-CART

(1) Conditional Gaussian Assumption:

(2) Only for continuous Y. For discrete Y : approximation likelihood

Forest Graphical Model• Density only involves univariate and bivariate marginals • Compute mutual information for each pair of variables • Greedily learn the tree structure via Chow-Liu algorithm• Handle both continuous and discrete data

Forest-Valued Regression

[Chow and Liu, 68,Tan et al., 09, Liu et al., 11]

31

Roadmap

Q & A




32


Nonparametric Regression

Parametric Models

Additive Models

Sparse Additive Models

Generalized Nonparametric Models: model interaction between variables

[Ravikumar et al., 09]

Bottleneck: Computation

[Hastie et al., 90]

33

My Work and Proposed Research

Greedy Learning Method • Additive Forward Regression (AFR)

– Generalization of Orthogonal Matching Pursuit to Non-parametric setting

• Generalized Forward Regression (GFR)

Penalized Regression Tree Method

Proposed Research: • Formulate the functional forms for structured penalties • Develop efficient algorithms for solving the corresponding

nonparametric structured sparse regression

[Tropp et al., 06]

34

Roadmap

Q & A




35



36

Acknowledgements

My Committee Members

Jaime Carbonell (advisor),

Tom Mitchell,

Larry Wasserman,

Robert Tibshirani

Acknowledgements: Eric P. Xing, John Lafferty, Seyoung Kim, Manuel Blum, Aarti Singh,

Jeff Schneider, Javier Pena, Han Liu, Qihang Lin, Junming Yin,

Xiong Liang, Tzu-Kuo Huang, Min Xu, Mladen Kolar, Yan Liu,

Jingrui He, Yanjun Qi, Bing Bai

IBM Fellowship

Feedback: Xi Chen ([email protected]) 37

Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications

Documents

Transcript of Thesis Proposal Learning with Sparsity: Structures, Optimization and Applications