by Elena Parkhomenko - tspace.library.utoronto.ca · Elena Parkhomenko Doctor of Philosophy...

Sparse Canonical Correlation Analysis

by

Elena Parkhomenko

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Public Health Sciences (Biostatistics)University of Toronto

c© Copyright by Elena Parkhomenko, 2008

Abstract

Sparse Canonical Correlation Analysis

Elena Parkhomenko

Doctor of Philosophy

Graduate Department of Public Health Sciences (Biostatistics)

University of Toronto

2008

Large scale genomic studies of the association of gene expression with multiple pheno-

typic or genotypic measures may require the identification of complex multivariate re-

lationships. In multivariate analysis a common way to inspect the relationship between

two sets of variables based on their correlation is Canonical Correlation Analysis, which

determines linear combinations of all variables of each type with maximal correlation

between the two linear combinations. However, in high dimensional data analysis, when

the number of variables under consideration exceeds tens of thousands, linear combina-

tions of the entire sets of features may lack biological plausibility and interpretability.

In addition, insufficient sample size may lead to computational problems, inaccurate es-

timates of parameters and non-generalizable results. These problems may be solved by

selecting sparse subsets of variables, i.e. obtaining sparse loadings in the linear combina-

tions of variables of each type. However, available methods providing sparse solutions,

such as Sparse Principal Component Analysis, consider each type of variables separately

and focus on the correlation within each set of measurements rather than between sets.

We introduce new methodology - Sparse Canonical Correlation Analysis (SCCA), which

examines the relationships of many variables of different types simultaneously. It solves

the problem of biological interpretability by providing sparse linear combinations that

include only a small subset of variables. SCCA maximizes the correlation between the

subsets of variables of different types while performing variable selection. In large scale

ii

genomic studies sparse solutions also comply with the belief that only a small proportion

of genes are expressed under a certain set of conditions. In this thesis I present method-

ology for SCCA and evaluate its properties using simulated data. I illustrate practical

use of SCCA by applying it to the study of natural variation in human gene expression

for which the data have been provided as problem 1 for the fifteenth Genetic Analysis

Workshop (GAW15). I also present two extensions of SCCA - adaptive SCCA and modi-

fied adaptive SCCA. Their performance is evaluated and compared using simulated data

and adaptive SCCA is applied to the GAW15 data.

iii

Acknowledgements

I would like to thank everybody who made this thesis possible and contributed to its

development.

First of all I would like to thank my thesis advisors, Professors David Tritchler and

Joseph Beyene for their tremendous help in developing this thesis. Without their ex-

pertise, involvement, sound advice and constant support this work would not have been

possible.

I would like to thank my thesis committee member, Professor Shelley Bull for her

helpful comments, stimulating discussions and questions, that lead to a substantially

improved thesis.

I am also grateful to the external examiner, Professor Angelo Canty, for his insightful

comments, and to Professor Michael Escobar for taking time out of his busy schedule to

serve as my internal reader.

I am grateful to Connaught Scholarship, Ontario Graduate Scholarship, and Univer-

sity of Toronto Open Fellowship for financial support.

I would like to thank fellow students and friends at the University of Toronto. In

particular, I wish to express many thanks to Lucia Mirea and Wei Xu for stimulating

and entertaining discussions during Friday lunches. I would also like to thank Jane

Figueiredo for her great moral support and encouragement.

Last but not least, I would like to thank my husband, Daniil Bunimovich for his

patience and support throughout my studies.

This thesis is dedicated to my daughter Victoria.

iv

Contents

1 Introduction 1

1.1 Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Existing analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 New methods - Sparse Canonical Correlation Analysis . . . . . . . . . . . 6

1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 8

2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Traditional methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Modifications of traditional methods . . . . . . . . . . . . . . . . . . . . 18

3 Methodology 29

3.1 Conventional canonical correlation analysis . . . . . . . . . . . . . . . . . 29

3.2 Sparse canonical correlation analysis . . . . . . . . . . . . . . . . . . . . 30

3.3 Data standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.6 Sparseness parameter selection . . . . . . . . . . . . . . . . . . . . . . . . 54

4 SCCA evaluation 61

4.1 Evaluation tool - Cross Validation . . . . . . . . . . . . . . . . . . . . . . 62

v

4.2 Effect of sample size on generalizability . . . . . . . . . . . . . . . . . . . 63

4.3 Effect of the true association between the linear combinations of variables 66

5 Extensions of SCCA 68

5.1 Oracle properties, prediction versus variable selection . . . . . . . . . . . 68

5.2 SCCA extension I: Adaptive SCCA . . . . . . . . . . . . . . . . . . . . . 83

5.3 SCCA extension II: Modified adaptive SCCA . . . . . . . . . . . . . . . . 90

6 Application 110

6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.4 Adaptive SCCA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 Discussion and future work 124

7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.2 Limitations of simulation studies . . . . . . . . . . . . . . . . . . . . . . 127

7.3 Preliminary filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.4 More than 2 sources of data . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.5 Computation of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.6 Computation of covariance . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7.7 Application of SCCA to the study of Chronic Fatigue Syndrome . . . . . 133

Bibliography 134

vi

List of Tables

3.1 The effect of 4 types of starting values for the singular vectors on SCCA:

number of selected X variables vs. simulated correlation . . . . . . . . . . 48


number of selected Y variables vs. simulated correlation . . . . . . . . . . 49

3.3 The effect of 4 types of starting values for the singular vectors on SCCA

results: discordance for set X vs. simulated correlation . . . . . . . . . . 51

3.4 The effect of 4 types of starting values for the singular vectors on SCCA

results: discordance for set Y vs. simulated correlation . . . . . . . . . . 52

4.1 Effect of true correlation between the associated subsets of variables in X

and Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.1 Summary of average prediction results for SCCA and CCA for GAW15 data118

6.2 Summary of average prediction results for SCCA and CCA for reduced

GAW15 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

vii

List of Figures

2.1 Latent variables in PLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Simulation: latent variable model . . . . . . . . . . . . . . . . . . . . . . 37


number of selected X variables vs. simulated correlation . . . . . . . . . . 56


number of selected Y variables vs. simulated correlation . . . . . . . . . . 57


number of correctly selected X variables vs. simulated correlation . . . . 58


number of correctly selected Y variables vs. simulated correlation . . . . 59

3.6 Sparseness parameter selection . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1 Sample size effect: test sample correlation vs. sample size . . . . . . . . . 65

5.1 Investigation of model selection for different sample sizes for set X, true

correlation 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Investigation of model selection for different sample sizes for set Y, true

correlation 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3 Investigation of model selection for different sample sizes for set X, true

correlation 0.95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

viii

5.4 Investigation of model selection for different sample sizes for set Y, true

correlation 0.95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.5 Model identification versus prediction: test sample correlation versus sparse-

ness parameter combinations, true correlation 0.9 . . . . . . . . . . . . . 79

5.6 Model identification versus prediction: average discordance for sets X and

Y versus sparseness parameter combinations, true correlation 0.9 . . . . . 80

5.7 Model identification versus prediction: the number of false positives for

set X versus sparseness parameter combinations, true correlation 0.9 . . . 81

5.8 Model identification versus prediction: the number of false negatives for

set X versus sparseness parameter combinations, true correlation 0.9 . . . 82

5.9 Compare adaptive SCCA, SCCA and SVD: test sample correlation vs

sample size, power of weights in soft-thresholding is 0.5 . . . . . . . . . . 86

5.10 Compare adaptive SCCA and SCCA: discordance vs sample size, power

of weights in soft-thresholding is 0.5 . . . . . . . . . . . . . . . . . . . . . 87

5.11 Compare adaptive SCCA and SCCA: number of false positives for set X

vs sample size, power of weights in soft-thresholding is 0.5 . . . . . . . . 88

5.12 Compare adaptive SCCA and SCCA: number of false negatives for set X

vs sample size, power of weights in soft-thresholding is 0.5 . . . . . . . . 102

5.13 Adaptive SCCA performance for different powers of weights in soft-thresholding:

test sample correlation vs sample size . . . . . . . . . . . . . . . . . . . . 103

5.14 Adaptive SCCA performance for different powers of weights in soft-thresholding:

discordance vs sample size . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.15 Modified adaptive SCCA: test sample correlation for adaptive SCCA and

SCCA with and without added noise . . . . . . . . . . . . . . . . . . . . 105

5.16 True false selection rate for adaptive SCCA and SCCA with and without

added noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

ix

5.17 True false non-selection rate for adaptive SCCA and SCCA with and with-

out added noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.18 True false selection rate for adaptive SCCA and SCCA with and without

added noise, using modified FSR estimate . . . . . . . . . . . . . . . . . 108

5.19 True false non-selection rate for adaptive SCCA and SCCA with and with-

out added noise, using modified FSR estimate . . . . . . . . . . . . . . . 109

6.1 Selection of the best combination of sparseness parameters for GAW15 . 117

x

Chapter 1

Introduction

1.1 Motivating examples

A growing number of large scale genomic studies focus on establishing the relationships

between microarray gene expression profiles and other phenotypes of interest. In some

cases there are two or more sets of measurements available for the same subjects. The

objective is to determine the association between different sets of variables as well as to

discover relevant predictive variables. Thus, in addition to the typical challenges of large

scale studies such as dimensionality reduction and data visualization, the focus is also on

data integration.

One example is a study of Chronic Fatigue Syndrome (CFS). CFS is a disease that

affects a significant proportion of the population in the United States and has detrimental

economic effect on society [Reeves et al., 2005]. Assessment of patients and identification

of illness is complicated by the lack of well established characteristic symptoms [Whistler

et al., 2005]. The symptoms of CFS are also shared by other neurological illnesses such as

multiple sclerosis, sleep disorders, and major depressive disorders. Moreover, definition

of CFS as a unique disease is not obvious as it may represent a common response to a

collection of other illnesses [Whistler et al., 2005]. Establishing well defined measures of

1

CFS is crucial for the assessment of this illness. This was the main purpose of the study

conducted by the Centre for Disease Control (CDC) in Wichita, KS. To achieve this goal

different sources of data were included in the study: clinical assessment of the patients,

microarray gene expression profiles, proteomics measures, and genotyping of selected

single nucleotide polymorphisms (SNPs). The data for this study were presented at the

Critical Assessment of Microarray Data Analysis (CAMDA) workshop in 2006.

It is important to note that different measurements are available for the same subjects

in the study. The question of interest is establishing the association between the markers

and the disease classes (i.e. not affected, CFS, CFS with major depressive disorders, just

major depressive disorders) as well as between different groups of markers. Identifying

important biomarkers may lead to better definition of the chronic fatigue syndrome and

the development of screening tests. An important observation is that various measure-

ments taken on the subjects are of different nature. The challenge from the statistical

point of view is integrating these variables and conducting a unified analysis that focuses

on the relationship between the sets of variables rather than within sets.

Another example is the study of natural variation in human gene expression. The

data for this study was offered for analysis at the fifteenth Genetic Analysis Workshop

(GAW15). Two types of measurements were available for 14 three-generation Centre

d‘Etude du Polymorphisme Humain (CEPH) families from Utah. One type is gene

expression profiles and the other is single nucleotide polymorphisms (SNPs). Several

studies have demonstrated that variation in baseline gene expression levels in humans

has a genetic component [Cheung et al., 2005, Morley et al., 2004]. Therefore, we are

interested in establishing the association between variation in gene expression and that

in SNPs. As in the CAMDA example the two types of measures available for each

subject have different natures: one is continuous gene expression values, the other type

of information relates to SNP genotypes and pedigree structure. An immediate challenge

in this context is how to combine these two types of data and find the relationship

2

between them.

In addition to the challenge of combining different types of measures available on

the same subjects, these studies also represent good examples of the problem of high-

dimensional data analysis and visualization. In both studies thousands of gene expression

values are available for a much smaller number of samples. In the CAMDA data set there

are almost 20000 gene expression profiles while the number of samples is only 177. There

is also extensive clinical and genetic data available for the same subjects. In the GAW15

data there are 3554 expression profiles preselected by Morley et al.[Morley et al., 2004] as

well as 2882 SNPs available for 194 subjects. Thus, in each study at least two very large

sets of variables have to be analyzed. Traditional approaches such as differential gene

expression analysis treat gene expression values as an outcome measure and variables of

the other type as covariates while performing the analysis one gene expression at a time

[Morley et al., 2004, Whistler et al., 2003]. Thus a separate model is built for every gene.

This may be very computationally intensive and does not take into account correlation

between gene expressions. Also in each case the number of variables of each type greatly

exceeds the sample size leading to computational instability. Therefore, dimensionality

reduction techniques may be useful for the analysis of these data sets. In conclusion,

a unified analysis approach suitable for extensive data and incorporating dimensionality

reduction is needed in these studies. Suitable methods should perform data integration to

determine the association between different sets of variables. Visualization of identified

connections between various types of measures is also of interest as well as biological

interpretability of the results.

1.2 Existing analysis tools

Both of these studies demonstrate a need for analysis tools that establish the relationships

between sets of measurements available on the the same group of subjects as well as reduce

3

the dimensionality. It is also desirable that obtained results are easy to interpret.

A common approach is to aggregate the original variables into composite latent vari-

ables and subsequently find the association between the latent variables corresponding

to different sets of measurements. Principal Component Analysis (PCA) is often used in

such cases for dimensionality reduction, to model the underlying structure in the data,

and to create latent variables. One disadvantage of this approach is that its main ob-

jective is to construct composite measures that maximize the variance within the sets of

variables rather than consider the correlation between sets. PCA works with one set of

variables at a time. Another disadvantage is that created latent variables called principal

components are linear combinations of the entire set of variables under consideration. In

large scale studies similar to the examples presented earlier these composite measures

aggregating thousands of variables may lack interpretability and may be difficult to visu-

alize. For example, in the study of natural variation in human gene expression one may

be interested in identifying genetic pathways and regulatory regions that could generate

hypotheses for further testing. In that case a solution representing a possible pathway

that contains all available genes is neither interpretable from a biological perspective nor

useful for hypothesis generation.

A solution to the first disadvantage of PCA may be obtained by considering the

method of Canonical Correlation Analysis (CCA). This approach is used in cases when

there are two sets of measures available on the sample to identify linear combinations of

variables in these sets that have maximal correlation. Thus, the focus is on the relation-

ships between variables in different data sets rather than within. However, this analysis

method also suffers from the problem of poor interpretability of linear combinations of

thousands of variables in high dimensional data.

Recently developed Sparse Principal Component Analysis (SPCA) methods [Zou

et al., 2004, Johnstone and Lu, 2004] address the problem of biological interpretabil-

ity and dimensionality reduction by incorporating variable selection which results in

4

sparse solutions. Sparse principal components contain only a small subset of the original

variables. In the study of Chronic Fatigue Syndrome this method could identify clinical

symptoms characteristic of CFS patients. However, SPCA cannot determine the rela-

tionships between the clinical data and gene expression profiles since, similarly to PCA,

it analyzes one set of variables at a time.

An important observation is that all three analysis approaches described above (PCA,

CCA, SPCA) use the same tool - singular value decomposition (SVD) of some matrix. In

the case of PCA and SPCA this matrix is the n×pi data matrix Xi corresponding to the

ith set of variables. In case of CCA this matrix is K = Σ−1/2XX ΣXY Σ

−1/2Y Y where ΣXX and

ΣY Y are the variance matrices for sets of variables X and Y and ΣXY is the correlation

matrix. Thus the difference between CCA and PCA/SPCA methods is determined by

the matrices used in SVD which is dictated by the different goals of the analysis. In

CCA we are interested in the correlation between the sets of variables, thus the matrix

K incorporates the information on all sets. In this case we are also interested in both left

and right singular vectors. On the other hand, in principal components methods each set

of measurements is considered independently, therefore SVD is performed on the data

matrix for each set separately, and we are interested in right singular vectors only. The

difference between the Sparse PCA and the conventional PCA is achieved by a variation

in SVD [Zou et al., 2004]. SPCA method uses sparse singular value decomposition that

results in sparse right singular vectors. Sparsity means that the loadings corresponding to

some variables in the set are zero, which is equivalent to variable selection. To summarize,

SVD is the key statistical tool used in the existing methodology that is also employed in

the development of the new method described in this thesis.

A different, non-SVD based approach often used in genomic/genetic studies such as

the study of natural variation in human gene expression is applying association/linkage

analysis one gene expression at a time. This ignores the correlation between measures of

gene expression. In addition, an insufficient sample size leads to computational problems,

5

inaccurate estimates of parameters and non-generalizable results. In large data sets

similar to the GAW15 data there may be thousands of gene expression profiles under

consideration as well as thousands of SNPs. In these cases this analysis approach may

also be very computationally intensive. The interpretation of the results may be unclear.

For instance, several gene expressions may be linked to similar groups of SNPs. In this

case similarity means that there is a non-empty intersection between the sets of SNPs.

Then the question arises whether this intersection set should be considered in relationship

to the whole group of gene expressions linked to the selected SNPs and whether these

gene expressions belong to the genes in the same pathway.

1.3 New methods - Sparse Canonical Correlation Anal-

ysis

In this thesis I present a new method - Sparse Canonical Correlation Analysis (SCCA).

SCCA allows the analysis of several sets of variables simultaneously in order to establish

the relationships between them. It also reduces dimensionality to improve biological

interpretability and to foster generation of new hypotheses. Thus, SCCA addresses both

disadvantages of PCA described earlier at the same time. It is applicable in large scale

studies when the number of variables in each set may greatly exceed the number of

samples and is computationally efficient. Similarly to the existing analysis methods

SCCA employs singular value decomposition. In order to determine the relationships

between different sets of variables I consider SVD of a matrix K described above and

seek both left and right singular vectors. I present an efficient algorithm for sparse SVD

that produces sparse solutions for singular vectors on both sides of SVD at the same

time (as compared to SPCA of H. Zou [Zou et al., 2004] that focuses only on the right

side and, therefore, can produce sparse solution for one set of variables only). Thus,

SCCA seeks sparsity in both sets of variables simultaneously. It incorporates variable

6

selection and produces linear combinations of small subsets of variables from each group

of measurements with maximal correlation. Established relationships between small sets

of variables are easier to interpret from biological perspective compared to the linear

combinations of the entire sets of variables and may be used for hypotheses generation.

In genetic and genomic studies sets of covariates and response variables with sparse

loadings also comply with the belief that only a small proportion of genes are expressed

under a certain set of conditions.

1.4 Organization of thesis

In chapter 2 I give brief definitions of some statistical concepts frequently used in this

thesis. I also describe existing methodology and discuss its disadvantages. The details

of the new Sparse Canonical Correlation Analysis method are presented in chapter 3. I

first describe the traditional Canonical Correlation Analysis that serves as basis for the

developed techniques. Then I present an efficient algorithm for SCCA and discuss its

details including the effect of starting values for the algorithm and selection of tuning

parameters. Chapter 4 describes evaluation of SCCA performance by cross-validation

and gives details of the data simulation approach that is used to evaluate SCCA. It

also demonstrates the sample size effect and sensitivity/specificity analysis. Chapter 5

discusses the oracle properties of an analysis tool and evaluates these properties of SCCA.

I follow by presenting extensions of SCCA aimed at improving the oracle properties and

discuss their performance. The illustration of developed methods using a real data set for

the study of natural variation in human gene expression is given in chapter 6. Chapter

7 contains the discussion and comparisons of SCCA and its extensions. The thesis is

completed with brief discussion of possible directions for future work.

7

Chapter 2

Background

In this thesis I introduce a new method called Sparse Canonical Correlation Analysis

(SCCA), which examines the relationships of different types of measurements taken on

the same subjects. Simultaneous analysis of several data sets is performed to allow data

integration. The obtained sparse solution reduces dimensionality and aids in data visu-

alization. Prior to introducing new techniques and describing the existing methodology

some definitions of frequently used statistical terms are presented.

2.1 Definitions

Latent variables:

Latent variables are variables which are inferred from other variables that are observed.

They are not measured directly. Latent variables are hypothesized to be associated with

the underlying model for the observed variables. They are also known as hidden variables

or model parameters. Latent variables are often used for visualization of high-dimensional

data through aggregation a large number of measured variables into a model. This allows

dimensionality reduction.

Eigenvectors and eigenvalues:

Let X be an n×n real quadratic matrix. Then the n×1 vector v is the right eigenvector

8

for X and λ ≥ 0 is the corresponding singular value if Xv = λv. Similarly, the left

eigenvector u can be defined as as satisfying equation uX = λu.

Singular value decomposition:

Let X be an m × n real matrix. Then it can be represented as X = UΛV ′ where U

is an m × m orthogonal matrix, V is an n × n orthogonal matrix and Λ is an m × n

diagonal matrix with non-negative diagonal elements λi, i = 1, . . . ,min(m,n). The first

min(m,n) columns of U and V are left and right singular vectors, respectively, and λi,

i = 1, . . . ,min(m,n) are the corresponding singular values. Note that left singular vectors

for X are the eigenvectors for XX ′ while the right singular vectors are the eigenvectors

for X ′X. The eigenvalues are equal for XX ′ and X ′X and they are equal to the squared

singular values of X.

Linear combination of vectors:

Let x1, . . . ,xk be a set of k n×1 vectors. Then the n×1 vector cx is a linear combination

of these vectors if cx = α1x1 + . . . + αkxk for some real constants α1, . . . , αk which are

usually called weights or loadings.

Principal component analysis:

Principal component analysis is often used for dimensionality reduction of the data,

to create composite measures which are equivalent to latent variables, and to detect

the underlying structure of the data. It determines linear combinations of the original

variables (principal components) that capture maximal variance. Principal components

can be obtained using the singular decomposition of the data matrix X.

Let X be an m × n data matrix (m observations on n variables). Assume that the

columns of X have means equal to 0 and consider the singular values decomposition of

X

X = UDV ′

Then the columns of U are called the principal components and the columns of V are

the corresponding variable weights in the principal components. The principal compo-

9

nents are mutually uncorrelated. The percentage of variance explained by the principal

component ui is equal to d2i /

∑rj=1 d2

j where di, i = 1, . . . , r are the diagonal elements of

D and r is the rank of X.

Canonical correlation analysis:

Canonical correlation analysis is used in cases when there are two sets of variables avail-

able on the sample to identify linear combinations of variables in these sets that have

maximal correlation. Suppose X is an m×p data matrix corresponding to m observations

of p variables of one type and Y is an m×q data matrix corresponding to m observations

of q variables of the other type. It is important to emphasize that both sets of variables

X and Y are taken on the same group of m subjects. Then linear combinations of vari-

ables from X and Y that have maximum correlation can be obtained by considering the

singular value decomposition of a matrix

K = Σ−1/2XX ΣXY Σ

−1/2Y Y = UDV ′

where ΣXX and ΣY Y are the variance matrices for sets X and Y and ΣXY is the covariance

matrix. Canonical vectors or weights in the linear combinations of the two sets of variables

that have largest correlation are

a = Σ−1/2XX u1 and b = Σ

−1/2Y Y v1

where u1 and v1 are the first left and right singular vectors, respectively. The canonical

variables are derived variables η = a′x and φ = b′y.

Vector length:

Let v = (v1, . . . , vp)′ be a p× 1 vector. Then its length denoted by |v| is

|v| = (

p∑

i=1

v2i )

1

2

L1 norm:

Let v = (v1, . . . , vp)′ be a p× 1 vector. Then its L1 norm, denoted by |v|1 is

|v|1 =

p∑

i=1

|vi|

10

L2 norm:

Let v = (v1, . . . , vp)′ be a p× 1 vector. Then its L2 norm, denoted by |v|2 is

|v|2 =

p∑

i=1

|vi|2

Hard thresholding:

For a given thresholding parameter δ, hard thresholding of variable x is given by ηH(x, δ) =

xI{|x| ≥ δ}.

Soft thresholding:

For a given thresholding parameter δ, soft thresholding of variable x is given by ηS(x, δ) =

(|x| − δ)+Sign(x), where (x)+ is equal to x if x ≥ 0 and 0 if x < 0, and

Sign(x) =

−1 if x < 0,

1 if x > 0,

0 if x = 0.

.

2.2 Traditional methods

Methods for simultaneous analysis of two data sets originated several decades ago. A

comprehensive methodology for studying relationships between two sets of variables was

developed by H. Wold in 1970. This collection of techniques is known as Partial Least

Squares (PLS). Many variations of the general algorithm were later developed to suit

the applications in different disciplines including econometrics, psychology, and medical

sciences. One of the variations is Canonical Correlation Analysis (CCA) which will be

discussed in greater detail later. A comprehensive survey of PLS methods is provided in

a technical report by J.A. Wegelin [Wegelin].

11

Partial Least Squares

Wegelin [Wegelin] summarizes several approaches for the two-block case (dealing with

two types of measurements) of the Partial Least Squares method and compares them.

These are the special cases of the more general Wold’s algorithm applicable to k blocks

of data, where k may be greater than 2 [Wold, 1975, 1982, 1985].

PLS methods are used to study the association between groups of variables. In the

two-block case there is a sample with 2 sets of measurements X and Y with I and J

variables respectively, and the same N samples. Let columns in the matrices represent

variables and rows represent samples. The association between variables is recovered by

considering the covariance matrix X ′Y and using latent variables. The author states that

PLS techniques are especially useful when columns of X or Y are collinear or when the

number of variables exceeds the number of samples (I > N or J > N).

There are two modes in which the general Wold algorithm for PLS can be ap-

plied: mode A and mode B. Mode A is presented by Wegelin as PLS-C2A, PLS-SB

and PLS1/PLS2 approaches whereas mode B of the general algorithm is equivalent to

the canonical correlation analysis of Hotelling [Hotelling, 1936]. Both modes can be used

to study the association between the groups of variables and provide their linear combi-

nations also referred to as latent variables. However, the interpretation of the coefficients

in these linear combinations differs. Another difference between the two modes of the

algorithm is in the measure of association. In mode A covariance between linear com-

binations of variables Cov(Xu, Y v) is maximized while in CCA or mode B correlation

Cor(Xu, Y v) is maximized. Both mode A and mode B of the general Wold algorithm are

presented in greater detail below. Two of the variations of mode A (PLS-C2A and PLS-

SB) are more relevant to the work presented in this thesis, therefore they are discussed

while the discussion of other variants (PLS1, PLS2) is omitted.

12

Approach 1: Mode A.

Two-block Mode A Partial Least Squares (PLS-C2A).

This approach is used when one needs to model the covariance between two sets of

variables X and Y. In this case the objective is not to study the covariance structure

within sets X and Y, rather it is to study the covariance between the sets X and Y. Also

it should be emphasized that the measure of association between groups of variables

considered in this approach is covariance as opposed to correlation. The modeling is

performed by utilizing pairs of latent variables (ξ1, ω1), . . . , (ξR, ωR) where ξi and ωi are

latent variable for sets X and Y, respectively, and R is called the rank and is determined

as part of model selection.

In the simplest case when there is only one latent variable for each data set, the

association can be presented schematically as shown in Figure 2.1. The variables in set X

are controlled by ξ while the variable in set Y are controlled by the latent variables ω and

these two latent variables are associated with each other. We are modeling the covariance

ξ ω

x1 x2 . . . xI y1 y2 . . . yJ

Figure 2.1: Simple latent variable model.

d1 = Cov(ξ1,ω1). The latent variables are represented by the linear combinations of the

original variables, i.e. Xu and Y v. Thus, we are maximizing the following covariance:

d1 = Cov(ξ1,ω1) = max||u||=||v||=1Cov(Xu, Y v)

Then vectors of coefficients in linear combination of variables can be obtained as the

the first singular vectors u1 and v1 from the singular value decomposition (SVD) of the

13

matrix X ′Y . Then d1u1v′1 provides the best rank-one approximation of X ′Y [Wegelin,

Harville, 1997].

J. Wegelin provides an iterative algorithm for estimation of (ui,vi) pairs for i =

1, . . . , R, where R is the desired size of the model. The algorithm is initialized by setting

X(1) = X and Y (1) = Y . Then at each step i a pair of coefficient vectors for linear

combinations of variables ui and vi is estimated by the first left and right singular vectors

of X(i)′Y (i), respectively. Subsequently, matrices X(i) and Y (i) are regressed on the latent

variables ξi = X(i)ui and ωi = Y (i)vi to get the best rank-one approximations as follows:

X(i) = ξiβ′X + eX

β′

X = (ξ′iξi)

−1ξ′iX

(i)

X(i)(ξi) = ξiβ′

X = ξi(ξ′iξi)

−1ξ′iX

(i)

Similarly for Y:

Y (i) = ωiβ′Y + eY

β′

Y = (ω′iωi)

−1ω′iY

(i)

Y (i)(ωi) = ωiβ′

Y = ωi(ω′iωi)

−1ω′iY

(i)

The residuals after subtracting the best rank-one approximations X(i)(ξi) and Y (i)(ωi)

from the current data matrices X(i) and Y (i) are then used as new data matrices in the

next step of the algorithm:

X(i+1) = X(i) − X(i)(ξi)

Y (i+1) = Y (i) − Y (i)(ωi)

The procedure is repeated until all R pairs of latent variables (ξi,ωi), i = 1, . . . , R are

obtained.

In this approach the coefficients of the R linear combinations of variables for set X

(or latent variables ξi) ui are orthogonal. The same is true for vi. However, they are not

14

necessarily the singular vectors of X ′Y . Also, ξi are mutually orthogonal as well as ωi.

The obtained latent variables maximize the covariance of interest

|Cov(ξi,ωi)| = |Cov(X(i)ui, Y(i)vi)| ∝ di = max||u||=||v||=1|Cov(X(i)u, Y (i)v)|

Here di is not the ith singular value in the SVD of X ′Y but rather the first singular value

in the SVD of X(i)′Y (i).

The kth element of ui, k = 1, . . . , I can be interpreted as being proportional to the

covariance between the kth variable in set X at step i and the ith latent variable for set

Y, i.e.

uki =N

di

Cov(X(i).k ,ωi) (2.1)

The coefficients of vi are interpreted in a similar fashion:

vki =N

di

Cov(Y(i).k , ξi), k = 1, . . . , J (2.2)

Another property of the coefficients for linear combinations of variables obtained by PLS-

C2A is that if an observed variable x.j is added or removed from the set X, this leads

only to a small change in uki, k 6= j.

PLS-SB.

This approach presented by Wegelin was originally developed by Sampson et al. [Sampson

et al., 1989] and Streissguth et al. [Streissguth et al., 1993]. It provides an identical

solution to PLS-C2A if only one pair of latent variables (ξ1,ω1) is considered. However,

additional pairs of latent variables differ for the two methods. The source of the difference

is in updating the data matrices X(i) and Y (i) before proceeding to the step i + 1. PLS-

C2A updates each X(i) and Y (i) separately by considering the residuals after removing

the effects of ith latent variables X(i) − X(i)(ξi) and Y (i) − Y (i)(ωi). PLS-SB updates

the cross-product X(i)′Y (i) by subtracting diuiv′i. Hence, the coefficients in the linear

combinations of variables represented by latent variables (ξi = Xui,ωi = Y vi) are the

15

singular vectors of X ′Y and we obtain the representation X ′Y =∑R

i=1 diuiv′i. In this

case the latent scores (ξi and ωi) are not orthogonal. Also ui and vi vectors obtained by

PLS-C2A and PLS-SB methods are not equal since X(2)′Y (2) 6= X ′Y − d1u1v′1 (for step

2 and similarly for other steps), where X(2) and Y (2) are updated data matrices at the

end of the first step of PLS-C2A.

Approach 2: Mode B PLS.

Canonical correlation analysis represents mode B of Wold’s general PLS algorithm. The

major difference from mode A is in the way the coefficients in linear combinations of

variables, i.e. u and v are computed. It is evident that in mode A, given the latent

variables ξ and ω, the coefficients can be obtained by separate linear regressions. The

original Wold algorithm can be used to study the relationships between multiple sources

of data (blocks of variables), however if we consider the simplest case of only two data

sets X and Y, then mode A regressions can be described as follows:

Consider only the first singular vectors and let X and Y be data sets containing N

observations each and I and J variables, respectively. Let ξi and ωi be the vectors of

latent variables associated with data sets X and Y, respectively, at step i of the algorithm.

Each vector is of length N. Also let ui and vi be the vectors of coefficients in linear

combination of variables from X and Y at step i. If X·j (or Y·j) denotes the jth column

of matrix X (or Y), i.e. the jth variable, then ui and vi are estimated by fitting I and J

simple linear regressions:

X·j ∼ ujiωi, for j = 1, . . . , I

Y·j ∼ vjiξi, for j = 1, . . . , J

where uji and vji are the jth coefficients in ui and vi, respectively. If the data sets X and

Y have been standardized then inclusion of the intercept in the regressions above is not

necessary.

16

This is equivalent to obtaining left and right singular vectors of X ′Y as shown in

Wegelin [Wegelin]. In other words u and v obtained by fitting these independent regres-

sion models are the eigenvectors of X ′Y Y ′X and Y ′XX ′Y , respectively.

In mode B at the ith step of the algorithm coefficients in linear combinations of

variables ui and vi are computed by performing multiple regression:

ωi ∼ Xui

ξi ∼ Y vi

Thus in mode B the coefficients are calculated simultaneously.

Multiple regression solutions u and v are equivalent to the eigenvalues of (X ′X)−1X ′Y (Y ′Y )−1Y ′X

and (Y ′Y )−1Y ′X(X ′X)−1X ′Y , respectively.

The author points out two major differences between the mode A and B solutions:

• The values in the vectors of coefficients in the linear combinations of variables u and

v are interpreted differently. In mode A ui indicates the covariance between the ith

variable in set X and the latent variable for the set Y as shown in equation (2.1).

Also, as stated above adding or removing additional variables to the analysis has

little effect on ui. However, in mode B these values should be interpreted similarly

to the coefficients in multiple regression and are affected by the addition of other

variables to the data set X.

• The multiple regression solution requires computation of the inverses of X ′X and

Y ′Y . In cases when the number of variables exceeds the number of observations

these inverses do not exist, which limits applicability to high dimensional data.

To summarize, the PLS algorithm can be viewed as two nested loops called the

inner loop and the outer loop [Wegelin]. The inner loop provides the estimates for the

coefficients in linear combinations of variables from sets X and Y. The approaches used to

obtain these coefficients differentiate mode A (a separate simple linear regression model

17

for each coefficient) from mode B (multiple regression model for all coefficients). The

outer loop is responsible for obtaining multiple sets of latent variables associated with

X and Y. Thus, if we are only interested in the first singular vectors (or only one latent

variables per data set), then the outer loop is not necessary. Different options available

for the outer loop are reflected in PLS-C2A algorithm compared to PLS-SB.

In traditional PLS methods no variable selection is performed. Linear combinations

of variables (or latent variables) contain the entire set of available variables for data

X and Y. In cases of microarray studies or genome-wide linkage studies the number of

variables under consideration may exceed tens of thousands. Thus, linear combinations

of all available variables may not be biologically interpretable. Also, these complete

linear combinations of variables may include a lot of noise variables that may affect the

results. Therefore, some variable selection is advisable in certain circumstances. This

modification can be added to the inner loop of the PLS algorithm to obtain sparse

linear combinations of variables by setting some coefficients in u and v to 0. Some

possible approaches to variable selection have been described in several papers reviewed

below. Another solution is sparse canonical correlation analysis, which is a new approach

presented in this thesis.

2.3 Modifications of traditional methods

Canonical correlation analysis can be considered similar to principal component analysis

(PCA) from the perspective of data preprocessing and dimensionality reduction. In CCA

we are interested in identifying subsets of variables from two data sets that have highest

correlation while in PCA we are seeking a subset of variables from one data set that

maximizes the variance. Solutions for both CCA and PCA problems can be obtained

by considering the singular value decomposition (SVD) of a certain matrix. For CCA

this matrix involves covariance between two data sets. For PCA we are looking for SVD

18

of the given data matrix. Thus, both problems are reduced to obtaining singular value

decomposition.

In studies of large complex data sets the disadvantage of both canonical correlation

analysis and principal component analysis is that solutions (canonical vectors and prin-

cipal component vectors respectively) are linear combinations of all the original variables

and may lack interpretability. Therefore, we are interested in identifying linear com-

binations of small subsets of the original variables. This can be viewed as a form of

variable selection. Sparse solutions are obtained by considering sparse singular value

decomposition.

Sparse singular value decomposition is the main focus of available modifications to

traditional methods for PCA. These modifications also may be adapted to obtain sparse

solutions for CCA. The simplest ad hoc approach is to set small in absolute value loadings

of principal component (or canonical) vectors to zero and is usually referred to as simple

thresholding. However, it has been shown to have a number of disadvantages [Cadima

and Jolliffe, 1995, Zou et al., 2004]. Zou et al. [Zou et al., 2004] propose obtaining

sparse solutions using lasso and elastic net techniques for variable selection. Johnstone

and Lu [Johnstone and Lu, 2004] perform variable selection prior to applying principal

component analysis. These two approaches are described in greater detail below.

Sparse Principal Component Analysis: Approach I

Johnstone and Lu [Johnstone and Lu, 2004] propose an algorithm that performs variable

selection prior to applying principal component analysis (PCA).

The authors show that the estimate of the first principal component is consistent if

and only if

c = limn→∞

p(n)/n = 0 (2.3)

Therefore, in case of high dimensionality and small sample size (p >> n) there is a need

to reduce dimensionality and perform variable selection before applying PCA. The paper

19

presents an algorithm that performs PCA on a subset of selected coordinates.

The algorithm:

Suppose we have an n×p data matrix X with number of variables p possibly larger than

the number of observations n.

X =

x′1

...

x′n

1. Select a basis {eν} in Rp

and find the representation of the data X in this basis,

i.e. find the coordinates xiν for each observation x′i

xi =

p∑

ν=1

xiνeν i = 1, 2, . . . , n

Here xi and eν are p× 1 vectors.

”Replace” the original data set with the set of coordinates (xiν): X −→ X{eν}.

2. Select a subset of k indices I = {ν1, . . . , νk} with highest coordinate variances

σ2ν = V ar(xiν).

Reduce the coordinate data set to include only k ≤ p selected basis vectors:

X{eν} −→ XI .

3. Apply standard PCA to the reduced coordinate data set XI to obtain k eigenvectors

ρj, where j = 1, . . . , k and ρj are k × 1 vectors.

4. Make the estimated eigenvectors sparse by using hard thresholding (given by xI(|x| ≥

δ) for x) with some preselected threshold parameter δ.

5. Obtain eigenvectors as linear combinations of the original variables by returning to

the original domain:

ρj =∑

νǫI

ρjνeν

20

where ρj are the eigenvectors in the original domain, i.e. linear combinations of the

original variables.

The motivation for this algorithm is the assumption that there exists a basis in which

the original signal has a sparse representation. Johnstone and Lu argue that a good

candidate is a wavelet basis. Wavelet bases are known for their good representation of

functions with few peaks, like a mixture of several distributions with small variances,

or a step function. In studies dealing with large data sets where the number of vari-

ables greatly exceeds the number of samples, it is often expected that the true signal is

measured by only a few variables and the rest are noise. In these circumstances wavelet

functions offer a good representation of the signal and separation of signal from noise.

The approach of Johnstone and Lu achieves consistency in the estimation of the principal

components. However, once sparse eigenvectors are transformed into the original signal

domain, they may not remain sparse. Thus, all original variables will be included in the

linear combinations represented by eigenvectors. Therefore, the principal components

may lack biological interpretability and still include additional noise from the variables

that do not measure the signal of interest. If the focus is on sparse representation and

we choose to keep the principal component obtained in the sparse basis domain omitting

the final transformation to the original signal domain, there still may not be any gain in

biological interpretability because new basis vectors are functions of the original variables

which may be complicated and meaningless from a biological point of view. Therefore,

there is a need to develop a new method that would perform variable selection in the

original signal domain and provide sparse results in the sense of sparse combination of

original variables.

Sparse Principal Component Analysis: Approach II

The work by Zou et al. [Zou et al., 2004] is motivated by the fact that principal compo-

nents are linear combinations of all variables and therefore are difficult to interpret. They

21

develop Sparse Principal Component Analysis (SPCA) introducing lasso and elastic net

to produce sparse principal components. This is based on representing PCA as regression

problem.

Lasso and elastic net:

In a regular linear regression problem, lasso [Tibshirani, 1996] and elastic net [Zou and

Hastie, 2005] techniques perform variable selection [Tibshirani, 1996, Zou et al., 2004, Zou

and Hastie, 2005]. Lasso imposes a constraint on the L1 norm of regression coefficients

while elastic net offers a more general approach and imposes constraints on both L1 and

L2 norms of coefficients.

Suppose there are n observations and p predictors. Let Y = (y1, . . . , yn)′ denote the

response vector and Xi = (x1i, . . . , xni)′ be the ith predictor, i = 1, . . . , p. Then the lasso

estimates of regression coefficients are βlasso:

βlasso = argminβ

∣∣∣Y −p∑

i=1

Xiβi

∣∣∣2

+ λ

p∑

i=1

|βi| (2.4)

The elastic net estimates of regression coefficients are βen:

βen = (1 + λ2) argminβ

∣∣∣Y −p∑

i=1

Xiβi

∣∣∣2

+ λ2

p∑

i=1

|βi|2 + λ1

p∑

i=1

|βi| (2.5)

In both lasso and elastic net sparse solutions are obtained due to the L1 norm penalty

on the regression coefficients. However, there are two major advantages of elastic net

over lasso. The first one is that lasso solution is limited to including at most n variables.

In the case of microarray studies the number of variables p (gene expressions) is often

greater than 10000 and exceeds the number of experiments n, which may be less than

10. Thus it may be impractical to restrict the model. Elastic net allows including all

available variables in the solution. The second advantage of the elastic net is that it

selects the whole group of correlated variables if one of these variables has been selected

while lasso only includes one of the variables from the group in the final model.

22

Sparse principal components analysis as elastic net problem:

Zou et al. demonstrate that PCA can be represented as a ridge regression problem

(Theorem 1 in [Zou et al., 2004]):

Let X be n× p data matrix and suppose its SVD is

X = UDV ′ (2.6)

Since principal components are linear combinations of all available variables, they can

be considered as response and obtained by linear regression of that response on variables

in X. Let Yi = UiDi = XVi denote the i-th principal component. The vector Vi

contains the coefficients of the original variables in the regression. Then ∀λ > 0 Vi can

be obtained by solving

βridge = argminβ

∣∣∣Yi −Xβ

∣∣∣2

+ λ|β|2 (2.7)

and standardizing v =ˆβ

ridge

|ˆβ

ridge|. Then v = Vi.

Here the ridge penalty λ is not used to penalize regression coefficients β but rather

serves to ensure a solution in cases when the data matrix X is not full rank, in particular

when p > n. In these cases there is no unique ordinary regression solution, while PCA

always produces a solution, thus λ has to be strictly positive.

This formulation of PCA is then transformed into general sparse principal components

analysis (SPCA) problem by modifying 2.7 to become a “self-contained regression-type

criterio” and introducing a lasso penalty. Consider the first k principal components. Let

α and β be p× k matrices and Xi the i-th row-vector in the data matrix X. Here X is

m× p and Xi is a single observation. For any λ > 0, let

(α, β) = argminα,β

n∑

i=1

∣∣∣Xi − αβ′Xi

∣∣∣2

+ λ

k∑

j=1

|βj|2 +k∑

j=1

λ1,j|βj|1 (2.8)

subject to α′α = Ik. Here |βj|1 denotes the L1 norm of βj.

Then take vj =ˆβ

j

|ˆβ

j|, j = 1, . . . , k.

23

A sparse principal component solution will set elements of vj to zero, so the principal

component is a simpler linear combination of the original variables.

The authors introduce a general SPCA algorithm to obtain sparse principal compo-

nents for specific values of tuning parameters λ and λ1j. The ridge penalty λ is the

same for all k principal components. If n > p and we have full rank matrix X, then

it can be set to 0. In case when p > n it should be set to a small positive value to

overcome the collinearity problem by regularizing the matrix. Lasso penalties λ1j are

selected separately for each principal component. The optimal values can be obtained

from a set of solutions provided by LARS-EN algorithm [Efron et al., 2004] based on

variance-sparsity tradeoff. Thus, for each principal component the user has to select two

tuning parameters.

The main motivation for development of new methods for sparse canonical correlation

analysis, sparse principal component analysis, and sparse singular vector decomposition

in general is the interest in studying large data sets when variable selection is essential

to the interpretability of the results, as when number of variables p greatly exceed the

number of samples n, which is the case in microarray studies. Zou et al. introduce a

modified solution for this particular situation to reduce the computation cost. It is based

on setting the ridge penalty λ = ∞ which transforms the elastic net solution into a

soft-thresholding operation.

Gene expression arrays SPCA algorithm:

1. Consider the first k ordinary principal components V [, 1 : k] and let α = V [, 1 : k]

2. Given fixed α, for j = 1, . . . , k apply soft-thresholding

βj =(|α′

jX′X| − λ1,j

2

)+Sign(α′

jX′X)

3. For each fixed β, obtain the SVD of X ′Xβ = UDV ′, then update α = UV ′

4. Repeat steps 2 and 3 until β converges

24

5. Normalization: Vj =β

j

|βj|, j = 1, . . . , k

Zou et al. also present a method for calculation of adjusted total variance. The

traditional approach for calculation of variance explained by principal components (PC)

is to take trace(UT U), where U are the principal components. However, this is only

valid under the assumption that PCs are uncorrelated. Therefore, adjustment is made

to account for the correlation between the modified principal components. The authors

then use the new measure of total explained variance to compare SPCA performance to

traditional PCA, the SCoTLASS method of Jolliffe and Uddin [Jolliffe and Uddin, 2003],

and simple thresholding, using two published data sets and simulated data.

The first published data set they use is the pitprops data [Jeffers, 1967] that consists

of 180 observations on 13 variables. To facilitate the comparison between the simple

thresholding method and other sparse PCA approaches, the authors make the number of

nonzero loadings selected by simple thresholding equal to the number of nonzero loadings

in sparse PC produced by SPCA and SCoTLASS. Comparison of the first six principal

components shows that adjusted total explained variance is highest for the SPCA vectors,

followed by the simple thresholding, and is lowest for SCoTLASS. Also SCoTLASS is

similar to simple thresholding in terms of variables selected especially in the first PC,

while set of variables selected by SPCA differs substantially from both for the first three

PCs.

The second published data study is based on Ramaswany data [Ramaswamy et al.,

2001] which has 16063 genes and 144 samples. Thus this is the p >> n situation typical

of microarray studies. Therefore, the soft-thresholding modification was used instead of

the full elastic net to obtain sparse PCs. The authors performed SPCA for a number of

different values for the sparseness parameter λ1 and observed that as sparsity increases

the percentage of explained variance decreases at a slow rate. They also observed that

only 2.5% of genes were sufficient to explain most of the variance in the first principal

component. Another interesting finding was that the simple thresholding approach keep-

25

ing the same number of genes that were selected by SPCA showed a higher percentage of

explained variance than SPCA and only about 2% of selected genes were different between

the two methods. SCoTLASS was not applicable in this setting due to computational

complexity.

In a simulation example Zou et al. used three hidden factors to create 10 variables

with a specific correlation structure such that principal components should have a sparse

representation. Four of 10 variables were associated with the first hidden factor, 4 with

the second, and 2 with the third. The correct solution should recover the first four vari-

ables in the leading principal component and the second 4 variables in the second PC

since the first 2 hidden factors are more important and the third hidden factor is the

combination of the first two. The authors used the exact covariance matrix for testing

SPCA and comparing it to ordinary PCA and simple thresholding. This is equivalent to

having an infinite sample of data (n =∞), so the ridge parameter λ in SPCA was set to

0. Both SPCA and SCoTLASS correctly identified the first two principal components,

which authors attribute to their explicit use of the lasso constraint. An obvious disadvan-

tage of ordinary PCA is that it included all 10 variables in each considered PC. Simple

thresholding misidentified two variables in the leading PC due to misleading correlation

between the variables.

An interesting observation is that the solution obtained by SPCA included all 4

variables associated with the first hidden factor in the leading PC even though only the

lasso constraint was used and the ridge penalty was set to 0. Therefore, the solution was

obtained by using lasso, not elastic net. As stated above, the usual criticism of lasso is

that it tends to select only one variable from the group of correlated variables. However,

results of this simulation show that in this case lasso did include the whole group of

correlated variables in the solution. This is a contradictory result and may be explained

by the fact that the authors used the prior knowledge that 4 variables have to included

in the solution and chose the thresholding parameter accordingly.

26

The conclusions presented by the authors based on these three studies are as follows:

• Simple thresholding may still be a useful tool due to its simplicity and sufficiently

good performance in studies with small sample size such as microarray studies.

However, it can misidentify important variables.

• SCoTLASS works well with a small number of variables and produces correct sparse

principal components. It is not applicable in studies with a large number of variables

due to its computational complexity.

• The new SPCA method performs well in both gene expression studies ( p >> n)

and in regular studies with sufficient sample size. It reduces to ordinary PCA in

the absence of a lasso constraint and otherwise has superior performance compared

to PCA since it produces sparse principal components and correctly identifies im-

portant variables. The algorithm for SPCA is computationally efficient.

Sparse principal component analysis offers several advantages over existing methods as

listed above. It obtains sparse principal components by introducing a lasso constraint

into the regression representation of the principal component problem. Thus, in a singu-

lar value decomposition X = UDV ′ it produces sparse vectors in V . This is an important

development for cluster analysis where we are interested in identifying small groups of

variables with some similarity (for example, genes belonging to the same functional path-

way) that are easy to interpret. In this situation we are only concerned about grouping

variables on one side of the decomposition of the matrix X. Thus, if X is an n × p

matrix in which rows represent n samples and columns represent p variables, then we are

only interested in grouping variables by obtaining sparse V . However, if the matrix of

interest X is the covariance matrix between two sets of variables from two different data

sets and we are interested in simultaneously identifying small groups of variables from

each set that would also be correlated, then we are seeking sparse SVD of X on both

sides, i.e. we are interested in obtaining sparse vectors both in U and V matrices. This

27

is the case in canonical correlation analysis when the study involves large data sets and

including all available variables in canonical vectors would make interpretation difficult.

In this case, SPCA does not provide a complete solution to the problem. However, it

introduces some useful techniques and ideas that can be adapted to the modification

of traditional canonical correlation analysis and singular vector decomposition to obtain

sparse solutions. This is the focus of this thesis.

28

Chapter 3

Methodology

Sparse canonical correlation analysis is an extension of traditional canonical correlation

analysis. Both methods focus on the identification of relationships between the two sets of

variables. In this chapter I will present new methodology developed in the thesis research.

I begin with a brief review of the traditional canonical correlation analysis in the first

section. Then the description the new technique and the algorithm for SCCA follow in the

second section. Section 3 contains the discussion of data standardization appropriate for

SCCA. Section 4 presents the simulation design used for evaluation of SCCA performance.

Convergence issues and the dependence on the starting values for algorithm are discussed

in section 5. The last section 6 describes selection of the sparseness parameters for SCCA.

3.1 Conventional canonical correlation analysis

Consider two sets of variables X and Y measured on n individuals. Suppose there are p

variables in the set X and q variables in the set Y .

X =

x11 . . . x1p

.... . .

xn1 . . . xnp

Y =

y11 . . . y1q

.... . .

yn1 . . . ynq

29

We are looking for linear combination of variables from sets X and Y with maximal

correlation. Let vectors a and b denote the coefficients in these linear combinations for

sets X and Y, respectively. Then we are looking to maximize the following correlation:

Corr(a′x,b′y) = ρ(a,b) =a′ΣXY b√

a′ΣXXa√

b′ΣY Y b(3.1)

Where ΣXX , ΣXY , and ΣY Y are the variance and covariance matrices for X and Y. The

solution is obtained by considering the Singular Value Decomposition (SVD) of a matrix

K:

K = Σ−1/2XX ΣXY Σ

−1/2Y Y = UDV ′ = (u1, . . . ,uk)D(v1, . . . ,vk)

′

= d1u1v′1 + d2u2v

′2 + . . . + dkukv

′k (3.2)

where k is the rank of the matrix K. The solution is based on the rank 1 approximation

to the correlation matrix [Good, 1969] which has dimension (p×q) = (p×p)(p×q)(q×q),

approximating K with the first singular vectors K ≈ d1u1v′1, where d1 is the positive

square root of the first eigenvalue of K ′K or KK ′. Canonical vectors that is, linear

combinations of each of the two sets of variables that have the largest correlation, are

a = Σ−1/2XX u1 and b = Σ

−1/2Y Y v1 (3.3)

Canonical vectors are the weights in the linear combinations of variables from sets X and

Y in (3.1) in which we are interested. The new variables η = a′x and φ = b′y obtained

in this analysis are called the canonical variables as in [Mardia et al., 1979] or latent

variables [Wegelin]. Vectors Xa and Y b give n observations of these variables.

3.2 Sparse canonical correlation analysis

In conventional CCA all variables from both sets are included in the fitted linear combi-

nations or canonical vectors. However, in many study settings such as microarray data

analysis and genome-wide linkage analysis the number of genes under consideration often

30

exceeds tens of thousands. In these cases linear combinations of the entire set of features

lack biological interpretability since they contain too many variables for further testing or

hypothesis generation. In addition, high dimensionality and insufficient sample size lead

to computational problems, inaccurate estimates of parameters, and non-generalizable

results. Sparse canonical correlation analysis (SCCA) solves the problem of biological

interpretability by providing sparse sets of associated variables. These results are ex-

pected to be more robust and generalize better outside the particular study. In many

applications there are good subject-matter arguments for sparsity. For example, when

gene expression is regarded as a response and genotypes as predictors, sparse loadings

comply with the belief that only a small proportion of genes are expressed under a certain

set of conditions and that expressed genes are regulated at a relatively small number of

genetic locations. As another example, gene expression may be considered to predict

a complex phenotype such as a comprehensive psychological profile. Only a subset of

the large shopping list of psychological variables may be relevant to the condition being

studied.

I propose to obtain sparse linear combinations of features by considering a sparse

singular value decomposition of the matrix K in 3.2. This means that canonical vec-

tors u and v in (3.2) have sparse loadings. We develop an iterative algorithm that

alternately approximates the left and right singular vectors of the SVD using iterative

soft-thresholding for feature selection. This approach is related to the iterative SVD al-

gorithm of I.J. Good [Good, 1969], Partial Least Squares (PLS) methods described by J.

Wegelin [Wegelin] and Sparse Principal Component Analysis method developed by Zou

et al. [Zou et al., 2004].

SCCA algorithm

Similarly to CCA consider n observations in two sets of variables X and Y with p variables

in the set X and q variables in the set Y . Assume that each of the sets of variables has

31

been standardized to have columns with zero means and unit variances. Let K be the

sample correlation matrix between X and Y as in (3.2). Let λu and λv be the soft-

thresholding parameters for variable selection from the sets X and Y respectively. The

first sparse canonical vectors can be identified using the following algorithm:

Sparse Canonical Correlation Analysis algorithm:

1. Select sparseness parameters λu and λv

2. Select initial values u0 and v0 and set i = 0

3. Update u:

(a) ui+1 ← Kvi

(b) Normalize: ui+1 ← ui+1

|ui+1|

(c) Apply soft-thresholding to obtain sparse solution:

ui+1 ← (|ui+1| − 12λu)+Sign(ui+1)

(d) Normalize: ui+1 ← ui+1

|ui+1|

4. Update v:

(a) vi+1 ← K ′ui+1

(b) Normalize: vi+1 ← vi+1

|vi+1|

(c) Apply soft-thresholding to obtain sparse solution:

vi+1 ← (|vi+1| − 12λv)+Sign(vi+1)

(d) Normalize: vi+1 ← vi+1

|vi+1|

5. i← i + 1

6. Repeat steps 3 and 4 until convergence.

32

where (x)+ is equal to x if x ≥ 0 and 0 if x < 0. Also

Sign(x) =

−1 if x < 0,

1 if x > 0,

0 if x = 0.

Notes:

1. The first normalization step in the updating of the singular vector (steps 3b and 4b)

is related to the scale of the sparseness parameter λ. This step brings the singular

vector to unit length prior to applying soft-thresholding. Thus, soft-thresholding

is always applied to comparable vectors, so the value of the parameter λ is easier

to choose. If this step is omitted, then values for the sparseness parameter can

range from 0.1 to over 100 for different data sets making the task of selecting the

approximate range of suitable sparseness parameters difficult. Further discussion

of the sparseness parameter selection algorithm can be found in the section on

sparseness parameters below.

2. The second normalization step in the algorithm (steps 3d and 4d) is to normalize

the sparse singular vectors before proceeding to the next iteration.

3. This algorithm is first applied to obtain the first singular vectors, u1 and v1. The

the residual of the matrix K after removing the effects of the first singular vectors

can be considered to obtain additional singular vectors.

3.3 Data standardization

The computation of matrix K in (3.2) requires the inverses of the p × p matrix X ′X

and the q × q matrix Y ′Y , which may not exist in cases when these matrices are ill-

conditioned. Such situations are often observed in microarray studies when the number

33

of variables greatly exceed the number of observations. The inverses also may not exist

when there is collinearity (linear dependence) even if the total number of variables in sets

X and Y is less than the number of observations. Several approaches may be taken to

solve this problem. The first one would be to regularize X ′X and Y ′Y by the addition

of γI for some parameter γ. Another approach is to use generalized inverses. Both of

these approaches may not be feasible in cases of high dimensional data. Regularization

requires additional parameter selection and validation while the second approach requires

computation of the full singular value decomposition for both X ′X and Y ′Y which may

be very slow in cases of high dimensionality. The third approach is to eliminate (X ′X)−1

and (Y ′Y )−1 from the computation of the matrix K. The motivation for this approach is

described below.

Prior to applying the SCCA algorithm the data is standardized so that all variables

have zero means and unit variances by subtracting column means and dividing by column

standard deviations. Then variance-covariance matrices ΣXX and ΣY Y for each data set

become correlation matrices and have ones along the diagonal. Also under the assumption

that in high dimensional problems most of the measured variables are not related to the

process of interest, i.e. they may be considered as noise, the correlation between them

(off-diagonal elements in ΣXX and ΣY Y ) is zero. We can approximate K by substituting

identity matrices instead of Σ11 and Σ22 in the expression for matrix K in (3.2) after

applying data standardization. Then K is computed as the covariance between sets X∗

and Y∗

:

K = Cov(X∗

, Y∗

) = ΣX∗Y ∗ (3.4)

where X∗

and Y∗

are the standardized data sets. The first canonical vectors in (3.3) are

then just u1 and v1.

This approach is also related to the partial least squares method (PLS) of J.A. Wegelin

[Wegelin] presented in the literature review section. Considering K = Cov(X∗

, Y∗

) =

34

ΣX∗Y

∗ is equivalent to maximizing

Cov(a′x∗,b′y∗) = a′ΣX∗Y ∗b (3.5)

instead of maximizing Corr(a′x,b′y) as in (3.1). Thus, we are using mode A approach

of the general Wold PLS algorithm and not mode B as described in chapter 2.

The limitation of this approach is that we ignore the correlation among the variables

of the same type. For example, if one type of variables is gene expression profiles, then

the correlation between the genes is not considered. This may lead to inflated covariances

between the different types of measurements since they are not scaled by the variances

within the same type of measurements. Thus, uninformative (i.e. noise) variables may

be included in the solution. However, noise variables are hypothesized to have lower

variances than the important variables and to be independent of each other. This fact is

often used in noise filtering methods and in principal component analysis which selects

combination of variables that explain most of the variation in the data. Therefore,

the covariances of the noise variables between different types of measurements should

not be greatly affected by the covariances within the same type of variables and the

approximation of the matrix K described above is acceptable.

3.4 Simulation

Different aspects of SCCA performance are evaluated using the simulated data. In this

section I describe the methods used in simulations.

Latent variable model

The goal of Sparse Canonical Correlation Analysis is to study two sets of variables X and

Y to identify subsets of variables in each of X and Y that are associated with variables

in the other set.

35

The dependency between two sets of variables can be modeled using latent, or in other

words unobserved, variables. Suppose there exists a latent variable ωX that controls a

subset of observed variables in X, thus inducing a correlation between these variables. In

this case measured variables in X serve as indicators. An example of this model could

be the following situation: we measure performance of several students in mathemat-

ics by giving them a number of tests. All students get consistently good results, which

implies that their performance is correlated. The unobserved reason for this correlation

could be that all of these students have the same mathematics teacher. Similarly, assume

there exists a latent variable ωY that controls a subset of observed variables in another

data set Y. Carrying on with our hypothetical example, variables in set Y could indi-

cate performance of the same group of students on physics tests and ωY represents their

physics school teacher. Thus, we have two distinct measures (mathematics and physics

test scores) on the same subjects (students). Suppose we observe high test scores in

physics suggesting positive correlation between the results on mathematics and physics

tests. This phenomenon may be explained by another higher level unobserved connection

between the mathematics and physics teachers (ωX and ωY ) and consequently between

the performance of their students on considered tests (X and Y ). This connection could

be represented by a high school where all of these students are studying and where

mathematics and physics teachers are working. Thus, good performance of students and

correlation between their results on mathematics and physics tests may be explained by

the school‘s specialization in sciences. In formal approach, this is equivalent to the exis-

tence of a higher level latent variable µ that controls both ωX and ωY and, consequently,

observed variables in X and Y.

Schematically, latent variables model can be shown as follows:

36

µ

ωX ωY

x1 x2 . . . xp.dep y1 y2 . . . yq.dep

Figure 3.1: Schematic representation of the latent variable model.

Mathematical representation

For a simulation study of SCCA performance we need to generate two separate data

sets with an equal number of observations in each and not necessarily equal numbers of

variables. A small subset of variables in X should be associated with a subset of variables

in Y and the rest of the variables in both data sets may be independent. This will allow

testing how well SCCA differentiates the associated group of variables from the rest.

Thus, to simulate the data the following scheme has been used:

1. Let data set X have p variables and set Y have q variables.

2. Both X and Y have n observations

3. Assume that X has p.dep dependent variables and Y has q.dep dependent variables.

These groups of variables are associated with each other according to some model.

The rest of the variables are independent within sets X and Y and between them.

4. Without loss of generality we can assume that the first p.dep variables in X are

associated with the first q.dep variables in Y

Thus, showing dependent variables in bold the data matrices are as follows

X =

x11 . . . x1p.dep x1p.dep+1 . . . x1p

.... . .

xn1 . . . xnp.dep xnp.dep+1 . . . xnp

Y =

y11 . . . y1q.dep y1q.dep+1 . . . y1q

.... . .

yn1 . . . ynq.dep ynq.dep+1 . . . ynq

37

Data sets X and Y that satisfy the latent variable model can be simulated using a

higher level latent variable µ alone. The simplest case is when there is only one set of

correlated variables in X that is associated with the only set of correlated variables in Y

and the rest of the variables are independent. This is the model used to simulate data.

In addition, suppose there exists a random variable µ with distribution N(0, σ2µ) such

that

xji = αiµj + exji for j = 1, . . . , n, i = 1, . . . , p.dep

yji = βiµj + eyji for j = 1, . . . , n, i = 1, . . . , q.dep

and for independent variables assume

xji = exji for j = 1, . . . , n, i = p.dep + 1, . . . , p

yji = eyji for j = 1, . . . , n, i = q.dep + 1, . . . , q

where exji ∼ N(0, σ2e) and eyji ∼ N(0, σ2

e) for j = 1, . . . , n, i = 1, . . . , p(q)

Also assume

p.dep∑

i=1

αi =

q.dep∑

i=1

βi = 1

Then define

µX =

p.dep∑

i=1

xi = µ +

p.dep∑

i=1

exi and µY =

q.dep∑

i=1

yi = µ +

q.dep∑

i=1

eyi

The highest correlation between linear combinations of dependent variables from sets X

and Y is observed between the sums of these dependent variables:

Cor(µX ,µY ) =σ2

µ√σ2

µ + p.depσ2e

√σ2

µ + q.depσ2e

(3.6)

38

The covariance matrix between variables in sets X and Y has the structure:

ΣXY =

y1 y2 . . . yq.dep yq.dep+1 . . . yq

x1 α1β1σ2µ α1β2σ

2µ . . . α1βq.depσ

2µ 0 . . . 0

x2 α2β1σ2µ α2β2σ

2µ . . . α2βq.depσ

2µ 0 . . . 0

......

. . . . . .

xp.dep αp.depβ1σ2µ αp.depβ2σ

2µ . . . αp.depβq.depσ

2µ 0 . . . 0

xp.dep+1 0 0 . . . 0 0 . . . 0

......

. . . . . .

xp 0 0 . . . 0 0 . . . 0

An important observation demonstrated by this matrix form is that in this simulation

design if 2 variables xi and xj in set X are associated with some variable yk in set Y,

then xi and xj are correlated. In the chosen notation, correlation between xi and yk

implies that αi and βk are non-zero, while correlation between xj and yk implies that αj

and βk are non-zero. Then the correlation between xi and xj is non-zero as well since

Cor(xi,xj) = αiαjσ2µ. This fact can be used for some preliminary filtering of the data to

reduce the number of noise variables that is discussed in chapter 7.

This matrix representation is also related to another way of looking at the canonical

correlation model. In the simplest case, such as considered above, there is a single set

of associated variables for each X and Y and the rest of the variables are independent.

Then there is only one pair of canonical variates and Σ12 = ΣXY = λuv′ where λ is the

singular value and u and v are the singular vectors. Then X and Y variables can be

39

sorted so that singular vectors have the form:

u =

u1

...

up.dep

0

...

0

v =

v1

...

vq.dep

0

...

0

Then ΣXY has the following form:

ΣXY =

u1v1 . . . u1vq.dep 0 . . . 0

......

......

up.depv1 . . . up.depvq.dep 0 . . . 0

0 . . . 0 0 . . . 0

......

......

0 . . . 0 0 . . . 0

Selection of the true correlation

The true correlation between the linear combinations of variables from the different sets

can be obtained by controlling the values of the variance of the latent variable µ and the

variance of the noise in the measurements, i.e. σ2µ and σ2

e . This is based on the equation

3.6 for the correlation between the linear combinations. Increasing σ2µ relative to the

variance of the noise results in higher correlation between the subsets of variables. For

example, when there are 20 variables in each set of measurements that are associated

between the sets, setting σ2µ = 1.8 and σ2

e = 0.01 would result in the correlation between

the linear combinations of these variables Cor(µX ,µY ) = 0.9, while setting σ2µ = 0.2 and

σ2e = 0.01 would result in Cor(µX ,µY ) = 0.5.

40

3.5 Convergence

The SCCA algorithm applied to the matrix K converges to first singular vectors u1 and

v1. The rate of convergence is very high and it usually requires less than 10 iterations

even for very large data sets each containing thousands of variables. This observation is

based on the simulation studies as well as the analysis of real data containing thousands

of variables in the sets X and Y. Computational time is also short since the only required

operations are matrix and vector multiplications.

Convergence in the absence of variable selection

In the absence of soft-thresholding used to obtain sparse solution, we have the algorithm

for calculation of singular decomposition described by I. J. Good [Good, 1969]. The

same algorithm is also presented as the power method for computing eigenvectors by

J. A. Wegelin [Wegelin], who references G. W. Stewart [Stewart, 1973]. The algorithm

converges at exponential speed [Good, 1969]. The proof of convergence is presented in

Wegelin [Wegelin] on page 32. The idea behind the algorithm can be seen by tracing a

few steps. Consider the singular value decomposition of a matrix K in (3.2):

K = UDV ′ = (u1, . . . ,uk)D(v1, . . . ,vk)′

Assume that the starting value v0 for the right singular vector is not in the null space of

V, so v0 = V α. Then the algorithm proceeds as:

1. u1 = Kv0 = UDV ′V α = UDα

2. v1 = K ′u1 = V DU ′UDα = V D2α

3. u2 = Kv1 = UDV ′V D2α = UD3α

4. ...

41

Note that the exponent of D increases at every step. If the problem is scaled such that

the largest eigenvalue (element of D) is one, it is evident that the powers of D converge

to a matrix with the value 1 in the (1,1) position and zero elsewhere. This will produce

the first right and left singular vectors u1 and v1.

Convergence in the presence of variable selection - the issue of

starting values

In the presence of soft-thresholding for variable selection, convergence for SCCA is related

to the selection of the starting value for the singular vector. This can be demonstrated

by considering in greater detail a section from the proof of convergence of the power

method [Wegelin] described above:

As before consider the singular decomposition of matrix K with rank k in (3.2)

K = UDV ′ = (u1, . . . ,uk)D(v1, . . . ,vk)′

In the SCCA algorithm we use a starting value for the right singular vector v0 to initialize

the process. The algorithm is symmetric with respect to left and right singular vectors,

therefore, we could also start from step 4 and use u0. Without loss of generality, let’s

assume that the procedure for the algorithm is as presented above and iterations start

with updating u using v0. If v0 is in the null space of V in 3.2, then the process will

stop at the first iteration since u1 and v1 will become zero vectors. Thus, to achieve

convergence, assume that v0 is not in the null space of V . Then it can be written as a

linear combination of vectors in V which represent an orthonormal basis (i.e. v′ivj = 0

for i 6= j and v′ivi = 1, i = 1, . . . , k, j = 1, . . . , k):

v0 = α1v1 + α2v2 + . . . + αkvk (3.7)

42

where α1, . . . , αk are some coefficients and at least one αi is non-zero. Then in step 3a

of the algorithm we have

u1 = Kv0 = UDV ′v0 (3.8)

= UDV ′(α1v1 + α2v2 + . . . + αkvk) (3.9)

=k∑

i=1

αiUDV ′vi (3.10)

=k∑

i=1

αidiui (3.11)

because v′ivi = 1.

If we write u1 = (u11, u

12, . . . , u

1p)

′ and

U = (u1, . . . ,uk) =

u11 . . . u1k

.... . .

up1 . . . upk

then

u1 =

u11

u12

...

u1p

=

d1α1u11 + d2α2u12 + . . . + dkαku1k

d1α1u21 + d2α2u22 + . . . + dkαku2k

...

d1α1up1 + d2α2up2 + . . . + dkαkupk

(3.12)

After the normalization in step 3b we will have

u1 =d1α1√∑ki=1 d2

i α2i

u1 + . . . +dkαk√∑ki=1 d2

i α2i

uk (3.13)

=

d1α1√∑ki=1

d2i α2

i

u11 + d2α2√∑ki=1

d2i α2

i

u12 + . . . + dkαk√∑ki=1

d2i α2

i

u1k

d1α1√∑ki=1

d2i α2

i

u21 + d2α2√∑ki=1

d2i α2

i

u22 + . . . + dkαk√∑ki=1

d2i α2

i

u2k

...

d1α1√∑ki=1

d2i α2

i

up1 + d2α2√∑ki=1

d2i α2

i

up2 + . . . + dkαk√∑ki=1

d2i α2

i

upk

(3.14)

Finally, the decision on the variable selection for the first singular vector u (i.e.

variable selection for the set X) is made in the soft-thresholding step 3c. The soft-

thresholding sets several small in absolute value entries in estimated vector u1 to zero

43

(depending on the sparseness parameter). We would like to achieve the same sparseness

as in the true singular vector by setting to zero entries in u1 that correspond to zeros in

true u1. However, as equation (3.13) demonstrates, after being updated and normalized

entries in u1 no longer depend on the first singular vector u1 exclusively, but rather on all

left singular vectors u1, . . . ,uk. This raises a concern whether correct entries in estimate

u1 are set to zero as this will affect all remaining steps of SCCA algorithm and the final

result.

I have described in detail the first step of the algorithm for the first singular vectors.

However, the derivation for updated singular vector entries is similar for all other steps

and for both left and right singular vectors. The above equations demonstrate that

updated singular vectors (for example, u1) depend on the starting values (u0) through

the parameters α. Also, in the later iterations the power of the singular values di will

increase. Thus, in iteration s we will have

us =k∑

i=1

αid2si ui (3.15)

As the iteration number s increases, so will the powers of di while the power of the αi

parameters does not increase. Also, given that the value of singular vectors di decreases

as i increases and d1 is the largest value, the first term α1d2s1 u1 will become dominant.

This fact is used in the proof of convergence of the power method in Wegelin [Wegelin].

Thus, at later iterations the influence of the starting singular vectors is weakened by

the increased influence of the singular values. However, the decisions regarding variable

selection are already made beginning with the first iteration, so the concern about the

effect on the results is valid. For instance, if true singular vectors were known then

we could take v0 = v1 = 1v1 + 0v2 + . . . + 0vk, i.e. in 3.7 we would have α1 = 1,

α2 = 0, . . . , αk = 0. In that case in 3.12 the new value of u1 would only depend on u1 as

44

follows:

u1 =d1α1√∑ki=1 d2

i α2i

u1 + . . . +dkαk√∑ki=1 d2

i α2i

uk = u1 =

u11

u21

...

up1

(3.16)

Therefore, correct entries in u1 would be automatically set to zero.

This demonstrates the importance of selecting proper starting values for the singular

vectors. As is stated above they should not be in the null space of U or V respectively.

Possible choices include

• First singular vector of sample matrix K as suggested by Zou et al. [Zou et al.,

2004]

• First singular vector of true matrix K if it is available, i.e. ideal starting value.

This may only be possible in simulation studies.

• Vector consisting of column means of K for v0 or row means of K for u0 as suggested

by Wegelin [Wegelin]. Both vectors should be standardized to have unit length.

This ensures that singular vectors are obtained in the order of decreasing singular

values. Thus, first singular vectors will be obtained first followed by the second and

so on. This choice is similar to using first singular vector of sample matrix K since

it usually resembles means of columns or rows depending on whether we consider

the right or left side.

• Vector consisting of random numbers that has unit length.

To assess the performance of these starting values and their effect on the results of SCCA

I performed simulation studies to compare all four choices.

45

Simulation to assess convergence of SCCA

To study the effect of starting values on the SCCA results I simulated data based on the

latent variable model. The results obtained for each of 4 choices of starting values were

compared based on the number of variables selected for sets X and Y, the number of

variables selected correctly, i.e. variables selected by SCCA that are truly important and

were simulated to have association between sets X and Y. I also compared the results for

4 starting value types based on the discordance measure which is calculated as number of

variables misidentified by SCCA compared to the true model used in simulation. Thus,

the discordance measure for set X is a combination of two different types of errors: the

number of false positives (true noise variable selected by SCCA) plus the number of false

negatives (true important variable not selected by SCCA). Discordance for Y is computed

similarly.

In each simulation the optimal combination of sparseness parameters λu and λv was

identified using 10-fold CV separately for each starting value choice and for each data

set. Then subsets of variables in X and Y were obtained by applying SCCA algorithm

with the optimal parameters λu and λv for soft-thresholding.

Simulation design:

Fifty replications were carried out for each of several different choices of the true correla-

tion between the sets of associated variables in X and Y. A greater number of simulations

would be beneficial to making the trends smoother, however it would be computationally

very time consuming since SCCA algorithm has to be applied 4 times for each simulation

(once for each of four different starting values). Also each application of SCCA algorithm

includes cross-validation for sparseness parameters selection. All that process is repeated

for every different value of the true correlation between the sets of associated variables in

X and Y. Investigation of the results for each simulated data set showed low variability

between the simulations for the specific conditions. Therefore, it is sufficient to use 50

simulated sets for each of several different choices of the true correlation. All four meth-

46

ods of starting values were applied in each simulation to facilitate comparison between

the results. Simulated data sets contained p.dep = q.dep = 20 associated variables each,

set X consisted of p = 150 variables and set Y had q = 100 variables. The sample size

n was 50. True correlations between the linear combination of the subsets of important

variables in sets X and Y ranged between 0.041 and 0.916.

Simulation results and conclusions:

Tables 3.1 and 3.2 and corresponding to them figures 3.2 to 3.5 compare the results of

SCCA for 4 types of starting values for singular vectors. Results in each figure are shown

for different values of the true correlation between associated subsets of variables in X

and Y used to generate the data (ρ).

Table 3.1 and corresponding figures 3.2 and 3.4 demonstrate the results for the number

of variables in set X selected by SCCA averaged over 50 simulations. Figure 3.2 shows the

number of positives while figure 3.4 shows the average number of selected variables that

were selected correctly, i.e. these variables belong to the subsets of associated variables

used in data simulation (true positives). Table 3.2 and corresponding figures 3.3 and 3.5

demonstrate similar results for the set Y.

Note that as the true correlation ρ between associated variable subsets increases,

the results obtained using 4 different types of the starting values become more similar.

Thus, for true correlation value of 0.916 the numbers of positives and the numbers of

true positives are exactly the same for all 4 starting value types. In fact, identical sets

of variables in X and Y are identified in this case for all simulations. Also note that as

the true correlation increases, the number of selected variables in sets X and Y decreases

and approaches 20 which is the number of associated variables in each set used in the

simulation model. At the same time the number of correctly selected variables in sets X

and Y increases with increasing true correlation and also approaches 20. This indicates

that sensitivity and specificity of SCCA improve with the increasing correlation between

the subsets of associated variables. In particular, average sensitivity for ρ = 0.916 is 0.81

47

positives for X (true positives for X)

types of starting values

average true

correlation

sample sin-

gular vectors

of K

true singular

vectors of K

row and col-

umn means

for K

Unif(0, 1)

random

numbers

0.041 63.06 (8.32) 66.44 (8.92) 72 (9.66) 63.22 (8.41)

0.116 62.24 (8.14) 60.22 (8.10) 62.46 (8.50) 61.52 (8.12)

0.245 60.2 (8.12) 65.4 (8.94) 56.02 (7.62) 59.48 (8.02)

0.531 52.46 (9.82) 55.1 (10.74) 60.64 (10.74) 57.66 (10.40)

0.723 50.48 (14.0) 50.14 (13.88) 51.92 (13.92) 52.28 (13.88)

0.858 41.34 (15.52) 40.78 (15.42) 41.2 (15.52) 40.76 (15.42)

0.916 28.76 (16.26) 28.74 (16.26) 28.74 (16.26) 28.74 (16.26)

Table 3.1: Investigation of the effect of 4 types of starting values for the singular vectors

on SCCA: number of variables selected for set X (total number of variables 150) averaged

over 50 simulations for 4 types of starting values for the singular vectors and different

correlations between associated subsets of variables in X and Y used in the simulations.

and 0.84 for X and Y sets respectively while average positive predictive values for X and

Y are 0.66 and 0.63 respectively. All results for sensitivity and specificity analysis of the

effect of starting values on the SCCA results are not shown here. However, since these

statistics are derived from the number of positives and the number of true positives,

they are reflected in shown figures and tables. Complete sensitivity-specificity analysis

of SCCA is presented separately in section 4.3 of chapter 4.

Tables 3.1 and 3.2 also show that the number of variables selected for the set X

is higher than for set Y for all types of starting values and almost all values of true

correlation between the sets of important variables. This observation can be explained

48

positives for Y (true positives for Y)

types of starting values

average true

correlation

sample sin-

gular vectors

of K

true singular

vectors of K

row and col-

umn means

for K

Unif(0, 1)

random

numbers

0.041 55.9 (10.84) 47.68 (9.72) 47.9 (9.76) 49.28 (9.86)

0.116 48.14 (9.9) 46.74 (9.93) 49.02 (10.12) 47.9 (9.46)

0.245 51.46 (10.34) 53.86 (11.14) 46.44 (9.28) 50.14 (9.94)

0.531 46.56 (11.64) 46.08 (12.2) 47.94 (11.94) 44.2 (11.44)

0.723 43.7 (15.12) 43.46 (15.08) 44.14 (15.1) 44.72 (15.12)

0.858 34.18 (16.32) 34.24 (16.34) 34.16 (16.32) 34.22 (16.34)

0.916 33.86 (16.86) 33.86 (16.86) 33.86 (16.86) 33.86 (16.86)


on SCCA: number of variables selected for set Y (total number of variables 100) averaged



49

by the fact that set X contains 150 variables and set Y contains 100 variables while 20

variables in each set are associated between X and Y. Thus, there more noise variables

in set X than in Y. Therefore, SCCA differentiates better between the informative and

uninformative (i.e. noise) variables when there are fewer noise variables in the set. The

number of true positives for the sets X and Y are closer for all considered cases. Hence,

the amount of noise in the data set has greater effect on the number of noise variables

included in the solution (false positives) than on the number of variables selected correctly

(true positives). This indicates that preliminary filtering to remove noise from the data

would be beneficial for SCCA performance. This is discussed in greater detail in chapter

7.

Table 3.3 and Table 3.4 demonstrate the results for the discordance measure for

X and Y averaged over 50 simulations. These tables also show improvement in the

SCCA performance as well as increased agreement between the results obtained using

4 different starting values as the true correlation ρ increases with the identical results

for ρ = 0.916. Here improved SCCA performance is demonstrated by the decreasing

discordance measures.

Pairwise t-tests were used to compare the results for different starting value choices

averaged over 50 simulations. Since all 4 types of starting values were applied to the

same data set within every simulation, then the measures of SCCA performance are not

independent. Thus, for example discordance measures for the set X for a specific true

correlation between important variables in X and Y are not independent observations for

each of 50 simulations. Therefore, pair-wise t-tests were carried out as follows. Suppose

m1 and m2 are 50×1 vectors of measures that are compared for the two types of starting

values (for example, discordance for set X in 50 simulations). Let v1 and v2 denote the

variances of these measures in 50 simulations and v12 denote the covariance.

v1 = V ar(m1), v2 = V ar(m2)

v12 = Cov(m1,m2)

50

average discordance for X

average true

correlation

sample sin-

gular vectors

of K

true singular

vectors of K

row and col-

umn means

for K

Unif(0, 1)

random

numbers

0.041 66.42 68.6 72.68 66.34

0.116 65.96 64.02 65.46 65.28

0.245 63.96 67.52 60.78 63.44

0.531 52.82 53.62 59.16 56.86

0.723 42.48 42.38 44.08 44.52

0.858 30.3 29.04 30.16 29.92

0.916 16.24 16.22 16.22 16.22


on SCCA results: discordance measure for set X averaged over 50 simulations for 4 types

of starting values for the singular vectors and different correlations between associated

subsets of variables in X and Y used in the simulations.

51

average discordance for Y

average true

correlation

sample sin-

gular vectors

of K

true singular

vectors of K

row and col-

umn means

for K

Unif(0, 1)

random

numbers

0.041 54.22 48.24 50.18 49.56

0.116 48.34 46.82 48.78 48.98

0.245 50.78 51.58 47.88 50.26

0.531 43.28 41.68 44.06 41.32

0.723 33.46 33.3 33.94 34.48

0.858 21.54 21.56 21.52 21.54

0.916 20.14 20.14 20.14 20.14


on SCCA results: discordance measure for set Y averaged over 50 simulations for 4 types

of starting values for the singular vectors and different correlations between associated

subsets of variables in X and Y used in the simulations.

52

m1 =50∑

i=1

mi1, m2 =50∑

i=1

mi2

Then

m1 − m2√v1 + v2 − 2v12

∼ T(50−1)

These tests were applied to compare the results presented in the tables above for all types

of the starting values for the singular vectors at each true correlation between important

variables in X and Y. Obtained p-values indicate that there is no statistical difference in

performance of SCCA when any type of the starting values is used. For instance, when

discordance measures for set X at true correlation of 0.04 are compared between using

sample singular vectors or using random numbers as starting values, then the p-value is

0.998372. All other obtained p-values exceed 0.72. In fact, when higher true correlations

between important variables are considered (ρ = 0.916) then identical sets of variables

in X and Y are selected by SCCA using all 4 types of starting values, therefore there is

no difference between using any of these choices.

Presented simulation results demonstrate that performance and convergence of SCCA

algorithm are not affected by the choice of the 4 types of starting values for left and right

singular vectors. Therefore, the user could pick any vectors as long as they are not in the

left and right null spaces of the matrix K respectively. However, as stated by Wegelin

[Wegelin], in the absence of variable selection using row and column means of the matrix

K as starting vectors ensures the correct order of obtaining singular vectors, i.e. the

first singular vectors are obtained followed by the second singular vectors and so on.

Therefore, it is recommended to use that type of starting values in SCCA algorithm as

well. This does not introduce additional computation complexity since row and column

means of the matrix are easily obtained and are faster to compute than sample first

singular vectors.

53

3.6 Sparseness parameter selection

The optimal combination of sparseness parameters for soft-thresholding steps can be

selected using k-fold cross-validation (CV). For each step of the CV we consider a group

of combinations of sparseness parameters for left and right singular vectors (λu, λv). Then

for every specific pair of sparseness parameters combination k−1k

% of the data (training

sample) is used to identify linear combinations of variables. We evaluate the correlation

between obtained canonical vectors in the remaining 1k% of the data set (testing sample).

Test sample correlations obtained for each combination of the sparseness parameters are

averaged over k CV steps. The optimal combination of λu and λv then corresponds to

the highest average test sample correlation.

Typical results of this process are demonstrated in figure 3.6. To create this graph I

simulated the data using the single latent variable approach with the following settings:

Simulation design:

• 1500 variables in set X

• 1000 variables in set Y

• 30 variables in each set are associated between the sets and the rest of the variables

are independent

• sample size is 150

• The range for both sparseness parameters λu and λv was between 0 and 0.25

I used 10-fold CV, thus the test sample contained 15 observations while the training

sample contained 135 observations.


Figure 3.6 shows average test sample correlations for canonical vectors obtained using

54

SCCA for different combinations of sparseness parameters as 3D surface. The dotted

plane on the graph corresponds to the test sample correlation obtained using the full

SVD solution in which linear combinations of variables contain the the entire sets of

variables in X and Y. In this case full SVD based test sample correlation was 0.42, while

the highest test sample correlation for SCCA solution was 0.86 and it was attained at

(λu = 0.101, λv = 0.131), hence this is the optimal sparseness parameter combination.

This graph also demonstrates the effect of the sparseness parameter selection on the

test sample correlation. When both λu and λv are set to zero, then no variable selection

is performed and SCCA provides the full SVD solution. This is demonstrated on the

graph by the equality in test sample correlations for SCCA and SVD in the bottom left

corner where both sparseness parameters are 0. However, if λu or λv are set to values

exceeding some threshold, then no variables are included in the sparse SCCA solution

which results in zero test sample correlation. This is the case on the bottom far right

corner of the graph where both sparseness parameters are close to 0.25.

Figure 3.6 also shows that the sparse solution obtained using SCCA is more gen-

eralizable (i.e. applicable to the independent data set) than the full SVD solution that

includes all available variables. This is indicated by the higher test sample correlation

for SCCA solution compared to SVD even for suboptimal sparseness parameter combi-

nations. Full SVD solution tends to overfit the data providing high correlation between

the linear combinations of variables for the given data set, but much lower correlation

for independent test data set generated under the same underlying model.

55

0.2 0.4 0.6 0.8

3040

5060

70

Average number of variables selected for X (positives for X)

average true correlation

num

ber

of p

ositi

ves

sample singular vectors of Ktrue singular vectors of Krow and column means for KUnif(0,1) random numbers

Figure 3.2: Investigation of the effect of 4 types of starting values for the singular vectors

on SCCA: number of variables selected for set X (total number of variables 150) averaged



Solid curve - sample singular vectors of K in equation 3.4, dashed curve - true singular

vectors of K, dotted curve - row and column means for K, dotted and dashed curve -

Unif(0, 1) random numbers.

56

0.2 0.4 0.6 0.8

3540

4550

55

Average number of variables selected for Y (positives for Y)


num

ber

of p

ositi

ves



on SCCA: number of variables selected for set Y (total number of variables 100) averaged



Solid curve - sample singular vectors of K in equation 3.4, dashed curve - true singular

vectors of K, dotted curve - row and column means for K, dotted and dashed curve -

Unif(0, 1) random numbers.

57

0.2 0.4 0.6 0.8

810

1214

16

Average number of variables correctly selected for X (true positives for X)


num

ber

of tr

ue p

ositi

ves sample singular vectors of K

true singular vectors of Krow and column means for KUnif(0,1) random numbers


on SCCA: number of variables correctly selected for set X (total number of variables:

150, important variables: 20) averaged over 50 simulations for 4 types of starting values

for the singular vectors and different correlations between associated subsets of variables

in X and Y used in the simulations. Solid curve - sample singular vectors of K in equation

3.4, dashed curve - true singular vectors of K, dotted curve - row and column means for

K, dotted and dashed curve - Unif(0, 1) random numbers.

58

0.2 0.4 0.6 0.8

1012

1416

Average number of variables correctly selected for Y (true positives for Y)


num

ber

of tr

ue p

ositi

ves



on SCCA: number of variables correctly selected for set Y (total number of variables:

100, important variables: 20) averaged over 50 simulations for 4 types of starting values

for the singular vectors and different correlations between associated subsets of variables

in X and Y used in the simulations. Solid curve - sample singular vectors of K in equation

3.4, dashed curve - true singular vectors of K, dotted curve - row and column means for

K, dotted and dashed curve - Unif(0, 1) random numbers.

59

sparseness param. for alpha

0.050.10

0.150.20

spar

sene

ss pa

ram. fo

r beta

0.05

0.10

0.15

0.2

test s

am

ple

corre

latio

n

0.3

0.4

0.5

0.6

0.7

0.8

SCCA test sample correlation vs sparseness parameters

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●

●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Figure 3.6: Average test sample correlation versus combinations of sparseness parameters

for left and right singular vectors. The maximum of test sample correlation determines

the optimal combination of sparseness parameters. 3D surface is for SCCA solution,

dotted plane is for full SVD solution.

60

Chapter 4

SCCA evaluation

In this chapter I present evaluation results for Sparse Canonical Correlation Analysis

(SCCA) performance.

I begin by describing the statistical tool for evaluation - cross-validation, in section

4.1. This approach can be used to evaluate the results of SCCA in real studies when

an independent test sample from the same distribution is not available. In the following

sections I describe various aspects of method performance and present their evaluation

using simulated data.

SCCA is a useful tool in the analysis of the large data sets since it produces a sparse

solution thus allowing dimensionality reduction. However, large scale studies often suffer

from the insufficient sample size. For example, in genomic and genetic studies measure-

ments on thousands of gene expression profiles and SNP genotypes may be available while

the sample size may be limited to a few hundred individuals. Therefore, it is important

to assess the effect of insufficient sample size on the performance of SCCA. Results of

the evaluation are described in section 4.2.

Another common problem in the large scale studies is the presence of a high number of

noise variables that are not related to the studied processes. The sparse solution provided

by SCCA incorporates filtering out the unimportant variables. This is done based on

61

the assumption that noise measurements are uncorrelated with each other within and

between different types of variables. However, if the correlation between the sets of

informative variables of different types is low it may be difficult to differentiate between

the noise and important variables. I evaluate the effect of the true underlying correlation

between the sets of associated variables in section 4.3.

4.1 Evaluation tool - Cross Validation

The performance of SCCA algorithm can be evaluated based on an independent test sam-

ple correlation between the obtained sparse linear combinations of variables of different

types. Higher test sample correlation indicates greater generalizability of the results.

In the absence of an independent testing sample, the performance of SCCA algorithm

can be evaluated using a two-step cross-validation consisting of inner and outer cross-

validations. Inner CV can be considered as part of the algorithm and is used to select

the optimal combination of the sparseness parameters for left and right singular vectors.

The outer loop is used to assess the generalizability of the results. Thus, for the kouter-

fold evaluation CV the original sample is split into k parts. Then 1kouter

% of samples are

reserved for testing (outer testing sample) and remaining kouter−1kouter

% of samples are used

to obtain linear combinations of variables that are associated between the given data sets

(outer training sample). This process includes kinner-fold CV described in the sparseness

parameter selection section (3.6) that treats the outer training sample as the original

data to which kinner-fold CV is applied. Outer test sample correlations are computed

for each SCCA solution obtained for each outer training sample and then averaged over

kouter CV steps.

In some studies the structure of the data may be more complex and some samples

may not be independent of others. One example is a genetic study where variables are

measured in several pedigrees. Then observations corresponding to the members of the

62

same pedigree are not independent. In these cases simple k-fold cross-validation may not

be desirable as it ignores the correlation structure in the data. One solution is to use

adaptive fold size dictated by the dependency between the samples. Thus, the original

data should not be split into k equal parts, but rather into k parts of comparable but

not necessarily equal size. For example, a family can be treated as an independent single

CV unit.

Test sample correlations obtained in different CV steps for family-based data may

foster sensitivity analysis for pedigrees and may aid in detecting heterogeneity in the

pedigrees. An example of this phenomenon is described in the Application chapter (6).

4.2 Effect of sample size on generalizability

Sparse canonical correlation is useful in cases when there is a large number of variables

under consideration since it allows variable selection. Filtering out the noise provides

more interpretable and generalizable results as compared to the traditional canonical

correlation analysis. The CCA solution provides linear combinations of the entire set of

available variables. This approach may not be appropriate in microarray studies with high

dimensionality because a linear combination of thousands of variables may be difficult to

interpret from the biological perspective. Also it is known that only a subset of measured

genes may be expressed, therefore a solution that includes the entire set of variables would

contain noise which may reduce its applicability to an independent data set. Another

feature of microarray studies is the limited sample size. In these cases estimates of the full

singular vectors obtained from CCA may not be very accurate. To investigate the effect

of sample size on SCCA performance I performed a simulation study. I compared the

generalizability of SCCA results to the full SVD solution which is provided by CCA for

different sample sizes using the simulated data sets. Greater generalizability is indicated

by the higher test sample correlation between the linear combinations of variables of

63

different types.

Simulation design:

The data were simulated based on the single latent variable model described in section

3.4. I generated 2 sets of variables X and Y with 500 variables in the set X and 400

variables in Y. Thirty variables in each set were simulated to be associated between X

and Y (important variables) and the rest of the variables were noise. Standard deviation

for the latent variable µ was 1 which resulted in the true generated correlation between

the sets of associated variables equal to 0.769. Linear combinations of variables were

obtained using SCCA and CCA (full SVD first singular vectors) for sample sizes ranging

between 50 and 1500. Subsequently these results were applied to independent test data

sets generated from the same distribution to compute the test data correlation between

the linear combinations of variables, Cor(Xtestu, Ytestv). The sample size for the test

data was 100. Fifty simulations were performed for each sample size and the test sample

correlations were averaged over the simulations.


Figure 4.1 demonstrates superior performance of SCCA compared to the full SVD solu-

tion especially for the small sample sizes. Sparse solutions obtained by SCCA are more

generalizable which is reflected in the higher correlation between the linear combinations

of variables applied to the independent test data. For instance, for sample size 100 SCCA

test sample correlation is 0.75 while for the full SVD solution it is 0.55. SCCA performs

better than full SVD for all sample sizes, however, as the sample size increases the full

SVD solution becomes more precise. This means that the values in the first singular

vectors that correspond to the noise variables become increasingly close to 0. Thus, the

full SVD solution approaches the sparse solution as the sample size increases. That is

demonstrated by the decreasing distance between the SCCA and full SVD curves for

higher sample sizes.

The graph also demonstrates the inconsistency of the full SVD solution that is dis-

64

0 500 1000 1500

0.4

0.5

0.6

0.7

0.8

Effect of sample size on test sample correlation

sample size

test

sam

ple

corr

elat

ion

SCCASVD

Figure 4.1: Sample size effect: test sample correlation for different sample sizes averaged

over 50 simulations for each sample size. Solid curve - test sample correlation for SCCA

solution, dashed curve - test sample correlation for full SVD solution.

cussed in Johnstone and Lu [Johnstone and Lu, 2004] where the authors show that the

estimate of the first principal component is consistent if and only if

c = limn→∞

p(n)/n = 0 (4.1)

In this simulation, however, the number of variables in each of the sets X and Y is

comparable to the sample size or exceeds it. Therefore, the full SVD estimates of the

first singular vectors are inconsistent which is reflected in the apparent upper boundary

for the test sample correlation: even for higher sample sizes test sample correlation for

full SVD solution does not exceed test sample correlation for SCCA solution and the

curves seem to approach their asymptotes in parallel.

65

Test sample correlation for the SCCA estimates seems to have zero slope for the

higher sample sizes and approaches 0.769 which is the true correlation between the sets of

associated variables used in the generating model. This suggests consistency of the SCCA

estimates. In case of sparse canonical correlation, however, it should be stressed that we

are addressing a new aspect - accuracy of the model selection rather than accuracy of the

estimated coefficients in the singular vectors. Identifying positions of zero coefficients in

the singular vectors is of greater importance than the accuracy of the non-zero coefficients,

which affects test sample correlation.

4.3 Effect of the true association between the linear

combinations of variables

In addition to the influence of sample size on the performance of SCCA, it may also

be affected by the true correlation between the associated subsets of variables in the

considered data sets X and Y. This effect was studied using the simulated data.

Simulation design:

I generated the 1500 variables for set X and 1000 variables for set Y using the single

latent variable model. Thirty variables in each data set were associated. The sample size

was 150. Presented results are averaged over 50 simulations. SCCA performance in cases

of different true correlation ranging from 0.68 to 0.95 was evaluated using sensitivity and

positive predictive value (PPV) measures. The sensitivity is computed as

sensitivity =#true positive

#true

Positive predictive value is computed as

PPV =#true positive

#positive

The optimal sparseness parameter selection was based on the 10-fold cross-validation.

66

True cor Sensitivity X PPV X Sensitivity Y PPV Y test sample cor

0.68 0.49 0.29 0.53 0.21 0.56

0.80 0.66 0.59 0.71 0.53 0.82

0.95 0.79 0.91 0.79 0.83 0.96

Table 4.1: The effect of true correlation between the associated subsets of variables in

X and Y on the sensitivity, positive predictive value and test sample correlation. The

results are averaged over 50 simulations.


Table 4.1 demonstrates sensitivity and PPV for both X and Y data sets as well as test

sample correlations computed as the maximum test sample correlation corresponding to

the optimal combination of the sparseness parameters obtained from CV.

Simulation results show that as correlation increases and approaches 1, the sensitivity

and PPV increase. For low true correlation these values are lower because it is difficult to

differentiate between unrelated noise variables and associated variables of interest. Thus,

more noise variables and fewer ”important” variables may be selected by SCCA.

67

Chapter 5

Extensions of SCCA

In this chapter I describe the evaluation of another aspect of SCCA performance - predic-

tion and correct model identification (section 5.1). These results are presented separately

from the the evaluation of the effect of sample size and the effect of the true correlation

between the associated sets of variables on generalizability described in chapter 4 for two

reasons. First, there is a new question of interest - how well does the model selected by

SCCA recover the true underlying model in the data. The second reason is that this

evaluation study inspired development of two extensions of SCCA also presented in this

chapter (Adaptive SCCA is described in section 5.2 and Modified Adaptive SCCA is

described in section 5.3).

5.1 Oracle properties, prediction versus variable se-

lection

Selecting the subset of variables for subset prediction is not the same as selecting the

subset of variables for best recovery of the true model, i.e. true subset of variables.

In case of SCCA the concept of prediction is similar to the concept of generalizability

and is measured by an independent tests sample correlation between the obtained linear

68

combinations of variables of different types. As outlined in chapter 2, during variable

selection lasso tends to shrink values of large coefficients towards 0 while setting param-

eters with small values to exactly 0. Therefore, even if the right subset of variables is

selected, the solution may still be biased. Likewise, it may not give the best prediction

results. When the lasso solution is obtained based on prediction, often noise variables are

included as well to improve prediction [Zou, 2006]. There is a trade-off between optimal

prediction and consistent variable selection in the lasso solution [Zou, 2006, Meinshausen

and Buhlmann, 2004].

We use soft-thresholding to perform variable selection which shares properties with

the elastic net solution when the ridge penalty is set to infinity in the sparse principal

components analysis approach of Zou et al. [Zou et al., 2004]. The benefit of the elastic

net approach over lasso is that if one of the variables from a group of correlated variables

is selected to be included in the model (has non-zero loading in canonical/principal vector

or non-zero coefficient in regression), then all variables from that group will be included

in the elastic net solution [Zou and Hastie, 2005]. On the other hand lasso picks only

one of the variables from the correlated group to be included in the model. In the case

of SCCA we are interested in establishing relationships between two subsets of variables

and would like all associated variables from two sources of data to be present in identified

subsets. Therefore, the elastic net approach is preferred for SCCA. However, the elastic

net may be subject to the same difficulty as lasso in terms of optimal prediction versus

model selection trade-off. In the next section I present the results of simulation studies

used to evaluate the oracle properties of SCCA.

Definition of oracle properties [Zou, 2006]:

Consider some model fitting procedure δ. Let β(δ) be a set of model coefficients estimated

by δ and let β∗ be a set of the true model coefficients. Also let A = {j : β∗j 6= 0}. Then

procedure δ has oracle properties if asymptotically β(δ) has the following properties:

• Correct subset identification: {j : βj 6= 0} = A

69

• Optimal estimation rate:√

n(β(δ)A−β∗A)→d N(0, Σ∗), where Σ∗ is the covariance

matrix under the true subset model.

Evaluation of oracle properties of SCCA

To evaluate the oracle properties of SCCA algorithm, we first consider correct model

selection (i.e. subset identification). In the SCCA algorithm sparseness parameters for

left and right singular vectors are chosen based on the prediction measure as evaluated by

the test sample correlation in CV steps as described in the sparseness parameter selection

section. To be more specific, sparseness parameters and, therefore, the variables in sets X

and Y are chosen so that the correlation between their linear combinations when applied

to the independent test data set is maximized as follows. Let Xtest and Ytest be the

independent standardized test data sets and u(λ) and v(λ) be left and right singular

vectors obtained by applying SCCA to standardized training data sets X and Y using

a specific combination of sparseness parameters (λu, λv). Then the optimal combination

(λoptimalu , λoptimal

v ) is

(λoptimalu , λoptimal

v ) = argmaxλu,λv

Cor(Xtestu(λ), Ytestv(λ))

Once the optimal combination of the sparseness parameters is identified, subsets of vari-

ables in X and Y can be selected by applying SCCA to the whole available data and

obtaining the loadings for singular vectors u and v. We would expect that as sample size

increases SCCA should identify the correct subsets of variables with greater accuracy.

That is fewer noise variables should be included and greater number of important vari-

ables should have non-zero loadings in the corresponding singular vectors. To test this I

performed simulations using a single latent variable model as described in section 3.4.

Simulation design:

I used 150 variables in set X, 100 variables in set Y with 20 variables in each set be-

ing associated with each other (or ”important”) and the rest were noise variables. The

70

standard deviation used for simulation of the latent variable µ was 0.5 and the standard

deviation for the noise variables was 0.1 which resulted in the true correlation between

the linear combinations of important variables averaged across simulations equal to ap-

proximately 0.5. Sample sizes range between 20 and 600. Both λu and λv were allowed

to take values in the set (0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20).

The accuracy of the model selection is evaluated by the discordance measure which

reflects the number of incorrectly identified variables. The discordance can be separated

into two components: the number of false positives, i.e. the number of noise variables with

non-zero loadings in a singular vector, and the number of false negatives, i.e. the number

of important variables that have zero singular vector loadings and are not selected. Thus,

the discordance for sets X and Y used in subsequent simulations can be computed as

follows:

discordance(X) = #false positive X + #false negative X

discordance(Y ) = #false positive Y + #false negative Y

Using outer CV here to evaluate the performance of SCCA is not necessary because

we are using the simulated data. Therefore, it is known which variables are important

and which are noise. That information is used to compute the discordance measure.

Thus, only one level of cross-validation if used in the algorithm to select the optimal

combination of the sparseness parameters for the left and right singular vectors. This

CV is performed as part of SCCA algorithm and not as an evaluation tool.


Figures 5.1 and 5.2 demonstrate the results of these simulations for sets X and Y or

equivalently for the left and right singular vectors respectively. The graphs show the

discordance measure for the different sample sizes as well as its components: the number

of false positives and the number of false negatives. The range of average number of

false negatives for set X is 3.42 to 10.84 and for set Y it is 3.14 to 9.22. Both graphs

demonstrate that as the sample size increases the true model identification accuracy

71

0 100 200 300 400 500 600

010

2030

4050

6070

Discordance measures versus sample sizes for set X

sample size

num

ber

of v

aria

bles

discordancefalse negativesfalse positives

Figure 5.1: Model selection measures versus sample size for data set X, average true

correlation between two linear combinations of important variables is 0.5. The results

are averaged over 50 simulations for each sample size. Solid curve - average discordance,

dashed curve - average number of false negatives, dotted curve - average number of false

positives.

increases, i.e. less noise is included in the model while more important variables are

selected. Also of interest here is the fact that the discordance between the true model

used for data simulation and subsets of variables identified by SCCA is mostly due to

the inclusion of the noise variables in obtained linear combinations. This is evident from

the fact that the curves for false positives are much closer to the discordance curves than

the curves for false negatives while, as was stated above, the discordance measure is the

sum of the number of false positives and false negatives. There are fewer than 4 false

negatives for data X for sample sizes larger than 490 while for data Y this is true for the

72

0 100 200 300 400 500 600

010

2030

4050

6070

Discordance measures versus sample sizes for set Y

sample size

num

ber

of v

aria

bles


Figure 5.2: Model selection measures versus sample size for data set Y, average true




positives.

sample sizes as low as 160.

I also performed similar data simulation for a higher true correlation between linear

combinations of important variables.

Simulation design:

The standard deviation used for simulation of the latent variable µ was 2 and the standard

deviation for the noise variables was 0.1 which resulted in the true correlation between

the linear combinations of important variables averaged across simulations equal to ap-

proximately 0.95.

73


In this situation we would expect to observe lower discordance values since it should be

easier to separate important variables from the noise. Figures 5.3 and 5.4 demonstrate

that this is indeed the case. The discordance measures are much lower than previously

observed for the same sample sizes. For example, when the data has 50 samples, then the

average discordance for set X is 60.22 for low correlation case while it is only 14.32 for the

high correlation case. Also the range of the false negatives is much narrower when the

true correlation is higher: (2.46, 3.42) for set X, and even for small sample sizes almost

all important variables are included in the linear combinations.

Again the main source of the error measured by discordance are the false positives

or noise variables included in the linear combinations. However, when true correlation

is high, then given a sufficient sample size almost all noise can be eliminated from the

sets of important variables selected by SCCA. This is demonstrated by the number of

false positives for the left singular vector being equal to zero for sample size 350 and

higher sample sizes. These figures show increased accuracy in the model identification

with the increasing sample size since the discordance for both data sets X and Y is

decreasing. Also comparison of the figures for different values of the true correlation

between linear combination of important variables demonstrates that the accuracy of the

model identification is higher when the true correlation is higher. This is explained by

that fact that in this case it is easier to distinguish noise variables from the important

variables.

Correct model identification versus prediction for SCCA

Although decreasing discordance is observed for increasing sample sizes, it is fairly high

for small samples, especially when the true correlation between the linear combinations

of important variables is low. For large samples we still observe non-zero discordance.

These results demonstrate that maximization of the prediction power of the model does

74

0 100 200 300 400 500 600

010

2030

Discordance measures versus sample sizes for set X

sample size

num

ber

of v

aria

bles


Figure 5.3: Model selection measures versus sample size for data set X, average true




positives.

not guarantee correct model selection. Even for large sample sizes a substantial number

of noise variables is included in the chosen subsets of variables. Also there are still

important variables being missed even for large samples. That indicates inconsistency

in variable selection: limn P (A⋆ = A) 6= 1. I used another form of evaluation to further

investigate this issue. I performed data simulation to compare performance of SCCA for

two approaches of sparseness parameters selection:

1. maximization of the test sample correlation

75

0 100 200 300 400 500 600

05

1015

20

Discordance measures versus sample sizes for set Y

sample size

num

ber

of v

aria

bles


Figure 5.4: Model selection measures versus sample size for data set Y, average true




positives.

2. minimization of the average discordance measure for sets X and Y, i.e.

discordance =1

2(discordance(X) + discordance(Y )) (5.1)

The implementation of the second approach is only possible in a simulation study where

we have the information about the true underlying model used to generate the data since

we need to know the number of false positives and false negatives.

As in the previous section simulations are based on a single latent variable model.

76

Simulation design:

Set X contains 150 variables, 100 variables are generated for set Y with 20 variables

in each set being associated with each other (or ”important”) and the rest were noise

variables. The standard deviation used for simulation of the latent variable µ was 2 and

the standard deviation for the noise variables was 0.1 which resulted in the true correla-

tion between the linear combinations of important variables equal to approximately 0.95.

Both λu and λv were allowed to take values between 0.0001 and 0.3. For both approaches

of sparseness parameters selection linear combinations of variables were identified apply-

ing SCCA to 45 observations. Subsequently test sample correlation was computed using

these results for independent test data consisting of 5 observations generated from the

same distribution. The average discordance measure for sets X and Y was computed

based on obtained linear combinations of variables and the knowledge of which variables

were simulated as ”important”. Test sample correlation based on the full SVD solution

was also computed for comparison purpose. The full SVD solution that includes all

available variables in the linear combinations was obtained using the same 45 observa-

tions as for SCCA and then correlations between linear combinations of variables in the

test sample (5 observations) was computed. I simulated a single sample to illustrate the

relationship between the sparseness parameters and test criterion.


Figure 5.5 shows test sample correlations for different combinations of sparseness param-

eters for left and right singular vectors as a 3D surface. The dotted plane on the graph

corresponds to the test sample correlation for the full SVD solution which is 0.9289835.

It is the same for all combinations of sparseness parameters since it does not depend on

them and all available variables are included in the linear combinations. In this simula-

tion test sample correlation obtained using SCCA exceeds test sample correlation based

on the full SVD for all combinations of sparseness parameters (0.92906 for SCCA). The

optimal combination of the sparseness parameters corresponds to the maximum of the

77

tests sample correlation which is 0.9783144 and is attained at (λu = 0.0701, λv = 0.2901).

Notice that when sparseness parameters are set to the lowest values of 0.0001 all vari-

ables are included in the SCCA solution. Hence test sample correlation at the bottom

left corner is equal for SCCA and full SVD solutions. As the values of sparseness pa-

rameters are increased the SCCA solutions become more sparse, so that fewer variables

are included in the linear combinations. This leads to improved prediction power for

the independent data as measured by the increasing test sample correlation. The main

benefit of the sparse solution is that noise variables are eliminated which leads to greater

generalizability. However, as was discussed above there may still be some noise variables

present in the linear combinations of variables. This can be examined by looking at a

graph of discordance measure versus sparseness parameters.

Figure 5.6 shows average discordance measure for sets X and Y for different com-

binations of sparseness parameters for left and right singular vectors. In this case

the best solution and the corresponding optimal combination of the sparseness pa-

rameters are obtained by locating the minimum of the discordance measure which is

5. It is attained at several combinations of sparseness parameters: all pairs of λu =

(0.2301, 0.2401, 0.2501, 0.2601, 0.2701) and λv = (0.2601, 0.2701, 0.2801, 0.2901) and at

λu = 0.2301, λv = 0.2501. For these sparseness parameter combinations SCCA test sam-

ple correlation ranges between 0.9722442 and 0.9772633 which is a little lower than the

highest test sample correlation, however, still significantly higher than the SVD test sam-

ple correlation. the graph also demonstrates that as sparseness parameters increase the

average discordance for X and Y decreases to a minimum, however it starts to increase if

sparseness parameters are increased even further. Thus, for λu = (0.2801, 0.2901), λv =

0.2901 average discordance is 5.5. As discussed in the previous section, discordance is

equal to the sum of false positives and false negatives. It also has been demonstrated that

the dominant component is the number of false positives, i.e. the number of noise vari-

ables included in the linear combinations. As we increase sparseness parameters fewer

78

sparseness param. for set X

0.050.10

0.150.20

0.25sp

arse

ness

para

m. for s

et Y

0.05

0.10

0.150.20

0.25

test sample correlation

0.93

0.94

0.95

0.96

0.97

SCCA test sample correlation vs. sparseness parameters

Figure 5.5: Model identification versus prediction: test sample correlation versus sparse-

ness parameters combinations, true correlation between two linear combinations of impor-

tant variables is 0.9. 3D surface shows test sample correlations for linear combinations

obtained using SCCA. Dotted plane corresponds to test sample correlation for linear

combinations of all available variables obtained using full SVD.

variables are included in the SCCA solution and more noise variables are eliminated.

However, after reaching a certain threshold linear combinations of variables may become

too sparse with the important variables excluded as well.

The trade-off between the number of false positives and the number of false negatives

is confirmed by figures 5.7 and 5.8 showing the number of false positives and false

negatives for set X for different combinations of sparseness parameters. These graphs

demonstrate two components of the discordance measure. Graphs for set Y (not shown)

are similar. Note that there is much smaller variation in these measures with the

79


0.050.10

0.150.20

0.25sp

arse

ness

para

m. for s

et Y

0.05

0.10

0.150.20

0.25

discordance

20

40

60

80

100

Discordance vs. sparseness parameters

Figure 5.6: Model identification versus prediction: average discordance for sets X and Y

versus sparseness parameters combinations, true correlation between two linear combi-

nations of important variables is 0.9.

change in λv, the sparseness parameter associated with the right singular vector, i.e.

the linear combination of variables from set Y. However, minor variations in both false

positives and false negatives for set X are still observed since left and right singular

vectors are not estimated independently from each other. The graphs demonstrate that as

sparseness parameters increase, the number of false positives decreases while the number

of false negatives increases. In fact, for any λv and λu greater than 0.2401 as well as for

λv ≥ 0.0301 and λu = 0.2301 no noise variables are included in the linear combination

of variables from set X. However, in these cases 6 or 7 important variables are excluded

as well which is indicated by the number of false negatives. On the other hand, for low

values of sparseness parameters all important variables are included in SCCA solution

80


0.050.10

0.150.20

0.25sp

arse

ness

para

m. for s

et Y

0.05

0.10

0.150.20

0.25

discordance

0

50

100

Number of false positives for set X vs. sparseness parameters

Figure 5.7: Model identification versus prediction: the number of false positives for

set X versus sparseness parameters combinations, true correlation between two linear

combinations of important variables is 0.9.

along with some noise variables. Hence, there is a trade-off between eliminating noise

variables from the linear combinations and keeping the important variables. Relative

importance of one or the other aspect of model identification may be dictated by the

external factors such as biological interpretation. For example, if the results are going

to be used to generate new hypotheses and to identify subsets of variables for further

examination, then the preference may be set by the cost of additional experiments (how

many noise variables we can afford to include) or the significance of missing biological

factors of interest (how many important variables can be excluded from the model).

Now let’s return to the initial question of interest: predictive power versus correct

model identification. At the optimal combination of sparseness parameters corresponding

81


0.050.10

0.150.20

0.25sp

arse

ness

para

m. for s

et Y

0.05

0.10

0.150.20

0.25

discordance

0

2

4

6

Number of false negatives for set X vs. sparseness parameters

Figure 5.8: Model identification versus prediction: the number of false negatives for

set X versus sparseness parameters combinations, true correlation between two linear

combinations of important variables is 0.9.

to the highest test sample correlation, average discordance for sets X and Y is 32. In

this case 60 noise variables are included in the linear combination of variables from set

X while 0 important variables are excluded, and for Y 0 noise variables are included

in the linear combination while 4 important variables are excluded. This finding is not

surprising since the sparseness parameter for Y is very high resulting in a sparse solution

while the sparseness parameter for X is much lower, so a greater amount of noise is

included in addition to all important variables. It should also be emphasized here that

the difference between the maximum test sample correlation and test sample correlations

corresponding to the lowest average discordance is only between 0.0010511 and 0.0060702

(there are several sparseness parameter combinations at which the lowest value of average

82

discordance is observed). Hence, it may be beneficial to focus on the correct model

identification with a small loss in the test sample correlation. A possible solution could

be selecting a combination of sparseness parameters such that the maximum number of

noise variables is eliminated while also maximizing test sample correlation.

5.2 SCCA extension I: Adaptive SCCA

It is typically not possible to select variables for the best subset recovery since we don’t

know the true subset of important variables a priori. In fact, even the number of impor-

tant variables is usually not known. Therefore, usually variable selection is done based

on a prediction criterion. In our case prediction is evaluated by the test sample corre-

lation in CV steps. In order to reduce the bias in the lasso solution H. Zou introduced

the adaptive lasso method [Zou, 2006] that includes additional weights in the lasso con-

straint. The weights are defined as w = 1

|ˆβ|γ

where β is a root-n-consistent estimator

of the true value of β and γ > 0 is a pre-specified parameter. Then the adaptive lasso

estimates are given by

β∗(n)

= argminβ

∣∣∣y −p∑

i=1

xiβi

∣∣∣2

+ λn

p∑

i=1

wi|βi| (5.2)

One suggested choice for the weights w is the inverse of the full Ordinary Least Squares

(OLS) solution 1/|βOLS|. An important remark in the paper is that these weights are

data dependent, i.e. as the sample size grows OLS estimates become increasingly precise

and the estimates for zero-coefficients should converge to 0. Therefore, the weights for

the zero-coefficients will increase to infinity, while the weights for nonzero-coefficients

will converge to some constant. Thus, parameters with large OLS values will be given a

smaller penalty weight. That should reduce the effect of shrinkage on the large values.

The analogy in our case to the OLS solution are the complete first singular vectors from

the full SVD. The connection between SVD and regression (OLS) was demonstrated by

I.J. Good [Good, 1969].

83

This idea can be applied to modify the SCCA algorithm. In SCCA the soft-thresholding

used for variable selection can be adjusted to include additional weights for the coeffi-

cients in the singular vectors as follows:

Let uSV D

and vSV D

denote the first singular vectors obtained from a full singular value de-

composition of the matrix K. Thus, all values in these vectors are non-zero. Also assume

both uSV D

and vSV D

have been standardized to have unit length. Then the algorithm

for the adaptive SCCA is exactly the same as the algorithm for simple SCCA presented

earlier with modifications only in the soft-thresholding steps 3c and 4c. Step 3c in the

ith iteration of the adaptive SCCA algorithm becomes

ui+1 ← (|ui+1 − 1

2

λu

|uSV D|γ |)+Sign(ui+1) (5.3)

while step 4c in the ith iteration becomes

vi+1 ← (|vi+1 − 1

2

λv

|vSV D|γ |)+Sign(vi+1) (5.4)

where γ > 0 is a user-specified parameter. This modification does not change the property

of the algorithm when the sparseness parameters are set to zero: in this case the algorithm

provides full SVD singular vectors.

To investigate the performance of adaptive SCCA in terms of correct model identifi-

cation for different sample sizes and to compare it to the performance of regular SCCA

I carried out simulations study based on the single latent variable model.

Simulation design:

For each simulation 150 variables were generated for the data set X and 100 variables

for the set Y. Twenty variables in each set were associated and the rest of the variables

represented independent noise. The sample size range was between 20 and 600 obser-

vations. The standard deviation used for simulation of the latent variable µ was 0.9

which resulted in approximately 0.8 correlation between the linear combinations of the

associated variables in X and Y. For each sample size, results were averaged over 50

simulations. The optimal sparseness parameter combinations for SCCA algorithm were

84

obtained based on 5-fold cross-validation. This CV approach as opposed to 10-fold CV

has been chosen to shorten the simulation time as well as to facilitate more realistic CV

choice for smaller sample sizes such as n = 20. I compared performance of the adaptive

SCCA to simple SCCA based on the test sample correlation between the obtained linear

combinations of variables and also based on the discordance measure. Test samples were

generated from the same distribution as the training sample for each simulation and

contained 50 observations.


Figure 5.9 demonstrates the simulation results for test sample correlation comparing

adaptive SCCA, SCCA and SVD for the power of weights in the parameters penalty

γ = 0.5. The graph shows that both adaptive SCCA and SCCA perform substantially

better than CCA based on the full SVD solution for all sample sizes, however there is

very little difference between the adaptive SCCA and SCAA in terms of the test sample

correlation. Thus, the modification of SCCA using the square root of the first singular

vectors as additional weights for the coefficients in linear combinations of variables does

not improve the predictive power of the obtained solution. To investigate the impact of

this modification on the correct model selection we can consider the discordance measures.

Figure 5.10 demonstrates the simulation results for the average discordance measure

(5.1) for the adaptive SCCA and regular SCCA. The graph shows average discordance

measures that correspond to the linear combinations of variables obtained from the adap-

tive SCCA and SCCA. The coefficients in linear combinations were estimated using the

optimal combination of sparseness parameters selected by maximization of the test sam-

ple correlation. Thus, the same sparseness parameters were used to generate this and the

previous graph in each simulation for each sample size. This discordance estimate can be

considered test sample correlation based. Figure 5.10 also shows the discordance curves

for both adaptive and regular SCCA that were obtained using the optimal combination

of sparseness parameters based on minimization of the average discordance measure.

85

0 100 200 300 400 500 600

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Test sample correlation for SCCA, adaptive SCCA and full SVD

sample size

test

sam

ple

corr

elat

ion

Adaptive SCCASCCASVD

Figure 5.9: Compare adaptive SCCA, SCCA and SVD: test sample correlation vs sample

size, power of weights in soft-thresholding is 0.5. Solid curve - adaptive SCCA, dashed

curve - SCCA, dotted curve - SVD.

Thus, these curves demonstrate the minimum discordance that could be achieved for the

simulated data sets for the considered range of sparseness parameters. Adaptive SCCA

demonstrates better performance in terms of test sample correlation based discordance

measure for all sample sizes. It also demonstrates lower minimized discordance for sam-

ple sizes below 200. However, for higher sample sizes regular SCCA can achieve similar

or even lower minimum discordance. A solution that is based on minimizing discordance

can only be obtained in the simulation studies since for real data the information about

the important and noise variables is usually not available. Thus, in applied data analysis

the sparseness parameters would have to be chosen based on the test sample correlation.

In that case adaptive SCCA does show superior performance to SCCA for correct model

86

identification.

0 100 200 300 400 500 600

010

2030

4050

Discordance for SCCA and adaptive SCCA

sample size

disc

orda

nce Adaptive SCCA

SCCAMin for Adaptive SCCAMin for SCCA

Figure 5.10: Compare adaptive SCCA and SCCA: discordance vs sample size, power of

weights in soft-thresholding is 0.5. Solid curve - test-sample-correlation-based discordance

for adaptive SCCA, dashed curve - test-sample-correlation-based discordance for SCCA,

dotted curve - minimized discordance for adaptive SCCA, dashed and dotted curve -

minimized discordance for SCCA.

The discordance measure consists of false positive and and false negative components.

There is a trade-off between these statistics: when selection constraints are relaxed to

include more important variables in the model, thus lowering the number of false nega-

tives, the number of noise variables selected (i.e. the number of false positives) increases.

On the other hand, lowering the number of false positives results in fewer variables in

the model, which may increase the number of false negatives. To have a complete un-

derstanding of the effect of additional weights in soft-thresholding on the correct model

87

identification it is necessary to consider two components of discordance separately. Fig-

ures 5.11 and 5.12 show the number of false positives and negatives for the data set X

for adaptive SCCA and SCCA for different sample sizes. The results for data set Y are

similar and, therefore, not shown.

0 100 200 300 400 500 600

010

2030

4050

False positives for X for SCCA and adaptive SCCA

sample size

num

ber

of fa

lse

posi

tive

Adaptive SCCASCCAMin for Adaptive SCCAMin for SCCA

Figure 5.11: Compare adaptive SCCA and SCCA: number of false positives for set X

vs sample size, power of weights in soft-thresholding is 0.5. Simulated number of noise

variables in X is 130. Solid curve - test sample correlation based number of false positives

for set X for adaptive SCCA, dashed curve - test sample correlation based number of

false positives for set X for SCCA, dotted curve - number of false positives for set X

based on minimized discordance for adaptive SCCA, dashed and dotted curve - number

of false positives for set X based on minimized discordance for SCCA.

The graphs show that adaptive SCCA selects fewer false positives compared to SCCA

for all sample sizes while simple SCCA performs better in terms of the number of false

88

negatives. However, the difference in the number of important variables not included

in the solution is less significant than the difference in the number of noise variables

included. Also, both SCCA and adaptive SCCA perform almost as well in terms of the

number of false positives as the analysis methods based on minimizing the discordance

measure that use the knowledge about the true underlying model in the simulated data.

Thus, adaptive SCCA solution contains fewer false positives and a comparable number

of false negatives to SCCA. therefore, it demonstrates better model selection properties

than the original SCCA algorithm.

Adaptive SCCA method uses an additional user controlled parameter which is the

power of the weights γ in the soft-thresholding steps (5.3, 5.4). H. Zou [Zou, 2006] sug-

gests using an additional cross-validation to select the optimal value for this parameter.

Thus, for SCCA that means a two-dimensional cross-validation to select sparseness pa-

rameters for left and right singular vectors λu, λv (level 1 CV) and to select γ (level 2

CV). I investigated the effect of the power of the weights on the performance of adaptive

SCCA by simulating data using the same set up as described above for three values of

the power parameter: γ = 0.5, 1, 2.


Figure 5.13 demonstrates the simulation results for the test sample correlation comparing

adaptive SCCA performance for the three considered values of γ. The graph shows similar

performance of adaptive SCCA for γ = 0.5 and 1, while test sample correlation is lower

for all sample sizes for γ = 2. Using γ = 2 also results in a higher discordance measure

for adaptive SCCA as demonstrated by the figure 5.14. The discordance for adaptive

SCCA with γ = 0.5 and 1 is similar for all sample sizes. Therefore, in further analysis

of the properties of adaptive sparse canonical correlation I use γ = 1 as the power of the

weights in the soft-thresholding step to reduce the computational complexity by avoiding

the second level of cross-validation necessary to select the optimal value of γ. This means

using the inverse of the values in the first singular vectors obtained from full SVD as the

89

weights in (5.3, 5.4). Thus, the soft-thresholding steps in adaptive SCCA are

ui+1 ← (|ui+1 − 1

2

λu

|uSV D| |)+Sign(ui+1)

and

vi+1 ← (|vi+1 − 1

2

λv

|vSV D| |)+Sign(vi+1)

5.3 SCCA extension II: Modified adaptive SCCA

In addition to introducing weights into SCCA based on the adaptive lasso approach of

H. Zou [Zou, 2006] described above we can explore further modification of the SCCA

algorithm to improve its model selection properties. A method that correctly identifies

the underlying model should include all important variables in its solution (and thus

have few false negatives) while eliminating all unnecessary variables (and thus minimizing

the number of false positives). However, in applications complete information about the

informativeness of variables is often not available. Therefore, it is not possible to evaluate

and minimize the false positive and false negative rates. One possible solution was offered

by Wu et al. [Wu et al., 2007] who developed a new approach for controlling variable

selection based on the introduction of pseudovariables into the original data set. These

pseudovariables are simulated to represent noise variables and thus allow estimation of

the number of false positives selected by the method of interest.

The authors propose 4 approaches to the generation of the pseudovariables. In the

first approach kp variables are generated independently from N(0, 1) distribution. In the

second approach pseudovariables are obtained by permuting the rows of the original data

matrix. In the other 2 approaches variables are generated as in the first two and then

they are regressed on the original data. Regression residuals are taken as the pseudovari-

ables. The objective of adding the simulated variables to the data is to introduce some

known uninformative variables that would allow estimation of the false positive rate. At

the same time this should not change the probabilities of selecting or not selecting the

90

original variables. Therefore, pseudovariables should resemble true noise variables in the

data as much as possible. In that case addition of independent identically distributed

N(0, 1) variables may not be realistic since their distribution may differ substantially

from the original variables, thus they will be easy to differentiate and eliminate from the

solution. Pseudovariables generated by row permutations have the same distribution as

the original variables, but they are independent of the outcome and, therefore, resemble

the uninformative variables. Using the regression residuals reduces the influence of the

pseudovariables on the selection probabilities of the original variables. Therefore, the

authors recommend simulating the pseudovariables by the row permutation and then

using the regression residuals.

In sparse canonical correlation analysis there are two data sets, X and Y, under

consideration and we are interested in identifying groups of variables that are associated

between the data sets. Therefore, to generate the pseudovariables we can permute the

rows in each set X and Y independently. That will produce variables that are independent

between the data sets X and Y, i.e. the uninformative variables. Regression residuals

are obtained as follows:

Let X be one of the data matrices and ZX be the matrix of generated pseudovariables

for X. Then the residuals after the regression on the original data are

(I −X(X ′X)−1X ′)ZX

. Similarly, for Y the regression-residual-based pseudovariables are

(I − Y (Y ′Y )−1Y ′)ZY

In the studies where sparse canonical correlation is likely to be applied, the number of

variables in data sets X and Y may exceed the number of observations. In that case,

the inverses (X ′X)−1 and (Y ′Y )−1 may not exist. One solution would be to use the

pseudo-inverses. However, the authors claim that there is a ”slight” advantage to using

the regression-residual permutation method, which may be offset by the imperfection of

91

Moore-Penrose pseudo-inverses. To reduce the computational load we use the pseudo-

variables generated by row permutations without considering the regression residuals.

The row permutation approach automatically produces as many pseudvariables as

there are original variables. Wu et al. investigated the effect of the number of pseu-

dovariables used on the performance of their variable selection method. They compared

using all generated pseudovariables, randomly selecting half of the pseudovariables and

generating twice as many pseudovariables as there are original variables. The authors

found no significant difference among the three approaches. In large studies such as mi-

croarray analysis the number of variables under investigation may be tens of thousands.

Thus, adding the same number of pseudovariables to the data may make computations

infeasible. Therefore, we investigate the effect of introducing additional noise by using

half as many pseudovariables as there are original variables. Thus, if data set X contains

p variables and set Y contains q variables, then sets of generated pseudovariables ZX and

ZY would contain 12p and 1

2q variables, respectively. These variables are generated by

random sample from p and q pseudovariables obtained by row permutations.

Controlling variable selection is based on minimizing the false selection rate. In sparse

canonical correlation analysis the false selection rate must be computed separately for

the data sets X and Y as follows:

γX(X,Y ) =UX(X,Y )

1 + SX(X,Y )=

UX(X,Y )

1 + IX(X,Y ) + UX(X,Y )for set X (5.5)

γY (X,Y ) =UY (X,Y )

1 + SY (X,Y )=

UY (X,Y )

1 + IY (X,Y ) + UY (X,Y )for set Y (5.6)

where UX(X,Y ) and UY (X,Y ) are the numbers of uninformative variables selected as

showing significant association (i.e. included in the model) from the sets X and Y,

respectively. IX(X,Y ) and IY (X,Y ) are the number of informative variables from X

and Y included in the model. Finally, SX(X,Y ) and SY (X,Y ) are the total number

of variables from sets X and Y included in the model. Thus, SX(X,Y ) = IX(X,Y ) +

UX(X,Y ) for set X with an analogous expression for data Y. Thus, the false selection rate

92

is approximately equal to the ratio of the number of false positives over the number of

positives and is similar to the false discovery rate. However, the estimation procedure is

different. Therefore, I follow the terminology of Wu et al. and refer to the false selection

rate.

The general iterative algorithm for estimating the false selection rate is

1. Set b = 1

2. Generate sets of 12p and 1

2q pseudovariables, ZX and ZY , by independent random

row permutations as described above

3. Estimate γXb(X,Y ) and γYb

(X,Y )

4. b = b + 1, repeat steps 2 and 3 until b = B

5. Compute estimated false selection rates averaged over the iterations:

γX(X,Y ) =1

B

B∑

b=1

γXb(X,Y ) for set X

γY (X,Y ) =1

B

B∑

b=1

γYb(X,Y ) for set Y

6. Common false selection rate for X and Y: γ(X,Y ) = 12(γX(X,Y ) + γY (X,Y ))

Sparse canonical correlation or its modified version, adaptive SCCA, is then applied to the

data and γ(X,Y ) is estimated for different values of sparseness parameters. The optimal

combination of the sparseness parameters for left and right singular vectors corresponds

to the the lowest estimated false selection rate. A final model (i.e. groups of variables

associated between the sets X and Y) is chosen by applying SCCA / adaptive SCCA

using the optimal combination of the sparseness parameters.

Wu et al. propose two methods for the estimation of the false selection rate. The

first one is based on estimating the expected ratio of the number of false positives to the

93

number of positives defined by the function

γER(α) = E{ U(α)

1 + S(α)} (5.7)

where α represents parameters used by the algorithm for model fitting. These parameters

are specified by the user, determine the level of sparseness of the model and can be

optimized to obtain the best fit. In the case of SCCA α represents sparseness parameters

for left and right singular vectors, λu and λv. S(α) is the number of variables selected by

the method used for the specific values of parameters in α, which as above includes both

informative and uninformative variables chosen. U(α) is the number of the uninformative

variables included in the model, i.e. the number of false positives. If I(α) is the number

of informative variables selected, then S(α) = U(α) + I(α).

The second false selection rate estimate is based on estimating the ratio of expected

number of false positives to the expected number of positives and is defined by the

function

γRE(α) =E{U(α)}

E{1 + S(α)} (5.8)

The first method depends only on the assumption of equal probabilities for selection of

real uninformative variables and generated pseudovariables. The second method depends

both on the same assumption and on the assumption that the original important variables

have the same probability of being selected regardless of whether pseudovariables have

been added to the data or not. To investigate the effect of introducing additional noise

on the model selection I use the the first method based on the expected ratio of the

number of false positives to the number of positives as it depends on fewer assumptions.

Sparse canonical correlation analysis deals with data sets of variables simultaneously

and a separate false selection rate has to be computed for each data. Thus, if there

are two data sets X and Y containing p and q variables, respectively, then γERX(α) and

γERY(α) should be estimated. Using the approach offered by Wu et al., the estimates are

γERX(α) =

kUX(α)U∗

pX(α)/kpX

1 + SX(α)(5.9)

94

for set X and

γERY(α) =

kUY(α)U∗

pY(α)/kpY

1 + SY (α)(5.10)

for data set Y. Here kpXand kpY

are the numbers of pseudovariables added to the original

data sets X and Y, respectively. SX(α) and SY (α) are the numbers of variables from

sets X and Y included in the model when the parameter values are equal to α, i.e. these

are the numbers of variables chosen as showing significant inter-relation between sets X

and Y. kUX(α) is the estimated number of original uninformative variables in the data X

computed as

kUX(α) = p− SX(α) (5.11)

Similarly, for set Y

kUY(α) = q − SY (α) (5.12)

U∗pX

(α) is the estimated number of pseudovariables generated for data X included in the

model for specific values of the selection parameters α obtained as

U∗pX

(α) =1

B

B∑

b=1

U∗p,bX

(α) (5.13)

where B is the number of iterations in the general algorithm for estimating false selection

rate, i.e. the number of times sets of pseudovariables ZX and ZY are generated by

permuting the rows of the original data matrices X and Y. U∗p,bX

(α) is the number of

pseudovariables for set X included in the model at the iteration b.

To investigate the effect of introducing additional noise variables on correct model

identification I carried out simulations for different sample sizes. I compared the perfor-

mance of 4 analysis methods:

• SCCA with added noise.

• Adaptive SCCA with added noise.

• SCCA without added noise (i.e. the original algorithm).

95

• Adaptive SCCA without added noise.

In the first two approaches sparseness parameters λu and λv are chosen by minimizing the

false selection rate. In the last two approaches the optimal combination of parameters is

chosen by maximizing the estimated test sample correlation using 5-fold cross-validation.

Simulation design:

Simulations were based on a single latent variable model. For each simulation 150 vari-

ables were generated for the data set X and 100 variables for the set Y. Twenty variables

in each set were associated and the rest of the variables represented independent noise.

The sample size range was between 20 and 400 observations. The standard deviation

used for simulation of the latent variable µ was 0.9 which resulted in approximately 0.8

correlation between the linear combinations of the associated variables in X and Y. For

each sample size obtained results were averaged over 50 simulations.

I compared performance of the four analysis methods listed above based on the test

sample correlation between the obtained linear combinations of variables and also based

on the true false selection rate as well as true false non-selection rate. To compute test

sample correlation, an independent test sample was generated from the same distribution

as the training sample for each simulation and contained 50 observations. True false

selection (FSR) and non-selection (FNR) rates are calculated based on the knowledge

about the underlying model used in simulations as follows:

FSRtrue(X) =number of false positives

1 + number of positives

FNRtrue(X) =number of false negatives

1 + number of negatives

Similar expressions are used for the Y data set. The common FSR and FNR values for

X and Y data are

FSRtrue =1

2(FSRtrue(X) + FSRtrue(Y ))

FNRtrue =1

2(FNRtrue(X) + FNRtrue(Y ))

96


Figure 5.15 demonstrates the simulation results for test sample correlation comparing

adaptive SCCA and SCCA with and without incorporating additional noise variables.

The power for the weights in adaptive SCCA is γ = 1. The graph demonstrates slightly

lower test sample correlation obtained by adaptive SCCA and SCCA when additional

noise variables are incorporated into the data sets (modified adaptive SCCA, modified

SCCA). This may be explained by the sparser solutions produced by the modified meth-

ods due to lower number of uninformative variables included in the linear combinations

of variables that are associated between sets X and Y. However, test sample correlations

are very similar for all 4 methods especially for larger sample sizes. To further investigate

the solutions it is necessary to consider the false selection and non-selection rates.

Figure 5.16 demonstrates true false selection rate for adaptive SCCA and SCCA

with and without the modification while Figure 5.17 shows true non-selection rates.

Surprisingly, analysis methods that incorporate additional noise pseudovariables into the

data and are based on minimizing the false selection rate have higher true FSR and lower

true FNR for all sample sizes compared to the methods that only use the original data

and are based on maximizing tests sample correlation. Both adaptive and simple SCCA

with added noise modification show true FSR rate higher than 0.4 for almost all sample

sizes and it decreases slowly with the increasing sample size. This means that too many

uninformative variables are selected for the solution by the modified methods. That also

explains lower true FNR for the modified approaches compared to regular adaptive SCCA

and SCCA. Since a large number of variables are included in the linear combinations,

more important variables are included as well. FSR and FNR are based on the numbers

of false positives and false negatives selected by a method. Therefore, they are prone to

a similar trade-off phenomenon.

What is the explanation for the higher false selection rates observed for methods that

should be minimizing FSR? A possible reason is that modified approaches underestimate

97

the false selection rates and, therefore, result in solutions that include too many noise

variables. Let’s reconsider the estimate of FSR offered by Wu et al. For data set X

estimated FSR in (5.9) is

γERX(α) =

kUX(α)U∗

pX(α)/kpX

1 + SX(α)

The numerator in this expression estimates the number of false positives as the estimated

proportion of the estimated number of uninformative variables included in the solution.

The denominator, SX(α), is the number of variables selected from the original set X

(number of positives) when no additional noise variables are incorporated in the data.

This value only depends on the original data, parameters α and the analysis method

used (adaptive SCCA or SCCA). SX(α) provides a direct measure of the number of

positives, or the number of variables selected for the solution, required for the estimation

of FSR and, therefore, should not a a source of underestimation of FSR. U∗pX

(α)/kpXalso

does not raise concern since it estimates the proportion of false positives based on the

known number of pseudovariables incorporated in the data and on the knowledge which

of the variables in the extended data set are pseudovariables. kUX(α) = p−SX(α) is the

estimated number of uninformative variables in the original data set X. This information

is not available in applications and, therefore, has to be estimated. The value is obtained

by subtracting the number of positives (original variables included in the solution) from

the total number of original variables. This estimate is based on the assumption that

at optimal level of parameters α analysis method only includes the important variables

in the solution and that it includes all important variables. However, in large studies

with small sample sizes, such as microarray studies, this assumption is unrealistic. Not

all important variables may be included in the solution while noise variables may be

selected as well. In case there are many noise variables included in the model, kUX(α)

would underestimate the number of uninformative variables, thus underestimating FSR.

To further investigate the effect of estimated number of uninformative variables on

the estimate of false selection rate I carried out simulations using the true numbers of

98

uninformative variables in sets X (130 noise variables) and Y (80 noise variables) in the

expressions for γERX(α) and γERY

(α) in (5.9, 5.10). I used the same simulation set up

as in the previous simulations.


Figures 5.18 and 5.19 demonstrate true false selection and non-selection rates comparing

6 analysis methods:

1. SCCA with added noise using the true numbers of uninformative variables to esti-

mate and minimize FSR.

2. Adaptive SCCA with added noise using the true numbers of uninformative variables

to estimate and minimize FSR.

3. SCCA with added noise using estimated FSR as in Wu et al. [Wu et al., 2007].

4. Adaptive SCCA with added noise using estimated FSR as in Wu et al. [Wu et al.,

2007].

5. SCCA without added noise (i.e. the original algorithm).

6. Adaptive SCCA without added noise.

Figure 5.18 shows that when the estimated number of uninformative variables is used to

calculate false selection rate as in Wu et al.(methods 3 and 4), then modified analysis

methods perform worse compared to the modified methods that use true number of unin-

formative variables in the estimation of FSR (methods 1 and 2). The graph demonstrates

lower and more rapidly decreasing true false selection rate with increasing sample size

for the methods 1 and 2 in the list above. This supports the conclusion that the ap-

proach of Wu et al. underestimates the false selection rates. Although modified adaptive

SCCA and modified SCCA that use true numbers of uninformative variables to estimate

FSR perform better than methods 3 and 4, they do not show significant advantage over

the adaptive SCCA that does not incorporate any pseudovariables (method 6). In fact,

99

method 2 shows higher true FSR while method 1 has comparable performance. Given

significantly higher computation complexity of the modified analysis methods, adaptive

SCCA is the best choice for the analysis of large data sets based on the true false selection

rate.

Figure 5.19 shows that modified methods that use Wu et al. estimated for FSR (3

and 4) have lower true false non-selection rate for all sample sizes compared to other

analysis methods. This effect can be explained by the trade-off between the number of

false positives and false negatives. Methods 3 and 4 include more variables in the linear

combinations of associated variables between sets X and Y. Thus, these solutions also

include greater number of important variables compared to the sparser linear combina-

tions produced by the other methods. However, the difference in true FNR between six

considered approaches is not as significant as the difference in true FSR: for most sample

sizes true FSR values are well below 0.1 for all methods which may be satisfactory FNR

rate in many applications. On the other hand, true false selection rates for the methods

3 and 4 exceed 0.4 for considered sample sizes which indicates presence of a large per-

centage of noise variables (at least 40%) in the solution. Thus, superior performance of

the third and fourth methods in terms of FNR is offset by their high true false selection

rates.

In summary, modification of SCCA that introduces additional noise variables in the

original data set in order to estimate and minimize the number of uninformative vari-

ables included in the linear combinations, does not offer an advantage in minimizing

the number of false positives, and thus better model identification. However, it is more

computationally intensive and may be infeasible in large studies where the number of

variables of each type may be tens of thousands. Both modified SCCA and modified

adaptive SCCA underestimate the false selection rate which results in a high number

of noise variables included in the linear combinations of associate variables. This is due

to underestimation of the number of uninformative variables in the original sets of mea-

100

surements which is used to estimate the false selection rate. Further development of this

approach is necessary to obtain better estimates. SCCA and adaptive SCCA methods are

preferred to the modified versions. The preference between SCCA and adaptive SCCA

methods may be set by an investigator based on the intended use of the results, available

resources, and prior biological knowledge.

The conclusion about the oracle properties of the developed methods is that adaptive

SCCA does not have an oracle property since the number of false negatives for adaptive

SCCA is not reduced to 0 for large sample sizes. However, the number of false posi-

tives is reduced compared to simple SCCA. The primary focus of SCCA is the correct

model identification, hence consistency of the estimates is of lesser importance and is not

considered.

101

0 100 200 300 400 500 600

05

1015

False negatives for X for SCCA and adaptive SCCA

sample size

num

ber

of fa

lse

nega

tive Adaptive SCCA

SCCAMin for Adaptive SCCAMin for SCCA

Figure 5.12: Compare adaptive SCCA and SCCA: number of false negatives for set X vs

sample size, power of weights in soft-thresholding is 0.5. Simulated number of important

variables in X is 20. Solid curve - test sample correlation based number of false negatives

for set X for adaptive SCCA, dashed curve - test sample correlation based number of

false negatives for set X for SCCA, dotted curve - number of false negatives for set X

based on minimized discordance for adaptive SCCA, dashed and dotted curve - number

of false negatives for set X based on minimized discordance for SCCA.

102

0 100 200 300 400 500 600

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Test sample correlation for adaptive SCCA for different powers of weights

sample size

test

sam

ple

corr

elat

ion

Gamma=0.5Gamma=1Gamma=2

Figure 5.13: Adaptive SCCA performance for different powers of weights in soft-

thresholding: test sample correlation vs sample size. Solid curve - power is 0.5, dashed

curve - power is 1, dotted curve - power is 2.

103

0 100 200 300 400 500 600

020

4060

8010

0

Discordance for adaptive SCCA for different powers of weights

sample size

disc

orda

nce

Gamma=0.5Gamma=1Gamma=2

Figure 5.14: Adaptive SCCA performance for different powers of weights in soft-

thresholding: discordance vs sample size. Solid curve - power is 0.5, dashed curve -

power is 1, dotted curve - power is 2.

104

100 200 300 400

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Test sample correlation for SCCA and adaptive SCCA with/without added noise

sample size

test

sam

ple

corr

elat

ion

Adaptive SCCA, added noiseSCCA, added noiseAdaptive SCCASCCA

Figure 5.15: Test sample correlation for different sample sizes for adaptive SCCA and

SCCA with and without incorporating additional noise pseudovariables. True correlation

between two linear combinations of important variables is 0.8. Weights power for adaptive

SCCA γ = 1.

105

100 200 300 400

0.0

0.2

0.4

0.6

0.8

True FSR for SCCA, adaptive SCCA based on min estimated FSR

sample size

fsr


Figure 5.16: True false selection rate for different sample sizes for adaptive SCCA and



SCCA γ = 1.

106

100 200 300 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

True FNR for SCCA, adaptive SCCA based on min estimated FSR

sample size

fnr


Figure 5.17: True false non-selection rate for different sample sizes for adaptive SCCA and



SCCA γ = 1.

107

100 200 300 400

0.0

0.2

0.4

0.6

0.8

True FSR based on min estimated FSR

sample size

fsr

Adaptive SCCA, added noiseSCCA, added noiseAdaptive SCCA, added noise, better FSR estSCCA, added noise, better FSR estAdaptive SCCASCCA

Figure 5.18: True false selection rate for different sample sizes for adaptive SCCA and



SCCA γ = 1. Black curves: solid - adaptive SCCA with added noise, modified FSR est.,

dashed - SCCA with added noise, modified FSR est., dotted - adaptive SCCA without

added noise, dashed and dotted - SCCA without added noise. Grey curves: solid -

adaptive SCCA with added noise, FSR est. as in Wu et al., dashed - SCCA with added

noise, FSR est. as in Wu et al.

108

100 200 300 400

0.02

0.04

0.06

0.08

0.10

True FNR based on min estimated FSR

sample size

fnr

Adaptive SCCA, added noiseSCCA, added noiseAdaptive SCCA, added noise, better FSR estSCCA, added noise, better FSR estAdaptive SCCASCCA

Figure 5.19: True false non-selection rate for different sample sizes for adaptive SCCA and



SCCA γ = 1. Black curves: solid - adaptive SCCA with added noise, modified FSR est.,

dashed - SCCA with added noise, modified FSR est., dotted - adaptive SCCA without

added noise, dashed and dotted - SCCA without added noise. Grey curves: solid -

adaptive SCCA with added noise, FSR est. as in Wu et al., dashed - SCCA with added

noise, FSR est. as in Wu et al.

109

Chapter 6

Application

6.1 Background

Several studies have demonstrated that there is variation in baseline gene expression lev-

els in humans that has a genetic component [Cheung et al., 2005, Morley et al., 2004].

Genome-wide analyses mapping genetic determinants of gene expression have been car-

ried out for expression of one gene at a time, which may be prone to a high false discovery

rate and computationally intensive since the number of genes under consideration often

exceeds tens of thousands. We present an exploratory multivariate method for initial in-

vestigation of such data and apply it to the data provided as problem 1 for the fifteenth

Genetic Analysis Workshop (GAW15). The linkages between the set of all SNP loci and

the set of all gene expression phenotypes can be characterized by a type of correlation

matrix based on the linkage analysis methodologies introduced by Tritchler et al. [Tritch-

ler et al., 2003] and Commenges [Commenges, 1994]. In multivariate analysis a common

way to inspect the relationship between two sets of variables based on their correlation is

canonical correlation analysis, which determines linear combinations of variables for each

data set such that the two linear combinations have maximum correlation. However, due

to the large number of genes, linear combinations involving all the genotypes or gene

110

expression phenotypes lack biological plausibility and interpretability and may not be

generalizable. We have developed a new method, Sparse Canonical Correlation Analysis

(SCCA), which examines the relationships between many genetic loci and gene expres-

sion phenotypes simultaneously and to establish the association between them. SCCA

provides sparse linear combinations. That is, only small subsets of the loci and the gene

expression phenotypes have non-zero loadings so the solution provides correlated sets of

variables that are sufficiently small for biological interpretability and further investiga-

tion. The method can help generate new hypotheses and guide further investigation. In

this case, the correlation of interest is between gene expression profiles and SNP-based

measures; correlations within gene expressions or within SNPs separately are not a focus

of interest.

6.2 Materials and Methods

Data

The data consist of microarray gene expression measurements which are treated as quan-

titative traits and a large number of genotypes for 14 Centre d‘Etude du Polymorphisme

Humain (CEPH) families from Utah. Each pedigree includes 3 generations with approx-

imately 8 offspring per sibship. There are 194 individuals, 56 of which are founders (the

information for their parents and other ancestors is not considered). Phenotypes were

measured by microarray gene expression profiles obtained from lymphoblastoid cells us-

ing the Affymetrix Human Genome Focus Arrays. Morley et al. [Morley et al., 2004]

selected 3554 genes among the available 8793 probes based on higher variation among

unrelated individuals than between replicate arrays for the same individual. Here we use

pre-processed and normalized data provided for these genes. Additional phenotypic data

obtained for CEPH families includes age and gender.

The normalization procedure for expression profiles used in this study was Affymetrix

111

Microarray Analysis Suite (MAS) [Affymetrix]. This may have a great effect on the

results altering the subset of gene expressions and SNPs selected by SCCA. Beyene et

al. [Beyene et al., 2007] demonstrated the influence of the normalization procedures on

the final results for several approaches including Affymetrix MAS. However, our analysis

is independent of data preprocessing. We assume that the appropriate tools have been

applied to the data at hand.

Genotypes are measured by genetic markers provided by The SNP Consortium and

are available for 2882 autosomal and X-linked SNPs. The physical map for SNP locations

is also available.

The statistical model

In this study we are interested in identifying linear combinations of gene expression

levels and SNPs that have the largest possible correlation. Canonical correlation analysis

establishes such relationships between the two sets of variables [Mardia et al., 1979].

In conventional CCA, all variables are included in the fitted linear combinations.

However, in microarray and genome-wide data the number of genes under consideration

often exceeds tens of thousands. In these cases linear combinations of all features may not

be easily interpretable. Sparse canonical correlation analysis (SCCA) enhances biological

interpretability and provides sets of variables with sparse loadings. This is consistent

with the belief that only a small proportion of genes are expressed under a certain set

of conditions, and that expressed genes are regulated at a subset of genetic locations.

We propose obtaining sparse linear combinations of features by considering a sparse

singular value decomposition of K where singular vectors u1 and v1 have sparse loadings.

We developed an iterative algorithm that alternately approximates the left and right

singular vectors of the SVD using soft-thresholding for feature selection. This approach

is related to the Sparse Principal Component Analysis method of Zou et al. [Zou et al.,

2004] and Partial Least Squares methods described by Wegelin [Wegelin].

112

Analysis approach

In this study one type of variables is based on gene expression levels and the other type

of information relates to SNP genotypes and pedigree structure. An immediate challenge

in this context is how to define correlation between these two types of data. We adopted

a measure of covariance of genetic similarity with phenotypic similarity as in Tritchler et

al. [Tritchler et al., 2003] and Commenges [Commenges, 1994]. Consider the offspring

generation in all available pedigrees and take all possible sib-pairs. Let yij and yik be the

phenotypes for the siblings j and k in family i for a particular gene expression and let

wijk represent ibd value for these siblings for some specific SNP. Then for the considered

gene expression and SNP the test statistic in [Tritchler et al., 2003] is

σ =∑

i

∑

j

∑

k>j

{yij − E(yij)}{yik − E(yik)}{wijk − E(wijk)} (6.1)

which is used for computation of a covariance matrix between the phenotypic similarity

and genotypic similarity. Note the similarity of the above expression to Haseman-Elston

regression. In fact, Tritchler et al. [Tritchler et al., 2003] show that the correlation

statistic subsumes both the original Haseman-Elston regression analysis and the later

Haseman-Elston (revisited).

Phenotypic similarity

The phenotypes in this study are the gene expression values for siblings in the last

generation of the pedigrees (i.e., the offspring generation). Previous studies have shown

that there is a variation in human gene expression according to age and gender [Morley

et al., 2004]. Therefore, we limit the analysis to the last generation in all pedigrees as

well as correct for the effects of gender and age by fitting a linear model

yij = α + βgendergenderij + βageageij + eij (6.2)

Gender and age information was not available for all individuals in the pedigree 1454 and

for 3 individuals in pedigree 1340. Therefore, these individuals were excluded from the

113

analysis. In the 13 remaining pedigrees, there were 344 distinct sib-pairs with sibship

size varying between 15 and 28. Although sib-pairs are correlated within pedigrees, this

does not affect the results since no assumption of independence is made.

Genotypic similarity

For each sib-pair, the probabilities of sharing 0, 1 and 2 alleles identical by descent were

estimated using MERLIN. Provided physical distance map of the SNP locations was

used for this computation since it is a suitable approximation to the genetic distances

required by MERLIN and the results do not show sensitivity to this substitution. Given

the incomplete genetic marker information for some individuals, exact IBD values could

not be computed. We estimated the number of alleles shared identical by descent by

two siblings as a posterior expected value based on the probabilities estimated using

MERLIN. Expected ibd values E(wijk) were computed as sample mean values over all

sib-pairs.

Standardization

We standardize the phenotype and genotype variables by subtracting the mean values

and dividing by the standard deviations. As described in the standardization section 3.3

of the methods chapter 3 simulations show that after data standardization the analysis

can be simplified by replacing variance matrices of gene expressions and IBD values by

the identity matrices while yielding satisfactory results. Then the matrix K in equation

3.2 is the covariance between the two data sets and the first canonical vectors in equation

3.3 are just u1 and v1.

114

Evaluation

We evaluated the results by performing SCCA with leave one out cross-validation (LOOCV)

analysis treating a pedigree as one unit.

In this study assessment of performance is based on the estimated test sample cor-

relation. It shows correlation between linear combinations of identified loci and gene

expressions in the independent sample. We used pedigree as the unit in LOOCV since

it represents a statistically independent unit. Leaving out one whole pedigree preserves

dependence structure in the family based study and ensures independence between train-

ing and testing samples. Using a random sample of 100/k% of all individuals for k-fold

CV would destroy familial correlation. We carried out an analogue of 13-fold CV where

fold-size was dictated by the complex structure of the data. Also, leaving out one pedi-

gree facilitates sensitivity analysis and shows the influence of the specific pedigrees on

the results.

The SCCA algorithm involves CV analysis for selection of the sparseness parameters

for gene expressions and SNPs. Therefore, validation of the SCCA performance is carried

out using a nested CV structure. In the outer CV loop the data is repeatedly split into

12 pedigrees for training sample and 1 pedigree for tests sample. The complete SCCA

procedure (including a CV tuning step) is then applied to the training sample. This

means that the inner CV is performed by splitting 12 families into 11 and 1 to select

best sparseness parameter combination which is subsequently used to identify linear

combinations of gene expressions and SNPs. These linear combinations are applied to

the remaining pedigree 13th test sample pedigree in the outer CV loop to compute test

sample correlation. Results are then averaged over all 13 outer CV steps.

115

6.3 Results

Optimal sparseness parameter combination selection and SCCA

results

Using cross-validation, we obtained a soft-threshold value of 0.07 for gene expressions

and 0.13 for SNPs corresponding to the maximal test sample correlation. The 3-D graph

in Figure 6.1 demonstrates the results of this CV. It shows the test sample correlation

between gene expression and SNP sets averaged over 13 CV steps. The dotted plane

corresponds to the test sample correlation for linear combinations of all variables ob-

tained using standard CCA, i.e. full SVD solution, which is 0.1384. It is constant for

all λgene expr. and λSNP since no parameters are involved in the analysis. The best aver-

age test sample correlation for SCCA is 0.1843. Figure 6.1 also demonstrates that for

evaluated sparseness parameter combinations SCCA provides a better solution in terms

of test sample correlation than the full SVD solution. We carried out SCCA using this

optimal combination of sparseness parameters and identified groups of 41 SNPs and 150

gene expressions with between group correlation of 43%. All obtained SNPs are uni-

formly distributed over a region on chromosome 9 between 86.80 megabases (Mb) and

120.09 Mb. Locations of expressed genes selected by SCCA are distributed over different

chromosomes. Six of the identified gene expressions are located on chromosome 9. Other

3 chromosomes that have more than 15 gene expressions each are 1, 2 and 6. No cis-

acting genetic regulators were found, where cis-regulators are defined as those that map

within 5Mb region, as was previously defined in Morley et al. [Morley et al., 2004].

Cross-Validation of SCCA algorithm

Table 6.1 summarizes the results of the cross-validation study comparing the performance

of SCCA to the complete SVD solution that includes all 3554 gene expressions and 2882

SNPs. Average overlap between the group of 150 gene expressions selected using SCCA

116

sparseness param. gene expr.

0.0700.075

0.080

0.085

0.090

0.095

0.100 sparse

ness param. fo

r SNP

0.08

0.10

0.12

0.14

test sample correlation

0.00

0.05

0.10

0.15

Average test sample correlation vs sparseness parameters

Figure 6.1: 3-D graph: test sample correlation averaged over 13 LOOCV steps for dif-

ferent combinations of sparseness parameters for gene expression and SNP measures.

Dotted plane: test sample correlation averaged over 13 LOOCV steps for standard CCA

solution.

on the complete data and the groups of gene expressions selected in CV steps is 46

genes while the average intersection of the 41 SNPs with the results in CV steps is 34

SNPs. Inspecting CV iterations shows pedigrees to be heterogeneous, with two pedigrees,

1416 and 1418, being outliers. When these pedigrees are used as test samples (refer to

these 2 CV steps as step 1416 and step 1418) gene expression and SNP sets selected for

the training data differ substantially from the sets obtained in other CV iterations. In

particular, there are 159 and 166 SNPs selected in steps 1416 and 1418 respectively. These

two SNP sets have an overlap of 155 SNPs indicating that the results are very similar.

However, there are only 7 SNPs in common between these groups of SNPs and the groups

117

Gene expr. SNPs Average test sample correlation

SCCA 83 66 0.1144

SVD 3554 2882 0.1384

Table 6.1: Validation: summary of prediction results for SCCA and full SVD averaged

over 13 Leave-One-Out-Cross-Validations.

selected in other CV iterations. All seven SNPs shared by the results in all CV steps are

located on chromosome 9 and are also a subset of 41 SNPs selected by SCCA applied

to the whole data. The difference is more dramatic for the gene expression comparison.

Again similar sets of genes are selected in CV steps 1416 (98 gene expressions selected)

and 1418 (69 gene expressions selected). There is almost complete overlap between these

two groups - 68 gene expression profiles. However, there are at most 9 common gene

expressions between these groups and gene expressions selected in other CV steps. In

fact, there is an empty intersection between gene sets obtained in CV steps 14116 and

1418 and genes selected when pedigree 1340 is used as a test sample. Also there are

respectively 7 and 3 common gene expressions between the group selected by SCCA for

the whole data and the groups in steps 14161 and 1418. On the other hand, results

obtained in all CV steps excluding steps 1416 and 1418 are more similar. The average

overlap in gene expressions between CV iterations and whole data results is 60 if we

also ignore steps where pedigrees 1340 and 1345 were used as test samples since in these

steps only 18 and 24 gene expressions were selected respectively. The average overlap in

SNPs is 40. Described results clearly suggest some differences between pedigrees 1416

and 1418 and the rest of the pedigrees. To improve the homogeneity in the data we apply

SCCA method as well as validation procedure to the data with pedigrees 1416 and 1418

removed.

118

SCCA results for reduced data

Similarly to the analysis of the whole data we applied SCCA and the validation procedure

to the reduced data consisting of all pedigrees except 14161 and 1418. We obtained

soft-threshold value of 0.1 for gene expressions and 0.09 for SNPs corresponding to the

maximal test sample correlation. We carried out SCCA using the optimal combination

of sparseness parameters and identified groups of 134 SNPs and 63 gene expressions with

between group correlation of 51%. Nine of selected SNPs are located on chromosome 9

between 116.61Mb and 136.27Mb. Six of these SNPs were identified in the analysis of the

whole data. Chromosome 7 contains 59 of selected SNPs distributed uniformly between

27.73Mb and 106.92Mb. Also large portion of the SNPs is located on chromosome 12:

37 SNPs between 76.50Mb and 115.75Mb. Other chromosomes that contain 10 or less

of the selected SNPs are 4, 6, 11, 14, 16, 18, 20, and 23. Obtained gene expressions are

distributed over different chromosomes.

Table 6.2 summarizes the results of the cross-validation study comparing the performance

of SCCA to the complete SVD solution when both methods are applied to the reduced

data. Average overlap between the group of 63 gene expressions selected using SCCA

on the data consisting of 11 pedigrees and the groups of gene expressions selected in CV

steps is 30 genes while the average intersection of the 134 SNPs with the results in CV

steps is 59 SNPs. Inspecting CV iterations reveals further heterogeneity in the pedigrees.

When the pedigrees 1341, 1346, and 1408 are used as test samples gene expression and

SNP sets selected for the training data differ substantially from the sets obtained in other

CV iterations.

6.4 Adaptive SCCA Results

In addition to the presented SCCA analysis I also applied adaptive SCCA to the study

of natural variation in human gene expression. I used the algorithm described in the

119

Gene expr. SNPs Average test sample correlation

SCCA 74 85 0.1544

SVD 3554 2882 0.1384

Table 6.2: Validation of SCCA applied to the reduced data: summary of prediction

results for SCCA and full SVD averaged over 11 Leave-One-Out-Cross-Validations.

adaptive SCCA section 5.2 of chapter 5 with power of the weights in soft-thresholding

penalty γ = 1. I considered soft-thresholding parameters λgene expr. and λSNP ranging

between 0 and 0.01 for both gene expressions and SNPs. Parameter values higher than

0.01 resulted in no variables being selected. Using cross-validation, I obtained soft-

threshold value of 0.004 for gene expressions and 0.009 for SNPs corresponding to the

maximum test sample correlation averaged over cross-validation steps of 0.2092. This

average test sample correlation is higher than the best test sample correlation of 0.1843

produced by SCCA for the optimal combination of the sparseness parameters λgene expr. =

0.07 and λSNP = 0.13.

Similarly to SCCA, I carried out adaptive SCCA using the optimal combination of

sparseness parameters and identified groups of 19 SNPs and 28 gene expressions with

between group correlation of 32%. Selected SNPs and gene expressions are the subsets

of the groups of SNPs and gene expressions identified by SCCA. Thus, all selected SNPs

are located on chromosome 9 between 102.82 megabases (Mb) and 117.15Mb. Locations

of selected expressed genes are distributed over different chromosomes. One of the iden-

tified gene expressions is located on chromosome 9. Chromosome 1 has 5 of selected gene

expressions, chromosome 10 has 4. Other chromosomes contain fewer than 4 identified

gene expressions. Gene expression selected by adaptive SCCA that is located on chro-

mosome 9 is gene 209034 at which is also one of the genes identified by Lantieri et al.

[Lantieri et al., 2007]. Since the adaptive SCCA results are the subsets of SCCA results,

again no cis- acting genetic regulators were found according to the definition in Morley

120

et al.[Morley et al., 2004].

6.5 Discussion

In this study we presented sparse canonical correlation analysis and demonstrated the

application of this new method to the simultaneous analysis of gene expression levels

and SNPs. Due to complex interaction between genes, a set of several genotypes may

be associated with several gene expressions possibly belonging to the same regulatory

pathway or genetic network. SCCA discovers such sets of genotypes and phenotypes

while keeping the size of groups sufficiently small for biological interpretability.

We identified a specific region on chromosome 9 that regulates a group of gene ex-

pression profiles. Selected set of loci should be interpreted as whole in relation to the

whole set of selected gene expressions. We presented the results for sets of SNPs and

gene expression levels with maximum correlation between SNP set and gene expression

set. Maximization of within group correlation is not the objective of SCCA so selected

gene expressions may not be highly correlated with each other, and the same is true for

SNPs.

This sparse solution may help to generate new hypothesis and isolate groups of loci

and gene expressions for future biological experimentation. For instance, selection of a

specific region on chromosome 9 by SCCA is particularly interesting, and a possible inter-

pretation could be that we found a regulatory region. The same region on chromosome 9

was also identified in other GAW15 contributions. For instance, considering a small set

of genes associated with the development of the enteric nervous system (ENS) Lantieri

et al. [Lantieri et al., 2007] also found evidence of linkage for two genes, 201387 s at and

209034 at, to a unique common regulator located on chromosome 9 at 109 centiMorgan

(cM). Similarly, Wang et al. [Wang et al., 2007] found 10 gene expressions mapped to

a ”hotspot” on chromosome 9, however, gene names and specific chromosomal locations

121

were not provided.

The smaller number of gene expressions and SNPs selected by SCCA facilitates better

biological interpretability of the results. The leave-one-out cross-validation results showed

a slightly lower average test sample correlation for SCCA compared to full SVD solution

as shown in table 6.1. For this particular data set, a possible explanation is outliers

among the results in the CV steps, due to the two incongruous pedigrees. This indicates

that using stringent constraints for subsetting variables may result in greater vulnerability

to outliers in the pedigrees. In simulations we have carried out to assess the performance

of SCCA our method demonstrates better performance compared to standard CCA based

on full SVD in terms of test sample correlation. Thus, SCCA may potentially provide

a more robust solution. Additional empirical studies using more homogeneous sets of

pedigrees would be useful. Variability of the results in separate CV steps, indicated by

incomplete overlap of selected sets of gene expressions and SNPs, points out the utility

of CV in determining heterogeneous sets of pedigrees.

I also applied adaptive sparse canonical correlation and identified a smaller group

of SNPs that is associated with a smaller set of gene expression profiles as compared

to the results obtained by SCCA. An important observation is that there is a complete

agreement between the solutions and no new results have been identified by the adaptive

SCCA. Also, the sets of SNPs and gene expressions selected by the adaptive SCCA are

sparser and they are the subsets of the sets of SNPs and gene expressions selected by

SCCA. This may be due to the reduction of the number of noise variables included in

the solution. This is supported by the simulation results presented in the chapter 5

which indicate that adaptive SCCA does have a tendency to select fewer uninformative

variables compared to SCCA. However, simulations also show that sparse adaptive SCCA

solutions may also include fewer important variables than SCCA. Thus, reduced numbers

of SNPs and gene expressions obtained by the adaptive SCCA may also be missing some

regulatory SNPs or gene expressions that are associated with identified genetic region on

122

chromosome 9. In conclusion, adaptive SCCA solution is more likely to contain a higher

percentage of SNPs and gene expressions that are related to each other while SCCA

solution may identify a more complete list of associated SNPs and gene expressions. The

choice between the two solutions can be made based on the biological knowledge and on

cost of additional biological experiments that may be used to validate the results, i.e.

relative cost of carrying out an additional experiment compared to the cost of missing

an important association between SNPs and gene expressions.

Both sparse canonical correlation analysis and adaptive SCCA allow the global inves-

tigation of both genomic and genetic data at the same time and provide an interpretable

answer even in studies with limited sample size and possible outliers such as this study

of natural variation in human gene expression. These methods are useful analytical tools

for genome-wide study of variation in human gene expression that can also identify new

regulatory pathways and genetic networks. Biological knowledge and intended further

analysis may suggest the choice between adaptive SCCA and SCCA.

123

Chapter 7

Discussion and future work

7.1 Discussion

This thesis describes new methodology for the simultaneous analysis of two sets of mea-

surements to establish the relationships between them - sparse canonical correlation anal-

ysis (SCCA). SCCA identifies linear combinations of variables in each data set that have

the highest correlation between the different sets of measurements. In this case maxi-

mization of the correlation between the variables within each data set is not the focus of

the analysis. Sparse canonical correlation analysis is an extension of the canonical corre-

lation analysis (CCA) that identifies such relationships between the variables. However,

CCA includes the entire sets of available variables in the linear combinations. In large

studies such as microarray and genome-wide linkage/assocation studies this may not be

practical due to high dimensionality of the data. In these cases linear combinations of

all variables may lack biological interpretability. SCCA solves this problem by providing

a sparse solution. Sparse linear combinations of variables obtained from SCCA include

only small subsets of variables from each data set. Hence, they are easier to interpret

and may be used to generate new hypothesis for further testing.

I presented the sparse canonical correlation analysis algorithm and investigated its

124

properties. Simulation studies show better performance of SCCA compared to CCA in

terms of test sample correlation for different sample sizes. The difference is especially

large for small sample sizes which is usually the case in large studies where we expect

SCCA to be most useful. I also investigated the oracle properties of SCCA according

to the definition in H. Zou [Zou, 2006]. Oracle property includes two components: con-

sistency of the estimated coefficients in the linear combinations of variables and correct

model identification. SCCA is an exploratory method and the focus of the analysis is on

identifying subsets of variables that have significant association between two data sets in

the study. Therefore, the values of the non-zero coefficients in the linear combinations

are of lesser importance than the locations of zeros. Identification of which coefficients

should be set to zero determines which variables are included in the linear combinations

and, thus, serves as the model selection. While consistency of the coefficients in the lin-

ear combinations of variables is a desirable property, the main focus of SCCA is correct

model identification.

Investigation of the model selection properties of SCCA using simulated data showed

that as sample size increases the number of unimportant variables selected as signifi-

cantly associated between the two data sets by SCCA (false positives) decreases. The

number of important variables not included in the model (false negatives) is decreasing

as well although at a slower rate and is less affected by the sample size. These misiden-

tified important variables could have very small in absolute value loadings in the linear

combinations of associated variables and therefore are difficult to differentiate from the

noise variables. Also the effect of the sample size on the number of false negatives is

stronger when the true simulated correlation between the linear combinations of impor-

tant variables is higher. Thus, for sufficiently large sample sizes (approximately twice

the number of variables in one data set or higher) the number of false positives selected

by SCCA is zero. Additional simulations demonstrate that maximization of the test

sample correlation to select the optimal combination of sparseness parameters for left

125

and right singular vectors that determine the solution does not guarantee the best model

identification. However, the true underlying model is not known in real studies. There-

fore, it is not possible to minimize the discordance measures which leads to obtaining

solution based on the test sample correlation maximization. This limitation inspired the

development of two extensions of SCCA: adaptive SCCA and modified adaptive SCCA.

I also presented adaptive SCCA - an extension of SCCA based on the adaptive LASSO

approach of H. Zou that may have preferred model selection properties in some appli-

cations. Similarly to SCCA, adaptive SCCA is seeking a solution by maximizing test

sample correlation. Simulations show that adaptive SCCA provides better filtration of

the noise and includes fewer uninformative variables in the linear combinations. However,

sparser sets of selected variables also may not include all important variables. Thus, there

is a trade-off between the number of false positives and false negatives. This leads to the

trade-off between using adaptive SCCA versus simple SCCA. In applications of SCCA

in which the results are used to form new hypothesis for further testing an investigator

may prefer to exclude as much noise as possible if the cost of additional experiments is

very high. On the other hand, if the results are used for discovery of new effects and

the cost of additional testing is not high compared to the cost of missing an important

component, a solution that has higher probability of containing most important variables

possibly along with higher percentage of noise may be preferred. Thus, the choice be-

tween adaptive SCCA and SCCA may be made based on the biological knowledge and on

the relative cost of having greater number of false positives compared to false negatives.

I investigated a further modification of SCCA that introduces additional noise vari-

ables in the original data set in order to estimate and minimize the number of uninforma-

tive variables included in the linear combinations. This method is based on the analysis

approach described by Wu et al. [Wu et al., 2007]. The simulations suggest that it does

not offer an advantage in minimizing the number of false positives, and thus better model

identification. Furthermore, it is more computationally intensive and may be infeasible

126

in large studies where the number of variables in each data sets may be tens of thousands.

Both adaptive SCCA and modified adaptive SCCA underestimate the false selection rate

which results in a high number of noise variables included in the linear combinations of

associated variables. This is due to underestimation of the number of uninformative vari-

ables in the original data sets which is used to estimate the false selection rate. Further

development of this approach is necessary to obtain better estimates.

I demonstrated an application of both adaptive SCCA and SCCA using a real study

of natural variation in human gene expression. An important observation is that the

solution obtained by the adaptive SCCA is a subset of the solution obtained by SCCA.

Based on the simulation studies the conclusion may be that adaptive SCCA solution

contains fewer noise variables, however, it may also contain fewer important variables.

As discussed above this again raises a question of the trade-off between the number of

false positives and false negatives and the choice between the two solutions.

Simulations and application demonstrate the methodology and utility of SCCA and

adaptive SCCA methods developed in this thesis. Both of these methods are preferred

to the modified versions which are based on introducing additional noise variables due to

computational complexity of the latter ones. The preference between SCCA and adaptive

SCCA methods may be set by an investigator based on the intended use of the results,

available resources, and prior biological knowledge.

7.2 Limitations of simulation studies

Simulations used to investigate the properties of SCCA and its extensions were based

on the single latent variable model and have several limitations. First one is that in

large scale genomic and genetic studies measurements there is a possibility that the

measurements are taken on the genes belonging to several different pathways or processes.

That means that several groups of associated variables of different types may be present in

127

the data. In this case a single latent variable model is not appropriate since it can describe

only one regulatory mechanism. Therefore, analysis of multiple processes requires a more

complex model that contains several independent latent variables each responsible for a

separate process. From a canonical correlation point of view that means considering

more than one pair of singular vectors. Similarly for SCCA, the solution would contain

several sparse linear combinations of variables. First, subsets of variables with the highest

correlation between their linear combinations can be identified as described in chapter

3. Then, the effect of these variables is removed by considering the residual correlation

matrix. This is followed by the identification of the additional subsets of variables with

the next highest correlation using the residual matrix. This type of complex data may

pose an additional challenges due to increased difficulty of differentiating between the

noise and informative variables as well as between the subsets of variables associated with

different processes. Further simulation studies using multiple latent variables models are

necessary to investigate performance of SCCA in such cases.

Another limitation of the simulation studies is a lower number of variables of each

type compared to the number of variables in some large scale studies (up to 500 variables

in the simulations vs. 15000 to 20000 variables in genomic/genetic studies). At the same

time simulated sample sizes and proportions of important variables are higher compared

to some applications. These simulation parameters were chosen to mimic frequent data

problems such as as limited sample size and high proportion of noise variables while

keeping simulations feasible from a computational point of view. However, the alternative

approach to SCCA in the simultaneous analysis of two sets of variables would be canonical

correlation analysis which includes the entire sets of variables in the linear combinations.

In case of large studies these results would lack biological interpretability. On the other

hand, SCCA is able to provide sparse solutions. Also, simulation results in section 4.2

show that SCCA has superior performance compared to CCA in terms of generalizability

as measured by the independent test sample correlation between the obtained linear

128

combinations of variables even for sample sizes much smaller than the number of variables.

Furthermore, the performance of SCCA improves rapidly with the increasing sample size

producing solutions with the true correlation between the associated subsets of variables

for sample sizes that are still much smaller than the number of variables. The usual

assumption in large studies is that many of the measurements are not related to the

processes of interest and represent noise. Therefore, it would be beneficial to filter out

some uninformative variables prior to analysis. This would increase the proportion of

the associated variables and the effective sample size. It may be unrealistic to expect

complete elimination of the noise variables at the filtering stage, therefore SCCA would

still be useful tool for the analysis of such data. Further discussion of the preliminary

filtering is presented in section 7.3.

The last limitation of the simulation studies is the rather high values of true correlation

between the associated subsets of variables. As shown in section 4.3 at lower values of true

correlation it is more difficult to differentiate between the noise and informative variables.

Further investigation of this effect and additional studies of possible improvement of the

performance of SCCA are necessary. Also, similarly to the discussion above, it would be

beneficial to perform preliminary noise filtering.

7.3 Preliminary filtering

Simulation results presented in section 3.5 of chapter 3 show better performance of SCCA

in the presence of fewer noise variables in the data sets. Also simulations in chapter 4

demonstrate improved performance of SCCA for larger sample sizes compared to the

number of variables. Thus, the results can be improved by filtering noise variables prior

to analysis. Uninformative variables can be filtered based on variance of variables within

set X and set Y separately. If there is no variation in variable then it can not be correlated

with anything. Another approach is to filter within sets X and Y separately based on

129

correlation of variables. This is based on the fact that if x1 and x2 are correlated with y

where y can be a latent variable, then x1 and x2 should be correlated with each other.

Incorporating preliminary filtering into SCCA and studying its effects is the subject of

future study.

7.4 More than 2 sources of data

Some recent genomic studies offer several phenotypes measured on the subjects. For

example, in the study of chronic fatigue syndrome (CFS) presented at the Critical As-

sessment of Microarray Data Analysis (CAMDA) workshop in 2006, different sources of

data were included in the study: clinical assessment of the patients, microarray gene ex-

pression profiles, proteomics data, and selected single nucleotide polymorphisms (SNPs).

The question of interest is data integration to establish the relationships between different

types of variables and to predict the disease class. Wold original Partial Least Squares

(PLS) algorithm is applicable to several sets of data. However, the solution includes the

entire sets of available variables. In large scale studies similar to the study of CFS these

results may lack biological interpretability. It would be interesting to develop an exten-

sion of Sparse Canonical Correlation Analysis based on the PLS approach applicable to

more than two sets of variables simultaneously.

7.5 Computation of variance

The data standardization section of chapter 3 describes the approximation of the vari-

ance matrices for different sets of variables with the identity matrices after the data has

been standardized. This approach is based on the assumption that in high dimensional

problems most of the measured variables are not related to the process of interest, i.e.

they may be considered as noise, and the correlation between them is zero. Thus, the

correlation between gene expressions in the microarray studies is not taken into account.

130

The same assumption is made by the traditional analysis approaches such as differential

gene expression analysis carried on a one gene expression at a time basis. Extension

of SCCA allowing better estimation and incorporation of the variance matrices for the

considered sets of variables would improve the solutions. In that case often unrealistic

assumption of gene independence can be removed.

7.6 Computation of covariance

The covariance matrix between the two sets of variables in the study is the key element

in the SCCA algorithm. Therefore, it is crucial to have an accurate estimate of that

matrix. In the case of genome-wide microarray study there are often tens of thousands

of gene expressions under consideration. However, there is typically only a few hundred

observations available. Thus, the sample covariance matrix may not be a precise estimate

of the true underlying covariance structure. I propose to use bagging to improve the

estimate of the covariance matrix. This approach is similar to the method in [Schafer

and Strimmer, 2004].

General algorithm

Bagging or bootstrap aggregation is a general method that improves a sample based

estimator [Breiman, 1996]. It utilizes the bootstrapping technique as follows:

• For a given data set X generate B bootstrap sets X∗b, b = 1, . . . , B, by sampling

with replacement from the available observations.

• For each bootstrap sample calculate an estimate of the statistic of interest Θ∗b

• The bagged estimator is the bootstrap mean 1B

∑Bb=1 Θ∗b

131

Application to SCCA

Bootstrap sample generation

In case of SCCA the sample statistic is a covariance matrix for two sets of variables.

Thus, there are two data sets X and Y with the same number of observations n and

possibly different number of variables. Obtaining a bootstrap sample in this case means

independently sampling with replacement from the set of observations, i.e. sampling from

a sequence {1, . . . , n} to get obs∗b. Then X∗b and Y ∗b are obtained by taking observations

included in obs∗b from the original sets of variables X and Y .

The bagging algorithm and SCCA can be applied using the following two approaches:

SCCA of the bagged covariance matrix estimate (B-SCCA1)

In the first approach the bagging algorithm is used to obtain an improved covariance

matrix estimate as a bootstrap mean of the covariance matrices for the bootstrap samples

from the original data sets X and Y . Thus, the bagged covariance matrix estimate

is calculated as 1B

∑Bb=1 Cov(X∗b, Y ∗b). Subsequently, SCCA is applied to the bagged

covariance estimate to obtain sparse combinations of variables from sets X and Y . This

can be interpreted as application of SCCA to the Bayesian posterior mean estimate of

the sample covariance matrix.

Bagged SCCA (B-SCAA2)

In the second approach SCAA is applied to each bootstrap sample from the original data,

i.e to X∗b and Y ∗b, b = 1, . . . , B, to obtain sparse combinations of variables α∗b and β∗b

for X and Y respectively. Then the posterior probability of a specific coefficient uj in the

left singular vector or vj in the right singular vector being equal to zero can be calculated

as 1B

∑Bb=1 I(u∗b

j = 0) for variables in the set X and similarly for the variables in Y .

Simulations would be useful for comparison of these two approaches and for evaluation

132

of their performance.

7.7 Application of SCCA to the study of Chronic

Fatigue Syndrome

Chronic Fatigue Syndrome (CFS) is a disease that affects a significant proportion of the

population in the United States and has detrimental economic effect on society [Reeves

et al., 2005]. Assessment of patients and identification of illness is complicated by the

lack of well established characteristic symptoms [Whistler et al., 2005]. The symptoms

of CFS are also shared by other neurological illnesses such as multiple sclerosis, sleep

disorders, major depressive disorders. Moreover, definition of CFS as a unique disease

is not obvious as it may represent a common response to a collection of other illnesses

[Whistler et al., 2005]. Establishing well defined measures of CFS is crucial for the

assessment of this illness. That was the main purpose of the study conducted by the

Centre for Disease Control (CDC) in Wichita, KS. To achieve this goal different sources

of data have been included in the study: clinical assessment of the patients, microarray

gene expression profiles, proteomics data, and selected single nucleotide polymorphisms

(SNPs). The data for this study was presented at the Critical Assessment of Microarray

Data Analysis (CAMDA) workshop in 2006.

In this study all measurements taken on the subjects are of a different nature. One

type of variables is based on gene expression levels, another type of information relates

to SNP genotypes and pedigree structure. In addition there is also proteomics data and

haematologic measurements. The challenge from the statistical point of view is integrat-

ing these variables and conducting a unified analysis that focuses on the relationship

between the sets of variables rather than within. For example, one may be interested in

using gene expression values jointly to predict the disease class differentiating between pa-

tients with CFS, CFS with major depressive disorder with melancholic features (MDDm),

133

patients with insufficient symptoms or fatigue (ISF), and non-fatigued subjects. Another

questions of interest may be establishing the relationship between the gene expression

profiles and clinical data or finding genetic regulatory pathways by analyzing microarray

and SNP data.

It is also important to note that this is a high dimensional data and the number of

variables in some sets greatly exceeds the number of subjects. In particular, there are

almost 20000 gene expression profiles while the number of samples is only 177.

Given the challenges described above, i.e. integration of different types of data and

high dimensionality, Sparse Canonical Correlation Analysis is an appropriate approach

applicable to the CAMDA study. It would allow simultaneous analysis of different types

of measurements and produce sparse results that are interpretable form a biological

point of view. Sparse solutions that indicate the relationships between the small subsets

of variables are easier to visualize which may be important in a study of regulatory

pathways. SCCA results may also be used to generate hypothesis for further testing.

Thus, it would be interesting to apply SCCA to the study of chronic fatigue syndrome.

134

Bibliography

Affymetrix. Microarray Suite User Guide. URL

http://www.affymetrix.com/support/technical/manuals.affxAffymetrix.

J. Beyene, P. Hu, E. Parkhomenko, and D. Tritchler. Impact of normalization and filtering

on linkage analysis of gene expression data. Number 1(Suppl1) in BMC Proceedings,

page S150, 2007.

L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.

J. Cadima and I. Jolliffe. Loadings and correlations in the interpretation of principal

components. Journal of Applied Statistics, 22:203–214, 1995.

V.G. Cheung, R.S. Spielman, K.G. Ewens, T.M. Weber, M. Morley, and J.T. Burdick.

Mapping determinants of human gene expression by regional and genome-wide associ-

ation. Nature, 437:1365–1369, 2005.

D. Commenges. Robust genetic linkage analysis based on a score test of homogeneity:

the weighted pair-wise correlation statistic. Genetic Epidemiology, 11:189–200, 1994.

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals

of Statistics, 32(2):407–499, 2004.

I.J. Good. Some applications of the singular decomposition of a matrix. Technometrics,

11(4):828–831, 1969.

135

D.A. Harville. Matrix algebra from a statistician’s perspective. Springer, 1997.

H. Hotelling. Relations between two sets of variables. Biometrika, 28:321–377, 1936.

J. Jeffers. Two case studies in the application of principal component. Applied Statistics,

16:225–236, 1967.

I. M. Johnstone and A. Y. Lu. Sparse principal component analysis. jan 2004.

I. T. Jolliffe and M. Uddin. A modified principal component technique based on the

lasso. Journal of Computational and Graphical Statistics, 12:531–547, 2003.

F. Lantieri, H. Rydbeck, P. Griseri, I. Ceccherini, and M. Devoto. Incorporating prior

biological information in linkage studies increases power and limits multiple testing.

Number 1(Suppl1) in BMC Proceedings, page S89, 2007.

K.V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate analysis. New York: Academic

Press, 1979.

N. Meinshausen and P. Buhlmann. Variable selection and high dimensional graphs with

the lasso. Technical report, ETH Zurich, 2004.

M. Morley, C.M. Molony, T.M. Weber, J.L. Devlin, K.G. Ewens, R.S. Spielman, and

V.G. Cheung. Genetic analysis of genome-wide variation in human gene expression.

Nature, 430:743–747, 2004.

S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukheriee, C. Yeang, M. Angelo, C. Ladd,

M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, and

T. Golub. Multiclass cancer diagnosis using tumor gene expression signature. Proceed-

ings of the National Academy of Sciences, 98:15149–15154, 2001.

W. Reeves, D. Wagner, R. Nisenbaum, J. Jones, B. Gurbaxani, L. Solomon, D. Papan-

icolaou, E. Unger, S. Vernon, and C. Heim. Chronic fatigue syndrome - a clinically

empirical approach to its definition and study. BMC Medicine, 3(1):19, 2005.

136

P.D. Sampson, A.P. Streissguth, H.M. Barr, and F.L. Bookstein. Neurobehavioral ef-

fects of prenatal alcohol: Part ii. partial least squares analysis. Neurotoxicology and

teratology, 11(5):477–491, 1989.

J. Schafer and K. Strimmer. An empirical bayes approach to inferring large-scale gene

association networks. Bioinformatics, 1(1), 2004.

G.W. Stewart. Introduction to matrix computations. Academic press, New York, 1973.

A.P. Streissguth, F.L. Bookstein, P.D. Sampson, and H.M. Barr. The enduring effects of

prenatal alcohol exposure on child development: birth through seven years, a partial

least squares solution. International Academy for Reasearch in Learning Disabilities

Monograph Series. University of Michigan Press, Ann Arbor, 10, 1993.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statis-

tical Society B series, 58(1):267–288, 1996.

D. Tritchler, Y. Liu, and S. Fallah. A test of linkage for complex discrete and continuous

traits in nuclear families. Biometrics, 59(8):382–392, 2003.

S. Wang, T. Zheng, and Y. Wang. Transcription activity hotspot, is it real or an artifact?

Number 1(Suppl1) in BMC Proceedings, page S10, 2007.

Jacob A. Wegelin. A survey of partial least squares (pls) methods, with emphasis on the

two-block case. URL citeseer.ist.psu.edu/wegelin00survey.html.

T. Whistler, E. Unger, R. Nisenbaum, and S. Vernon. Integration of gene expression,

clinical and epidemiologic data to characterize chronic fatigue syndrome. 2003.

T. Whistler, J. Jones, E. Unger, and S. Vernon. Exercise responsive genes measured

in peripheral blood of women with chronic fatigue syndrome and matched control

subjects. BMC Physiology, 5:5, 2005.

137

H. Wold. Path models with latent variables: the NIPALS approach., pages 307–357. Quan-

titative sociology:international perspectives on mathematical and statstical modeling.

Academic, 1975.

H. Wold. Soft modeling: the basic design and some extensions. In H. Wold K.G. Joreskog,

editor, Systems under indirect observation:causality, stracture, prediction, Part II,

number 139 in Proceedings of the Conference on Systems Under Indirect Observa-

tion, pages 1–54, Cartigny, Switzerland, October 1982. North Holland.

H. Wold. Partial Least Squares, pages 581–591. Encyclopedia of the statistical sciences.

Wiley, 1985.

Y. Wu, D.D. Boos, and L.A. Stefansky. Controlling variables selection by the addition

of pseudovariables. Journal of the American Statistical Association, 102(477):235–243,

2007.

H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical

Assosciation, 101(476):1418–1429, 2006.

H. Zou and T. Hastie. Regularization and variable selction via the elastic net. Journal

of Royal Statistical Society B series, 567(2):301–320, 2005.

H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Technical

report, Statistics department, Stanford University, 2004.

138

by Elena Parkhomenko - tspace.library.utoronto.ca · Elena Parkhomenko Doctor of Philosophy...

Documents

Transcript of by Elena Parkhomenko - tspace.library.utoronto.ca · Elena Parkhomenko Doctor of Philosophy...