by Elena Parkhomenko - tspace.library.utoronto.ca · Elena Parkhomenko Doctor of Philosophy...
Transcript of by Elena Parkhomenko - tspace.library.utoronto.ca · Elena Parkhomenko Doctor of Philosophy...
Sparse Canonical Correlation Analysis
by
Elena Parkhomenko
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Public Health Sciences (Biostatistics)University of Toronto
c© Copyright by Elena Parkhomenko, 2008
Abstract
Sparse Canonical Correlation Analysis
Elena Parkhomenko
Doctor of Philosophy
Graduate Department of Public Health Sciences (Biostatistics)
University of Toronto
2008
Large scale genomic studies of the association of gene expression with multiple pheno-
typic or genotypic measures may require the identification of complex multivariate re-
lationships. In multivariate analysis a common way to inspect the relationship between
two sets of variables based on their correlation is Canonical Correlation Analysis, which
determines linear combinations of all variables of each type with maximal correlation
between the two linear combinations. However, in high dimensional data analysis, when
the number of variables under consideration exceeds tens of thousands, linear combina-
tions of the entire sets of features may lack biological plausibility and interpretability.
In addition, insufficient sample size may lead to computational problems, inaccurate es-
timates of parameters and non-generalizable results. These problems may be solved by
selecting sparse subsets of variables, i.e. obtaining sparse loadings in the linear combina-
tions of variables of each type. However, available methods providing sparse solutions,
such as Sparse Principal Component Analysis, consider each type of variables separately
and focus on the correlation within each set of measurements rather than between sets.
We introduce new methodology - Sparse Canonical Correlation Analysis (SCCA), which
examines the relationships of many variables of different types simultaneously. It solves
the problem of biological interpretability by providing sparse linear combinations that
include only a small subset of variables. SCCA maximizes the correlation between the
subsets of variables of different types while performing variable selection. In large scale
ii
genomic studies sparse solutions also comply with the belief that only a small proportion
of genes are expressed under a certain set of conditions. In this thesis I present method-
ology for SCCA and evaluate its properties using simulated data. I illustrate practical
use of SCCA by applying it to the study of natural variation in human gene expression
for which the data have been provided as problem 1 for the fifteenth Genetic Analysis
Workshop (GAW15). I also present two extensions of SCCA - adaptive SCCA and modi-
fied adaptive SCCA. Their performance is evaluated and compared using simulated data
and adaptive SCCA is applied to the GAW15 data.
iii
Acknowledgements
I would like to thank everybody who made this thesis possible and contributed to its
development.
First of all I would like to thank my thesis advisors, Professors David Tritchler and
Joseph Beyene for their tremendous help in developing this thesis. Without their ex-
pertise, involvement, sound advice and constant support this work would not have been
possible.
I would like to thank my thesis committee member, Professor Shelley Bull for her
helpful comments, stimulating discussions and questions, that lead to a substantially
improved thesis.
I am also grateful to the external examiner, Professor Angelo Canty, for his insightful
comments, and to Professor Michael Escobar for taking time out of his busy schedule to
serve as my internal reader.
I am grateful to Connaught Scholarship, Ontario Graduate Scholarship, and Univer-
sity of Toronto Open Fellowship for financial support.
I would like to thank fellow students and friends at the University of Toronto. In
particular, I wish to express many thanks to Lucia Mirea and Wei Xu for stimulating
and entertaining discussions during Friday lunches. I would also like to thank Jane
Figueiredo for her great moral support and encouragement.
Last but not least, I would like to thank my husband, Daniil Bunimovich for his
patience and support throughout my studies.
This thesis is dedicated to my daughter Victoria.
iv
Contents
1 Introduction 1
1.1 Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Existing analysis tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 New methods - Sparse Canonical Correlation Analysis . . . . . . . . . . . 6
1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Traditional methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Modifications of traditional methods . . . . . . . . . . . . . . . . . . . . 18
3 Methodology 29
3.1 Conventional canonical correlation analysis . . . . . . . . . . . . . . . . . 29
3.2 Sparse canonical correlation analysis . . . . . . . . . . . . . . . . . . . . 30
3.3 Data standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Sparseness parameter selection . . . . . . . . . . . . . . . . . . . . . . . . 54
4 SCCA evaluation 61
4.1 Evaluation tool - Cross Validation . . . . . . . . . . . . . . . . . . . . . . 62
v
4.2 Effect of sample size on generalizability . . . . . . . . . . . . . . . . . . . 63
4.3 Effect of the true association between the linear combinations of variables 66
5 Extensions of SCCA 68
5.1 Oracle properties, prediction versus variable selection . . . . . . . . . . . 68
5.2 SCCA extension I: Adaptive SCCA . . . . . . . . . . . . . . . . . . . . . 83
5.3 SCCA extension II: Modified adaptive SCCA . . . . . . . . . . . . . . . . 90
6 Application 110
6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4 Adaptive SCCA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7 Discussion and future work 124
7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2 Limitations of simulation studies . . . . . . . . . . . . . . . . . . . . . . 127
7.3 Preliminary filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.4 More than 2 sources of data . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5 Computation of variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.6 Computation of covariance . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.7 Application of SCCA to the study of Chronic Fatigue Syndrome . . . . . 133
Bibliography 134
vi
List of Tables
3.1 The effect of 4 types of starting values for the singular vectors on SCCA:
number of selected X variables vs. simulated correlation . . . . . . . . . . 48
3.2 The effect of 4 types of starting values for the singular vectors on SCCA:
number of selected Y variables vs. simulated correlation . . . . . . . . . . 49
3.3 The effect of 4 types of starting values for the singular vectors on SCCA
results: discordance for set X vs. simulated correlation . . . . . . . . . . 51
3.4 The effect of 4 types of starting values for the singular vectors on SCCA
results: discordance for set Y vs. simulated correlation . . . . . . . . . . 52
4.1 Effect of true correlation between the associated subsets of variables in X
and Y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1 Summary of average prediction results for SCCA and CCA for GAW15 data118
6.2 Summary of average prediction results for SCCA and CCA for reduced
GAW15 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
vii
List of Figures
2.1 Latent variables in PLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Simulation: latent variable model . . . . . . . . . . . . . . . . . . . . . . 37
3.2 The effect of 4 types of starting values for the singular vectors on SCCA:
number of selected X variables vs. simulated correlation . . . . . . . . . . 56
3.3 The effect of 4 types of starting values for the singular vectors on SCCA:
number of selected Y variables vs. simulated correlation . . . . . . . . . . 57
3.4 The effect of 4 types of starting values for the singular vectors on SCCA:
number of correctly selected X variables vs. simulated correlation . . . . 58
3.5 The effect of 4 types of starting values for the singular vectors on SCCA:
number of correctly selected Y variables vs. simulated correlation . . . . 59
3.6 Sparseness parameter selection . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1 Sample size effect: test sample correlation vs. sample size . . . . . . . . . 65
5.1 Investigation of model selection for different sample sizes for set X, true
correlation 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Investigation of model selection for different sample sizes for set Y, true
correlation 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Investigation of model selection for different sample sizes for set X, true
correlation 0.95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
viii
5.4 Investigation of model selection for different sample sizes for set Y, true
correlation 0.95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 Model identification versus prediction: test sample correlation versus sparse-
ness parameter combinations, true correlation 0.9 . . . . . . . . . . . . . 79
5.6 Model identification versus prediction: average discordance for sets X and
Y versus sparseness parameter combinations, true correlation 0.9 . . . . . 80
5.7 Model identification versus prediction: the number of false positives for
set X versus sparseness parameter combinations, true correlation 0.9 . . . 81
5.8 Model identification versus prediction: the number of false negatives for
set X versus sparseness parameter combinations, true correlation 0.9 . . . 82
5.9 Compare adaptive SCCA, SCCA and SVD: test sample correlation vs
sample size, power of weights in soft-thresholding is 0.5 . . . . . . . . . . 86
5.10 Compare adaptive SCCA and SCCA: discordance vs sample size, power
of weights in soft-thresholding is 0.5 . . . . . . . . . . . . . . . . . . . . . 87
5.11 Compare adaptive SCCA and SCCA: number of false positives for set X
vs sample size, power of weights in soft-thresholding is 0.5 . . . . . . . . 88
5.12 Compare adaptive SCCA and SCCA: number of false negatives for set X
vs sample size, power of weights in soft-thresholding is 0.5 . . . . . . . . 102
5.13 Adaptive SCCA performance for different powers of weights in soft-thresholding:
test sample correlation vs sample size . . . . . . . . . . . . . . . . . . . . 103
5.14 Adaptive SCCA performance for different powers of weights in soft-thresholding:
discordance vs sample size . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.15 Modified adaptive SCCA: test sample correlation for adaptive SCCA and
SCCA with and without added noise . . . . . . . . . . . . . . . . . . . . 105
5.16 True false selection rate for adaptive SCCA and SCCA with and without
added noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
ix
5.17 True false non-selection rate for adaptive SCCA and SCCA with and with-
out added noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.18 True false selection rate for adaptive SCCA and SCCA with and without
added noise, using modified FSR estimate . . . . . . . . . . . . . . . . . 108
5.19 True false non-selection rate for adaptive SCCA and SCCA with and with-
out added noise, using modified FSR estimate . . . . . . . . . . . . . . . 109
6.1 Selection of the best combination of sparseness parameters for GAW15 . 117
x
Chapter 1
Introduction
1.1 Motivating examples
A growing number of large scale genomic studies focus on establishing the relationships
between microarray gene expression profiles and other phenotypes of interest. In some
cases there are two or more sets of measurements available for the same subjects. The
objective is to determine the association between different sets of variables as well as to
discover relevant predictive variables. Thus, in addition to the typical challenges of large
scale studies such as dimensionality reduction and data visualization, the focus is also on
data integration.
One example is a study of Chronic Fatigue Syndrome (CFS). CFS is a disease that
affects a significant proportion of the population in the United States and has detrimental
economic effect on society [Reeves et al., 2005]. Assessment of patients and identification
of illness is complicated by the lack of well established characteristic symptoms [Whistler
et al., 2005]. The symptoms of CFS are also shared by other neurological illnesses such as
multiple sclerosis, sleep disorders, and major depressive disorders. Moreover, definition
of CFS as a unique disease is not obvious as it may represent a common response to a
collection of other illnesses [Whistler et al., 2005]. Establishing well defined measures of
1
CFS is crucial for the assessment of this illness. This was the main purpose of the study
conducted by the Centre for Disease Control (CDC) in Wichita, KS. To achieve this goal
different sources of data were included in the study: clinical assessment of the patients,
microarray gene expression profiles, proteomics measures, and genotyping of selected
single nucleotide polymorphisms (SNPs). The data for this study were presented at the
Critical Assessment of Microarray Data Analysis (CAMDA) workshop in 2006.
It is important to note that different measurements are available for the same subjects
in the study. The question of interest is establishing the association between the markers
and the disease classes (i.e. not affected, CFS, CFS with major depressive disorders, just
major depressive disorders) as well as between different groups of markers. Identifying
important biomarkers may lead to better definition of the chronic fatigue syndrome and
the development of screening tests. An important observation is that various measure-
ments taken on the subjects are of different nature. The challenge from the statistical
point of view is integrating these variables and conducting a unified analysis that focuses
on the relationship between the sets of variables rather than within sets.
Another example is the study of natural variation in human gene expression. The
data for this study was offered for analysis at the fifteenth Genetic Analysis Workshop
(GAW15). Two types of measurements were available for 14 three-generation Centre
d‘Etude du Polymorphisme Humain (CEPH) families from Utah. One type is gene
expression profiles and the other is single nucleotide polymorphisms (SNPs). Several
studies have demonstrated that variation in baseline gene expression levels in humans
has a genetic component [Cheung et al., 2005, Morley et al., 2004]. Therefore, we are
interested in establishing the association between variation in gene expression and that
in SNPs. As in the CAMDA example the two types of measures available for each
subject have different natures: one is continuous gene expression values, the other type
of information relates to SNP genotypes and pedigree structure. An immediate challenge
in this context is how to combine these two types of data and find the relationship
2
between them.
In addition to the challenge of combining different types of measures available on
the same subjects, these studies also represent good examples of the problem of high-
dimensional data analysis and visualization. In both studies thousands of gene expression
values are available for a much smaller number of samples. In the CAMDA data set there
are almost 20000 gene expression profiles while the number of samples is only 177. There
is also extensive clinical and genetic data available for the same subjects. In the GAW15
data there are 3554 expression profiles preselected by Morley et al.[Morley et al., 2004] as
well as 2882 SNPs available for 194 subjects. Thus, in each study at least two very large
sets of variables have to be analyzed. Traditional approaches such as differential gene
expression analysis treat gene expression values as an outcome measure and variables of
the other type as covariates while performing the analysis one gene expression at a time
[Morley et al., 2004, Whistler et al., 2003]. Thus a separate model is built for every gene.
This may be very computationally intensive and does not take into account correlation
between gene expressions. Also in each case the number of variables of each type greatly
exceeds the sample size leading to computational instability. Therefore, dimensionality
reduction techniques may be useful for the analysis of these data sets. In conclusion,
a unified analysis approach suitable for extensive data and incorporating dimensionality
reduction is needed in these studies. Suitable methods should perform data integration to
determine the association between different sets of variables. Visualization of identified
connections between various types of measures is also of interest as well as biological
interpretability of the results.
1.2 Existing analysis tools
Both of these studies demonstrate a need for analysis tools that establish the relationships
between sets of measurements available on the the same group of subjects as well as reduce
3
the dimensionality. It is also desirable that obtained results are easy to interpret.
A common approach is to aggregate the original variables into composite latent vari-
ables and subsequently find the association between the latent variables corresponding
to different sets of measurements. Principal Component Analysis (PCA) is often used in
such cases for dimensionality reduction, to model the underlying structure in the data,
and to create latent variables. One disadvantage of this approach is that its main ob-
jective is to construct composite measures that maximize the variance within the sets of
variables rather than consider the correlation between sets. PCA works with one set of
variables at a time. Another disadvantage is that created latent variables called principal
components are linear combinations of the entire set of variables under consideration. In
large scale studies similar to the examples presented earlier these composite measures
aggregating thousands of variables may lack interpretability and may be difficult to visu-
alize. For example, in the study of natural variation in human gene expression one may
be interested in identifying genetic pathways and regulatory regions that could generate
hypotheses for further testing. In that case a solution representing a possible pathway
that contains all available genes is neither interpretable from a biological perspective nor
useful for hypothesis generation.
A solution to the first disadvantage of PCA may be obtained by considering the
method of Canonical Correlation Analysis (CCA). This approach is used in cases when
there are two sets of measures available on the sample to identify linear combinations of
variables in these sets that have maximal correlation. Thus, the focus is on the relation-
ships between variables in different data sets rather than within. However, this analysis
method also suffers from the problem of poor interpretability of linear combinations of
thousands of variables in high dimensional data.
Recently developed Sparse Principal Component Analysis (SPCA) methods [Zou
et al., 2004, Johnstone and Lu, 2004] address the problem of biological interpretabil-
ity and dimensionality reduction by incorporating variable selection which results in
4
sparse solutions. Sparse principal components contain only a small subset of the original
variables. In the study of Chronic Fatigue Syndrome this method could identify clinical
symptoms characteristic of CFS patients. However, SPCA cannot determine the rela-
tionships between the clinical data and gene expression profiles since, similarly to PCA,
it analyzes one set of variables at a time.
An important observation is that all three analysis approaches described above (PCA,
CCA, SPCA) use the same tool - singular value decomposition (SVD) of some matrix. In
the case of PCA and SPCA this matrix is the n×pi data matrix Xi corresponding to the
ith set of variables. In case of CCA this matrix is K = Σ−1/2XX ΣXY Σ
−1/2Y Y where ΣXX and
ΣY Y are the variance matrices for sets of variables X and Y and ΣXY is the correlation
matrix. Thus the difference between CCA and PCA/SPCA methods is determined by
the matrices used in SVD which is dictated by the different goals of the analysis. In
CCA we are interested in the correlation between the sets of variables, thus the matrix
K incorporates the information on all sets. In this case we are also interested in both left
and right singular vectors. On the other hand, in principal components methods each set
of measurements is considered independently, therefore SVD is performed on the data
matrix for each set separately, and we are interested in right singular vectors only. The
difference between the Sparse PCA and the conventional PCA is achieved by a variation
in SVD [Zou et al., 2004]. SPCA method uses sparse singular value decomposition that
results in sparse right singular vectors. Sparsity means that the loadings corresponding to
some variables in the set are zero, which is equivalent to variable selection. To summarize,
SVD is the key statistical tool used in the existing methodology that is also employed in
the development of the new method described in this thesis.
A different, non-SVD based approach often used in genomic/genetic studies such as
the study of natural variation in human gene expression is applying association/linkage
analysis one gene expression at a time. This ignores the correlation between measures of
gene expression. In addition, an insufficient sample size leads to computational problems,
5
inaccurate estimates of parameters and non-generalizable results. In large data sets
similar to the GAW15 data there may be thousands of gene expression profiles under
consideration as well as thousands of SNPs. In these cases this analysis approach may
also be very computationally intensive. The interpretation of the results may be unclear.
For instance, several gene expressions may be linked to similar groups of SNPs. In this
case similarity means that there is a non-empty intersection between the sets of SNPs.
Then the question arises whether this intersection set should be considered in relationship
to the whole group of gene expressions linked to the selected SNPs and whether these
gene expressions belong to the genes in the same pathway.
1.3 New methods - Sparse Canonical Correlation Anal-
ysis
In this thesis I present a new method - Sparse Canonical Correlation Analysis (SCCA).
SCCA allows the analysis of several sets of variables simultaneously in order to establish
the relationships between them. It also reduces dimensionality to improve biological
interpretability and to foster generation of new hypotheses. Thus, SCCA addresses both
disadvantages of PCA described earlier at the same time. It is applicable in large scale
studies when the number of variables in each set may greatly exceed the number of
samples and is computationally efficient. Similarly to the existing analysis methods
SCCA employs singular value decomposition. In order to determine the relationships
between different sets of variables I consider SVD of a matrix K described above and
seek both left and right singular vectors. I present an efficient algorithm for sparse SVD
that produces sparse solutions for singular vectors on both sides of SVD at the same
time (as compared to SPCA of H. Zou [Zou et al., 2004] that focuses only on the right
side and, therefore, can produce sparse solution for one set of variables only). Thus,
SCCA seeks sparsity in both sets of variables simultaneously. It incorporates variable
6
selection and produces linear combinations of small subsets of variables from each group
of measurements with maximal correlation. Established relationships between small sets
of variables are easier to interpret from biological perspective compared to the linear
combinations of the entire sets of variables and may be used for hypotheses generation.
In genetic and genomic studies sets of covariates and response variables with sparse
loadings also comply with the belief that only a small proportion of genes are expressed
under a certain set of conditions.
1.4 Organization of thesis
In chapter 2 I give brief definitions of some statistical concepts frequently used in this
thesis. I also describe existing methodology and discuss its disadvantages. The details
of the new Sparse Canonical Correlation Analysis method are presented in chapter 3. I
first describe the traditional Canonical Correlation Analysis that serves as basis for the
developed techniques. Then I present an efficient algorithm for SCCA and discuss its
details including the effect of starting values for the algorithm and selection of tuning
parameters. Chapter 4 describes evaluation of SCCA performance by cross-validation
and gives details of the data simulation approach that is used to evaluate SCCA. It
also demonstrates the sample size effect and sensitivity/specificity analysis. Chapter 5
discusses the oracle properties of an analysis tool and evaluates these properties of SCCA.
I follow by presenting extensions of SCCA aimed at improving the oracle properties and
discuss their performance. The illustration of developed methods using a real data set for
the study of natural variation in human gene expression is given in chapter 6. Chapter
7 contains the discussion and comparisons of SCCA and its extensions. The thesis is
completed with brief discussion of possible directions for future work.
7
Chapter 2
Background
In this thesis I introduce a new method called Sparse Canonical Correlation Analysis
(SCCA), which examines the relationships of different types of measurements taken on
the same subjects. Simultaneous analysis of several data sets is performed to allow data
integration. The obtained sparse solution reduces dimensionality and aids in data visu-
alization. Prior to introducing new techniques and describing the existing methodology
some definitions of frequently used statistical terms are presented.
2.1 Definitions
Latent variables:
Latent variables are variables which are inferred from other variables that are observed.
They are not measured directly. Latent variables are hypothesized to be associated with
the underlying model for the observed variables. They are also known as hidden variables
or model parameters. Latent variables are often used for visualization of high-dimensional
data through aggregation a large number of measured variables into a model. This allows
dimensionality reduction.
Eigenvectors and eigenvalues:
Let X be an n×n real quadratic matrix. Then the n×1 vector v is the right eigenvector
8
for X and λ ≥ 0 is the corresponding singular value if Xv = λv. Similarly, the left
eigenvector u can be defined as as satisfying equation uX = λu.
Singular value decomposition:
Let X be an m × n real matrix. Then it can be represented as X = UΛV ′ where U
is an m × m orthogonal matrix, V is an n × n orthogonal matrix and Λ is an m × n
diagonal matrix with non-negative diagonal elements λi, i = 1, . . . ,min(m,n). The first
min(m,n) columns of U and V are left and right singular vectors, respectively, and λi,
i = 1, . . . ,min(m,n) are the corresponding singular values. Note that left singular vectors
for X are the eigenvectors for XX ′ while the right singular vectors are the eigenvectors
for X ′X. The eigenvalues are equal for XX ′ and X ′X and they are equal to the squared
singular values of X.
Linear combination of vectors:
Let x1, . . . ,xk be a set of k n×1 vectors. Then the n×1 vector cx is a linear combination
of these vectors if cx = α1x1 + . . . + αkxk for some real constants α1, . . . , αk which are
usually called weights or loadings.
Principal component analysis:
Principal component analysis is often used for dimensionality reduction of the data,
to create composite measures which are equivalent to latent variables, and to detect
the underlying structure of the data. It determines linear combinations of the original
variables (principal components) that capture maximal variance. Principal components
can be obtained using the singular decomposition of the data matrix X.
Let X be an m × n data matrix (m observations on n variables). Assume that the
columns of X have means equal to 0 and consider the singular values decomposition of
X
X = UDV ′
Then the columns of U are called the principal components and the columns of V are
the corresponding variable weights in the principal components. The principal compo-
9
nents are mutually uncorrelated. The percentage of variance explained by the principal
component ui is equal to d2i /
∑rj=1 d2
j where di, i = 1, . . . , r are the diagonal elements of
D and r is the rank of X.
Canonical correlation analysis:
Canonical correlation analysis is used in cases when there are two sets of variables avail-
able on the sample to identify linear combinations of variables in these sets that have
maximal correlation. Suppose X is an m×p data matrix corresponding to m observations
of p variables of one type and Y is an m×q data matrix corresponding to m observations
of q variables of the other type. It is important to emphasize that both sets of variables
X and Y are taken on the same group of m subjects. Then linear combinations of vari-
ables from X and Y that have maximum correlation can be obtained by considering the
singular value decomposition of a matrix
K = Σ−1/2XX ΣXY Σ
−1/2Y Y = UDV ′
where ΣXX and ΣY Y are the variance matrices for sets X and Y and ΣXY is the covariance
matrix. Canonical vectors or weights in the linear combinations of the two sets of variables
that have largest correlation are
a = Σ−1/2XX u1 and b = Σ
−1/2Y Y v1
where u1 and v1 are the first left and right singular vectors, respectively. The canonical
variables are derived variables η = a′x and φ = b′y.
Vector length:
Let v = (v1, . . . , vp)′ be a p× 1 vector. Then its length denoted by |v| is
|v| = (
p∑
i=1
v2i )
1
2
L1 norm:
Let v = (v1, . . . , vp)′ be a p× 1 vector. Then its L1 norm, denoted by |v|1 is
|v|1 =
p∑
i=1
|vi|
10
L2 norm:
Let v = (v1, . . . , vp)′ be a p× 1 vector. Then its L2 norm, denoted by |v|2 is
|v|2 =
p∑
i=1
|vi|2
Hard thresholding:
For a given thresholding parameter δ, hard thresholding of variable x is given by ηH(x, δ) =
xI{|x| ≥ δ}.
Soft thresholding:
For a given thresholding parameter δ, soft thresholding of variable x is given by ηS(x, δ) =
(|x| − δ)+Sign(x), where (x)+ is equal to x if x ≥ 0 and 0 if x < 0, and
Sign(x) =
−1 if x < 0,
1 if x > 0,
0 if x = 0.
.
2.2 Traditional methods
Methods for simultaneous analysis of two data sets originated several decades ago. A
comprehensive methodology for studying relationships between two sets of variables was
developed by H. Wold in 1970. This collection of techniques is known as Partial Least
Squares (PLS). Many variations of the general algorithm were later developed to suit
the applications in different disciplines including econometrics, psychology, and medical
sciences. One of the variations is Canonical Correlation Analysis (CCA) which will be
discussed in greater detail later. A comprehensive survey of PLS methods is provided in
a technical report by J.A. Wegelin [Wegelin].
11
Partial Least Squares
Wegelin [Wegelin] summarizes several approaches for the two-block case (dealing with
two types of measurements) of the Partial Least Squares method and compares them.
These are the special cases of the more general Wold’s algorithm applicable to k blocks
of data, where k may be greater than 2 [Wold, 1975, 1982, 1985].
PLS methods are used to study the association between groups of variables. In the
two-block case there is a sample with 2 sets of measurements X and Y with I and J
variables respectively, and the same N samples. Let columns in the matrices represent
variables and rows represent samples. The association between variables is recovered by
considering the covariance matrix X ′Y and using latent variables. The author states that
PLS techniques are especially useful when columns of X or Y are collinear or when the
number of variables exceeds the number of samples (I > N or J > N).
There are two modes in which the general Wold algorithm for PLS can be ap-
plied: mode A and mode B. Mode A is presented by Wegelin as PLS-C2A, PLS-SB
and PLS1/PLS2 approaches whereas mode B of the general algorithm is equivalent to
the canonical correlation analysis of Hotelling [Hotelling, 1936]. Both modes can be used
to study the association between the groups of variables and provide their linear combi-
nations also referred to as latent variables. However, the interpretation of the coefficients
in these linear combinations differs. Another difference between the two modes of the
algorithm is in the measure of association. In mode A covariance between linear com-
binations of variables Cov(Xu, Y v) is maximized while in CCA or mode B correlation
Cor(Xu, Y v) is maximized. Both mode A and mode B of the general Wold algorithm are
presented in greater detail below. Two of the variations of mode A (PLS-C2A and PLS-
SB) are more relevant to the work presented in this thesis, therefore they are discussed
while the discussion of other variants (PLS1, PLS2) is omitted.
12
Approach 1: Mode A.
Two-block Mode A Partial Least Squares (PLS-C2A).
This approach is used when one needs to model the covariance between two sets of
variables X and Y. In this case the objective is not to study the covariance structure
within sets X and Y, rather it is to study the covariance between the sets X and Y. Also
it should be emphasized that the measure of association between groups of variables
considered in this approach is covariance as opposed to correlation. The modeling is
performed by utilizing pairs of latent variables (ξ1, ω1), . . . , (ξR, ωR) where ξi and ωi are
latent variable for sets X and Y, respectively, and R is called the rank and is determined
as part of model selection.
In the simplest case when there is only one latent variable for each data set, the
association can be presented schematically as shown in Figure 2.1. The variables in set X
are controlled by ξ while the variable in set Y are controlled by the latent variables ω and
these two latent variables are associated with each other. We are modeling the covariance
ξ ω
x1 x2 . . . xI y1 y2 . . . yJ
Figure 2.1: Simple latent variable model.
d1 = Cov(ξ1,ω1). The latent variables are represented by the linear combinations of the
original variables, i.e. Xu and Y v. Thus, we are maximizing the following covariance:
d1 = Cov(ξ1,ω1) = max||u||=||v||=1Cov(Xu, Y v)
Then vectors of coefficients in linear combination of variables can be obtained as the
the first singular vectors u1 and v1 from the singular value decomposition (SVD) of the
13
matrix X ′Y . Then d1u1v′1 provides the best rank-one approximation of X ′Y [Wegelin,
Harville, 1997].
J. Wegelin provides an iterative algorithm for estimation of (ui,vi) pairs for i =
1, . . . , R, where R is the desired size of the model. The algorithm is initialized by setting
X(1) = X and Y (1) = Y . Then at each step i a pair of coefficient vectors for linear
combinations of variables ui and vi is estimated by the first left and right singular vectors
of X(i)′Y (i), respectively. Subsequently, matrices X(i) and Y (i) are regressed on the latent
variables ξi = X(i)ui and ωi = Y (i)vi to get the best rank-one approximations as follows:
X(i) = ξiβ′X + eX
β′
X = (ξ′iξi)
−1ξ′iX
(i)
X(i)(ξi) = ξiβ′
X = ξi(ξ′iξi)
−1ξ′iX
(i)
Similarly for Y:
Y (i) = ωiβ′Y + eY
β′
Y = (ω′iωi)
−1ω′iY
(i)
Y (i)(ωi) = ωiβ′
Y = ωi(ω′iωi)
−1ω′iY
(i)
The residuals after subtracting the best rank-one approximations X(i)(ξi) and Y (i)(ωi)
from the current data matrices X(i) and Y (i) are then used as new data matrices in the
next step of the algorithm:
X(i+1) = X(i) − X(i)(ξi)
Y (i+1) = Y (i) − Y (i)(ωi)
The procedure is repeated until all R pairs of latent variables (ξi,ωi), i = 1, . . . , R are
obtained.
In this approach the coefficients of the R linear combinations of variables for set X
(or latent variables ξi) ui are orthogonal. The same is true for vi. However, they are not
14
necessarily the singular vectors of X ′Y . Also, ξi are mutually orthogonal as well as ωi.
The obtained latent variables maximize the covariance of interest
|Cov(ξi,ωi)| = |Cov(X(i)ui, Y(i)vi)| ∝ di = max||u||=||v||=1|Cov(X(i)u, Y (i)v)|
Here di is not the ith singular value in the SVD of X ′Y but rather the first singular value
in the SVD of X(i)′Y (i).
The kth element of ui, k = 1, . . . , I can be interpreted as being proportional to the
covariance between the kth variable in set X at step i and the ith latent variable for set
Y, i.e.
uki =N
di
Cov(X(i).k ,ωi) (2.1)
The coefficients of vi are interpreted in a similar fashion:
vki =N
di
Cov(Y(i).k , ξi), k = 1, . . . , J (2.2)
Another property of the coefficients for linear combinations of variables obtained by PLS-
C2A is that if an observed variable x.j is added or removed from the set X, this leads
only to a small change in uki, k 6= j.
PLS-SB.
This approach presented by Wegelin was originally developed by Sampson et al. [Sampson
et al., 1989] and Streissguth et al. [Streissguth et al., 1993]. It provides an identical
solution to PLS-C2A if only one pair of latent variables (ξ1,ω1) is considered. However,
additional pairs of latent variables differ for the two methods. The source of the difference
is in updating the data matrices X(i) and Y (i) before proceeding to the step i + 1. PLS-
C2A updates each X(i) and Y (i) separately by considering the residuals after removing
the effects of ith latent variables X(i) − X(i)(ξi) and Y (i) − Y (i)(ωi). PLS-SB updates
the cross-product X(i)′Y (i) by subtracting diuiv′i. Hence, the coefficients in the linear
combinations of variables represented by latent variables (ξi = Xui,ωi = Y vi) are the
15
singular vectors of X ′Y and we obtain the representation X ′Y =∑R
i=1 diuiv′i. In this
case the latent scores (ξi and ωi) are not orthogonal. Also ui and vi vectors obtained by
PLS-C2A and PLS-SB methods are not equal since X(2)′Y (2) 6= X ′Y − d1u1v′1 (for step
2 and similarly for other steps), where X(2) and Y (2) are updated data matrices at the
end of the first step of PLS-C2A.
Approach 2: Mode B PLS.
Canonical correlation analysis represents mode B of Wold’s general PLS algorithm. The
major difference from mode A is in the way the coefficients in linear combinations of
variables, i.e. u and v are computed. It is evident that in mode A, given the latent
variables ξ and ω, the coefficients can be obtained by separate linear regressions. The
original Wold algorithm can be used to study the relationships between multiple sources
of data (blocks of variables), however if we consider the simplest case of only two data
sets X and Y, then mode A regressions can be described as follows:
Consider only the first singular vectors and let X and Y be data sets containing N
observations each and I and J variables, respectively. Let ξi and ωi be the vectors of
latent variables associated with data sets X and Y, respectively, at step i of the algorithm.
Each vector is of length N. Also let ui and vi be the vectors of coefficients in linear
combination of variables from X and Y at step i. If X·j (or Y·j) denotes the jth column
of matrix X (or Y), i.e. the jth variable, then ui and vi are estimated by fitting I and J
simple linear regressions:
X·j ∼ ujiωi, for j = 1, . . . , I
Y·j ∼ vjiξi, for j = 1, . . . , J
where uji and vji are the jth coefficients in ui and vi, respectively. If the data sets X and
Y have been standardized then inclusion of the intercept in the regressions above is not
necessary.
16
This is equivalent to obtaining left and right singular vectors of X ′Y as shown in
Wegelin [Wegelin]. In other words u and v obtained by fitting these independent regres-
sion models are the eigenvectors of X ′Y Y ′X and Y ′XX ′Y , respectively.
In mode B at the ith step of the algorithm coefficients in linear combinations of
variables ui and vi are computed by performing multiple regression:
ωi ∼ Xui
ξi ∼ Y vi
Thus in mode B the coefficients are calculated simultaneously.
Multiple regression solutions u and v are equivalent to the eigenvalues of (X ′X)−1X ′Y (Y ′Y )−1Y ′X
and (Y ′Y )−1Y ′X(X ′X)−1X ′Y , respectively.
The author points out two major differences between the mode A and B solutions:
• The values in the vectors of coefficients in the linear combinations of variables u and
v are interpreted differently. In mode A ui indicates the covariance between the ith
variable in set X and the latent variable for the set Y as shown in equation (2.1).
Also, as stated above adding or removing additional variables to the analysis has
little effect on ui. However, in mode B these values should be interpreted similarly
to the coefficients in multiple regression and are affected by the addition of other
variables to the data set X.
• The multiple regression solution requires computation of the inverses of X ′X and
Y ′Y . In cases when the number of variables exceeds the number of observations
these inverses do not exist, which limits applicability to high dimensional data.
To summarize, the PLS algorithm can be viewed as two nested loops called the
inner loop and the outer loop [Wegelin]. The inner loop provides the estimates for the
coefficients in linear combinations of variables from sets X and Y. The approaches used to
obtain these coefficients differentiate mode A (a separate simple linear regression model
17
for each coefficient) from mode B (multiple regression model for all coefficients). The
outer loop is responsible for obtaining multiple sets of latent variables associated with
X and Y. Thus, if we are only interested in the first singular vectors (or only one latent
variables per data set), then the outer loop is not necessary. Different options available
for the outer loop are reflected in PLS-C2A algorithm compared to PLS-SB.
In traditional PLS methods no variable selection is performed. Linear combinations
of variables (or latent variables) contain the entire set of available variables for data
X and Y. In cases of microarray studies or genome-wide linkage studies the number of
variables under consideration may exceed tens of thousands. Thus, linear combinations
of all available variables may not be biologically interpretable. Also, these complete
linear combinations of variables may include a lot of noise variables that may affect the
results. Therefore, some variable selection is advisable in certain circumstances. This
modification can be added to the inner loop of the PLS algorithm to obtain sparse
linear combinations of variables by setting some coefficients in u and v to 0. Some
possible approaches to variable selection have been described in several papers reviewed
below. Another solution is sparse canonical correlation analysis, which is a new approach
presented in this thesis.
2.3 Modifications of traditional methods
Canonical correlation analysis can be considered similar to principal component analysis
(PCA) from the perspective of data preprocessing and dimensionality reduction. In CCA
we are interested in identifying subsets of variables from two data sets that have highest
correlation while in PCA we are seeking a subset of variables from one data set that
maximizes the variance. Solutions for both CCA and PCA problems can be obtained
by considering the singular value decomposition (SVD) of a certain matrix. For CCA
this matrix involves covariance between two data sets. For PCA we are looking for SVD
18
of the given data matrix. Thus, both problems are reduced to obtaining singular value
decomposition.
In studies of large complex data sets the disadvantage of both canonical correlation
analysis and principal component analysis is that solutions (canonical vectors and prin-
cipal component vectors respectively) are linear combinations of all the original variables
and may lack interpretability. Therefore, we are interested in identifying linear com-
binations of small subsets of the original variables. This can be viewed as a form of
variable selection. Sparse solutions are obtained by considering sparse singular value
decomposition.
Sparse singular value decomposition is the main focus of available modifications to
traditional methods for PCA. These modifications also may be adapted to obtain sparse
solutions for CCA. The simplest ad hoc approach is to set small in absolute value loadings
of principal component (or canonical) vectors to zero and is usually referred to as simple
thresholding. However, it has been shown to have a number of disadvantages [Cadima
and Jolliffe, 1995, Zou et al., 2004]. Zou et al. [Zou et al., 2004] propose obtaining
sparse solutions using lasso and elastic net techniques for variable selection. Johnstone
and Lu [Johnstone and Lu, 2004] perform variable selection prior to applying principal
component analysis. These two approaches are described in greater detail below.
Sparse Principal Component Analysis: Approach I
Johnstone and Lu [Johnstone and Lu, 2004] propose an algorithm that performs variable
selection prior to applying principal component analysis (PCA).
The authors show that the estimate of the first principal component is consistent if
and only if
c = limn→∞
p(n)/n = 0 (2.3)
Therefore, in case of high dimensionality and small sample size (p >> n) there is a need
to reduce dimensionality and perform variable selection before applying PCA. The paper
19
presents an algorithm that performs PCA on a subset of selected coordinates.
The algorithm:
Suppose we have an n×p data matrix X with number of variables p possibly larger than
the number of observations n.
X =
x′1
...
x′n
1. Select a basis {eν} in Rp
and find the representation of the data X in this basis,
i.e. find the coordinates xiν for each observation x′i
xi =
p∑
ν=1
xiνeν i = 1, 2, . . . , n
Here xi and eν are p× 1 vectors.
”Replace” the original data set with the set of coordinates (xiν): X −→ X{eν}.
2. Select a subset of k indices I = {ν1, . . . , νk} with highest coordinate variances
σ2ν = V ar(xiν).
Reduce the coordinate data set to include only k ≤ p selected basis vectors:
X{eν} −→ XI .
3. Apply standard PCA to the reduced coordinate data set XI to obtain k eigenvectors
ρj, where j = 1, . . . , k and ρj are k × 1 vectors.
4. Make the estimated eigenvectors sparse by using hard thresholding (given by xI(|x| ≥
δ) for x) with some preselected threshold parameter δ.
5. Obtain eigenvectors as linear combinations of the original variables by returning to
the original domain:
ρj =∑
νǫI
ρjνeν
20
where ρj are the eigenvectors in the original domain, i.e. linear combinations of the
original variables.
The motivation for this algorithm is the assumption that there exists a basis in which
the original signal has a sparse representation. Johnstone and Lu argue that a good
candidate is a wavelet basis. Wavelet bases are known for their good representation of
functions with few peaks, like a mixture of several distributions with small variances,
or a step function. In studies dealing with large data sets where the number of vari-
ables greatly exceeds the number of samples, it is often expected that the true signal is
measured by only a few variables and the rest are noise. In these circumstances wavelet
functions offer a good representation of the signal and separation of signal from noise.
The approach of Johnstone and Lu achieves consistency in the estimation of the principal
components. However, once sparse eigenvectors are transformed into the original signal
domain, they may not remain sparse. Thus, all original variables will be included in the
linear combinations represented by eigenvectors. Therefore, the principal components
may lack biological interpretability and still include additional noise from the variables
that do not measure the signal of interest. If the focus is on sparse representation and
we choose to keep the principal component obtained in the sparse basis domain omitting
the final transformation to the original signal domain, there still may not be any gain in
biological interpretability because new basis vectors are functions of the original variables
which may be complicated and meaningless from a biological point of view. Therefore,
there is a need to develop a new method that would perform variable selection in the
original signal domain and provide sparse results in the sense of sparse combination of
original variables.
Sparse Principal Component Analysis: Approach II
The work by Zou et al. [Zou et al., 2004] is motivated by the fact that principal compo-
nents are linear combinations of all variables and therefore are difficult to interpret. They
21
develop Sparse Principal Component Analysis (SPCA) introducing lasso and elastic net
to produce sparse principal components. This is based on representing PCA as regression
problem.
Lasso and elastic net:
In a regular linear regression problem, lasso [Tibshirani, 1996] and elastic net [Zou and
Hastie, 2005] techniques perform variable selection [Tibshirani, 1996, Zou et al., 2004, Zou
and Hastie, 2005]. Lasso imposes a constraint on the L1 norm of regression coefficients
while elastic net offers a more general approach and imposes constraints on both L1 and
L2 norms of coefficients.
Suppose there are n observations and p predictors. Let Y = (y1, . . . , yn)′ denote the
response vector and Xi = (x1i, . . . , xni)′ be the ith predictor, i = 1, . . . , p. Then the lasso
estimates of regression coefficients are βlasso:
βlasso = argminβ
∣∣∣Y −p∑
i=1
Xiβi
∣∣∣2
+ λ
p∑
i=1
|βi| (2.4)
The elastic net estimates of regression coefficients are βen:
βen = (1 + λ2) argminβ
∣∣∣Y −p∑
i=1
Xiβi
∣∣∣2
+ λ2
p∑
i=1
|βi|2 + λ1
p∑
i=1
|βi| (2.5)
In both lasso and elastic net sparse solutions are obtained due to the L1 norm penalty
on the regression coefficients. However, there are two major advantages of elastic net
over lasso. The first one is that lasso solution is limited to including at most n variables.
In the case of microarray studies the number of variables p (gene expressions) is often
greater than 10000 and exceeds the number of experiments n, which may be less than
10. Thus it may be impractical to restrict the model. Elastic net allows including all
available variables in the solution. The second advantage of the elastic net is that it
selects the whole group of correlated variables if one of these variables has been selected
while lasso only includes one of the variables from the group in the final model.
22
Sparse principal components analysis as elastic net problem:
Zou et al. demonstrate that PCA can be represented as a ridge regression problem
(Theorem 1 in [Zou et al., 2004]):
Let X be n× p data matrix and suppose its SVD is
X = UDV ′ (2.6)
Since principal components are linear combinations of all available variables, they can
be considered as response and obtained by linear regression of that response on variables
in X. Let Yi = UiDi = XVi denote the i-th principal component. The vector Vi
contains the coefficients of the original variables in the regression. Then ∀λ > 0 Vi can
be obtained by solving
βridge = argminβ
∣∣∣Yi −Xβ
∣∣∣2
+ λ|β|2 (2.7)
and standardizing v =ˆβ
ridge
|ˆβ
ridge|. Then v = Vi.
Here the ridge penalty λ is not used to penalize regression coefficients β but rather
serves to ensure a solution in cases when the data matrix X is not full rank, in particular
when p > n. In these cases there is no unique ordinary regression solution, while PCA
always produces a solution, thus λ has to be strictly positive.
This formulation of PCA is then transformed into general sparse principal components
analysis (SPCA) problem by modifying 2.7 to become a “self-contained regression-type
criterio” and introducing a lasso penalty. Consider the first k principal components. Let
α and β be p× k matrices and Xi the i-th row-vector in the data matrix X. Here X is
m× p and Xi is a single observation. For any λ > 0, let
(α, β) = argminα,β
n∑
i=1
∣∣∣Xi − αβ′Xi
∣∣∣2
+ λ
k∑
j=1
|βj|2 +k∑
j=1
λ1,j|βj|1 (2.8)
subject to α′α = Ik. Here |βj|1 denotes the L1 norm of βj.
Then take vj =ˆβ
j
|ˆβ
j|, j = 1, . . . , k.
23
A sparse principal component solution will set elements of vj to zero, so the principal
component is a simpler linear combination of the original variables.
The authors introduce a general SPCA algorithm to obtain sparse principal compo-
nents for specific values of tuning parameters λ and λ1j. The ridge penalty λ is the
same for all k principal components. If n > p and we have full rank matrix X, then
it can be set to 0. In case when p > n it should be set to a small positive value to
overcome the collinearity problem by regularizing the matrix. Lasso penalties λ1j are
selected separately for each principal component. The optimal values can be obtained
from a set of solutions provided by LARS-EN algorithm [Efron et al., 2004] based on
variance-sparsity tradeoff. Thus, for each principal component the user has to select two
tuning parameters.
The main motivation for development of new methods for sparse canonical correlation
analysis, sparse principal component analysis, and sparse singular vector decomposition
in general is the interest in studying large data sets when variable selection is essential
to the interpretability of the results, as when number of variables p greatly exceed the
number of samples n, which is the case in microarray studies. Zou et al. introduce a
modified solution for this particular situation to reduce the computation cost. It is based
on setting the ridge penalty λ = ∞ which transforms the elastic net solution into a
soft-thresholding operation.
Gene expression arrays SPCA algorithm:
1. Consider the first k ordinary principal components V [, 1 : k] and let α = V [, 1 : k]
2. Given fixed α, for j = 1, . . . , k apply soft-thresholding
βj =(|α′
jX′X| − λ1,j
2
)+Sign(α′
jX′X)
3. For each fixed β, obtain the SVD of X ′Xβ = UDV ′, then update α = UV ′
4. Repeat steps 2 and 3 until β converges
24
5. Normalization: Vj =β
j
|βj|, j = 1, . . . , k
Zou et al. also present a method for calculation of adjusted total variance. The
traditional approach for calculation of variance explained by principal components (PC)
is to take trace(UT U), where U are the principal components. However, this is only
valid under the assumption that PCs are uncorrelated. Therefore, adjustment is made
to account for the correlation between the modified principal components. The authors
then use the new measure of total explained variance to compare SPCA performance to
traditional PCA, the SCoTLASS method of Jolliffe and Uddin [Jolliffe and Uddin, 2003],
and simple thresholding, using two published data sets and simulated data.
The first published data set they use is the pitprops data [Jeffers, 1967] that consists
of 180 observations on 13 variables. To facilitate the comparison between the simple
thresholding method and other sparse PCA approaches, the authors make the number of
nonzero loadings selected by simple thresholding equal to the number of nonzero loadings
in sparse PC produced by SPCA and SCoTLASS. Comparison of the first six principal
components shows that adjusted total explained variance is highest for the SPCA vectors,
followed by the simple thresholding, and is lowest for SCoTLASS. Also SCoTLASS is
similar to simple thresholding in terms of variables selected especially in the first PC,
while set of variables selected by SPCA differs substantially from both for the first three
PCs.
The second published data study is based on Ramaswany data [Ramaswamy et al.,
2001] which has 16063 genes and 144 samples. Thus this is the p >> n situation typical
of microarray studies. Therefore, the soft-thresholding modification was used instead of
the full elastic net to obtain sparse PCs. The authors performed SPCA for a number of
different values for the sparseness parameter λ1 and observed that as sparsity increases
the percentage of explained variance decreases at a slow rate. They also observed that
only 2.5% of genes were sufficient to explain most of the variance in the first principal
component. Another interesting finding was that the simple thresholding approach keep-
25
ing the same number of genes that were selected by SPCA showed a higher percentage of
explained variance than SPCA and only about 2% of selected genes were different between
the two methods. SCoTLASS was not applicable in this setting due to computational
complexity.
In a simulation example Zou et al. used three hidden factors to create 10 variables
with a specific correlation structure such that principal components should have a sparse
representation. Four of 10 variables were associated with the first hidden factor, 4 with
the second, and 2 with the third. The correct solution should recover the first four vari-
ables in the leading principal component and the second 4 variables in the second PC
since the first 2 hidden factors are more important and the third hidden factor is the
combination of the first two. The authors used the exact covariance matrix for testing
SPCA and comparing it to ordinary PCA and simple thresholding. This is equivalent to
having an infinite sample of data (n =∞), so the ridge parameter λ in SPCA was set to
0. Both SPCA and SCoTLASS correctly identified the first two principal components,
which authors attribute to their explicit use of the lasso constraint. An obvious disadvan-
tage of ordinary PCA is that it included all 10 variables in each considered PC. Simple
thresholding misidentified two variables in the leading PC due to misleading correlation
between the variables.
An interesting observation is that the solution obtained by SPCA included all 4
variables associated with the first hidden factor in the leading PC even though only the
lasso constraint was used and the ridge penalty was set to 0. Therefore, the solution was
obtained by using lasso, not elastic net. As stated above, the usual criticism of lasso is
that it tends to select only one variable from the group of correlated variables. However,
results of this simulation show that in this case lasso did include the whole group of
correlated variables in the solution. This is a contradictory result and may be explained
by the fact that the authors used the prior knowledge that 4 variables have to included
in the solution and chose the thresholding parameter accordingly.
26
The conclusions presented by the authors based on these three studies are as follows:
• Simple thresholding may still be a useful tool due to its simplicity and sufficiently
good performance in studies with small sample size such as microarray studies.
However, it can misidentify important variables.
• SCoTLASS works well with a small number of variables and produces correct sparse
principal components. It is not applicable in studies with a large number of variables
due to its computational complexity.
• The new SPCA method performs well in both gene expression studies ( p >> n)
and in regular studies with sufficient sample size. It reduces to ordinary PCA in
the absence of a lasso constraint and otherwise has superior performance compared
to PCA since it produces sparse principal components and correctly identifies im-
portant variables. The algorithm for SPCA is computationally efficient.
Sparse principal component analysis offers several advantages over existing methods as
listed above. It obtains sparse principal components by introducing a lasso constraint
into the regression representation of the principal component problem. Thus, in a singu-
lar value decomposition X = UDV ′ it produces sparse vectors in V . This is an important
development for cluster analysis where we are interested in identifying small groups of
variables with some similarity (for example, genes belonging to the same functional path-
way) that are easy to interpret. In this situation we are only concerned about grouping
variables on one side of the decomposition of the matrix X. Thus, if X is an n × p
matrix in which rows represent n samples and columns represent p variables, then we are
only interested in grouping variables by obtaining sparse V . However, if the matrix of
interest X is the covariance matrix between two sets of variables from two different data
sets and we are interested in simultaneously identifying small groups of variables from
each set that would also be correlated, then we are seeking sparse SVD of X on both
sides, i.e. we are interested in obtaining sparse vectors both in U and V matrices. This
27
is the case in canonical correlation analysis when the study involves large data sets and
including all available variables in canonical vectors would make interpretation difficult.
In this case, SPCA does not provide a complete solution to the problem. However, it
introduces some useful techniques and ideas that can be adapted to the modification
of traditional canonical correlation analysis and singular vector decomposition to obtain
sparse solutions. This is the focus of this thesis.
28
Chapter 3
Methodology
Sparse canonical correlation analysis is an extension of traditional canonical correlation
analysis. Both methods focus on the identification of relationships between the two sets of
variables. In this chapter I will present new methodology developed in the thesis research.
I begin with a brief review of the traditional canonical correlation analysis in the first
section. Then the description the new technique and the algorithm for SCCA follow in the
second section. Section 3 contains the discussion of data standardization appropriate for
SCCA. Section 4 presents the simulation design used for evaluation of SCCA performance.
Convergence issues and the dependence on the starting values for algorithm are discussed
in section 5. The last section 6 describes selection of the sparseness parameters for SCCA.
3.1 Conventional canonical correlation analysis
Consider two sets of variables X and Y measured on n individuals. Suppose there are p
variables in the set X and q variables in the set Y .
X =
x11 . . . x1p
.... . .
xn1 . . . xnp
Y =
y11 . . . y1q
.... . .
yn1 . . . ynq
29
We are looking for linear combination of variables from sets X and Y with maximal
correlation. Let vectors a and b denote the coefficients in these linear combinations for
sets X and Y, respectively. Then we are looking to maximize the following correlation:
Corr(a′x,b′y) = ρ(a,b) =a′ΣXY b√
a′ΣXXa√
b′ΣY Y b(3.1)
Where ΣXX , ΣXY , and ΣY Y are the variance and covariance matrices for X and Y. The
solution is obtained by considering the Singular Value Decomposition (SVD) of a matrix
K:
K = Σ−1/2XX ΣXY Σ
−1/2Y Y = UDV ′ = (u1, . . . ,uk)D(v1, . . . ,vk)
′
= d1u1v′1 + d2u2v
′2 + . . . + dkukv
′k (3.2)
where k is the rank of the matrix K. The solution is based on the rank 1 approximation
to the correlation matrix [Good, 1969] which has dimension (p×q) = (p×p)(p×q)(q×q),
approximating K with the first singular vectors K ≈ d1u1v′1, where d1 is the positive
square root of the first eigenvalue of K ′K or KK ′. Canonical vectors that is, linear
combinations of each of the two sets of variables that have the largest correlation, are
a = Σ−1/2XX u1 and b = Σ
−1/2Y Y v1 (3.3)
Canonical vectors are the weights in the linear combinations of variables from sets X and
Y in (3.1) in which we are interested. The new variables η = a′x and φ = b′y obtained
in this analysis are called the canonical variables as in [Mardia et al., 1979] or latent
variables [Wegelin]. Vectors Xa and Y b give n observations of these variables.
3.2 Sparse canonical correlation analysis
In conventional CCA all variables from both sets are included in the fitted linear combi-
nations or canonical vectors. However, in many study settings such as microarray data
analysis and genome-wide linkage analysis the number of genes under consideration often
30
exceeds tens of thousands. In these cases linear combinations of the entire set of features
lack biological interpretability since they contain too many variables for further testing or
hypothesis generation. In addition, high dimensionality and insufficient sample size lead
to computational problems, inaccurate estimates of parameters, and non-generalizable
results. Sparse canonical correlation analysis (SCCA) solves the problem of biological
interpretability by providing sparse sets of associated variables. These results are ex-
pected to be more robust and generalize better outside the particular study. In many
applications there are good subject-matter arguments for sparsity. For example, when
gene expression is regarded as a response and genotypes as predictors, sparse loadings
comply with the belief that only a small proportion of genes are expressed under a certain
set of conditions and that expressed genes are regulated at a relatively small number of
genetic locations. As another example, gene expression may be considered to predict
a complex phenotype such as a comprehensive psychological profile. Only a subset of
the large shopping list of psychological variables may be relevant to the condition being
studied.
I propose to obtain sparse linear combinations of features by considering a sparse
singular value decomposition of the matrix K in 3.2. This means that canonical vec-
tors u and v in (3.2) have sparse loadings. We develop an iterative algorithm that
alternately approximates the left and right singular vectors of the SVD using iterative
soft-thresholding for feature selection. This approach is related to the iterative SVD al-
gorithm of I.J. Good [Good, 1969], Partial Least Squares (PLS) methods described by J.
Wegelin [Wegelin] and Sparse Principal Component Analysis method developed by Zou
et al. [Zou et al., 2004].
SCCA algorithm
Similarly to CCA consider n observations in two sets of variables X and Y with p variables
in the set X and q variables in the set Y . Assume that each of the sets of variables has
31
been standardized to have columns with zero means and unit variances. Let K be the
sample correlation matrix between X and Y as in (3.2). Let λu and λv be the soft-
thresholding parameters for variable selection from the sets X and Y respectively. The
first sparse canonical vectors can be identified using the following algorithm:
Sparse Canonical Correlation Analysis algorithm:
1. Select sparseness parameters λu and λv
2. Select initial values u0 and v0 and set i = 0
3. Update u:
(a) ui+1 ← Kvi
(b) Normalize: ui+1 ← ui+1
|ui+1|
(c) Apply soft-thresholding to obtain sparse solution:
ui+1 ← (|ui+1| − 12λu)+Sign(ui+1)
(d) Normalize: ui+1 ← ui+1
|ui+1|
4. Update v:
(a) vi+1 ← K ′ui+1
(b) Normalize: vi+1 ← vi+1
|vi+1|
(c) Apply soft-thresholding to obtain sparse solution:
vi+1 ← (|vi+1| − 12λv)+Sign(vi+1)
(d) Normalize: vi+1 ← vi+1
|vi+1|
5. i← i + 1
6. Repeat steps 3 and 4 until convergence.
32
where (x)+ is equal to x if x ≥ 0 and 0 if x < 0. Also
Sign(x) =
−1 if x < 0,
1 if x > 0,
0 if x = 0.
Notes:
1. The first normalization step in the updating of the singular vector (steps 3b and 4b)
is related to the scale of the sparseness parameter λ. This step brings the singular
vector to unit length prior to applying soft-thresholding. Thus, soft-thresholding
is always applied to comparable vectors, so the value of the parameter λ is easier
to choose. If this step is omitted, then values for the sparseness parameter can
range from 0.1 to over 100 for different data sets making the task of selecting the
approximate range of suitable sparseness parameters difficult. Further discussion
of the sparseness parameter selection algorithm can be found in the section on
sparseness parameters below.
2. The second normalization step in the algorithm (steps 3d and 4d) is to normalize
the sparse singular vectors before proceeding to the next iteration.
3. This algorithm is first applied to obtain the first singular vectors, u1 and v1. The
the residual of the matrix K after removing the effects of the first singular vectors
can be considered to obtain additional singular vectors.
3.3 Data standardization
The computation of matrix K in (3.2) requires the inverses of the p × p matrix X ′X
and the q × q matrix Y ′Y , which may not exist in cases when these matrices are ill-
conditioned. Such situations are often observed in microarray studies when the number
33
of variables greatly exceed the number of observations. The inverses also may not exist
when there is collinearity (linear dependence) even if the total number of variables in sets
X and Y is less than the number of observations. Several approaches may be taken to
solve this problem. The first one would be to regularize X ′X and Y ′Y by the addition
of γI for some parameter γ. Another approach is to use generalized inverses. Both of
these approaches may not be feasible in cases of high dimensional data. Regularization
requires additional parameter selection and validation while the second approach requires
computation of the full singular value decomposition for both X ′X and Y ′Y which may
be very slow in cases of high dimensionality. The third approach is to eliminate (X ′X)−1
and (Y ′Y )−1 from the computation of the matrix K. The motivation for this approach is
described below.
Prior to applying the SCCA algorithm the data is standardized so that all variables
have zero means and unit variances by subtracting column means and dividing by column
standard deviations. Then variance-covariance matrices ΣXX and ΣY Y for each data set
become correlation matrices and have ones along the diagonal. Also under the assumption
that in high dimensional problems most of the measured variables are not related to the
process of interest, i.e. they may be considered as noise, the correlation between them
(off-diagonal elements in ΣXX and ΣY Y ) is zero. We can approximate K by substituting
identity matrices instead of Σ11 and Σ22 in the expression for matrix K in (3.2) after
applying data standardization. Then K is computed as the covariance between sets X∗
and Y∗
:
K = Cov(X∗
, Y∗
) = ΣX∗Y ∗ (3.4)
where X∗
and Y∗
are the standardized data sets. The first canonical vectors in (3.3) are
then just u1 and v1.
This approach is also related to the partial least squares method (PLS) of J.A. Wegelin
[Wegelin] presented in the literature review section. Considering K = Cov(X∗
, Y∗
) =
34
ΣX∗Y
∗ is equivalent to maximizing
Cov(a′x∗,b′y∗) = a′ΣX∗Y ∗b (3.5)
instead of maximizing Corr(a′x,b′y) as in (3.1). Thus, we are using mode A approach
of the general Wold PLS algorithm and not mode B as described in chapter 2.
The limitation of this approach is that we ignore the correlation among the variables
of the same type. For example, if one type of variables is gene expression profiles, then
the correlation between the genes is not considered. This may lead to inflated covariances
between the different types of measurements since they are not scaled by the variances
within the same type of measurements. Thus, uninformative (i.e. noise) variables may
be included in the solution. However, noise variables are hypothesized to have lower
variances than the important variables and to be independent of each other. This fact is
often used in noise filtering methods and in principal component analysis which selects
combination of variables that explain most of the variation in the data. Therefore,
the covariances of the noise variables between different types of measurements should
not be greatly affected by the covariances within the same type of variables and the
approximation of the matrix K described above is acceptable.
3.4 Simulation
Different aspects of SCCA performance are evaluated using the simulated data. In this
section I describe the methods used in simulations.
Latent variable model
The goal of Sparse Canonical Correlation Analysis is to study two sets of variables X and
Y to identify subsets of variables in each of X and Y that are associated with variables
in the other set.
35
The dependency between two sets of variables can be modeled using latent, or in other
words unobserved, variables. Suppose there exists a latent variable ωX that controls a
subset of observed variables in X, thus inducing a correlation between these variables. In
this case measured variables in X serve as indicators. An example of this model could
be the following situation: we measure performance of several students in mathemat-
ics by giving them a number of tests. All students get consistently good results, which
implies that their performance is correlated. The unobserved reason for this correlation
could be that all of these students have the same mathematics teacher. Similarly, assume
there exists a latent variable ωY that controls a subset of observed variables in another
data set Y. Carrying on with our hypothetical example, variables in set Y could indi-
cate performance of the same group of students on physics tests and ωY represents their
physics school teacher. Thus, we have two distinct measures (mathematics and physics
test scores) on the same subjects (students). Suppose we observe high test scores in
physics suggesting positive correlation between the results on mathematics and physics
tests. This phenomenon may be explained by another higher level unobserved connection
between the mathematics and physics teachers (ωX and ωY ) and consequently between
the performance of their students on considered tests (X and Y ). This connection could
be represented by a high school where all of these students are studying and where
mathematics and physics teachers are working. Thus, good performance of students and
correlation between their results on mathematics and physics tests may be explained by
the school‘s specialization in sciences. In formal approach, this is equivalent to the exis-
tence of a higher level latent variable µ that controls both ωX and ωY and, consequently,
observed variables in X and Y.
Schematically, latent variables model can be shown as follows:
36
µ
ωX ωY
x1 x2 . . . xp.dep y1 y2 . . . yq.dep
Figure 3.1: Schematic representation of the latent variable model.
Mathematical representation
For a simulation study of SCCA performance we need to generate two separate data
sets with an equal number of observations in each and not necessarily equal numbers of
variables. A small subset of variables in X should be associated with a subset of variables
in Y and the rest of the variables in both data sets may be independent. This will allow
testing how well SCCA differentiates the associated group of variables from the rest.
Thus, to simulate the data the following scheme has been used:
1. Let data set X have p variables and set Y have q variables.
2. Both X and Y have n observations
3. Assume that X has p.dep dependent variables and Y has q.dep dependent variables.
These groups of variables are associated with each other according to some model.
The rest of the variables are independent within sets X and Y and between them.
4. Without loss of generality we can assume that the first p.dep variables in X are
associated with the first q.dep variables in Y
Thus, showing dependent variables in bold the data matrices are as follows
X =
x11 . . . x1p.dep x1p.dep+1 . . . x1p
.... . .
xn1 . . . xnp.dep xnp.dep+1 . . . xnp
Y =
y11 . . . y1q.dep y1q.dep+1 . . . y1q
.... . .
yn1 . . . ynq.dep ynq.dep+1 . . . ynq
37
Data sets X and Y that satisfy the latent variable model can be simulated using a
higher level latent variable µ alone. The simplest case is when there is only one set of
correlated variables in X that is associated with the only set of correlated variables in Y
and the rest of the variables are independent. This is the model used to simulate data.
In addition, suppose there exists a random variable µ with distribution N(0, σ2µ) such
that
xji = αiµj + exji for j = 1, . . . , n, i = 1, . . . , p.dep
yji = βiµj + eyji for j = 1, . . . , n, i = 1, . . . , q.dep
and for independent variables assume
xji = exji for j = 1, . . . , n, i = p.dep + 1, . . . , p
yji = eyji for j = 1, . . . , n, i = q.dep + 1, . . . , q
where exji ∼ N(0, σ2e) and eyji ∼ N(0, σ2
e) for j = 1, . . . , n, i = 1, . . . , p(q)
Also assume
p.dep∑
i=1
αi =
q.dep∑
i=1
βi = 1
Then define
µX =
p.dep∑
i=1
xi = µ +
p.dep∑
i=1
exi and µY =
q.dep∑
i=1
yi = µ +
q.dep∑
i=1
eyi
The highest correlation between linear combinations of dependent variables from sets X
and Y is observed between the sums of these dependent variables:
Cor(µX ,µY ) =σ2
µ√σ2
µ + p.depσ2e
√σ2
µ + q.depσ2e
(3.6)
38
The covariance matrix between variables in sets X and Y has the structure:
ΣXY =
y1 y2 . . . yq.dep yq.dep+1 . . . yq
x1 α1β1σ2µ α1β2σ
2µ . . . α1βq.depσ
2µ 0 . . . 0
x2 α2β1σ2µ α2β2σ
2µ . . . α2βq.depσ
2µ 0 . . . 0
......
. . . . . .
xp.dep αp.depβ1σ2µ αp.depβ2σ
2µ . . . αp.depβq.depσ
2µ 0 . . . 0
xp.dep+1 0 0 . . . 0 0 . . . 0
......
. . . . . .
xp 0 0 . . . 0 0 . . . 0
An important observation demonstrated by this matrix form is that in this simulation
design if 2 variables xi and xj in set X are associated with some variable yk in set Y,
then xi and xj are correlated. In the chosen notation, correlation between xi and yk
implies that αi and βk are non-zero, while correlation between xj and yk implies that αj
and βk are non-zero. Then the correlation between xi and xj is non-zero as well since
Cor(xi,xj) = αiαjσ2µ. This fact can be used for some preliminary filtering of the data to
reduce the number of noise variables that is discussed in chapter 7.
This matrix representation is also related to another way of looking at the canonical
correlation model. In the simplest case, such as considered above, there is a single set
of associated variables for each X and Y and the rest of the variables are independent.
Then there is only one pair of canonical variates and Σ12 = ΣXY = λuv′ where λ is the
singular value and u and v are the singular vectors. Then X and Y variables can be
39
sorted so that singular vectors have the form:
u =
u1
...
up.dep
0
...
0
v =
v1
...
vq.dep
0
...
0
Then ΣXY has the following form:
ΣXY =
u1v1 . . . u1vq.dep 0 . . . 0
......
......
up.depv1 . . . up.depvq.dep 0 . . . 0
0 . . . 0 0 . . . 0
......
......
0 . . . 0 0 . . . 0
Selection of the true correlation
The true correlation between the linear combinations of variables from the different sets
can be obtained by controlling the values of the variance of the latent variable µ and the
variance of the noise in the measurements, i.e. σ2µ and σ2
e . This is based on the equation
3.6 for the correlation between the linear combinations. Increasing σ2µ relative to the
variance of the noise results in higher correlation between the subsets of variables. For
example, when there are 20 variables in each set of measurements that are associated
between the sets, setting σ2µ = 1.8 and σ2
e = 0.01 would result in the correlation between
the linear combinations of these variables Cor(µX ,µY ) = 0.9, while setting σ2µ = 0.2 and
σ2e = 0.01 would result in Cor(µX ,µY ) = 0.5.
40
3.5 Convergence
The SCCA algorithm applied to the matrix K converges to first singular vectors u1 and
v1. The rate of convergence is very high and it usually requires less than 10 iterations
even for very large data sets each containing thousands of variables. This observation is
based on the simulation studies as well as the analysis of real data containing thousands
of variables in the sets X and Y. Computational time is also short since the only required
operations are matrix and vector multiplications.
Convergence in the absence of variable selection
In the absence of soft-thresholding used to obtain sparse solution, we have the algorithm
for calculation of singular decomposition described by I. J. Good [Good, 1969]. The
same algorithm is also presented as the power method for computing eigenvectors by
J. A. Wegelin [Wegelin], who references G. W. Stewart [Stewart, 1973]. The algorithm
converges at exponential speed [Good, 1969]. The proof of convergence is presented in
Wegelin [Wegelin] on page 32. The idea behind the algorithm can be seen by tracing a
few steps. Consider the singular value decomposition of a matrix K in (3.2):
K = UDV ′ = (u1, . . . ,uk)D(v1, . . . ,vk)′
Assume that the starting value v0 for the right singular vector is not in the null space of
V, so v0 = V α. Then the algorithm proceeds as:
1. u1 = Kv0 = UDV ′V α = UDα
2. v1 = K ′u1 = V DU ′UDα = V D2α
3. u2 = Kv1 = UDV ′V D2α = UD3α
4. ...
41
Note that the exponent of D increases at every step. If the problem is scaled such that
the largest eigenvalue (element of D) is one, it is evident that the powers of D converge
to a matrix with the value 1 in the (1,1) position and zero elsewhere. This will produce
the first right and left singular vectors u1 and v1.
Convergence in the presence of variable selection - the issue of
starting values
In the presence of soft-thresholding for variable selection, convergence for SCCA is related
to the selection of the starting value for the singular vector. This can be demonstrated
by considering in greater detail a section from the proof of convergence of the power
method [Wegelin] described above:
As before consider the singular decomposition of matrix K with rank k in (3.2)
K = UDV ′ = (u1, . . . ,uk)D(v1, . . . ,vk)′
In the SCCA algorithm we use a starting value for the right singular vector v0 to initialize
the process. The algorithm is symmetric with respect to left and right singular vectors,
therefore, we could also start from step 4 and use u0. Without loss of generality, let’s
assume that the procedure for the algorithm is as presented above and iterations start
with updating u using v0. If v0 is in the null space of V in 3.2, then the process will
stop at the first iteration since u1 and v1 will become zero vectors. Thus, to achieve
convergence, assume that v0 is not in the null space of V . Then it can be written as a
linear combination of vectors in V which represent an orthonormal basis (i.e. v′ivj = 0
for i 6= j and v′ivi = 1, i = 1, . . . , k, j = 1, . . . , k):
v0 = α1v1 + α2v2 + . . . + αkvk (3.7)
42
where α1, . . . , αk are some coefficients and at least one αi is non-zero. Then in step 3a
of the algorithm we have
u1 = Kv0 = UDV ′v0 (3.8)
= UDV ′(α1v1 + α2v2 + . . . + αkvk) (3.9)
=k∑
i=1
αiUDV ′vi (3.10)
=k∑
i=1
αidiui (3.11)
because v′ivi = 1.
If we write u1 = (u11, u
12, . . . , u
1p)
′ and
U = (u1, . . . ,uk) =
u11 . . . u1k
.... . .
up1 . . . upk
then
u1 =
u11
u12
...
u1p
=
d1α1u11 + d2α2u12 + . . . + dkαku1k
d1α1u21 + d2α2u22 + . . . + dkαku2k
...
d1α1up1 + d2α2up2 + . . . + dkαkupk
(3.12)
After the normalization in step 3b we will have
u1 =d1α1√∑ki=1 d2
i α2i
u1 + . . . +dkαk√∑ki=1 d2
i α2i
uk (3.13)
=
d1α1√∑ki=1
d2i α2
i
u11 + d2α2√∑ki=1
d2i α2
i
u12 + . . . + dkαk√∑ki=1
d2i α2
i
u1k
d1α1√∑ki=1
d2i α2
i
u21 + d2α2√∑ki=1
d2i α2
i
u22 + . . . + dkαk√∑ki=1
d2i α2
i
u2k
...
d1α1√∑ki=1
d2i α2
i
up1 + d2α2√∑ki=1
d2i α2
i
up2 + . . . + dkαk√∑ki=1
d2i α2
i
upk
(3.14)
Finally, the decision on the variable selection for the first singular vector u (i.e.
variable selection for the set X) is made in the soft-thresholding step 3c. The soft-
thresholding sets several small in absolute value entries in estimated vector u1 to zero
43
(depending on the sparseness parameter). We would like to achieve the same sparseness
as in the true singular vector by setting to zero entries in u1 that correspond to zeros in
true u1. However, as equation (3.13) demonstrates, after being updated and normalized
entries in u1 no longer depend on the first singular vector u1 exclusively, but rather on all
left singular vectors u1, . . . ,uk. This raises a concern whether correct entries in estimate
u1 are set to zero as this will affect all remaining steps of SCCA algorithm and the final
result.
I have described in detail the first step of the algorithm for the first singular vectors.
However, the derivation for updated singular vector entries is similar for all other steps
and for both left and right singular vectors. The above equations demonstrate that
updated singular vectors (for example, u1) depend on the starting values (u0) through
the parameters α. Also, in the later iterations the power of the singular values di will
increase. Thus, in iteration s we will have
us =k∑
i=1
αid2si ui (3.15)
As the iteration number s increases, so will the powers of di while the power of the αi
parameters does not increase. Also, given that the value of singular vectors di decreases
as i increases and d1 is the largest value, the first term α1d2s1 u1 will become dominant.
This fact is used in the proof of convergence of the power method in Wegelin [Wegelin].
Thus, at later iterations the influence of the starting singular vectors is weakened by
the increased influence of the singular values. However, the decisions regarding variable
selection are already made beginning with the first iteration, so the concern about the
effect on the results is valid. For instance, if true singular vectors were known then
we could take v0 = v1 = 1v1 + 0v2 + . . . + 0vk, i.e. in 3.7 we would have α1 = 1,
α2 = 0, . . . , αk = 0. In that case in 3.12 the new value of u1 would only depend on u1 as
44
follows:
u1 =d1α1√∑ki=1 d2
i α2i
u1 + . . . +dkαk√∑ki=1 d2
i α2i
uk = u1 =
u11
u21
...
up1
(3.16)
Therefore, correct entries in u1 would be automatically set to zero.
This demonstrates the importance of selecting proper starting values for the singular
vectors. As is stated above they should not be in the null space of U or V respectively.
Possible choices include
• First singular vector of sample matrix K as suggested by Zou et al. [Zou et al.,
2004]
• First singular vector of true matrix K if it is available, i.e. ideal starting value.
This may only be possible in simulation studies.
• Vector consisting of column means of K for v0 or row means of K for u0 as suggested
by Wegelin [Wegelin]. Both vectors should be standardized to have unit length.
This ensures that singular vectors are obtained in the order of decreasing singular
values. Thus, first singular vectors will be obtained first followed by the second and
so on. This choice is similar to using first singular vector of sample matrix K since
it usually resembles means of columns or rows depending on whether we consider
the right or left side.
• Vector consisting of random numbers that has unit length.
To assess the performance of these starting values and their effect on the results of SCCA
I performed simulation studies to compare all four choices.
45
Simulation to assess convergence of SCCA
To study the effect of starting values on the SCCA results I simulated data based on the
latent variable model. The results obtained for each of 4 choices of starting values were
compared based on the number of variables selected for sets X and Y, the number of
variables selected correctly, i.e. variables selected by SCCA that are truly important and
were simulated to have association between sets X and Y. I also compared the results for
4 starting value types based on the discordance measure which is calculated as number of
variables misidentified by SCCA compared to the true model used in simulation. Thus,
the discordance measure for set X is a combination of two different types of errors: the
number of false positives (true noise variable selected by SCCA) plus the number of false
negatives (true important variable not selected by SCCA). Discordance for Y is computed
similarly.
In each simulation the optimal combination of sparseness parameters λu and λv was
identified using 10-fold CV separately for each starting value choice and for each data
set. Then subsets of variables in X and Y were obtained by applying SCCA algorithm
with the optimal parameters λu and λv for soft-thresholding.
Simulation design:
Fifty replications were carried out for each of several different choices of the true correla-
tion between the sets of associated variables in X and Y. A greater number of simulations
would be beneficial to making the trends smoother, however it would be computationally
very time consuming since SCCA algorithm has to be applied 4 times for each simulation
(once for each of four different starting values). Also each application of SCCA algorithm
includes cross-validation for sparseness parameters selection. All that process is repeated
for every different value of the true correlation between the sets of associated variables in
X and Y. Investigation of the results for each simulated data set showed low variability
between the simulations for the specific conditions. Therefore, it is sufficient to use 50
simulated sets for each of several different choices of the true correlation. All four meth-
46
ods of starting values were applied in each simulation to facilitate comparison between
the results. Simulated data sets contained p.dep = q.dep = 20 associated variables each,
set X consisted of p = 150 variables and set Y had q = 100 variables. The sample size
n was 50. True correlations between the linear combination of the subsets of important
variables in sets X and Y ranged between 0.041 and 0.916.
Simulation results and conclusions:
Tables 3.1 and 3.2 and corresponding to them figures 3.2 to 3.5 compare the results of
SCCA for 4 types of starting values for singular vectors. Results in each figure are shown
for different values of the true correlation between associated subsets of variables in X
and Y used to generate the data (ρ).
Table 3.1 and corresponding figures 3.2 and 3.4 demonstrate the results for the number
of variables in set X selected by SCCA averaged over 50 simulations. Figure 3.2 shows the
number of positives while figure 3.4 shows the average number of selected variables that
were selected correctly, i.e. these variables belong to the subsets of associated variables
used in data simulation (true positives). Table 3.2 and corresponding figures 3.3 and 3.5
demonstrate similar results for the set Y.
Note that as the true correlation ρ between associated variable subsets increases,
the results obtained using 4 different types of the starting values become more similar.
Thus, for true correlation value of 0.916 the numbers of positives and the numbers of
true positives are exactly the same for all 4 starting value types. In fact, identical sets
of variables in X and Y are identified in this case for all simulations. Also note that as
the true correlation increases, the number of selected variables in sets X and Y decreases
and approaches 20 which is the number of associated variables in each set used in the
simulation model. At the same time the number of correctly selected variables in sets X
and Y increases with increasing true correlation and also approaches 20. This indicates
that sensitivity and specificity of SCCA improve with the increasing correlation between
the subsets of associated variables. In particular, average sensitivity for ρ = 0.916 is 0.81
47
positives for X (true positives for X)
types of starting values
average true
correlation
sample sin-
gular vectors
of K
true singular
vectors of K
row and col-
umn means
for K
Unif(0, 1)
random
numbers
0.041 63.06 (8.32) 66.44 (8.92) 72 (9.66) 63.22 (8.41)
0.116 62.24 (8.14) 60.22 (8.10) 62.46 (8.50) 61.52 (8.12)
0.245 60.2 (8.12) 65.4 (8.94) 56.02 (7.62) 59.48 (8.02)
0.531 52.46 (9.82) 55.1 (10.74) 60.64 (10.74) 57.66 (10.40)
0.723 50.48 (14.0) 50.14 (13.88) 51.92 (13.92) 52.28 (13.88)
0.858 41.34 (15.52) 40.78 (15.42) 41.2 (15.52) 40.76 (15.42)
0.916 28.76 (16.26) 28.74 (16.26) 28.74 (16.26) 28.74 (16.26)
Table 3.1: Investigation of the effect of 4 types of starting values for the singular vectors
on SCCA: number of variables selected for set X (total number of variables 150) averaged
over 50 simulations for 4 types of starting values for the singular vectors and different
correlations between associated subsets of variables in X and Y used in the simulations.
and 0.84 for X and Y sets respectively while average positive predictive values for X and
Y are 0.66 and 0.63 respectively. All results for sensitivity and specificity analysis of the
effect of starting values on the SCCA results are not shown here. However, since these
statistics are derived from the number of positives and the number of true positives,
they are reflected in shown figures and tables. Complete sensitivity-specificity analysis
of SCCA is presented separately in section 4.3 of chapter 4.
Tables 3.1 and 3.2 also show that the number of variables selected for the set X
is higher than for set Y for all types of starting values and almost all values of true
correlation between the sets of important variables. This observation can be explained
48
positives for Y (true positives for Y)
types of starting values
average true
correlation
sample sin-
gular vectors
of K
true singular
vectors of K
row and col-
umn means
for K
Unif(0, 1)
random
numbers
0.041 55.9 (10.84) 47.68 (9.72) 47.9 (9.76) 49.28 (9.86)
0.116 48.14 (9.9) 46.74 (9.93) 49.02 (10.12) 47.9 (9.46)
0.245 51.46 (10.34) 53.86 (11.14) 46.44 (9.28) 50.14 (9.94)
0.531 46.56 (11.64) 46.08 (12.2) 47.94 (11.94) 44.2 (11.44)
0.723 43.7 (15.12) 43.46 (15.08) 44.14 (15.1) 44.72 (15.12)
0.858 34.18 (16.32) 34.24 (16.34) 34.16 (16.32) 34.22 (16.34)
0.916 33.86 (16.86) 33.86 (16.86) 33.86 (16.86) 33.86 (16.86)
Table 3.2: Investigation of the effect of 4 types of starting values for the singular vectors
on SCCA: number of variables selected for set Y (total number of variables 100) averaged
over 50 simulations for 4 types of starting values for the singular vectors and different
correlations between associated subsets of variables in X and Y used in the simulations.
49
by the fact that set X contains 150 variables and set Y contains 100 variables while 20
variables in each set are associated between X and Y. Thus, there more noise variables
in set X than in Y. Therefore, SCCA differentiates better between the informative and
uninformative (i.e. noise) variables when there are fewer noise variables in the set. The
number of true positives for the sets X and Y are closer for all considered cases. Hence,
the amount of noise in the data set has greater effect on the number of noise variables
included in the solution (false positives) than on the number of variables selected correctly
(true positives). This indicates that preliminary filtering to remove noise from the data
would be beneficial for SCCA performance. This is discussed in greater detail in chapter
7.
Table 3.3 and Table 3.4 demonstrate the results for the discordance measure for
X and Y averaged over 50 simulations. These tables also show improvement in the
SCCA performance as well as increased agreement between the results obtained using
4 different starting values as the true correlation ρ increases with the identical results
for ρ = 0.916. Here improved SCCA performance is demonstrated by the decreasing
discordance measures.
Pairwise t-tests were used to compare the results for different starting value choices
averaged over 50 simulations. Since all 4 types of starting values were applied to the
same data set within every simulation, then the measures of SCCA performance are not
independent. Thus, for example discordance measures for the set X for a specific true
correlation between important variables in X and Y are not independent observations for
each of 50 simulations. Therefore, pair-wise t-tests were carried out as follows. Suppose
m1 and m2 are 50×1 vectors of measures that are compared for the two types of starting
values (for example, discordance for set X in 50 simulations). Let v1 and v2 denote the
variances of these measures in 50 simulations and v12 denote the covariance.
v1 = V ar(m1), v2 = V ar(m2)
v12 = Cov(m1,m2)
50
average discordance for X
average true
correlation
sample sin-
gular vectors
of K
true singular
vectors of K
row and col-
umn means
for K
Unif(0, 1)
random
numbers
0.041 66.42 68.6 72.68 66.34
0.116 65.96 64.02 65.46 65.28
0.245 63.96 67.52 60.78 63.44
0.531 52.82 53.62 59.16 56.86
0.723 42.48 42.38 44.08 44.52
0.858 30.3 29.04 30.16 29.92
0.916 16.24 16.22 16.22 16.22
Table 3.3: Investigation of the effect of 4 types of starting values for the singular vectors
on SCCA results: discordance measure for set X averaged over 50 simulations for 4 types
of starting values for the singular vectors and different correlations between associated
subsets of variables in X and Y used in the simulations.
51
average discordance for Y
average true
correlation
sample sin-
gular vectors
of K
true singular
vectors of K
row and col-
umn means
for K
Unif(0, 1)
random
numbers
0.041 54.22 48.24 50.18 49.56
0.116 48.34 46.82 48.78 48.98
0.245 50.78 51.58 47.88 50.26
0.531 43.28 41.68 44.06 41.32
0.723 33.46 33.3 33.94 34.48
0.858 21.54 21.56 21.52 21.54
0.916 20.14 20.14 20.14 20.14
Table 3.4: Investigation of the effect of 4 types of starting values for the singular vectors
on SCCA results: discordance measure for set Y averaged over 50 simulations for 4 types
of starting values for the singular vectors and different correlations between associated
subsets of variables in X and Y used in the simulations.
52
m1 =50∑
i=1
mi1, m2 =50∑
i=1
mi2
Then
m1 − m2√v1 + v2 − 2v12
∼ T(50−1)
These tests were applied to compare the results presented in the tables above for all types
of the starting values for the singular vectors at each true correlation between important
variables in X and Y. Obtained p-values indicate that there is no statistical difference in
performance of SCCA when any type of the starting values is used. For instance, when
discordance measures for set X at true correlation of 0.04 are compared between using
sample singular vectors or using random numbers as starting values, then the p-value is
0.998372. All other obtained p-values exceed 0.72. In fact, when higher true correlations
between important variables are considered (ρ = 0.916) then identical sets of variables
in X and Y are selected by SCCA using all 4 types of starting values, therefore there is
no difference between using any of these choices.
Presented simulation results demonstrate that performance and convergence of SCCA
algorithm are not affected by the choice of the 4 types of starting values for left and right
singular vectors. Therefore, the user could pick any vectors as long as they are not in the
left and right null spaces of the matrix K respectively. However, as stated by Wegelin
[Wegelin], in the absence of variable selection using row and column means of the matrix
K as starting vectors ensures the correct order of obtaining singular vectors, i.e. the
first singular vectors are obtained followed by the second singular vectors and so on.
Therefore, it is recommended to use that type of starting values in SCCA algorithm as
well. This does not introduce additional computation complexity since row and column
means of the matrix are easily obtained and are faster to compute than sample first
singular vectors.
53
3.6 Sparseness parameter selection
The optimal combination of sparseness parameters for soft-thresholding steps can be
selected using k-fold cross-validation (CV). For each step of the CV we consider a group
of combinations of sparseness parameters for left and right singular vectors (λu, λv). Then
for every specific pair of sparseness parameters combination k−1k
% of the data (training
sample) is used to identify linear combinations of variables. We evaluate the correlation
between obtained canonical vectors in the remaining 1k% of the data set (testing sample).
Test sample correlations obtained for each combination of the sparseness parameters are
averaged over k CV steps. The optimal combination of λu and λv then corresponds to
the highest average test sample correlation.
Typical results of this process are demonstrated in figure 3.6. To create this graph I
simulated the data using the single latent variable approach with the following settings:
Simulation design:
• 1500 variables in set X
• 1000 variables in set Y
• 30 variables in each set are associated between the sets and the rest of the variables
are independent
• sample size is 150
• The range for both sparseness parameters λu and λv was between 0 and 0.25
I used 10-fold CV, thus the test sample contained 15 observations while the training
sample contained 135 observations.
Simulation results and conclusions:
Figure 3.6 shows average test sample correlations for canonical vectors obtained using
54
SCCA for different combinations of sparseness parameters as 3D surface. The dotted
plane on the graph corresponds to the test sample correlation obtained using the full
SVD solution in which linear combinations of variables contain the the entire sets of
variables in X and Y. In this case full SVD based test sample correlation was 0.42, while
the highest test sample correlation for SCCA solution was 0.86 and it was attained at
(λu = 0.101, λv = 0.131), hence this is the optimal sparseness parameter combination.
This graph also demonstrates the effect of the sparseness parameter selection on the
test sample correlation. When both λu and λv are set to zero, then no variable selection
is performed and SCCA provides the full SVD solution. This is demonstrated on the
graph by the equality in test sample correlations for SCCA and SVD in the bottom left
corner where both sparseness parameters are 0. However, if λu or λv are set to values
exceeding some threshold, then no variables are included in the sparse SCCA solution
which results in zero test sample correlation. This is the case on the bottom far right
corner of the graph where both sparseness parameters are close to 0.25.
Figure 3.6 also shows that the sparse solution obtained using SCCA is more gen-
eralizable (i.e. applicable to the independent data set) than the full SVD solution that
includes all available variables. This is indicated by the higher test sample correlation
for SCCA solution compared to SVD even for suboptimal sparseness parameter combi-
nations. Full SVD solution tends to overfit the data providing high correlation between
the linear combinations of variables for the given data set, but much lower correlation
for independent test data set generated under the same underlying model.
55
0.2 0.4 0.6 0.8
3040
5060
70
Average number of variables selected for X (positives for X)
average true correlation
num
ber
of p
ositi
ves
sample singular vectors of Ktrue singular vectors of Krow and column means for KUnif(0,1) random numbers
Figure 3.2: Investigation of the effect of 4 types of starting values for the singular vectors
on SCCA: number of variables selected for set X (total number of variables 150) averaged
over 50 simulations for 4 types of starting values for the singular vectors and different
correlations between associated subsets of variables in X and Y used in the simulations.
Solid curve - sample singular vectors of K in equation 3.4, dashed curve - true singular
vectors of K, dotted curve - row and column means for K, dotted and dashed curve -
Unif(0, 1) random numbers.
56
0.2 0.4 0.6 0.8
3540
4550
55
Average number of variables selected for Y (positives for Y)
average true correlation
num
ber
of p
ositi
ves
sample singular vectors of Ktrue singular vectors of Krow and column means for KUnif(0,1) random numbers
Figure 3.3: Investigation of the effect of 4 types of starting values for the singular vectors
on SCCA: number of variables selected for set Y (total number of variables 100) averaged
over 50 simulations for 4 types of starting values for the singular vectors and different
correlations between associated subsets of variables in X and Y used in the simulations.
Solid curve - sample singular vectors of K in equation 3.4, dashed curve - true singular
vectors of K, dotted curve - row and column means for K, dotted and dashed curve -
Unif(0, 1) random numbers.
57
0.2 0.4 0.6 0.8
810
1214
16
Average number of variables correctly selected for X (true positives for X)
average true correlation
num
ber
of tr
ue p
ositi
ves sample singular vectors of K
true singular vectors of Krow and column means for KUnif(0,1) random numbers
Figure 3.4: Investigation of the effect of 4 types of starting values for the singular vectors
on SCCA: number of variables correctly selected for set X (total number of variables:
150, important variables: 20) averaged over 50 simulations for 4 types of starting values
for the singular vectors and different correlations between associated subsets of variables
in X and Y used in the simulations. Solid curve - sample singular vectors of K in equation
3.4, dashed curve - true singular vectors of K, dotted curve - row and column means for
K, dotted and dashed curve - Unif(0, 1) random numbers.
58
0.2 0.4 0.6 0.8
1012
1416
Average number of variables correctly selected for Y (true positives for Y)
average true correlation
num
ber
of tr
ue p
ositi
ves
sample singular vectors of Ktrue singular vectors of Krow and column means for KUnif(0,1) random numbers
Figure 3.5: Investigation of the effect of 4 types of starting values for the singular vectors
on SCCA: number of variables correctly selected for set Y (total number of variables:
100, important variables: 20) averaged over 50 simulations for 4 types of starting values
for the singular vectors and different correlations between associated subsets of variables
in X and Y used in the simulations. Solid curve - sample singular vectors of K in equation
3.4, dashed curve - true singular vectors of K, dotted curve - row and column means for
K, dotted and dashed curve - Unif(0, 1) random numbers.
59
sparseness param. for alpha
0.050.10
0.150.20
spar
sene
ss pa
ram. fo
r beta
0.05
0.10
0.15
0.2
test s
am
ple
corre
latio
n
0.3
0.4
0.5
0.6
0.7
0.8
SCCA test sample correlation vs sparseness parameters
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●●
●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
Figure 3.6: Average test sample correlation versus combinations of sparseness parameters
for left and right singular vectors. The maximum of test sample correlation determines
the optimal combination of sparseness parameters. 3D surface is for SCCA solution,
dotted plane is for full SVD solution.
60
Chapter 4
SCCA evaluation
In this chapter I present evaluation results for Sparse Canonical Correlation Analysis
(SCCA) performance.
I begin by describing the statistical tool for evaluation - cross-validation, in section
4.1. This approach can be used to evaluate the results of SCCA in real studies when
an independent test sample from the same distribution is not available. In the following
sections I describe various aspects of method performance and present their evaluation
using simulated data.
SCCA is a useful tool in the analysis of the large data sets since it produces a sparse
solution thus allowing dimensionality reduction. However, large scale studies often suffer
from the insufficient sample size. For example, in genomic and genetic studies measure-
ments on thousands of gene expression profiles and SNP genotypes may be available while
the sample size may be limited to a few hundred individuals. Therefore, it is important
to assess the effect of insufficient sample size on the performance of SCCA. Results of
the evaluation are described in section 4.2.
Another common problem in the large scale studies is the presence of a high number of
noise variables that are not related to the studied processes. The sparse solution provided
by SCCA incorporates filtering out the unimportant variables. This is done based on
61
the assumption that noise measurements are uncorrelated with each other within and
between different types of variables. However, if the correlation between the sets of
informative variables of different types is low it may be difficult to differentiate between
the noise and important variables. I evaluate the effect of the true underlying correlation
between the sets of associated variables in section 4.3.
4.1 Evaluation tool - Cross Validation
The performance of SCCA algorithm can be evaluated based on an independent test sam-
ple correlation between the obtained sparse linear combinations of variables of different
types. Higher test sample correlation indicates greater generalizability of the results.
In the absence of an independent testing sample, the performance of SCCA algorithm
can be evaluated using a two-step cross-validation consisting of inner and outer cross-
validations. Inner CV can be considered as part of the algorithm and is used to select
the optimal combination of the sparseness parameters for left and right singular vectors.
The outer loop is used to assess the generalizability of the results. Thus, for the kouter-
fold evaluation CV the original sample is split into k parts. Then 1kouter
% of samples are
reserved for testing (outer testing sample) and remaining kouter−1kouter
% of samples are used
to obtain linear combinations of variables that are associated between the given data sets
(outer training sample). This process includes kinner-fold CV described in the sparseness
parameter selection section (3.6) that treats the outer training sample as the original
data to which kinner-fold CV is applied. Outer test sample correlations are computed
for each SCCA solution obtained for each outer training sample and then averaged over
kouter CV steps.
In some studies the structure of the data may be more complex and some samples
may not be independent of others. One example is a genetic study where variables are
measured in several pedigrees. Then observations corresponding to the members of the
62
same pedigree are not independent. In these cases simple k-fold cross-validation may not
be desirable as it ignores the correlation structure in the data. One solution is to use
adaptive fold size dictated by the dependency between the samples. Thus, the original
data should not be split into k equal parts, but rather into k parts of comparable but
not necessarily equal size. For example, a family can be treated as an independent single
CV unit.
Test sample correlations obtained in different CV steps for family-based data may
foster sensitivity analysis for pedigrees and may aid in detecting heterogeneity in the
pedigrees. An example of this phenomenon is described in the Application chapter (6).
4.2 Effect of sample size on generalizability
Sparse canonical correlation is useful in cases when there is a large number of variables
under consideration since it allows variable selection. Filtering out the noise provides
more interpretable and generalizable results as compared to the traditional canonical
correlation analysis. The CCA solution provides linear combinations of the entire set of
available variables. This approach may not be appropriate in microarray studies with high
dimensionality because a linear combination of thousands of variables may be difficult to
interpret from the biological perspective. Also it is known that only a subset of measured
genes may be expressed, therefore a solution that includes the entire set of variables would
contain noise which may reduce its applicability to an independent data set. Another
feature of microarray studies is the limited sample size. In these cases estimates of the full
singular vectors obtained from CCA may not be very accurate. To investigate the effect
of sample size on SCCA performance I performed a simulation study. I compared the
generalizability of SCCA results to the full SVD solution which is provided by CCA for
different sample sizes using the simulated data sets. Greater generalizability is indicated
by the higher test sample correlation between the linear combinations of variables of
63
different types.
Simulation design:
The data were simulated based on the single latent variable model described in section
3.4. I generated 2 sets of variables X and Y with 500 variables in the set X and 400
variables in Y. Thirty variables in each set were simulated to be associated between X
and Y (important variables) and the rest of the variables were noise. Standard deviation
for the latent variable µ was 1 which resulted in the true generated correlation between
the sets of associated variables equal to 0.769. Linear combinations of variables were
obtained using SCCA and CCA (full SVD first singular vectors) for sample sizes ranging
between 50 and 1500. Subsequently these results were applied to independent test data
sets generated from the same distribution to compute the test data correlation between
the linear combinations of variables, Cor(Xtestu, Ytestv). The sample size for the test
data was 100. Fifty simulations were performed for each sample size and the test sample
correlations were averaged over the simulations.
Simulation results and conclusions:
Figure 4.1 demonstrates superior performance of SCCA compared to the full SVD solu-
tion especially for the small sample sizes. Sparse solutions obtained by SCCA are more
generalizable which is reflected in the higher correlation between the linear combinations
of variables applied to the independent test data. For instance, for sample size 100 SCCA
test sample correlation is 0.75 while for the full SVD solution it is 0.55. SCCA performs
better than full SVD for all sample sizes, however, as the sample size increases the full
SVD solution becomes more precise. This means that the values in the first singular
vectors that correspond to the noise variables become increasingly close to 0. Thus, the
full SVD solution approaches the sparse solution as the sample size increases. That is
demonstrated by the decreasing distance between the SCCA and full SVD curves for
higher sample sizes.
The graph also demonstrates the inconsistency of the full SVD solution that is dis-
64
0 500 1000 1500
0.4
0.5
0.6
0.7
0.8
Effect of sample size on test sample correlation
sample size
test
sam
ple
corr
elat
ion
SCCASVD
Figure 4.1: Sample size effect: test sample correlation for different sample sizes averaged
over 50 simulations for each sample size. Solid curve - test sample correlation for SCCA
solution, dashed curve - test sample correlation for full SVD solution.
cussed in Johnstone and Lu [Johnstone and Lu, 2004] where the authors show that the
estimate of the first principal component is consistent if and only if
c = limn→∞
p(n)/n = 0 (4.1)
In this simulation, however, the number of variables in each of the sets X and Y is
comparable to the sample size or exceeds it. Therefore, the full SVD estimates of the
first singular vectors are inconsistent which is reflected in the apparent upper boundary
for the test sample correlation: even for higher sample sizes test sample correlation for
full SVD solution does not exceed test sample correlation for SCCA solution and the
curves seem to approach their asymptotes in parallel.
65
Test sample correlation for the SCCA estimates seems to have zero slope for the
higher sample sizes and approaches 0.769 which is the true correlation between the sets of
associated variables used in the generating model. This suggests consistency of the SCCA
estimates. In case of sparse canonical correlation, however, it should be stressed that we
are addressing a new aspect - accuracy of the model selection rather than accuracy of the
estimated coefficients in the singular vectors. Identifying positions of zero coefficients in
the singular vectors is of greater importance than the accuracy of the non-zero coefficients,
which affects test sample correlation.
4.3 Effect of the true association between the linear
combinations of variables
In addition to the influence of sample size on the performance of SCCA, it may also
be affected by the true correlation between the associated subsets of variables in the
considered data sets X and Y. This effect was studied using the simulated data.
Simulation design:
I generated the 1500 variables for set X and 1000 variables for set Y using the single
latent variable model. Thirty variables in each data set were associated. The sample size
was 150. Presented results are averaged over 50 simulations. SCCA performance in cases
of different true correlation ranging from 0.68 to 0.95 was evaluated using sensitivity and
positive predictive value (PPV) measures. The sensitivity is computed as
sensitivity =#true positive
#true
Positive predictive value is computed as
PPV =#true positive
#positive
The optimal sparseness parameter selection was based on the 10-fold cross-validation.
66
True cor Sensitivity X PPV X Sensitivity Y PPV Y test sample cor
0.68 0.49 0.29 0.53 0.21 0.56
0.80 0.66 0.59 0.71 0.53 0.82
0.95 0.79 0.91 0.79 0.83 0.96
Table 4.1: The effect of true correlation between the associated subsets of variables in
X and Y on the sensitivity, positive predictive value and test sample correlation. The
results are averaged over 50 simulations.
Simulation results and conclusions:
Table 4.1 demonstrates sensitivity and PPV for both X and Y data sets as well as test
sample correlations computed as the maximum test sample correlation corresponding to
the optimal combination of the sparseness parameters obtained from CV.
Simulation results show that as correlation increases and approaches 1, the sensitivity
and PPV increase. For low true correlation these values are lower because it is difficult to
differentiate between unrelated noise variables and associated variables of interest. Thus,
more noise variables and fewer ”important” variables may be selected by SCCA.
67
Chapter 5
Extensions of SCCA
In this chapter I describe the evaluation of another aspect of SCCA performance - predic-
tion and correct model identification (section 5.1). These results are presented separately
from the the evaluation of the effect of sample size and the effect of the true correlation
between the associated sets of variables on generalizability described in chapter 4 for two
reasons. First, there is a new question of interest - how well does the model selected by
SCCA recover the true underlying model in the data. The second reason is that this
evaluation study inspired development of two extensions of SCCA also presented in this
chapter (Adaptive SCCA is described in section 5.2 and Modified Adaptive SCCA is
described in section 5.3).
5.1 Oracle properties, prediction versus variable se-
lection
Selecting the subset of variables for subset prediction is not the same as selecting the
subset of variables for best recovery of the true model, i.e. true subset of variables.
In case of SCCA the concept of prediction is similar to the concept of generalizability
and is measured by an independent tests sample correlation between the obtained linear
68
combinations of variables of different types. As outlined in chapter 2, during variable
selection lasso tends to shrink values of large coefficients towards 0 while setting param-
eters with small values to exactly 0. Therefore, even if the right subset of variables is
selected, the solution may still be biased. Likewise, it may not give the best prediction
results. When the lasso solution is obtained based on prediction, often noise variables are
included as well to improve prediction [Zou, 2006]. There is a trade-off between optimal
prediction and consistent variable selection in the lasso solution [Zou, 2006, Meinshausen
and Buhlmann, 2004].
We use soft-thresholding to perform variable selection which shares properties with
the elastic net solution when the ridge penalty is set to infinity in the sparse principal
components analysis approach of Zou et al. [Zou et al., 2004]. The benefit of the elastic
net approach over lasso is that if one of the variables from a group of correlated variables
is selected to be included in the model (has non-zero loading in canonical/principal vector
or non-zero coefficient in regression), then all variables from that group will be included
in the elastic net solution [Zou and Hastie, 2005]. On the other hand lasso picks only
one of the variables from the correlated group to be included in the model. In the case
of SCCA we are interested in establishing relationships between two subsets of variables
and would like all associated variables from two sources of data to be present in identified
subsets. Therefore, the elastic net approach is preferred for SCCA. However, the elastic
net may be subject to the same difficulty as lasso in terms of optimal prediction versus
model selection trade-off. In the next section I present the results of simulation studies
used to evaluate the oracle properties of SCCA.
Definition of oracle properties [Zou, 2006]:
Consider some model fitting procedure δ. Let β(δ) be a set of model coefficients estimated
by δ and let β∗ be a set of the true model coefficients. Also let A = {j : β∗j 6= 0}. Then
procedure δ has oracle properties if asymptotically β(δ) has the following properties:
• Correct subset identification: {j : βj 6= 0} = A
69
• Optimal estimation rate:√
n(β(δ)A−β∗A)→d N(0, Σ∗), where Σ∗ is the covariance
matrix under the true subset model.
Evaluation of oracle properties of SCCA
To evaluate the oracle properties of SCCA algorithm, we first consider correct model
selection (i.e. subset identification). In the SCCA algorithm sparseness parameters for
left and right singular vectors are chosen based on the prediction measure as evaluated by
the test sample correlation in CV steps as described in the sparseness parameter selection
section. To be more specific, sparseness parameters and, therefore, the variables in sets X
and Y are chosen so that the correlation between their linear combinations when applied
to the independent test data set is maximized as follows. Let Xtest and Ytest be the
independent standardized test data sets and u(λ) and v(λ) be left and right singular
vectors obtained by applying SCCA to standardized training data sets X and Y using
a specific combination of sparseness parameters (λu, λv). Then the optimal combination
(λoptimalu , λoptimal
v ) is
(λoptimalu , λoptimal
v ) = argmaxλu,λv
Cor(Xtestu(λ), Ytestv(λ))
Once the optimal combination of the sparseness parameters is identified, subsets of vari-
ables in X and Y can be selected by applying SCCA to the whole available data and
obtaining the loadings for singular vectors u and v. We would expect that as sample size
increases SCCA should identify the correct subsets of variables with greater accuracy.
That is fewer noise variables should be included and greater number of important vari-
ables should have non-zero loadings in the corresponding singular vectors. To test this I
performed simulations using a single latent variable model as described in section 3.4.
Simulation design:
I used 150 variables in set X, 100 variables in set Y with 20 variables in each set be-
ing associated with each other (or ”important”) and the rest were noise variables. The
70
standard deviation used for simulation of the latent variable µ was 0.5 and the standard
deviation for the noise variables was 0.1 which resulted in the true correlation between
the linear combinations of important variables averaged across simulations equal to ap-
proximately 0.5. Sample sizes range between 20 and 600. Both λu and λv were allowed
to take values in the set (0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0.20).
The accuracy of the model selection is evaluated by the discordance measure which
reflects the number of incorrectly identified variables. The discordance can be separated
into two components: the number of false positives, i.e. the number of noise variables with
non-zero loadings in a singular vector, and the number of false negatives, i.e. the number
of important variables that have zero singular vector loadings and are not selected. Thus,
the discordance for sets X and Y used in subsequent simulations can be computed as
follows:
discordance(X) = #false positive X + #false negative X
discordance(Y ) = #false positive Y + #false negative Y
Using outer CV here to evaluate the performance of SCCA is not necessary because
we are using the simulated data. Therefore, it is known which variables are important
and which are noise. That information is used to compute the discordance measure.
Thus, only one level of cross-validation if used in the algorithm to select the optimal
combination of the sparseness parameters for the left and right singular vectors. This
CV is performed as part of SCCA algorithm and not as an evaluation tool.
Simulation results and conclusions:
Figures 5.1 and 5.2 demonstrate the results of these simulations for sets X and Y or
equivalently for the left and right singular vectors respectively. The graphs show the
discordance measure for the different sample sizes as well as its components: the number
of false positives and the number of false negatives. The range of average number of
false negatives for set X is 3.42 to 10.84 and for set Y it is 3.14 to 9.22. Both graphs
demonstrate that as the sample size increases the true model identification accuracy
71
0 100 200 300 400 500 600
010
2030
4050
6070
Discordance measures versus sample sizes for set X
sample size
num
ber
of v
aria
bles
discordancefalse negativesfalse positives
Figure 5.1: Model selection measures versus sample size for data set X, average true
correlation between two linear combinations of important variables is 0.5. The results
are averaged over 50 simulations for each sample size. Solid curve - average discordance,
dashed curve - average number of false negatives, dotted curve - average number of false
positives.
increases, i.e. less noise is included in the model while more important variables are
selected. Also of interest here is the fact that the discordance between the true model
used for data simulation and subsets of variables identified by SCCA is mostly due to
the inclusion of the noise variables in obtained linear combinations. This is evident from
the fact that the curves for false positives are much closer to the discordance curves than
the curves for false negatives while, as was stated above, the discordance measure is the
sum of the number of false positives and false negatives. There are fewer than 4 false
negatives for data X for sample sizes larger than 490 while for data Y this is true for the
72
0 100 200 300 400 500 600
010
2030
4050
6070
Discordance measures versus sample sizes for set Y
sample size
num
ber
of v
aria
bles
discordancefalse negativesfalse positives
Figure 5.2: Model selection measures versus sample size for data set Y, average true
correlation between two linear combinations of important variables is 0.5. The results
are averaged over 50 simulations for each sample size. Solid curve - average discordance,
dashed curve - average number of false negatives, dotted curve - average number of false
positives.
sample sizes as low as 160.
I also performed similar data simulation for a higher true correlation between linear
combinations of important variables.
Simulation design:
The standard deviation used for simulation of the latent variable µ was 2 and the standard
deviation for the noise variables was 0.1 which resulted in the true correlation between
the linear combinations of important variables averaged across simulations equal to ap-
proximately 0.95.
73
Simulation results and conclusions:
In this situation we would expect to observe lower discordance values since it should be
easier to separate important variables from the noise. Figures 5.3 and 5.4 demonstrate
that this is indeed the case. The discordance measures are much lower than previously
observed for the same sample sizes. For example, when the data has 50 samples, then the
average discordance for set X is 60.22 for low correlation case while it is only 14.32 for the
high correlation case. Also the range of the false negatives is much narrower when the
true correlation is higher: (2.46, 3.42) for set X, and even for small sample sizes almost
all important variables are included in the linear combinations.
Again the main source of the error measured by discordance are the false positives
or noise variables included in the linear combinations. However, when true correlation
is high, then given a sufficient sample size almost all noise can be eliminated from the
sets of important variables selected by SCCA. This is demonstrated by the number of
false positives for the left singular vector being equal to zero for sample size 350 and
higher sample sizes. These figures show increased accuracy in the model identification
with the increasing sample size since the discordance for both data sets X and Y is
decreasing. Also comparison of the figures for different values of the true correlation
between linear combination of important variables demonstrates that the accuracy of the
model identification is higher when the true correlation is higher. This is explained by
that fact that in this case it is easier to distinguish noise variables from the important
variables.
Correct model identification versus prediction for SCCA
Although decreasing discordance is observed for increasing sample sizes, it is fairly high
for small samples, especially when the true correlation between the linear combinations
of important variables is low. For large samples we still observe non-zero discordance.
These results demonstrate that maximization of the prediction power of the model does
74
0 100 200 300 400 500 600
010
2030
Discordance measures versus sample sizes for set X
sample size
num
ber
of v
aria
bles
discordancefalse negativesfalse positives
Figure 5.3: Model selection measures versus sample size for data set X, average true
correlation between two linear combinations of important variables is 0.95. The results
are averaged over 50 simulations for each sample size. Solid curve - average discordance,
dashed curve - average number of false negatives, dotted curve - average number of false
positives.
not guarantee correct model selection. Even for large sample sizes a substantial number
of noise variables is included in the chosen subsets of variables. Also there are still
important variables being missed even for large samples. That indicates inconsistency
in variable selection: limn P (A⋆ = A) 6= 1. I used another form of evaluation to further
investigate this issue. I performed data simulation to compare performance of SCCA for
two approaches of sparseness parameters selection:
1. maximization of the test sample correlation
75
0 100 200 300 400 500 600
05
1015
20
Discordance measures versus sample sizes for set Y
sample size
num
ber
of v
aria
bles
discordancefalse negativesfalse positives
Figure 5.4: Model selection measures versus sample size for data set Y, average true
correlation between two linear combinations of important variables is 0.95. The results
are averaged over 50 simulations for each sample size. Solid curve - average discordance,
dashed curve - average number of false negatives, dotted curve - average number of false
positives.
2. minimization of the average discordance measure for sets X and Y, i.e.
discordance =1
2(discordance(X) + discordance(Y )) (5.1)
The implementation of the second approach is only possible in a simulation study where
we have the information about the true underlying model used to generate the data since
we need to know the number of false positives and false negatives.
As in the previous section simulations are based on a single latent variable model.
76
Simulation design:
Set X contains 150 variables, 100 variables are generated for set Y with 20 variables
in each set being associated with each other (or ”important”) and the rest were noise
variables. The standard deviation used for simulation of the latent variable µ was 2 and
the standard deviation for the noise variables was 0.1 which resulted in the true correla-
tion between the linear combinations of important variables equal to approximately 0.95.
Both λu and λv were allowed to take values between 0.0001 and 0.3. For both approaches
of sparseness parameters selection linear combinations of variables were identified apply-
ing SCCA to 45 observations. Subsequently test sample correlation was computed using
these results for independent test data consisting of 5 observations generated from the
same distribution. The average discordance measure for sets X and Y was computed
based on obtained linear combinations of variables and the knowledge of which variables
were simulated as ”important”. Test sample correlation based on the full SVD solution
was also computed for comparison purpose. The full SVD solution that includes all
available variables in the linear combinations was obtained using the same 45 observa-
tions as for SCCA and then correlations between linear combinations of variables in the
test sample (5 observations) was computed. I simulated a single sample to illustrate the
relationship between the sparseness parameters and test criterion.
Simulation results and conclusions:
Figure 5.5 shows test sample correlations for different combinations of sparseness param-
eters for left and right singular vectors as a 3D surface. The dotted plane on the graph
corresponds to the test sample correlation for the full SVD solution which is 0.9289835.
It is the same for all combinations of sparseness parameters since it does not depend on
them and all available variables are included in the linear combinations. In this simula-
tion test sample correlation obtained using SCCA exceeds test sample correlation based
on the full SVD for all combinations of sparseness parameters (0.92906 for SCCA). The
optimal combination of the sparseness parameters corresponds to the maximum of the
77
tests sample correlation which is 0.9783144 and is attained at (λu = 0.0701, λv = 0.2901).
Notice that when sparseness parameters are set to the lowest values of 0.0001 all vari-
ables are included in the SCCA solution. Hence test sample correlation at the bottom
left corner is equal for SCCA and full SVD solutions. As the values of sparseness pa-
rameters are increased the SCCA solutions become more sparse, so that fewer variables
are included in the linear combinations. This leads to improved prediction power for
the independent data as measured by the increasing test sample correlation. The main
benefit of the sparse solution is that noise variables are eliminated which leads to greater
generalizability. However, as was discussed above there may still be some noise variables
present in the linear combinations of variables. This can be examined by looking at a
graph of discordance measure versus sparseness parameters.
Figure 5.6 shows average discordance measure for sets X and Y for different com-
binations of sparseness parameters for left and right singular vectors. In this case
the best solution and the corresponding optimal combination of the sparseness pa-
rameters are obtained by locating the minimum of the discordance measure which is
5. It is attained at several combinations of sparseness parameters: all pairs of λu =
(0.2301, 0.2401, 0.2501, 0.2601, 0.2701) and λv = (0.2601, 0.2701, 0.2801, 0.2901) and at
λu = 0.2301, λv = 0.2501. For these sparseness parameter combinations SCCA test sam-
ple correlation ranges between 0.9722442 and 0.9772633 which is a little lower than the
highest test sample correlation, however, still significantly higher than the SVD test sam-
ple correlation. the graph also demonstrates that as sparseness parameters increase the
average discordance for X and Y decreases to a minimum, however it starts to increase if
sparseness parameters are increased even further. Thus, for λu = (0.2801, 0.2901), λv =
0.2901 average discordance is 5.5. As discussed in the previous section, discordance is
equal to the sum of false positives and false negatives. It also has been demonstrated that
the dominant component is the number of false positives, i.e. the number of noise vari-
ables included in the linear combinations. As we increase sparseness parameters fewer
78
sparseness param. for set X
0.050.10
0.150.20
0.25sp
arse
ness
para
m. for s
et Y
0.05
0.10
0.150.20
0.25
test sample correlation
0.93
0.94
0.95
0.96
0.97
SCCA test sample correlation vs. sparseness parameters
Figure 5.5: Model identification versus prediction: test sample correlation versus sparse-
ness parameters combinations, true correlation between two linear combinations of impor-
tant variables is 0.9. 3D surface shows test sample correlations for linear combinations
obtained using SCCA. Dotted plane corresponds to test sample correlation for linear
combinations of all available variables obtained using full SVD.
variables are included in the SCCA solution and more noise variables are eliminated.
However, after reaching a certain threshold linear combinations of variables may become
too sparse with the important variables excluded as well.
The trade-off between the number of false positives and the number of false negatives
is confirmed by figures 5.7 and 5.8 showing the number of false positives and false
negatives for set X for different combinations of sparseness parameters. These graphs
demonstrate two components of the discordance measure. Graphs for set Y (not shown)
are similar. Note that there is much smaller variation in these measures with the
79
sparseness param. for set X
0.050.10
0.150.20
0.25sp
arse
ness
para
m. for s
et Y
0.05
0.10
0.150.20
0.25
discordance
20
40
60
80
100
Discordance vs. sparseness parameters
Figure 5.6: Model identification versus prediction: average discordance for sets X and Y
versus sparseness parameters combinations, true correlation between two linear combi-
nations of important variables is 0.9.
change in λv, the sparseness parameter associated with the right singular vector, i.e.
the linear combination of variables from set Y. However, minor variations in both false
positives and false negatives for set X are still observed since left and right singular
vectors are not estimated independently from each other. The graphs demonstrate that as
sparseness parameters increase, the number of false positives decreases while the number
of false negatives increases. In fact, for any λv and λu greater than 0.2401 as well as for
λv ≥ 0.0301 and λu = 0.2301 no noise variables are included in the linear combination
of variables from set X. However, in these cases 6 or 7 important variables are excluded
as well which is indicated by the number of false negatives. On the other hand, for low
values of sparseness parameters all important variables are included in SCCA solution
80
sparseness param. for set X
0.050.10
0.150.20
0.25sp
arse
ness
para
m. for s
et Y
0.05
0.10
0.150.20
0.25
discordance
0
50
100
Number of false positives for set X vs. sparseness parameters
Figure 5.7: Model identification versus prediction: the number of false positives for
set X versus sparseness parameters combinations, true correlation between two linear
combinations of important variables is 0.9.
along with some noise variables. Hence, there is a trade-off between eliminating noise
variables from the linear combinations and keeping the important variables. Relative
importance of one or the other aspect of model identification may be dictated by the
external factors such as biological interpretation. For example, if the results are going
to be used to generate new hypotheses and to identify subsets of variables for further
examination, then the preference may be set by the cost of additional experiments (how
many noise variables we can afford to include) or the significance of missing biological
factors of interest (how many important variables can be excluded from the model).
Now let’s return to the initial question of interest: predictive power versus correct
model identification. At the optimal combination of sparseness parameters corresponding
81
sparseness param. for set X
0.050.10
0.150.20
0.25sp
arse
ness
para
m. for s
et Y
0.05
0.10
0.150.20
0.25
discordance
0
2
4
6
Number of false negatives for set X vs. sparseness parameters
Figure 5.8: Model identification versus prediction: the number of false negatives for
set X versus sparseness parameters combinations, true correlation between two linear
combinations of important variables is 0.9.
to the highest test sample correlation, average discordance for sets X and Y is 32. In
this case 60 noise variables are included in the linear combination of variables from set
X while 0 important variables are excluded, and for Y 0 noise variables are included
in the linear combination while 4 important variables are excluded. This finding is not
surprising since the sparseness parameter for Y is very high resulting in a sparse solution
while the sparseness parameter for X is much lower, so a greater amount of noise is
included in addition to all important variables. It should also be emphasized here that
the difference between the maximum test sample correlation and test sample correlations
corresponding to the lowest average discordance is only between 0.0010511 and 0.0060702
(there are several sparseness parameter combinations at which the lowest value of average
82
discordance is observed). Hence, it may be beneficial to focus on the correct model
identification with a small loss in the test sample correlation. A possible solution could
be selecting a combination of sparseness parameters such that the maximum number of
noise variables is eliminated while also maximizing test sample correlation.
5.2 SCCA extension I: Adaptive SCCA
It is typically not possible to select variables for the best subset recovery since we don’t
know the true subset of important variables a priori. In fact, even the number of impor-
tant variables is usually not known. Therefore, usually variable selection is done based
on a prediction criterion. In our case prediction is evaluated by the test sample corre-
lation in CV steps. In order to reduce the bias in the lasso solution H. Zou introduced
the adaptive lasso method [Zou, 2006] that includes additional weights in the lasso con-
straint. The weights are defined as w = 1
|ˆβ|γ
where β is a root-n-consistent estimator
of the true value of β and γ > 0 is a pre-specified parameter. Then the adaptive lasso
estimates are given by
β∗(n)
= argminβ
∣∣∣y −p∑
i=1
xiβi
∣∣∣2
+ λn
p∑
i=1
wi|βi| (5.2)
One suggested choice for the weights w is the inverse of the full Ordinary Least Squares
(OLS) solution 1/|βOLS|. An important remark in the paper is that these weights are
data dependent, i.e. as the sample size grows OLS estimates become increasingly precise
and the estimates for zero-coefficients should converge to 0. Therefore, the weights for
the zero-coefficients will increase to infinity, while the weights for nonzero-coefficients
will converge to some constant. Thus, parameters with large OLS values will be given a
smaller penalty weight. That should reduce the effect of shrinkage on the large values.
The analogy in our case to the OLS solution are the complete first singular vectors from
the full SVD. The connection between SVD and regression (OLS) was demonstrated by
I.J. Good [Good, 1969].
83
This idea can be applied to modify the SCCA algorithm. In SCCA the soft-thresholding
used for variable selection can be adjusted to include additional weights for the coeffi-
cients in the singular vectors as follows:
Let uSV D
and vSV D
denote the first singular vectors obtained from a full singular value de-
composition of the matrix K. Thus, all values in these vectors are non-zero. Also assume
both uSV D
and vSV D
have been standardized to have unit length. Then the algorithm
for the adaptive SCCA is exactly the same as the algorithm for simple SCCA presented
earlier with modifications only in the soft-thresholding steps 3c and 4c. Step 3c in the
ith iteration of the adaptive SCCA algorithm becomes
ui+1 ← (|ui+1 − 1
2
λu
|uSV D|γ |)+Sign(ui+1) (5.3)
while step 4c in the ith iteration becomes
vi+1 ← (|vi+1 − 1
2
λv
|vSV D|γ |)+Sign(vi+1) (5.4)
where γ > 0 is a user-specified parameter. This modification does not change the property
of the algorithm when the sparseness parameters are set to zero: in this case the algorithm
provides full SVD singular vectors.
To investigate the performance of adaptive SCCA in terms of correct model identifi-
cation for different sample sizes and to compare it to the performance of regular SCCA
I carried out simulations study based on the single latent variable model.
Simulation design:
For each simulation 150 variables were generated for the data set X and 100 variables
for the set Y. Twenty variables in each set were associated and the rest of the variables
represented independent noise. The sample size range was between 20 and 600 obser-
vations. The standard deviation used for simulation of the latent variable µ was 0.9
which resulted in approximately 0.8 correlation between the linear combinations of the
associated variables in X and Y. For each sample size, results were averaged over 50
simulations. The optimal sparseness parameter combinations for SCCA algorithm were
84
obtained based on 5-fold cross-validation. This CV approach as opposed to 10-fold CV
has been chosen to shorten the simulation time as well as to facilitate more realistic CV
choice for smaller sample sizes such as n = 20. I compared performance of the adaptive
SCCA to simple SCCA based on the test sample correlation between the obtained linear
combinations of variables and also based on the discordance measure. Test samples were
generated from the same distribution as the training sample for each simulation and
contained 50 observations.
Simulation results and conclusions:
Figure 5.9 demonstrates the simulation results for test sample correlation comparing
adaptive SCCA, SCCA and SVD for the power of weights in the parameters penalty
γ = 0.5. The graph shows that both adaptive SCCA and SCCA perform substantially
better than CCA based on the full SVD solution for all sample sizes, however there is
very little difference between the adaptive SCCA and SCAA in terms of the test sample
correlation. Thus, the modification of SCCA using the square root of the first singular
vectors as additional weights for the coefficients in linear combinations of variables does
not improve the predictive power of the obtained solution. To investigate the impact of
this modification on the correct model selection we can consider the discordance measures.
Figure 5.10 demonstrates the simulation results for the average discordance measure
(5.1) for the adaptive SCCA and regular SCCA. The graph shows average discordance
measures that correspond to the linear combinations of variables obtained from the adap-
tive SCCA and SCCA. The coefficients in linear combinations were estimated using the
optimal combination of sparseness parameters selected by maximization of the test sam-
ple correlation. Thus, the same sparseness parameters were used to generate this and the
previous graph in each simulation for each sample size. This discordance estimate can be
considered test sample correlation based. Figure 5.10 also shows the discordance curves
for both adaptive and regular SCCA that were obtained using the optimal combination
of sparseness parameters based on minimization of the average discordance measure.
85
0 100 200 300 400 500 600
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Test sample correlation for SCCA, adaptive SCCA and full SVD
sample size
test
sam
ple
corr
elat
ion
Adaptive SCCASCCASVD
Figure 5.9: Compare adaptive SCCA, SCCA and SVD: test sample correlation vs sample
size, power of weights in soft-thresholding is 0.5. Solid curve - adaptive SCCA, dashed
curve - SCCA, dotted curve - SVD.
Thus, these curves demonstrate the minimum discordance that could be achieved for the
simulated data sets for the considered range of sparseness parameters. Adaptive SCCA
demonstrates better performance in terms of test sample correlation based discordance
measure for all sample sizes. It also demonstrates lower minimized discordance for sam-
ple sizes below 200. However, for higher sample sizes regular SCCA can achieve similar
or even lower minimum discordance. A solution that is based on minimizing discordance
can only be obtained in the simulation studies since for real data the information about
the important and noise variables is usually not available. Thus, in applied data analysis
the sparseness parameters would have to be chosen based on the test sample correlation.
In that case adaptive SCCA does show superior performance to SCCA for correct model
86
identification.
0 100 200 300 400 500 600
010
2030
4050
Discordance for SCCA and adaptive SCCA
sample size
disc
orda
nce Adaptive SCCA
SCCAMin for Adaptive SCCAMin for SCCA
Figure 5.10: Compare adaptive SCCA and SCCA: discordance vs sample size, power of
weights in soft-thresholding is 0.5. Solid curve - test-sample-correlation-based discordance
for adaptive SCCA, dashed curve - test-sample-correlation-based discordance for SCCA,
dotted curve - minimized discordance for adaptive SCCA, dashed and dotted curve -
minimized discordance for SCCA.
The discordance measure consists of false positive and and false negative components.
There is a trade-off between these statistics: when selection constraints are relaxed to
include more important variables in the model, thus lowering the number of false nega-
tives, the number of noise variables selected (i.e. the number of false positives) increases.
On the other hand, lowering the number of false positives results in fewer variables in
the model, which may increase the number of false negatives. To have a complete un-
derstanding of the effect of additional weights in soft-thresholding on the correct model
87
identification it is necessary to consider two components of discordance separately. Fig-
ures 5.11 and 5.12 show the number of false positives and negatives for the data set X
for adaptive SCCA and SCCA for different sample sizes. The results for data set Y are
similar and, therefore, not shown.
0 100 200 300 400 500 600
010
2030
4050
False positives for X for SCCA and adaptive SCCA
sample size
num
ber
of fa
lse
posi
tive
Adaptive SCCASCCAMin for Adaptive SCCAMin for SCCA
Figure 5.11: Compare adaptive SCCA and SCCA: number of false positives for set X
vs sample size, power of weights in soft-thresholding is 0.5. Simulated number of noise
variables in X is 130. Solid curve - test sample correlation based number of false positives
for set X for adaptive SCCA, dashed curve - test sample correlation based number of
false positives for set X for SCCA, dotted curve - number of false positives for set X
based on minimized discordance for adaptive SCCA, dashed and dotted curve - number
of false positives for set X based on minimized discordance for SCCA.
The graphs show that adaptive SCCA selects fewer false positives compared to SCCA
for all sample sizes while simple SCCA performs better in terms of the number of false
88
negatives. However, the difference in the number of important variables not included
in the solution is less significant than the difference in the number of noise variables
included. Also, both SCCA and adaptive SCCA perform almost as well in terms of the
number of false positives as the analysis methods based on minimizing the discordance
measure that use the knowledge about the true underlying model in the simulated data.
Thus, adaptive SCCA solution contains fewer false positives and a comparable number
of false negatives to SCCA. therefore, it demonstrates better model selection properties
than the original SCCA algorithm.
Adaptive SCCA method uses an additional user controlled parameter which is the
power of the weights γ in the soft-thresholding steps (5.3, 5.4). H. Zou [Zou, 2006] sug-
gests using an additional cross-validation to select the optimal value for this parameter.
Thus, for SCCA that means a two-dimensional cross-validation to select sparseness pa-
rameters for left and right singular vectors λu, λv (level 1 CV) and to select γ (level 2
CV). I investigated the effect of the power of the weights on the performance of adaptive
SCCA by simulating data using the same set up as described above for three values of
the power parameter: γ = 0.5, 1, 2.
Simulation results and conclusions:
Figure 5.13 demonstrates the simulation results for the test sample correlation comparing
adaptive SCCA performance for the three considered values of γ. The graph shows similar
performance of adaptive SCCA for γ = 0.5 and 1, while test sample correlation is lower
for all sample sizes for γ = 2. Using γ = 2 also results in a higher discordance measure
for adaptive SCCA as demonstrated by the figure 5.14. The discordance for adaptive
SCCA with γ = 0.5 and 1 is similar for all sample sizes. Therefore, in further analysis
of the properties of adaptive sparse canonical correlation I use γ = 1 as the power of the
weights in the soft-thresholding step to reduce the computational complexity by avoiding
the second level of cross-validation necessary to select the optimal value of γ. This means
using the inverse of the values in the first singular vectors obtained from full SVD as the
89
weights in (5.3, 5.4). Thus, the soft-thresholding steps in adaptive SCCA are
ui+1 ← (|ui+1 − 1
2
λu
|uSV D| |)+Sign(ui+1)
and
vi+1 ← (|vi+1 − 1
2
λv
|vSV D| |)+Sign(vi+1)
5.3 SCCA extension II: Modified adaptive SCCA
In addition to introducing weights into SCCA based on the adaptive lasso approach of
H. Zou [Zou, 2006] described above we can explore further modification of the SCCA
algorithm to improve its model selection properties. A method that correctly identifies
the underlying model should include all important variables in its solution (and thus
have few false negatives) while eliminating all unnecessary variables (and thus minimizing
the number of false positives). However, in applications complete information about the
informativeness of variables is often not available. Therefore, it is not possible to evaluate
and minimize the false positive and false negative rates. One possible solution was offered
by Wu et al. [Wu et al., 2007] who developed a new approach for controlling variable
selection based on the introduction of pseudovariables into the original data set. These
pseudovariables are simulated to represent noise variables and thus allow estimation of
the number of false positives selected by the method of interest.
The authors propose 4 approaches to the generation of the pseudovariables. In the
first approach kp variables are generated independently from N(0, 1) distribution. In the
second approach pseudovariables are obtained by permuting the rows of the original data
matrix. In the other 2 approaches variables are generated as in the first two and then
they are regressed on the original data. Regression residuals are taken as the pseudovari-
ables. The objective of adding the simulated variables to the data is to introduce some
known uninformative variables that would allow estimation of the false positive rate. At
the same time this should not change the probabilities of selecting or not selecting the
90
original variables. Therefore, pseudovariables should resemble true noise variables in the
data as much as possible. In that case addition of independent identically distributed
N(0, 1) variables may not be realistic since their distribution may differ substantially
from the original variables, thus they will be easy to differentiate and eliminate from the
solution. Pseudovariables generated by row permutations have the same distribution as
the original variables, but they are independent of the outcome and, therefore, resemble
the uninformative variables. Using the regression residuals reduces the influence of the
pseudovariables on the selection probabilities of the original variables. Therefore, the
authors recommend simulating the pseudovariables by the row permutation and then
using the regression residuals.
In sparse canonical correlation analysis there are two data sets, X and Y, under
consideration and we are interested in identifying groups of variables that are associated
between the data sets. Therefore, to generate the pseudovariables we can permute the
rows in each set X and Y independently. That will produce variables that are independent
between the data sets X and Y, i.e. the uninformative variables. Regression residuals
are obtained as follows:
Let X be one of the data matrices and ZX be the matrix of generated pseudovariables
for X. Then the residuals after the regression on the original data are
(I −X(X ′X)−1X ′)ZX
. Similarly, for Y the regression-residual-based pseudovariables are
(I − Y (Y ′Y )−1Y ′)ZY
In the studies where sparse canonical correlation is likely to be applied, the number of
variables in data sets X and Y may exceed the number of observations. In that case,
the inverses (X ′X)−1 and (Y ′Y )−1 may not exist. One solution would be to use the
pseudo-inverses. However, the authors claim that there is a ”slight” advantage to using
the regression-residual permutation method, which may be offset by the imperfection of
91
Moore-Penrose pseudo-inverses. To reduce the computational load we use the pseudo-
variables generated by row permutations without considering the regression residuals.
The row permutation approach automatically produces as many pseudvariables as
there are original variables. Wu et al. investigated the effect of the number of pseu-
dovariables used on the performance of their variable selection method. They compared
using all generated pseudovariables, randomly selecting half of the pseudovariables and
generating twice as many pseudovariables as there are original variables. The authors
found no significant difference among the three approaches. In large studies such as mi-
croarray analysis the number of variables under investigation may be tens of thousands.
Thus, adding the same number of pseudovariables to the data may make computations
infeasible. Therefore, we investigate the effect of introducing additional noise by using
half as many pseudovariables as there are original variables. Thus, if data set X contains
p variables and set Y contains q variables, then sets of generated pseudovariables ZX and
ZY would contain 12p and 1
2q variables, respectively. These variables are generated by
random sample from p and q pseudovariables obtained by row permutations.
Controlling variable selection is based on minimizing the false selection rate. In sparse
canonical correlation analysis the false selection rate must be computed separately for
the data sets X and Y as follows:
γX(X,Y ) =UX(X,Y )
1 + SX(X,Y )=
UX(X,Y )
1 + IX(X,Y ) + UX(X,Y )for set X (5.5)
γY (X,Y ) =UY (X,Y )
1 + SY (X,Y )=
UY (X,Y )
1 + IY (X,Y ) + UY (X,Y )for set Y (5.6)
where UX(X,Y ) and UY (X,Y ) are the numbers of uninformative variables selected as
showing significant association (i.e. included in the model) from the sets X and Y,
respectively. IX(X,Y ) and IY (X,Y ) are the number of informative variables from X
and Y included in the model. Finally, SX(X,Y ) and SY (X,Y ) are the total number
of variables from sets X and Y included in the model. Thus, SX(X,Y ) = IX(X,Y ) +
UX(X,Y ) for set X with an analogous expression for data Y. Thus, the false selection rate
92
is approximately equal to the ratio of the number of false positives over the number of
positives and is similar to the false discovery rate. However, the estimation procedure is
different. Therefore, I follow the terminology of Wu et al. and refer to the false selection
rate.
The general iterative algorithm for estimating the false selection rate is
1. Set b = 1
2. Generate sets of 12p and 1
2q pseudovariables, ZX and ZY , by independent random
row permutations as described above
3. Estimate γXb(X,Y ) and γYb
(X,Y )
4. b = b + 1, repeat steps 2 and 3 until b = B
5. Compute estimated false selection rates averaged over the iterations:
γX(X,Y ) =1
B
B∑
b=1
γXb(X,Y ) for set X
γY (X,Y ) =1
B
B∑
b=1
γYb(X,Y ) for set Y
6. Common false selection rate for X and Y: γ(X,Y ) = 12(γX(X,Y ) + γY (X,Y ))
Sparse canonical correlation or its modified version, adaptive SCCA, is then applied to the
data and γ(X,Y ) is estimated for different values of sparseness parameters. The optimal
combination of the sparseness parameters for left and right singular vectors corresponds
to the the lowest estimated false selection rate. A final model (i.e. groups of variables
associated between the sets X and Y) is chosen by applying SCCA / adaptive SCCA
using the optimal combination of the sparseness parameters.
Wu et al. propose two methods for the estimation of the false selection rate. The
first one is based on estimating the expected ratio of the number of false positives to the
93
number of positives defined by the function
γER(α) = E{ U(α)
1 + S(α)} (5.7)
where α represents parameters used by the algorithm for model fitting. These parameters
are specified by the user, determine the level of sparseness of the model and can be
optimized to obtain the best fit. In the case of SCCA α represents sparseness parameters
for left and right singular vectors, λu and λv. S(α) is the number of variables selected by
the method used for the specific values of parameters in α, which as above includes both
informative and uninformative variables chosen. U(α) is the number of the uninformative
variables included in the model, i.e. the number of false positives. If I(α) is the number
of informative variables selected, then S(α) = U(α) + I(α).
The second false selection rate estimate is based on estimating the ratio of expected
number of false positives to the expected number of positives and is defined by the
function
γRE(α) =E{U(α)}
E{1 + S(α)} (5.8)
The first method depends only on the assumption of equal probabilities for selection of
real uninformative variables and generated pseudovariables. The second method depends
both on the same assumption and on the assumption that the original important variables
have the same probability of being selected regardless of whether pseudovariables have
been added to the data or not. To investigate the effect of introducing additional noise
on the model selection I use the the first method based on the expected ratio of the
number of false positives to the number of positives as it depends on fewer assumptions.
Sparse canonical correlation analysis deals with data sets of variables simultaneously
and a separate false selection rate has to be computed for each data. Thus, if there
are two data sets X and Y containing p and q variables, respectively, then γERX(α) and
γERY(α) should be estimated. Using the approach offered by Wu et al., the estimates are
γERX(α) =
kUX(α)U∗
pX(α)/kpX
1 + SX(α)(5.9)
94
for set X and
γERY(α) =
kUY(α)U∗
pY(α)/kpY
1 + SY (α)(5.10)
for data set Y. Here kpXand kpY
are the numbers of pseudovariables added to the original
data sets X and Y, respectively. SX(α) and SY (α) are the numbers of variables from
sets X and Y included in the model when the parameter values are equal to α, i.e. these
are the numbers of variables chosen as showing significant inter-relation between sets X
and Y. kUX(α) is the estimated number of original uninformative variables in the data X
computed as
kUX(α) = p− SX(α) (5.11)
Similarly, for set Y
kUY(α) = q − SY (α) (5.12)
U∗pX
(α) is the estimated number of pseudovariables generated for data X included in the
model for specific values of the selection parameters α obtained as
U∗pX
(α) =1
B
B∑
b=1
U∗p,bX
(α) (5.13)
where B is the number of iterations in the general algorithm for estimating false selection
rate, i.e. the number of times sets of pseudovariables ZX and ZY are generated by
permuting the rows of the original data matrices X and Y. U∗p,bX
(α) is the number of
pseudovariables for set X included in the model at the iteration b.
To investigate the effect of introducing additional noise variables on correct model
identification I carried out simulations for different sample sizes. I compared the perfor-
mance of 4 analysis methods:
• SCCA with added noise.
• Adaptive SCCA with added noise.
• SCCA without added noise (i.e. the original algorithm).
95
• Adaptive SCCA without added noise.
In the first two approaches sparseness parameters λu and λv are chosen by minimizing the
false selection rate. In the last two approaches the optimal combination of parameters is
chosen by maximizing the estimated test sample correlation using 5-fold cross-validation.
Simulation design:
Simulations were based on a single latent variable model. For each simulation 150 vari-
ables were generated for the data set X and 100 variables for the set Y. Twenty variables
in each set were associated and the rest of the variables represented independent noise.
The sample size range was between 20 and 400 observations. The standard deviation
used for simulation of the latent variable µ was 0.9 which resulted in approximately 0.8
correlation between the linear combinations of the associated variables in X and Y. For
each sample size obtained results were averaged over 50 simulations.
I compared performance of the four analysis methods listed above based on the test
sample correlation between the obtained linear combinations of variables and also based
on the true false selection rate as well as true false non-selection rate. To compute test
sample correlation, an independent test sample was generated from the same distribution
as the training sample for each simulation and contained 50 observations. True false
selection (FSR) and non-selection (FNR) rates are calculated based on the knowledge
about the underlying model used in simulations as follows:
FSRtrue(X) =number of false positives
1 + number of positives
FNRtrue(X) =number of false negatives
1 + number of negatives
Similar expressions are used for the Y data set. The common FSR and FNR values for
X and Y data are
FSRtrue =1
2(FSRtrue(X) + FSRtrue(Y ))
FNRtrue =1
2(FNRtrue(X) + FNRtrue(Y ))
96
Simulation results and conclusions:
Figure 5.15 demonstrates the simulation results for test sample correlation comparing
adaptive SCCA and SCCA with and without incorporating additional noise variables.
The power for the weights in adaptive SCCA is γ = 1. The graph demonstrates slightly
lower test sample correlation obtained by adaptive SCCA and SCCA when additional
noise variables are incorporated into the data sets (modified adaptive SCCA, modified
SCCA). This may be explained by the sparser solutions produced by the modified meth-
ods due to lower number of uninformative variables included in the linear combinations
of variables that are associated between sets X and Y. However, test sample correlations
are very similar for all 4 methods especially for larger sample sizes. To further investigate
the solutions it is necessary to consider the false selection and non-selection rates.
Figure 5.16 demonstrates true false selection rate for adaptive SCCA and SCCA
with and without the modification while Figure 5.17 shows true non-selection rates.
Surprisingly, analysis methods that incorporate additional noise pseudovariables into the
data and are based on minimizing the false selection rate have higher true FSR and lower
true FNR for all sample sizes compared to the methods that only use the original data
and are based on maximizing tests sample correlation. Both adaptive and simple SCCA
with added noise modification show true FSR rate higher than 0.4 for almost all sample
sizes and it decreases slowly with the increasing sample size. This means that too many
uninformative variables are selected for the solution by the modified methods. That also
explains lower true FNR for the modified approaches compared to regular adaptive SCCA
and SCCA. Since a large number of variables are included in the linear combinations,
more important variables are included as well. FSR and FNR are based on the numbers
of false positives and false negatives selected by a method. Therefore, they are prone to
a similar trade-off phenomenon.
What is the explanation for the higher false selection rates observed for methods that
should be minimizing FSR? A possible reason is that modified approaches underestimate
97
the false selection rates and, therefore, result in solutions that include too many noise
variables. Let’s reconsider the estimate of FSR offered by Wu et al. For data set X
estimated FSR in (5.9) is
γERX(α) =
kUX(α)U∗
pX(α)/kpX
1 + SX(α)
The numerator in this expression estimates the number of false positives as the estimated
proportion of the estimated number of uninformative variables included in the solution.
The denominator, SX(α), is the number of variables selected from the original set X
(number of positives) when no additional noise variables are incorporated in the data.
This value only depends on the original data, parameters α and the analysis method
used (adaptive SCCA or SCCA). SX(α) provides a direct measure of the number of
positives, or the number of variables selected for the solution, required for the estimation
of FSR and, therefore, should not a a source of underestimation of FSR. U∗pX
(α)/kpXalso
does not raise concern since it estimates the proportion of false positives based on the
known number of pseudovariables incorporated in the data and on the knowledge which
of the variables in the extended data set are pseudovariables. kUX(α) = p−SX(α) is the
estimated number of uninformative variables in the original data set X. This information
is not available in applications and, therefore, has to be estimated. The value is obtained
by subtracting the number of positives (original variables included in the solution) from
the total number of original variables. This estimate is based on the assumption that
at optimal level of parameters α analysis method only includes the important variables
in the solution and that it includes all important variables. However, in large studies
with small sample sizes, such as microarray studies, this assumption is unrealistic. Not
all important variables may be included in the solution while noise variables may be
selected as well. In case there are many noise variables included in the model, kUX(α)
would underestimate the number of uninformative variables, thus underestimating FSR.
To further investigate the effect of estimated number of uninformative variables on
the estimate of false selection rate I carried out simulations using the true numbers of
98
uninformative variables in sets X (130 noise variables) and Y (80 noise variables) in the
expressions for γERX(α) and γERY
(α) in (5.9, 5.10). I used the same simulation set up
as in the previous simulations.
Simulation results and conclusions:
Figures 5.18 and 5.19 demonstrate true false selection and non-selection rates comparing
6 analysis methods:
1. SCCA with added noise using the true numbers of uninformative variables to esti-
mate and minimize FSR.
2. Adaptive SCCA with added noise using the true numbers of uninformative variables
to estimate and minimize FSR.
3. SCCA with added noise using estimated FSR as in Wu et al. [Wu et al., 2007].
4. Adaptive SCCA with added noise using estimated FSR as in Wu et al. [Wu et al.,
2007].
5. SCCA without added noise (i.e. the original algorithm).
6. Adaptive SCCA without added noise.
Figure 5.18 shows that when the estimated number of uninformative variables is used to
calculate false selection rate as in Wu et al.(methods 3 and 4), then modified analysis
methods perform worse compared to the modified methods that use true number of unin-
formative variables in the estimation of FSR (methods 1 and 2). The graph demonstrates
lower and more rapidly decreasing true false selection rate with increasing sample size
for the methods 1 and 2 in the list above. This supports the conclusion that the ap-
proach of Wu et al. underestimates the false selection rates. Although modified adaptive
SCCA and modified SCCA that use true numbers of uninformative variables to estimate
FSR perform better than methods 3 and 4, they do not show significant advantage over
the adaptive SCCA that does not incorporate any pseudovariables (method 6). In fact,
99
method 2 shows higher true FSR while method 1 has comparable performance. Given
significantly higher computation complexity of the modified analysis methods, adaptive
SCCA is the best choice for the analysis of large data sets based on the true false selection
rate.
Figure 5.19 shows that modified methods that use Wu et al. estimated for FSR (3
and 4) have lower true false non-selection rate for all sample sizes compared to other
analysis methods. This effect can be explained by the trade-off between the number of
false positives and false negatives. Methods 3 and 4 include more variables in the linear
combinations of associated variables between sets X and Y. Thus, these solutions also
include greater number of important variables compared to the sparser linear combina-
tions produced by the other methods. However, the difference in true FNR between six
considered approaches is not as significant as the difference in true FSR: for most sample
sizes true FSR values are well below 0.1 for all methods which may be satisfactory FNR
rate in many applications. On the other hand, true false selection rates for the methods
3 and 4 exceed 0.4 for considered sample sizes which indicates presence of a large per-
centage of noise variables (at least 40%) in the solution. Thus, superior performance of
the third and fourth methods in terms of FNR is offset by their high true false selection
rates.
In summary, modification of SCCA that introduces additional noise variables in the
original data set in order to estimate and minimize the number of uninformative vari-
ables included in the linear combinations, does not offer an advantage in minimizing
the number of false positives, and thus better model identification. However, it is more
computationally intensive and may be infeasible in large studies where the number of
variables of each type may be tens of thousands. Both modified SCCA and modified
adaptive SCCA underestimate the false selection rate which results in a high number
of noise variables included in the linear combinations of associate variables. This is due
to underestimation of the number of uninformative variables in the original sets of mea-
100
surements which is used to estimate the false selection rate. Further development of this
approach is necessary to obtain better estimates. SCCA and adaptive SCCA methods are
preferred to the modified versions. The preference between SCCA and adaptive SCCA
methods may be set by an investigator based on the intended use of the results, available
resources, and prior biological knowledge.
The conclusion about the oracle properties of the developed methods is that adaptive
SCCA does not have an oracle property since the number of false negatives for adaptive
SCCA is not reduced to 0 for large sample sizes. However, the number of false posi-
tives is reduced compared to simple SCCA. The primary focus of SCCA is the correct
model identification, hence consistency of the estimates is of lesser importance and is not
considered.
101
0 100 200 300 400 500 600
05
1015
False negatives for X for SCCA and adaptive SCCA
sample size
num
ber
of fa
lse
nega
tive Adaptive SCCA
SCCAMin for Adaptive SCCAMin for SCCA
Figure 5.12: Compare adaptive SCCA and SCCA: number of false negatives for set X vs
sample size, power of weights in soft-thresholding is 0.5. Simulated number of important
variables in X is 20. Solid curve - test sample correlation based number of false negatives
for set X for adaptive SCCA, dashed curve - test sample correlation based number of
false negatives for set X for SCCA, dotted curve - number of false negatives for set X
based on minimized discordance for adaptive SCCA, dashed and dotted curve - number
of false negatives for set X based on minimized discordance for SCCA.
102
0 100 200 300 400 500 600
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Test sample correlation for adaptive SCCA for different powers of weights
sample size
test
sam
ple
corr
elat
ion
Gamma=0.5Gamma=1Gamma=2
Figure 5.13: Adaptive SCCA performance for different powers of weights in soft-
thresholding: test sample correlation vs sample size. Solid curve - power is 0.5, dashed
curve - power is 1, dotted curve - power is 2.
103
0 100 200 300 400 500 600
020
4060
8010
0
Discordance for adaptive SCCA for different powers of weights
sample size
disc
orda
nce
Gamma=0.5Gamma=1Gamma=2
Figure 5.14: Adaptive SCCA performance for different powers of weights in soft-
thresholding: discordance vs sample size. Solid curve - power is 0.5, dashed curve -
power is 1, dotted curve - power is 2.
104
100 200 300 400
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
Test sample correlation for SCCA and adaptive SCCA with/without added noise
sample size
test
sam
ple
corr
elat
ion
Adaptive SCCA, added noiseSCCA, added noiseAdaptive SCCASCCA
Figure 5.15: Test sample correlation for different sample sizes for adaptive SCCA and
SCCA with and without incorporating additional noise pseudovariables. True correlation
between two linear combinations of important variables is 0.8. Weights power for adaptive
SCCA γ = 1.
105
100 200 300 400
0.0
0.2
0.4
0.6
0.8
True FSR for SCCA, adaptive SCCA based on min estimated FSR
sample size
fsr
Adaptive SCCA, added noiseSCCA, added noiseAdaptive SCCASCCA
Figure 5.16: True false selection rate for different sample sizes for adaptive SCCA and
SCCA with and without incorporating additional noise pseudovariables. True correlation
between two linear combinations of important variables is 0.8. Weights power for adaptive
SCCA γ = 1.
106
100 200 300 400
0.01
0.02
0.03
0.04
0.05
0.06
0.07
True FNR for SCCA, adaptive SCCA based on min estimated FSR
sample size
fnr
Adaptive SCCA, added noiseSCCA, added noiseAdaptive SCCASCCA
Figure 5.17: True false non-selection rate for different sample sizes for adaptive SCCA and
SCCA with and without incorporating additional noise pseudovariables. True correlation
between two linear combinations of important variables is 0.8. Weights power for adaptive
SCCA γ = 1.
107
100 200 300 400
0.0
0.2
0.4
0.6
0.8
True FSR based on min estimated FSR
sample size
fsr
Adaptive SCCA, added noiseSCCA, added noiseAdaptive SCCA, added noise, better FSR estSCCA, added noise, better FSR estAdaptive SCCASCCA
Figure 5.18: True false selection rate for different sample sizes for adaptive SCCA and
SCCA with and without incorporating additional noise pseudovariables. True correlation
between two linear combinations of important variables is 0.8. Weights power for adaptive
SCCA γ = 1. Black curves: solid - adaptive SCCA with added noise, modified FSR est.,
dashed - SCCA with added noise, modified FSR est., dotted - adaptive SCCA without
added noise, dashed and dotted - SCCA without added noise. Grey curves: solid -
adaptive SCCA with added noise, FSR est. as in Wu et al., dashed - SCCA with added
noise, FSR est. as in Wu et al.
108
100 200 300 400
0.02
0.04
0.06
0.08
0.10
True FNR based on min estimated FSR
sample size
fnr
Adaptive SCCA, added noiseSCCA, added noiseAdaptive SCCA, added noise, better FSR estSCCA, added noise, better FSR estAdaptive SCCASCCA
Figure 5.19: True false non-selection rate for different sample sizes for adaptive SCCA and
SCCA with and without incorporating additional noise pseudovariables. True correlation
between two linear combinations of important variables is 0.8. Weights power for adaptive
SCCA γ = 1. Black curves: solid - adaptive SCCA with added noise, modified FSR est.,
dashed - SCCA with added noise, modified FSR est., dotted - adaptive SCCA without
added noise, dashed and dotted - SCCA without added noise. Grey curves: solid -
adaptive SCCA with added noise, FSR est. as in Wu et al., dashed - SCCA with added
noise, FSR est. as in Wu et al.
109
Chapter 6
Application
6.1 Background
Several studies have demonstrated that there is variation in baseline gene expression lev-
els in humans that has a genetic component [Cheung et al., 2005, Morley et al., 2004].
Genome-wide analyses mapping genetic determinants of gene expression have been car-
ried out for expression of one gene at a time, which may be prone to a high false discovery
rate and computationally intensive since the number of genes under consideration often
exceeds tens of thousands. We present an exploratory multivariate method for initial in-
vestigation of such data and apply it to the data provided as problem 1 for the fifteenth
Genetic Analysis Workshop (GAW15). The linkages between the set of all SNP loci and
the set of all gene expression phenotypes can be characterized by a type of correlation
matrix based on the linkage analysis methodologies introduced by Tritchler et al. [Tritch-
ler et al., 2003] and Commenges [Commenges, 1994]. In multivariate analysis a common
way to inspect the relationship between two sets of variables based on their correlation is
canonical correlation analysis, which determines linear combinations of variables for each
data set such that the two linear combinations have maximum correlation. However, due
to the large number of genes, linear combinations involving all the genotypes or gene
110
expression phenotypes lack biological plausibility and interpretability and may not be
generalizable. We have developed a new method, Sparse Canonical Correlation Analysis
(SCCA), which examines the relationships between many genetic loci and gene expres-
sion phenotypes simultaneously and to establish the association between them. SCCA
provides sparse linear combinations. That is, only small subsets of the loci and the gene
expression phenotypes have non-zero loadings so the solution provides correlated sets of
variables that are sufficiently small for biological interpretability and further investiga-
tion. The method can help generate new hypotheses and guide further investigation. In
this case, the correlation of interest is between gene expression profiles and SNP-based
measures; correlations within gene expressions or within SNPs separately are not a focus
of interest.
6.2 Materials and Methods
Data
The data consist of microarray gene expression measurements which are treated as quan-
titative traits and a large number of genotypes for 14 Centre d‘Etude du Polymorphisme
Humain (CEPH) families from Utah. Each pedigree includes 3 generations with approx-
imately 8 offspring per sibship. There are 194 individuals, 56 of which are founders (the
information for their parents and other ancestors is not considered). Phenotypes were
measured by microarray gene expression profiles obtained from lymphoblastoid cells us-
ing the Affymetrix Human Genome Focus Arrays. Morley et al. [Morley et al., 2004]
selected 3554 genes among the available 8793 probes based on higher variation among
unrelated individuals than between replicate arrays for the same individual. Here we use
pre-processed and normalized data provided for these genes. Additional phenotypic data
obtained for CEPH families includes age and gender.
The normalization procedure for expression profiles used in this study was Affymetrix
111
Microarray Analysis Suite (MAS) [Affymetrix]. This may have a great effect on the
results altering the subset of gene expressions and SNPs selected by SCCA. Beyene et
al. [Beyene et al., 2007] demonstrated the influence of the normalization procedures on
the final results for several approaches including Affymetrix MAS. However, our analysis
is independent of data preprocessing. We assume that the appropriate tools have been
applied to the data at hand.
Genotypes are measured by genetic markers provided by The SNP Consortium and
are available for 2882 autosomal and X-linked SNPs. The physical map for SNP locations
is also available.
The statistical model
In this study we are interested in identifying linear combinations of gene expression
levels and SNPs that have the largest possible correlation. Canonical correlation analysis
establishes such relationships between the two sets of variables [Mardia et al., 1979].
In conventional CCA, all variables are included in the fitted linear combinations.
However, in microarray and genome-wide data the number of genes under consideration
often exceeds tens of thousands. In these cases linear combinations of all features may not
be easily interpretable. Sparse canonical correlation analysis (SCCA) enhances biological
interpretability and provides sets of variables with sparse loadings. This is consistent
with the belief that only a small proportion of genes are expressed under a certain set
of conditions, and that expressed genes are regulated at a subset of genetic locations.
We propose obtaining sparse linear combinations of features by considering a sparse
singular value decomposition of K where singular vectors u1 and v1 have sparse loadings.
We developed an iterative algorithm that alternately approximates the left and right
singular vectors of the SVD using soft-thresholding for feature selection. This approach
is related to the Sparse Principal Component Analysis method of Zou et al. [Zou et al.,
2004] and Partial Least Squares methods described by Wegelin [Wegelin].
112
Analysis approach
In this study one type of variables is based on gene expression levels and the other type
of information relates to SNP genotypes and pedigree structure. An immediate challenge
in this context is how to define correlation between these two types of data. We adopted
a measure of covariance of genetic similarity with phenotypic similarity as in Tritchler et
al. [Tritchler et al., 2003] and Commenges [Commenges, 1994]. Consider the offspring
generation in all available pedigrees and take all possible sib-pairs. Let yij and yik be the
phenotypes for the siblings j and k in family i for a particular gene expression and let
wijk represent ibd value for these siblings for some specific SNP. Then for the considered
gene expression and SNP the test statistic in [Tritchler et al., 2003] is
σ =∑
i
∑
j
∑
k>j
{yij − E(yij)}{yik − E(yik)}{wijk − E(wijk)} (6.1)
which is used for computation of a covariance matrix between the phenotypic similarity
and genotypic similarity. Note the similarity of the above expression to Haseman-Elston
regression. In fact, Tritchler et al. [Tritchler et al., 2003] show that the correlation
statistic subsumes both the original Haseman-Elston regression analysis and the later
Haseman-Elston (revisited).
Phenotypic similarity
The phenotypes in this study are the gene expression values for siblings in the last
generation of the pedigrees (i.e., the offspring generation). Previous studies have shown
that there is a variation in human gene expression according to age and gender [Morley
et al., 2004]. Therefore, we limit the analysis to the last generation in all pedigrees as
well as correct for the effects of gender and age by fitting a linear model
yij = α + βgendergenderij + βageageij + eij (6.2)
Gender and age information was not available for all individuals in the pedigree 1454 and
for 3 individuals in pedigree 1340. Therefore, these individuals were excluded from the
113
analysis. In the 13 remaining pedigrees, there were 344 distinct sib-pairs with sibship
size varying between 15 and 28. Although sib-pairs are correlated within pedigrees, this
does not affect the results since no assumption of independence is made.
Genotypic similarity
For each sib-pair, the probabilities of sharing 0, 1 and 2 alleles identical by descent were
estimated using MERLIN. Provided physical distance map of the SNP locations was
used for this computation since it is a suitable approximation to the genetic distances
required by MERLIN and the results do not show sensitivity to this substitution. Given
the incomplete genetic marker information for some individuals, exact IBD values could
not be computed. We estimated the number of alleles shared identical by descent by
two siblings as a posterior expected value based on the probabilities estimated using
MERLIN. Expected ibd values E(wijk) were computed as sample mean values over all
sib-pairs.
Standardization
We standardize the phenotype and genotype variables by subtracting the mean values
and dividing by the standard deviations. As described in the standardization section 3.3
of the methods chapter 3 simulations show that after data standardization the analysis
can be simplified by replacing variance matrices of gene expressions and IBD values by
the identity matrices while yielding satisfactory results. Then the matrix K in equation
3.2 is the covariance between the two data sets and the first canonical vectors in equation
3.3 are just u1 and v1.
114
Evaluation
We evaluated the results by performing SCCA with leave one out cross-validation (LOOCV)
analysis treating a pedigree as one unit.
In this study assessment of performance is based on the estimated test sample cor-
relation. It shows correlation between linear combinations of identified loci and gene
expressions in the independent sample. We used pedigree as the unit in LOOCV since
it represents a statistically independent unit. Leaving out one whole pedigree preserves
dependence structure in the family based study and ensures independence between train-
ing and testing samples. Using a random sample of 100/k% of all individuals for k-fold
CV would destroy familial correlation. We carried out an analogue of 13-fold CV where
fold-size was dictated by the complex structure of the data. Also, leaving out one pedi-
gree facilitates sensitivity analysis and shows the influence of the specific pedigrees on
the results.
The SCCA algorithm involves CV analysis for selection of the sparseness parameters
for gene expressions and SNPs. Therefore, validation of the SCCA performance is carried
out using a nested CV structure. In the outer CV loop the data is repeatedly split into
12 pedigrees for training sample and 1 pedigree for tests sample. The complete SCCA
procedure (including a CV tuning step) is then applied to the training sample. This
means that the inner CV is performed by splitting 12 families into 11 and 1 to select
best sparseness parameter combination which is subsequently used to identify linear
combinations of gene expressions and SNPs. These linear combinations are applied to
the remaining pedigree 13th test sample pedigree in the outer CV loop to compute test
sample correlation. Results are then averaged over all 13 outer CV steps.
115
6.3 Results
Optimal sparseness parameter combination selection and SCCA
results
Using cross-validation, we obtained a soft-threshold value of 0.07 for gene expressions
and 0.13 for SNPs corresponding to the maximal test sample correlation. The 3-D graph
in Figure 6.1 demonstrates the results of this CV. It shows the test sample correlation
between gene expression and SNP sets averaged over 13 CV steps. The dotted plane
corresponds to the test sample correlation for linear combinations of all variables ob-
tained using standard CCA, i.e. full SVD solution, which is 0.1384. It is constant for
all λgene expr. and λSNP since no parameters are involved in the analysis. The best aver-
age test sample correlation for SCCA is 0.1843. Figure 6.1 also demonstrates that for
evaluated sparseness parameter combinations SCCA provides a better solution in terms
of test sample correlation than the full SVD solution. We carried out SCCA using this
optimal combination of sparseness parameters and identified groups of 41 SNPs and 150
gene expressions with between group correlation of 43%. All obtained SNPs are uni-
formly distributed over a region on chromosome 9 between 86.80 megabases (Mb) and
120.09 Mb. Locations of expressed genes selected by SCCA are distributed over different
chromosomes. Six of the identified gene expressions are located on chromosome 9. Other
3 chromosomes that have more than 15 gene expressions each are 1, 2 and 6. No cis-
acting genetic regulators were found, where cis-regulators are defined as those that map
within 5Mb region, as was previously defined in Morley et al. [Morley et al., 2004].
Cross-Validation of SCCA algorithm
Table 6.1 summarizes the results of the cross-validation study comparing the performance
of SCCA to the complete SVD solution that includes all 3554 gene expressions and 2882
SNPs. Average overlap between the group of 150 gene expressions selected using SCCA
116
sparseness param. gene expr.
0.0700.075
0.080
0.085
0.090
0.095
0.100 sparse
ness param. fo
r SNP
0.08
0.10
0.12
0.14
test sample correlation
0.00
0.05
0.10
0.15
Average test sample correlation vs sparseness parameters
Figure 6.1: 3-D graph: test sample correlation averaged over 13 LOOCV steps for dif-
ferent combinations of sparseness parameters for gene expression and SNP measures.
Dotted plane: test sample correlation averaged over 13 LOOCV steps for standard CCA
solution.
on the complete data and the groups of gene expressions selected in CV steps is 46
genes while the average intersection of the 41 SNPs with the results in CV steps is 34
SNPs. Inspecting CV iterations shows pedigrees to be heterogeneous, with two pedigrees,
1416 and 1418, being outliers. When these pedigrees are used as test samples (refer to
these 2 CV steps as step 1416 and step 1418) gene expression and SNP sets selected for
the training data differ substantially from the sets obtained in other CV iterations. In
particular, there are 159 and 166 SNPs selected in steps 1416 and 1418 respectively. These
two SNP sets have an overlap of 155 SNPs indicating that the results are very similar.
However, there are only 7 SNPs in common between these groups of SNPs and the groups
117
Gene expr. SNPs Average test sample correlation
SCCA 83 66 0.1144
SVD 3554 2882 0.1384
Table 6.1: Validation: summary of prediction results for SCCA and full SVD averaged
over 13 Leave-One-Out-Cross-Validations.
selected in other CV iterations. All seven SNPs shared by the results in all CV steps are
located on chromosome 9 and are also a subset of 41 SNPs selected by SCCA applied
to the whole data. The difference is more dramatic for the gene expression comparison.
Again similar sets of genes are selected in CV steps 1416 (98 gene expressions selected)
and 1418 (69 gene expressions selected). There is almost complete overlap between these
two groups - 68 gene expression profiles. However, there are at most 9 common gene
expressions between these groups and gene expressions selected in other CV steps. In
fact, there is an empty intersection between gene sets obtained in CV steps 14116 and
1418 and genes selected when pedigree 1340 is used as a test sample. Also there are
respectively 7 and 3 common gene expressions between the group selected by SCCA for
the whole data and the groups in steps 14161 and 1418. On the other hand, results
obtained in all CV steps excluding steps 1416 and 1418 are more similar. The average
overlap in gene expressions between CV iterations and whole data results is 60 if we
also ignore steps where pedigrees 1340 and 1345 were used as test samples since in these
steps only 18 and 24 gene expressions were selected respectively. The average overlap in
SNPs is 40. Described results clearly suggest some differences between pedigrees 1416
and 1418 and the rest of the pedigrees. To improve the homogeneity in the data we apply
SCCA method as well as validation procedure to the data with pedigrees 1416 and 1418
removed.
118
SCCA results for reduced data
Similarly to the analysis of the whole data we applied SCCA and the validation procedure
to the reduced data consisting of all pedigrees except 14161 and 1418. We obtained
soft-threshold value of 0.1 for gene expressions and 0.09 for SNPs corresponding to the
maximal test sample correlation. We carried out SCCA using the optimal combination
of sparseness parameters and identified groups of 134 SNPs and 63 gene expressions with
between group correlation of 51%. Nine of selected SNPs are located on chromosome 9
between 116.61Mb and 136.27Mb. Six of these SNPs were identified in the analysis of the
whole data. Chromosome 7 contains 59 of selected SNPs distributed uniformly between
27.73Mb and 106.92Mb. Also large portion of the SNPs is located on chromosome 12:
37 SNPs between 76.50Mb and 115.75Mb. Other chromosomes that contain 10 or less
of the selected SNPs are 4, 6, 11, 14, 16, 18, 20, and 23. Obtained gene expressions are
distributed over different chromosomes.
Table 6.2 summarizes the results of the cross-validation study comparing the performance
of SCCA to the complete SVD solution when both methods are applied to the reduced
data. Average overlap between the group of 63 gene expressions selected using SCCA
on the data consisting of 11 pedigrees and the groups of gene expressions selected in CV
steps is 30 genes while the average intersection of the 134 SNPs with the results in CV
steps is 59 SNPs. Inspecting CV iterations reveals further heterogeneity in the pedigrees.
When the pedigrees 1341, 1346, and 1408 are used as test samples gene expression and
SNP sets selected for the training data differ substantially from the sets obtained in other
CV iterations.
6.4 Adaptive SCCA Results
In addition to the presented SCCA analysis I also applied adaptive SCCA to the study
of natural variation in human gene expression. I used the algorithm described in the
119
Gene expr. SNPs Average test sample correlation
SCCA 74 85 0.1544
SVD 3554 2882 0.1384
Table 6.2: Validation of SCCA applied to the reduced data: summary of prediction
results for SCCA and full SVD averaged over 11 Leave-One-Out-Cross-Validations.
adaptive SCCA section 5.2 of chapter 5 with power of the weights in soft-thresholding
penalty γ = 1. I considered soft-thresholding parameters λgene expr. and λSNP ranging
between 0 and 0.01 for both gene expressions and SNPs. Parameter values higher than
0.01 resulted in no variables being selected. Using cross-validation, I obtained soft-
threshold value of 0.004 for gene expressions and 0.009 for SNPs corresponding to the
maximum test sample correlation averaged over cross-validation steps of 0.2092. This
average test sample correlation is higher than the best test sample correlation of 0.1843
produced by SCCA for the optimal combination of the sparseness parameters λgene expr. =
0.07 and λSNP = 0.13.
Similarly to SCCA, I carried out adaptive SCCA using the optimal combination of
sparseness parameters and identified groups of 19 SNPs and 28 gene expressions with
between group correlation of 32%. Selected SNPs and gene expressions are the subsets
of the groups of SNPs and gene expressions identified by SCCA. Thus, all selected SNPs
are located on chromosome 9 between 102.82 megabases (Mb) and 117.15Mb. Locations
of selected expressed genes are distributed over different chromosomes. One of the iden-
tified gene expressions is located on chromosome 9. Chromosome 1 has 5 of selected gene
expressions, chromosome 10 has 4. Other chromosomes contain fewer than 4 identified
gene expressions. Gene expression selected by adaptive SCCA that is located on chro-
mosome 9 is gene 209034 at which is also one of the genes identified by Lantieri et al.
[Lantieri et al., 2007]. Since the adaptive SCCA results are the subsets of SCCA results,
again no cis- acting genetic regulators were found according to the definition in Morley
120
et al.[Morley et al., 2004].
6.5 Discussion
In this study we presented sparse canonical correlation analysis and demonstrated the
application of this new method to the simultaneous analysis of gene expression levels
and SNPs. Due to complex interaction between genes, a set of several genotypes may
be associated with several gene expressions possibly belonging to the same regulatory
pathway or genetic network. SCCA discovers such sets of genotypes and phenotypes
while keeping the size of groups sufficiently small for biological interpretability.
We identified a specific region on chromosome 9 that regulates a group of gene ex-
pression profiles. Selected set of loci should be interpreted as whole in relation to the
whole set of selected gene expressions. We presented the results for sets of SNPs and
gene expression levels with maximum correlation between SNP set and gene expression
set. Maximization of within group correlation is not the objective of SCCA so selected
gene expressions may not be highly correlated with each other, and the same is true for
SNPs.
This sparse solution may help to generate new hypothesis and isolate groups of loci
and gene expressions for future biological experimentation. For instance, selection of a
specific region on chromosome 9 by SCCA is particularly interesting, and a possible inter-
pretation could be that we found a regulatory region. The same region on chromosome 9
was also identified in other GAW15 contributions. For instance, considering a small set
of genes associated with the development of the enteric nervous system (ENS) Lantieri
et al. [Lantieri et al., 2007] also found evidence of linkage for two genes, 201387 s at and
209034 at, to a unique common regulator located on chromosome 9 at 109 centiMorgan
(cM). Similarly, Wang et al. [Wang et al., 2007] found 10 gene expressions mapped to
a ”hotspot” on chromosome 9, however, gene names and specific chromosomal locations
121
were not provided.
The smaller number of gene expressions and SNPs selected by SCCA facilitates better
biological interpretability of the results. The leave-one-out cross-validation results showed
a slightly lower average test sample correlation for SCCA compared to full SVD solution
as shown in table 6.1. For this particular data set, a possible explanation is outliers
among the results in the CV steps, due to the two incongruous pedigrees. This indicates
that using stringent constraints for subsetting variables may result in greater vulnerability
to outliers in the pedigrees. In simulations we have carried out to assess the performance
of SCCA our method demonstrates better performance compared to standard CCA based
on full SVD in terms of test sample correlation. Thus, SCCA may potentially provide
a more robust solution. Additional empirical studies using more homogeneous sets of
pedigrees would be useful. Variability of the results in separate CV steps, indicated by
incomplete overlap of selected sets of gene expressions and SNPs, points out the utility
of CV in determining heterogeneous sets of pedigrees.
I also applied adaptive sparse canonical correlation and identified a smaller group
of SNPs that is associated with a smaller set of gene expression profiles as compared
to the results obtained by SCCA. An important observation is that there is a complete
agreement between the solutions and no new results have been identified by the adaptive
SCCA. Also, the sets of SNPs and gene expressions selected by the adaptive SCCA are
sparser and they are the subsets of the sets of SNPs and gene expressions selected by
SCCA. This may be due to the reduction of the number of noise variables included in
the solution. This is supported by the simulation results presented in the chapter 5
which indicate that adaptive SCCA does have a tendency to select fewer uninformative
variables compared to SCCA. However, simulations also show that sparse adaptive SCCA
solutions may also include fewer important variables than SCCA. Thus, reduced numbers
of SNPs and gene expressions obtained by the adaptive SCCA may also be missing some
regulatory SNPs or gene expressions that are associated with identified genetic region on
122
chromosome 9. In conclusion, adaptive SCCA solution is more likely to contain a higher
percentage of SNPs and gene expressions that are related to each other while SCCA
solution may identify a more complete list of associated SNPs and gene expressions. The
choice between the two solutions can be made based on the biological knowledge and on
cost of additional biological experiments that may be used to validate the results, i.e.
relative cost of carrying out an additional experiment compared to the cost of missing
an important association between SNPs and gene expressions.
Both sparse canonical correlation analysis and adaptive SCCA allow the global inves-
tigation of both genomic and genetic data at the same time and provide an interpretable
answer even in studies with limited sample size and possible outliers such as this study
of natural variation in human gene expression. These methods are useful analytical tools
for genome-wide study of variation in human gene expression that can also identify new
regulatory pathways and genetic networks. Biological knowledge and intended further
analysis may suggest the choice between adaptive SCCA and SCCA.
123
Chapter 7
Discussion and future work
7.1 Discussion
This thesis describes new methodology for the simultaneous analysis of two sets of mea-
surements to establish the relationships between them - sparse canonical correlation anal-
ysis (SCCA). SCCA identifies linear combinations of variables in each data set that have
the highest correlation between the different sets of measurements. In this case maxi-
mization of the correlation between the variables within each data set is not the focus of
the analysis. Sparse canonical correlation analysis is an extension of the canonical corre-
lation analysis (CCA) that identifies such relationships between the variables. However,
CCA includes the entire sets of available variables in the linear combinations. In large
studies such as microarray and genome-wide linkage/assocation studies this may not be
practical due to high dimensionality of the data. In these cases linear combinations of
all variables may lack biological interpretability. SCCA solves this problem by providing
a sparse solution. Sparse linear combinations of variables obtained from SCCA include
only small subsets of variables from each data set. Hence, they are easier to interpret
and may be used to generate new hypothesis for further testing.
I presented the sparse canonical correlation analysis algorithm and investigated its
124
properties. Simulation studies show better performance of SCCA compared to CCA in
terms of test sample correlation for different sample sizes. The difference is especially
large for small sample sizes which is usually the case in large studies where we expect
SCCA to be most useful. I also investigated the oracle properties of SCCA according
to the definition in H. Zou [Zou, 2006]. Oracle property includes two components: con-
sistency of the estimated coefficients in the linear combinations of variables and correct
model identification. SCCA is an exploratory method and the focus of the analysis is on
identifying subsets of variables that have significant association between two data sets in
the study. Therefore, the values of the non-zero coefficients in the linear combinations
are of lesser importance than the locations of zeros. Identification of which coefficients
should be set to zero determines which variables are included in the linear combinations
and, thus, serves as the model selection. While consistency of the coefficients in the lin-
ear combinations of variables is a desirable property, the main focus of SCCA is correct
model identification.
Investigation of the model selection properties of SCCA using simulated data showed
that as sample size increases the number of unimportant variables selected as signifi-
cantly associated between the two data sets by SCCA (false positives) decreases. The
number of important variables not included in the model (false negatives) is decreasing
as well although at a slower rate and is less affected by the sample size. These misiden-
tified important variables could have very small in absolute value loadings in the linear
combinations of associated variables and therefore are difficult to differentiate from the
noise variables. Also the effect of the sample size on the number of false negatives is
stronger when the true simulated correlation between the linear combinations of impor-
tant variables is higher. Thus, for sufficiently large sample sizes (approximately twice
the number of variables in one data set or higher) the number of false positives selected
by SCCA is zero. Additional simulations demonstrate that maximization of the test
sample correlation to select the optimal combination of sparseness parameters for left
125
and right singular vectors that determine the solution does not guarantee the best model
identification. However, the true underlying model is not known in real studies. There-
fore, it is not possible to minimize the discordance measures which leads to obtaining
solution based on the test sample correlation maximization. This limitation inspired the
development of two extensions of SCCA: adaptive SCCA and modified adaptive SCCA.
I also presented adaptive SCCA - an extension of SCCA based on the adaptive LASSO
approach of H. Zou that may have preferred model selection properties in some appli-
cations. Similarly to SCCA, adaptive SCCA is seeking a solution by maximizing test
sample correlation. Simulations show that adaptive SCCA provides better filtration of
the noise and includes fewer uninformative variables in the linear combinations. However,
sparser sets of selected variables also may not include all important variables. Thus, there
is a trade-off between the number of false positives and false negatives. This leads to the
trade-off between using adaptive SCCA versus simple SCCA. In applications of SCCA
in which the results are used to form new hypothesis for further testing an investigator
may prefer to exclude as much noise as possible if the cost of additional experiments is
very high. On the other hand, if the results are used for discovery of new effects and
the cost of additional testing is not high compared to the cost of missing an important
component, a solution that has higher probability of containing most important variables
possibly along with higher percentage of noise may be preferred. Thus, the choice be-
tween adaptive SCCA and SCCA may be made based on the biological knowledge and on
the relative cost of having greater number of false positives compared to false negatives.
I investigated a further modification of SCCA that introduces additional noise vari-
ables in the original data set in order to estimate and minimize the number of uninforma-
tive variables included in the linear combinations. This method is based on the analysis
approach described by Wu et al. [Wu et al., 2007]. The simulations suggest that it does
not offer an advantage in minimizing the number of false positives, and thus better model
identification. Furthermore, it is more computationally intensive and may be infeasible
126
in large studies where the number of variables in each data sets may be tens of thousands.
Both adaptive SCCA and modified adaptive SCCA underestimate the false selection rate
which results in a high number of noise variables included in the linear combinations of
associated variables. This is due to underestimation of the number of uninformative vari-
ables in the original data sets which is used to estimate the false selection rate. Further
development of this approach is necessary to obtain better estimates.
I demonstrated an application of both adaptive SCCA and SCCA using a real study
of natural variation in human gene expression. An important observation is that the
solution obtained by the adaptive SCCA is a subset of the solution obtained by SCCA.
Based on the simulation studies the conclusion may be that adaptive SCCA solution
contains fewer noise variables, however, it may also contain fewer important variables.
As discussed above this again raises a question of the trade-off between the number of
false positives and false negatives and the choice between the two solutions.
Simulations and application demonstrate the methodology and utility of SCCA and
adaptive SCCA methods developed in this thesis. Both of these methods are preferred
to the modified versions which are based on introducing additional noise variables due to
computational complexity of the latter ones. The preference between SCCA and adaptive
SCCA methods may be set by an investigator based on the intended use of the results,
available resources, and prior biological knowledge.
7.2 Limitations of simulation studies
Simulations used to investigate the properties of SCCA and its extensions were based
on the single latent variable model and have several limitations. First one is that in
large scale genomic and genetic studies measurements there is a possibility that the
measurements are taken on the genes belonging to several different pathways or processes.
That means that several groups of associated variables of different types may be present in
127
the data. In this case a single latent variable model is not appropriate since it can describe
only one regulatory mechanism. Therefore, analysis of multiple processes requires a more
complex model that contains several independent latent variables each responsible for a
separate process. From a canonical correlation point of view that means considering
more than one pair of singular vectors. Similarly for SCCA, the solution would contain
several sparse linear combinations of variables. First, subsets of variables with the highest
correlation between their linear combinations can be identified as described in chapter
3. Then, the effect of these variables is removed by considering the residual correlation
matrix. This is followed by the identification of the additional subsets of variables with
the next highest correlation using the residual matrix. This type of complex data may
pose an additional challenges due to increased difficulty of differentiating between the
noise and informative variables as well as between the subsets of variables associated with
different processes. Further simulation studies using multiple latent variables models are
necessary to investigate performance of SCCA in such cases.
Another limitation of the simulation studies is a lower number of variables of each
type compared to the number of variables in some large scale studies (up to 500 variables
in the simulations vs. 15000 to 20000 variables in genomic/genetic studies). At the same
time simulated sample sizes and proportions of important variables are higher compared
to some applications. These simulation parameters were chosen to mimic frequent data
problems such as as limited sample size and high proportion of noise variables while
keeping simulations feasible from a computational point of view. However, the alternative
approach to SCCA in the simultaneous analysis of two sets of variables would be canonical
correlation analysis which includes the entire sets of variables in the linear combinations.
In case of large studies these results would lack biological interpretability. On the other
hand, SCCA is able to provide sparse solutions. Also, simulation results in section 4.2
show that SCCA has superior performance compared to CCA in terms of generalizability
as measured by the independent test sample correlation between the obtained linear
128
combinations of variables even for sample sizes much smaller than the number of variables.
Furthermore, the performance of SCCA improves rapidly with the increasing sample size
producing solutions with the true correlation between the associated subsets of variables
for sample sizes that are still much smaller than the number of variables. The usual
assumption in large studies is that many of the measurements are not related to the
processes of interest and represent noise. Therefore, it would be beneficial to filter out
some uninformative variables prior to analysis. This would increase the proportion of
the associated variables and the effective sample size. It may be unrealistic to expect
complete elimination of the noise variables at the filtering stage, therefore SCCA would
still be useful tool for the analysis of such data. Further discussion of the preliminary
filtering is presented in section 7.3.
The last limitation of the simulation studies is the rather high values of true correlation
between the associated subsets of variables. As shown in section 4.3 at lower values of true
correlation it is more difficult to differentiate between the noise and informative variables.
Further investigation of this effect and additional studies of possible improvement of the
performance of SCCA are necessary. Also, similarly to the discussion above, it would be
beneficial to perform preliminary noise filtering.
7.3 Preliminary filtering
Simulation results presented in section 3.5 of chapter 3 show better performance of SCCA
in the presence of fewer noise variables in the data sets. Also simulations in chapter 4
demonstrate improved performance of SCCA for larger sample sizes compared to the
number of variables. Thus, the results can be improved by filtering noise variables prior
to analysis. Uninformative variables can be filtered based on variance of variables within
set X and set Y separately. If there is no variation in variable then it can not be correlated
with anything. Another approach is to filter within sets X and Y separately based on
129
correlation of variables. This is based on the fact that if x1 and x2 are correlated with y
where y can be a latent variable, then x1 and x2 should be correlated with each other.
Incorporating preliminary filtering into SCCA and studying its effects is the subject of
future study.
7.4 More than 2 sources of data
Some recent genomic studies offer several phenotypes measured on the subjects. For
example, in the study of chronic fatigue syndrome (CFS) presented at the Critical As-
sessment of Microarray Data Analysis (CAMDA) workshop in 2006, different sources of
data were included in the study: clinical assessment of the patients, microarray gene ex-
pression profiles, proteomics data, and selected single nucleotide polymorphisms (SNPs).
The question of interest is data integration to establish the relationships between different
types of variables and to predict the disease class. Wold original Partial Least Squares
(PLS) algorithm is applicable to several sets of data. However, the solution includes the
entire sets of available variables. In large scale studies similar to the study of CFS these
results may lack biological interpretability. It would be interesting to develop an exten-
sion of Sparse Canonical Correlation Analysis based on the PLS approach applicable to
more than two sets of variables simultaneously.
7.5 Computation of variance
The data standardization section of chapter 3 describes the approximation of the vari-
ance matrices for different sets of variables with the identity matrices after the data has
been standardized. This approach is based on the assumption that in high dimensional
problems most of the measured variables are not related to the process of interest, i.e.
they may be considered as noise, and the correlation between them is zero. Thus, the
correlation between gene expressions in the microarray studies is not taken into account.
130
The same assumption is made by the traditional analysis approaches such as differential
gene expression analysis carried on a one gene expression at a time basis. Extension
of SCCA allowing better estimation and incorporation of the variance matrices for the
considered sets of variables would improve the solutions. In that case often unrealistic
assumption of gene independence can be removed.
7.6 Computation of covariance
The covariance matrix between the two sets of variables in the study is the key element
in the SCCA algorithm. Therefore, it is crucial to have an accurate estimate of that
matrix. In the case of genome-wide microarray study there are often tens of thousands
of gene expressions under consideration. However, there is typically only a few hundred
observations available. Thus, the sample covariance matrix may not be a precise estimate
of the true underlying covariance structure. I propose to use bagging to improve the
estimate of the covariance matrix. This approach is similar to the method in [Schafer
and Strimmer, 2004].
General algorithm
Bagging or bootstrap aggregation is a general method that improves a sample based
estimator [Breiman, 1996]. It utilizes the bootstrapping technique as follows:
• For a given data set X generate B bootstrap sets X∗b, b = 1, . . . , B, by sampling
with replacement from the available observations.
• For each bootstrap sample calculate an estimate of the statistic of interest Θ∗b
• The bagged estimator is the bootstrap mean 1B
∑Bb=1 Θ∗b
131
Application to SCCA
Bootstrap sample generation
In case of SCCA the sample statistic is a covariance matrix for two sets of variables.
Thus, there are two data sets X and Y with the same number of observations n and
possibly different number of variables. Obtaining a bootstrap sample in this case means
independently sampling with replacement from the set of observations, i.e. sampling from
a sequence {1, . . . , n} to get obs∗b. Then X∗b and Y ∗b are obtained by taking observations
included in obs∗b from the original sets of variables X and Y .
The bagging algorithm and SCCA can be applied using the following two approaches:
SCCA of the bagged covariance matrix estimate (B-SCCA1)
In the first approach the bagging algorithm is used to obtain an improved covariance
matrix estimate as a bootstrap mean of the covariance matrices for the bootstrap samples
from the original data sets X and Y . Thus, the bagged covariance matrix estimate
is calculated as 1B
∑Bb=1 Cov(X∗b, Y ∗b). Subsequently, SCCA is applied to the bagged
covariance estimate to obtain sparse combinations of variables from sets X and Y . This
can be interpreted as application of SCCA to the Bayesian posterior mean estimate of
the sample covariance matrix.
Bagged SCCA (B-SCAA2)
In the second approach SCAA is applied to each bootstrap sample from the original data,
i.e to X∗b and Y ∗b, b = 1, . . . , B, to obtain sparse combinations of variables α∗b and β∗b
for X and Y respectively. Then the posterior probability of a specific coefficient uj in the
left singular vector or vj in the right singular vector being equal to zero can be calculated
as 1B
∑Bb=1 I(u∗b
j = 0) for variables in the set X and similarly for the variables in Y .
Simulations would be useful for comparison of these two approaches and for evaluation
132
of their performance.
7.7 Application of SCCA to the study of Chronic
Fatigue Syndrome
Chronic Fatigue Syndrome (CFS) is a disease that affects a significant proportion of the
population in the United States and has detrimental economic effect on society [Reeves
et al., 2005]. Assessment of patients and identification of illness is complicated by the
lack of well established characteristic symptoms [Whistler et al., 2005]. The symptoms
of CFS are also shared by other neurological illnesses such as multiple sclerosis, sleep
disorders, major depressive disorders. Moreover, definition of CFS as a unique disease
is not obvious as it may represent a common response to a collection of other illnesses
[Whistler et al., 2005]. Establishing well defined measures of CFS is crucial for the
assessment of this illness. That was the main purpose of the study conducted by the
Centre for Disease Control (CDC) in Wichita, KS. To achieve this goal different sources
of data have been included in the study: clinical assessment of the patients, microarray
gene expression profiles, proteomics data, and selected single nucleotide polymorphisms
(SNPs). The data for this study was presented at the Critical Assessment of Microarray
Data Analysis (CAMDA) workshop in 2006.
In this study all measurements taken on the subjects are of a different nature. One
type of variables is based on gene expression levels, another type of information relates
to SNP genotypes and pedigree structure. In addition there is also proteomics data and
haematologic measurements. The challenge from the statistical point of view is integrat-
ing these variables and conducting a unified analysis that focuses on the relationship
between the sets of variables rather than within. For example, one may be interested in
using gene expression values jointly to predict the disease class differentiating between pa-
tients with CFS, CFS with major depressive disorder with melancholic features (MDDm),
133
patients with insufficient symptoms or fatigue (ISF), and non-fatigued subjects. Another
questions of interest may be establishing the relationship between the gene expression
profiles and clinical data or finding genetic regulatory pathways by analyzing microarray
and SNP data.
It is also important to note that this is a high dimensional data and the number of
variables in some sets greatly exceeds the number of subjects. In particular, there are
almost 20000 gene expression profiles while the number of samples is only 177.
Given the challenges described above, i.e. integration of different types of data and
high dimensionality, Sparse Canonical Correlation Analysis is an appropriate approach
applicable to the CAMDA study. It would allow simultaneous analysis of different types
of measurements and produce sparse results that are interpretable form a biological
point of view. Sparse solutions that indicate the relationships between the small subsets
of variables are easier to visualize which may be important in a study of regulatory
pathways. SCCA results may also be used to generate hypothesis for further testing.
Thus, it would be interesting to apply SCCA to the study of chronic fatigue syndrome.
134
Bibliography
Affymetrix. Microarray Suite User Guide. URL
http://www.affymetrix.com/support/technical/manuals.affxAffymetrix.
J. Beyene, P. Hu, E. Parkhomenko, and D. Tritchler. Impact of normalization and filtering
on linkage analysis of gene expression data. Number 1(Suppl1) in BMC Proceedings,
page S150, 2007.
L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996.
J. Cadima and I. Jolliffe. Loadings and correlations in the interpretation of principal
components. Journal of Applied Statistics, 22:203–214, 1995.
V.G. Cheung, R.S. Spielman, K.G. Ewens, T.M. Weber, M. Morley, and J.T. Burdick.
Mapping determinants of human gene expression by regional and genome-wide associ-
ation. Nature, 437:1365–1369, 2005.
D. Commenges. Robust genetic linkage analysis based on a score test of homogeneity:
the weighted pair-wise correlation statistic. Genetic Epidemiology, 11:189–200, 1994.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals
of Statistics, 32(2):407–499, 2004.
I.J. Good. Some applications of the singular decomposition of a matrix. Technometrics,
11(4):828–831, 1969.
135
D.A. Harville. Matrix algebra from a statistician’s perspective. Springer, 1997.
H. Hotelling. Relations between two sets of variables. Biometrika, 28:321–377, 1936.
J. Jeffers. Two case studies in the application of principal component. Applied Statistics,
16:225–236, 1967.
I. M. Johnstone and A. Y. Lu. Sparse principal component analysis. jan 2004.
I. T. Jolliffe and M. Uddin. A modified principal component technique based on the
lasso. Journal of Computational and Graphical Statistics, 12:531–547, 2003.
F. Lantieri, H. Rydbeck, P. Griseri, I. Ceccherini, and M. Devoto. Incorporating prior
biological information in linkage studies increases power and limits multiple testing.
Number 1(Suppl1) in BMC Proceedings, page S89, 2007.
K.V. Mardia, J.T. Kent, and J.M. Bibby. Multivariate analysis. New York: Academic
Press, 1979.
N. Meinshausen and P. Buhlmann. Variable selection and high dimensional graphs with
the lasso. Technical report, ETH Zurich, 2004.
M. Morley, C.M. Molony, T.M. Weber, J.L. Devlin, K.G. Ewens, R.S. Spielman, and
V.G. Cheung. Genetic analysis of genome-wide variation in human gene expression.
Nature, 430:743–747, 2004.
S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukheriee, C. Yeang, M. Angelo, C. Ladd,
M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, and
T. Golub. Multiclass cancer diagnosis using tumor gene expression signature. Proceed-
ings of the National Academy of Sciences, 98:15149–15154, 2001.
W. Reeves, D. Wagner, R. Nisenbaum, J. Jones, B. Gurbaxani, L. Solomon, D. Papan-
icolaou, E. Unger, S. Vernon, and C. Heim. Chronic fatigue syndrome - a clinically
empirical approach to its definition and study. BMC Medicine, 3(1):19, 2005.
136
P.D. Sampson, A.P. Streissguth, H.M. Barr, and F.L. Bookstein. Neurobehavioral ef-
fects of prenatal alcohol: Part ii. partial least squares analysis. Neurotoxicology and
teratology, 11(5):477–491, 1989.
J. Schafer and K. Strimmer. An empirical bayes approach to inferring large-scale gene
association networks. Bioinformatics, 1(1), 2004.
G.W. Stewart. Introduction to matrix computations. Academic press, New York, 1973.
A.P. Streissguth, F.L. Bookstein, P.D. Sampson, and H.M. Barr. The enduring effects of
prenatal alcohol exposure on child development: birth through seven years, a partial
least squares solution. International Academy for Reasearch in Learning Disabilities
Monograph Series. University of Michigan Press, Ann Arbor, 10, 1993.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statis-
tical Society B series, 58(1):267–288, 1996.
D. Tritchler, Y. Liu, and S. Fallah. A test of linkage for complex discrete and continuous
traits in nuclear families. Biometrics, 59(8):382–392, 2003.
S. Wang, T. Zheng, and Y. Wang. Transcription activity hotspot, is it real or an artifact?
Number 1(Suppl1) in BMC Proceedings, page S10, 2007.
Jacob A. Wegelin. A survey of partial least squares (pls) methods, with emphasis on the
two-block case. URL citeseer.ist.psu.edu/wegelin00survey.html.
T. Whistler, E. Unger, R. Nisenbaum, and S. Vernon. Integration of gene expression,
clinical and epidemiologic data to characterize chronic fatigue syndrome. 2003.
T. Whistler, J. Jones, E. Unger, and S. Vernon. Exercise responsive genes measured
in peripheral blood of women with chronic fatigue syndrome and matched control
subjects. BMC Physiology, 5:5, 2005.
137
H. Wold. Path models with latent variables: the NIPALS approach., pages 307–357. Quan-
titative sociology:international perspectives on mathematical and statstical modeling.
Academic, 1975.
H. Wold. Soft modeling: the basic design and some extensions. In H. Wold K.G. Joreskog,
editor, Systems under indirect observation:causality, stracture, prediction, Part II,
number 139 in Proceedings of the Conference on Systems Under Indirect Observa-
tion, pages 1–54, Cartigny, Switzerland, October 1982. North Holland.
H. Wold. Partial Least Squares, pages 581–591. Encyclopedia of the statistical sciences.
Wiley, 1985.
Y. Wu, D.D. Boos, and L.A. Stefansky. Controlling variables selection by the addition
of pseudovariables. Journal of the American Statistical Association, 102(477):235–243,
2007.
H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical
Assosciation, 101(476):1418–1429, 2006.
H. Zou and T. Hastie. Regularization and variable selction via the elastic net. Journal
of Royal Statistical Society B series, 567(2):301–320, 2005.
H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Technical
report, Statistics department, Stanford University, 2004.
138