Ivica Crnkovic Mälardalen University Department of Computer Engineering [email protected]
For Peer Revie · Complete List of Authors: Kopriva, Ivica; Ruðer Bo koviæ Institute, Laser and...
Transcript of For Peer Revie · Complete List of Authors: Kopriva, Ivica; Ruðer Bo koviæ Institute, Laser and...
For Peer Review
Empirical Kernel Map Approach to Nonlinear
Underdetermined Blind Separation of Sparse Nonnegative Dependent Sources: Pure Components Extraction from
Nonlinear Mixtures Mass Spectra
Journal: Journal of Chemometrics
Manuscript ID: CEM-14-0009.R1
Wiley - Manuscript type: Research Article
Date Submitted by the Author: n/a
Complete List of Authors: Kopriva, Ivica; Ruðer Bo koviæ Institute, Laser and Atomic Research and Development Jerić, Ivanka; Ruñer Bošković Institute, Organic Chemistry and Biochemistry Filipović, Marko; Ruñer Bošković Institute, Laser and Atomic Research and Development Brkljačić, Lidija; Ruñer Bošković Institute, Organic Chemistry and Biochemistry
Keyword: Nonlinear underdetermined blind source separation, Robust principal component analysis, Thresholding, Empirical kernel maps, Nonnegative matrix factorization
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
For Peer Review
Empirical Kernel Map Approach to Nonlinear
Underdetermined Blind Separation of Sparse
Nonnegative Dependent Sources: Pure Components
Extraction from Nonlinear Mixtures Mass Spectra
Ivica Kopriva1*, Ivanka Jerić2, Marko Filipović1 and Lidija Brkljačić2
Ruđer Bošković Institute, Bijenička cesta 54, HR-10000, Zagreb, Croatia
1Division of Laser and Atomic Research and Development
phone: +385-1-4571-286, fax:+385-1-4680-104
e-mail: [email protected], [email protected]
2Division of Organic Chemistry and Biochemistry
e-mail: [email protected], [email protected]
Abstract
Nonlinear underdetermined blind separation of nonnegative dependent sources consists
in decomposing set of observed nonlinearly mixed signals into greater number of
original nonnegative and dependent component (source) signals. That hard problem is
practically relevant for contemporary metabolic profiling of biological samples, where
Page 1 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
sources (a.k.a. pure components or analytes) are aimed to be extracted from mass
spectra of nonlinear multicomponent mixtures. This paper presents method for
nonlinear underdetermined blind separation of nonnegative dependent sources that
comply with sparse probabilistic model, i.e. sources are constrained to be sparse in
support and amplitude. That model is validated on experimental pure components mass
spectra. Under sparse prior nonlinear problem is converted into equivalent linear one
comprised of original sources and their higher-, mostly second, order monomials.
Influence of these monomials, that stand for error terms, is reduced by preprocessing
matrix of mixtures by means of robust principal component analysis, hard-, soft- and
trimmed thresholding. Preprocessed data matrices are mapped in high-dimensional
reproducible kernel Hilbert space (RKHS) of functions by means of empirical kernel
map. Sparseness constrained nonnegative matrix factorizations (NMF) in RKHS yield
sets of separated components. They are assigned to pure components from the library
using maximal correlation criterion. The methodology is exemplified on demanding
numerical and experimental examples related respectively to extraction of 8 dependent
components from 3 nonlinear mixtures and to extraction of 25 dependent analytes from
9 nonlinear mixtures mass spectra recorded in nonlinear chemical reaction of peptide
synthesis.
Key words: Nonlinear underdetermined blind source separation, Robust principal
component analysis, Thresholding, Empirical kernel maps, Nonnegative matrix
factorization.
Page 2 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1. INTRODUCTION
Identification of pure components present in the mixture is a traditional problem in
spectroscopy (nuclear magnetic resonance-, infrared, Raman) and mass spectrometry [1-
4]. Identification proceeds often by matching separated components spectra with a
library of reference compounds [5-7], whereas degree of correlation depends on how
well pure components are separated from each other. Thereby, of interest are blind
source separation (BSS) methods that use only the matrix with recorded mixtures
spectra as input information [8-11]. In majority of scenarios, separation of pure
components is performed by assuming that mixture spectra are linear combinations of
pure components [1-4]. While linear mixture model is adequate for many scenarios,
nonlinear model offers more accurate description of processes and interactions
occurring in biological systems. Living organisms are best examples of complex
nonlinear systems that function far from equilibrium. Internal and external stimuli
(disease, drug treatment, environmental changes) cause perturbations in the system as a
result of highly synchronized molecular interactions [12]. As opposed to many BSS
methods developed for linear problems, the number of methods that address nonlinear
BSS problem is considerably smaller, see for example chapter 14 in [11]. That number
is reduced further when related nonlinear BSS problem is underdetermined, that is when
number of pure components is greater than number of mixtures. That is why metabolic
profiling, that aims to identify and quantify small-molecule analytes (a.k.a. pure
components or sources) present in biological samples (typically urine, serum or
biological tissue extract) is seen as one of the most challenging tasks in systems biology
[13]. Therefore, underdetermined problem is of practical relevance.
Page 3 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
The aim of the paper is to present method for blind separation of pure
components from smaller number of multicomponent nonlinear mixtures mass spectra.
Therefore, it is assumed that components are nonnegative and sparse. To this end, we
address underdetermined nonlinear nonnegative BSS (uNNBSS) problem with sparse
and dependent sources. As it has been discussed at great length in [4], even linear
underdetermined BSS problem comprised of dependent sources is challenging with only
few algorithms addressing it. There is basically no method proposed for uNNBSS
problem. Herein, we propose method for uNNBSS problem that can be considered as
generalization of the method developed in [4] for underdetermined linear nonnegative
BSS (uLNBSS) problem comprised of dependent sources. Proposed method constrains
sources to be nonnegative and comply with sparse probabilistic model [14, 15], that is
sources are assumed to be sparse in support and amplitude. The model is validated on
experimental mass spectrometry data and is therefore practically relevant, see section
3.2. This represents first original contribution of the paper. Under this sparse prior,
nonlinear problem is approximated by a linear one comprised of original sources and
their second order monomials. This follows from analytical derivations based on Taylor
expansion of nonlinear mixture model (that is the vector function with vector argument)
up to an arbitrary order. Analytical derivation of Taylor expansion based on Tucker
model of tensor derivatives represents, arguably, second original contribution of the
paper. The key contribution of the paper is reduction of influence of higher order
monomials that stand for error terms. That is achieved by preprocessing matrix of
mixtures by means of robust principal component analysis (RPCA) [16, 17], hard- (HT),
soft- (ST) [18] and trimmed thresholding (TT) [19]. Preprocessed data matrices are
mapped observation-wise in high-dimensional RKHS by means of empirical kernel map
Page 4 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
(EKM). Thus, one uNNBSS problem is converted into four nonnegative BSS problems
in RKHS with the same number of observations but increased number of mixtures.
Sparseness constrained NMF is performed in RKHS to solve these nonnegative BSS
problems. Thereby, components separated by NMF are assigned to pure components
from the library using maximal correlation criterion.
The rest of the paper is organized as follows. Section 2 gives overview of
nonlinear BSS methods and presents theory upon which proposed uNNBSS is built.
Section 3 describes experiments performed on computational and experimental data.
Section 4 presents and discusses results of comparative performance analysis between
proposed uNNBSS and some state-of-the-art NMF algorithms. Concluding remarks are
given in Section 5.
2. THEORY AND ALGORITHM
Aimed application of proposed uNNBSS method is extraction of analytes from
multicomponent nonlinear mixtures of mass spectra. As emphasized in [4] mass
spectrometry is chosen due to its increasing importance in clinical chemistry, safety and
quality control as well as biomarker discovery and validation. As in [4, 5], we assume
that library of reference mass spectra is available to evaluate quality of components
extracted by the proposed method.1 For an example the NIST and Wiley-Interscience
universal spectral library [7], contains more than 800 000 mass spectra (corresponding
1 Please note that any BSS algorithm when applied to experimental data requires some kind of expert knowledge to evaluate the separation results. Herein the library of pure components is such an "expert". The same concept is used in hyperspectral image analysis where identification of minerals proceeds by comparison of estimated endmembers with spectral profiles stored in the library, see for an example the ASTER spectral library at [20].
Page 5 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
to more than 680 000 compounds). As opposed to [4], where linear mixture model is
assumed, nonlinear model is assumed herein. Thereby, linear model is implicitly
included as a special case.
From the viewpoint of uNNBSS problem with dependent sources existing
algorithms for nonlinear BSS problem have at least one of the several deficiencies: (i)
they assume that number of mixtures is equal to or greater than the unknown number of
sources [21-29]; (ii) they do not take into account nonnegativity constraint that is
present when sources are pure components mass spectra [21-32]; (iii) they assume that
source signals are statistically independent [22-24, 27, 28-32] and, sometimes,
individually correlated [28, 30, 31]. None of these assumptions holds true for the
uNNBSS problem considered herein. Algorithm described in [33] is developed for
uNNBSS problem composed of nonnegative sources. However, the assumption made
by the algorithm is that set of observation indexes exist such that each source is present
alone in at least one of these observations. That assumption seems too strong for the
considered uNNBSS problem where mass spectra of structurally similar pure
components are expected to overlap. That is especially the case if the resolution of the
mass spectrometer is low. Algorithms [34-36] execute nonlinear nonnegative BSS by
means of nonnegative matrix factorization (NMF) in reproducible kernel Hilbert space
(RKHS). Nevertheless, unlike the uNNBSS method proposed herein, they do not: (i)
enforce sparseness constraint that is shown herein to be enabling condition for solving
otherwise intractable uNNBSS problem; (ii) reduce influence of higher order
monomials of the original sources (error terms) induced by nonlinear mixing process
and that is shown herein to be crucial for obtaining reasonably accurate solution of the
uNNBSS problem. As it is seen in section 2.2, uNNBSS problem is converted into
Page 6 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
equivalent uLNBSS problem with large number of sources: the original ones and their
higher order monomials induced by nonlinear mixing process. Without activation of
sparse probabilistic prior equivalent uLNBSS problem is intractable.
As it is seen in sections 3.1 and 3.2, proposed methodology significantly
improves accuracy relative to the case when the NMF algorithm is performed on EKM-
mapped matrix of mixtures data without suppression of higher order monomials. It has
already been discussed in [37, 4] that performance of many NMF algorithms depends
on optimal usage of parameters required to be known a priori, such as balance
parameter that regulates influence of sparseness constraint [38], or number of
overlapping components that exist in mixtures [39]. Often, these parameters are difficult
to select optimally in practice. That is why the nonnegative matrix underapproximation
(NMU) algorithm [40] is proposed to solve nonnegative BSS problems in RKHS. That
is, it does not require a priori information from the user. Thus, we propose herein to
combine RPCA, HT, ST and TT preprocessing transforms, EKM based nonlinear
mapping with the NMU algorithm in mapping induced high-dimensional RKHS. Hence,
the PTs-EKM-NMU algorithm. The PTs-EKM-NMU is exemplified on numerical and
experimental problems. Nevertheless, proposed preprocessing transforms can also be
used in combination with other sparseness constrained NMF algorithms. Provided that
number of overlapping components can be inferred reasonably accurate, an NMF
algorithm with 0 -constraints (NMF_L0) [39] is a good choice.
2.1 Underdetermined nonlinear nonnegative blind source separation with
dependent sources
Page 7 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
The uNNBSS problem with dependent sources is described as:
1,...,t t t T x f s (1)
where 10N
t R x stands for nonnegative measurement vector comprised of intensities
acquired at some of T mass-to-charge (m/z) channels, 10M
t R s stands for unknown
vector comprised of intensities of M nonnegative sources. 0 0: M NR R f is an unknown
multivariate mapping such that 1 ...T
t t N tf f f s s s and 0 0 1:
NMn n
f R R .
Problem (1) can be casted in the matrix framework:
X f S (2)
such that 0N TR X , 0
M TR S , where 1
T
t tx and 1
T
t ts are column vectors of matrices
X and S respectively and f S implies that nonlinear mapping is performed column
wise such as in (1). It is further assumed that: 0 1
T
t tL
s where
0ts stands for 0
quasi-norm that counts number of non-zero coefficients of ts and 01,...,
max tt T
L
s .
Evidently, it applies: L≤M, where L denotes maximal number of sources that can be
present at any coordinate t. The uNNBSS problem implies that components mass
spectra, 10 1
MTm m
R
s , ought to be inferred from mixture data matrix X only. In this
paper the following assumptions are made on nonlinear mixture model (1)/(2):
A1) 0xnt1 n=1,...,N and t=1,...,T ,
Page 8 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
A2) 0smt1 m=1,...,M and t=1,...,T ,
A3) M > N,
A4) Amplitude smt obeys exponential distribution on (0, 1] interval and discrete
distribution at zero, see also eq.(3),
A5) Components of the vector valued function f(s): 10 0: M
nf R R s ,
n=1,...,N are differentiable up to unknown order K,
A6) M<<T.
To avoid confusion between column and row vectors they will be indexed by lowercase
letters that correspond with uppercase letters related to dimensions of the corresponding
matrix. As an example st refers to the column- and sm to the row vector of matrix
0M TR S . Evidently, uppercase bold letters denote matrices, lowercase bold letters
denote vectors and italic lowercase letters denote scalars. In order to be useful solution
of the uNNBSS problem is expected to be essentially unique, that is estimated matrix of
pure components (sources) S and the true matrix of pure components S have to be
related through ˆ S P S , where P and stand respectively for MM permutation and
diagonal matrices. As discussed at great length in [4] even linear underdetermined BSS
problem requires constraints to be imposed on sources in order to ensure essentially
unique solution. Nonlinear BSS problem is more difficult. Herein, we assume that pure
components 1
M
m ms comply with sparse probabilistic model imposed by A4. It implies
that each component will be zero at great part of its support (number of m/z channels T)
Page 9 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
as well as that non-zero intensity will be distributed according to exponential
distribution with small expected value. These two constraints are expected to ensure
that, in probability, compared to N and M the maximal number of analytes L present at
the particular m/z coordinate is small enough. However, N stands for number of
biological samples available and it is expected to be small. Thus, it can virtually be
impossible to satisfy above requirement. That is why, as in [4], in order to increase the
number of measurements (samples) the original uNNBSS problem (1) has to be mapped
into RKHS by using EKM. Before that, we need to approximate uNNBSS problem
(1)/(2) by an equivalent uLNBSS problem.
2.2 Sparse probabilistic model of source signals
Taylor expansion of the nonlinear model (1) up to an arbitrary order K is derived in
Supporting Information. It is shown that uNNBSS problem (1) can be represented by an
equivalent uLNBSS problem, eq. (7) in Supporting Information, comprised of M
original sources and ( )2
Kk
k M higher order monomials, where ( ) 1k M kM
k
. Thus,
without further constraints uNNBSS problem (1) is computationally intractable. That is
why, according to A4, we assume that sources s comply with sparse probabilistic model
comprised of mixed state distribution [14, 15, 4]:
*( ) 1 1,..., 1,...,mt m mt m mt mtp s s s f s m M t T (3)
Page 10 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
where (smt) is an indicator function and *(smt)=1-(smt) is its complementary function,
1
0T
m mt tP
s . Hence,
10 1
T
mt m tP s
. The nonzero state of smt is
distributed according to f(smt). We have chosen exponential
distribution: 1 expmt mtm mf s s to model sparse distribution of the nonzero
states, in which case the most probable outcomes are equal to m. It has been verified in
[4] that model (3) describes well mass spectra of the pure components. Herein, by using
mass spectra of 25 pure components we have estimated ˆ 0.27,0.74m and
ˆ 0.0012,0.0014m , see section 3.2 and Figure 4 for more details.2 Under exponential
prior, probability that amplitude smt [, m], for 0<<<1, is 0.632. Thus, in 36.8% of
the cases random realization of smt will have amplitudes greater than most probable
value m. For a given m and given probability p(<smts) the value of s is obtained as:
s-mln(1-p). Thus, for p=0.99 and m=1.510-3 it follows s=710-3. Hence, we may
approximate equivalent uLNBSS model, eq. (7) in Supporting Information, by retaining
second order terms only:
1 2
1 2
21
1 2 2(1) (1)
, 1
...1
2....
M
M
m m m m
HOT
s
X G S G s
s s
(4)
2 Even though the exponential distribution has support on the [0,) interval, by setting =0.01 realizations will be contained in [0, 1] interval with a probability that is close to 1 with an error of
3.7210-44. Thus, this justifies a choice of exponential distribution to model sparse distribution of amplitudes smt on interval [0, 1].
Page 11 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
where 1(1)G , respectively 2
(1)G , stand for unfolded versions of the tensor of first,
respectively second, order derivatives and HOT stands for higher-order terms.
Contribution of third order terms in (4) is of the order (7x10-3)3=3.4310-7. In order to
reduce HOT entry-wise thresholding of X can be performed. By neglecting fourth- and
higher-order terms we have empirically arrived at the threshold value of: [10-6, 10-4].3
2.3 Suppression of higher order (error) terms
Mass spectra of 25 pure components recorded in nonlinear chemical reaction of peptide
bond formation, see section 3.2 and Figures 3 and S-4 in Supporting Information,
illustrate diversity of morphologies. Some have few very dominant (large) peaks (see
spectra of pure components 1, 2, 8, 13, 16, 17, 18, 19, 20, 21, 22, 23, 24 and 25), some
have intensities distributed on several m/z values, whereas intensities can be small (see
spectra of pure components 3, 4, 5, 6, 7, 9, 10, 11, 12, 14 and 15). It is thus hard to
propose one preprocessing (thresholding) transform for suppression of higher order
terms induced by nonlinear mixing process. We, therefore, propose the combination of
methods for this purpose.
3 These threshold values can be justified by the following analysis. Due to A1 and A2 elements of G in (7) in Supporting Information are less than 1. In pursuing worst case analysis of third-order effects we assume that third-order derivatives coefficients in G are less than some value g3. Thus, contribution of third-order terms is limited by above by x(3)=M(3)g3s. If mixture value xnt is greater than x(3) then it is probably due to first and second-order terms. The threshold value evidently depends on values of M(3), g3
and s. For example, assuming M=100 (M(3)=171700), g3=0.1 and s=3.410-7 we get x(3)=5.810-3. However, that is overly pessimistic given the fact that most of the third-order cross-products will, due to sparseness, vanish. Thus, optimal threshold value is somewhere in the interval [10-6, 10-4].
Page 12 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
2.3.1 Robust principal component analysis
RPCA has been proposed in [16, 17] to decompose data matrix X into sum of two
matrices: X=A+E. Provided that A is low rank matrix and E is sparse matrix
decomposition is unique and it is obtained as a solution of the optimization problem:
minimize * 1
A E subject to: A + E = X. (5)
Thereby, *
1
I N
ii
A denotes nuclear norm (sum of singular values) and IN is a rank
of matrix A; 1
1 1
N T
ntn t
e
E denotes 1 -norm of E and 1 T is a regularization
constant. In term of equivalent uLNBSS problem (4) A is associated with first and
second order terms and E is associated with HOT. A is actually represented by linear
mixture model composed of 2M + M(M-1)/2 sources and N mixtures. Since both N and
2M+M(M-1)/2 are small compared to T rank of A equals min(N, 2M + M(M-1)/2)=N .
Thus, it is low. E is comprised of monomials (products of the original source
components) of the order three- or higher. Since by assumption A4 source components
are sparse in support and amplitude their three- and higher-order products are either
zero or very small. Thus, E is sparse. Therefore, it is justified to use RPCA
decomposition of X in (4) to suppress higher-order terms induced by nonlinear mixing
process. That yields approximation of X, that is A , with suppressed higher-order terms.
In the experiments reported in Section 3 we have used accelerated proximal gradient
algorithm [41] , available with a MATLAB code at [42], to solve (5).
Page 13 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
2.3.2 Hard thresholding
Hard thresholding (HT) operator, [18], can be applied entry-wise to X in (4) according
to: 1
10nt nt
nt ntnt
x if xb HT x
if x
, n=1,...,N , t=1,...,T and 1 [10-6, 10-4] stands for
a threshold. HT preprocessing transform of X yields matrix B that is expected to have
the same structure as A in (5).
2.3.3 Soft thresholding
Soft thresholding (ST) operator, [18], can be applied entry-wise to X in (4) according to
2max 0,nt nt ntc ST x x , n=1,...,N , t=1,...,T and 2 [10-6, 10-4]. ST
preprocessing transform of X yields matrix C that, as B obtained by HT, is also
expected to have the same structure as A in (5).
2.3.4 Trimmed thresholding
Trimmed thresholding (TT) operator, [19], is applied entry-wise to X in (4) according
to: 3
3
30
ntnt nt
nt nt nt
nt
xx if x
d TT x x
if x
, n=1,...,N , t=1,...,T and 3 [10-6, 10-4].
is a trade-off parameter between hard and soft thresholding. When =1, TT equals ST.
When TT is equivalent to HT. Herein, we set =3.5 because this value yields TT
to operate between ST and HT [19]. TT preprocessing transform of X yields matrix D
Page 14 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
that, as B obtained by HT and C obtained by ST, is also expected to have the same
structure as A in (5).
2.4 Empirical kernel map based nonlinear mapping of preprocessed mixture
matrix
So far we have substituted uNNBSS problem (1)/(2) by four uLNBSS problems in a
form of (4). While original uNNBSS problem is characterized by nonlinear multivariate
mapping f and triplet (N, M, L) the uLNBSS problems are characterized by (N, P, Q)
where P2M+ M(M-1)/2 stands for number of dependent sources in (4) and Q2L+L(L-
1)/2 stands for maximal number of sources at particular m/z coordinate. Since by
assumption A3 M>N it follows that P>>N. Thus, even with activation of sparseness
constraints imposed by A4 it will be virtually impossible to ensure essentially unique
solution of these uLNBSS problems. To this end, as in [4], we apply the EKM-based
nonlinear mapping of uLNBSS problems represented by preprocessed mixture matrices
A, B, C and D to RKHS in order to increase number of samples/mixtures from N to
D>>N. Theory and discussion related to it has been presented in great details in section
2.2 in [4]. We therefore present it in a concise form herein. EKM of column vectors
1
T
t ta in (4) with respect to a basis 1
D
d dv is : N DR R , such that:
1
1, , ,..., , 1,...,Dd d
T
t t t D t t T
va a v a v a . Thereby, ,d t v a is a
positive definite symmetric function. The basis 1
D
d dv has to span the empirical set of
patterns 1
T
t ta such that 1 1
D T
d td tspan span
v a . In this case
1 1
D T
d td tspan span
v a , where 0 1
TN
t t tR
a a , i.e.
Page 15 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
0 1
DN
d d dR
v v , is in principle infinite dimensional nonlinear mapping. If
,t t a a , respectively ,d d v v , projection of 1
T
t t
a onto
1
D
d d
v yields in matrix form:
1 1 1
1
, ... ,
... ... ...
, ... ,
T
D T D
a v a v
A
a v a v
(6)
Herein, as in [4], we choose 2 2, exp /t d t d a v a v . When assumption A1
holds we can set 21. We analogously obtain EKM-mappings of matrices B, C and D
and that respectively yields DT matrices (B), (C) and (D). Likewise, as in [4], we
use k-means data clustering algorithm to estimate basis V by clustering 1
T
t ta in D
clusters. Thereby, by setting D=T clustering is unnecessary because each empirical
pattern is a basis vector. That, however, comes at increased computing cost. By using
sparseness assumption A4 it is shown in [4] that:
1 T HOT
0A Z G
S (7)
Page 16 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
where Z is a bias term and does not play a role in parts based decomposition that
follow, 01T is row vector of zeros and 0P TR S is matrix with P2M+ M(M-1)/2 rows
that contain original source components and their second order monomials. G is a
matrix of appropriate dimensions. EKM-mapped matrices (B), (C) and (D) follow
the same approximation as (A) in (7). It is important to emphasize that in (4) higher
order (error) terms are induced by nonlinear mixing process f(X) while in (7) they are
induced by nonlinear character of the EKM. That is, increase of number of mixtures
from N to D in (A), (B), (C) and (D) comes at the cost of errors induced by the
EKM. However, as in [4] and (4), we can again apply preprocessing transforms to
suppress HOT. Since matrices B, C and D were obtained by respectively applying HT,
ST and TT operators on X in (4) we apply these operators in the same order on (B),
(C) and (D). In order to keep level of notational complexity as low as possible we
keep the same notation for thresholded versions of matrices (B), (C) and (D). We
do not apply RPCA decomposition on (A) because rank of it is dictated by Z and is
equal to min(D,T)=D and that is not low. The final effect of EKM-based mappings is to
ensure that sparseness constrained factorization of (A), (B), (C) and (D) yields,
with greater probability, more accurate solution compared to decomposition by the
same method of A, B, C and D. That will be the case when the following condition
holds:
(D/N) >>(P/M) and (D/N) >>(Q/L). (8)
Page 17 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Because P2M+M(M-1)/2 and Q2L+L(L-1)/2 condition (8) becomes: (D/N) >>(M/2-
3/2) and (D/N) >>(L/2-3/2). Numerical problem studied in section 3 is characterized by
N=3, M=8, L=3 and D=T=1000. Evidently, above condition is fulfilled.
2.5 Sparseness constrained factorization
To increase accuracy of the pure components extraction we apply sparseness
constrained NMF (sNMF) in RKHS to matrices (A), (B), (C) and (D).4 That
yields four sets of separated components:
1
P
m msNMF
As A (9)
1
P
m msNMF
Bs B (10)
1
P
m msNMF
Cs C (11)
1
P
m msNMF
Ds D (12)
When it comes to implementation of the sNMF algorithms we use, as in [4], the NMU
algorithm [40] with a MATLAB code available at [44] and the NMF_L0 algorithm [39]
with a MATLAB code available at [45]. The NMF_L0 algorithm was run with the
4To ensure essentially unique decomposition sparseness constrained NMF algorithms have been formulated such as [38, 39, 40]. However, only very recently it is proved in [43], see Theorem 4 and Corollary 2, that uniqueness of some asymmetric NMF S=WH implies that each column of W (row of H) contains at least M-1 zeros, where M is nonnegative rank of S.
Page 18 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
following parameter setup: reverse sparse nonnegative least square sparse coder and
alternating nonnnegative least square for dictionary update stage. A main reason for
preferring the NMU algorithm over other sparseness constrained NMF algorithms is that
there are no regularization constants that require a tuning procedure. When performing
NMU-based factorizations in (9) to (12) the unknown number of pure components P
needs to be given to the algorithm as an input. As in [4] we set: P D T . That is, in
order not to lose some component we prefer to extract all T rank-one factors.5 These
four sets of separated components are compared with the pure components stored in the
library using normalized correlation coefficient. Each pure component is associated
with the separated component by which it has the highest correlation. As a reference in
the benchmark numerical study we have used solution obtained by applying the
NMF_L0 algorithm to the (9) to (12). Afterwards, maximal correlation criterion has
been used to assign separated components to pure components in the library. NMF_L0
is based on natural sparseness measure, the 0 -pseudo-norm of the component matrix
S , and that is known from compressed sensing theory, [47], to yield the best results
when sparseness of S decreases. The NMF_L0 when applied in (9) to (12) requires a
priori information on the number of components P and number of overlapping
components Q and they are related to M and L through: P2M + M(M-1)/2 and
Q2L+L(L-1)/2. In numerical scenario both M and L are known while in experimental
5 The factorization problems (9) to (12) are related to the determination of nonnegative rank of nonnegative matrix and that is defined as the smallest number of rank one matrices into which original
matrix can be decomposed [46]. For some matrix 0D TR Ψ with DT nonnegative rank equals the
smallest positive integer P for which there exists nonnegative column vectors 1
P
p pg such that each
column vector of can be represented as linear combination with nonnegative coefficients of the column
vectors 1
P
p pg .
Page 19 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
scenario selection of optimal (true) value of L is hard. We summarize the PTs-EKM-
NMU/NMF_L0 algorithm in the Algorithm 1.
3.0 EXPERIMENTS
Studies on numerical and experimental data reported below were executed on personal
computer running under Windows 64-bit operating system with 64GB of RAM using
Intel Core i7-3930K processor and operating with a clock speed of 3.2 GHz. MATLAB
2012b environment has been used for programming.
3.1 Numerical study
In numerical study we simulate uNNBSS problem (2) with N=3, M=8, L=3 and
T=1000. Source signals were generated according to mixed state probabilistic model (3)
with exponential prior. Thereby, m=1.510-3 m=1,...,M. We have generated two
scenarios with m=0.5 and m=0.8 m=1,...,M. Values for m and m are equivalent to
those obtained by fitting probabilistic model (3) to experimental mass spectra of 25 pure
components, see section 3.2 and Figure 4 for details. The uNNBSS problem (2) has
been simulated using nonlinear mixtures:
3 2 1 2 3 31 1 2 3 4 5 6 7 8( ) tan ( ) tanh( ) sin( )f s s s s s s s s s
3 3 1 2 22 1 2 3 4 5 6 7 8( ) tanh( ) tan ( ) tanh( ) sin( )f s s s s s s s s s
Page 20 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
1 2 3 3 13 1 2 3 4 5 6 7 8( ) sin( ) tan ( ) tanh( ) sin( ) tan ( )f s s s s s s s s s
Nonlinear mixtures are chosen arbitrary to demonstrate capability of proposed algorithm
to solve uNNBSS problem comprised of unknown nonlinear mixtures. HT, ST and TT
operators used in steps 2, 3, 4 and 6 in Algorithm 1 were implemented with =10-5 and
=3.5 has been used for TT operator. Gaussian kernel based EKM has been used with
2=1 and D=T=1000. Table 1 shows results of comparative analysis, for the case of
m=0.8, obtained by NMU and NMF_L0 applied to uNNBSS (1)/(2); NMU and
NMF_L0 applied in (9) to (12) without suppression of higher order monomials (EKM-
NMU and EKM-NMF_L0); and NMU and NMF_L0 applied in (9) to (12) after RPCA,
HT, ST and TT preprocessing transforms (PTs-EKM-NMU and PTs-EKM-NMF_L0).
Due to sparse prior imposed on sources it was reasonable to expect that useful results
can be obtained by direct factorization of uNNBSS problem (2). Results for m=0.5 are
shown in Table S-1 in Supporting Information, while results for m=0.8 and m=0.5 as a
function of Monte Carlo index are shown in Figure 1. For the value of normalized
correlation coefficient between pure component and assigned separated component we
evaluate performance in term of four metrics described in caption of Table 1. They are
defined with respect to predefined labeling of the pure components stored in the library.
The first three metrics are calculated for correctly assigned components only. That is
why NMU and NMF_L0 appear to have comparable performance in term of mean and
minimal correlation metrics. But they are inferior in number of separated components
correlated with pure components with correlation greater than or equal to 0.6 as well as
in number of (in)correctly assigned separated components (due to poor separation).
Page 21 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Thereby, incorrect assignment implies that two or more pure components are assigned
to the same separated component. We also can see that preprocessing transforms
improve performance compared with factorizations of mixtures data without
preprocessing related to suppression of higher order monomials.
3.2 Experimental data on chemical reaction comprising peptide synthesis
3.2.1 Chemicals
Chemical reaction has been performed according to the following procedure: L-Leucine
(200 mg, 1.52 mmol) was dissolved in 5 mL of dry dimethylformamide (DMF) and
solution was cooled to 0 °C. N-methylmorpholine (NMM, 3.05 mmol, 337 μL) and
isobutylchloroformate (IBCF, 3.34 mmol, 458 μL) were added. Aliquots of the reaction
mixture (100μL) were withdrawn every 30 minutes (t0-t8), solvent was evaporated and
the residue dissolved in 1mL of 0.1 % formic acid (FA) in 50 % MeOH. Aliquots (100
μL) were diluted with 400 μL of 0.1 % FA in 50 % MeOH and 10 μL were injected
through autosampler on a column (Zorbax XDB C18, 3.5 m, 4.7 mm) at the flow rate
of 0.5 mL/min. Mobile phase was 0.1 % FA in water (solvent A) and 0.1 % FA in
MeOH (solvent B). Gradient was applied as follows: 0 min 40 % B; 0-15 min 90 %B;
12-15min 90% B; 17.1 min 40% B; 17.1-20 min 40 %B. Figure S-2 in Supporting
Information shows 9 chromatograms corresponding to reaction mixture recorded at 9
time instants (t0-t8) during the reaction. Mass spectra of 9 mixtures (x1 to x9), obtained
by full integration of chromatograms, and mass spectra of 25 pure components (s1 to
s25) arising during the reaction are respectively shown in Figure S-3 and Figure S-4 in
Page 22 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Supporting Information. Mass spectra of pure components 1, 4, 8 and 11 are also shown
in Figure 3.
3.2.2. Mass spectroscopy measurements
Electrospray ionization-mass spectrometry (ESI-MS) measurements operating in a
positive ion mode were performed on a HPLC-MS triple quadrupole instrument
equipped with an autosampler (Agilent Technologies, Palo Alto, CA, USA). The
desolvation gas temperature was 3000C with flow rate of 8.0 L/min. The fragmentor
voltage was 135 V and capillary voltage was 4.0 kV. Mass spectra were recorded in m/z
segment of 10-2000. All data acquisition and processing was performed using Agilent
MassHunter software. Acquired mass spectra are composed of intensities at T=9901 m/z
coordinates.
3.2.3. Setting up an experiment
Peptides and proteins are compounds involved in numerous biological processes of key
importance, like cell-cell communication, immune response, cell growth and
proliferation, hormonal and enzymatic activity. They are, therefore of ever-increasing
interest as tools in studies of biological systems and modulators of biological functions.
Chemical synthesis of peptides involves condensation of two suitably protected parts
(amino acids or peptides) in order to obtain single, desirable product. However, for the
purpose of this work, a different approach was undertaken. Non-protected amino acid,
L-leucin, was allowed to react under basic conditions (NMM) in the presence of IBCF
Page 23 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
giving various products: di-, tri-, tetrapeptides as well as corresponding intermediates.
Nonlinearity of the described reaction was assured based on the following: (i)
concentration of individual components does not change linearly with time and (ii) as
reaction proceeds, new components appear that were not present at the beginning of the
reaction. Figure 2 schematically describes possible components present in the reaction
mixture. It is important to note, that aim of this experiment was not to determine
structure of all components, but to provide reliable experimental data on nonlinear
reaction. Library of compounds required for the validation of algorithm was built by
integration of each peak in the chromatogram corresponding to the mixture x9 and
subsequent extraction of mass spectrum. During the library generation, no
discrimination based on the intensity of peaks was made. Therefore, all peaks were
treated as pure components.
4. RESULTS AND DISCUSSION
Inspection of pure components mass spectra shown in Figures S-4 in Supporting
Information shows significant overlapping, resulting from the similarity of chemical
structure of components. Pure components 1 and 2, 16 and 17 as well as 19 and 21 have
normalized correlation coefficient above 0.97 and, consequently, they are impossible to
be distinguished. In addition to that, pure components 5 and 7 have normalized
correlation coefficient above 0.78. Thus, they are also expected to be very hard to
discriminate. However, we expect from proposed PTs-EKM_NMU method to be able to
discriminate the rest of the components. That is not trivial given the fact that normalized
correlation coefficients for 26 combinations of pure components vary between 0.1 and
Page 24 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
0.44. That makes the uNNBSS problem comprised of correlated pure components very
hard. Correlation matrix of the pure components mass spectra, where pairs of pure
components are identified with normalized correlation coefficient above 0.1, is shown
in Table S-2 in the Supporting Information. As emphasized previously, it is sparseness
of the pure components mass spectra in support and amplitude that is expected to enable
solution of related uNNBSS. To this end, mixed state probabilistic model (3) with
exponential prior on continuous distribution of the non-zero amplitude has been fitted to
experimental pure components mass spectra (they are shown in Figure S-4 in
Supporting Information as well as in Figure 3 for pure components 1, 4, 8 and 11). Even
though these pure components are correlated with others and some (4 and 11) have
small intensity they are uniquely assigned to the true pure components from the library.
Figure 4 (left), also Figure S-5 in Supporting Information, shows estimated probability
that value of the pure component mass spectra is zero. As can be seen 22 out of 25 pure
components have zero amplitudes at 40% to 75% of their support. Figure 4 (right), also
Figure S-6 in Supporting Information, shows most expected values (mean) of
exponential distribution estimated by fitting exponential distribution to amplitude
histograms. They were estimated for 25 pure components in the range (0, 1] within
intervals of the 0.01 width. It can be seen that ˆ 0.0012,0.0014m for m=1,...,25.
Figure S-7 shows probability that amplitude of the pure components mass spectra
occurs in interval [0, A], such that 0.01A1. That is an average estimate over 25 pure
components. It is seen that 0.01A0.08 occurs with probability 0.97. Reported results
confirm that sparse probabilistic model (3) is experimentally well grounded. That is
further confirmed by Figure S-8 in Supporting Information that shows estimated
histograms (stars) and exponential probability density functions (squares) calculated
Page 25 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
with the mean values from Figure 4-right. It is seen that approximation is very good.
Estimated histograms vs. exponential probability density functions for pure components
1, 4, 8 and 11 are also shown in Figure 5.
Table 2 presents results of comparative performance analysis using the four
metrics as in section 3.1 for NMU, EKM-NMU, PTs-EKM-NMU for D=T=9901 and
PTs-EKM-NMU for D=4000. Thus, in the last case k-means clustering has been used to
find a basis 4000
1d dv in the input space of patterns 9901
1t tx . Provided that it retains
accuracy, the subspace approximation is very important from computational reasons.
That is because when four preprocessing transforms are combined, sparseness
constrained NMF in (9) to (12) has to be performed four times. That can be done in
parallel. Nevertheless, one factorization of the 99019901 matrix by NMU algorithm
takes approximately 79 hours on above specified machine, while factorization of the
40009901 matrix by the same algorithm takes approximately 13.7 hours. For NMF_L0
number of overlapping components, L, has to be reported to the algorithm as input
information. For Pts_EKM-NMF_L0 algorithm optimal value of L can be inferred by
running NMF_L0 algorithm multiple times on problem such (9). That, however, would
result in high computational costs. That is why NMF_L0 has not been used in RKHS on
problems (9) to (12). It is seen from Table 2 that linear sparseness constrained matrix
factorization yields poor quality of separation compared to linear factorization in the
RKHS. That is especially the case with number of incorrectly assigned components and
that is direct consequence of the low purity of separated components. That, indirectly,
also confirms nonlinear character of the mixtures mass spectra of the desired chemical
reaction. It is also seen that combination of four preprocessing transforms for
Page 26 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
suppression of higher order monomials and sparseness constrained factorization in
RKHS significantly improves quality of separation. In this regard, Figure S-9 in
Supporting Information shows mass spectra of 25 separated components assigned to
pure components according to maximal correlation criterion. Separated pure
components 1, 4, 8 and 11 are also shown in Figure 3. Thereby, value of normalized
correlation coefficient and preprocessing transform (RPCA, HT, ST or TT) that yielded
best result are also reported. Due to the diversity of morphologies of mass spectra all
four preprocessing transforms yielded best results at some cases. It is also important to
notice that subspace approximation of proposed method with D=4000 yields results
very comparable to those obtained by D=9901 but with much shorter computation time.
Thus, proposed approach to pure components extraction can, when implemented on
state-of-the-art multiprocessor (grid) platform, be executed in even shorter time which
makes it practically relevant.
5.0 CONCLUSION
Blind source separation approach to pure components extraction is most often based on
linear mixture model. That is, mixtures spectra are assumed to be the unknown
weighted linear combination of pure components spectra. Herein, we have addressed
problem related to extraction of pure components from nonlinear mixtures of mass
spectra. Thereby, number of mixtures is assumed to be (significantly) less than number
of pure components. We propose an approach that combines four preprocessing
methods for suppression of higher order monomials induced by nonlinear mixing
process and sparseness constrained nonnegative matrix factorization in RKHS induced
Page 27 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
by EKM. Two practically important properties of the proposed approach are that no
information about character of the nonlinear mixing process is required and that linear
mixing problem is contained implicitly as a special case. It is believed that these
properties make the proposed approach practically relevant for contemporary metabolic
profiling of biological samples, that is pure components extraction in biomarker
identification studies. Proposed approach is demonstrated on demanding numerical and
experimental scenarios. In the last case, related to chemical reaction of synthesis of
peptides, components separated from 9 nonlinear mixtures mass spectra are assigned
uniquely to 25 the pure components from the library. On the same problem separation
by linear NMF algorithms yielded 15 (NMU) and 7 (NMF_L0) incorrectly assigned
components.
ACKNOWLEDGMENT
This work has been supported through grant 9.01/232 "Nonlinear component analysis
with applications in chemometrics and pathology" funded by the Croatian Science
Foundation.
REFERENCES
1. Nuzillard D, Bourg S, Nuzilard J M. Model-Free Analysis of Mixtures by NMR
Using Blind Source Separation J. Magn. Reson. 1998; 133: 358-363.
Page 28 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
2. Visser E, Lee T W. An information-theoretic methodology for the resolution of pure
component spectra without prior information using spectroscopic measurements.
Chemom. Int. Lab. Syst. 2004; 70: 147-155.
3. Kopriva I, Jerić I. Blind separation of analytes in nuclear magnetic resonance
spectroscopy and mass spectrometry: sparseness-based robust multicomponent analysis.
Anal. Chem. 2010; 82: 1911-1920.
4. Kopriva I, Jerić I, Brkljačić L. Nonlinear mixture-wise expansion approach to
underdetermined blind separation of nonnegative dependent sources. J. Chemometrics
2013; 27: 189-197.
5. Roux A, Xu Y, Heilier J-F, Olivier M-F, Ezan E, Tabet J-C, Junot C. Annotation of
the human adult urinary metabolome and metabolite identification using ultra high
performance liquid chromatography coupled to a linear quadrupole ion trap-orbitrap
mass spectrometer. Anal. Chem. 2012; 84: 6429−6437.
6. Abu-Farha M, Elisma F, Zhou H, Tian R, Asmer M S, Figeys D. Proteomics: from
technology developments to biological applications. Anal. Chem. 2009; 81: 4585-4599.
7. McLafferty F W, Stauffer D A, Loh S Y, Wesdemiotis C. Unknown Identification
Using Reference Mass Spectra. Quality Evaluation of Databases. J. Am. Soc. Mass.
Spectrom. 1999; 10: 1229-1240.
8. Hyvärinen A, Karhunen J, Oja E. Independent Component Analysis. John Wiley &
Sons, Inc.: New York, US, 2001.
9. Cichocki A, Amari S. Adaptive Blind Signal and Image Processing. John Wiley: New
York, 2002.
Page 29 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
10. Cichocki A, Zdunek R, Phan A H, Amari, S I. Nonnegative Matrix and Tensor
Factorizations. John Wiley: Chichester, UK, 2009.
11. Comon P, Jutten C (eds). Handbook of Blind Source Separation. Academic Press:
Oxford, UK, 2010.
12. Walleczek J(ed). Self-organized biological dynamics and non-linear control.
Cambridge University Press: Cambridge, UK. 2000
13. Nicholson J K, Lindon J C. Systems biology: Metabonomics. Nature 2008; 455
(7216): 1054-1056.
14. Bouthemy P, Piriou C H G, Yao J. Mixed-state auto-models and motion texture
modeling. J. Math Imaging Vision 2006; 25: 387-402.
15. Caifa C, Cichocki A. Estimation of Sparse Nonnegative Sources from Noisy
Overcomplete Mixtures Using MAP. Neural Comput. 2009; 21: 3487-3518.
16. Candès E J, Li X, Ma Y, Wright H. Robust Principal Component Analysis? J. ACM
2011; 58: Article 11 (37 pages).
17. Chandrasekaran V, Sanghavi S, Paririlo P A, Wilsky A S. Rank-Sparsity
Incoherence for Matrix Decomposition. SIAM J. Opt. 2011; 21: 572-596.
18. Donoho D L. De-Noising by Soft-Thresholding. IEEE Trans. Inf. Theory 1995; 41
(3): 613-627.
19. Fang H T, Huang D S. Wavelet de-noising by means of trimmed thresholding. in:
Proc. of the 5th World Congress on Intelligent Control and Automation, June 15-19,
2004, Hangzhou, P. R. China, pp. 1621-1624.
Page 30 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
20. The website of the ASTER spectral library. http://speclib.jpl.nasa.gov [13 January
2014]
21. Zhang K, Chan L. Minimal Nonlinear Distortion Principle for Nonlinear
Independent Component Analysis. J. Mach. Learn. Res. 2008; 9: 2455-2487.
22. Levin D N. Using state space differential geometry for nonlinear blind source
separation. J. Appl. Phys. 2008; 103: 044906:1-12.
23. Levin D N. Performing Nonlinear Blind Source Separation With Signal Invariants.
IEEE Trans. Sig. Proc. 2010; 58: 2131-2140.
24. Taleb A, Jutten C. Source Separation in Post-Nonlinear Mixtures. IEEE Trans. Sig.
Proc. 1999; 47: 2807-2820.
25. Duarte L T, Suyama R, Rivet B, Attux R, Romano J M T, Jutten C. Blind
compensation of nonlinear distortions: applications to source separation of post-
nonlinear mixtures. IEEE Trans. Sig. Proc. 2012; 60: 5832-5844.
26. Filho E F S, de Seixas J M, Calôba L P. Modified post-nonlinear ICA model for
online neural discrimination. Neurocomputing 2010; 73: 2820-2828.
27. Nguyen T V, Patra J C, Das A. A post nonlinear geometric algorithm for
independent component analysis. Digital Sig. Proc. 2005; 15: 276-294.
28. Ziehe A, Kawanabe M, Harmeling S, Müller K R. Blind Separation of Post-
Nonlinear Mixtures Using Gaussianizing Transformations And Temporal Decorrelation.
J. Mach. Learn. Res. 2003; 4: 1319-1338.
Page 31 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
29. Zhang K, Chan L W. Extended Gaussianization Method for Blind Separation of
Post-Nonlinear Mixtures. Neural Comput. 2005; 17: 425-452.
30. Harmeling S, Ziehe A, Kawanabe M. Kernel-Based Nonlinear Blind Source
Separation. Neural Comput. 2003; 15: 1089-1124.
31. Martinez D, Bray A. Nonlinear Blind Source Separation Using Kernels. IEEE Tr.
Neural Net. 2003; 14: 228-235.
32. Almeida L. MISEP-Linear and nonlinear ICA based on mutual information. J.
Mach. Learn. Res. 2003; 4: 1297-1318.
33. Vaerenbergh S V, Santamaria I A. Spectral Clustering Approach to
Underdetermined Postnonlinear Blind Source Separation of Sparse Sources. IEEE
Trans. Neural Net. 2006; 17: 811-814.
34. Buciu I, Nikolaidis N, Pitas I. Nonnegative Matrix Factorization in Polynomial
Feature Space. IEEE Trans. Neural Net. 2007; 19: 1090-1100.
35. Zafeiriou S, Petrou M. Non-linear Non-negative Component Analysis. IEEE Trans.
Image Proc. 2010; 19: 1050-1066.
36. Pan B, Lai J, Chen W S. Nonlinear nonnegative matrix factorization based on
Mercer kernel construction. Pattern Rec. 2011; 44: 2800-2810.
37. Yang Z, Xiang Y, Xie S, Ding S, Rong Y. Nonnegative Blind Source Separation by
Sparse Component Analysis Based on Determinant Measure. IEEE Trans. Neural Net.
and Learn. Sys. 2012; 23 (10): 1601-1610.
Page 32 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
38. Cichocki A, Zdunek R, Amari S I. Hierarchical ALS Algorithms for Nonnegative
Matrix Factorization and 3D Tensor Factorization. LNCS 2007; 4666: 169-176.
39. Peharz R, Pernkopf, F. Sparse nonnegative matrix factorization with 0 -constraints.
Neurocomputing 2012; 80: 38-46.
40. Gillis N, Glineur F. Using underapproximations for sparse nonnegative matrix
factorization. Pattern Rec. 2010; 43: 1676-1687.
41. Lin Z, Ganesh A, Wright J, Wu L, Chen M, Ma Y. Fast Convex Optimization
Algorithms for Exact Recovery of a Corrupted Low-Rank Matrix. UIUC Technical
Report UILU-ENG-09-2214, August 2009.
42. The website on low-rank matrix recovery and completion via convex optimization:
http: //perception.csl.illinois.edu/matrix-rank/sample_code.html [13 January 2014]
43. Huang K, Sidiropoulos N D, Swami A. Non-Negative Matrix Factorization
Revisited: Uniqueness and Algorithm for Symmetric Decomposition. IEEE Trans. Sig.
Proc. 2014; 62: 211-224.
44. The Nicolas Gillis Website. https://sites.google.com/site/nicolasgillis/code [13
January 2014].
45. The Robert Peharz Website. http://www3.spsc.tugraz.at/people/robert-peharz [13
January 2014].
46. Cohen J E, Rothblum U C. Nonnegative Ranks, Decompositions, and Factorizations
of Nonnegative Matrices. Linear Algebra and Its Applications 1993; 190: 149-168.
47. Chartran R, Staneva V. Restricted isometry properties and nonconvex compressive
sensing. Inverse Problems 2008; 24: 035020 (14 pages).
Page 33 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Figure Captions
Figure 1 (color online). Numerical study. Normalized correlation coefficient vs. Monte
Carlo run index between true and extracted sources by algorithms: NMF_L0 (crosses),
NMU (circles) and PTs-EKM-NMU (pluses) and PTs-EKM-NMF_L0 (stars). Mean
value (first row), minimal value (second row), number of values greater than or equal to
0.6 (third row), number of incorrect pairs (fourth row). Probability of state zero equal
to 0.5 (left column) and 0.8 (right column).
Page 34 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Fig. 1, Kopriva, Jerić, Filipović & Brkljačić
Page 35 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Figure 2. Structures of possible components present in the reaction mixture.
Fig. 2, Kopriva, Jerić, Filipović & Brkljačić
Figure 3. Two top rows: mass spectra of pure components s1, s4, s8 and s11. Two bottom
rows: estimated mass spectra of pure components s1, s4, s8 and s11 by proposed PTs-
EKM-NMU algorithm. Information on value of highest normalized correlation
coefficient and associated error reduction method (RPCA, HT, ST and TT) are also
displayed.
Page 36 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Fig. 3, Kopriva, Jerić, Filipović & Brkljačić
Page 37 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Figure 4. Experimental study. Left: estimated probability that value of the pure component mass
spectra is zero, that is estimate of m, m=1,...,25. Right: estimates of most expected values
(means) of exponential distribution obtained by fitting exponential distribution to amplitude
histograms. They were estimated for 25 pure components in the range (0, 1] within intervals of
the 0.01 width.
Fig. 4, Kopriva, Jerić, Filipović & Brkljačić
Page 38 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Figure 5. (color online). Experimental study for pure components 1, 4, 8 and 11. Estimated
histograms (stars) vs. exponential probability density functions (squares), calculated with the
estimates of mean values shown in Figure 4 -right, fitted to amplitude histograms.
Fig. 5, Kopriva, Jerić, Filipović & Brkljačić
Page 39 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Table Captions
Algorithm 1. The PTs-EKM-NMF (preferably NMU) algorithm.
Required:
0N TR X . If A1) is not satisfied perform scaling
1 1arg max
T
t tt X X x or ,
, 1arg max
N T
nt n tnt
X X X .
1. Perform RPCA (5) on X in (2)/(4) with 1 T . It yields approximation A in (4).
2. Perform HT on X in (2)/(4) with 1[10-6,10-4]. It yields approximation B.
3. Perform ST on X in (2)/(4) with 2[10-6,10-4]. It yields approximation C.
4. Perform TT on X in (2)/(4) with 3[10-6,10-4] and =3.5. It yields approximation D.
5. Perform EKM mappings A(A), B(B), C(C) and D(D) according to (6). Use Gaussian kernel with 2=1.
6. Perform HT, ST and TT respectively of matrices (B), (C) and (D). 7. Perform sparseness constrained factorization, preferably by NMU
algorithm, of matrices (A), (B), (C) and (D) to obtain separated components AS , BS , CS and DS .
8. Assign to pure components from the library those separated components AS , BS , CS and DS with highest normalized correlation coefficient.
Page 40 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Table 1. Comparative performance analysis of NMU, NMF_L0, EKM-NMU, EKM-
NMF_L0, PTs-EKM-NMU and PTs-EKM-NMF_L0 algorithms. Probability of zero
state was m=0.8. Four metrics used in comparative performance analysis were: number
of associated components with normalized correlation coefficient greater than or equal
to 0.6, mean value of correlation coefficient over all associated components, minimal
value of correlation coefficient and number of pure components assigned incorrectly
(that occurs due to poor separation). All four metrics were calculated with respect to
predefined labeling of the pure components stored in the library. Incorrect assignment
implies that, based on maximal correlation criterion, two or more pure components are
assigned to the same separated component. Mean values and variance are reported and
estimated over 10 Monte Carlo runs. The best result in each metric is in bold. The first
three metrics are calculated only for correctly assigned components. That is why NMU
and NMF_L0 appear to have comparable performance.
NMU NMF_L0 EKM-NMU
EKM-NMF_L0
PTs_EKM-NMU
PTs-EKM-
NMF_L0
correlation G.E. 0.6
2.8±0.92 2.3±1.34 3.7±0.48 3.2±0.63 3.8±0.42 3.7±0.48
mean correllation
0.70±0.03 0.61±0.11 0.69±0.02 0.64±0.03 0.70±0.03 0.69±0.04
minimal correlation
0.53±0.04 0.42±0.08 0.51±0.03 0.45±0.04 0.52±.04 0.49±0.06
inccorect assignments
3.4±0.70 3.1±0.57 2.4±0.97 2.2±0.63 2.0±0.88 1.5±1.43
Page 41 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Table 2. Comparative performance analysis of NMU, NMF_L0, EKM-NMU, PTs-
EKM-NMU (D=T=9901) and PTs-EKM-NMU (D=4000) algorithms of 9 experimental
nonlinear mixtures mass spectra related to peptide synthesis. Number of pure
components equals 25. Four metrics used in comparative performance analysis were:
number of associated components with normalized correlation coefficient greater than
or equal to 0.6, mean value of correlation coefficient over all associated components,
minimal value of correlation coefficient and number of pure components assigned
incorrectly (that occurs due to poor separation). The best result in each metric is in bold.
The first three metrics are calculated only for correctly assigned components.
NMU NMF_L0 EKM-NMU
PTs_EKM-NMU
D=T=9901
PTs-EKM-NMU
D=4000
correlation G.E. 0.6
8 14 16 18 18
mean correlation
0.342 0.518 0.673 0.702 0.708
minimal correlation
0.038 0.039 0.267 0.419 0.283
inccorect assignments
15 7 0 0 1
CPU time 1.3s 40 s 78.78h 478h 413.7h
Page 42 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Empirical Kernel Map Approach to Nonlinear Underdetermined Blind Separation
of Sparse Nonnegative Dependent Sources: Pure Components Extraction from
Nonlinear Mixtures Mass Spectra
Ivica Kopriva1* , Ivanka Jerić2, Marko Filipović1 and Lidija Brkljačić2
Ruđer Bošković Institute, Bijenička cesta 54, HR-10000, Zagreb, Croatia
1Division of Laser and Atomic Research and Development
phone: +385-1-4571-286, fax:+385-1-4680-104
e-mail: [email protected], [email protected]
2Division of Organic Chemistry and Biochemistry
e-mail: [email protected], [email protected]
Summary abstract. A method for underdetermined nonlinear blind separation of nonnegative
sparse dependent sources is proposed. It combines robust principal component analysis, hard-,
soft- and trimmed thresholding to suppress higher order monomials induced by nonlinear
mixing with empirical kernel map based nonlinear mapping of preprocessed mixtures data and
sparseness constrained nonnegative matrix factorization (NMF) in high-dimensional mapped
space. The method is aimed to extract analytes from mass spectra of nonlinear multicomponent
mixtures of biological samples.
Page 43 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
(color online). Numerical study. Normalized correlation coefficient vs. Monte Carlo run index between true and extracted sources by algorithms: NMF_L0 (crosses), NMU (circles) and PTs-EKM-NMU (pluses) and PTs-EKM-NMF_L0 (stars). Mean value (first row), minimal value (second row), number of values greater than or
equal to 0.6 (third row), number of incorrect pairs (fourth row). Probability of state zero equal to 0.5 (left column) and 0.8 (right column). 205x241mm (300 x 300 DPI)
Page 44 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Structures of possible components present in the reaction mixture.
76x32mm (300 x 300 DPI)
Page 45 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Two top rows: mass spectra of pure components s1, s4, s8 and s11. Two bottom rows: estimated mass spectra of pure components s1, s4, s8 and s11 by proposed PTs-EKM-NMU algorithm. Information on value of highest normalized correlation coefficient and associated error reduction method (RPCA, HT, ST and TT)
are also displayed. 256x323mm (300 x 300 DPI)
Page 46 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
Experimental study. Left: estimated probability that value of the pure component mass spectra is zero, that is estimate of ρm, m=1,...,25. Right: estimates of most expected values (means) of exponential distribution
obtained by fitting exponential distribution to amplitude histograms. They were estimated for 25 pure
components in the range (0, 1] within intervals of the 0.01 width. 59x17mm (300 x 300 DPI)
Page 47 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
For Peer Review
(color online). Experimental study for pure components 1, 4, 8 and 11. Estimated histograms (stars) vs. exponential probability density functions (squares), calculated with the estimates of mean values shown in
Figure 4 -right, fitted to amplitude histograms. 122x72mm (300 x 300 DPI)
Page 48 of 48
http://mc.manuscriptcentral.com/cem
Journal of Chemometrics
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960