M PROCESSING AND MODELLING OF HYPHENATED METABOLITE DATA...

MULTIVARIATE PROCESSING AND

MODELLING OF HYPHENATED METABOLITE DATA

by

PÄR JONSSON

Akademisk avhandling Som med tillstånd av rektorsämbetet vid Umeå universitet för erhållande av Filosofie Doktorsexamen vid Teknisk–Naturvetenskapliga fakulteten, framlägges till offentlig granskning vid Kemiska institutionen, Umeå universitet, sal KB3B1, KBC, fredagen den 27 januari 2006, kl. 10.00. Fakultetsopponent: Professor Royston Goodacre, School of Chemistry, University of Manchester, PO Box 88, Sackville Street, Manchester, M60 1QD, U.K.

COPYRIGHT © 2005 PÄR JONSSON

ISBN: 91-7305-992-7 PRINTED IN SWEDEN BY VMC–KBC

UMEÅ UNIVERSITY, UMEÅ 2005

Title Multivariate Processing and Modelling of Hyphenated Metabolite Data Author Pär Jonsson, Department of Chemistry, Organic Chemistry, Research Group for Chemometrics, Umeå University, SE – 901 87, Umeå, Sweden. Abstract

One trend in the ‘omics’ sciences is the generation of increasing amounts of data, describing complex biological samples. To cope with this and facilitate progress towards reliable diagnostic tools, it is crucial to develop methods for extracting representative and predictive information. In global metabolite analysis (metabolomics and metabonomics) NMR, GC/MS and LC/MS are the main platforms for data generation. Multivariate projection methods (e.g. PCA, PLS and O-PLS) have been recognized as efficient tools for data analysis within subjects such as biology and chemistry due to their ability to provide interpretable models based on many, correlated variables. In global metabolite analysis, these methods have been successfully applied in areas such as toxicology, disease diagnosis and plant functional genomics.

This thesis describes the development of processing methods for the unbiased extraction of representative and predictive information from metabolic GC/MS and LC/MS data characterizing biofluids, e.g. plant extracts, urine and blood plasma. In order to allow the multivariate projections to detect and highlight differences between samples, one requirement of the processing methods is that they must extract a common set of descriptors from all samples and still retain the metabolically relevant information in the data. In Papers I and II this was done by applying a hierarchical multivariate compression approach to both GC/MS and LC/MS data. In the study described in Paper III a hierarchical multivariate curve resolution strategy (H-MCR) was developed for simultaneously resolving multiple GC/MS samples into pure profiles. In Paper IV the H-MCR method was applied to a drug toxicity study in rats, where the method’s potential for biomarker detection and identification was exemplified. Finally, the H-MCR method was extended, as described in Paper V, allowing independent samples to be processed and predicted using a model based on an existing set of representative samples. The fact that these processing methods proved to be valid for predicting the properties of new independent samples indicates that it is now possible for global metabolite analysis to be extended beyond isolated studies. In addition, the results facilitate high through-put analysis, because predicting the nature of samples is rapid compared to the actual processing. In summary this research highlights the possibilities for using global metabolite analysis in diagnosis. Keywords Chemometrics, Curve Resolution, GC/MS, LC/MS, Metabolomics, Metabonomics, Multivariate Analysis and Multivariate Curve Resolution.

ISBN: 91-7305-992-7

Contents 1. List of Papers _______________________________________________ 7

2. List of Abbreviations__________________________________________ 9

3. Notations __________________________________________________ 11

4. Introduction________________________________________________ 13

5. Aim ______________________________________________________ 13

6. Background ________________________________________________ 17

6.1. Metabolomics and Metabonomics __________________________ 17

6.2. Instrumentation_________________________________________ 18 6.2.1. Gas Chromatography________________________________ 18 6.2.2. Liquid Chromatography______________________________ 20 6.2.3. Mass Spectrometry _________________________________ 20

6.3. Multivariate Data Analysis________________________________ 21 6.3.1. Principal Component Analysis ________________________ 22 6.3.2. Partial Least Squares Projections to Latent Structures ______ 24 6.3.3. Orthogonal-PLS____________________________________ 27 6.3.4. Batch Modelling ___________________________________ 28

6.4. Processing Hyphenated Data ______________________________ 30 6.4.1. Manual Integration__________________________________ 32 6.4.2. Peak Detection_____________________________________ 32 6.4.3. Curve Resolution/Deconvolution ______________________ 33 6.4.4. Multivariate Curve Resolution_________________________ 36 6.4.5. Using the Processed Data for Sample Comparisons ________ 38

7. Results____________________________________________________ 41

7.1. Hierarchical Multivariate Data Compression (Papers I and II) ____ 41

7.2. Hierarchical Multivariate Curve Resolution (Paper III) _________ 44

7.3. Modelling of Time Dependent Responses (Paper IV) ___________ 48

7.4. Processing and Prediction of Independent Samples (Paper V) ____ 51

8. Conclusions and Future Perspectives ____________________________ 55

9. Populärvetenskaplig Sammanfattning (På Svenska)_________________ 57

10. References_________________________________________________ 59

11. Acknowledgements__________________________________________ 65

List of Papers

7

1. List of Papers

This thesis is based on the papers listed below, which are referred to in the text by the corresponding Roman numerals (I-V). I Jonsson P*, Gullberg J*, Nordström A, Kusano M, Kowalczyk M,

Sjöström M and Moritz T. A strategy for identifying differences in large series of metabolomic samples analyzed by GC/MS. Analytical Chemistry 2004,76:1738-1745. * To be considered joint first authors.

II Jonsson P, Bruce SJ, Moritz T, Trygg J, Sjöström M, Plumb R,

Granger J, Maibaum E, Nicholson JK, Holmes E and Antti H. Extraction, interpretation and validation of information for comparing samples in metabolic LC/MS data sets. The Analyst 2005,130:701-707.

III Jonsson P, Johansson AI, Gullberg J, Trygg J, A J, Grung B, Marklund S, Sjöström M, Antti H and Moritz T. High-Throughput Data Analysis for Detecting and Identifying Differences between Samples in GC/MS-Based Metabolomic Analyses. Analytical Chemistry 2005,77:5635-5642.

IV Jonsson P, Stenlund H, Moritz T, Trygg J, Sjöström M, Verheij ER,

Lindberg J, Schuppe-Koistinen I and Antti H. Modeling of time dependent toxicological responses in urinary GC/MS data. Submitted

V Jonsson P, Sjövik-Johansson E, Wuolikainen A, Lindberg J,

Schuppe-Koistinen I, Sjöström M, Trygg J, Moritz T and Antti H. Predictive metabolite profiling applying hierarchical multivariate curve resolution to GC/MS data - a potential tool for multi-parametric diagnosis. Submitted

Papers I-III are reprinted with kind permission of the publishers: American Chemical Society (Papers I and III) and The Royal Society of Chemistry (Paper II).

List of Abbreviations

9

2. List of Abbreviations Abbreviation Meaning AR Alternating Regression DA Discriminant Analysis DAD Diode Array Detector DmodX Distance to the model in X EI Electron Impact ESI Electro Spray Ionization Eq. Equation GC Gas Chromatography H-MCR Hierarchical MCR (See MCR) LC Liquid Chromatography m/z Mass to Charge Ratio MCR Multivariate Curve Resolution MS Mass Spectrometry NMR Nuclear Magnetic Resonance O-PLS Orthogonal – PLS (See PLS) PCA Principal Component Analysis PLS Partial Least Squares Projections to Latent Structures SS Sum of Squares TIC Total Ion Current TMS Trimethylsilyl ToF Time of Flight UPLC Ultra Performace Liquid Chromatography

Notations

11

3. Notations

Matrices are represented by bold upper case letters, vectors by bold lower case letters and scalars by upper case letters. Notation Meaning -1 Denotes matrix inversion A Number of components or profiles C Matrix of Loadings for (Y) in PLS and O-PLS or

matrix of chromatographic profiles in CR and MCR c Loading vector for (Y) in PLS and O-PLS or

vector for chromatographic profile in CR and MCR E Residual of X ESC Residual of XSC F Residual of Y K Number of columns in X KCS Number of columns in XCS N Number of rows in X and Y NCS Number of rows in XCS M Number of columns in Y P Matrix of loadings in PCA, PLS and O-PLS p Loading vector in PCA, PLS and O-PLS S Matrix of spectroscopic profiles in CR and MCR s Vector containing spectroscopic profile in CR and MCR T Matrix of scores in PCA, PLS and O-PLS T Denotes matrix or vector transposition t Score vector in PCA, PLS and O-PLS W Matrix of weights for X in PLS and O-PLS w Weight vector for X in PLS and O-PLS X Data matrix of descriptive variables XCS Data matrix of hyphenated measurement(s) Y Data matrix of measured responses y Data vector containing measured response

Introduction

13

4. Introduction

When you aim for perfection, you discover it's a moving target.

George Fisher NITIATING PERTURBATIONS AND OBSERVING their effects is one way of gaining knowledge about a system. This approach requires that

the system being studied can be appropriately characterized so that the effects of the perturbations can be identified and interpreted. If this is achieved it will lead to a more detailed understanding of the system under observation. When using new methods in biology, such as transcriptomics [1, 2], proteomics [3] and global metabolite analysis (metabolomics [4] or metabonomics [5]) this approach can be used to gain knowledge about living organisms (systems) and specific processes therein. In such studies the perturbations can be of various types, e.g. a toxic insult, an environmental change, a genetic modification or a specific disease. Such disturbances will affect the system at all levels of biomolecular organisation. Due to the complex composition of biological samples, and the sensitivity of modern analytical platforms, the results of such experiments are extremely information-rich datasets, consisting of a multitude of correlated variables, which can be difficult to interpret. Multivariate projection methods are specially designed to deal with such situations and have proved to be well suited to the handling and analysis of data describing complex biological systems. However, in order to utilize these methods to their full potential, the data often needs pre-treatment to format them appropriately for analysis. GC/MS and LC/MS are common analytical platforms used in global metabolite analysis. These platforms can produce information-rich data about metabolites in a variety of biological fluids and tissues. However, to optimize the information output, and to provide opportunities for sample comparisons based on multivariate analysis (e.g. PCA, PLS and O-PLS), the GC/MS and LC/MS data need to be appropriately pre-treated or processed. Hence, rational and robust data processing protocols are required to retain the metabolically relevant information in the data for multivariate analysis. In this way reliable and interpretable data analysis can be conducted, thus

I

Multivariate Processing and Modelling of Hyphenated Metabolite Data

14

facilitating a deeper understanding of the biological system or process under study.

Aim

15

5. Aim If you don't know where you are going,

you will probably end up somewhere else. Lawrence J. Peter

HE AIM OF THIS STUDY was to develop data processing protocols for the comparison of complex biological samples characterized by

GC/MS or LC/MS, based on changes in their metabolite profiles. These processing protocols must meet certain criteria:

Ø They must generate data describing the samples using a common set of representative variables and hence make the outcome of the processing suitable for multivariate data analysis.

Ø The extracted variables should be interpretable, to allow

identification of important metabolites; so-called potential biomarkers.

Ø They must be able to cope with analytical drift.

Ø It must be possible to automate them.

Ø They must be valid for independent samples or sample sets

using calculated processing parameters, to facilitate:

o Predicting the features of new independent samples. o External validation.

T

Background

17

6. Background Never be afraid to try something new.

Remember, amateurs built the ark; Professionals built the Titanic.

Author Unknown HIS CHAPTER CONTAINS A BRIEF INTRODUCTION to global metabolite analysis (metabolomics and metabonomics), followed

by an overview of the analytical platforms used in this study for acquiring the metabolite data to characterize the samples studied. A description of how the different analytical platforms work is included, as well as the output produced by each of them. Different methods are described for extracting information from the resultant analytical output, in order to create representative data reflecting the relative metabolite concentrations between samples. In addition, an overview is presented of the multivariate statistical methods used for sample comparisons, based on the representative metabolite data sets.

6.1. Metabolomics and Metabonomics

ETABOLOMICS AND METABONOMICS are two closely related approaches in the post-genomic era of biology. The two

techniques can both be classified as types of ‘global metabolite analysis’. Such global approaches have been applied in many different areas of biological and clinical science, including: (1) toxicology, for identifying markers of organ-specific toxicity in rats and mice, based on urine and plasma NMR data [6]; (2) clinical diagnosis, for non-invasive diagnosis of coronary heart disease in human blood serum using NMR [7]; and (3) plant biology or functional genomics, for identifying metabolites affected by genetic modification in plants (Arabidopsis and potatoes) using GC/MS [8, 9].

Metabonomics has been defined as “The quantitative measurement of time-related multi-parametric metabolic responses of a living system to patho-physiological stimuli or genetic modification” by Nicholson et al [5],

T

M


18

while metabolomics has been defined as “comprehensive analysis of the whole metabolome under a given set of conditions” by Fiehn [10]. However, despite the differences in definitions, the two approaches are very closely related, sharing the idea that differences between samples can be detected and studied by means of their metabolic composition. This is based on the fact that metabolites are the end products of enzyme activity and gene expression. In order to detect these differences, the maximum number of metabolites present in the samples should be quantified, either absolutely or relatively. This is normally achieved using analytical techniques that are suitable for detecting and quantifying small molecules in biological samples. Examples of the analytical techniques used for this purpose can be found in [11-13], and include NMR, GC/MS, and LC/MS. In many cases the goal of metabolomics and metabonomics is to find metabolites or patterns of metabolites that are indicators of specific states of a system; so-called biomarkers or biomarker patterns.

Global metabolite analysis is considered to be unbiased, meaning that there are no pre-assumptions about what to look for in the samples. This is different to metabolic profiling, where a set of pre-defined metabolites is quantified. One of the advantages with the global metabolite approach, compared to transcriptomics for instance, is that it is species-independent, thus the same analytical setup can be used for different species.

A number of recent review articles have discussed aspects of the fields of metabolomics and metabonomics [14-18].

6.2. Instrumentation HE SAMPLES ANALYSED IN THE STUDIES presented in Papers I-V were plant extracts and samples of blood plasma and urine from

both rats and humans, reflecting the most common types of samples used in global metabolite analysis, although analyses have also been carried out on other available biofluids (e.g. cerebrospinal fluid [19], saliva [20], seminal fluid [21]) and intact liver tissue [22]. To capture the metabolic composition of such samples, the most commonly used analytical platforms are NMR or mass spectrometry following chromatographic separation, usually by gas or liquid chromatography, although the use of other alternatives, such as capillary electrophoresis (CE), has also been reported [23, 24]. In GC/MS and LC/MS the chromatographic systems separate the different compounds

T

Background

19

(metabolites) according to their physical/chemical properties and the detector (mass spectrometer) is used to quantify the compounds and to provide ‘fingerprints’ for their chemical structures. Since GC/MS and LC/MS are the only analytical platforms considered in this study, no other techniques will be discussed in detail. However, for the interested reader, review papers discussing the benefits and drawbacks of the various analytical techniques that can be used in global metabolite analysis have been published [11-13].

6.2.1. Gas Chromatography

AS CHROMATOGRAPHY (GC) separates the different compounds in a sample according to their boiling points and their interactions

with the stationary phase. In this process, compounds that boil at low temperatures elute before those that boil at high temperatures. The time it takes for a compound to pass through the chromatographic system is known as the ‘retention time’. For compounds to pass through the column they must have a reasonably low boiling point. For many metabolites this is not the case and thus the sample molecules must be derivatized. In this step, trimethylsilyl (TMS) or tert-butyldimethylsilys (TBDMS) are commonly added to functional groups containing OH, NH or SH, to form TMS/TBDMS-ethers, TMS/TBDMS-amines and TMS/TBDMS-sulphides respectively. Adding TMS or TBDMS reduces the compounds’ ability to form hydrogen or polar bonds (since a hydrogen bond donor is replaced by TMS or TBDMS), thus the boiling point is also reduced. In addition, aldehydes and keto groups are commonly transformed to corresponing oximes, using alkoxyamines or hydroxyamines to improve GC properties and prevent enolization (the formation of enoles from keto-groups). An overview of derivatization reactions used prior to GC/MS analysis is presented in a review by Halket et al [25]. Experimental design has been used to optimize the sample preparation steps (extraction and derivatization) for diverse biological samples, e.g. plant extracts [26] and human blood plasma [27].

In GC, the retention index [28] is often used to generalize the retention time and thereby overcome the differences between different instruments [29]. An alkane series (a mixture of n-alkanes e.g. C8-C40) is injected into the GC and the retention time for nonane (C9) is designated as having a retention index of 900 while decane (C10) is designated as having a retention index of 1000. A compound that elutes exactly in-between C9 and C10, therefore, has a retention index of 950.

G


20

6.2.2. Liquid Chromatography

N LIQUID CHROMATOGRAPHY (LC) the separation of molecules is based on their interaction with the mobile and stationary phases.

The most common LC system is the so-called reverse phase system, in which a hydrophobic stationary phase (column) is used. The mobile phase used is hydrophilic but changes during analysis, through a gradient, to become more hydrophobic. In reverse phase systems, polar compounds will elute first and more hydrophobic compounds will elute later. Another alternative in LC is to use normal phase chromatography. Here a hydrophilic stationary phase is used and the mobile phase used is hydrophobic at the start but changes, through a gradient, to become more hydrophilic.

In contrast to GC, the compounds analysed with LC systems do not have to be derivatized prior to analysis. However it has been shown that derivatization can enhance MS detection for ESI [30, 31].

In recent years LC systems have been developed with smaller particles in the columns; under higher column pressures this leads to enhanced separation efficiency. These systems are known as Ultra Performance Liquid Chromatography (UPLC) [32]. The enhanced separation they provide makes it possible to detect more metabolites and to accelerate the analysis.

6.2.3. Mass Spectrometry

NCE THE COMPOUNDS HAVE PASSED through the chromatographic system, they are introduced into the mass spectrometer. The mass

spectrometer consists of several parts: an ion-source, a mass analyser, a detector and, finally, a computer system [33]. In the ion source the molecules are ionized and fragmented. The degree of fragmentation is depending on the ionization technique. The mass analyser separates the ions generated in the ion source according to their mass to charge ratio (m/z). The detector records the signal intensity from the separated ions and the computer system collects and stores the signals from the detector. The most commonly used ion-source in GC/MS-based global metabolite analysis is electron impact (EI). In EI a beam of high energy electrons ionizes and fragments the vaporized compounds, then the charged fragments are introduced into the mass analyser. EI is considered to be a ‘hard ionization’ mode, producing many fragments for each compound, thus facilitating compound identification. In LC/MS, electrospray ionization (ESI) [34] is the most frequently used ion-

I

O

Background

21

source. Here a liquid is introduced to the ion-source through a needle, to the tip of which an electrical potential is applied. This forms charged droplets consisting of solvent and analyte (e.g. a metabolite). When the solvent evaporates, the droplets shrink in size and the charge at the droplet’s surface increases leading to a “coulombic explosion” resulting in the formation of smaller droplets. This is repeated until only the naked analyte is present; this is then introduced to the mass analyser. It should be noted that, in order to become charged, the analyte must have the ability to accept or donate protons. ESI is considered to be a ‘soft ionization’ mode and results in little or no fragmentation. Therefore, greater mass accuracy is required to identify the compounds (or at least to determine their elementary composition). An alternative for improving mass accuracy is to use MS/MS [35].

The most frequently used mass analysers in global metabolite analysis are Time-of-Flight (ToF) and Quadrupole systems. ToF analysers measure the mass-dependent time taken for ions of different masses to “fly” from the ion source to the detector. Quadrupole mass analysers work like a filter which allows only ions of a specific mass to charge ratio (m/z) to pass through the quadropole into the detector. Quadropole instruments scan across the selected m/z range in order to obtain a full mass spectrum at each time point.

6.3. Multivariate Data Analysis

ULTIVARIATE DATA ANALYSIS is concerned with the analysis and interpretation of complex data structures. In biology,

chemistry and medicine, and various other branches of science and technology, there is a steady trend towards the use of more variables (descriptors) to characterize observations (e.g. samples, molecules, processes). This development generates increasingly complex data tables, which are difficult to interpret without appropriate tools. Multivariate projection methods (e.g. PCA, PLS and O-PLS) have been successfully applied for modelling and interpreting complex datasets in a wide variety of applications in the natural sciences and industry. The benefits of these methods are that they produce robust and interpretable models, they can handle incomplete, noisy and collinear data structures and they can be used for generating predictions of independent samples. In global metabolite profiling, samples are described according to their metabolite composition. The description of the same sample can vary depending on the analytical platform used for its characterization (metabolite concentrations, areas under resolved profiles or intensities for specific shift

M


22

resonances in an NMR spectrum etc). Irrespective of the analytical platform or technique used, the datasets produced are both large and complex. Multivariate methods are then used to extract the relevant systematic information from these datasets, in a robust and interpretable way [36]. To obtain reliable and interpretable multivariate models for sample comparisons, all samples should be characterized on the basis of the same descriptors or variables (metabolites). The data are arranged in a matrix in which each row corresponds to a sample and each column to one variable (metabolite). This matrix, often denoted X, has N rows (number of samples) and K columns (number of variables), see figure 1. Prior to statistical analysis, the variables in X are often mean-centred, i.e. the average of each variable is subtracted. In many cases the variables are also scaled to unit variance, to give each variable equal chance of influencing the model. The variables may also be transformed to obtain a more normal distribution. Several different multivariate methods have been developed for sample comparisons. However, only the methods used in Papers I-V are described here.

XK

N Figure 1. An illustration of a data table or matrix X, used in multivariate analysis. X consists of N rows and K columns. Each row in the matrix corresponds to one sample and each column to one variable.

6.3.1. Principal Component Analysis RINCIPAL COMPONENT ANALYSIS (PCA) [37, 38] is a multivariate projection method used to compress information in a data table or

matrix X into a few so-called ‘principal components’, see figure 2. The objective of the compression is to explain as much of the variation in the original data set as possible, using a set of new latent variables (principal components). By reducing the dimensionality of the data it becomes much easier to get an overview of the variation, so that groups, trends and outliers

P

Background

23

can be identified among the samples. The reason why such a compression is possible is that variables are correlated with each other. If the variables were independent (uncorrelated), compression using PCA would not be possible. In many cases correlations between variables (e.g. metabolites) occur because they change according to some systematic underlying common factor e.g. a genetic modification. PCA has the ability to detect these underlying factors and compress the information based upon them.

-20 -15 -10 -5 0 5 10 15 20-20

-15

-10

-5

0

5

10

15

20

t[1]

t[2]

t [1]

t [2]

Figure 2. PCA compresses the information in the data matrix X into scores T and loadings P, describing the underlying systematic structures of X. (Here exemplified using data from Paper IV). Rats from two different groups, control (gray stars) and dose group (black stars), were studied. Urine samples were characterized using GC/MS and H-MCR (Paper III). In total 20 samples (10 from each group) were characterized with 268 variables (resolved metabolites). A clear separation between the two groups is revealed in the PCA score plot (t2/t1).

Each principal component consists of one score vector t and one loading vector p; the score can be regarded as the new variable and the loading as the link between the original variables and the score. Scores are linear combinations of the original variables and the influence of the original variables (their weights in the linear combination) is represented in their loadings.

Principal components can be calculated using the NIPALS (Non-Linear Iterative Partial Least Squares) algorithm [39] or by Singular Value Decomposition (SVD) [40]. In the NIPALS algorithm, as used for PCA calculations in this study, the first score vector t1, is calculated to obtain maximum Euclidean length (norm) under the constraint that the loading vector p1 has a length of one. After the first principal component has been calculated, the variance explained by that component is removed from X by subtraction of t1p1

T from X, giving the residual matrix E. The second


24

component is calculated in the same way as the first with the only difference being that E replaces X. This procedure continues until the desired number of components A has been calculated. By using the model residual E, after each calculated component, for extraction of the next component, all extracted components will become orthogonal to each other, meaning that unique variation is described by each component. The final model describing X is the sum of all extracted principal components tipi

T, see Eq. 1. Model of X: X=TPT+E=t1p1

T+ t2p2T+…+ tApA

T+E (Eq. 1) The length of a principal component is referred to as the component’s

eigenvalue; this is proportional to the amount of variation described by the component. The variation explained by a PCA model is given by the parameter R2X, see Eq. 2, which can be calculated for the entire model as well as for the individual components.

R2X = 1-SS(E)/SS(X) (Eq. 2) By plotting the score values for different components against each other,

maps (score plots) of the inter-sample relationships can be obtained. The corresponding loading plots reveal the variables responsible for the patterns found in the score plots. The model residual can be used to find out whether an observation fits into the model or not. For this purpose a measure known as DmodX is used. DmodX is the residual standard deviation for an observation.

6.3.2. Partial Least Squares Projections to Latent Structures ARTIAL LEAST SQUARES PROJECTIONS TO LATENT STRUCTURES (PLS) [38, 41-43] is a multivariate regression method used to find the

relationship between a predictor matrix X and a response matrix Y, where the response matrix contains additional characterization of the samples in X.

One special case of PLS is PLS-Discriminant Analysis (PLS-DA) [44], where PLS is used to find the relationship between the descriptors in X and the class identity of the samples (described by a dummy matrix containing the class information).

P

Background

25

In global metabolite analysis PLS is used because of its ability to highlight relationships between X and Y in a robust and interpretable manner, either for classification or calibration. The predictive ability must be high to achieve a valid interpretation. If not, the interpretive reliability is reduced, but can in some cases still provide useful information.

PLS uses the underlying or latent variables (scores) of X in order to describe the variation in both X and Y, in contrast to PCA which only models X. Therefore, the objective is also different when calculating latent variables in PLS, namely to extract the variation in X needed to predict the variation in the response Y. The first score vector t1 in PLS is, thus, calculated to maximize the covariance between t1 and Y. The influence of the original variables in X is found in the weight vector w1. The variation explained by the score t1 is then removed from X and Y by subtracting t1p1

T from X and t1c1

T from Y to form the residuals E and F, respectively. The loadings p1 and c1 of X and Y, respectively, carry information regarding the variance described by t1. Additional components are then calculated in the same way as for the first, except that E and F replace X and Y. This procedure is repeated until the desired number of components has been calculated. Cross-validation [45] is often used to estimate the number of significant components in a PLS model. The respective models for X and Y can be seen in Eqs. 3 and 4. The explained variations in X and Y are referred to as R2X and R2Y, respectively, and are calculated using Eqs. 2 and 5.

Model of X: X=TPT+E (Eq. 3) Model of Y: Y=TCT+F (Eq. 4) R2Y = 1-SS(F)/SS(Y) (Eq. 5)


26

T

XN

K

WT

PT

CT

N

M

Y

Figure 3. An overview of the PLS model. The weight vectors of X, W, are used to calculate the score vectors of X, T. T contains the information in X needed to predict Y. The variation described in T with respect to X and Y is summarized in the loadings P and C, respectively.

PLS models can be validated through internal or external methods. For internal validation purposes, cross-validation [45] is often used to estimate the number of model components as well as to estimate the predictive ability of the model by calculating the Q2Y value. The Q2Y value is calculated according to Eq. 6, were PRESS (PRediction Error Sum of Squares) is calculated by cross-validation. In cross-validation the samples are divided into a number of groups. One group is then excluded and a model is calculated using the remaining samples; in addition, Y is predicted for the excluded samples. This is repeated until all samples have been excluded/predicted once. PRESS is calculated as the sum of squares of the difference between the true and predicted Y values (the residual for each sample). Q2Y can vary in the range -8 to 1, where 1 is optimal. When using cross-validation to select the number of components, a rule of thumb is that if Q2Y increases by adding an additional component the component is considered significant, otherwise it is not.

Q2Y=1-PRESS/SS(Y) (Eq. 6) For external validation, new samples (test set samples) with known

responses Y, are predicted using the model and the accuracy of the predictions is based on the size of the RMSEP (Root Mean Square Error of

Background

27

Prediction). RMSEP is calculated using Eq. 7, where y and ypred are the measured and predicted response respectively and N is the number of test set samples. One RMSEP value is obtained for each response variable in Y, in the same units as the response. External validation is the best way to validate a model since samples that have not been used to build the model are predicted. It is, therefore, an independent measure of how well the model will be able to predict new samples.

RMSEP=√(S(y-ypred)2/N) (Eq. 7)

6.3.3. Orthogonal-PLS

RTHOGONAL-PLS (O-PLS) [46] is a modification of PLS developed to improve interpretation of the resulting models. The

difference between PLS and O-PLS is that the latter splits the variation in X, needed to predict Y in PLS, into two parts; the Y-related variation (denoted P in Eqs. 8 and 9) and the Y-orthogonal variation (denoted O in Eq. 8). PLS models both the Y-related and the Y-orthogonal variation together. This does not affect the predictive ability but, in some cases, interpretation can suffer. The reason is that it is difficult to know whether or not the modelled variation is related to Y. O-PLS facilitates interpretation since these two sources of variation are separated, providing the opportunity to interpret them independently. This also makes it possible to estimate the size of the variation in X that is related to Y. The Y-orthogonal variation is the variation in X that is not correlated to Y, but which does overlap with the Y-related variation (Y-orthogonal variation which does not overlap with the Y-related variation is not modelled by O-PLS nor in PLS). In order to achieve good predictions, the Y-orthogonal variation must be considered in cases where it overlaps with the Y-related variation. The models of X and Y are presented in Eqs. 8 and 9, respectively.

Model of X: X=TOPO

T+TPPPT+E (Eq. 8)

Model of Y: Y=TPCPT+F (Eq. 9)

O


28

Tp

XWoT

To

WpT

PoT

PpT

CpT

XN

K

N

M

Y

Figure 4. An overview of an O-PLS model. The variation in X is split into two parts: one part related to Y, TP, and the other part orthogonal to Y, TO. This facilitates interpretation since the two sources of variation can be interpreted independently.

6.3.4. Batch Modelling

ULTIVARIATE STATISTICAL BATCH processing or batch modelling [47] can be applied to processes that have defined

start and end points, and where the samples are characterized at several sequential time points, using a set of common descriptors. The idea behind batch modelling is to let a set of representative batches define the normal progress of batches over time. This is usually accomplished by using PLS to find the relation between the descriptors and some maturity index (often time or yield). PLS parameters such as score vectors, “residual” (DmodX) or predicted response are then used to define control limits. The most commonly used control limits are ±3 standard deviations from the average for each time point [48], see figure 5 for an example. The transparency of a PLS model makes it possible to detect deviations from normality and, further, to detect variables (descriptors) that contribute to these deviations. The steps described above are part of the “observation level” or level 1 in batch modelling. At level 2 or the “batch level”, the batches’ total response over time is considered. One way to accomplished this is by using the scores from level 1 as new descriptors for the batches, see figure 6.

M

Background

29

Batch modelling was initially developed for monitoring industrial batch processes, but has since been applied, with good results, to monitoring and interpreting time-dependent toxicological responses in metabonomic NMR studies [49, 50] and wood formation in plants using NMR [51].

0 1 2 3 4 5 6 760

50

40

30

20

10

0

10

20

30

Time (Days)

t [1

]

Figure 5. An example of batch modelling at level 1 (observation level) from Paper IV. Rats from two different groups, one control and one dose group, were studied over time. Urine samples were collected at six time points for each rat and characterized using GC/MS and H-MCR (Paper III). Each animal is here seen as one batch. The rats in the control group (gray lines) were used to calculate a PLS model against time. The PLS score vectors reflect the normal variation over time (the first score vector t1, is used here), defined by the control limits (dashed lines). Projecting the data about the treated animals (black lines) into the model revealed a difference in behaviour over time for these animals. The fact that the treated animals’ data are within the control limits for the first time points (including pre-dose) indicates that the differences are due to the treatment, and not to the inter-animal variation.


30

Figure 6. An overview of the different steps in batch modelling. All samples are first sorted batch-wise and within each batch they are sorted in chronological order. At level 1 (observation level) a PLS model is calculated to find the relationship between the sample description and a maturity index. The score values from the level 1 PLS model are combined batch-wise and function as new descriptors for each individual batch. This new data matrix is then subjected to multivariate analysis at level 2 (batch level). In global metabolite analysis, a batch can be defined as an animal, a human or a plant, that has been characterized at several sequential time points.

Background

31

6.4. Processing Hyphenated Data

ATA OBTAINED FROM HYPHENATED MEASUREMENTS, such as GC/MS and LC/MS, are by definition two dimensional, consisting

of a spectral and a chromatographic dimension. The data can be regarded as a matrix XCS consisting of NCS rows (time points/scans) and KCS columns (mass channels). Each cell in the matrix contains intensity values, which can be regarded as the third dimension in the data. By plotting one column (m/z value) of the data, an ion chromatogram for that specific m/z is obtained and by plotting one row (time point) the mass spectrum recorded at that time point is revealed, see Figure 7.

Figure 7. Overview of the data structure generated by GC/MS or LC/MS measurements. The data is two dimensional, with one spectral and one chromatographic dimension.

The number of time points is dependent on the analysis time and the spectrum sampling frequency. The number of mass channels is dependent on the m/z range and the accuracy of the mass detector. To get an overview of one sample, the TIC (Total Ion Current) can be used to obtain a two dimensional plot (time vs. intensity). The TIC is a chromatogram where all recorded responses from the same time point are summarized.

Each compound detected will give rise to a signal in the time range during which it elutes from the chromatographic system. The channels in which the

D


32

signals are recorded depend on which fragments are formed from the compounds in the ion source. Each compound has a chromatographic and a spectral profile. The chromatographic profile can be used for quantification of the compound and the spectral profile (mass spectrum) in combination with retention time can be used to identify it.

The goal when processing hyphenated data in global metabolite analysis is to extract information, so that each detected compound can be quantified and identified. There are several ways of achieving this, including (1) manual integration, (2) peak detection, (3) curve resolution/deconvolution; and (4) multivariate curve resolution.

6.4.1. Manual Integration ANUAL INTEGRATION OF PEAKS is one way of extracting information from GC/MS and LC/MS data. Here the areas of the

peaks are determined by integrating the area under the curve for one or more compound-specific ions. This approach, however, can never fulfil the criteria that selection of compounds should be unbiased. Other problems are also associated with this technique: for some compounds it is very difficult to find unique ions and it is very time-consuming.

6.4.2. Peak Detection

HAT IS A PEAK? That is the first question to consider when trying to identify one. A simple definition is that a peak is a local

maximum of a signal with an intensity above a certain specific threshold. For a local maximum, the first derivative is zero, and this fact is used to locate peaks. Criteria with respect to shape may also be considered. If a potential peak is too short it may be the result of interference and if it is too broad it may be due to background noise. When a peak is detected it is defined by three parameters: (1) its position in time; (2) its position in the spectral dimension (its m/z value); and (3) the peak intensity (height or area). The first two parameters are used for identification of the peak and the third for quantification. Examples of how peak detection methods can be used in global metabolite analysis have been presented elsewhere [32, 52, 53].

M

W

Background

33

6.4.3. Curve Resolution/Deconvolution

URVE RESOLUTION OR DECONVOLUTION can be regarded as ‘mathematical chromatography’, in which compounds that are not

separated by chemical chromatography are separated into their pure profiles in silico, see Figure 8. The reason for doing this is that quantification and identification are possible if pure profiles are obtained, see Figure 9. In curve resolution the assumption is that the data matrix XCS can be decomposed into spectral S and chromatographic profiles C, see Eq. 10.

XCS=CST + ECS=c1s1

T + c2s2T+…+ cAsA

T + ECS (Eq. 10) The spectral and chromatographic profiles for the compounds are the

underlying factors in the data matrix XCS. PCA can model these underlying factors. However, the scores and loadings obtained for a PCA model will not be the spectral and chromatographic profiles of XCS (if more than one compound is detected), since the profiles are not necessarily orthogonal. Therefore, curve resolution or deconvolution methods have been specifically developed to deal with this situation.

Figure 8. An illustration of curve resolution. The TIC is built up of three different profiles, which can be found by applying curve resolution methods. The three compounds are not separated completely by the chromatographic system. However, using curve resolution the compounds can be resolved into their pure profiles.

C


34

341 341.5 342 342.50

1

2

3x 10

4

Time (Sec)

Inte

nsit

y

50 100 150 200 250 300 350 400 450 5001000

500

0

500

1000

m/z

Inte

nsit

y

Figure 9. (A) A chromatographic profile for one compound (citrate-3TMS). (B) The corresponding spectral profile (black), together with the standard spectra of citrate from a spectral library (gray). The area beneath the chromatographic profile is used to quantify the compound and the resolved mass spectrum (black in B) is used, together with the retention time or retention index, for identification.

There are several different approaches to curve resolution. These can be divided into three types: iterative, non-iterative and a combination of the two called hybrid methods [54, 55] .

The methods work in different ways, but one common step is that if S is found C can be calculated, and vice versa, using equations 11 and 12.

From S to C: C=XCSS(STS)-1 (Eq. 11)

From C to S: S=XCSTC(CTC)-1 (Eq. 12)

Examples of non-iterative or direct methods are: Heuristic Evolving Latent

Projections (HELP) [56]; Evolving Factor Analysis (EFA) [57]; and Orthogonal Projection Resolution (OPR) [58]. All these methods use

Background

35

information windows to resolve the data. The information windows contain information from a “selective information region” (a part of the chromatogram where only one compound elutes) and “zero concentration regions” (parts of the chromatogram where a specific compound does not exist). Another important tool is the local rank map, which shows the number of compounds eluting at each time point. Although direct methods give good results for curve resolution they are not suitable for metabolic studies, since they demand a great deal of manual work and are difficult to automate, making them time consuming.

Examples of iterative methods are: Alternating Regression (AR) [59]; Iterative Target Transformation Factor Analysis (ITTFA) [60]; and gentle [61, 62].

Although there are an infinite number of solutions with the same fit when decomposing XCS into chromatographic and spectral profiles, only one of them is correct. To avoid solutions containing obvious errors, constraints are applied to find the most likely solutions. The most commonly used constraint, for both chromatographic and spectral profiles, is non-negativity. This is logical, since a compound either elutes at a time point or it does not. Therefore, the chromatographic profile must be positive. When a compound elutes it produces a signal in some of the mass channels and no signal in others. To obtain non-negativity the negative values are replaced by zeros, see Figure 10.

Figure 10. Depiction of the non-negativity constraint. Both chromatographic and spectral profiles are constrained to be positive. Negative values are replaced by zeros.

Another frequently applied constraint for chromatographic profiles is unimodality. Unimodality implies that a profile has only one local maximum (therefore it is also the global maximum). One way to obtain unimodality is to find the maximum and then move step wise towards the start and the end points. If a value is found which is greater then the previous it is replaced by the previous, see Figure 11.


36

Figure 11. Chromatographic profiles are constrained to be unimodal, meaning that the signal has a positive (or zero) first derivative before the maximum and a negative first derivative after the maximum.

Before starting to use an iterative method, the number of profiles to resolve (the chemical rank) must be decided. This can be done by first calculating a PCA model of the hyphenated data table XCS and counting the number of eigenvalues (see section 6.3.1; PCA) above a certain specified noise limit [54]. This approach is based on the assumption that each compound contributes uniquely to XCS. During the iterations, the constraints are applied to make the solution more likely. Alternating Regression (AR) iterates between the two equations (10 and 11) and applies the constraints after each operation. The procedure is repeated until convergence and it can start with either of the two operations. Different starting points have been used in the literature, e.g. random numbers [59] and pure variables found by the simplified Borgen method [61]. Pure variables are defined as variables that are unique for one of the profiles.

Due to the complex composition of samples used in global metabolite analysis it is not possible to resolve the whole XCS matrix in a single step. Therefore, the data are divided into segments and each segment is resolved separately. In contrast to direct methods, iterative methods require less or no manual input and can be automated, making them more useful for global metabolite analysis [63, 64].

6.4.4. Multivariate Curve Resolution ULTIVARIATE CURVE RESOLUTION (MCR) was initially developed to study industrial processes over time [65], but it has

also been used to resolve second order analytical data such as that from LC/DAD [66]. The biggest differences between multivariate curve resolution approaches compared to “classic” curve resolution is that, in MCR, multiple samples are treated simultaneously rather than being analysed separately. The

M

Background

37

simultaneous treatment of multiple samples adds an extra dimension to the data structure, see Figure 12. The data can be seen as a cube with a spectral, a chromatographic and a sample dimension. This cube is then unfolded to form a matrix with the same number of columns as the number of mass spectral variables (m/z) and the number of rows corresponding to the number of samples multiplied by the number of chromatographic time points.

M/Z

Tim

e

M/Z

Tim

e

M/Z

Tim

e

M/Z

Tim

e

Sample

Chromatografic

Spectroscopic

M/Z

Tim

e

M/Z

Tim

e

M/Z

Tim

e

M/Z

Tim

e

M/Z

Tim

e

M/Z

Tim

e

M/Z

Tim

e

M/Z

Tim

e

Spectroscopic

Figure 12. A set of hyphenated samples has a three dimensional (cubic) structure with spectral, chromatographic and sample dimensions. In multivariate curve resolution, the cube is unfolded to a two dimensional matrix, which is resolved into spectral and chromatographic profiles.

The unfolded matrix, consisting of multiple samples described using the same spectral descriptors, is now the XCS matrix. This matrix can be resolved into spectral and chromatographic profiles (Eq. 10) using the AR algorithm (see section 6.4.3). Due to the fact that XCS consists of multiple samples, the chromatographic profiles will contain information about all the individual samples, but the spectral profiles will be common for all samples. Since the chromatographic profiles originate from multiple samples, unimodality cannot be applied in the same way as in “classic” curve resolution. Instead, the parts of the chromatographic profiles belonging to each sample are constrained to be unimodal.


38

By using MCR all samples will be characterized with the same variables (metabolites), allowing multiple sample comparisons by means of multivariate analysis. The common spectral profile for each resolved compound also permits identification of the resolved profiles by searching a spectral library. Appling MCR to segments of the chromatographic dimension provides a tool for resolution of complex hyphenated data. This strategy is referred to as H-MCR, see section 7.3 and Papers III, IV and V.

6.4.5. Using the Processed Data for Sample Comparisons

N GLOBAL METABOLITE ANALYSIS the main aim is to detect differences between samples based on their metabolite composition.

Multivariate analysis is often used to model and detect these differences. In order to realize the full potential of multivariate analysis it is important to characterize all samples using the same variables. However, the outcomes of curve resolution and peak detection algorithms frequently do not meet this criterion. This may be due to peaks shifting as a result of analytical drift, the number of peaks (resolved compounds) differing between samples, or the estimates of the spectral profiles from the same compound differing between samples. Different numbers of peaks (compounds) may be obtained for several different reasons: samples may contain unique compounds, the number of compounds resolved may be incorrectly estimated or peaks may be above the detection limit in some samples and below the detection limit in others. The differences in estimates of the number of compounds to resolve can result from disturbances, from background noise or from non-linear responses. To overcome these problems, compounds from different samples have to be aligned or matched. The best way of matching resolved profiles is to identify the compounds corresponding to the resolved profiles using spectral databases (libraries). However, in many cases this is difficult since not all compounds can be found in databases, and some are not even known. If only known compounds are matched, it is difficult to be unbiased and also much information will be lost.

When matching compounds from different samples using retention times/indices and spectral profiles, one has to set limits on the similarity in retention times and correlations between spectral profiles or m/z accuracy required to consider corresponding peaks to originate from the same compound [67].

I

Background

39

If compounds, or peaks, are present in several samples but missing in a few, should they be set to zero or should they be treated as missing values in samples where they have not been recorded? In some cases they should be set to zero, since they are not present, but in others a peak may not be found due to inefficient peak detection or curve resolution, and in such cases missing values might provide a better representation. If all samples in a study are resolved simultaneously, as in MCR, matching of resolved compounds is not required, since they will have a common spectral profile.

Results

41

7. Results You'll always miss 100% of the

shots you don't take. Wayne Gretzky

N THIS CHAPTER, THE RESULTS from papers I-V are summarized. The focus is on explaining the methods or strategies developed and

presented in these papers.

7.1. Hierarchical Multivariate Data Compression (Papers I and II)

HE AIM OF THESE TWO STUDIES was to develop a strategy for simultaneous compression and representation of multiple samples

analysed using GC/MS (Paper I) or LC/MS (Paper II). The strategy presented facilitates the extraction of a common set of descriptors from the GC/MS and LC/MS data used to characterize the samples. This was achieved by aligning the data in the chromatographic dimension and then dividing the samples into time windows (splitting the chromatographic dimension into segments). The information in each time window was then summarized in the chromatographic dimension to form a total mass spectrum for each sample. This technique generates one matrix XCS consisting of NCS rows (the number of samples) and KCS columns (the number of mass channels) for each time window. These matrices are then compressed using a bilinear compression method, Alternating Regression (AR), to form intensity vectors, reflecting the relative metabolite concentrations between samples, and corresponding spectral profiles. The intensity vectors from all time windows are then combined to form the matrix X, which is suitable for further multivariate analysis e.g. by PCA, PLS and O-PLS. All steps are illustrated in figure 13.

I

T


42

Figure 13. Work flow for the strategy presented in Papers I and II. (1) The first step is alignment of the data in the chromatographic dimension. (2) The aligned data are divided into time windows (segments). (3) For each time window, each sample is summarized to a total mass spectrum, froming the matrix XCS. (4) XCS is compressed using bi-linear compression (AR) to obtain intensity vectors and spectral profiles. This is repeated for all time windows. (5) The intensity vectors from all time windows are combined to form the data matrix X. (6) The data matrix X can then be subjected to multivariate analysis.

The differences between the approaches presented in Papers I and II are in the alignment procedure and the time window setting step. The reason for the differences is that in Paper I GC/MS data were used, while in Paper II the data were from LC/MS. In Paper I the alignment was conducted by determining the maximum covariance between the sample TIC’s [68], and the time windows were set manually, by dividing the segments at regions of low global intensity (time points where the total intensity is low in all samples) to prevent splitting peaks. These steps could not be applied to the LC/MS data since it was not possible to find such global low intensity regions. Therefore a different alignment strategy was used in Paper II.

For the LC/MS data, the alignment was conducted by first mathematically constructing an average sample, in which each mass channel was searched individually to detect peaks, see figure 14. Then all samples were searched for peaks. This search was limited to narrow regions close to the positions where the peaks were detected in the average sample. The detected peaks were represented by a “needle” of the same height as the detected peak. This method of data representation has been used previously for NMR and GC data [69]. The position of the needles was aligned to the corresponding peak detected in the average sample. In Paper II the time windows were set to a fixed narrow interval. The reason that the time windows could be narrower in this case is that the needles did not have any spread in the chromatographic dimension and hence could not be split.

Results

43

Figure 14. One ion chromatogram from the average sample, constructed mathematically based on all samples. All mass channels (ion chromatograms) in the average sample are searched individually for peaks. The peaks detected are represented by a needle.

The use of Alternating Regression (AR) to compress each time window differs slightly from the use of AR for curve resolution. Since the data matrices XCS in this case do not have a chromatographic dimension (rows represent samples, in contrast to different chromatographic time points), unimodality cannot be applied as a constraint. Hence non-negativity is the only constraint applied. However, the algorithm still alternates between the same two equations (Eqs. 11 and 12) until convergence.

In this case C does not represent the chromatographic profiles. Instead, it consists of “concentration” or rather intensity vectors. As stated previously, intensity vectors from all time windows were combined to form the matrix X. Each variable in X has a corresponding spectral profile, which originates from a specific time window. Thus, it is possible to trace the findings highlighted in a multivariate model of X back to the raw data (the GC/MS or LC/MS data). The number of profiles to extract from each time window is found by calculating a PCA model of XCS for each time window, where the number of eigenvalues over a certain specified threshold determines the number of profiles to extract.

The two versions of hierarchical compression used in Papers I and II fulfil the criteria of: (1) generating data that describes the samples with a common set of representative variables; (2) they can be used to create interpretable multivariate models; (3) they can cope with analytical variation in the chromatographic dimension (drifts in the spectral dimension were not investigated); and (4) they are valid for handling and predicting new independent samples (only demonstrated in Paper II). In addition, these


44

compression approaches can be conducted much more quickly than curve resolution.

In the study described in Paper I GC/MS measurements of extracts from hybrid aspen (Populus tremula × Populus tremuloides) leaves were used to illustrate the proposed strategy. Leaves at different developmental stages, subjected to different photoperiods, were investigated. PLS-DA models were used to discriminate between the samples in different developmental stages as well as between plants subjected to the different photoperiods. The PLS-DA models were based on descriptors derived using the strategy presented and from peak detection/deconvolution using ChromaTOF (1.00) software (Leco Corp.). The study showed that models based on the different strategies produced similar results.

In the study described in Paper II human urine samples from two population cohorts, Shanxi (People’s Republic of China) and Honolulu (United States of America), characterized using LC/MS, were used to illustrate the strategy. Models with the ability to discriminate between individuals from the two populations based on their urine metabolite composition, showing high prediction accuracy (87.4%) for independent samples, were obtained by applying the strategy presented in combination with PLS-DA.

7.2. Hierarchical Multivariate Curve Resolution

(Paper III)

LTHOUGH THE METHOD PRESENTED in Papers I and II works well for multiple sample comparisons and provides useful information

about the origin of the extracted variables, identification of the metabolites behind the variables highlighted in the multivariate analysis soon became a bottleneck in the work flow. Therefore, there was a need to find a processing strategy that provided more detailed information about the extracted variables, and still fulfilled the criterion that they should be common to all samples. In paper III it was shown that multivariate curve resolution (MCR) fulfils these criteria for GC/MS data. The complexity and size of the GC/MS dataset made it impossible to apply MCR to the entire data cube. Therefore the same strategy as used in Paper I for dividing the data into time windows

A

Results

45

was applied as a pre-step to MCR. The entire procedure is referred to as hierarchical multivariate curve resolution (H-MCR); an overview of all the steps is presented in figure 15. In the MCR step, iteration between equations 11 and 12 is repeated until convergence. After each step in the iteration, constraints are applied. For the spectral profiles S, non-negativity was the only applied constraint. For the chromatographic profiles C, non-negativity, unimodality within each sample and a constraint regarding elution time were applied. The latter requires one profile to have approximately the same elution time for all samples. Profiles not fulfilling this constraint are set to zero. The number of components to resolve is determined iteratively for each time window and can differ between time windows. The search starts with one profile and then the number of profiles is increased, one at a time, until three sequential solutions are rejected as not valid (the last valid solution is used). A solution is rejected if it does not fulfil the criterion that the profiles must elute in the same order in all samples.

The advantage of using H-MCR, compared to the method presented in paper I, is that more detailed information about the origin and identity of the variables is obtained. The variables are represented here by the areas under the resolved chromatographic profiles, which have a specific retention time and a corresponding spectral profile. The retention time can be translated to a retention index, if an alkane series has been measured using the same setup and under the same conditions. If the resolved spectral profiles are subjected to a library search it is possible to identify the resolved profiles as unique metabolites (provided that the resolved profiles are of good quality and the spectral library contains the specific compound). In comparison to the strategy presented in Paper I, H-MCR is relatively time consuming but, nevertheless, can be conducted in the same timescale as the GC/MS analysis itself.


46

m/z

Time

Samples

m/z

Samples

Time

m/zSample 1

Sample 2

Sample 3

Sample n -1

Sample n

Time

C

ST*MCR

Samples

Variables (Metabolites)X-Matrix

MV

A

Libr

ary

Sear

ch

Integ

rated

Areas

Highlighted variables

Xcs

For each Time window

Unfolding

Variables (Metabolites)

m/z

m/z

3 4

5

1

2

6

Figure 15. An overview of the H-MCR strategy. The three dimensional data structure (multiple GC/MS samples) (1) is split up into sample cubes by dividing the chromatographic dimension into time windows (2). Each of these “smaller” cubes is unfolded (3) into a matrix XCS which is then resolved using MCR, to obtain spectral S and chromatographic C profiles (4). The areas under the resolved profiles are calculated for all time windows and are used as variables in the matrix X (5). These variables should represent individual metabolites, if the resolution is accurate. Each variable in the matrix X corresponds to one row in the matrix containing all resolved spectral profiles which can be used for identification (6). The matrix X can be subjected to multivariate analysis (MVA). Metabolites corresponding to the variables highlighted in the multivariate analysis can be identified by means of a library search.

Results

47

One of the benefits of using MCR compared to “classical” curve resolution is that no matching of resolved profiles is needed. Another benefit is that co-eluting compounds can be resolved, as long as they do not occur in the same proportions (ratios) in all samples, see figure 16.

Rank=1 Rank=1

Sample 1 Sample 2

Rank =2Σ

Figure 16. Two co-eluting profiles cannot be resolved if one sample at a time is considered. The gray profile has the same shape as the black profile in both samples 1 and 2. Hence, the gray profile can be seen as the black profile multiplied by a scaling factor in both samples 1 and 2, so the mathematical rank for each sample is 1. This factor is not the same for the two samples, so if the two samples are resolved simultaneously there is no such factor and therefore the two profiles (the black and the gray) can be mathematically separated, since the rank is now 2.

To minimize the analysis time, only spectral profiles highlighted in the multivariate analysis are subjected to the library search. The entire strategy has the potential to be fully automated, but there is then a risk that it will become a “black box” approach. Therefore, it is a good idea to undertake the time window selection manually. The benefit of this is that samples that are anomalous for some reason will be detected and can be removed prior to further processing.

In Paper III, three test cases were used to demonstrate the potential of the H-MCR strategy: (1) a standard mixture test case, consisting of 101 compounds of which 17 were varied according to an experimental design; (2) plant extracts from five different genotypes of Arabidopsis; and (3) blood plasma extracts from men and women (in supporting information for paper III).

Test case 1 was used to illustrate that the resolved compounds provided spectral profiles of high quality, which could be validated through identification by library searches, and that the resolved chromatographic


48

profiles contained quantitative information. Test cases 2 and 3 were used to illustrate that the method provided a relevant description of the samples, which could be used in multivariate analysis for detection of differences with respect to the different sample classes, e.g. genotype and gender. In addition, test cases 2 and 3 were used to compare the quantitative information in the H-MCR resolved profiles with the quantitative information obtained using manual integration. The comparison revealed that similar results were obtained using the different methods.

7.3. Modelling of Time Dependent Responses (Paper

IV)

N PAPER IV MULTI-PARAMETRIC METABOLIC toxic responses in rat urine samples were measured over time using GC/MS. The H-MCR

strategy (Paper III) and batch modelling (see section 6.3.4) were used to study the toxin-induced metabolic changes over time.

The samples in the study originated from two groups of rats: those that had been exposed to a drug (a liver toxin) and control rats (not treated). Each group consisted of five animals. Urine samples from each rat were collected at six time points (day 0, pre-dose; and days 1, 2, 3, 5 and 7, post-dose). The urine samples were characterized using GC/MS, with two replicates. In total 120 GC/MS measurements (two groups × five animals × six time points × two replicates) were carried out. All samples were subjected to H-MCR and the areas under the 268 resolved profiles (metabolites) were used as sample descriptors in the batch modelling procedure. The control rats were used as model samples in the batch modelling to define normality. The results for the treated rats were then applied to the model. Deviations from normality defined by the control samples were observed for the treated animals from day 2 onwards, see figure 17.

I

Results

49

0 1 2 3 4 5 6 760

50

40

30

20

10

0

10

20

30

Time (Days)

t [1]

Figure 17. The first score vector over time for each animal. Data from the treated animals (black) were projected into the model based on the control animals (gray) and revealed differences over time for the treated animals. The dashed line denotes the control limits (mean ±3 standard deviations at each time point). The fact that the treated animals are inside the control limits for the first time points (including pre-dose) indicates that the differences are due to the treatment and not to inter-animal variation.

Figure 18. The cause of the differences between control and treated rats over time is revealed in the PLS contribution plot. The largest source of the deviation between the control and treated rats is in the exogenous compounds (drug metabolites). However, by zooming in on the plot, other variables can be found that differ between the treated and control animals over time. The variable Win22C04 corresponds to profile number 4 (in retention order) in time window number 22, similarly Win15C03 is profile number 3 in time window 15.


50

The PLS contribution plot (figure 18) revealed which variables (resolved profiles/metabolites) caused this deviation. It was found that one variable dominated this effect; variable Win22C04 (resolved profile number 4 in time window number 22). However, by zooming in on the contribution plot, other variables were found that also contributed to the difference. An example of such a variable was Win15C03.

The causes of the observed differences between the control and treated rats originate from two main factors: (1) exogenous compounds detected in the treated rat urine samples at time points post first dose; and (2) endogenous metabolites displaying a different pattern over time in the treated rats compared to the control rats, see figure 19.

0 1 2 3 4 5 6 710

0

10

20

30

40

50

60

Time (Days)

Inte

ns

ity

B

0 1 2 3 4 5 6 71

0

1

2

3

4

5

6

7

Time (Days)

Inte

ns

ity

A

Figure 19. The variables highlighted in the contribution plot (figure 17) monitored over time. The gray solid lines denote the control rats, the gray dashed lines represent the control limits (mean ±3 standard deviations calculated based on the control rats) and the black solid lines denote treated rats. (A) Plot for the variable Win22C04 over time for each animal. It is evident that this must correspond to an exogenous compound since it is not present in the control samples, nor in the treated samples pre-dose. (B) Plot for the variable Win15C03 over time for each animal. The variable is a potential biomarker since it is present in all control samples and changes in the treated samples post-dose.

In summary, by using GC/MS, H-MCR and batch modelling it was possible to detect deviations between control and treated rats over time, based on their urine metabolite composition. It was further possible to differentiate between drug metabolites and potential biomarkers by investigating the individual variables. The findings were confirmed by tracing the results back to the raw data. Furthermore, by utilizing spectral libraries it was possible to identify the potential biomarker by performing a library search.

Results

51

7.4. Processing and Prediction of Independent

Samples (Paper V)

O REACH BEYOND ISOLATED STUDIES and to use the models obtained for predictions, it has to be possible to treat new samples

in the same manner as the model-building samples with respect to data processing. To illustrate the full potential of the H-MCR strategy, an extension of the method was developed to achieve this (Paper V). The ability to treat new samples according to obtained processing parameters not only facilities prediction of the features of new samples, it also enables the use of a representative subset of samples to reduce processing time and to obtain well-balanced multivariate models (see figure 20 for an overview of the processing strategy).

H-MCR can be regarded as a transformation process, converting a GC/MS data matrix into a row vector described by the areas under the resolved profiles. In Paper V it was demonstrated that it was possible to apply this transformation to new independent samples. The processing parameters used for independent samples were: (1) the target TIC (sample), used for alignment; (2) the edges used for time window setting; (3) the spectral profiles S, to calculate chromatographic profiles C according to Eq. 12; and (4) the applied constraints.

The reason for the predictions being produced so much faster than the model samples can be resolved is that when resolving the model samples the AR algorithm iterates between the operations (Eqs. 11 and 12) until convergence. For the model samples this is repeated several times for each time window in order to find the number of profiles to resolve. For new independent samples the number of profiles to resolve and the spectral profiles S have already been estimated, since the processing parameters obtained for the model samples are used. Hence the only steps required to resolve independent samples are to use Eq. 11 once for each time window and to apply the same constraints as for the model samples.

T


52

OrginalGC/MS

Samples

NewGC/MS

Samples

H-MCR

H-MCR

ResolvedSamples

ResolvedSamples

Selectedsamples

Samples

not selected

All samples

Pro

cess

Par

amet

ers

ResolvedModelBuildingSamples

ResolvedPredictionSamples

MVA

Prediction

MV

AP

aram

eter

s

Figure 20. Scheme of the predictive metabolite profiling approach based on the H-MCR method. All or a representative selection of samples from the original set of samples are subjected to H-MCR. The H-MCR parameters obtained can then be used to resolve new samples or samples not selected in step 1. The data obtained after H-MCR processing or treatment according to the H-MCR parameters are collected in two separate data tables, where each sample corresponds to one row and each column to one “metabolite” (resolved profile). The value in each cell of the tables is the area under the resolved profile for a specific sample. Sample comparisons based on multivariate analysis can then be carried out either by modelling the processed data and using the model to make predictions of the treated samples or by merging the processed and treated data and hence calculating a model based on all samples.

It should be noted that there is a risk in applying such a model to new samples, since the new samples may contain compounds that are not present in any of the samples used to build the model. However, by using a representative set of samples this risk can be minimized. In Paper V it was proposed that all GC/MS samples should first be characterized by means of hierarchical data compression (Paper I) to form the matrix X, containing variation reflecting the compositional differences between them. Then to use PCA for compression of this matrix into a small number of score vectors. Finally to use a space filling design [70] for selection of a representative subset of samples covering the PCA score space. The choise of hierarchical data compression as processing method is based on the fact that it is a rapid processing method, providing a good description of the data, from which a representative sample selection can be made. The selected samples are subjected to further processing using H-MCR while the remaining samples are resolved using the process parameters obtained from the H-MCR

Results

53

procedure. This suggested strategy facilitates resolution of large sample sets, i.e. it facilitates high-throughput analysis.

In Paper V three examples were used to illustrate the strategy: (1) urine samples collected over time from rats exposed to a liver toxin; (2) plant leaf extracts from hybrid aspen (Populus tremula × Populus tremuloides), originating from leaves at different developmental stages; and (3) blood plasma samples from men and women.

In the rat urine example, analytical replicates were used to illustrate that processing independent samples works, demonstrating that the strategy is valid. To achieve this, the model samples and the “prediction” samples (the analytical replicates) were merged, post-processing, to form the matrix X, containing all samples. Hierarchical Cluster Analysis (HCA) [71] and PCA were used for the evaluation. It was found that the analytical replicates clustered together in the HCA and also ended up close to each other in the PCA score plot. This demonstrates that the analytical reproducibility is high and that valid processing of independent samples is achieved.

In the leaf extract example a PLS-DA model discriminating between extracts from leaves 2, 10 and 20 (counted from the top of the plant) were calculated based on the sample description obtained from H-MCR. Extracts from leaf 9 (analysed 11 months later) were resolved using the processing parameters from the model samples and predicted using the model. The leaf 9 sample predictions were closest to those for leaf 10, as expected since they are at a similar developmental stage. This shows that GC/MS and H-MCR can be used for creating models that are valid for predictions over time.

In the human blood plasma example, a well-balanced classification model was obtained by selecting samples from the two groups (men and women) using a space filling design. The samples not selected for H-MCR processing and for calculating the classification model (O-PLS-DA) were all classified correctly, according to gender, by the O-PLS-DA model. In this way an external validation of the model’s predictive ability was obtained.

The strategy was applied to three different biofluids – urine, leaf extracts and blood plasma – using two different types of mass spectrometer: Quadrupole and Time of Flight. This indicates that H-MCR is a valid approach for resolving GC/MS data. It can have applications in many different fields of science where the global metabolite analysis approach is applied, including functional genomics, drug toxicity and treatment efficacy assessments, and clinical diagnosis.

Conclusions and Future Perspectives

55

8. Conclusions and Future Perspectives Arriving at one goal is the starting point to another.

John Deweyit T HAS BEEN DEMONSTRATED IN THIS THESIS that multivariate processing of hyphenated metabolite data (in this case GC/MS and

LC/MS data) can be used to create representative data reflecting the metabolite composition of complex biological samples. The generated data can be considered to have high quality and it has been shown that, in combination with multivariate projection methods (e.g. PCA, PLS and O-PLS), robust and interpretable models allowing detection and identification of metabolic markers or marker patterns can be obtained. In addition, the generated models provided the opportunity for thorough statistical and biological validation.

The two proposed processing strategies, hierarchical multivariate data compression (papers I and II) and H-MCR (paper III), both fulfil the pre-defined criteria of: (1) generating a common set of representative variables, (2) allowing interpretation; (3) coping with analytical drift; (4) allowing automization; and (5) permitting the use of the obtained processing parameters for resolving independent samples or sample sets. This makes the outcome (data) of the processing suitable for multivariate analysis. Hierarchical multivariate compression has been shown to work for processing both GC/MS and LC/MS data. This approach is considered to be a rapid processing method but has limitations with respect to compound identification (e.g. for biomarker discovery). The H-MCR method has, so far, been successfully applied for resolving GC/MS data. This allowed the resolution of a variety of complex metabolic GC/MS datasets (e.g. data obtained from plant extracts, blood plasma and urine) into pure chromatographic profiles and corresponding unique mass spectra. The chromatographic profiles were then used as quantitative metabolite measurements in the multivariate analysis, while the unique mass spectra were used for identification by means of library search. The combination of H-MCR and multivariate analysis thus provides a robust and interpretable tool for detecting and identifying metabolic biomarkers or biomarker patterns, as exemplified in Paper IV. Furthermore, it was shown that H-MCR

I


56

can be used for resolving independent samples based on the processing parameters from a set of representative samples (paper V). The resolved independent samples can then be either predicted into an existing model based on the processed samples or combined with the processed samples to form a new model. This approach, named predictive metabolite profiling, moves global metabolite analysis away from isolated studies towards predictive systems. It could have significant implications for building predictive systems in, for example, clinical diagnosis. In addition, this approach also contributes to high-throughput global metabolite analysis, since only a subset of representative samples have to be processed, while the remaining samples can be treated according to the processing parameters obtained. It is, thus, a fast and efficient approach for comparative analysis of large sample sets.

One challenge in the near future will be to modify the H-MCR method to be applicable to LC/MS data, particularly UPLC/MS data. The aim here will be to enhance the information output in terms of quantification and identification of these high-sensitivity data, and to complement the GC/MS analysis to allow a more comprehensive description of the metabolome. A more general future challenge in global metabolite analysis will be to build up more complete spectral libraries [72] for compound identification. This will be a crucial step towards finding biomarkers of diagnostic or even prognostic value, as well as for increasing our understanding of complex interactions in biological systems. In conclusion, with the ever increasing analytical capacity for generating omics data, methods for extracting relevant information in terms of quantification and identification of metabolites, genes or proteins, will have to be developed in order to tackle future challenges in areas such as functional genomics, clinical diagnosis and systems biology [73]. In this context, the results presented in this thesis should be seen as a contribution to the field of global metabolite profiling with respect to the generation and analysis of representative metabolite data for detailed and reliable interpretation of complex biological systems.

Populärvetenskaplig Sammanfattning (På Svenska)

57

9. Populärvetenskaplig Sammanfattning

(På Svenska) ETABOLISM ELLER ÄMNESOMSÄTTNING kan sammanfattas som den process som sker i celler hos växter, djur och människor när

näringsämnen; tas upp, omvandlas till energi eller bryts ner till andra små molekyler som sedan avlägsnas från cellerna. Dessa små molekyler benämns ofta metaboliter och spänner över många klasser av kemiska ämnen. Antalet olika metaboliter som finns i växter har uppskattats till över 200,000 och är troligen ännu högre hos människor. Global metabolitanalys (metabolomik eller metabonomik) syftar till att hitta unika fingeravtryck av metaboliter i komplexa biologiska prover (t. ex. plantextrakt, urin och blodplasma). Dessa fingeravtryck, uppbyggda av hundra- till tusentals metaboliter, vilkas koncentrationer förändras enligt specifika mönster, kan sägas vara karaktäristiska för det det studerade systemets fysiologiska tillstånd. Exempel på yttre faktorer eller tillstånd som kan orsaka systematiska förändringar i dessa ”mönster” är olika typer av sjukdomar eller miljöpåverkan. Det faktum att metaboliter varierar i mönster medför att det inte är nödvändigt att bestämma koncentrationen av alla metaboliter i ett prov för att konstatera att en viss förändring har skett. Detta är något som utnyttjas inom bland annat global metbabolitanalys. För att skapa dessa karaktäristiska fingeravtryck och i och med detta göra det möjligt att upptäcka mönsterförändringar använder man sig av kraftfulla bioanalytiska instrument eller metoder. Exempel på sådana metoder är kärnmagnetisk resonansspekroskopi (NMR) eller, som beskrivet i denna avhandling, gaskromatografi (GC) eller vätskekromatografi (LC) kombinerat med mass spektrometri (MS). Fördelen med dessa metoder är att informationsrika data, som beskriver de biologiska proverna, kan genereras på ett snabbt och reproducerbart vis. Kemometri utgör en viktig del av global metabolitanalys när det gäller analysen av dessa informationsrika data. Kemometriska eller multivariata analysmetoder (t.ex PCA, PLS, O-PLS) klarar av att hantera och analysera stora datamängder uppbyggda av ett extremt stort antal (tusentals-hundratusentals) korrelerade variabler (i detta fall, samvarierande metaboliter i ett biologiskt prov). Detta gör det möjligt att jämföra och analysera dessa informationsrika fingeravtryck och på så sätt

M


58

skapa statistiska modeller som är pålitliga och lätta att tolka och förstå (s.k. genomskinliga modeller). På så sätt hoppas man kunna detektera, tolka och förstå små fluktuationer i dessa fingeravtryck (metabolitmönster) som orsakats av exempelvis förändringar i en eller flera geners uttryck eller av något specifikt sjukdomstillstånd. I denna avhandling presenteras olika strategier för att utifrån metabolitdata genererade med GC/MS och LC/MS skapa förutsättningar (representativa data) för att med multivariata analysmetoder kunna jämföra prover med avseende på metabolitmönster. Målsättningen är att kunna skapa tolkningsbara, prediktiva modeller, vilka gör det möjligt att detektera och identifiera metaboliter eller mönster av metaboliter (s.k. biomarkörer) som är karaktäristiska för olika tillstånd (t.ex. specifika sjukdomar). De utvecklade strategierna kan sammanfattas under det gemensamma namnet kurvupplösning, vilket är ett sätt att matematiskt separera de metaboliter som inte kan separeras analytiskt via kromatografi, och på så sätt göra det möjligt att på ett pålitligt sätt kvantifiera och identifiera metaboliterna. I detta arbete har metoder för multivariat kurvupplösning utvecklats, vilket har resulterat i fungerande metoder för upplösning av hundratals metaboliter i många prover samtidigt. Dessa metoder har också applicerats på olika typer av komplexa biologiska prover (t. ex. plantextrakt, urin och blodplasma) med goda resultat. I kombination med multivariata analysmetoder utgör de utvecklade kurvupplösningsmetoderna ”genomskinliga” system för tolkning av metaboliska processer, detektion och identifiering av biomarkörer eller biomarkörmönster, samt upplösning och prediktion av oberoende prover. Dessa egenskaper gör att de föreslagna strategierna kan vara av potentiell stor betydelse för applikationer inom exempelvis funktionsgenomik, läkemedelsutveckling och sjukdomsdiagnos.

References

59

10. References 1. Storck T, von Brevern MC, Behrens CK, Scheel J, Bach A.

Transcriptomics in predictive toxicology. Current Opinion in Drug Discovery & Development 2002,5:90-97.

2. Bowtell DDL. Options available - from start to finish - for obtaining expression data by microarray. Nature Genetics 1999,21:25-32.

3. Pandey A, Mann M. Proteomics to study genes and genomes. Nature 2000,405:837-846.

4. Fiehn O. Metabolomics - the link between genotypes and phenotypes. Plant Molecular Biology 2002,48:155-171.

5. Nicholson JK, Lindon JC, Holmes E. 'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica 1999,29:1181-1189.

6. Lindon JC, Nicholson JK, Holmes E, Antti H, Bollard ME, Keun H, et al. Contemporary issues in toxicology - The role of metabonomics in toxicology and its evaluation by the COMET project. Toxicology and Applied Pharmacology 2003,187:137-146.

7. Brindle JT, Antti H, Holmes E, Tranter G, Nicholson JK, Bethell HWL, et al. Rapid and noninvasive diagnosis of the presence and severity of coronary heart disease using H-1-NMR-based metabonomics. Nature Medicine 2002,8:1439-1444.

8. Weckwerth W, Loureiro ME, Wenzel K, Fiehn O. Differential metabolic networks unravel the effects of silent plant phenotypes. Proceedings of the National Academy of Sciences of the United States of America 2004,101:7809-7814.

9. Fiehn O, Kopka J, Dormann P, Altmann T, Trethewey RN, Willmitzer L. Metabolite profiling for plant functional genomics. Nature Biotechnology 2000,18:1157-1161.

10. Fiehn O. Combining genomics, metabolome analysis, and biochemical modelling to understand metabolic networks. Comparative and Functional Genomics 2001,2:155-168.

11. Dunn WB, Bailey NJC, Johnson HE. Measuring the metabolome: current analytical technologies. Analyst 2005,130:606-625.

12. Sumner LW, Mendes P, Dixon RA. Plant metabolomics: large-scale phytochemistry in the functional genomics era. Phytochemistry 2003,62:817-836.


60

13. Dunn WB, Ellis DI. Metabolomics: Current analytical platforms and methodologies. Trac-Trends in Analytical Chemistry 2005,24:285-294.

14. Goodacre R, Vaidyanathan S, Dunn WB, Harrigan GG, Kell DB. Metabolomics by numbers: acquiring and understanding global metabolite data. Trends in Biotechnology 2004,22:245-252.

15. Nicholson JK, Connelly J, Lindon JC, Holmes E. Metabonomics: a platform for studying drug toxicity and gene function. Nature Reviews Drug Discovery 2002,1:153-161.

16. van der Greef J, Stroobant P, van der Heijden R. The role of analytical sciences medical systems biology. Current Opinion in Chemical Biology 2004,8:559-565.

17. Kell DB. Metabolomics and systems biology: making sense of the soup. Current Opinion in Microbiology 2004,7:296-307.

18. Weckwerth W. Metabolomics in systems biology. Annual Review of Plant Biology 2003,54:669-689.

19. Coen M, O'Sullivan M, Bubb WA, Kuchel PW, Sorrell T. Proton nuclear magnetic resonance-based metabonomics for rapid diagnosis of meningitis and ventriculitis. Clinical Infectious Diseases 2005,41:1582-1590.

20. Gibney MJ, Walsh M, Brennan L, Roche HM, German B, van Ommen B. Metabolomics in human nutrition: opportunities and challenges. American Journal of Clinical Nutrition 2005,82:497-503.

21. Averna TA, Kline EE, Smith AY, Sillerud LO. A decrease in H-1 nuclear magnetic resonance spectroscopically determined citrate in human seminal fluid accompanies the development of prostate adenocarcinoma. Journal of Urology 2005,173:433-438.

22. Wang YL, Bollard ME, Keun H, Antti H, Beckonert O, Ebbels TM, et al. Spectral editing and pattern recognition methods applied to high-resolution magic-angle spinning H-1 nuclear magnetic resonance spectroscopy of liver tissues. Analytical Biochemistry 2003,323:26-32.

23. Soga T, Ohashi Y, Ueno Y, Naraoka H, Tomita M, Nishioka T. Quantitative metabolome analysis using capillary electrophoresis mass spectrometry. Journal of Proteome Research 2003,2:488-494.

24. Sato S, Soga T, Nishioka T, Tomita M. Simultaneous determination of the main metabolites in rice leaves using capillary electrophoresis mass spectrometry and capillary electrophoresis diode array detection. Plant Journal 2004,40:151-163.

25. Halket JM, Waterman D, Przyborowska AM, Patel RKP, Fraser PD, Bramley PM. Chemical derivatization and mass spectral libraries in metabolic profiling by GC/MS and LC/MS/MS. Journal of Experimental Botany 2005,56:219-243.

References

61

26. Gullberg J, Jonsson P, Nordström A, Sjöström M, Moritz T. Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry. Analytical Biochemistry 2004,331:283-295.

27. A J, Trygg J, Gullberg J, Johansson AI, Jonsson P, Antti H, et al. Extraction and GC/MS analysis of the human blood plasma metabolome. Analytical Chemistry 2005,In Press.

28. Vandendool H, Kratz PD. A generalization of the retention index system including linear temperature programmed gas-liquid partition chromatography. Journal of Chromatography 1963,11:463-471.

29. Schauer N, Steinhauser D, Strelkov S, Schomburg D, Allison G, Moritz T, et al. GC-MS libraries for the rapid identification of metabolites in complex biological samples. Febs Letters 2005,579:1332-1337.

30. Okamoto M, Takahashi KI, Doi T. Sensitive detection and structural characterization of trimethyl(p-aminophenyl)ammonium-derivatized oligosaccharides by electrospray-ionization mass-spectrometry and tandem mass-spectrometry. Rapid Communications in Mass Spectrometry 1995,9:641-643.

31. Nordström A, Tarkowski P, Tarkowska D, Dolezal K, Astot C, Sandberg G, Moritz T. Derivatization for LC electrospray ionization-MS: A tool for improving reversed-phase separation and ESI responses of bases, ribosides, and intact nucleotides. Analytical Chemistry 2004,76:2869-2877.

32. Wilson ID, Nicholson JK, Castro-Perez J, Granger JH, Johnson KA, Smith BW, Plumb RS. High resolution "ultra performance" liquid chromatography coupled to oa-TOF mass spectrometry as a tool for differential metabolic pathway profiling in functional genomic studies. Journal of Proteome Research 2005,4:591-598.

33. Herbert CG, Johnstone RAW. Mass spectrometry basics: CRC Press LLC; 2003.

34. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM. Electrospray ionization for mass-spectrometry of large biomolecules. Science 1989,246:64-71.

35. Tolstikov VV, Fiehn O. Analysis of highly polar compounds of plant origin: Combination of hydrophilic interaction chromatography and electrospray ion trap mass spectrometry. Analytical Biochemistry 2002,301:298-307.

36. Eriksson L, Antti H, Gottfries J, Holmes E, Johansson E, Lindgren F, et al. Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm). Analytical and Bioanalytical Chemistry 2004,380:419-429.


62

37. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems 1987,2:37-52.

38. Höskuldsson A. A combined theory for PCA and PLS. Journal of Chemometrics 1995,9:91-123.

39. Wold H. Estimation of principal components and related models by iterative least squares. In: Multivariate analysis. Edited by Krishnaiah PR. New York: Academic Press; 1966:391-420.

40. Jolliffe IT. Principal component analysis. New York: Springer; 1986.

41. Geladi P, Kowalski BR. Partial least-squares regression - a Tutorial. Analytica Chimica Acta 1986,185:1-17.

42. Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems 2001,58:109-130.

43. Wold S, Trygg J, Berglund A, Antti H. Some recent developments in PLS modeling. Chemometrics and Intelligent Laboratory Systems 2001,58:131-150.

44. Sjöström M, Wold S, Söderström B. PLS discriminant plots: Elsevier Science Publ. B. V. North Holland:; 1986.

45. Wold S. Cross-Validatory estimation of the number of components in factor and principal components models. Technometrics 1978,20:397-405.

46. Trygg J, Wold S. Orthogonal projections to latent structures (O-PLS). Journal of Chemometrics 2002,16:119-128.

47. Nomikos P, Macgregor JF. Multivariate spc charts for monitoring batch processes. Technometrics 1995,37:41-59.

48. Wold S, Kettaneh N, Friden H, Holmberg A. Modelling and diagnostics of batch processes and analogous kinetic experiments. Chemometrics and Intelligent Laboratory Systems 1998,44:331-340.

49. Azmi J, Griffin JL, Antti H, Shore RF, Johansson E, Nicholson JK, Holmes E. Metabolic trajectory characterisation of xenobiotic-induced hepatotoxic lesions using statistical batch processing of NMR data. Analyst 2002,127:271-276.

50. Antti H, Bollard ME, Ebbels T, Keun H, Lindon JC, Nicholson JK, Holmes E. Batch statistical processing of H-1 NMR-derived urinary spectral data. Journal of Chemometrics 2002,16:461-468.

51. Wiklund S, Karlsson M, Antti H, Johnels D, Sjöström M, Wingsle G, Edlund U. A new metabonomic strategy for analysing the growth process of the poplar tree. Plant Biotechnology Journal 2005,3:353-362.

52. Katajamaa M, Oresic M. Processing methods for differential analysis of LC/MS profile data. Bmc Bioinformatics 2005,6.

References

63

53. Duran AL, Yang J, Wang LJ, Sumner LW. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics 2003,19:2283-2293.

54. Liang YZ, Kvalheim OM. Resolution of two-way data: theoretical background and practical problem-solving - part 1: theoretical background and methodology. Fresenius Journal of Analytical Chemistry 2001,370:694-704.

55. Shen HL, Grung B, Kvalheim OM, Eide I. Automated curve resolution applied to data from multi-detection instruments. Analytica Chimica Acta 2001,446:313-328.

56. Kvalheim OM, Liang YZ. Heuristic evolving latent projections - resolving 2-way multicomponent data .1. Selectivity, latent-projective graph, datascope, local rank, and unique resolution. Analytical Chemistry 1992,64:936-946.

57. Maeder M, Zilian A. Evolving factor-analysis, a new multivariate technique in chromatography. Chemometrics and Intelligent Laboratory Systems 1988,3:205-213.

58. Liang YZ, Kvalheim OM. Diagnosis and resolution of multiwavelength chromatograms by rank map, orthogonal projections and sequential rank analysis. Analytica Chimica Acta 1994,292:5-15.

59. Karjalainen EJ. The spectrum reconstruction problem - Use of alternating regression for unexpected spectral components in 2-dimensional spectroscopies. Chemometrics and Intelligent Laboratory Systems 1989,7:31-38.

60. Gemperline PJ. A priori estimates of the elution profiles of the pure components in overlapped liquid chromatography peaks using target factor analysis. Journal of Chemical Information and Computer Sciences 1984,24:206 - 212.

61. Grande BV, Manne R. Use of convexity for finding pure variables in two-way data from mixtures. Chemometrics and Intelligent Laboratory Systems 2000,50:19-33.

62. Manne R, Grande BV. Resolution of two-way data from hyphenated chromatography by means of elementary matrix transformations. Chemometrics and Intelligent Laboratory Systems 2000,50:35-46.

63. Idborg-Björkman H, Edlund PO, Kvalheim OM, Schuppe-Koistinen I, Jacobsson SP. Screening of biomarkers in rat urine using LC/electrospray ionization-MS and two-way data analysis. Analytical Chemistry 2003,75:4784-4792.

64. Halket JM, Przyborowska A, Stein SE, Mallard WG, Down S, Chalmers RA. Deconvolution gas chromatography mass spectrometry of urinary organic acids - Potential for pattern recognition and automated identification of metabolic disorders. Rapid Communications in Mass Spectrometry 1999,13:279-284.


64

65. Tauler R, Kowalski B, Fleming S. Multivariate curve resolution applied to spectral data from multiple runs of an industrial-process. Analytical Chemistry 1993,65:2040-2047.

66. Tauler R. Multivariate curve resolution applied to second order data. Chemometrics and Intelligent Laboratory Systems 1995,30:133-146.

67. Eide I, Neverdal G, Thorvaldsen B, Shen HL, Grung B, Kvalheim O. Resolution of GC-MS data of complex PAC mixtures and regression modeling of mutagenicity by PLS. Environmental Science & Technology 2001,35:2314-2318.

68. Malmquist G, Danielsson R. Alignment of chromatographic profiles for principal component analysis - a prerequisite for fingerprinting methods. Journal of Chromatography A 1994,687:71-88.

69. Torgrip RJO, Åberg M, Karlberg B, Jacobsson SP. Peak alignment using reduced set mapping. Journal of Chemometrics 2003,17:573-582.

70. Marengo E, Todeschini R. A new algorithm for optimal, distance-based experimental-design. Chemometrics and Intelligent Laboratory Systems 1992,16:37-44.

71. Johnson DE. Applied multivariate methods for data analysts: Brooks/Cole Publishing Company, Pacific Grove, California 93950.; 1998.

72. Kopka J, Schauer N, Krueger S, Birkemeyer C, Usadel B, Bergmuller E, et al. [email protected]: the Golm metabolome Database. Bioinformatics 2005,21:1635-1638.

73. Kitano H. Systems biology: A brief overview. Science 2002,295:1662-1664.

Acknowledgements

65

11. Acknowledgements Jag vill tacka alla som bidragit till att jag har lärt mig nya saker och samtidigt haft en mycket trevlig tid som doktorand. Tack! Jag vill rikta ett speciellt tack till: Min handledare Michael, för att du gav mig möjligheten att doktorera och för att du alltid har haft tid (trots ett pressat schema) när jag behövt goda råd. Ett extra stort tack för att du styrde om mitt projekt när allt inte gick som vi tänkte i början… Flisbandet i Piteå känns nu mera väldigt avlägset. Min biträdande handledare Henke, för att du kom tillbaka från London och gav projektet en vitamininjektion de sista åren och tack för alla goda råd och hjälp under avhandlingsskrivandet. Tack även för trevliga tillställningar utanför universitets väggar. Team Moritz™ på UPSC för gott samarbete (genom alla projekt) som inte bara varit givande forskningsmässigt utan också väldigt kul. Tack, Thomas, Jonas, Miyako, Anders, Mariusz, Annika och Krister utan er hjälp, er kunskap, ert tålamod och er tro på kemometri hade detta aldrig funkat eller ens blivit av. Steve and all the others at IC, for good collaboration (paper II). Jiye, Stefan and Bjørn for good collaboration (paper III). Ina och Johan för gott samarbete (papper IV & V). Elin och Anna som hjälpte mig att skilja män från kvinnor i papper V. John Blackwell för engelsk språkgranskning. Alla på Organisk Kemi som gör det trevligt att komma till jobbet varje dag. Ann-Helen, Bert och Carina som ser till att allt från, besvärliga reseräkningar till krånglande datorer, har löst sig smärtfritt. Kemometrigruppen, som trots, eller kanske tack vare, vår diversitet alltid har det mycket trevligt tillsammans, inte minst bland snöklädda berg.


66

IMO min rumskamrat genom de flesta åren, det har varit mycket trevligt att dela rum med dig. De X-jobbare som jag har fått äran att handleda. Ni ska veta att jag tycker det har varit kul och givande. Olav, för att jag fick möjligheten att utvecklas som kurvupplösare i Bergen. Tack, även för att du (genom NICE) anordnade den fantastiska workshopen på Färöarna. Johan, för alla diskussioner om kemometrins fantastiska värld och för dina kommentarer på avhandlingen. Michael, Svante och Paul som fångade mitt intresse för kemometri på C-kursen. Alla trevliga kemometriker som jag träffat och inspirerats av på konferenser. Alla vänner som har varit med mig på allt från fjällvandring, bergsbestigningar (låter kanske tuffare än det var), skidturer, tipslördagar, sena kvällar på REX, till fikan och allt annat som inte har med forskning att göra men som sätter färg på tillvaron, TACK! Ni vet vilka ni är. Whisky-Klubben, Uishge Beatha för trevliga kvällar, som utan tvekan får betyget 5 (om det nu finns ett sånt betyg…) Lillemor och Totte ”mina nya släktingar”, det är alltid lika trevligt hälsa på hos er. Släkten Carlsson; Morfar, Lena, Torgny, Johanna och Pontus. Det är alltid trevligt att besöka er i Karlskoga och även att vinna över er i Go-cart! Mamma, Pappa och lille-bror Johan för att det alltid är lika bra att komma hem till Kristinehamn och få ett avbrott från studierna. Tack, för allt stöd och uppmuntran jag fått genom åren. Min älskade Sussi, för att du i alla lägen får mig att må så mycket bättre. För att du helt enkelt är värdens bästa, alla kategorier (i mina ögon). Sist men inte minst vill jag tacka EU:s strukturfonder Mål 1 Norra Norrland ”Fibernätverket” för finansiering av arbetet.

M PROCESSING AND MODELLING OF HYPHENATED METABOLITE DATA...

Documents

Transcript of M PROCESSING AND MODELLING OF HYPHENATED METABOLITE DATA...