Chemometric Analysis of First Order Chemical Data

Chemometric Analysis of First Order Chemical Data

Ralf J. O. Torgrip

A

Doctoral Thesis

Department of Analytical Chemistry

Stockholm University

Sweden

2003

II

Doctoral Thesis, 2003

Ralf J. O. Torgrip [email protected] Department of Analytical Chemistry Stockholm University SE-106 91 Stockholm Sweden

©2003 Ralf Torgrip 1:st edition ISBN 91-7265-586-0. pp I-VIII, 1-61, V1-V25 Printed by Akademitryck AB, Edsbruk, Sweden, 2003.

III

To

Monica, Kim and Alexandra

IV

“If you look for Nature’s secrets in only one direction, you are likely to miss the most important secrets, those you did not have enough imagination to predict.”

Freeman Dyson

“An expert is a man who has made all the mistakes, which can be made, in a very narrow field.”

Niels Bohr (1885 - 1962)

“A new scientific truth does not, generally speaking, succeed because the opponents are convinced or declare themselves educated, however because they die and the new genera-tions from the beginning learn about it as the truth”

Max Planck (1858 - 1947)

The symbol of the International Chemometrics Society (ICS) is an Erlenmeyer flask with axes coming through

the sides of the flask. The symbol represents the interaction of mathematics and statistics with

chemical problems, which is the essence of chemometrics.

V

Chemometric Analysis of First Order Chemical Data

Ralf J. O. Torgrip

Department of Analytical Chemistry, Stockholm University

Abstract This thesis considers issues related to the multivariate analysis and calibration of chemical data collected in a vector format (one–way data). The work is designed to facilitate the task of the analytical chemist, focusing on methods of maximizing the amount of infor-mation that can be obtained by multivariate modeling from full sets of data generated by instruments or experiments. It is shown that more complex problems can be solved using a multivariate modeling methodology than by using the traditional univariate approach. The thesis is based on five papers that address different aspects of the multivariate analy-sis of one-way data, organized in a two-way matrix, with the aid of various chemometric techniques. The methodologies used cover variance analysis, predictive modeling, com-pression and alignment of data expressed as responses from various types of analytical instruments.

The instrumental techniques used to acquire the chemical data considered in the thesis include Near Infrared (NIR), Ultra Violet-Visible (UV-VIS) and Nuclear Magnetic Reso-nance (NMR) spectrometry. Other types of data (e.g. FT-IR and GC signals) are also used for testing different solution spaces to different problems that has been encountered.

Paper I. In the work described in Paper I, NIR reflectance data and multivariate cali-bration are used for building calibration models predicting yield, kappa number, Klason lignin, glucose, xylose and uronic acid contents in birch chips sampled during controlled Kraft digestion. The combination of NIR reflectance and multivariate calibration models predicts the descriptors well.

Paper II. In the work described in Paper II, UV-VIS spectrometry and multivariate calibration are used to determine the nitrate ( 3NO- ) concentration in municipal wastewa-ter. The method is based on scanned spectra of calibration samples measured as raw, un-filtered wastewater. The proposed method has a working range of 0.5-13.7 mg/l with a relative error of 3.4%. The method can also be used for the determination of total phosphorous, total nitrogen, ammo-nium nitrogen and iron.

VI

Paper III. In the work described in Paper III, the problem of data redundancy is ad-dressed. The paper deals with the compression of the abscissa axis (x-axis) of spectro-metric data by using a slightly modified B-spline zero compression method. It is shown that NIR reflectance, FT-IR reflectance and UV-VIS transmittance data can be success-fully compressed prior to multivariate calibration. The method can compress the data to ap-proximately 20% of the original variable size without apparent degradation.

Paper IV. In the work described in Paper IV, NIR reflectance spectrometry, calibra-tion and data analysis are used to monitor the specificity of bonding site interactions of water with carboxymethylated cellulose (CMC) in two ionic forms (CMC-Na and CMC-Ca) with various degrees of substitution of the carboxylic acid groups. The aim of the study was to explore two mutually contradictory hypotheses stating that water is either specifically adsorbed to adsorption sites (hydroxyl and carboxyl groups) or to the entire cellulose surface in the humidity range 0-100% RH. The results favor the specific site adsorption theory.

Paper V. In the work described in Paper V, a novel algorithm and method for peak alignment, or spectral synchronization, of data with pronounced differences between the baseline and signals of interest is presented. NMR, GC, LC, and CE signals are typically applicable for synchronization with the proposed algorithm. The method further extends ideas from gene and protein sequence alignment and explores the use of sparseness, dy-namic programming, and fast search algorithms. The method can be used for applica-tions such as automatic database searches, and the automated evaluation of chroma-tograms or spectra for identifying peaks, as well as for either qualitative or quantitative analyses. The proposed method gives a similar level of performance to previously reported methods, but is at least 50-800 times faster. The corrected data yield more parsimonious models.

KEY WORDS: Near Infrared, Alignment, Breadth first search, Calibration, Cellulose, Chemometrics, CMC, Compression, Data pretreatment, DTW, Dynamic programming, Dynamic time-warping, Glucose, kappa number, Klason lignin, Metabonomics, Multi-variate calibration, Multivariate data analysis, Needleman-Wunsch, NIR, Nitrate, NMR, Partial Least Squares, PCA, PLS, Principal Components, Projection to Latent Structures, Pulp, Regression, Smith-Waterman, Spectrometry, Spectroscopy, Ultraviolet, Uronic acid, UV-VIS, Visible, water adsorption, Wood, Xylose, Yield.

VII

Table of contents List of papers 1

Abbreviations and Symbols 2

1 Introduction 5 1.1 Analytical Chemistry – philosophical basics 5 1.2 The scope of this thesis 6 1.3 Data order 7

2 Theory 9 2.1 Introduction 9 2.2 Spectrometry and fundamental aspects of calibration 9 2.3 NIR/UV-VIS spectral artifacts 12 2.4 NMR and chromatography – some artifacts 15 2.5 The nature of noise 15 2.6 Summary of artifacts 16

3 Mathematical modeling 17 3.1 Matrix notation 17 3.2 The rank of a matrix equation system 17 3.3 Why chemometrics? 18 3.4 Chemometrics – a prelude 20 3.5 Least Squares modeling methods 21 3.6 Factor based methods - Chemometrics 24 3.7 Linearization of data 33 3.8 Compression 34 3.9 Peak Alignment 36

4 Summary of the papers 37 4.1 Paper I 37 4.2 Paper II 37 4.3 Paper III 38 4.4 Paper IV 38 4.5 Paper V 39

5 Discussion 41 5.1 Paper I 41 5.2 Paper II 46 5.3 Paper III 47 5.4 Paper IV 48 5.5 Paper V 50

VIII

6 Conclusions 55

7 Acknowledgements 57

8 References 59

1

List of papers

I Multivariate characterization of chemical and physical de-scriptors in pulp using NIRR. Olsson, R., Tomani, P., Karlsson, M., Josefsson, T., Sjöberg, K. and Björk-lund, C. Tappi Journal, (78) 10 (1995), 158-166. In this paper the author contributed with the modeling of data and the writing of the pa-per.

II Determination of nitrate in municipal waste water by UV spectrometry. Karlsson, M., Karlberg, B. and Olsson, R.J.O. Analytica Chimica Acta, 312 (1995), 107-113. In this paper the author contributed in parts with the data acquisition, modeling of data and the writing of the paper.

III Compression of first order spectral data using the B-spline zero compression method. Olsson, R.J.O., Karlsson, M. and Moberg, L. Journal of Chemometrics, (5&6) 10 (1997), 399-410. In this paper the author is responsible for the idea, the modeling of data and the writing of the paper.

IV Water sorption to hydroxyl and carboxylic acid groups in Carboxymethyl-cellulose (CMC) studied with NIR-spectroscopy. Berthold, J., Olsson, R.J.O. and Salmén, L. Cellulose 5 (1998), 281-298. In this paper the author contributed with the experimental design, the modeling of data and, in parts, the writing of the paper.

V Peak Alignment using Reduced Set mapping. Torgrip, R.J.O., Åberg, M., Karlberg B. and Jacobsson, S.P. Submitted, 2003-01-21, Journal of Chemometrics. In this paper the author contributed with the main idea, the modeling of the data and the writing of the paper.

Note! The author has changed his family name from Olsson to Torgrip.

2

Abbreviations and Symbols

NIR Near Infrared NIRS Near Infrared Spectroscopy NIRR Near Infrared Reflectance IR Infra Red FT-IR Fourier Transform - Infra Red UV-VIS Ultra Violet – Visible CMC Carboxy Methylated Cellulose LS Least Squares MLR Multiple Linear Regression CLS Classical Least Squares ILS Inverse Least Squares PCA Principal Components Analysis FA Factor Analysis PFA Principal Factor Analysis NIPALS Non-linear Iterative Partial Least Squares PLS Partial Least Squares PCR Principal Components Regression SVD Singular Value Decomposition MEM Maximum Entropy Method SD Standard Deviation RSD Relative Standard Deviation PARAFAC Parallel Factor analysis – N-way modeling method Tucker N-way modeling method N Number of atoms in a molecule X Matrix of instrumental responses. One row corresponds to one spectrum Y Matrix of descriptors. One column corresponds to one descriptor x Vector of discrete values E Matrix of residuals K Matrix of “pure” spectral responses y Predicted concentration β Regressor or regression vector (ILS or PLS) P Matrix of loadings from PCA or PLS T Matrix of scores from PCA or PLS W Matrix of PLS weights Q Matrix of PLS Y-loadings p Number of variables in X (or Y) n Number of objects in X (or Y)

3

a Number of pc’s (PLS and PCA) PRESS Predicted Residual Error Sums of Squares, measure of model error RMSEP Root Mean Square Error of Prediction, measure of model error RMSECV Root Mean Square Error of Cross Validation, measure of model error r2 Correlation coefficient q2 Correlation coefficient of samples not in the model (validation) GC Gas Chromatography MS Mass Spectrometry NMR Nuclear Magnetic Resonance HPLC High Performance Liquid Chromatography LC Liquid Chromatography CE Capillary Electrophoresis FIA Flow Injection Analysis DSC Differential Scanning Calorimetry TGA Thermogravimetric Analysis MPEG Compression method (for pictures and music) ZIP Compression method (for text, pictures and data files) SA Simulated Annealing GA Genetic Algorithm

5

1 Introduction

1.1 Analytical Chemistry – philosophical basics

To quote from the homepage of the American Chemical Society, Division of Ana-lytical Chemistry:

”Analytical chemistry seeks ever improved means of measuring the chemical composition of natu-ral and artificial materials. The techniques of this science are used to identify the substances which may be present in a material and to determine the exact amounts of the identified substances. Ana-lytical chemists work to improve the reliability of existing techniques to meet the demands for better chemical measurements which arise constantly in our society. They adapt proven methodologies to new kinds of materials or to answer new questions about their composition. They carry out research to discover completely new principles of measurement and are at the forefront of the utilization of major discoveries such as lasers and microchip devices for practical purposes. They make important contributions to many other fields as diverse as forensic chemistry, archaeology, and space science. [As the emblem above points out, ] analytical chemistry serves the needs of many fields.”

As stated above, analytical chemistry fundamentally addresses three tasks:

• Classification (qualitative).

• Interpretation (qualitative).

and the and the most demanding

• Calibration (quantitative and qualitative).

These, seemingly simple, tasks have kept alchemists and chemists sleepless for the last 3000 years and will continue to do so. The field has, like other areas of sci-ence, experienced a rapid increase in the complexity of the questions asked and an-swered. Much of this complexity is due to advances in measuring. Today, the meas-urement of chemical samples may involve an abundance of different techniques, based on diverse aspects of the atoms’ interaction with their surroundings, like the absorption of light, affinity to other atoms, or magnetic properties. As computers and hardware are evolving, the instruments used for chemical measurements gener-ate ever more data per sample for the chemists to interpret. The questions are also becoming more complex, so the samples subjected to analysis are getting more complex, for example: in vivo observations being the ultimate challenge for some analytical chemists. Since the answer to the analytical question being addressed is to be found somewhere in the measured data, the field of analytical chemistry also involves interpreting chemical data.

6

1.2 The scope of this thesis

Information and data are not the same, although some people believe the terms to be synonymous. Data can be found in abundance – information is more difficult to acquire. However, if it has sufficient inherent structure, data can be transformed to information. This transformation can either be done by analysts processing the data with their brains or by the use of some aid, possibly mathematical or statistical tools, which clarify the underlying structure of the gathered data. These newfound struc-tures reveal the true nature of the underlying phenomena that are reflected by the data – information. Such information allows conclusion to be drawn regarding the nature of the samples.

The subject of this thesis is the use of a multitude of measured variables to describe samples. By using many variables to describe samples, some analytical problems are solved, but new problems are created. The rationale for using many variables for the description of a sample is that more complex analytical questions can be answered.

The thesis covers the field of multivariate modeling of first order chemical data emanating from different types of instrumentation. The data analysis techniques used in the thesis includes variance analysis, predictive modeling, compression and alignment of multivariate data.

Today, modern instrumentation can generate data by the millions and, with hy-phenated techniques, in many dimensions. The task of extracting the information from these data in their raw form is too complex for the human brain. Multivariate data analysis and multivariate regression methods are extremely useful in facilitating this task. These methods can, if used correctly, also aid in the intellectual elucidation of intricate problems in a fast and efficient manner. All aspects of scientific explora-tion can benefit from using this methodology, since information can be extracted from acquired data in a comprehensible format. So, by using complex data and powerful chemometric techniques it is possible to solve demanding analytical prob-lems.

In this thesis the author will illustrate and discuss the power of multivariate tech-niques and methodologies. Their advantages are illustrated by considering a fast, non-invasive method of analysis for organic solid samples with a non-uniform sam-ple matrix; Near Infrared Spectrometry (NIR), and a rapid method for the analysis of ions in liquids containing suspended particles; Ultra Violet – Visible spectrometry (UV-VIS). Modeling these types of data is straightforward, from a chemometric point of view, even if the data are complex per se. Further problems occur when try-ing to analyze other data types e.g. NMR, GC, and LC. The special problems ad-dressed in the thesis are unwanted x-axis shifts of peaks and the computer-related problem of modeling with large masses of data – redundancy.

7

The rationale for using full first order data in this thesis is the nature of the sam-ples being modeled. Solids are difficult to analyze by traditional means, without la-borious sample pretreatment steps, such as dissolution, purification and workup. Thus, using a single technique such as high-resolution spectrometry would, if feasi-ble, greatly reduce the time required for analysis of solids. Also, the analysis of tur-bid liquid systems would often be much more interesting if sample pretreatment steps such as filtration and dilution could be avoided.

The goal when using multivariate techniques is to maximize the advantages of exploring a full vector representation of samples, i.e. whole spectra or chroma-tograms, or large parts of such data sets, as opposed to the traditional “single” or few datum representations previously used. The methodology has proven to be valuable for addressing complex problems where the traditional “univariate” ap-proach fails. Incorporation of the described methodology thus extends the utility of already existing instrumentation, making measurements more informative.

This thesis will not present any in-depth analysis of the chemometric techniques per se since most of the methods used have been thoroughly described in other works, except for the peak alignment method (Paper V). Instead a short, context-setting description of the mathematical modeling techniques and data pre-processing treatments used in the studies is presented. The author has applied vari-ous combinations of chemometric techniques that, at the time of application, seemed suitable for bridging generic problems of multivariate data modeling – today the choice of techniques might have been different. The thesis should be read with this in mind.

The presented results should not be seen as limited, either to the specific chemi-cal systems or spectral regions discussed - but as a generic approach to the modeling of first-order data addressed in analytical chemistry.

1.3 Data order

Data are categorized by the way they are collected and organized. The resulting di-mensionality of the data – their ways or order, is dependent on the way the experi-ments are designed, the data-generating instruments, and the way the data can be organized. This thesis will address data collected in a vector, i.e. where one sample generates one vector consisting of a number (p) of discrete intensity measurements obtained from some instrument. For example, data from a spectrometer, collected as absorbance (or transmittance) values for a number of energies (i.e. wavelengths) placed in a row (or column), which, when sequentially plotted, give a visual repre-sentation of a discretized spectrum, see Fig 1.

8

0 100 200 300 400 500 600 700 800 900 10000.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

Data bin

Inte

nsity

Figure 1. Ten NIR reflectance spectra of pulp scanned on-line. One spectrum comprises p=1050 discrete data.

For consistency with the proposed syntax1, the dimensionality of the resulting data is referred to as the “way“ and the number of unique entities in the data are re-ferred to as the “modes“. The one-way data collected for this thesis are organized in matrices of a number of samples (n) with their corresponding recorded intensities giving the analyzed data a two-mode, two-way character, since there are two dimensions (rows and columns) and two entities – samples and wavelengths. Fig 2 depicts the arrangement of the data into a matrix - X.

Matrix X

Intensity measurement 1,2,...,p

Sam

ple n

um

ber 1

,2,...,n

xij

Figure 2. The arrangement of data; n samples and p datum per sample.

9

2 Theory

2.1 Introduction

Instead of presenting a full theory section covering the electromagnetic spectrum and theoretical aspects of NMR, which can be found in standard textbooks of ana-lytical chemistry, we simply recall that atoms and atomic bonds can exchange energy with their surroundings. Absorption of energy can occur if the incident irradiated energy “matches” the energy of a possible energy transition in the atom or bond of a molecule. If this criterion is met the atom or bond has a statistical chance of ab-sorbing the energy and the absorbed portion of energy can be detected as “missing” in the resulting energy flux following interaction with the sample. In spectrometry the missing energy will manifest itself as peaks when plotted in a reference-corrected spectrogram.

The ability to detect this missing energy is one of the fundamental working prin-ciples of analytical chemistry. One of the analytical fields relying on this phenome-non is spectrometry. This feature is also the basis of many other analytical tech-niques since they rely on the same type of energy differences in the final quantifying step, e.g. GC, LC, HPLC, and FIA, where the detector may be of this type.

2.2 Spectrometry and fundamental aspects of calibration

This section briefly outlines apparent deviations from the Beer-Lambert approxima-tion, and reasons why classical spectral modeling techniques are less appropriate when analyzing and calibrating with NIR or UV-VIS spectra.

2.2.1 The Beer-Lambert approximation

The Beer-Lambert approximation, Eq 2.1: the foundation of univariate linear mod-els in spectrometry, essentially states that the integral of a spectral peak is propor-tional to the number of (identical) bonds or atoms absorbing energy to yield the peak, which (in turn) is proportional to the concentration of the respective com-pound(s). The peak integral can be approximated by the recorded peak height (A) at some chosen energy (λ ), so peak height is proportional to concentration (c).

10

The univariate approach assumes that the Beer-Lambert approximation is valid,

100,

log IA bcI

λλ λ

λ

ε= − = 2.1

where A is absorbance, I is the energy intensity after interaction with the sample, 0I is the energy intensity before interaction with it, ε is the extinction coefficient for a chemical specie at the chosen wavelength, b is the effective light-absorbing path length, c is the concentration of the absorbing chemical species and λ is a selected wavelength.

The univariate approach, ( )c A λ∝ , is theoretically valid if the following condi-tions are fulfilled:

• The linearity condition. The instrument’s response must be linearly correlated with the measured feature. Deviations in the absorptivity coefficients can occur at high concentrations (>0.01M) due to electrostatic interactions between mole-cules in close proximity, a phenomenon referred to as analyte association.

• The interferent condition. The instrument’s response must not exhibit any wave-length shift for the measured constituent.

• The selectivity condition. The spectral peaks of interest must be fully separated.

• The noise condition. The measurement process itself will always yield noise in our measured data. The structure of the noise varies, depending on the analytical systems involved. The number of light quanta measured by the detector must be larger than the number originating from background radiation (background noise).

• The scatter condition. There must be no scattering of light due to particulates in the sample.

• The sample must not fluoresce or generate phosphorescence.

• There must be no changes in the refractive index of the samples at high analyte concentration.

Note one, meeting the linearity condition requires customization of the instrument or manipulation of the sample to suit the measured concentration e.g. adjustment of the effective pathlength to yield a suitable system response, adjustment of the dilu-tion or concentration of the sample to ensure a suitable response in the analytical energy-beam and/or the ability to adjust the intensity of the analytical beam.

Note two, the interferent condition which depends on the linear nature of Eq 2.1. If there is any change in the wavelength of the absorption maxima for a chemical specie, due to interfering substances Eq 2.1 will no longer be valid. Some chemical systems will therefore not be suitable for traditional spectroscopy, a good example

11

being the analysis of water in wood, where the water molecules interact with differ-ent bonding sites, creating a very subtle spectrum of overlapping water peaks exhib-iting a range of wavelength shifts depending on their bond strength. This type of artifact makes the traditional approach erroneous since the Beer-Lambert approach assumes that each chemical species has an absorbance maximum at a single, fixed wavelength.

Note three The selectivity condition. It is, for the traditionalist, a minor problem that is usually solved by separation, cleaning, standardization and dilution of the sample if possible. The problem can also be addressed by Inverse Least Squares (ILS) analysis, see Sec 3.5.2, of a few carefully selected wavelengths. The ILS meth-odology works well assuming that interferences, wavelength shifts and/or strong co-correlations between the measured wavelengths do not cause the numerical solution of the problem to be unstable and thus more sensitive to this kind of problems.

Note four. Noise occurs in both the dependent and the independent data. The noise in the dependent data is usually referred to as measurement error. The spectral noise will affect the model building and it can be shown that the multivariate models described below are less sensitive to this type of artifact since signal averaging is built into the modeling step.

Note five. Light scatter. The scatter phenomena occurring when using NIR on solid samples and UV-VIS on unfiltered, turbid wastewater can be regarded as se-vere – from a modeling point of view. The traditional remedy for this problem is to find one variable that reflects the scatter phenomenon but not the chemistry and subtract the value of variable from the objects data in question or to, alternatively, use some of the linearization techniques described in Chap 3.7, Linearization. The multivariate methods can model the scatter as a part of the calibration or data analy-sis.

12

2.3 NIR/UV-VIS spectral artifacts In this section, artifacts known to occur in the types of spectral data modeled in this thesis are presented.

2.3.1 Fermi resonance

Fermi Resonance is a phenomenon that can shift or split a fundamental IR-peak. Fermi resonance is observed when a spectral overtone or a combination interacts, i.e. exchanges energy, with a fundamental resonance peak. A necessary condition for Fermi resonance is that the excited vibronic states of the molecule are close in fre-quency. The effect of Fermi resonance is to shift the affected NIR peaks’ maximum wavelength positions: raising one and reducing another in frequency. Fermi reso-nance may induce strong non-linearities in the spectral data.

2.3.2 Vibrational coupling

Vibrational coupling occurs when vibrational energy from adjacent bonds in a mole-cule interacts. The distance between the vibrating atoms and the type of vibronic mode govern the magnitude of the coupling effect. If the atoms (bonds) are close and the vibrational modes, the mode symmetry, and the approximate frequency of the vibrations match, strong coupling occurs. If more than two bonds separate the atoms little or no coupling will occur. The effect shifts the peak position of the af-fected NIR peaks. Like Fermi resonance, vibrational coupling may induce strong non-linearities in the spectral data.

2.3.3 Water and hydrogen bonding

If the sample contains water, hydroxyl groups or any other possible source of hy-drogen bonding, the NIR spectrum can exhibit apparent or true peak shifts. NIR spectra of pure water do not show true Gaussian or Lorentzian peak characteristics because the NIR water peak is a composite, composed of multiple peaks originating from different water sub-species, i.e. water with hydrogen bonding to various other chemical groups or with differing coordination. The gradual formation or disap-pearance of these sub-species and their subsequent spectral sub-peaks can induce apparent shifts in the wavelength of the resulting, composite, water peak. See Fig 3.

13

Figure 3. The effect of a sub-peak changing in intensity. The sub-peak area to the far right is

decreased by 5%, which induces a change in the composite peak maximum location of ~2% of the composite peak width at half height and a ~0.1% decrease in

composite peak height.

Hydrogen bonding can also change the force constants of the X-H bonds, thereby inducing a true wavelength shift of a peak2. Differences in hydrogen bonding, i.e. water content, can induce weak to strong non-linearity in the spectral data. This non-linearity is detected as a shift in peak wavelength.

2.3.4 Light scattering

NIR can be employed in three different measuring modes: transmittance, transflec-tance and reflectance. UV-VIS is usually used in transmittance mode.

NIR reflectance data readily exhibit both Mie and Rayleigh types of light scatter-ing of the incident NIR beam, resulting in a baseline shift of the recorded spectra depending on the apparent macro dimensions of the analyzed sample. This type of light scattering is referred to as specular.

The theory related to NIR reflection from solid samples is further complicated by the multiple internal scattering of the NIR beam that may occur, i.e. the beam can be reflected many times in the sample before being recorded by the detector. This type of scatter is referred to as diffuse. The effects of scattering can be seen in Figs 4 and 5, where reflectance spectra of pulp samples taken from Paper I and transmission spectra of pseudo-gasoline samples are shown, respectively. The pseudo-gasoline spectra are shown as examples of data without scatter phenomena.

14

1200 1400 1600 1800 2000 2200 24000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Ref

lect

ance

Wavelength [nm]

Figure 4. 46 NIR reflectance spectra of pulp showing scatter effects as

the baseline shifts. Data taken from Paper I.

800 900 1000 1100 1200 1300 1400 1500 1600-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Abs

orba

nce

Wavelength [nm]

Figure 5. 30 NIR transmittance spectra of pseudo gasoline - blends of benzene, toluene,

o-xylene, p-xylene and octane. These spectra show no scatter.

The theories of Schuster3, Kubelka-Munk4 and Saunderson5 attempt to account for the scatter phenomena that may be encountered, but they are not applicable in all cases. The Kubelka-Munk and Saunderson theories result in a mathematical lin-earization transform applicable to NIR spectra.

15

For the usual mode of operation the terms specular and diffuse reflection are used for discriminating between different types of scatter phenomena: specular be-ing light reflected as from a mirror (which does not contain information about sam-ple chemistry), and diffuse being the part of the scattered light that has interacted with the sample and hence carries information about its chemical nature.

Note that the specular part may be of analytical interest since it contains informa-tion about sample macro dimensions, i.e. the sample’s particle dimensions. Some partial solutions to problems related to scattered and reflected light can be found in Chap 3.7, Linearization of data. In general, the light scatter of solid samples will induce non-linearities (ranging from weak to strong) in the spectral data.

UV-VIS data from turbid samples also exhibit baseline shifts due to light scatter. In this case the scatter results in loss of energy reaching the detector, thus causing an apparent increase in absorbance. This absorbance shift does not, typically, de-pend on the concentration of the chemical species of interest. Instead, it is a spectral artifact that complicates calibration.

2.4 NMR and chromatography – some artifacts

Some of the most serious artifacts that can occur in data, from the modeling point of view, are changes in peak positions (or locations) that are not governed by chem-istry, or are unwanted from a modeling point of view. As shown in the theory sec-tion, if such shifts in peaks occur the fundamental basis of modeling variance will break down. This is due to the fact that if a peak changes position, regardless of if the intensity changes, the modeling mathematics will interpret the changes as vari-ance. The alignment problem is one of the major reasons for that chemometrics has not been applied more often to data such as NMR, GC, LC, CE and FIA signals. NMR analysis of complex samples, typically biological samples, is prone to peak shifting6, 7. This is because matrix effects such as pH, salt activity etc. change the resonant frequencies of the hydrogen atoms under study. In the case of chroma-tographic techniques, it is the variations in pressure, temperature, columns and in-jection parameters that change the relative elution times of chromatographic peaks.

2.5 The nature of noise

Noise is always present when something is measured. In spectroscopy the structure of the photometric noise is very important and it can be divided, roughly, into two types - homoscedastic and heteroscedastic. The homoscedastic noise is not dependent on the magnitude of the measured signal, whereas the heteroscedastic noise is. The het-

16

eroscedastic noise is usually found to be approximately proportional to the meas-ured signal.

The analysis of noise structure will not be covered in this thesis, although one of the techniques described later, Principal Component Analysis (PCA), is often used in the analysis of noise. The reason for addressing the noise is that it adds to the mathematical rank, especially if the measured data are co-linear. If this is the case, the noise can make a major contribution to the derived regressor, thus making the regressor unstable. The noise can also introduce chance correlations with the de-pendent data. The risk of this happening is proportional to the number of variables in the data being analyzed.

2.6 Summary of artifacts

The criteria mentioned above regarding the validity of the Beer-Lambert approxima-tion are not generally fulfilled for the calibration of the types of chemical data con-sidered in this thesis. Thus, in order to apply the traditional, univariate methods suc-cessfully, we would have to rely on tedious workup methods such as separation, purification and so on.

For subsequent data analysis, too, the artifacts encountered using NIR, UV-VIS and NMR analyses make the data sets considered here unsuitable for analysis with univariate methods based on correlation analysis with one single, chosen, datum. The artifacts related to the Beer-Lambert approximation are well known and have been thoroughly described, see for example the book by Ingle and Crouch8.

17

3 Mathematical modeling

3.1 Matrix notation

Since chemical data will be discussed from a matrix viewpoint in the following text, the matrix and vector notation used requires some attention.

Data will be discussed in terms of scalar(s), vector(s), matrix/matrices and possi-bly tensors. Matrices are depicted in bold upper case letters, vectors in bold lower-case letters and scalars in italics. Objects are row vectors and variables column vec-tors. The matrix or vector transpose is indicated with a prime ( ′i ).

The matrix of instrumental responses is denoted X and is referred to as the X-block. In this thesis the X-block typically comprise spectra. The matrix (or vector) of descriptors, usually concentrations, is denoted Y or y and is also referred to as the Y-block. Symbols denoted by “hat” ( i ) are predicted from a model. A sample is referred to as an object. The object’s corresponding spectrum is organized in a vec-tor of discrete numbers – data, which are referred to as variables.

3.2 The rank of a matrix equation system

The concept of rank is used and needs some clarification. First of all one must make a clear distinction between the mathematical rank and the pseudorank (in this case the chemical rank) of an investigated system. The mathematical rank is a measure of how many unique (linearly independent) solutions there are to a mathematical prob-lem, in this case solving the equations for determining the wavelength coefficients in the calibration model, i.e. establishing the regressor. In mathematical equation sys-tems with strong co-correlation (in-between variables and between samples) it can be shown that the Least Squares solution (LS) easily degenerates and becomes sensi-tive to perturbation9. This means that there are a large number of possible solutions to the problem, which are all quite similar. These phenomena increase the mathe-matical rank.

The chemical rank, on the other hand, is a measure of how many true chemical phenomena are changing in our measured system. The chemical rank is, almost al-ways, a smaller number than the mathematical rank. The chemical rank can be diffi-cult to estimate, even with a thorough understanding of the system investigated since different chemical phenomena happen simultaneously. In any chemical system both formation and breakage of chemical bonds (strong and weak) will occur, to-gether with the formation of various intermediates.

18

Interaction between the electromagnetic waves used in spectrometry and their surroundings will also occur. This ultimately leads to the measured data containing information on the changing chemistry, on the interactions mentioned, as well as the ever-present noise. These interactions will also add to the chemical and mathe-matical rank.

The problem of defining the ”true” chemical rank is of interest when using mul-tivariate techniques since the choice of model dimensionality, which should ideally correspond with chemical rank, is left to the user and is important for good model performance.

The problem of finding the true chemical rank has been addressed by a variety of approaches, e.g. cross validation10, bootstrap procedures11, and statistical criteria such as the F-test12.

3.3 Why chemometrics?

Examination of the problems emanating from applying spectrometry to the chemi-cal systems of interest has led to the conclusion that the univariate method is unlikely to result in successful calibrations for the systems of interest. A possible remedy for this is to change tactics and use multivariate methods, correlating many independent and dependent variables at the same time.

Multivariate methods, in essence, involve the use of more data in the models i.e. whole spectra. With the aid of more data, more complex models can be built:

( , )c f A λ= , that can solve the problems described above by correcting for differ-ent phenomena if there is any correlation between the artifacts and the data. The multivariate models can themselves be seen as selectivity enhancement by mathe-matical modeling of multi-channel measurements.

The multivariate methods based on a latent variable approach have a further ad-vantage over univariate methods since they provide an easily interpreted, graphics-based window into multi-dimensional problems i.e. a data-scope, allowing us to ex-amine the multidimensional problem by opening a two- or three (low) dimensional window in the data displaying the dominant features. The intermediate modeling results can be interpreted with respect to both quantitative and qualitative aspects of the modeled data.

The distinction between dependent and independent data is not straightforward: the taxonomy stems from the motivation for constructing predictive models i.e. we believe that the spectrometric data reflect the true composition more accurately than the traditional chemical analysis. Models constructed on this basis can then be used to understand the relation between the dependent and independent data. To invert

19

this definition, the strategy is to model causality, i.e. the chemically analyzed values are the “true” values that are reflected in our spectrometric data.

The framework for modeling chemical data is the idea that any signal reflecting chemical phenomena (data) can be divided into two parts: systematic (originating from chemical features) and non-systematic (noise). In other words, data can be regarded as containing two entities, information and noise. These basic concepts facilitate the construction of models, and drawing conclusions from models based on measured data. Hence, the models can be simply written as follows:

=Data Chemical model + Noise

or

=Data Information + Noise

Much of this thesis considers means to separate these two entities with different kinds of models and methods borrowed from mathematics, statistics, physics and, of course, chemistry. It will be shown that applying these different approaches to chemistry can solve analytical problems that are difficult to handle by traditional methods.

The value of using multivariate methods can be assessed by examining the theo-retical restrictions of the univariate approach and elucidating whether these restric-tions are met in true-life calibration situations. Some of these restrictions are com-monly referred to as interferences and have their origin in interactions between chemical constituents, physical phenomena and the measurement process itself.

The multivariate modeling methodology removes interference effects, extends the linear range and enables the detection of erroneous samples (referred to as out-liers). The methodology also suppresses the influence of noise by truncating, random and/or irrelevant, phenomena in the data. By using all the wavelengths in the spec-tral data and a latent variable problem-solving approach, we are able to compress the main features of the data into condensed representation which is used for cali-bration and for finding structures and/or patterns, thereby revealing hidden infor-mation. The multivariate models, e.g. shrinkage or biased calibration techniques will be shown to be more rugged and accurate.

20

3.4 Chemometrics – a prelude

The field of chemometrics is vast, and the author will by no means try to explain it all. The few chemometric techniques used in the enclosed papers have been exten-sively reviewed in other papers, with the exception of compression (Paper III) and the alignment method (Paper V). Even though others have thoroughly discussed the subjects of the following sections, the author will give a brief introduction to the core theory of the popular orthogonal projection technique (PCA) and the related bilinear regression/calibration method (PLS), both of which are commonly used multivariate data analysis and modeling techniques. The guide can be regarded as a primer for the techniques of PCA and PLS.

The methods to be described have a common denominator of factoring or de-composing the data into orthogonal components. Orthogonal means totally uncorrelated in a broader sense. In the case of calibration, the problem of performing ( )′ -1X X is solved by the transformation of X to an orthogonal representation of X. This pro-cedure makes the transformed representation of ′X X , or any composite covariance matrix, a diagonal matrix and hence is easily inverted to a stable solution. The or-thogonal projection methods are also useful in the interpretation of the data. The projections are well suited for graphical inspection since the data are projected into a few, uncorrelated, dimensions that are easier to interpret than the original data mass. The possibility of plotting different features of the data is one of the advantages of the multivariate techniques described.

One goal of this thesis is to establish relationships between strongly inter-correlated, high-resolution, spectral data that do not conform to the Beer-Lambert approximation, and features of the objects, e.g. the content of various chemical con-stituents or the co-ordination of water. The traditional way this is usually done is by making a model that mathematically models the features of the spectral data with the desired descriptor, i.e. calibration, where the objective is to establish the parame-ters of the transfer function, ( )f i , between the blocks of observed and measured data. The possible models are;

( )f= +Y X E

or

( )f= +X Y E

The standard modeling tactic is to minimize the sum of squares of the model re-sidual, E, by some algorithm.

21

The most commonly used type of model for the calibration of spectral data is still the univariate linear Least Squares (LS) model, i.e. for all the objects one variable is selected to explain the calibrated descriptor, and for this wavelength a line is re-gressed between the chosen variable and the descriptor’s magnitude. The univariate and multivariate LS models are not ideal, given the previously described artifacts, which often occur in many real applications based on NIR reflectance data, e.g. non-linear responses, the pitfalls of the Beer-Lambert approximation, heteroscedastic noise structure, co-linearity and rank deficiency, to mention just a few.

The mathematical structure and shortcomings of multivariate LS are described in the next section, and its shortcomings provide motivation for the use of more com-plex calibration techniques, such as the chemometric methods. The proposed chemometric methods presented here should be regarded merely as a selection of possible ways of overcoming the aforementioned problems in calibration. Of course there are other techniques, but a comprehensive survey of such methods is beyond the scope of this thesis.

3.5 Least Squares modeling methods

The solutions of the LS models are interesting because they are widely used for mathematically modeling data and their basic forms will be encountered again in the following sections. Note that the LS based methods are categorized as chemometric or multivariate methods, but not as orthogonal compression methods, which are described later in Chap 4.4, Chemometrics.

There are, in spectrometry, formally two methods of directly achieving the goal of multivariate calibration: Classical Least Squares (CLS) and Inverse Least Squares (ILS). Both of these are linear models that share an identical mathematical objective function, but differ in philosophy. The objective of linear prediction models is to obtain a “pure spectrum” or regression coefficients that can be used in such a way that a feature can be predicted from a spectrum with unknown composition. The original theory of Least Squares can be traced back to Gauss and Legendre (1700s).

22

3.5.1 Classical Least Squares (CLS)

The CLS philosophy states that the variables, X (spectrum), are a function of the descriptor Y (concentration etc.). The working principle is to minimize the residuals in X i.e.

2min([ ] )X - YK (3.2)

The model for CLS is = +X YK E (3.3)

where the LS solution is

( ) 1ˆ −′ ′=K Y Y Y X (3.4)

where K comprises the “pure” spectral responses, or linear combinations of X, for the analytes in Y. This is interesting since it means that CLS belongs to the “factor” model family, because the X matrix is factored into K by Y. There is no formal re-striction on K as there is on the factor based, chemometric, methods.

The prediction in CLS is performed by the following operation;

( ) 1−′c = KK Kx (3.5)

where x is the unknown spectrum and K is derived from Eq 3.4.

The obvious drawback of this method is that all of the “pure” analyte spectra and interferents must be known for the method to be useful. The formulation of the CLS model makes it suitable for mild extrapolation. Note that CLS is also known as the K-matrix method or Reverse-, Direct- and Total calibration.

3.5.2 Inverse Least Squares (ILS)

The philosophy of ILS states that the descriptor is a function of the variables. The objective function is to minimize the residuals in Y, i.e.

2min([ ] )Y - Xβ

The ILS model is;

= +Y Xβ E

where the βvector is formed in the following way:

( ) 1−′ ′=β X X X Y (3.6)

The regressor, or regression vector (β ), multiplied by a spectrum yields the pre-dicted result for the targeted feature, y , See Eq 3.7.

23

ˆ ˆ(1) (1) (2) (2) ... ( ) ( ) or y p p y= ⋅ + ⋅ + + ⋅ =x β x β x β xβ (3.7)

where (1)β is the first coefficient in the β vector, and (1)x is the first data point in our spectrum of unknown composition. ILS can handle unknown interferents, but the predicted feature must be spanned by the data in the previously constructed calibration model, i.e. the ILS model should not be extrapolated. The ILS method is also known as P-matrix, MLR, Forward-, Indirect- and Partial calibration.

3.5.3 Co-correlation and the inverse

The major problem of constructing a regressor to an analytical modeling problem is connected with the nature of inverting the variance-covariance matrix in the solu-tion of Eqs 3.5 and 3.6.

Deriving the matrix ( )′ -1X X is one of the key issues for calibration. The mathe-matics calls for a matrix inverse. If the data in the X matrix are inter- and intra-correlated, i.e. the data show a high degree of co-linearity and/or the objects are lin-ear combinations of each other, the inverse is ill-conditioned9, 13, that is, the esti-mated regressor will be the result of numerical artifacts and not of causality. Co-correlation will effectively increase the mathematical rank of the data, resulting in an

′X X matrix with relatively large numbers on the diagonal and near-diagonal ele-ments, which when inverted will be numerically unstable. Trying to invert an ill-conditioned matrix is the matrix algebra equivalent to division with zero. Conse-quently, the coefficients comprising the β vector become large and this makes the model more sensitive to noise, especially in X. This, in turn, causes degradation of model performance, such as an increase in the resulting model error (predictive er-ror).

A thorough review concerning the strengths, weaknesses and statistical behavior of the CLS and ILS methods can be found in the book by Martens and Næs13.

It should also be noted that from the chemist’s point of view the co-correlation or co-linearity, is not a problem as long as there is at least one variable that is non-overlapping for the sought descriptor. Essentially co-correlation means that more information is available and that we must use computational techniques that remedy the mathematical aspects of the inverse problem. Techniques such as the ones de-scribed in the next chapter Chemometrics.

24

3.6 Factor based methods - Chemometrics

We will start this section with a brief recap of the data projection/compression method of Principal Component Analysis (PCA)14-17, because many of its concepts also appear in the more complex multivariate techniques for calibration (notably, PCA is a compression technique describing variance).

3.6.1 Principal Component Analysis – PCA - an introduction

PCA is a method of transforming complex data in such a way that the most impor-tant or relevant information is made more apparent. This is accomplished by con-structing a new set of variables that are linear combinations of the original variables in the data set. These new variables, P, are denoted Principal Components (pc’s), Latent Variables (LV’s) eigenvectors, factors or singular vectors. An important feature of these pc’s is that they are orthogonal (completely uncorrelated) to one another.

In addition, the new variable-sets are created in the order of the amount of vari-ance in the data that they can account for. Thus, the first component describes more of the variance in the data set than the second component, and so forth. As each new pc is calculated, the objects’ relationship to that pc can be calculated by projec-tion. Basically, the extraction of principal components amounts to a variance maximiz-ing rotation of the original variable space.

The relationships between objects are not changed by this transformation, but because the new axes are ordered by their importance (the variance they describe is a measure of how much distinguishing information in the data they account for) we can graphically see the most important differences between samples in a low-dimensionality plot (2- or 3-D), which radically simplifies the interpretation of multi-dimensional data. See for instance Figs 14 and 16.

The main goals of PCA are;

• data reduction

• simplification

• finding relationships, classes or similarities among objects and variables

• variable selection

• outlier detection

The pioneering work of PCA was done by Karl Pearson16 (1901) and Spearman18 (1904). PCA was, later, briefly mentioned by Fisher and MacKenzie19 as more suit-able than analysis of variance (ANOVA) for modeling response data, they also out-

25

lined the NIPALS algorithm, later rediscovered by Wold20. PCA was further devel-oped by Hotelling14 (1931) into what we today call PCA.

The concept of PCA was appealing and in the 1930s Thurstone21 and other psy-chologists developed factor analysis (FA), which closely resembles PCA, although its objective function is to rotate the resulting PCA factors to simplify interpretation or generate loadings with desired features. The FA method is considered, more or less, to be a supervised method, but this is not the case for PCA. For further readings about FA the author recommends, for instance, the book by Malinowski22.

Since these pioneering works, PCA has been rediscovered in many scientific fields, resulting in, amongst other things, an abundance of redundant terminology. Numerous versions of PCA exist, depending on the scientific field the user is work-ing in. Examples include the following: singular value decomposition (SVD) or ei-genvector analysis, often used in the field of numerics; Karhunen-Loéve expansion, used in electrical engineering; perceptual mapping, applied in marketing and Hotel-ling transform in image analysis. Correspondence analysis is a special double-scaled version of PCA favored in the French-speaking countries and Principal Factor Analysis (PFA) is used in FA, to name but a few. This multitude of different acro-nyms for what is, essentially, the same thing arises from the fact that there are dif-ferent numerical methods of calculation and that the method is scale-dependent i.e. different pretreatments of X data will result in different PCA solutions, and the dis-coverers of these pretreatment methods have frequently renamed the method.

As stated, PCA provides an approximation of the matrix X in terms of the prod-uct of two smaller matrices, T and P . These matrices, T and P , capture the essen-tial data patterns in X. Plotting the columns of T gives a picture of the dominant “object patterns” of X and, analogously, plotting the columns of P shows the com-plementary “variable patterns”.

The features of PCA can be summarized as follows:

• PCA maps (projects) the data into two new datasets, one reflect-ing the variables and one reflecting objects.

• The variable and object spaces are projected to variance components according to the size of variance in the variable space i.e. The variance components are projected in order of descending magnitude.

• The variance components are uncorrelated, i.e. orthogonal (or-thonormal actually).

• The projected object-space is based on the variance compo-nents and is therefore also orthogonal.

26

3.6.2 PCA

Principal component analysis (PCA) is crucial for the deeper understanding of many multivariate techniques and will therefore be explained in some detail.

As stated, the working principle of PCA is that the data are mapped into two new sets of data. T – the scores reflecting the object space and the numerical size of the data, and P - the loadings or pc’s reflecting the variable space. P is the model of the data having the same length as the number of variables. P comprises the projected variance components of the X data. Furthermore, P is ordered according to decreas-ing magnitude of explained variance in X, and the rows of the pc’s are orthonormal, which differentiates the PCA model from the CLS model, 3.3, making the pc’s to-tally uncorrelated with each other and with unit norm, i.e. ′T T is diagonal and

′P P = I .

The PCA model is written as: ′X = TP + E (3.8)

There are several different ways of constructing the P matrix, the most commonly employed being “the power method”9 or SVD9. Once the P matrix is established the T matrix can be computed by:

T = XP

where T is the projection of each object’s variable-set on the variance components of the variable space.

The obvious benefit of this procedure is that variance components of highly co-correlated data can be analyzed according to the magnitude of the variance with to-tally uncorrelated components. Furthermore; the method is a framework for trunca-tion, i.e. we can determine, by examining the variance components, where along the decomposition information can be separated from noise according to:

Data True variance noise

= +

′= +X TP E

This enables the interesting part of the data to be separated from the part containing noise (E) and/or variance that we are not interested in, i.e. truncation. The truncated matrix can then be used to make the matrix inverse more rugged and stable, hence allowing possibilities for calculating ( )′ -1X X which is then denoted ( )+′X X or, even more interesting, the inverse of the ( ) 1ˆ ˆ −′T T matrix. Note that if no truncation is done, model 3.8 becomes; ′X = TP , that is, equality.

27

There are a number of different ways to establish the significant number of pc’s that describe the latent variance patterns in the data, including cross-validation10, bootstrapping11, as well as statistical, and numerical methods such as eigenvalue cri-teria15.

The objective is to find directions in the data, namely the directions of maximum variability. The objective function of PCA can be seen as minimizing the orthogonal distance between the original data and its projection onto the sought direction vec-tor of maximum variability. This is not the same thing as the minimizing criteria in LS, see Fig 6.

}dLS}dPCA

Figure 6. Geometric difference between the LS and PCA minimizing distance.

PCA also has interesting statistical features, the major one being that the scores can be analyzed with univariate and multivariate population statistics, i.e. one can use the projected scores to build multivariate confidence intervals for their projection space. This feature enables various useful manipulations, for instance, multivariate outlier detection, classification and cluster analysis. The subject of multivariate sta-tistics23, classification and cluster analysis24, and discriminant analysis are themselves sub-fields of chemometrics, and are not further considered in this thesis.

A note of caution is that PCA is scale-dependent, indicating that the eventual pre-scaling of the objects and variables alters the solution in P, and hence in T .

28

One should also point out that the PCA solution has rotational ambiguity. This means that the scores and loadings can be rotated in any direction by a matrix C and the rotated solution will explain the same amount of variance i.e.

1−′ ′=X = TP TCC P (3.9)

Where C is any diagonal or orthogonal matrix. The way the C matrix is con-structed is the concern of the FA community.

The mathematical objective function for PCA can be written

( )2min ′X - TP

and it can be shown that if a random dataset X (n p× ) is to be approximated by a combination of linear orthonormal basis vectors the optimal approximation of the random vectors x (1 p× ) by a linear combination of q (q n< ) independent vectors is obtained by projecting the random vectors x onto the eigenvectors v correspond-ing to the largest eigenvalues iλ of the covariance matrix ′X X .

The framework for the calculation of the loadings and scores will here be drawn from the eigenvector theory:

For a square matrix X (n n× ) the equation

iλ=Xv v (3.10)

Eq. 3.10 is called the eigenvalue equation for X. The eigenvalues of X, iλ , are sca-lars for which the equation possesses non-zero solutions in v. The corresponding vectors iv are the eigenvectors of X. The eigenvectors of X form an orthonormal set of vectors. The eigenvalue equation can also be written as

= ΛXV V

or

= ΛTV XV

where Λ is the diagonal matrix containing the eigenvalues iλ on the diagonal, and V is the orthogonal matrix, the columns of which are the individual eigenvectors. The eigenvalues of X can be obtained by solving the equation

det( ) 0− Λ =X I

where det indicates the determinant of the matrix. The mathematical rank of matrix X is equal to the non-zero eigenvalues of X. The rank of a square matrix (n n× ) is at most n. If the rank of X is less than n, the X is said to be singular and cannot easily be inverted unless the series of eigenvalues is truncated.

The SVD algorithm is commonly used for carrying out the PCA and for inverting non-square matrices. This algorithm is often found in commercial mathematical

29

packages for computers. The full SVD solution is not equal to the PCA solution, but can easily be made the same. There is one drawback of using the SVD: in most packages it computes all the singular values of a matrix. This is usually not required, but it is sometimes necessary, for instance in residual analysis.

For a non-square matrix X (n p× ), the singular value decomposition is given by ′=X USV

If >n p , then U is (n p× ), S is ( p p× ) and V is ( p p× ). The matrices have the fol-lowing properties: ′U U = I , U is orthonormal, S is diagonal and ′ ′= =V V VV I , V is orthonormal.

If n p< , then U is (n n× ), S is (n n× ) and V is ( p n× ). The matrices have the following properties: ′V V = I and ′ ′= =U U UU I .

The column vectors of U are referred to as the left singular vectors of X, and the column vectors of V are referred to as the right singular vectors of X. The diagonal elements of S, iiσ , are referred to as the singular values of X, and are ordered in mag-nitude i.e. 11 22 33 ... kkσ σ σ σ> > > > , where k is min( , )n p . The singular value de-composition of ′X is ′VSU .

The SVD is related to the eigenvalue analysis by the following. The eigenvalue equa-tion for the square matrix ′X X can be written

′ = ΛX XV V

By substituting the SVD solution for X we obtain ′ ′ ′= = ΛX XV VSU USV V V

and since ′ =U U I , ′ =V V I and that ′ =S S , we obtain

2′ =X XV VS (3.11)

where 2 =S SS . From Eq. 3.11 we can see that V is the eigenvector matrix of ′X X and that 2 = ΛS i.e. the singular values of X are the square roots of the eigenvalues of ′X X , similarly, for the square matrix

′ ′ ′= = = Λ2 2XX U US V VSU U US U

U is the matrix of eigenvectors of the ′XX matrix. Note that if n p≥ ′X X and ′XX have at most p non-zero eigenvalues. Similarly, if n p≤ , ′X X and ′XX have at

most n non-zero eigenvalues.

In order to link the SVD solution to PCA it is noted that =P V =T US

where 2′ = = ΛTT S .

30

3.6.3 Partial Least Squares - PLS

As a prelude to explaining the PLS calibration/decomposition technique, we can examine the NIR reflectance data, which exhibit strong inter-correlation between variables (for an example, see Fig 11). Here the descriptor’s correlation coefficient has been plotted against each spectral variable. Two major features can be seen: the correlation between the spectral datum and Y, and the strong inter-correlation be-tween variables, which is in full agreement with the NIR theory stating that the peaks in NIR will be recurrent, broad and overlapped. This will present a problem, according to LS theory, if one wishes to calibrate the data to some interesting object feature.

To overcome the LS inverse problem, the logical extension for multivariate cali-bration is to use the projected scores from PCA and regress them with the Y-block, making an ILS model. This method is coined Principal Components Regression (PCR) and is one solution to the problem. PCR is not further explained here be-cause it was not used in the papers and PCR is not believed to be superior to Partial Least squares, Projection to Latent Structures or PLS25, 26, which is not as straight-forward as PCR. PLS has a different objective function, 3.12, than ILS, which maximizes correlation, by maximizing the square of the covariance between the X- and the Y- blocks.

max cov( , ) | 1∧ = wt y t = Xw w (3.12)

PLS is also a factor-based method that projects data according to the magnitude of covariance. PLS generates variance-mapping components in a similar fashion to PCA, but the PLS-components are not identical to the PCA-components, due to the difference in objective function. This difference should be noted since the PLS components are also usually referred to as pc’s. In addition, the PLS model can over-come the inverse problem by using deflation of the X- and Y-block into (almost) or-thogonal components. One further interesting feature of PLS is that it is a shrinkage method, i.e.;

2, 2,if a<min(n,p)PLS ILS<β β (3.13)

Relationship 3.13 indicates that the length (norm) of the PLS regressor is numeri-cally shorter than its counterpart from ILS, indicating a more stable predictor. Note that relationship 3.13 is only valid for a model that has less pc’s than objects or vari-ables, whichever is less, otherwise the PLS and ILS solutions will converge.

There are two basic frameworks for PLS, one for models with one descriptor, PLS-1, and another for multiple descriptors, PLS-2. In the following passage, the PLS method is illustrated with the algorithmical framework for PLS-1.

31

For clarity we begin with a lemma to be more consistent with the previous ILS and CLS sections, 3.5.2, and retain the common model notation. However, “hat” notation has been ignored because all intermediate results are predicted, rendering the “hat” superfluous;

( ) ( )( )1 1− − ′′ ′ ′ ′Y Y Y X = X Y Y Y (3.14)

The X and y data should first be centered, normalized, linearized etc. according to some desired scheme.

In PLS decomposition, the extraction of the pc’s begins with;

( ) 1−′ ′w = X y y y (3.15)

which is the solution for the CLS model, 3.4; ′X = yw + E and lemma 3.14. This step is designed to extract the y corresponding “pure” components from the spec-tral data. After this, the PLS weights vector, w, thereby formed is normalized to unit norm and the normalized concentrations are predicted according to CLS theory, model 3.5, by;

( )′ -1t = Xw w w = Xw

since ′ =w w I . The t vector formed in this way is denoted the score vector. The scores are the estimated normalized concentrations predicted with the “pure com-ponents” model. The scores are based on the CLS model ′X = tw + E .

The score vector is then calibrated against the true concentrations using ILS,

( )′ ′ -1q = t y t t

according to the model y = qt + e . The q vector formed in this way is denoted the y-loadings. The q vector models the relation between the normalized concentrations and the measured descriptor. Note that the error minimized is in the y:s, that is, y takes the place of X in the ILS model. The q vector is later needed for the deflation of y.

After this the spectra are calibrated against the normalized scores according to:

( )′ ′ -1p = X t t t (3.16)

the p vector thus formed, the loading, is the “pure component” estimated using the scores, t, as Y in the CLS model, 3.4, hence the “partial” in partial least squares. The p vector is estimated from the model ′X = tp + E . Note that the p vector no longer accounts for maximum variance, as in the PCA model, 3.8, but rather the maximal co-variance between X and t.

32

The X- and Y-blocks are then ”cleaned” from the variance extracted by this PLS component by, for the X-block;

′X = X - tp (3.17)

via subtraction of the product of normalized concentrations and “pure spectral pro-files” based on normalized concentrations. Consequently, for the Y-block;

y = y - tq (3.18)

by subtraction of the product of the normalized scores and the concentration load-ing.

The PLS algorithm then starts again from Eq 3.15 with the variance “cleaned” X and y as input data, extracting the next pc. This procedure proceeds sequentially until the desired number of pc’s has been extracted.

It should be noted that the P matrix is not orthonormal in this algorithm, but the resulting W is, i.e. W´W=I. The steps 3.16, 3.17 and 3.18 assure that the formed T matrix is orthogonal.

A final ILS-type regression vector can be built using the following;

( )′ -1β= W P W Q

with [ ]a ... 1 1 1W = w w w w , [ ]1 2 3 a...Q = q q q q , [ ]1 2 3 a ... P = p p p p and a being the final number of extracted pc’s.

The number of pc’s to retain in the final model can be established with similar techniques as for PCA. The most commonly used figures of merit for choosing the appropriate number of pc’s are y-residual statistics, such as PRESS, RMSEP, or a figure of merit from cross-validation like RMSECV or bootstrapped confidence intervals. For further reading about different figures of merit for the models, the book by Martens and Naes13 is recommended.

Evaluation of the final model in terms of predictor statistics becomes rather complex compared with the straightforward analysis of ILS and CLS predictor properties, due to the complex nature of the PLS regressor.

To further clarify our understanding of the PLS algorithm a closer look at the P and W, loadings and weight matrices (vectors), is usually helpful. The W matrix is modeled very much as in PCA, but since the deflation of subsequent pc’s starts over with an X-block that has been deflated using P, 3.17, which is tilted by trying to ex-plain covariance with Y, the picture becomes somewhat more complex. Actually, the difference between the P and W matrices is the portion of variance in X that is not correlated with Y. This portion can also be subtracted from X before extraction of the subsequent pc.

33

We can conclude that PLS factors the X- and Y-blocks in an orthogonal manner, mapping the covariance of the data blocks.

As mentioned above, the PLS decomposition algorithm described is just one of a number of algorithmical frameworks of PLS that have been developed. Alternatives exist, but the described algorithm is simple to visualize due to the use of CLS and ILS, which are theoretically straightforward. Readers interested in more intricate versions of PLS-decomposition algorithms are recommended to consult, for in-stance, the Ph.D. thesis of Lindgren27.

3.7 Linearization of data

This section will briefly mention the possibilities of pre-processing, or pre-linearizing, the X and/or Y data before multivariate modeling. The section is not meant to fully explore this topic, but to show that the possibility exists and that use of pre-processing and/or linearization steps might yield a better (or worse) model. Better, in this context, implies a model that is more rugged and performs better with respect to prediction error, which is the ultimate goal of calibration.

Basically, there are two types of pre-linearizations: data equalizers and data trans-formers.

The first types of transform, data equalizers, are often employed when descriptor variables arise on an unequal footing, typically in the Y-block, when descriptors are measured with different units. This situation can also arise when the X-block con-sists of data of a discrete nature, i.e. not comprising spectra. The remedy for this scale difference is subtraction of the column mean (centering) and division by the column standard deviation for the wanted descriptor, often referred to as auto-scaling or z-transform. Note that this is also the most commonly employed trans-formation of the X-block prior to PCA decomposition.

The untreated ′X X matrix is referred to as the product-moment matrix. If the X ma-trix has been centered prior to multiplication, the ′X X matrix is referred to as the variance-covariance matrix. Furthermore, if the X matrix has been Z-transformed (auto-scaled) the ′X X matrix is referred to as the correlation matrix. The different names indicate that the pretreatment of the data will change the solution space of LS, PCA and PLS, i.e. the solution is sensitive to numerical size and therefore subject to trans-formation - pretreatment.

The second types of transform; data transformers, often deal with the X-block with the scope of linearizing the X-block to the Y-block. The most common trans-forms employed for NIR spectroscopy data are specular scatter removal, e.g. Multi-plicative Scatter Correction (MSC28), Standard Normal Variate transform (SNV29),

34

Kubelka Munk transform4, Saunderson correction5 or derivatives of some type30-33. There are also others, like the -log10(R) transform which, sometimes, is believed to remove non-linearities from the X-block. Furthermore, there are transforms like the Fourier Transform and wavelet transformation, which maps the wavelength domain of the X-block to some other domain, believed to more linearly describe the X-block. Because of the plentitude of possible permutations of linearization ap-proaches, combined with the fact that some are of parametric nature, the choice of optimal linearization sequence is not straightforward, i.e. there are various possibili-ties for optimization. Consequently, different optimization strategies and tactics can be exploited, see for instance Olsson34 and Stordrange et. al.35.

The sequences with which the pretreatments are performed are also of great im-portance, since the pretreatments are affected by numerical size.

Another problem with pretreatments is the unwanted anchoring of data, i.e. the pretreatment is dependent on the data itself, and when an unknown datum is to be passed through the model the pretreatment is based on corrections saved from the modeling step.

3.8 Compression

The goal of data compression is to reduce either the number of rows or columns in a matrix, usually columns (variables). Available compression methods can be divided into two main types: lossless and lossy. Lossless compression can only be used, today, for data that lack noise and have an ordered structure. Lossless compression means that the compressed data set can be back-transformed exactly to its uncompressed, original, shape from the compressed representation – spectral data do not usually belong to this type.

Since it is not possible to perform a perfect, lossless, back-transformation of spectral data, several lossy methods have been devised, e.g. truncated Fourier series, spline and polynomial compression and the popular wavelet transform. These methods are, in extreme cases, lossless, but the lossless transformation yields no compression, that is, the compressed representation is as large as the original data set.

The common denominator for the practical use of these, now lossy, compression algorithms is truncation. The truncation is obtained by defining a numerical thres-hold for where over (or under) this threshold the remaining data structure is consid-ered to be noise or unwanted and hence discarded with. In Fourier transform the threshold is the subset of FT-coefficients to keep, in PCA it is the significant num-ber of pc’s. The threshold value for the truncation can be established in several ways: methods like cross-validation and bootstrapping are frequently used.

35

Notably, the more parameters that can be changed when using a certain method, the more cumbersome it is to elucidate the optimality of the chosen parameter set-ting, wavelets, for example, having very many degrees of freedom. This, in turn, will make the modeling of the compressed data more complex and time consuming since the parameter setting for the compression algorithm itself must be tested, and can easily be as complex as the modeling of the data.

The main reason for compression would be, in the case of NIR or other spectral data from high-resolution instruments, to decrease the computation time required for modeling and data storage space. For the power method, SVD and PLS, the flop count is proportional to p2, 4n2p+8np2+9p3 and 8npa+5pa+2na, respectively (see ref-erences 36, 9 and 37) n being the number of objects, p the number of variables and a the number of pc’s, indicating that a reduction in the number of variables, p, would speed up the modeling computations involved quite dramatically.

Notably, the use of compression/decompression is widespread in pictures and music. In passing, one could note the lossy MPEG (Motion Pictures Expert Group) compression method. Another lossless method widely used is the file compression method resulting in ZIP files, which works by mapping segment repetitions.

Another interesting feature of compression is that it extends the possibility of variable selection, i.e. the selection of a subset of variables for data-analysis and/or calibration that is optimal in some sense. While multivariate implies that many vari-ables are involved, it does not necessarily mean that all variables are informative. Multivariate models benefit from the reduction of variables that do not describe relevant features. Optimality may, in this case, for instance, refer to the parsimoni-ousness of the model or minimization of prediction error. Variable selection is, usu-ally, computationally intensive because of the extensive use of random permutations when invoking popular methods like simulated annealing (SA) and genetic algo-rithms (GA’s). Regardless of the method used, even a 10-50% reduction in variables makes a big difference in reducing computation time.

Theoretical aspects of the proposed B-spline type of compression are thoroughly covered in Paper III and will not be repeated here.

36

3.9 Peak Alignment

As stated in the introduction, unaligned peaks cause problems during data-analysis and/or calibration. The main reason for this is that the variance mapping mathematics recognize the peak shift as variance phenomena, see Fig 7, that add to the chemical rank. Consequently, more latent components are required to model the data adequately and the interpretation of the models becomes more difficult since one varying peak, which exhibits shifts, is assigned a >1 rank solution when analyz-ing the data. The corresponding loadings also become more complex, showing multi-modality and negative peaks: artifacts that are highly undesirable. This is not a wanted situation since it makes the interpretation of the models much more diffi-cult. In worst case scenarios, the variance introduced from peak shifts can “drown” the variance of interest, making it difficult to build a discriminating model at all.

Figure 7. Lower plot depicts three synthetic samples each with two peaks of identical height. The left peak is aligned and the right peak is unaligned. The top curve is the variance of the samples indicating that variance is induced by

peak-shift.

37

4 Summary of the papers

4.1 Paper I

The title of paper I is “Multivariate characterization of chemical and physical descriptors in pulp using NIRR.”

In this work the possibility of measuring complex chemistry and sum parameters of digested wood-chips by using reflectance NIR and chemometrics is discussed. Kappa number, Klason lignin, glucose, xylose and uronic acid contents were meas-ured during a lab digestion of birch chips. The questions addressed in this paper were as follows. First, is it possible to use this method to follow the digestion proc-ess over time in terms of pulp composition? Second, is it possible to build a predic-tive model that is not influenced by variations the digestion parameters (effective alkali, sulfidity, digestion temperature and temperature ramp)?

The main contribution of the work, from an analytical perspective, is that the pa-rameters had not been modeled with the combination of PLS and data from a full spectral resolution NIR instrument prior to the time of publication. Similar investi-gations have been made since then, which support the validity of the proposed method.

The method yields results comparable with the reference lab techniques in terms of prediction error. Furthermore, the method is not sensitive to changes in the di-gestion parameters spanned by the experiments.

4.2 Paper II

The title of paper II is “Determination of nitrate in municipal waste water by UV spectrome-try.”

In Paper II we suggest a novel method for measuring the nitrate concentration in untreated raw wastewater. The method is based on the fact that nitrate ions absorb in the UV portion of the electromagnetic spectrum. The nitrate ion, 3NO- , has a strong absorbance at 205 nm. The presented method is based on using full UV-VIS spectral data for the simultaneous determination of nitrate, total phosphorous, total nitrogen, ammonium nitrogen and iron by multivariate calibration. The proposed calibration method, PLS, is also compared with ILS.

38

4.3 Paper III

The title of paper III is “Compression of first order spectral data using the B-spline zero com-pression method.”

There are numerous candidate algorithms for compression of spectral data in the context of calibration, one of which is the B-spline zero algorithm. The B-spline method has several appealing features: it is virtually parameter-free, and the mapping is done in the same space as the spectra itself, i.e. the method does not change the basis of the data, making translation back to the original data straightforward. In this work the B-spline zero method is extended from the original guide vector of the Maximum Entropy Method (MEM) to improve its performance in the context of variance modeling with spectrometric data showing baseline shifts. The original the-ory, based on MEM, employs the mean of the X-block as a compression guide vec-tor. Since (co)variance is modeled by the PLS calibration algorithm, the B-spline zero method was incorporated with the standard deviation (SD) and relative stan-dard deviation (RSD) of the X-block as guide vector candidates. The rationale for this extension is that for NIR reflectance data the mean of the X-block does not necessarily have to carry any useful information for the calibration of the parameters of interest.

Results are shown for NIR and FT-IR reflectance data exhibiting baseline shifts of differing nature. The method is also used to analyze UV-VIS transmittance data of turbid water samples showing baseline shifts. The results show that it can com-press the data to approximately 20% of the original size without causing model deg-radation.

4.4 Paper IV

The title of paper IV is “Water sorption to hydroxyl and carboxylic acid groups in Carboxymethyl-cellulose (CMC) studied with NIR-spectroscopy.”

Figure 8. Structure of two repeating units of Sodium Carbomethoxylated Cellulose (CMC)

This paper describes the use of a programmable climate chamber equipped with a NIR spectrometer to investigate water adsorption to the surface of sodium and cal-cium modified CMC, see Fig 8, (CMC-Na and CMC-Ca) with differing degrees of

39

substitution, when exposed to humid air in the range 1-99 %RH. The aim of the study was to determine if water is specifically adsorbed to sites with certain chemical features, i.e. hydroxyls and carboxyl groups, or if the water adsorption was non-specific, i.e. it adsorbed over the whole cellulose fiber.

The paper presents PCA analysis and PLS calibration results of NIR reflectance data for the modeling of bound water. Water content was determined by Differen-tial Scanning Calorimetry (DSC) and Thermogravimetric Analysis (TGA) as refer-ence methods. The authors also attempt to make a qualitative interpretation of the resulting models using loading and bi-plots.

4.5 Paper V

The title of paper V is “Peak Alignment using Reduced Set mapping.”

This paper describes a novel algorithm and method for peak alignment, or spec-tral synchronization, of data with pronounced differences between baseline and sig-nals of interest. Typical types of data that can be synchronized with the algorithm are data acquired by NMR, GC, LC, or CE. The method is based on ideas from se-quence alignment and explores the use of sparseness and fast search algorithms. The proposed method performs peak alignment as powerfully as previously reported methods but is, at least, 50-800 times faster. The method has a potential for applica-tions such as automatic database searches, and the automated evaluation of chroma-tograms or spectra for identifying peaks, as well as for either qualitative or quantita-tive analyses. The method could also be used for more parsimonious interpretations of multivariate models of data exhibiting peakshifts

41

5 Discussion

5.1 Paper I

The study presented in this paper was motivated by the desire to characterize a complex organic material – native cellulose, and the different chemical and/or other properties that naturally vary in native cellulose. Cellulose generally originates, ex-cept in cotton, from the plant cell wall, which is composed of three constituents: cellulose, lignin, and hemicellulose. The material may originate from any plant, so the term lignocellulosic materials is used as a generic term.

Cellulose is a complex biopolymer composed of glucose monomers and is one of the main bulk raw materials used in process industries. It is mostly harvested as wood, and diverse processed forms appear in our daily lives, for instance; paper, towels, packaging, fillers, diapers, fuel, building materials and additives in cosmetics, pharmaceuticals, and paint. Despite the vast amounts of these materials processed, produced and used every day, our understanding of how the materials’ composition varies with place of growth, season of harvest and species, to mention just a few sources of variance, is virtually non-existent. How this variance affects the proc-essability of the material and the final product is almost unknown. This is due to the difficulties involved in characterizing the material, since the traditional measure-ments of the relevant descriptors are very time consuming (and laborious) in rela-tion to the process/production time. Traditional analysis of lignocellulosic materials such as wood, wood-dust, chips, pulp and paper, is difficult and involves tedious sample work-up steps such as dissolving and derivatisation prior to analysis by tradi-tional wet chemistry, GC, MS, NMR, physical testing or other techniques. The main aim of the work described in this thesis was to explore the possibility of using fast spectrometric techniques combined with multivariate data-analysis and modeling to enable very fast and accurate predictions of key chemical and physical properties to be made.

This paper explores the potential of Near Infrared spectrometry, NIR or NIRS (or NIRR), in reflectance mode as a means to characterize lignocellulosic materials in a rapid and efficient way. The advantages of this spectrometric technique are; the deep sample penetration of the analyzing beam, simple sample preparation, non-invasiveness and speed of analysis – typically one minute. However, there are also disadvantages with analyzing materials using this sector of the electromagnetic spec-trum: the resulting data overlap, the data exhibit artifacts such as inter-molecular resonance phenomena and hydrogen-bond shifts resulting in non-linear shifts of peak position, and the reflectance data also show the effects of Rayleigh and Mie light-scattering.

42

A summary of the previously mentioned spectral anomalies relevant for this paper will now be made to illustrate the complexity of calibrating with NIR and native cellulose. According to thermodynamic considerations, a cellulose molecule (D-glucose monomer, C6H12O6), see Fig 9,

Figure 9. Structure of two cellulose monomers (glucose units) joined by a β-1,4-glycoside link.

with N=24 atoms, has 66 different possible fundamental vibrational states, modes, to populate. In addition, one of the possible model molecules for lignin, see Fig 10,

Figure 10. Structure of one model compound for lignin, displaying the complex nature of lignins.

with N=53, will yield another 153 modes. This results in the possibility of approxi-mately ~220 fundamental peaks. The fundamentals of these peaks, multiplied by a factor of at least two (>440 peaks), will yield the first and second overtone peaks, which are responsible for the resulting NIR spectrum. To further complicate the resulting spectrum, one has to consider the possibilities for combinations, rotational band broadening of the peaks, hydrogen bond shifts, vibrational coupling, Fermi resonance and peak overlaps. Consequently, the NIR spectrum will comprise a plen-titude of overlapped peaks, so the possibility of straightforward peak assignments in a NIR spectrum of a complex material is low. Nevertheless, essentially all the infor-mation from the relatively peak-resolved IR spectrum is present in the correspond-ing NIR spectrum.

The complexity of the sample, and the scatter present can be visualized by com-paring reflectance spectra of solid samples with transmittance spectra of liquid sam-ples, see Figs 4 and 5. Note the difference in appearance: the liquid spectra lack base-line shifts and have much sharper peaks due to the less complex chemistry in liquid samples.

43

One of the key questions related to the data presented in this paper is: what is actually modeled by the spectral data? From a closer inspection of the Y-block one could reason that some of the parameters measured reflect common phenomena. Yield and glucose are measures of cellulose content, while Kappa value and Klason lignin are measures of the substances the digestion process is supposed to be reduc-ing, i.e. lignins, so these pairs of measurements should theoretically be strongly co-correlated. This is also supported by the data in Table 1, where the correlation coef-ficients (r2) of the Y-block are tabulated.

r2 Yield Kappa Klason Glucose Xylose Uronic

Yield -

Kappa 0⋅81 -

Klason 0⋅86 0⋅94 -

Glucose 0⋅97 0⋅88 0⋅93 -

Xylose 0⋅27 0⋅53 0⋅53 0⋅32 -

Uronic 0⋅85 0⋅54 0⋅60 0⋅80 0⋅05 -

Table 1, correlation coefficients of the Y-matrix associated with Paper I. The highest/lowest correlation coefficients are marked in gray.

From Table 1 we can conclude in that the yield and glucose are strongly corre-lated (r2=0⋅97). This is consistent with expectations since the cellulose is made from glucose monomers. Also, kappa number and Klason lignin determinations are strongly co-linear (r2=0⋅94). Again this is as expected, since both figures are meas-ures of the amount of lignin. The inter-correlations between yield, glucose, kappa and Klason values probably reflect the time axis of the digestion.

From a modeling perspective, the raw spectra exhibit strong correlations with the Y-block: xylose content being a mild exception. This phenomenon can be ac-counted for by understanding that the raw spectra reflect the time of digestion, via scatter phenomena, and that the digestion time is correlated with the descriptors, see Fig 11.

44

1200 1400 1600 1800 2000 2200 2400

0

0.2

0.4

0.6

0.8

1

wavelength [nm]

raw

data

r2

yieldkappaklasonglucosexyloseuronic

Figure 11. Plot of the squared correlation coefficient (r2) for the descriptors and

corresponding raw spectra. Data taken from Paper I.

If baseline shifts are scatter-corrected by using the MSC of the spectral data the cor-relation patterns are different, but still strongly correlated, here indicating correla-tions of chemical origin, see Fig 12.

1200 1400 1600 1800 2000 2200 2400

0

0.2

0.4

0.6

0.8

1

wavelength [nm]

MS

C r

2

yieldkappaklasonglucosexyloseuronic

Figure 12. Plot of the squared correlation coefficient (r2) for the descriptors and

the MSC-treated spectra. Data taken from Paper I.

45

Typical predictions for digestions are depicted in Fig 13, where we can see the influ-ence of digestion on the different measured parameters, indicating that the digestion process can be monitored by the proposed method.

0

10

20

30

40

50

60

70

80

90

100

Experiment 1-6 (time)

"con

cent

ratio

n"

yieldkappaligninglucosexyloseuronic

Figure 13. Plot of time vs. “concentration” for the descriptors during

digestion. Data taken from Paper I.

Separating the correlating descriptors in terms of models is very difficult. The patterns of correlation between the X- and Y- blocks are very similar, making quali-tative model analysis difficult. Establishing whether the models are modeling a “true” chemical entity or some underlying, possibly latent, phenomenon is not straightforward. The author concludes that the method predicts the four first de-scriptors, but since the descriptors are strongly correlated, deciding which of them represent the “true” chemical response in the spectra, or if they are all represented, is almost impossible. One possible way of looking at this problem is that xylose, at least, is being adequately modeled and predicted without having strong correlations to any of the other descriptors. This implies that the spectrum reflects unique in-formation about xylose. If this is the case, it is likely that other constituents with comparable concentrations, but differences in chemistry can be uniquely modeled as well.

An interesting concept used in this paper is the factorial design approach to solv-ing the combinatory problem of choosing algorithms for spectral linearization34.

The method also gives an interesting opportunity to measure yield (or glucose), which, from an industrial point of view, is a very elusive parameter.

46

The author would also like to note that the usefulness of NIR does not end with the measurement of lignocellulosic materials, since virtually all organics and some inorganics can be analyzed by this technique. Incidentally, a search of the CAS data-base with the words “pulp” and “NIR” resulted in 31 relevant hits with dates of publication later than this paper.

The author was not involved in work related to the digestion or the experimental set-up, which should ideally have been done using factorial design. The author’s contributions were the multivariate data analysis and model building.

The author would also like to make a comment regarding the model figure of merit, r2, used in Paper I. The common notation for the correlation coefficient be-tween predicted and true samples not being in the model is q2. This notation should have been used in this paper.

5.2 Paper II

The interest in measuring nitrate, 3NO- , is due to the fact that nitrate is a key bio-logical nutrient found in excess in wastewater. Thus, the removal of nitrate is very important for water treatment plants cleaning wastewaters before discharging the effluent back into natural systems. The following paragraphs summarize the poten-tial impact of nitrate (a major form of bio-available nitrogen) on ecosystems:

Generally, phosphorus is the limiting nutrient in freshwater aquatic systems. That is, if all phosphorous is used, plant growth will cease, however much nitrogen is available. Many bodies of freshwater are currently subject to influxes of nitrogen and phosphorus from external sources. The increasing concentration of available phosphorus allows plants to assimilate more nitrogen before the phosphorus is de-pleted. Thus, if sufficient phosphorus is available, high concentrations of nitrates will lead to high levels of phytoplankton (algae) and macrophyte (large aquatic plant) production.

In contrast to freshwater, nitrogen is the primary limiting nutrient in the seaward portions of most estuarine systems. Thus, nitrogen levels control the rate of primary production. If a nitrogen-limited system is supplied with high levels of nitrogen, significant increases in phytoplankton and macrophyte production may occur.

Excessive aquatic plant production may negatively impact fresh water and estua-rine environments in the following ways:

• Algal mats, decaying algal clumps, odors, and discoloration of the water may interfere with recreational and aesthetic water uses.

47

• Extensive growth of rooted aquatic macrophytes may obstruct navigation channels and restrict both aeration and channel capacity.

• Dead macrophytes and phytoplankton settle to the bottom of a water body, stimulating microbial breakdown processes that require oxygen. Eventually, dissolved oxygen will be depleted.

• Aquatic life may be hampered if the entire water body experiences daily fluc-tuations in dissolved oxygen levels as a result of nightly plant respiration. Ex-treme oxygen depletion can lead to the death of desirable fish species.

• Toxic algae have been associated with eutrophication in coastal regions and may result in paralytic shellfish poisoning.

• Algal blooms may shade submersed aquatic vegetation, reducing or eliminat-ing photosynthesis and productivity.

The major problem associated with the direct measurement of wastewater is that of turbidity. The scattering of the incident light will manifest itself in a loss of quanta not accounted for by the ion’s absorption. A remedy that has been used to solve this problem is to measure at two different wavelengths: one for the absorption of ni-trate and another for establishing the turbidity. In this paper we suggest a new method for measuring nitrate, based on the nitrate ion’s absorbance in the UV por-tion of the electromagnetic spectrum. The nitrate ion, 3NO- , has a strong absorb-ance at 205 nm. The presented method is based on using full UV-VIS spectral data for the simultaneous determination of nitrate, total phosphorous, total nitrogen, ammonium nitrogen and iron by multivariate calibration.

The merits of the proposed method are that it is cheap, fast and since there is no mechanical or chemical workup of the samples made, the method is a strong candi-date for on-line implementation enabling real-time process monitoring and possibly – control.

An interesting fact related to this data set is that is has been further analyzed by Sundberg38. In this work, different aspects of co-linear data modeling tactics are ad-dressed.

5.3 Paper III

Data obtained with different spectrometers scanning different wavelength ranges were used in order to elucidate the notion that different guide vectors would per-form differently with respect to scatter phenomena observed in the data, i.e. differ-ent baseline shifts and turbidity.

48

The B-spline zero method is useful, especially in the context of modeling and outlier detection since it retains parts of the spectrum not used for modeling, unlike variable selection techniques. The method is also almost parameter-free, except for the number of target variables in the compressed set, which makes implementation of the method straightforward and appealing with respect to model complexity and, ultimately, Occhams’ razor – Entia non sunt multiplicanda praeter necessitatem: “One should not increase, beyond what is necessary, the number of entities required to explain anything.”

The results suggest that the proposed RSD basis for the guidance vector outper-forms the proposed MEM theory based mean vector for data showing a multiplica-tive baseline shift, in this case NIR reflectance data. The SD and RSD are not better in the case of scattered transmittance data, which is consistent with expectations, since the mean vector adequately maps the information content of this data.

5.4 Paper IV

The question of whether the bonding of water to cellulose is of specific or non-specific nature had been discussed for quite some time. Two different, opposing, theories had previously been presented.

The discussion section in this paper is very elaborate, and covers the findings in the paper quite thoroughly. The prelude to the study was a discussion about the merits of NIR and how they could be employed to address the question under con-sideration. The key points mentioned were as follows.

Since NIR-theory holds that hydrogen bond shifts or redistributions of vibra-tional energy for either the hydroxylic or carboxylic groups, or the water itself, should occur when the sample is subjected to changes in relative humidity, and thus in the amount of water adsorbed to the CMC matrix. This adsorption phenomenon should manifest itself in the NIR spectra as changes in the peaks corresponding to carboxyls and hydroxyls, or the water peak. The water peak could shift in wave-length due to the complex nature of the peak (the sum of the sub-peaks of the water sub-species), that is, water bonded in different ways. If the sub-peaks are changing in height, the result should be a shift in wavelength of the composite peak maxi-mum, i.e. the one we observe in the resulting spectrum.

If any of the mentioned peak shifts or specific peak height changes were present, these phenomena could be modeled by multivariate methods and by examining the resulting model parameters reflecting the variable space, conclusions regarding the water adsorption would be drawn.

49

One interesting study described in this paper concerned the optimal number of scans of the instrument. The instrument can be set to scan an arbitrary number of times, and the resulting spectrum is formed from the mean of these scans. This is of crucial importance since the power of the incident light source was ~100 W and could affect the moisture by heating the analyzed sample and thus changing the wa-ter sorption. Because of this heating effect, as few instrumental scans as possible would be preferred. On the other hand, taking too few scans impairs the signal to noise ratio (S/N), so some kind of compromise needs to be found. The problem was solved using factorial design for the variables affecting the resulting spectra, e.g. RH, the amount of sample scanned and the number of scans. As the response, sin-gular values of spectral segments of a static sample scanned in a time series, were used. Two spectral segments of the time series scanned were modeled, one segment containing information about water, i.e. the water peak at 1780-1880 nm, and one segment of the spectra containing the baseline, 1040-1240 nm. From these sections a model of how the instrumental parameters affected both the noise (S/N) and the potentially detrimental heating of the sample could be made.

Since the matrix (cellulose) is the same for both Na- and Ca- modified cellulose the only differences that can be recorded by NIR are differences in the spectra of Na and Ca and/or the differences in water dynamics in the wetted system. Since the ions themselves do not have any NIR absorption, one can establish a hypothesis that both systems will show strong water-related (~1940 nm) loadings for the OH-coordinated water, which should be similar, if not identical. The models should then deviate since the Ca- and Na- coordinated water should differ according to the rela-tive differences in hydrogen bonds between the different counter ions. This is very much the case - the second pc (the first is omitted since the data were not centered) of the Ca and Na models are in fact identical. The higher pc loading differences out-lined in the Paper, Figs 10 and 11, are now assigned to differ in water coordination compared to the counter ions.

The author would like to add that more results could have been presented regard-ing this study. For instance, Window Factor Analysis39 (WFA) could have been per-formed on the data, possibly revealing the sub substructure of the first overtone water peak (~1940 nm). The data can also be arranged in a three-mode three-way data set (RH× DS× λ ) and analyzed with higher order methods such as PARAFAC40, 41, Tucker and N-PLS42.

50

5.5 Paper V

As an example of the merits of the proposed peak alignment method, coined PARS, described in this paper, a 1H-NMR data set of rat urine collected for a me-tabonomics study of the drug Citalopram is presented. The experiment monitored twelve rats, six in a control group and six given the drug. The rats’ urine was col-lected and analyzed by NMR over a period of time (-6 -5 1 3 7 10 14 days after in-troduction of the active substance) resulting in seven NMR-spectra for each rat dur-ing the study. One aim of this study was to see if the rats in the control group and the rats given the substance could be differentiated by the NMR spectra of their urine i.e. supervised data analysis. To compare the merits of peak alignment, an un-supervised PCA analysis of the 84 NMR spectra was performed, with and without peak alignment, Figs. 14 and 16. For a reasonable comparison a PCA analysis of bucketed data was also made. The scores plot of the bucketed data is also shown, in Fig. 15. Bucketing of NMR data is currently the most common pretreatment method for overcoming the alignment problem. The scores of the models can be seen in Figs. 14-16 and the loadings of the model of PARS-corrected data can be seen in Figs. 17 and 18. More details of the experimental set-up can be found in the paper.

-8 -6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

8

1

3

7

1014

1

37

7

10

14

1

3

10

14

1

3

7

10

14

1

3

7

1014

137

10 14

Scores pc 2

Sco

res

pc 3

Figure 14. PCA scores - uncorrected raw NMR data. o=no drug, + given drug, the number indicates time (in days)

the rat had been given the drug.

51

-40 -30 -20 -10 0 10 20 30-15

-10

-5

0

5

10

15

20

25

1

3

7

10

14

1

37

710

141

3

10

14

1

3

710

14

1

3

7

10

14

1

3

7

10

14

Scores pc 2

Sco

res

pc 3

Figure 15. PCA scores, bucketed NMR data. (bins=255)

-12 -10 -8 -6 -4 -2 0 2 4 6-6

-4

-2

0

2

4

6

8

10

12

1

3710

14

1

377 10

14

1

3

10

14

13

7 1014

13

7

1014

1

3

7

10

14

Scores pc 2

Sco

res

pc 3

Figure 16. PCA scores, PARS corrected NMR data..

52

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08-0.1

-0.05

0

0.05

0.1

0.15

52505429

5410

5193

2664

37443646

Loading pc 2

Load

ing

pc 3

Figure 17. PCA loadings corresponding to the scores plot. Discriminating peaks indicated by numbers – PARS

corrected NMR data.

2500 3000 3500 4000 4500 5000 5500-0.1

-0.05

0

0.05

0.1

0.15

data bin #

Figure 18. PCA loadings. pcs’ 2 and 3, zoomed. Discriminating peaks indicated by arrows – PARS corrected

NMR data. pc 2 with offset for clarity.

53

Clearly, the scores plot from the unaligned and bucketed data does not show any discriminating power. The unaligned and bucketed data cannot detect the time at which drug was introduced from the metabolites in the rat urine, whereas the scores from the PARS aligned data show a clear discrimination. The bucketing of complex NMR data is not an efficient method for reducing the number of variables. This is understandable since 64000 data are “integrated” into 255. If there is more than one peak in a “bucket” the information carried by this peak will be destroyed. The buck-eting also destroys the interpretation of the NMR spectra/model loadings since the fine structure of the spectrogram is utterly destroyed. The attentive reader immedi-ately sees that the NMR data is of a true three-way nature, opening the possibilities for analysis with PARAFAC or Tucker models. This is, of course, true. The merits of the peak alignment are still valid for these higher way model types, even if their importance is somewhat lessened. The proof of this is, unfortunately, beyond the scope of this thesis.

55

6 Conclusions

The major conclusions that can be drawn from these studies are as follows.

The proposed method of reflectance NIR in conjunction with chemometrics is capable of measuring major constituents of woodchips/pulp. The method does not seem to be influenced by the digestion process. The method yields comparable or better precision and accuracy than traditional measurement techniques.

UV-VIS data of untreated wastewater can be modeled for the nutrient of interest, nitrate, even though the spectra show severe effects of the turbidity of the samples. This method is well suited for on-line analysis and hence for environmental moni-toring.

It is not necessary to use fully resolved spectral information in calibra-tion/classification with NIR reflectance data. Good results can be achieved with compressed data using the proposed, modified, B-spline zero compression method. In this study compression down to 20% of the initial abscissa data size did not af-fect the model’s behavior.

NIR reflectance spectrometry can determine the amount of water adsorbed to/in a complex biomaterial. The method can also distinguish the specific chemical groups to which the water is adsorbed.

Alignment of peaks in data such as NMR, GC, and LC spectra/chromatograms can readily enhance the following data analysis. Models of data subjected to peak alignment display both increased discrimination power as well as more parsimonious interpretation.

The first conclusion is of particular interest for the pulp and paper industry.

The second conclusion is of interest for the control of wastewater treatment plants since their effluents have an environmental impact.

The third conclusion is of interest if one wants to calibrate using large amounts of data since the method can significantly accelerate the calculations involved.

The fourth conclusion is of interest for the food and agriculture sectors, since the different bonding of water in native products is correlated with complex food-related criteria such as ripeness and freshness that are, traditionally, difficult to measure.

The fifth conclusion is of interest for members of the scientific community using instruments that yield data prone to shifts in peaks along the x-axis who want to analyze these data with multivariate methods. The proposed algorithm can reduce the effects of peak shifting, thus improving the models.

57

7 Acknowledgements

Many people have contributed to this thesis, some with content, some with ideas and some by making it possible to engage is such an activity as writing a thesis – I sincerely thank you all: you know who you are. In addition I would like to take the opportunity to point out a few who deserve a somewhat more special dedication;

Professor Bo Karlberg, my tutor in this quest, for believing that I could do it, for all the good company he provided and the interesting discussions - ranging from intri-cate chemical problems, through house building, to the eternal question of how to catch that really big fish.

Prof. Sven Jacobsson. One of the major co-contributors to the later parts of this work. Sven is gratefully acknowledged for a lot of positive inputs, for enabling use of some of the data presented in this thesis and for arranging some of the funding.

All my love to my own family; Monica, Kim and Alexandra, for putting up with me during my lengthy nightly computer sessions and absent-minded behavior.

Mum, Dad & Sis who have always supported me in my endeavors.

My family in-law who always keeps me focused on my goals.

Prof. Em. Kjell Sjöberg, KTH. Kjell is acknowledged for being the one that once upon a time introduced me to the “multivariate” way of doing things, furthermore; Kjell is a person with great visions. Without Kjell – there would not have been any thesis.

The Lann, Bergman, Moberg, Westin, Hildebrandt, Rehnqvist, Andersson and Cederholm families, for being such good friends.

My friends at my former workplace – Micke, Jonas, Robert, Krister, Greger, David, Eva, Christer and Ragnar for inspiring me to undertake this project.

All my friends at KTH and SU – another great source of inspiration.

Last, but not in any way least; I would like to thoroughly thank the Department for co-funding me – Thanks!

Finally a thought goes to Dave Weckl for showing me how things really are done…

Love you all!

RalpH

59

8 References

[1] H. A. L. Kiers, "Towards a Standardized notation and terminology in multiway analysis." J. Chemom., 105-122 (2000).

[2] D. A. Burns and E. W. Ciurczak, "Handbook of Near-Infrared Analysis", New York-Basel, 2001.

[3] A. Schuster, "Radiation through a foggy atmosphere", Astrophys J., 1 (1905). [4] P. Kubelka and Z. Munk, "Ein Beitrag zur Optik der Farbanstriche", Z. Tech.

Physik, 593-601 (1931). [5] J. L. Saunderson, "Calculation of the Color of Pigmented Plastics", J.Opt.Soc.Am.,

32, 727-736 (1942). [6] J. K. Nicholson, J. C. Lindon, and E. Holmes, "'Metabonomics': understanding the

metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR data." Xenobiotica, 29, 1181-1189 (1999).

[7] J. Forshed, F. O. Andersson, and S. P. Jacobsson, "NMR and Bayesian regularized neural network regression for impurity determination of 4-aminophenol." J. Pharm. Biomed. Anal., 29, 495-505 (2002).

[8] J. D. J. Ingle and S. R. Crouch, "Introduction to Molecular Spectroscopy" in, Prentice-Hall International (UK) Limited., London, 1988.

[9] G. H. Golub and C. F. van Loan, "Matrix Computations", The Johns Hopkins University press, London, 1989.

[10] S. Wold, "Cross validatory estimation of the number of components in factor and principal component models", Technometrics, 397-405 (1978).

[11] B. Efron and R. J. Tibshirani, "An introduction to the bootstrap", Chapman & Hall, New York, 1993.

[12] D. M. Haaland and E. V. Thomas, "Partial Least-squares methods for spectral analyses. 1. Relation to other quantitative calibration methods and the extraction of qualitative information." Anal. Chem, 60, 1193-1202 (1988).

[13] H. Martens and T. Naes, "Methods for calibration" in Multivariate Calibration, John Wiley & Sons, 1989.

[14] H. Hotelling, "Analysis of a complex of statistical variables into principal components", J.Educ.Psychol., 417-520 (1933).

[15] J. E. Jackson, "A users guide to principal components", Wiley, New York, 1991. [16] K. Pearson, "On lines and planes of closest fit to systems of points in space",

Philos.Mag., 559-572 (1901). [17] S. Wold, K. Esbensen, and P. Geladi, "Principal Components Analysis", Chemom.

Intell. Lab. Syst., 37-52 (1987). [18] C. Spearman, "General intelligence, objectively determined and measured." American

Journal of Psychology, 15, 201-293 (1904). [19] R. A. Fisher and W. A. MacKenzie, "Studies in crop variation II. The maurial

response of different potatoe variates." Journal of Agricultural Science, 13, 311-320 (1923).

60

[20] H. Wold, "Nonlinear estimation by iterative least square procedures." in Research Papers in Statistics, F. N. David, Ed., John Wiley and Sons, Inc., New York, 1966.

[21] L. L. Thurstone, "Multiple Factor Analysis", Psychology Review, 38, 406-427 (1931).

[22] E. R. Malinowski, "Factor Analysis in Chemistry", John Wiley and Sons, Inc., New York, 1991.

[23] R. J. A. Harris, "Primer of Multivariate Statistics", Academic Press, New York, 1975.

[24] C. H. Romesburg, "Cluster Analysis for Researchers", Lifetime Learning Publications Belmont, CA., 1984.

[25] H. Wold, "Partial Least Squares" in Encyclopedia of Statistical Sciences, Wiley, New York, 1985.

[26] P. Geladi and B. R. Kowalski, "Partial Least-Squares Regression: A Tutorial", Anal. Chim. Acta, 1-17 (1986).

[27] F. Lindgren, Ph.D. Thesis, "Third generation PLS: some elements and applications", Dept. Organic Chemistry, University of Umeå, Umeå, Sweden, 1994.

[28] K. H. Norris and P. C. Williams, "Optimization of mathematical treatments of raw near-infrared signal in the measurement of protein in hard red spring wheat. I. Influence of particle size." Cereal Chem., 61, 158-165 (1984).

[29] R. J. Barnes, M. S. Dhanoa, and L. S.J., "Standard normal variate transformation and de-trending of near-infrared diffuse reflectance data", Appl. Spectrosc., 43, 772-777 (1989).

[30] A. Savitzky and M. J. E. Golay, "Smoothing and differentiation of data by simplified least squares procedures", Anal. Chem, 36, 1627-1639 (1964).

[31] J. M. Schmitt, "Fractional derivative analysis of diffuse reflectance spectra", Appl. Spectrosc., 52, 840-846 (1998).

[32] T. C. O'Haver, "Derivative and Wavelength Modulation Spectrometry", Anal. Chem, 91A (1979).

[33] T. C. O'Haver and G. L. Green, "Derivative Spectroscopy", Am.Lab., 15 (1975). [34] R. J. O. Olsson, "Optimizing data-pretreatment by a factorial design approach" in Near

infra-red spectoscopy, Ellis Horwood, 1992. [35] L. Stordrange, F. O. Libnau, D. Malthe-Sörensen, et al., "Feasibility study of NIR

for surveillance of a pharmaceutical process, including a study of different preprocessing techniques." J. Chemom., 16, 529-541 (2002).

[36] J. H. Wilkinson, "The Algebraic Eigenvalue Problem", Claredon Press, Oxford, 1969.

[37] F. Westad, K. Diepold, and H. Martens, "QR-PLSR: Reduced Rank regression for high speed hardware implementation", J. Chemom., 439-451 (1996).

[38] R. Sundberg, "Multivariate Calibration - Direct and Indirect Regression Methodology", Scandinavian Journal of Statistics, 26, 161-207 (1999).

61

[39] E. R. Malinowski, "Automatic window factor analysis. A more efficient method for determining concentration profiles from evolutionary spectra", J. Chemom., 273-279 (1996).

[40] R. A. Harshman, "Foundations of the PARAFAC procedure: Model and conditions for an "exploratory" multi mode factor analysis", UCLA Working Papers in phonetics, 1-84 (1970).

[41] J. D. Carroll and J. Chang, "Analysis of individual differences in multidimensional scaling via an N-way generalization of and Eckart -Young decomposition." Psychometrica, 283 (1970).

[42] R. Bro, "Multi-way calibration. Multilinear PLS." J. Chemom., 10, 47-62 (1996). Note that references occurring in papers I-V are, generally, not cross referenced in this section.

Chemometric Analysis of First Order Chemical Data

Documents

Transcript of Chemometric Analysis of First Order Chemical Data