Comparing efficiency of cross validated partial least squares

13
Comparing efficiency of cross validated partial least squares regression and ridge regression models in predicting spatial variability of total nitrogen concentration in Truckee River, Nevada Anpalaki J Ragavan, Department of Mathematics and Statistics, University of Nevada, Reno. ABSTRACT Multiple linear regression (MLR) requires controllable predictor variables (factors) that are few in number, non- collinear, and have well defined relationship to response variables. When any of these three conditions are violated MLR will be inappropriate. On the other hand sensitivity of an unbiased MLR model to noise, sampling and spatial variability (this sensitivity causes regression models to be ill-posed) can be largely reduced by reducing error variance by deliberately introducing a small amount of bias into the model. Spatial variability of monthly total nitrogen concentration (TN) time series as a function of seven independent variables, from six monitoring sites over the period from January 1997 to December 2004, collected from Truckee River, Nevada was modeled through partial least squares (PLS) and ridge regression modeling respectively, using PROC PLS (PLS method, RLGW algorithm and split cross validation), and PROC REG (ridge coefficient [ ] =0.8) in SAS®. PLS regression extracted 2 factors that predicted observed TN well. Ridge regression model with =0.8, gave minimum predicted TN equal to 0.18 mg/L, and very low VIF <0.4 for all independent variables. Ridge regression model was significant at 0.05 significance level (p<0.0001, R 2 =0.69). The ridge regression model showed smaller sum of squared residuals (SSE = 19.88) than the PLS regression model (SSE=37.02), and mean predicted TN closer to observed TN than PLS regression model for the validation period. Optimum estimated ridge coefficient varied with site. k k KEY WORDS: Partial least squares regression; ridge regression; variance inflation factor; tolerance, multicollinearity INTRODUCTION STUDY SITE The Truckee River which is a river in northern Nevada and northern California, is 140 mi (225 km) long, originates from the mountains south of Lake Tahoe, flows into the Lake Tahoe at its south end, draining part of the high Sierra Nevada, and emptying into Pyramid Lake in the Great Basin (USEPA, 1991). Truckee River flows northwest through the mountains to Truckee, California, then turns sharply to the east and flowing into Nevada, past Reno and Sparks, along the northern end of the Carson Range. A picture of Truckee River taken from down town Reno, Nevada is shown in Figure 1. East of the Truckee Meadows, fourteen ditches remove water for irrigation. The most significant diversion is Derby Dam, where at least 32% of the river's water is diverted annually (Peternel and Laurel, 2005). TMWRF currently maintains 11 continuous monitoring stations (www.tmwrf.com ) within the Truckee water system. These stations are located at: Mogul, Steamboat Creek (SC), McCarran Bridge (MC), North Truckee Drain (NTD), Lockwood (LW), Patrick, Waltham, Tracy, Painted Rock, Wadsworth (WB), and Marble Bluff Dam. The LW monitoring site is currently chosen as the compliance site for assessing total maximum daily loads (TMDL) for phosphorus into the Truckee River because most controllable sources are thought to be upstream of LW. Site LW is located in the lower Truckee River basin 65.6 river miles from Lake Tahoe, down stream of MB, NTD, and SC monitoring sites and Vista (Figure 2; source: http://truckeeriverinfo.org/gallery ). Truckee River’s waters are an important source of drinking and irrigation along its valley and adjacent valleys. As discussed previously, increased urbanization and the prevalence of water diversions have caused a decline in water quality, and the resulting detrimental effects on habitat have brought about the need to restore the river to a more natural condition to improve habitat and the river's overall health. The water is quite clear near Lake Tahoe, but as it descends, the water turns muddy and concentrated in nutrients by the time it passes Reno, Nevada. The California State Water Resources Control Board (State board) has classified under Section 330(d) of the Clean Water Act the middle reach of the Truckee River as “impaired”. Currently, the Truckee River does not support its designated uses. As a result, the river at Lockwood is listed on Nevada's 303(d) list for TN, total phosphorus concentration (TP), and total dissolved solids (TDS). STATEMENT OF PROBLEM Currently there is a need for a consistent, scientifically defensible approach for assigning nutrient criteria for Truckee River water, to control Eutrophication. Until recently, exceedances of TP in Truckee River, has been thought to be the direct cause of biomass activity hence the major cause of Eutrophication (EPA, 2007). Recently, researchers are reporting other variables such as DO, SF, water temperature (Temp), water pH (pH) and TN to directly affect biomass activity hence Eutrophication in the river. The relationship of these variables to TN in the river has not been studied or fully understood. All the above factors that affect biomass activity including TP in the river are directly or indirectly related to TN enhancement in the river. Inadequacy of information regarding the relationship of these factors currently limits the ability of NDEP to revise the values imposed in the nutrient criteria (use status for the 303(d), Impaired Waters List (Category 5 of the Integrated Report)), which challenges the subsequent 1

Transcript of Comparing efficiency of cross validated partial least squares

Page 1: Comparing efficiency of cross validated partial least squares

Comparing efficiency of cross validated partial least squares regression and ridge regression models in predicting spatial variability of total nitrogen

concentration in Truckee River, Nevada

Anpalaki J Ragavan, Department of Mathematics and Statistics, University of Nevada, Reno.

ABSTRACT Multiple linear regression (MLR) requires controllable predictor variables (factors) that are few in number, non-collinear, and have well defined relationship to response variables. When any of these three conditions are violated MLR will be inappropriate. On the other hand sensitivity of an unbiased MLR model to noise, sampling and spatial variability (this sensitivity causes regression models to be ill-posed) can be largely reduced by reducing error variance by deliberately introducing a small amount of bias into the model. Spatial variability of monthly total nitrogen concentration (TN) time series as a function of seven independent variables, from six monitoring sites over the period from January 1997 to December 2004, collected from Truckee River, Nevada was modeled through partial least squares (PLS) and ridge regression modeling respectively, using PROC PLS (PLS method, RLGW algorithm and split cross validation), and PROC REG (ridge coefficient [ ] =0.8) in SAS®. PLS regression extracted 2 factors that predicted observed TN well. Ridge regression model with =0.8, gave minimum predicted TN equal to 0.18 mg/L, and very low VIF <0.4 for all independent variables. Ridge regression model was significant at 0.05 significance level (p<0.0001, R2=0.69). The ridge regression model showed smaller sum of squared residuals (SSE = 19.88) than the PLS regression model (SSE=37.02), and mean predicted TN closer to observed TN than PLS regression model for the validation period. Optimum estimated ridge coefficient varied with site.

kk

KEY WORDS: Partial least squares regression; ridge regression; variance inflation factor; tolerance, multicollinearity

INTRODUCTION STUDY SITE

The Truckee River which is a river in northern Nevada and northern California, is 140 mi (225 km) long, originates from the mountains south of Lake Tahoe, flows into the Lake Tahoe at its south end, draining part of the high Sierra Nevada, and emptying into Pyramid Lake in the Great Basin (USEPA, 1991). Truckee River flows northwest through the mountains to Truckee, California, then turns sharply to the east and flowing into Nevada, past Reno and Sparks, along the northern end of the Carson Range. A picture of Truckee River taken from down town Reno, Nevada is shown in Figure 1. East of the Truckee Meadows, fourteen ditches remove water for irrigation. The most significant diversion is Derby Dam, where at least 32% of the river's water is diverted annually (Peternel and Laurel, 2005). TMWRF currently maintains 11 continuous monitoring stations (www.tmwrf.com) within the Truckee water system. These stations are located at: Mogul, Steamboat Creek (SC), McCarran Bridge (MC), North Truckee Drain (NTD), Lockwood (LW), Patrick, Waltham, Tracy, Painted Rock, Wadsworth (WB), and Marble Bluff Dam. The LW monitoring site is currently chosen as the compliance site for assessing total maximum daily loads (TMDL) for phosphorus into the Truckee River because most controllable sources are thought to be upstream of LW. Site LW is located in the lower Truckee River basin 65.6 river miles from Lake Tahoe, down stream of MB, NTD, and SC monitoring sites and Vista (Figure 2; source: http://truckeeriverinfo.org/gallery). Truckee River’s waters are an important source of drinking and irrigation along its valley and adjacent valleys. As discussed previously, increased urbanization and the prevalence of water diversions have caused a decline in water quality, and the resulting detrimental effects on habitat have brought about the need to restore the river to a more natural condition to improve habitat and the river's overall health. The water is quite clear near Lake Tahoe, but as it descends, the water turns muddy and concentrated in nutrients by the time it passes Reno, Nevada. The California State Water Resources Control Board (State board) has classified under Section 330(d) of the Clean Water Act the middle reach of the Truckee River as “impaired”. Currently, the Truckee River does not support its designated uses. As a result, the river at Lockwood is listed on Nevada's 303(d) list for TN, total phosphorus concentration (TP), and total dissolved solids (TDS).

STATEMENT OF PROBLEM Currently there is a need for a consistent, scientifically defensible approach for assigning nutrient criteria for

Truckee River water, to control Eutrophication. Until recently, exceedances of TP in Truckee River, has been thought to be the direct cause of biomass activity hence the major cause of Eutrophication (EPA, 2007). Recently, researchers are reporting other variables such as DO, SF, water temperature (Temp), water pH (pH) and TN to directly affect biomass activity hence Eutrophication in the river. The relationship of these variables to TN in the river has not been studied or fully understood. All the above factors that affect biomass activity including TP in the river are directly or indirectly related to TN enhancement in the river. Inadequacy of information regarding the relationship of these factors currently limits the ability of NDEP to revise the values imposed in the nutrient criteria (use status for the 303(d), Impaired Waters List (Category 5 of the Integrated Report)), which challenges the subsequent

1

Page 2: Comparing efficiency of cross validated partial least squares

implementation of the beneficial use criteria (NDEP, 2007). The beneficial use criteria currently focuses, on TP not TN. TN is another major limiting factor affecting biomass activity in the river. Identifying the exact relationship among the multiple independent variables (DO, pH, DOC) that affect biomass activity to TN is currently required. It is also crucial that the multiple independent variables that affect biomass activity be grouped into few relating factors to implement nutrient criteria for nitrogen and phosphorus to control Eutrophication in the Truckee River. The TMDL compliance level for TN for Truckee River is currently equal to 1.2 mg/L (1000 lb/day) set at the Lockwood monitoring site.

In addition, TN, has been classified by the Environmental Protection Agency (NDEP, 1994) as a -conservative pollutant (conservative pollutants persist in the water segment of the aquatic environment over time remaining essentially constant in concentrations), hence not expected to be perturbed by seasonal variations or other short term cyclical and non-cyclical variations in the system but varies directly with the volumes of flows of dischargers of the receiving water body. However, it is possible that TN concentration in Truckee River can be affected by seasonal Agricultural practices such as fertilizer use and other cyclical and non-cyclical man made activities. This classification needs revising too in the future. Site SC is currently the major contributor of TN as well as TP to Truckee River.

Figure 1: Truckee River in Reno, Nevada Figure 2: TMWRF Monitoring Stations

In this statistical and data analysis paper two approaches: i) partial least squares (PLS) regression, and ii) ridge regression were compared, in identifying and modeling the relationship of multiple independent variables to TN in Truckee River, as close as possible, which will enable designers to target and manage TN concentration in Truckee River accurately as possible to their source of origination. PLS regression was fitted to predict the dependent variable TN as a meaningful function of two or more factors of the multiple independent variables in the Truckee River to help environmental policy makers and designers to help in developing criteria for nutrient loading into the river. Monthly water quality data were obtained from Truckee Meadows Water Reclamation Facility (TMWRF, www.tmwrf.com) for the period from January of 1997 through December of 2004, and used in the modeling. The developed model can provide guidance to probable range and type of TN load generated and deposited into the Truckee River.

OVERDETERMINED SYSTEMS In mathematics, a system of linear equations is considered over-determined if there are more equations than

unknowns. Systems of this variety are deemed inconsistent. Each equation (as observation) introduced into the system acts as a constraint that restricts one degree of freedom. Such over-determined systems are over-constrained (i.e.: the number of equations outnumb s the numbe o the unknowns) and there is no solution. For example in a system with

er r fM linear equations and variables ( N is

always a solution. When , the system is underdetermined ays further solutions with the dimension of the space of solutions equal to at least MN − . For NM ≥ (the system is over determined), there are no solutions other than all values being 0. The over determined system of equations can be reduced to adequate number of dependencies (linearly dependent elements), so that the number of effective constraints become less than the apparent number (

N XXX ,..,, 21 ),

and there are alw

0...21 ==== NXXXNM <

M ) of constraints, to obtain other solutions. In the Truckee River water quality data there are much more observations than the number of independent (also referred as predictor) variables and the system is over-determined. The relationship of TN to the selected multiple independent variables have also been found to be non-linear in previous studies (Ragavan, 2007). The solution of over determined systems with non-linearity and multicollinearity is currently challenging.

METHOD OF LEAST SQUARES The method of least squares (LSQ) is the method of fitting data in regression analysis for over-determined

systems. With method of LSQ approximate solutions can usually be found for over-determined systems. In the LSQ method best fit of a model is obtained with the least value of the sum of squared residuals (SSR). Parameters of the model functio example, for a simple data set consisting of data points

(data pairs) , where i is an

n are adjusted to best fit a data set. For N( Niyx ii ,...,1,, = ) x independent variable and i is a dependent variable (whose

value is found by observation), the model function has the form:

y),( βixf , (where X is a matrix of independent

variables and M adjustable parameters are held in the vector β ). The LSQ method is the "best" fit when the SSR

2

Page 3: Comparing efficiency of cross validated partial least squares

∑=

=N

iir

1

2 is the minimum. The residual ( ), is the difference between the values of the dependent ( ) variable

and the predicted values from the estimated model as: .

ir iy

)(^βiii xfyr −=

NON-LINEAR LEAST SQUARES One major problem with the application of LSQ methods is that there is no closed form solution with LSQ

method to non-linear systems. Instead, numerical algorithms are used to find the value of the parameters ( β ), that minimize the objective function. Most algorithms involve choosing initial values for the parameters. The parameters are refined iteratively, that is, the values are obtained by successive approximation as:

[1] jbj

bj βββ Δ+=+1

In Eq. [1], b is an iteration number and jβΔ is the vector of increments known as the shift vector. Many solution

algorithms for non-linear LSQ problems require that the Jacobian be calculated, and analytical expressions for the partial derivatives are complicated and impossible to obtain hence the partial derivatives must be calculated by numerical approximation. Even under the conditions that the errors are uncorrelated with the predictor variables, non-linear LSQ estimates are generally biased and require further refinements.

MULTICOLLINEARITY Another problem with the application of LSQ methods is collinearity, which is a linear relationship between two

explanatory variables in a regression model. The independent variables, and are collinear if1X 2X 2X1X δ= , for

some value ofδ . Multicollinearity, refers to a situation in which two or more explanatory variables in a multiple regression model are highly correlated or linearly dependent. Whenever, there is a linear relationship among the

independent variables, the rank of X is less than M , and the matrix XX T will not be invertible. Note that, ordinary

LSQ estimates involve inverting the matrix XX T

i

. In most applications, perfect multicollinearity (correlation between two independent variables is equal to 1 or -1) is unlikely. The more likely is near multicollinearity (e.g.: when a stochastic error term is added to the multiple regression equation). In such systems, there are no exact linear relationships among the variables, but the variables are nearly perfectly correlated. In this case, the matrix X

XX T

iY

will be invertible, but will be ill-conditioned. IDENTIFYING MULTICOLLINEARITY Presence of multicollinearity in a regression model can be found from: i) large changes in the estimated regression coefficients when a predictor variable is added or deleted, ii) insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the hypothesis that those coefficients are insignificant as a group (using a F-test), iii) large changes in the estimated regression coefficients when an observation is added or deleted, iv) a tolerance value of less than 0.20 and/or a Variance Inflation Factor (VIF) of 5 and above (Henseler and Fassott, 2005). In addition, the usual interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit change in a predictor variable , holding the other predictor variables constant. In the presence of multicollinearity, the estimate of one predictor variable's impact on while controlling for the others tends to be less precise compared to when predictors are uncorrelated with one another. The affected estimates are unstable and will show high standard errors.

iX

RIDGE REGRESSION

As discussed above, using the LSQ method, approximate solutions to over-determined systems can be found.

For example, for the system: with the LSQ formula equal to: , an approximate solution

can be found as: , where || . || is the Euclidean norm. In the above approach, the matrix

bAX =2||b−

bAAA TT 1)−X (∧

=||min Ax

xA may be

ill-conditioned or singular yielding large number of solutions. However, regularization terms can be used in the minimization process in order to give preference to a particular solution with desirable properties as:

, where )||2||||2 Xλ+(||min AXx

b− λ is a suitably chosen regularization factor. The above approach,

commonly known as ridge regression in statistics is the mostly used to find solutions for ill-posed problems. Regularization has been found to improve the conditioning in many systems, enabling numerical solutions. In the

above approach, an explicit solution, denoted by∧X , is given as: . In many cases, the

matrix

bATT 1)−+ λλAAX T(∧

=λ is chosen as the identity matrix ( I=λ ), which yields solutions with smaller norms. In other cases, high-

3

Page 4: Comparing efficiency of cross validated partial least squares

pass operators (e.g. a difference operator or a weighted Fourier operator), are used to enforce smoothness. The effect of regularization may be varied via scaling of the matrix, λ (= Iα ). For λ = 0, this reduces to the un-

regularized LSQ solution provided that exists. The optimal regularization parameter, 1)( −AAT α is usually unknown and often is determined by ad hoc methods such as: i) discrepancy principle, ii) cross validation (leave-one-out), iii) L-curve method, iv) unbiased predictive risk estimator. For example, the optimal parameter, in the method of

cross-validation is obtained via minimizing , where is the residual sums of squares, and 2/τRSS RSS τ is the effective number of degree of freedom (Wahba, 1990).

PARTIAL LEAST SQUARES REGRESSION A major limitation with the application of LSQ methods for over determined systems is that the inverse

matrix XX Tmay not exist. A solution for this problem can be easily generated by deleting some predictor variables

from the analysis without changing the rank of X , which can be achieved best through the partial least squares (regression technique, where uncorrelated latent variables called factors which are linear combinations of the original

regressors are created. In PLS regression a regression is computed for the scores for the

PLS)

X variables, and the

scores for the Y variable. Successive orthogonal factors that maximize the covariance between each X score and

the corresponding Y score are chosen. A LSQ regression is then performed on the subset of extracted latent variables. This leads to a biased but lower variance estimate of the regression coefficients compared to the Ordinary

Least Squares (OLS) regression. With the PLS regression model the multidimensional direction in the X space that

explains the maximum multidimensional variance direction in the Y space are found. This method is especially swhen there is multicollinearity in the data.

uited

METHODS DISPLAYING OBSERVED DATA

Histograms were constructed of original and the first differenced TN time series with a normal curve superposed using the HISTOGRAM option in PROC UNIVARIATE. Box plots of the distribution of observed TN by site and by year after counting the missing values were constructed using the PLOT option in PROC BOXPLOT (SAS® Code 1). Scatter plots were constructed to display the observed relationships between TN and each independent variable using PROC GPLOT.

DATA INTEGRITY TESTING

SAS® CODE 1

PROC MEANS DATA=Monthly NOPRINT; VAR TN; BY site; OUTPUT OUT=Cancel NMISS=ncancel; DATA Comp; MERGE Monthly Cancel; by site; RUN; symbol1 v=plus c=black; symbol2 v=square c= red; symbol3 v=triangle c=yellow; TITLE 'Distribution of Original TN among Sites';

PROC BOXPLOT DATA=Comp;

PLOT TN *site = ncancel / boxstyle = schematicid cboxes=blue cboxfill = red cframe=vligb nohlabel symbollegend = legend1 notches; legend1 label=('Missing Values:') cborder = black cframe=ligr; label TN ='TN mg/L)'; RUN;

AUTOCORRELATION

When fitting regression models for time series data (data collected at discrete time intervals), the assumption of residual independence requires, that the time ordered error terms display no autocorrelation, that is the errors corresponding to observations across time periods are independent. When an autocorrelated error structure occurs, each error is correlated with the error immediately before it. Autocorrelation is a symptom of systematic lack of fit in the data. Such serially correlated errors (known as autocorrelation) speak about the linear dependence between observations (Box and Jenkins, 1976). Time series data must be corrected for autocorrelated errors before fitting any regression model with it (Parks, 1967). Autocorrelation function (ACF) plots can reveal the presence of autocorrelated errors in a series.

OUTLIERS AND INFLUENTIAL OBSERVATIONS On the balance of probabilities, an observation beyond, 2.0 standard deviations (SD) from the mean need to be

highlighted for follow-up investigations to identify causes such as intervention, and recording errors (Salas and

4

Page 5: Comparing efficiency of cross validated partial least squares

Obeysekera, 1988). Only the most extreme observations (4.0 or more SD from the mean) need be excluded from the analysis. In this study all time series variables were tested and corrected for the presence of outliers and any influential observations. No observations with values beyond 4.0 SD from the mean were observed in any of the variables in the data. All observations were included in the analysis. The observations above 2.0 SD from the mean of TN (2.0 mg/L) were recorded as influential and the intervention in response (TN) due to these influential observations were included in the analysis as a dummy variable (X1).

DATA NON-STATIONARITY All time series regression models require that the time series modeled is stationary. It reveals whether the

variations in a time series are likely to be permanent or temporary. A time series (Xt, t (time)=0, ±1, ±2,….) is said to be stationary if it has statistical properties similar to those of the time-shifted series (Xt+h , t=0, ±1, ±2,….) for each integer h. Strict stationarity of a time series {Xt, t=0, ±1, ±2,….) implies that the series {X1, ….Xn} and the time shifted series {X1+h,…Xn+h} have the same joint distributions for all integers h and n > 0. Usually second order stationarity is adequate for modeling water quality time series (Fuller, 1978). For a second order stationary process the mean is constant and the auto-covariance function depends only on the time lag, which is consistent with a normal process. In this study, the formal Dicky and Fuller unit root non-stationarity test (Fuller, 1978) was used to test the data for non-stationarity (SAS® Code 2). The original TN time series was found to be non-stationary at 5% level of significance. First differencing of the time series was found adequate to correct the data for non-stationarity. Data became consistent with a normal process after first differencing (constant mean and the auto-covariance dependent only on time lags).

UNOBSERVED VARIATION Unobserved variations in a time series such as seasonal variation, trend, and other long and/or short term

cyclical and/or non-cyclical variations due to any man-made intervention can influence regression analysis. In this study the unobserved variance components (seasonality, cycles, trend) in the overall original TN time series during the period from January 1997 through December 2004 were decomposed and their significance were tested at 5% level of significance using standard Chi-square tests (SAS® Code 3). Since variations in TN due to all three unobserved components were significant (p<0.05), seasonal variations in TN (Summer, Winter), and possible cyclical and/or non-cyclical intervention due to influential observations (X1) were computed as shown below and included in the analysis as explanatory variables. A value equal to 1 was recorded for the variable if the variable satisfied the following definition and recorded as equal to zero otherwise.

X1 = ‘intervention’

SAS® Code 3

PROC UCM DATA=Monthly PRINTALL; ID Date INTERVAL=Month; MODEL TN; IRREGULAR plot=smooth; LEVEL variance=0 noest plot=smooth; SLOPE variance=0 noest plot=smooth; CYCLE rho=1 noest=rho plot=smooth; SEASON length=12 plot=smooth; RUN;

Summer = ‘summer months’ Winter = ‘winter months’ X1 = TN > 2.0 Summer = ( 5 < mm < 11 ) * ( year > 1990 ); Winter = ( year > 1990 ) - Summer;

SAS® Code 2

PROC ARIMA DATA=Monthly; IDENTIFY VAR= TN STATIONARITY=(ADF=(1,2,4,6,12)); RUN; DATA Monthly; SET Monthly; TN =DIF(TN); RUN;

MISSING VALUES

Time series regression modeling requires data without missing values. Missing values for any observation in any of the decisions variables can lead to missing values in the objective function. In this study missing values in the dependent and independent variables were imputed through Marcov Chain Monte Carlo (MCMC) simulation with multiple chains (Schafer, 1997; Schafer, 1999) before fitting the regression models (SAS® Code 4). MCMC simulation involves sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. After multiple steps the state of the Marcov chain becomes the sample from the desired distribution.

SIMPLE CORRELATIONS AND MULTICOLLINEARITY Data were tested for multicollinearity among independent variables using VIF and TOL options with the MODEL

statement in PROC REG (SAS® Code 6). A VIF value equal to or larger than 5 was taken to indicate the presence of multicollinearity in the variable. Simple Pearson correlation coefficients among the independent variables were also obtained using PROC CORR along with significant probabilities at 5% level of significance.

VARIABLE TRANSFORMATION AND MLR MODEL FITTING The predictor variables were transformed to seek linear and /or non-linear functions of the predictors that can

5

Page 6: Comparing efficiency of cross validated partial least squares

explain as much variation of the response variables as possible. The exact linear and/or non-linear relationship between dependent variable (TN) and the predictor variables were identified through trial and error fitting of a linear regression model for the dependent variable (TN) and the respective independent variable using the REG procedure in SAS®, while testing for the significance of the parameter estimate, model diagnostic statistics (R-Square) and residual characteristics (SAS Code 5). Tentative relationships among the predictor variables and the dependent variable were identified from scatter plots. Finally a multiple linear regression (MLR) model was fitted for TN and identified functions of the independent variables (SAS® Code 6). Model diagnostic statistics and residuals of the best fitted MLR model were tested for model adequacy. An alpha level of 0.05 (5%) was used for all hypothesis testing.

SAS® Code 4 PROC MI DATA=Monthly SEED=21355417 NOINT NIMPUTE=6 MU0=50 10 180 OUT=outmi; MCMC CHAIN=multiple DISPLAYINIT INITIAL=em(ITPRINT); VAR TN STP Alkalinity DO DOC SF pH Temp; RUN;

FITTING A RIDGE REGRESSION MODEL

The ridge coefficient that resulted in minimum predicted response (TN) of the fitted MLR model was obtained in PROC RSREG using the RIDGE statement with the MIN option (SAS® Code 7). Ridge values were varied from 0 to 1.2. A ridged regression model was fitted in PROC REG using the OUTVIF and RIDGE options varying the ridge values from 0 to 1.0. Ridge VIF values were output through the OUTVIF option and saved to an output data set specified with the OUTEST= statement in PROC REG (SAS® Code 8). Ridge VIF were plotted against ridge

using the RIDGEPLOT option with the PLOT statement (SAS® Code 9).

k

k

k

SAS® Code 5 DATA Monthly; SET Monthly; SFLOG = LOG(abs(SF)); RUN; PROC REG DATA=Monthly PLOTS(UNPACKEDPANELS); Linear: MODEL TN=SF /NOINT; Logarithmic: MODEL TN=LOGSF/NOINT; Polynomial: MODEL TN=SF*SF/NOINT;

RUN;

SAS® Code 6 PROC REG DATA=Monthly; TITLE 'QQ Plot'; MODEL TN=LOGSF DO STP DOC X1 Alkalinity ph Temp1 Summer/ MSE NOINT VIF TOL; PLOT nqq.*r./MSE CFRAME=green; PLOT rstudent.*obs. RUN;

SAS® Code 7 PROC RSREG DATA=Monthly; MODEL TN= LOGSF DO pH STP Alkalinity DOC Temp1 Summer X1 / LACKFIT PREDICT; RIDGE MIN RADIUS=0 to 1 by 0.1; RUN;

SAS® Code 8 PROC REG DATA=Monthly OUTVIF OUTEST=b RIDGE= 0 to 1.2 by .1; MODEL TN= LOGSF DO1 STP DOC Alkalinity ph Temp1 Summer X1/ NOINT NOPRINT VIF TOL; RUN;

SAS® Code 9 TITLE 'VIF Plot for Nitrogen Data'; PROC REG DATA=b; PLOT(LOGSF DO1 ph STP Alkalinity DOC Temp1)*_RIDGE_/NOMODEL NOSTAT; PLOT RIDGEPLOT NOMODEL NOSTAT; RUN;

SAS® Code 10a Data Monthly; SET Monthly; IF _N_ <= 500; n=_N_; RUN;

SAS® Code 10b DATA Monthly1; SET Monthly; IF (n > 500) THEN TN=.; RUN;

The ridge parameter estimates for all independent variables were also plotted against ridge (SAS® Code 9).

The ridge value that resulted in both, i) minimum predicted TN, and ii) minimum VIF for all independent variables k

ik

was chosen as the optimum value for the ridge regression model. A ridge regression model w th the optimum k

6

Page 7: Comparing efficiency of cross validated partial least squares

k was fitted. Model paramet r estimates at the selected k value were output along with probabilities, standard ed upper and lower 95% confidence intervals.

IDGE REGRESSION MODEL VALIDATION

e rrors

00 observations) and validation datasets (obs. 501 through 576). The e

model was fitted to the data using PROC PLS with the PLS linear predictive

io

n

RESSION PROCEDURE f the latent vecto ing steps are repeated until convergence is reached.

until convergence. xtra , the matrix

an

opt

RData were divided into model (First 5

linear ridge regression model fitted with the model data set (SAS® Code 10a) was used to predict TN (SAS® Cod10b) for the validation data. Predicted TN was plotted against observed TN.

FITTING THE PLS REGRESSION MODEL A predictive linear PLS regression

method (METHOD=PLS) in SAS® (SAS® Code 11). Note the SAS® procedure PLS fits only predictive PLS models,with one "block" of predictors and one "block" of responses. The iterative RLGW factor extraction algorithm (ALGORITHM=RLGW) was used, which is the most efficient when there are large number of predictors in the model.

Any missing values that were generated in the data due to differencing was handled with the MISSING=AVG n with PROC PLS. The cross validation method using split samples option (CV=SPLIT) with n=10 was used to

select the number of factors. In the cross validation factor extraction method, the data is divided into groups, the model is fitted to all groups leaving out one, and the capability of the fitted model to predict responses is checked using the left out group, repeating for all groups. Smallest number of factors with probability >0.05 was extracted. Predicted values, scores of response and predictor variables, residuals, influence and model diagnostic statistics were output into an output data set specified with the OUTPUT OUT= statement. By specifying a SOLUTION optiowith the MODEL statement, regression parameter estimates for all predictors with standard errors, and probabilities were output. The PLS regression procedure used in this study with the final form of the model is described briefly below.

PLS REGrUAfter an initial random estimate o the follow

cted

1) UXW T= ; 2) ||||, TTTXWT ←= ; 3) TYC T= ; 4) ||||/, UUUYCU ←= / 5) Repeat steps 2 – 5When one factor (one component ofU ) has been e are deflated (defined by T X and Y as:

XTTXX T−← , YTTYY −← ), and the whole procedure is repeated with deflatedT X and Y ma ces, tri to extract new factors until the rank of X is reached. After extraction of the P components the ( pxn ) matrices, UT , ,

the ( pxN ) matrix W , and the ( pL : where L is the number of response variables in the model [ =1 in th

study ix C are extracted, consisting of columns created respectively by the following vectors,

{ } { } { } { }pii

pii

pii CandWUT 11 ,,,, == extracted during the individual iterations (Manne, 1987;

for predicted response (

x

on m

is

. al.,

]), matrpii 11 ==

1994). The final fitted

Rannar et

PLS regressi odel Y ) is of the following form:

11 )(,)(:,ˆ −−=+= TTTXPandCWWPwhereEXY TTTTββ = [2]

In Eq. [2], β is the ( ), matrix of the parameter estimates of predictor variables, LxN P is the ( pxN ) matrix

consisting of fac loading rs ( pTii

Ti TTTXP )}/({= ), W is the ( pxN ) matrix onsistin ights of

the predictor variables, and

tor s vecto i 1=i c g of we

E is the ( Du to the fa t jiforWP T >= ,,0 ,

and, iforWP T <≠ ,,0 , (Hoskuldsson, 1988), the matrix WPT is upper triangular and therefore invertible.

In the above PLS regression procedure, different scaling of the individual latent vectors pT }{ and pU }{ do not

influence the estimate of the matrix

Lxn ) matrix of r uals. esid e ct, tha j

i 1=

i

i

jji

ii 1=

β .

PLS REGRESSION MODEL VALIDATIO Nt 500 observations) and validation datasets (obs. 501 through 576). The

aData were divided into model (Firs

line r PLS regression model fitted with the model data set was used to predict TN for the validated dataset. X -scores, against Y -scores for all extracted factors, residuals against predicted response, and observation nu ewere plotted using the GPLOT procedure (SAS® Code 12) for the model, and for the validation datasets and compared. The output data set generated by the OUTPUT OUT= statement of the PLS procedure was used as datafor PROC GPLOT to create the above plots. Predicted values from validation dataset were also plotted against observed TN.

mb rs

7

Page 8: Comparing efficiency of cross validated partial least squares

RESULTS AND DISCUSSION ISTICS Truckee River, Nevada was not normal (Figure 1), and had significant

% significance level (p<0.0001) (Figure 2). After first differencing, the TN tion

es

r ,

PARA ESTIMATES AND MULTICOLLINEARITY

ighly significant multicollinearity existed among the independent variables (Table 2). Best fitted MLR was an adjusted R-square value equal to 0.885 (Figure 5).

24) l

DATA INTEGRITY AND SIMPLE STATThe original TN time series in

autocorrelation at several time lags at the 5time series became consistent with normal distribution (Figure 3) (constant mean, and auto covariance funcdependent only on time lags) hence became linear. The first differenced TN series was also devoid of any autocorrelations (Figure 4).

Simple statistics of observed independent variables (mean, SD, minimum, and maximum) taken over all sitare shown in Table 1. Variable SF showed larger standard deviation than mean. Mean SF was high (643.8, [SD=900.9] cfs). However, SF in the river was less than the mean 72% of the time during the study period. Mean STP was high (0.1 [SD=0.1] mg/L). Mean DO was 10 mg/L (SD =1.8mg/L) which is much above the compliance level foDO (5 mg/L) in the river. DO in Truckee River exceeded compliance level 99% of the time and exceeded mean DO54% of the time during the study period. There were significant positive Pearson correlations between DOC and STP(0.729, p<0.0001), DOC and Alkalinity (0.593, p<0.0001), and STP and Alkalinity (0.749, p<0.0001). There were significant negative Pearson correlation between DO and Temp (-0.788, p<0.0001).

MH

ETER

highly significant at 5% level (p<0.0001) and showedResiduals from the fitted MLR model were normal (Figure 5), and without any patterns (Figure 6). VIF of pH (116.and DO (78.07) were very high although these two variables showed a significant parameter estimate in the mode(p<0.05) (Table 2). VIF for variables, STP, DOC, Temp, and Alkalinity were above 5 although these three variables showed highly significant parameter estimates (p<0.005) in the model. Parameter estimate for Summer, was not significant at 5% level. Parameter estimate for Log(SF) was not significant but improved after transformation. VIF ofLog(SF) was larger than 10. VIF of most independent variables still remained high even after transformation of SF and Temp.

Figure 1: Histogram of observed TN Figure 3: Histogram of

first differenced TN

p existed betw er, and X1.

Figure 2: ACF plot of observed TN

TRANSFORMATION OF PREDICTOR VARIABLES nshiA negative logarithmic relationship was identified between SF and TN. A negative polynomial relatio

een Temp and TN. Positive and linear relationships existed for TN with STP, Alkalinity, DOC, Summ

Observed TNObserved TN

SAS® Code 11 SAS® Code 12 PROC PLS DATA=partial CV= DATA pltdata; set outpls; split(10)CVTEST(pval=0.05) length text $ 1;

retain function 'label' pDETAILS METHOD=PLS osition '5' hsys '3' xsys MISSING=AVG OUTMODEL=est '2' ysys '2' color 'blue' style 'swissb'; text=%str(n); x=yhat&ifac; y=yres&ifac; axis1 label=(ang

ALGORITHM=RLGW); MODEL TN=LOGSF DO2 STP DOC Alkalinity Summer pH X1 Temp1 /SOLUTION; OUTPUT OUT=outpls p = yha

le=270 rotate=90 "Predicted") major=(number=5) minor=none; axis2 label=("Residu

t1 yresidual = yres1 xresidual = xres1-xres15 xscore = xscr yscore = yscr stdy = stdy stdx = stdx t2=t2 h=h press = press;

al") minor=none; symbol1 v=dot i=none c=white; PROC GPLOT DATA=outpls; PLOT yhat&ifac*yres&ifac=1 / anno=pltdata vaxis= axis1 haxis=axis2 frame cframe=blue

RUN; ;

RUN;

8

Page 9: Comparing efficiency of cross validated partial least squares

DO was negatively and linearly related to TN (Table 2).

LOCATION AND VARIANCE INFORMATION OF TN The mean, minimum, maximum and the median of the observed TN series were 0.796 mg/L (SD =0.268), 0.429

mg/L, 1.9 mg/L and 0.719 mg/L. The mean TN in Truckee River during the study period was below the compliance vel in the River, 25% of the time during the study period. Mean TN

at sit

level (1.2 mg/L). TN exceeded the compliance lees, SC (1.42 [SD=0.39] mg/L), and NTD (1.332 [SD=0.702] mg/L) were above the compliance level. At SC and

NTD, TN levels exceeded the compliance level 71.1% and 52.6% of the time respectively during the study period. Mean TN was the largest at SC followed by NTD, and was the smallest at MC (0.284 [SD=0.225] mg/L) (Figure 7). Mean TN was smaller than 0.796 mg/L at all other sites. All sites except NTD showed very small variable response toTN with 50 percent of the observations falling within 0.2 mg/L (Figure 7). Site NTD showed the largest variable response to TN with 50 percent of the observations within 1.2 mg/L (Figure 7). Annual variations in mean TN showed a slightly increasing overall trend (Figure 8).

Figure 4: ACF, first differenced TN Figure 5: Q-Q plot of best

MLR model

, STP and Alkalinity to o served N were c,

identified between TN and pH, and TN and Temp from

First Differenced TNFirst Differenced TNTable 1: Simple statistics of predictors

OBSERVED RELATIONSHIPS OF TN TO INDEPENDENT VARIABLES

The relationship of observed DOC b T linear and positive (Figures 9a, 9and 9g respectively). The relationship of SF, and, DO to observed TN were negative and appeared non-linear (Figures 9b, and 9d respectively). A clear relationship could not be

the scatter plots.

Table 2: Parameter estimates and VIF of predictors in MLR

Figure 6: Residual plot of best MLR model

Figure 7: Box plot of observed Figure 8: Box plot of observed Figure 9a: Alkalinity versus N by site TN by year observed TN

9

Page 10: Comparing efficiency of cross validated partial least squares

FITTED RIDGE REGRESSION MODEL Estimated TN decreased wi between 0.8 and 0.9 (Figure 10). VIF

for all independent variables also decreased w P and X1 decreased larger as was increased while that of other predict value equal to 0.8 was

chosen as the optimum which values for all independent variables. for all predictor variables at the optimum he fitted ridge regression model with

value equal to 0.8 is shown in Eq. [3 r variable pH became non-significant at 5% level of significance after removing muclticollinearity was much

as larger than the mp ce level (Table 4).

th ridge coefficient ( ) and reached zero at ith igure 11). Ridge trace for ST

ors did not change much (Fgave the minimum predicted TN and very low

re smaller than 0.4. T]. Parameter estimate fo

0.8). Predicted TDO at the optimum

k(F

e

ance level. Predicted

k

igure 12). Ridge VIF

N

k

value w

(at ridge

k

VIFridge

k

at the optimum ridge k w

kk

lian

k = ksmaller (equal to 0.18 mg/L) than the complico

Figure 9b: DO vs. observed TN Figure 9c: DOC vs. observed TN Figure 9d: SF vs observed N

.

Figure 9e: Temperature versus Figure 9f: pH versus observed TN Figure 9g: STP versus

observed TN observed TN

18676.00226.0

*0001.003.00017.0

0468.03164.10074.0)(0226.03202.0

XSummer

TempTemppHAlkalinity

DOCSTPDOSFLogTN

++

−++

++−−=

[3]

Ridge K against Estimated Response

0.00.51.01.52.0

0.0 0.2 0.4 0.6 0.8 1.0Ridge K

Esd

TNL)

tim (mat

eg/

VIF against kVIF against k

Figure 10: Estimated TN against Figure 11: VIF against ridge K Figure 12: Ridge trace for ridge K predictors

10

Page 11: Comparing efficiency of cross validated partial least squares

FITTED PLS REGRESSION MODEL Two factors (smallest number of factors that gave p> 0.05) were extracted (Table 5). Seventy four percent of

the response variation was already explained while only 52.58% of the predictor variation was explained (Table 6) by the two factors. STP, DOC and Alkalinity showed almost equal model effect loadings in Factor 1. DO, Temp, seasonality (Summer 1), and man-made intervention (X1) showed almost equal model effect loading in Factor 2 (Table 7).

Table 3: Parameter estimates and probabilities of the ridged model

Table 4: Predicted and critical ridge trace of response and independent variables

Variable pH showed very small model effect loading indicating that pH can be excluded from the model. The X-scores and the Y-scores were positively correlated for both factors (Table 8, note the regression coefficients). Correlation between X-score and the Y-score was larger for the first factor (Figure 13) than for the second factor (Figure 14). Residual plots did not show any patterns (Figure 15 and Figure 16). Residuals were normal. The fitted linear PLS regression model for the model dataset is shown in Eq. 4.

Figure 13: Y-score1 against Figure 14: Y-score2 against X-score1 (model data set) X-score2 (model data set)

Table 7: Model Effects Loadings of independent variables for extracted Factors

Table 6: Percent Variation Explained Table 5: Split Sample Validation

11

Page 12: Comparing efficiency of cross validated partial least squares

VALIDATION OF PLS REGRESSION MODEL Fitted model predicted the validation data set adequately. Plots of model residuals (Figure 18 and Figure 19) for

the validation dataset did not show any pattern and were normal. Y-score was positively correlated to X-score for both factors with almost the same regression coefficients (0.491, 0.183) as obtained for the model dataset (Figure 17).

Table 8: Model Effects Weights of Independent Variables among Factors

Figure 15: Y Residual against Figure 16: Residual against Figure 17: Y-Score1 against predicted TN, (model dataset) Observation #, (model dataset) X-Score1 (validation dataset)

11583.10777.0

*0002.00416.00035.0 TempTemppHAlkalinity −++

0618.02457.20018.0)(

XSummer

DOCSTPDOSF

++

++−0126.00965.0 LogTN −−=

[4]

01234

501

505

509

513

517

521

525

529

533

537

541

Observation #

Obs

ervd

e an

d Pr

edci

ted

TN (m

g/L)

PLS-Reg. TNRidge-Reg. TNObserved TN

Figure 18: Residual against Figure 19: Residual against Figure 20: Predicted TN against Predicted TN (validation dataset) observation #, (validation dataset) observed TN, (validation dataset) COMPARING PLS AND RIDGE REGRESSION MODELS

Mean predicted TN for the forecast (validation) period from the ridge regression model (1.35mg/L) was smaller and closer to the observed mean (1.23 mg/L), compared to the mean TN predicted by the PLS regression model

ed much smaller sums of squared error (19.89) than the PLS regression e

NCLUSION Suitability of correcting the water quality data from the Truckee River for multicollinearity through ridge and

partial least squares regression modeling was compared in this study. Total nitrogen concentration time series was regressed against seven independent variables collected at six monitoring sites along Truckee River. Original water

(1.69 mg/L). Ridge regression model showmod . The ridge regression model predictl (37.02) for the validation period ed the observed TN better than the PLS regression model (Figure 20).

CO

12

Page 13: Comparing efficiency of cross validated partial least squares

13

quality data from the Truckee River was non-stationary, non-normal and had missing values and large multicollinearity. Data were corrected for non-stationarity and missing values respectively through first differencing and missing imputation using Marcov chain Monte Carlo simulation. Data also became linear after first differencing.

SQ regression model used two factors. Coefficient of regression for X-score against Y-score for fa s 0.466 and that for factor 2 was 0.157. PLS regression model predicted the observed TN fairly accurately. Ridge regression model fitted with a ridge coefficient equal to 0.8 gave better prediction of observed TN and lower model error sums of squares than the PLS regression model. The ridge regression model showed an R2=0.69. Variance

flations factors were less than 0.4 for all predictors in the fitted ridge regression model.

REFERENCES Box, G.E.P., and Jenkins, G.M. (1976) Time series Analysis Forecasting and Control, (2

nd ed.): Holden-Day, San

Francisco, Ca. Fuller, W. (1978) Introduction to Time Series, New York: John Wiley & Sons, Inc.

NDEP (1994) Truckee River final total maximum daily loads and waste load allocations. Nevada Division of Environmental Protection, Carson City, Nevada. Parks, R.W. (1967) Efficient Estimation of a System of Regression Equations When Disturbances Are Both Serially and Contemporaneously Correlated, Journal of the American Statistical Association, 62, 500-509. Peternel, K., and Laurel, S. (May 15-May 19, 2005) Truckee River Restoration Modeling, World Water and

Environmental Resources Con ss. Anchorage, Alaska, USA.

in Medical Research, 8, 3-15.

Partial L ctor 1 wa

in

greRagavan, A. (2007) Time series Cross sectional analysis of Total Nitrogen Concentration in Truckee River, Nevada using SAS®, Proceedings of the Statistics and Data Analysis Section, 15th WUSS Annual Conference, San Francisco, CA. Schafer, J.L. (1999) Multiple Imputation: A Primer, Statistical Methods Schafer, J.L. (1997) Analysis of Incomplete Multivariate Data, New York: Chapman and Hall. Truckee Meadows Water Reclamation Facility: www.tmwrf.com USEPA (1991) Guidance for water quality-based decisions: The TMDL process. EPA 440/4-91-001. U.S.

Environmental Protection Agency, Office of Water, Washington, DC. NDEP (1994) Truckee River final total maximum daily loads and waste load allocations: Nevada Division of

Environmental Protection, Carson City, Nevada. ssment protocols for Wadeable streams: Nevada Division of

ntal Protection, Carson City, Nevada. Salas

NDEP (2007) Nevada’s nutrient asseEnvironme

, J.D., and Obeysekera, J.T.B. (1988) ARIMA models Identification of Hydrologic Time Series, Water Resources Research, 18, 4, 1011-1021.

USEPA, (March 2007) N-Steps: http://n-steps.tetratechffx.com/NTSCHome.com Wahba, G. (1990) Spline Models for Observational Data, SIAM Henseler, J., and Fassott, G. (September 7-9, 2005) Testing Moderating Effects in PLS Path Models: An Illustration of Available Procedures, 4th International Symposium on PLS and Related Methods, Barcelona. Manne, R. (1987) Analysis of Two Partial-Least-Squares Algorithms for Multivariate Calibration, Chemometrics and Intelligent Laboratory Systems, 2, 187–197. Rannar, F., Lindgren, P.G., and Wold, S., (1994) A PLS kernel algorithm for data sets with many variables and fewer objects. Part 1: Theory and algorithm. Chemometrics and Intelligent Laboratory Systems, 8, 111– 125.

Hoskuldsson, A. (1988) PLS Regression Methods, Journal of Chemometrics, 2, 211–228.

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author: Name: Anpalaki J. Ragavan M.S. Enterprise: Department of Mathematics and Statistics, University of Nevada, Reno Address: 1215, Beech Road, # 26 City, State, Zip Reno, NV 89512, USA Work phone: (775)-322-3694 Fax: (775)-784-1040 Email: [email protected] Web: None

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS

ve companies. Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respecti