Speed Dating with SAS Regression · PDF fileSpeed Dating with SAS Regression Procedures David...
Transcript of Speed Dating with SAS Regression · PDF fileSpeed Dating with SAS Regression Procedures David...
Speed Dating with SASRegression Procedures
David J Corliss, PhDWayne State University
Physics and Astronomy / Public Outreach
PROC REG
• Regression Type: Continuous, linear
• General regression procedure with a number of options but limited specialized capabilities, for which other procedures have been developed
• Supports many model variable selection methods (e.g., Forward), can be coded for polynomial regression, multiple model statements and features interactive capability
• Parameter estimation by Maximum Likelihood
PROC REGExample: Homeless Students by State
Solid performance of the model across the range from low to high homelessness states indicates consistency of factors correlated with the number of homeless students
r2=.652
ActualPercent
Model - Percent of Student Population
Special Data Needs: Problems with OutliersPROC ROBUSTREG
• Regression Type: Continuous, linear
• Robust regression is achieved by identifying outliers and limiting their influence by assigning weights and then performing standard regression
• Supports outlier detection using M, LTS, S and MM estimation; robust ANOVA
• Provides robust R2 and deviance, robust modeling parameters and outlier diagnostics
PROC ROBUSTREGExample: Log-Log Regression With Weighted Outliers
SAS/STAT® 9.2 User’s Guide, support.sas.com
In ROBUSTREG, the outliers are not disregarded: weights are assigned and incorporated in the regression
Special Data Needs: Ill-Conditioned DataPROC ORTHOREG
• Regression Type: Continuous, linear
• Regression using the Gentleman-Givens procedure instead of collecting crossproducts
• For ill-conditioned data, where small errors in the data may cause large errors in the results – more accurate than PROG REG or GLM
• “ORTHO” in ORTHOREG refers to using an orthogonal approach to Least Squares, notorthogonal regression
PROC ORTHOREGExample: Fitting a Higher-Order Polynomial
SAS/STAT® 9.2 User’s Guide, support.sas.com
An example of ORTHOREG fitting a 9th-degree polynomial, where near singularities must be distinguished from true ones
Special Data Needs: TransformationPROC TRANSREG
• Regression Type: Continuous, linear
• Regression with a number of data transformations, including smooth, spline, Box-Cox and other non-linear forms
• Supports fitting splines with a user-specified degree and number of knots; works for piece-wise and discontinuous solutions
• Also supports variable transformation for canonical regression and response surface regression
PROC TRANSREGExample: Spline Regression to a Complex Form
TRANSREG used to fit splines to a spectrographic line profileto determine the radial velocity of erupting gas from a star
Special Model Types: General LinearPROC GLM
• Regression Type: Continuous, linear
• General purpose procedure for continuous least squares regression using classification predictor variables as well as continuous
• No Bayesian capability at this time; use GENMOD or MCMC for Bayesian functionality
• While capable of many types of models and analysis, another procedure is often better for a specific task
PROC GLMExample: Age Group as a Categorical Predictor Variable
GLM used with ODS Graphics to visualize statistical output
An Overview of ODS Statistical Graphics in SAS® 9.3Robert N. Rodriguez, SAS Institute Inc., Cary, NC
agegroup
Distribution of Response
Special Model Types: Quantile RegressionPROC QUANTREG
• Regression Type: Continuous, linear
• Quantile regression: while other procedures model the mean, quantile regression models the median and other specified quantiles to provide a more complete picture of the response variable
• Uncertainties for individual quantiles can be estimated by bootstrapping
• Model fitting by Least Squares
PROC QUANTREGExample: 5/10/ 25/50/75/90/95% Quantiles
An example of QUANTREG demonstrating greater detail than possible with ordinary least squares regression
Quantile regression with PROC QUANTREGPeter L. Flom, Peter Flom Consulting, New York, NY
Predicted birth weight by maternal weight gain
Special Model Types: PSL, PCA RegressionPROC PLS
• Regression Type: Continuous, linear
• Special procedure for Partial Least Squares and Principal Component regression: predictor and response variables are projected into a new coordinate systems, possibly with reduced complexity
• Supports reduced rank regression with cross validation of the number of components; no Bayesian capability at this time
• After projection, model fitting is by Least Squares
PROC PLSExample: Variable Importance Plot
An example of PROC PLS, including latent variablesderived from the original, observed variables
Quantile regression with PROC QUANTREGPeter L. Flom, Peter Flom Consulting, New York, NY
Special Model Types: Survey DataPROC SURVEYREG
• Regression Type: Continuous, linear
• Special capabilities for analysis in the presence of common survey data features, including stratification, clustering and weighting
• Supports estimation of sampling error using either Taylor series or primary sample units
• Model fitting by Generalized Least Squares
PROC SURVEYREGExample: Regression with Stratified Sampling
Example output from SURVEYREG, withsummary statistics and model parameters
PROC SURVEYREG sas.support.com, example 98.4
Stratum Information
Stratum Index State Region N Obs Population Total Sampling Rate
1 Iowa 1 3 100 3.00%
2 2 5 50 10.0%
3 3 3 15 20.0%
4 Nebraska 1 6 30 20.0%
5 2 2 40 5.00%
Tests of Model Effects
Effect Num DF F Value Pr > F
Model 1 21.74 0.0004
Intercept 1 4.93 0.0433
FarmArea 1 21.74 0.0004
Note: The denominator degrees of freedom for the F tests is 14.
Estimated Regression Coefficients
Parameter EstimateStandard
Error t Value Pr > |t|
Intercept 11.8162978 5.31981027 2.22 0.0433
FarmArea 0.2126576 0.04560949 4.66 0.0004
Covariance of Estimated Regression Coefficients
Intercept FarmArea
Intercept 28.300381277 -0.146471538
FarmArea -0.146471538 0.0020802259
Special Model Types: Survey DataPROC SURVEYPHREG
• Regression Type: Continuous, linear
• Performs Cox Proportional Hazards modeling on survey data with truncation, supporting stratification, clustering and weighting
• Performs estimation of variance by model parameters by Taylor series, BRR or Jackknife
• Model fitting by Maximum Likelihood
PROC SURVEYREGExample: Stratified Sampling with Truncated Data
Example output from SURVEYPHREG, withsummary statistics and model parameters
PROC SURVEYPHREG sas.support.com, example 97.2
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard Error t Value Pr > |t|Hazard
Ratio
BodyWeight 586 0.011920 0.003155 3.78 0.0002 1.012
Smoke -1 586 -1.174048 0.739450 -1.59 0.1129 0.309
Smoke 1 586 -1.006515 0.578810 -1.74 0.0826 0.365
Smoke 2 586 -0.674183 0.558412 -1.21 0.2278 0.510
Smoke 3 586 0 . . . 1.000
Type III Tests of Model Effects
Effect Num DF Den DF F Value Pr > F
BodyWeight 1 586 14.27 0.0002
Smoke 3 586 1.49 0.2160
Estimate
Label Estimate Standard Error DF t Value Pr > |t| Exponentiated
Row 1 -0.7532 0.3870 586 -1.95 0.0521 0.4709
Special Model Types:Contingency Table Regression
PROC CATMOD
• Regression Type: Continuous, linear
• A generalization of continuous methods to categorical data, performs linear regression and other analyses on data than can be expressed in a contingency tables
• Supports both ordinary and logistic regression, log-linear and repeated measures
• Model fitting by WLS; ML is available for log-linear and generalized logit models
PROC CATMODExample: Bartlett's Data, No 3-Variable Interaction
Example output from CATMOD, withsummary statistics and model parameters
PROC CATMOD sas.support.com, example 28.4
Data SummaryResponse Length*Time*Status Response Levels 8Weight Variable wt Populations 1Data Set BARTLETT Total Frequency 960Frequency Missing 0 Observations 8
Response ProfilesResponse Length Time Status
1 1 1 12 1 1 23 1 2 14 1 2 25 2 1 16 2 1 27 2 2 18 2 2 2
Maximum Likelihood Analysis of VarianceSource DF Chi-Square Pr > ChiSqLength 1 2.64 0.1041Time 1 5.25 0.0220Length*Time 1 5.25 0.0220Status 1 48.94 <.0001Length*Status 1 48.94 <.0001Time*Status 1 95.01 <.0001Likelihood Ratio 1 2.29 0.1299
Special Model Types: Response SurfacePROC RSREG
• Regression Type: Continuous, linear
• Creates a quadratic Response Surface, a general linear model where optimal solutions are identified as peaks or valleys, with ridge analysis to identify regions near optimal responses
• Relies on ODS graphic to display the response surface, ridges and fit parameters
• Model fitting by Least Squares
PROC RSREGExample: Response Surface with a Single Solution
An example of RSREG, with the optimal solution found atthe minimum; multiple minima and maxima are possible
http://v8doc.sas.com/sashtml/stat/chap56/sect5.htm
Response Surface with a Simple Optimum
Special Model Types: Survival AnalysisPROC LIFEREG
• Regression Type: Continuous, linear
• Models time to failure data as a linear combination of predictors and a random disturbance term, which can be described by many different distributions
• Supports standard survival analysis data censored on the right, left, both or neither; has a Bayesian option
• Model fitting by Least Squares, only one model statement per procedure
PROC LIFEREGExample: Cumulative Hazard Model
This example of LIFEREG plots the log-logisticvs. the Kaplan-Meier Cumulative Hazard
Response Surface with a Simple Optimum
D. Hosmer and S. LemeshowApplied Survival Analysis, Ch. 8
Special Model Types: Proportional HazardsPROC PHREG
• Regression Type: Continuous, linear
• Cox Proportional Hazards modeling, where the a unit increase in a predictor multiplies the risk by a factor determined by the model
• Supports proportional hazards models with data censored on the right, left, both or neither, variable selection by forwards, backwards, stepwise or best subset
• Maximum Likelihood with a Bayesian option
PROC PHREGExample: PH Model With Time-Dependent Predictors
Example output from PHREG, withsummary statistics and model parameters
PROC PHREG sas.support.com, example 64.6
Summary of the Number of Event and Censored Values
Total Event CensoredPercent
Censored99 71 28 28.28
Model Fit Statistics
CriterionWithout
CovariatesWith
Covariates-2 LOG L 561.680 551.874AIC 561.680 557.874SBC 561.680 564.662
Testing Global Null Hypothesis: BETA=0Test Chi-Square DF Pr > ChiSqLikelihood Ratio 9.8059 3 0.0203Score 9.0521 3 0.0286Wald 9.0554 3 0.0286
Analysis of Maximum Likelihood Estimates
Parameter DFParameterEstimate
StandardError Chi-Square Pr > ChiSq
HazardRatio
XStatus 1 -3.19837 1.18746 7.2547 0.0071 0.041XAge 1 0.05544 0.02263 6.0019 0.0143 1.057XScore 1 0.44490 0.28001 2.5245 0.1121 1.560
Special Model Types: Structural EquationsPROC CALIS
• Regression Type: Continuous, linear
• In Structural Equation Modeling, a linear combination of predictors describes a vector equal to a linear combination of outcome variables
• Supports latent variables, multiple and multivariate regression, path analysis and canonical correlation
• Maximum Likelihood with Least Squares and Bayesian options
PROC CALISExample: Linear Relations among Factor Loadings
Example output from CALIS, withmatrices of model parameters
PROC CALIS sas.support.com, example 25.4
Estimated Parameter Matrix B[6:2]General Matrix
Fact1 Fact2var1 0.3422 0.6318
[b11]var2 0.321 0.6531
[b21]var3 0.4918 0.4822
[b31]var4 0.5755 0.3985
[b41]var5 0.7769 0.1971
[b51]var6 0.6666 0.3074
[b61]
Estimated Parameter Matrix D[6:6]Diagonal Matrix
var1 var2 var3 var4 var5 var6var1 1.0077 0 0 0 0 0
[d1]var2 0 0.9971 0 0 0 0
[d2]var3 0 0 0.9908 0 0 0
[d3]var4 0 0 0 0.9909 0 0
[d4]var5 0 0 0 0 0.9964 0
[d5]var6 0 0 0 0 0 1.0169
[d6
Discrete Outcomes: Logistic RegressionPROC LOGISTIC
• Regression Type: binary & ordinal outcomes, linear
• General procedure for logistic regression with a number of options; other procedures may offer more capabilities for specific types of discrete models
• Supports many model variable selection methods and diagnostic tests
• Model fitting by Maximum Likelihood
Discrete Outcomes: General Linear ModelsPROC GENMOD
• Regression Type: discrete outcomes, linear
• Generalized linear models with discrete outcomes, appropriate where the data are not normally distributed or the variance is not the same for all observations
• Supports Poisson Regression and Repeated Measures
• Model fitting by Maximum Likelihood with a Bayesian option
Discrete Outcomes: PROBIT ModelsPROC PROBIT
• Regression Type: discrete outcomes, linear
• Generalized linear models with discrete outcomes, appropriate for use with discrete event data
• Supports probit, logit, ordinal logistic, and extreme value / gompit
• Model fitting by Maximum Likelihood; no Bayesian option at this time
Discrete Outcomes: Survey DataPROC SURVEYLOGISTIC
• Regression Type: binary & ordinal outcomes, linear
• Performs regression on survey data with categorical responses; special capabilities for analysis for common survey data features, including stratification, clustering and weighting
• Supports estimation of sampling error using either Taylor series or resampling
• Model fitting by Maximum Likelihood
Non-Linear Models: GeneralPROC NLIN
• Regression Type: non-linear
• Performs non-linear regression with the dependent variable divided into a mean component and a (random) error component; process is iterative
• Supports steepest-descent, Newton, modified Gauss-Newton and Marquardt methods
• Model fitting by Least Squares or WLS
PROC NLINExample: Fitting a Model to a Complex Curve
In this example of NLIN, observations are normally distributed about a non-linear function – in this case, a Morlet wavelet
Non-Linear Models: MixedPROC NLMIXED
• Regression Type: non-linear
• Performs non-linear regression where both the mean and errors components of the dependent variable are non-linear; process uses a Taylor series expansion about zero
• Supports normal, binomial and Poisson distributions and capability for programing a general distribution
• Likelihood-based model fitting
PROC NLMIXEDExample: Plot of Profile of Trees Over Time
In this example of NLMIXED, variability theshape of observed trees increases over time
Fitting Nonlinear Mixed Models with the New NLMIXEDProcedure, Russell D. Wolfinger, SAS Institute Inc., Cary, NC
Linear Mixed: Fixed and Random EffectsPROC MIXED
• Regression Type: linear, fixed and random effects
• Performs linear regression using a linear combination of fixed effects added to a second linear combination of random effects
• Support repeated measures in longitudinal studies, Especially useful for dealing with missing data
• Likelihood-based model fitting
PROC MIXEDExample: Repeated Measures
Example of MIXED, incorporating both fixed and randomeffects to improve the predictive power of the model
Repeated Measures Modeling With PROC MIXED E. Barry Moser, Louisiana State University, Baton Rouge, LA
Actual Growth of Sitka Spruce trees Predicted Growth with Random Coefficient Asymptote
Linear Mixed: General Mixed Models:PROC GLIMMIX
• Regression Type: linear mixed
• A generalization of GENMOD and MIXED to permit normally-distributed random effects and non-normal error terms, fitting models to correlated data or where the variability is not constant
• Supports a wide variety of distributions and link functions
• Likelihood-based model fitting
PROC GLIMMIXExample: Crossed Random Effects
LOESS with crossed random effects analyzes in-breeding in an isolated population, allowing generalization to all populations
PROC GLIMMIX sas.support.com, example 38.2
Non-Parametric: Local RegressionPROC LOESS
• Regression Type: linear, non-parametric
• Develops a model using non-parametric regression to segments of data and calculates confidence limits for the outcome; computationally intensive
• Supports multiple dependent variables, multidimensional predictors and interpolation using kd trees
• Model fitting by local least squares
PROC LOESSExample: Periodicities in Weather Data
In this example of LOESS, Local Regression is usedto identify potential periodicities at 12 and 42 month
Loess Fit with 99% Confidence LimitsAn Introduction to PROC LOESS for Local Regression, Robert D. Cohen, SAS Institute Inc., Cary, NC
Non-Parametric: Additive ModelsPROC GAM
• Regression Type: linear, non-parametric
• General Additive Models, with multiple independent non-parametric predictors; univariate smoothing provides finer details than is possible with the piece-wise LOESS procedure
• Supports non-parametric and semi-paramentricmodels, multidimensional predictors
• Model fitting by iterative reweighted least squares
PROC GAMExample: Performance of a Catalyst
Example of PROC GAM used to fit a complex response surface without loss of detail to due piece-wise fitting in PROC LOESS
PROC GAM sas.support.com, example 36.3