SAS Notes

SAS Notes

Shane Zhang

Table of Contents

• SAS Input/Output• Functions in Data Step• Simple Statistics Procedures• Hypothesis testing – mean and proportion• Multiple linear regression• Generalized linear regression• Cluster Analysis• Association Analysis• Logistic Regression

SAS Input/Output/*read from ls –l output*/data aa;infile 'ls -l ' pipe dsd dlm=' ' missover;input dir $ owner $;

/*read multiple file with same layout */filename in ('200306.csv','200309.csv' );data base; informat shp_dt mmddyy8.; format sshipym date9.; infile in dsd delimiter=',' ;

/*read in only part of the file,useful for large mainframe tape file*/data base(drop=i);infile '1.csv' end=eof dsd firstobs=3 delimiter=',' missover;retain i 0; do while (i<20); input aa $; output; i=i+1; end;

/*output csv file*/data _null_; set a; file 'xx.csv' dsd dlm=','; put x y z;

/*define ftp connection as filename*/filename test ftp '.cshrc' cd ='/export/home/sz325584' host='forecast.marketing.fedex.com' user="sz325584" pass="xxxx";

/*read one line into a variable*/data flist; infile "ls ./" pipe length=len; input @; input fname $varying200. len;

/*create graphs*/goptions reset=all; proc gplot data=cars; plot price*(citympg, hwympg, cylinders enginesize);Symbol v=circle;

Functions in Data Step

• mdy• intnx(‘month’,sshipym,1)• put(variable, format)• Name=scan(string,2,”&=“) /*split string by &= and find the second

item*/• substr(yyyymm,5,2)• index(address,’NY’) /*find position of a pattern in a string*/• call symput(numobs, ‘numobs’) /*put data step variable value into a sas

macro variable*/

Simple Statistic Procedures

/*create histogram*/ proc univariate data=one noprint; var v1; histogram v1 / normal;run;

/*create means, standard deviations */PROC MEANS DATA=volume; VAR adv; OUTPUT OUT=volume_stat(KEEP=MEAN STD MAX MIN) MEAN(adv)=MEAN STD(adv)=STD MAX(adv)=MAX MIN(adv)=MIN;

/*Random select*/ Proc surveyselect data = trees Method = SRS n = 15 out = sample1; strata segment; run;

Variables Types• Categorical or nominal variables are ones such as favorite color, which

have two or more categories and no way to order the values. Other examples of categorical variables include gender, blood type and favorite ice cream flavor.

• Ordinal variables can be ordered, but are similar to categorical variables in that there are clear categories. The relative distances or spacing between variables values is not uniform.

• Continuous/Interval variables are similar to ordinal variables, except that values are measured in a way where their differences are meaningful. The place number of runners in a race is considered an ordinal scale, but if we consider the actual times of runners rather than their place, this would be an interval scale.

Hypothesis testing

• A statistical test is a quantitative way to decide whether there is enough evidence to reasonably believe a conjecture to be true.

• null hypothesis H0, and the alternative hypothesis Ha.– H0 normally assumes no difference in means or in

regression analysis, no relationship between predicator and response variable, i.e. coefficient=0

– To control type I error, we often set threshold to be 5%, only reject null hypothesis when p<0.05. Or in other words, only accept Ha (there is difference or there is relationship) when evidence is very strong.

One tail or two tailed hypothesis testing

• To obtain correct results, it is important to determine whether the hypothesis tests are one or two-tailed. When the

• null and alternative hypotheses are of the form H0: x1= x2, with Ha: x1> x2 or Ha: x1< x2, we call that a one-tailed test,

• and when the null hypothesis is of the form x1 x2, we call that a two tailed test.

Hypothesis testing on means - Ttest

• We can use t-tests in the following three situations;– We want to test whether the mean is significantly

different than a hypothesized value.– We want to test whether means for two

independent groups are significantly different.– We want to test whether means for dependent or

paired groups are significantly different.

Ttest• Ttest is a special form of one way ANOVA where category variable has only two

values.• Whether the cereal box avg weight is different from 15 ounce? (two sided) PROC TTEST DATA= datasetname H0=15; can also be done with proc univariate. VAR weight;

• Whether the cereal box avg weight is above 15 ounce? (one sided) ods graphics on;

proc ttest h0=15 plots(showh0) sides=u alpha=0.1; var weight;

• Test whether the means of two independent group are the same. (control group vs. target group or different brands of cereal box) PROC TTEST DATA= datasetname; CLASS brand; VAR weight;

Paired Ttest

• Test two attributes belong to the same objectEg. Same account, pre campaign sales and post

campaign sale. Same student, reading and writing scores.

Test whether account sales different after marketing campaign? Note: pre and post sales are dependent groups.

PROC TTEST DATA= datasetname; PAIRED pre_sale*post_sale;

Or test whether students reading and writing scores are significant different. PROC TTEST DATA= datasetname; PAIRED read_score*write_score;

ANOVA

• When comparing means from more than two groups, use one way ANOVA. Two way ANOVA means there are two CLASS variables (eg CLASS SEMENT INDUSTRY).

• There are two common ways to run ANOVA in SAS. A seemingly obvious way is PROC ANOVA, the other is PROC GLM, which has the added advantage of allowing with a few more SAS options.

ANOVA

• H0: All means are equal across brands.• Ha: There is a difference between mean salaries of families

who vacationed in different seasons. PROC ANOVA DATA= cereal; CLASS brand; MODEL weight= brand; MEANS brand;

Nonparametric ANOVA

• Used when we cannot assume normal distribution. For example, when sample size is too small. The mean distribution won’t be normal.

• Proc npar1way data=sasdata; class variable; var variables;

Hypothesis testing on proportion• A chi-square goodness of fit test allows us to test whether the observed

proportions for a categorical variable differ from hypothesized proportions. For example, let's suppose that we believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White folks.

proc freq data = mydata; tables race / chisq testp=(10 10 10 70);run;

race Frequency Percent Percent Frequency Percent --------------------------------------------------------------------- 1 24 12.00 10.00 24 12.00 2 11 5.50 10.00 35 17.50 3 20 10.00 10.00 55 27.50 4 145 72.50 70.00 200 100.00 Chi-Square Test for Specified Proportions ------------------------- Chi-Square 5.0286 DF 3 Pr > ChiSq 0.1697 Sample Size = 200 These results show that racial composition in our sample does not differ significantly from the hypothesized values that we supplied

One-way MANOVA

• MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more dependent variables. In a one-way MANOVA, there is one categorical independent variable and two or more dependent variables. E.g. examine the differences in read, write and math broken down by program type (prog).

proc glm data = "c:\mydata\hsb2"; class prog; model read write math = prog; manova h=prog;run;quit;

Multiple Linear Regression• A powerful tool to understand relationship between predictor and response variables and

predict future values.• Linear is in terms of the coefficients. The following are multiple linear models y=b0+b1*x1+b2*x2 y=b0+b1*x1^2+b2*x1*x2• Assumptions – linear relationship, error term is normal distributed and independent• For N independent variables, the number of all possible of model combination is 2^N, very

computing intensive. We can use stepwise selection to find a model quickly.• Plot the chart before trying models

• Goptions reset=all;• proc gplot data=paper;• plot strength*amount;• Symbol i=rc; /*impose quadratic regression model on the chart*/• title ‘Quadratic model’;• plot strength*amount;• symbol i=rc; /*impose cubic regression on the chart*/• title ‘Cubic model’; • Run;•

Evaluate Model Assumption

• Normality– Normal probability plots of the residuals using proc

univariate• Independent observations– Plot residuals vs time and other ordering component– Durbin-watson statitics or the first order autocorrelation

statistics for time series data• Constant variance– Plot residuals vs. predicted value– Spearman rank correlation coefficient between absolute

value of residuals and predicted values

Model fitness

• Examine model-fitting statistics such as R^2, adjusted R^2, AIC, SBC

• If overall model p value<0.05 then at least one of the predictor is significant

• Each coefficient’s p value• Examine residual plots and validate the normality

assumption. • Proc reg data=mydata; model reading= age gender; plot r.*p; /*plot a graph of the residuals vs. the predicted values;*/ Output out=out r=residuals; proc univariate data=out; var residuals; histogram/ normal;

Remedial measures• When a straight line is inappropriate

– Transform the independent variables to obtain linearity– Fit a polynomial regression model– Fit a nonlinear regression model using proc nlin

• When there is multicollinearity– Exclude redundant independent variables– Center the independent variables in polynomial regression model

• When there are influential observations– Make sure there are no data errors– Investigate the cause of the data– Delete the observations if appropriate and document the situation

• Transforming the dependent variables– Transforming the dependent variable is one of the common approaches to

deal with nonnormal data and or nonconstant variances. E.g.

Regression with Categorical Predictors

In proc reg, categorical predictor needs to be coded into dummy variable as input. In prog glm, this is done automatically when the variable is put under class statement.

This is the data step showing how to code dummy variables (0/1 value) using category “catvar”, which has value 1, 2, 3

DATA mydata; set mydata; array dum(3) dum1-dum3; do i = 1 to 3; dum(i)=(catvar=i); end; drop i;

If an ordinal predictor has only three or four levels then clearly it should coded using dummy coding. There are times when an ordinal predictor can be treated as if it were interval (this is called quasi-interval) especially if the variable has more than five or six levels.

Polynomial Regression

• A polynomial regression is a special type of multiple linear regression where powers of variables and cross-product(interactions) terms are included in the model.

• Y=b0+b1*X1+b2*X2+b3*X1^2+b4*X1*X2• Data a;• set a;• var2=var1**2;• var3=var1**3;• Proc reg data=a;• model y=var1 var2 var3;

collinearity • Detect collinearity• Proc reg data=mydata; model oxygen_consumption=runtime age weight /vif collin collinoint;• Vif and collin options provide collinearity diagnostics• Vif>10 or collin>100 indicate strong collinearity• Treatment polynomial function often introduce collinearity, one treatment is to center the

data, so some data becomes negative which reduce x^2 and x^3 correlations.Proc stdize data=paper method=mean out=paper1; var var1;Data paper1; set paper1; mvar2=var1**2; mvar3=var1**3;

Multivariate multiple regression• Multivariate multiple regression is used when you have two or more

variables that are to be predicted from two or more predictor variables. In our example, we will predict write and read from female, math, science and social studies (socst) scores. The mtest statement in the proc reg is used to test hypotheses in multivariate regression models where there are several dependent variables fit to the same regressors

proc reg data = "c:\mydata\hsb2"; model write read = female math science socst; female: mtest female; math: mtest math; science: mtest science; socst: mtest socst;

Autoregression

• If errors (residuals) are not independent, autoregression model should be used. By simultaneously estimating the regression coefficient and the autoregressive error model parameters, the autoreg procedure corrects the regression estimates.

• Yt=b0+b1*x1+..+bk*xk+vt

• Vt=-pt-1*Vt-1-pt-2*Vt-1 +Et

• proc autoreg data=sales;model sales=price promotion /nlag=3 method=m1 dwprob;

Autoregression and Arima for time series forecast

• t is the month order, t=_n_; The order of the• AR(p) model is chosen by a backward elimination search

• proc autoreg data = taxrevenue; model rev = t d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12/ nlag = 4 DW=4 DWPROB method=ml backstep slstay=0.05;

• Use of proc arima to fit ARMA models consists of 3 steps. The first step is model identification, in which the observed series is transformed to be stationary. The only transformation available within proc arima is differencing. The second step is model estimation, in which the orders p and q are selected and the corresponding parameters are estimated. The third step is forecasting, in which the estimated model is used to forecast future values of the observable time series

• proc arima data=history; i var=adv nlag=15; e p=1 q=3; f lead=12 interval=month id=sshipym out=fcst; /*f statement means forecast*/

Since data is not stationary, we can use differencing adv(1) instead of adv.

Generalized linear model

• No longer requires response variable follows normal distribution conditioned on the predictor variables and constant variance.

• Examples: distribution– Linear regression normal– Logistic regression binomial – Poisson regression Poisson– Gamma regression gamma

Poisson Regression• Poisson distribution is used to model the count (non negative integer) or

occurring rates of rare events (when the mean gets larger, the Poisson distribution approaches normal distribution)– Number of ear infection in infants– Number of equipment failures– Rate of insurance claim

• Differ from normal distribution– Non symmetrical, skewed to the right for rare events– Has only one parameter (mean). Actually variance equals mean.

• Example– proc genmod data=skincancer; class city age; model cases=city age /offset=log_pop dist=poi link=log type3; title ‘Poisson regression model for skin cancer rates’;

Offset legs the genmod to model rate instead of counts.

Proc reg and proc glm

• Proc reg more convenient for regression analysis because of options for plot and model automatic selection. (no class statement)

• Proc glm more convenient for anova because of class statement. (no plot or model selection options)

Cluster Analysis

• The primary goal of market segmentation is to better satisfy customer needs or wants. The firm does not want to – Use the same marketing program for all customers– Incur the high cost of a unique program for each

customer• A deck of 52 cards can be grouped as – 26 red and 26 black– 13 each spades, hearts, diamonds and clubs

Cluster Analysis• Scale of measurement will impact the grouping. It is better to

standardize input variables before clustering.• Two types of clustering

– Hierarchical clustering• Used for small size of data. Can determine number of clusters by finding the local peak

for F and T-squared statistics.• No theoretical reason to expect a hierarchical structure

– Non-hierarchical clustering• Scale up well with large/complex data• Number of clusters need to be specified in advance• Initial seed required

– Combination• Two step method is used. First, a hierarchical method is applied on the training data to

decide number of clusters and initial choice of seeds.• Feed into non-hierarchical method (such as k-means) to apply to whole data set. SAS

enterprise Miner automates this two step process.

Custer Analysis

Association Analysis

• Chi-Square Test• Proc freq data=mydata; table gender*purchase /chisq expected cellchi2 nocol

nopercent; title1 ‘Association between Gender and Purchase’;

When more than 20% of cells have expected counts less than five, chi-square test may not be valid.

Regression OverviewResponse/Independent Variable Analysis

continuous Linear regression proc reg, proc glm

Categorical –Binary (yes, no) Binary Logistic regression proc logistic

Categorical – ordinal ( small, medium, large ect.)

Ordinal logistic regression proc logistic

Categorical – Nominal (no order) Nominal logistic regression proc catmod

• PROC LOGISTIC in SAS can be used to model dichotomous (binary) and ordinal cases and PROC CATMOD for nominal. PROC GENMOD, a procedure for analyzing general linear models, can also be used to perform logistic regression in SAS. It allows for modeling a correlation structure of observations and therefore is very useful in situations when observations are not independent.

Logistic regression model• Logistic regression model ensure that the estimated probability are between 0 and 1 (no

matter what X values are)• P=1/(1+e-(b0+b1*X1+b2*X2…) )

• Logit transform is applies to transform the above nonlinear function (at center, it is close to linear, but close to zero and 1, very nonlinear) to linear,

• The linear model is Y=Log (p/(1-p))=b0+b1*X1+b2*X2• Note p/(1-p) is the odds ratio of event happen vs. not happen

SAS Logistics Procedures• proc logistic data=abc.logisticd descending; class gender (ref=‘Male') income (ref=‘Low') /param=ref; model purchase=gender income /rsquare lackfit ctable selection=stepwise; output out=probs predicted=phat;

• “descending” makes the procedure to model P(Y=1)• Using the estimate of b1, b1 .. from the training data, we can calculate the p for

population dataset. We can use P>0.5 as threshold to predict event happen (i.e. Will purchase etc.)

• P=1/(1+e-(b0+b1*X1+b2*X2…) )

• Model fit. When comparing models, lower AIC is the better model. • Intercept only Intercept and covariates• AIC 560 550• SC 570 560• -2 Long L 575 550

SAS Notes

Documents

Transcript of SAS Notes