Computing New Variables in SAS (1)

download Computing New Variables in SAS (1)

of 12

Transcript of Computing New Variables in SAS (1)

  • 8/12/2019 Computing New Variables in SAS (1)

    1/12

    Computing New Variables in SAS

    The SAS Data Step is a powerful and flexible programming tool that is used tocreate a new SAS dataset. A Data Step is requiredto create any new variables or

    modify existing ones in SAS. The Data Step allows you to assign a particular valueto all cases, or to a subset of cases; to transform a variable by using amathematical function, such as the log function; or to create a sum, average, orother summary statistic based on the values of several existing variables within anobservation.

    Unlie Stata and S!SS, you cannot simply create a new variable or modify anexisting variable in "open# SAS code. This basically means that you need to createa new dataset by using a Data Step whenever you want to create or modify

    variables. A single Data Step can be used to create an unlimited number of newvariables.

    $e will illustrate creating new variables using the employee.sas%bdat SAS dataset,which has &%& observations and '( variables. $e assume this dataset has beendownloaded and saved in the )*+ folder.

    To create the new dataset called "mylib.employee2#, first highlight and submitthe ibname statement. The ibname statement needs to be submitted only once

    per SAS session. -t tells SAS where to read and write permanent SAS datasets.

    The Data Step starts with the Data Statement and ends with un. /ou create thenew dataset called mylib.employee0 by highlighting and submitting the entire DataStep commands shown below, starting at "data# and ending with "run;#. 1ach time

    you mae any changes to the Data Step commands, you must highlight and re2submit them. This will re2create the mylib.employee0 dataset by over2writing theprevious version.

    libname mylib "C:\";data mylib.employee2; set mylib.employee;

    /* put commands to create new variables here*/ /* be sure they go B!# the run statement*/

    run;

    '

  • 8/12/2019 Computing New Variables in SAS (1)

    2/12

    $e discuss each aspect of the SAS Data Step statements below.

    Libname statement:

    libname mylib "C:\";

    The ibname statement tells SAS the folder where existing SAS datasets arelocated, and where new permanent SAS datasets can be saved. -n this case weassign "3/-4# to the folder )*+. The general form of the ibname statement is*libname libre$ "path%to%$older"; The folder must already exist before itcan be used in a ibname statement.

    /ou can specify the type of dataset that you wish to read5write to a folder byadding a SAS "1ngine# to the ibname statement. To tell SAS to read or create6ersion 7 SAS datasets in your folder, you could use syntax lie the following*libname libre$ &' "path%to%$older\"; -f you8re using SAS release 7, thedefault is to create version 7 datasets.

    Data Statement:

    data mylib.employee2;

    The Data statement tells SAS to create a new permanent SAS dataset calledmylib.employee0. The first name of the two2level dataset name is the library9mylib: and the second part of the name follows a period 9.: and is the name of thedataset itself 9employee0:. -n $indows, this dataset will be namedemployee0.sas%bdat, however the file extension is not used in the SAS code. -ngeneral, if you want to create a new permanent SAS dataset, you would use SAScode of the form* data library.datasetname;

    Set Statement:

    set mylib.employee;

    The Set statement tells SAS to create your new dataset from the existingpermanent SAS dataset called mylib.employee, which in $indows is theemployee.sas%bdat within the )*+ folder.

    0

  • 8/12/2019 Computing New Variables in SAS (1)

    3/12

    Run Statement:

    run;

    The un statement is a "Step 4oundary# in SAS. -t ends the SAS Data Step,emember to include all codes to create new variables before the un statement.

    The example below illustrates creating a number of new variables in the newdataset, mylib.employee0, which is created from mylib.employee.

    Examples of computing new ariables SAS:

    libname mylib "C:\";data mylib.employee2;

    set mylib.employee;

    /*(enerating new variables containing constants*/ currentyear)2+; alpha )",";may- ) "-,02"1;

    $ormat may- mmddyy-.; $ormat bdate mmddyy-.;

    /*(enerating &ariables sing &alues $rom ther &ariables*/ saldi$$ ) salary 3 salbegin; label saldi$$ ) "Current 4alary 5 Beginning 4alary";

    /*(enerating &ariables Conditionally Based on &alues o$ ther&ariables */ i$ 6salary 7) 8 salary 9) 2+ then salcat ) "C";i$ 6salary 7 2+ 8 salary 9) + then salcat ) "B";

    i$ 6salary 7 + then salcat ) ",";

    i$ salary not). and obcat not). then do; i$ 6salary 9 + 8 obcat )

  • 8/12/2019 Computing New Variables in SAS (1)

    4/12

    /*(enerating dummy variables*/minhighsal ) 6salary 7 > 8 minority ) -;

    i$ 6salary not). and minority not). then minhighsal ) 6salary 7

    > 8 minority ) -;

    i$ obcat not). then do; obdum- ) 6obcat)-; obdum2 ) 6obcat)2; obdum< ) 6obcat)

  • 8/12/2019 Computing New Variables in SAS (1)

    5/12

    STD9Arg', Arg0,,Arg?: Standard deviation of non2missingarguments

    6A9Arg', Arg0,,Arg?: 6ariance of non2missing arguments)69Arg', Arg0,,Arg?: )oefficient of variation of non2missing

    arguments3-?9Arg', Arg0,,Arg?: 3inimum of non2missing arguments3AB9Arg', Arg0,,Arg?: 3aximum of non2missing arguments&issing Values #unctions:

    3-SS-?C9Arg: E ' if the value of Arg is missingE ( if not missing

    ?3-SS96ar', 6ar0,,6ar?: ?umber of missing values across variableswithin a case

    ?96ar', 6ar0,,6ar?: ?umber of non2missing values acrossvariables within a case

    Across'case #unctions:

    AC96ar: 6alue from previous caseACn96ar: 6alue from nth previous case

    Date and (ime #unctions:

    Datepart9datetimevalue: 1xtracts date portion from a datetime value3onth9datevalue: 1xtracts month from a date valueDay9datevalue: 1xtracts day from a date value/ear9datevalue: 1xtracts year form a date value-ntc9Finterval8,datestart,dateend: Ginds the number of completed intervals

    between two dates"t!er #unctions:

    A?U?-9Seed: Uniform pseudo2random no. defined on theinterval 9(,':

    A??>9Seed: Std. ?ormal pseudo2random no.!>4?>39x: !rob. a std. normal is HE x!>4-T9p: pthuantile from std. normal dist.

    I

  • 8/12/2019 Computing New Variables in SAS (1)

    6/12

    ?umeric vs. )haracter 6ariables

    There are only two types of variable in SAS* numeric and character. ?umericvariables are the default type and are used for numeric and date values.

    )haracter variables can have alpha2numeric values, which may be any combinationof letters, numbers, or other characters. The length of a character variable can beup to 0%J% characters. 6alues of character variables are case2sensitive. Gorexample, the value "Ann Arbor# is different than the value "A?? A4>#.

    A )uic* Note about ariable names in SAS

    6ariable names in SAS can be up to 0 characters long, and must start with eithera letter or underscore. 6ariable names may contain only letters, numbers, andunderscores. ?o blans are allowed. 6ariable names will be saved in a SAS datasetin the combination of upper and lower case used when defining the variable 9i.e.when the variable name is first mentioned in a Data Step:. Although SAS willremember the variable name using the case you specify when creating it, it willrecogniKe the variable name if you type it using any case later on.

    Cenerating 6ariables )ontaining )onstants

    $e first consider the simple case in which we want to create a variable that has aconstant value for all observations in the data set. -n the example below we createa new numeric variable named "currentyear#, which has a constant value of 0((Ifor all observations*

    currentyear)2+;

    The example below illustrates creating a new character variable named "alpha#which contains the letter "A# for all observations in the dataset. ?ote that thevalue must be enclosed either in single or double2uotes, because this is a

    character variable.

    alpha )",";

    Dates can be generated in a number of different ways. Gor example, we can use themdy function to create a date value from a month, day, and year value, as shownbelow*

    J

  • 8/12/2019 Computing New Variables in SAS (1)

    7/12

    may- ) mdy6+?-?2;

    >r we can create a date by using a SAS date constant, as shown below*

    may- ) "-,02"1;

    The D following the uoted date constant tells SAS that this is not a charactervariable, but a date value, which is stored as a numeric value.

    -f we now loo at the value of may'% by typing "viewtable mylib.employee0# or "vtmylib.employee0# in the SAS command dialog box in the upper left of the SASdestop, we will see that the variable "may'%# contains the value '&%&% for eachobservation. This is because SAS stores dates as t!e number of da+s since

    ,anuar+ -. -/01. $e can reformat this variable so that it actually displays as adate by typing the following statement in our Data Step, and resubmitting it torecreate the dataset.

    $ormat may- mmddyy-.;

    ?4* emember to close the viewtable window showing the data before resubmittingthe commands, because SAS locs a dataset when it is open in the viewtablewindow, and will not allow any changes to be made while the viewtable window isopen.

    -f you once again open the dataset in the viewtable window, you will now see itdisplayed as (I5'%50(((. emember to close the viewtable window when you aredone viewing the dataset.

    Cenerating 6ariables Using 6alues from >ther 6ariables

    $e can also generate new variables as a function of existing variables. Gorexample, if we wanted to create a new variable in the mylib.employee0 data setcontaining the difference between the current salary and beginning salary for eachemployee, we could simply use the following statement*

    saldi$$ ) salary 5 salbegin;

    ?ew variables can also be labeled*

    %

  • 8/12/2019 Computing New Variables in SAS (1)

    8/12

    label saldi$$ ) "Current 4alary 5 Beginning 4alary";

    $e can use the mdy function to create a new date value, based on three variablesin a dataset, in this example the variables were called "3onth#, "Day#, and "/ear#,although they could have different names*

    date ) mdy6month?day?year;

    6alues of the date variable would vary from observation to observation, becausethe mdy6function is using different values of variables to create date. ememberto use a Gormat statement to format the new variable DAT1 so it will loo lie adate.

    Cenerating 6ariables )onditionally 4ased on 6alues of >ther 6ariables

    /ou can also create new variables in SAS conditional on the values of othervariables. Gor example, if we wanted to create a new character variable, SA)AT,that contains salary categories "A#, "4#, and ")# we could use the followingcommands. ?ote that we have used conditional syntax to give new values to allpossible original values of salary. -f we had accidentally left out any categories ofsalary, they would be given a missing value in SA)AT.

    i$ 6salary 7) 8 salary 9) 2+ then salcat ) "C";i$ 6salary 7 2+ 8 salary 9) + then salcat ) "B";i$ 6salary 7 + then salcat ) ",";

    ?ote the use of an -fthen statement to identify the condition that a given casein the data set must meet for the new variable to be given a value of "A#. -ngeneral, these types of conditional commands have the form*

    i$ 6condition then varname ) value;

    where the condition can be specified using a logical operator or a mnemonic 9e.g., E9e:, L 9and:, M 9or:, NE 9notE, ne:, O 9gt:, OE 9ge: H 9lt: HE 9le::. The parentheses arenot necessary to specify a condition in SAS, but can be used to clarify a statementor to group parts of a statement. A semicolon is reuired at the end of the

    P

  • 8/12/2019 Computing New Variables in SAS (1)

    9/12

    statement. Gor example, if one wants to create a variable that identifies employeeswho are managers but have relatively low salaries, one could use a statement lie

    i$ 6salary 9 + 8 obcat ) $SA variable. A safer version of this conditional command would loo liethis*

    i$ 6salary not). 8 salary 9 + 8 obcat ) $SA. ?ote that the use of Felse8 will put allvalues, including missing values on either variable, into the "?# category 9everyother value, including missing, is captured by the Felse8 condition:. The final -fstatement will put anyone with a missing value on either of these variables into themissing value of 3A?>$SA, which is" " for a character variable.

    i$ 6salary not). 8 salary 9 + 8 obcat )

  • 8/12/2019 Computing New Variables in SAS (1)

    10/12

    statement to close the do loop. -n the example below, the entire bloc of code willonly be executed if salary is not missing and Qobcat is not missing.

    i$ salary not). and obcat not). then do; i$ 6salary 9 + 8 obcat )

  • 8/12/2019 Computing New Variables in SAS (1)

    11/12

    SAS has automatically generated a ' for all cases meeting the condition specifiedin the parentheses, and a ( for all cases not meeting the condition 9including caseswith missing values on either variable:. To be sure that no missing values gotmistaenly included in the minhighsal variable, you could use the following syntax*

    i$ 6salary not). and minority not). then minhighsal ) 6salary 7> 8 minority ) -;

    -f you have a variable with more than or more categories, you can create adummy variable for each category, and later in a later analysis, you would usuallychoose to include one less dummy variable in a model than there are categories. Gorexample, in the employee dataset, there is a variable called Qobcat, with levels 9',0, and :. Rere is SAS code to create three new dummy variables, one for eachlevel of Qobcat. ?otice that potential problems with missing values are avoided by

    using a do loop prior to the creation of the dummy variables*

    i$ obcat not). then do; obdum- ) 6obcat)-; obdum2 ) 6obcat)2; obdum< ) 6obcat)

  • 8/12/2019 Computing New Variables in SAS (1)

    12/12

    The converse operation is to determine the number of non2missing values there arein a list of variables,

    npresent ) n6o$ bdate33minority;

    Another common operation is to calculate the sum or the mean of the values forseveral variables and store the results in a new variable. Gor example, to calculatea new variable, salmean, representing the average of the current and beginningsalary, use the following command. ?ote that you can use a list of variablesseparated by commas, without including "of# before the list.

    salmean ) mean6salary? salbegin;

    All missing values for the variables listed will be ignored when computing the mean

    in this way. The min9 :, max9 :, and std9 : functions wor in a similar way.

    '0