Advanced SPSS Workshop Handout 2013 - Yale...

16
1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock Campbell & Oriana Aragón SPSS Syntax This section provides an introduction and sampling of basic syntax in SPSS. All illustrations and details are intended to apply to SPSS v19 for Windows (and most Macs). Most of the information will be the same for other versions, but there may be discrepancies. Syntax refers to the computer language SPSS uses to complete analyses. While most commonly used commands are available through the menu system of SPSS (point & click) many more options and functionalities are available using syntax. Why Use Syntax? Using syntax can save a great deal of time when running repetitive analyses. It is also an easy way to document your work. It allows you to instantly duplicate that work with a new (or updated) data set. It allows you to 'tweak' your analysis in ways not available through dialog boxes If you'd like to follow along, please open the sample data file satisf.sav' It should be located in C:\Program Files\IBM\SPSS\Statistics\19\Samples\English\satisf.sav. Creating and Saving a Syntax File: Syntax files in SPSS are plain text files with an extension of '.sps'. 1) Double click and it will open to the data file- variable view. 2) Next open a new syntax file: File > New > Syntax 3) To save the file File >Save > (Dialogue box will pop up. Name and designate location) Archiving Procedure and Describing Data

Transcript of Advanced SPSS Workshop Handout 2013 - Yale...

Page 1: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

1

Intermediate SPSS StatLab Workshop February 21, 2013

Sherlock Campbell & Oriana Aragón

SPSS Syntax This section provides an introduction and sampling of basic syntax in SPSS. All illustrations and details are intended to apply to SPSS v19 for Windows (and most Macs). Most of the information will be the same for other versions, but there may be discrepancies.

Syntax refers to the computer language SPSS uses to complete analyses. While most commonly used commands are available through the menu system of SPSS (point & click) many more options and functionalities are available using syntax.

Why Use Syntax?

Using syntax can save a great deal of time when running repetitive analyses.

It is also an easy way to document your work.

It allows you to instantly duplicate that work with a new (or updated) data set.

It allows you to 'tweak' your analysis in ways not available through dialog boxes

If you'd like to follow along, please open the sample data file satisf.sav'

It should be located in C:\Program Files\IBM\SPSS\Statistics\19\Samples\English\satisf.sav.

Creating and Saving a Syntax File: Syntax files in SPSS are plain text files with an extension of '.sps'.

1) Double click and it will open to the data file- variable view.

2) Next open a new syntax file:

File > New > Syntax

3) To save the file File >Save > (Dialogue box will pop up. Name and designate location)

Archiving Procedure and Describing Data

Page 2: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

2

1) Syntax files in SPSS are plain text files with an extension of '.sps'. You can create syntax several ways. Probably the easiest way to start using syntax is by using the 'paste' button available in most dialog boxes. It is seen below in the Descriptives dialog box.

First select the options you wish from the dialog box, then, instead of clicking 'OK', click 'Paste'. If you do not have a syntax window open already, this will open a new syntax window containing the commands you selected in the dialog box as seen below.

• If you already have a syntax window open, the commands will be pasted at the bottom of the currently active syntax window. The Syntax Editor allows you to edit a plain text file and submit selected commands to SPSS directly. Hit the green arrow (run command) and you will see on your output viewer the descriptive statistics for these data.

Page 3: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

3

• Next, let’s say that we want every variable in the data set.

• Likewise if you would only like the mean and standard deviation, you can eliminate the MIN and MAX

from the /STATISTICS =

• You can add notes (using *), cut, paste and edit just as with any other text file.

Structure of Commands in SPSS syntax • Commands in SPSS begin with a keyword that is the name of the command followed by any subcommands

and user specifications. The end of the command is marked by a period/full stop. • In SPSS syntax files, commands must always be placed in the first column. Refer to the Command Syntax

Reference for a discussion of available commands and options. It can be found in the menus under Help > Command Syntax Reference.

o Example: the FREQUENCIES command: FREQUENCIES produces tables of frequency counts and percentages of the values of

individual variables. FREQUENCIES is used to obtain frequencies and statistics for categorical variables and to obtain statistics and graphical displays for continuous variables.

By default, SPSS will paste syntax with commands and specifications in all caps, and will display variables as you have entered them. Commands and specifications do not have to be entered in all caps, but I will continue to display them that way to help differentiate them from variables. In the syntax below, just as pasted from the dialog box, 'sex' and 'race' are

Page 4: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

4

the variables that I selected. NOTE: Variable names in SPSS are generally separated by spaces.

FREQUENCIES VARIABLES=gender regular /ORDER= ANALYSIS.

In the syntax window there is a very useful toolbar button called 'Syntax Help'. It is context sensitive, meaning that it will display a syntax chart for the command where the cursor is currently located. Clicking on the 'Syntax Help' button provides the following information about the FREQUENCIES command.

Don't let the long list intimidate you. Many people are surprised to learn that FREQUENCIES has so many subcommands and specifications!

Subcommands and specifications in square brackets ([ ]) are optional, and those in braces ({ }) indicate a choice between elements. Look closely, there are only two words that are NOT in brackets, FREQUENCIES and 'varlist'. This means that you can run frequencies on the two variables 'sex' and 'race' by typing the following into a syntax editor window:

FREQUENCIES gender regular.

There are many abbreviations allowed in syntax (generally the first 3 or 4 letters of a command/specification will suffice.) So the following syntax will provide the same output:

FREQ gender regular.

Don't forget the period at the end of the command. The 'Syntax Help' button provides the skeleton of the command you are using, but does not provide a detailed explanation of each possible subcommand and/or specification. That information is found in the Command Syntax Reference (use the menus: Help >Command Syntax Reference).

Page 5: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

5

Now lets add a little more to the syntax. The following will provide descriptive statistics and barcharts for our two variables:

FREQUENCIES gender regular /STATISTICS=STDDEV MINIMUM MAXIMUM MEAN MEDIAN MODE /BARCHART.

If you want to add documentation to your syntax file, indicate the start of a comment with an asterisk (*). Everything between that asterisk and the next period will by ignored by SPSS. Remember not to add other periods in your documentation if you use this method, since SPSS will try to interpret everything after the period as commands. Another method is to use /* and */ to set off a comment. That is, start with /*, insert your comment of as many or as few words and lines as you want then end the comment with */. Remember not to include any periods in the comment using this technique either.

Missing Values, Variable and Value Labels Missing Values

• Creating a well documented data file can be quite tedious. Syntax statements can make the process a little less painful. The MISSING VALUES command declares values for variables as user-missing. User-missing values are then treated the same as the system-missing values. (That is, they are usually ignored.) Multiple missing values are separated by commas, and a range of missing values may be declared using the keywords LO, LOWEST, HI, HIGHEST, and THRU. If the variable(s) are strings, enclose the missing values in single quotes. Still using the 'demo' file from above; the following command sets values of the variable 'age' higher than 99 (including 99) and values of 'region' equal to 999 to user-missing. Don't forget the period at the end.

MISSING VALUES age (99 THRU HIGHEST) region (999).

• To cancel previously declared missing values, simply reassign the missing values to blank (use () in the previous statement.) To remove all missing values settings at once, use the following code:

MISSING VALUES ALL ().

Page 6: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

6

Variable and Value Labels

• VARIABLE LABELS assigns descriptive labels to variables in the data file. You can assign a label to one variable or to a long list at the same time. The following syntax assigns labels to two variables in the active file. NOTE: Each variable label can be up to 120 characters long, although most procedures will only print fewer than the 120 characters. All statistical procedures display at least 40 characters.

VARIABLE LABELS contact 'Contact with Employee' regular ‘Shopping Frequency’.

• In general, syntax will ignore spaces and lines within commands and subcommands. It is often easier to read a syntax file if you add spaces and start new lines to create columns, as below.

VARIABLE LABELS contact 'Contact with Employee'.

• The VALUE LABELS command assigns descriptive labels to values of variables in the data file. Many people confuse variable and value labels when they are new to them. Variable labels describe the variables and value labels allow you to assign descriptions to particular values of a variable. In the 'demo' file, ‘contact’ is either 0 or 1. Value labels help you to remember whether 0 means married or not married.

VALUE LABELS contact 0 'no' 1 'yes'.

• NOTE: The VALUE LABELS command deletes all existing value labels for the specified variable(s) and assigns new value labels. The ADD VALUE LABELS command can be used to add new labels or to alter labels for specified values without deleting existing labels.

• • To create value labels for additional variables just list the next variable after the last value label of the

previous, followed by the value labels in single quotes. Remember to put a period only at the very end of the command.

Page 7: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

7

Data Management: COMPUTE, RECODE, SPLIT FILE, and FILTER • In this section we will go over creating new variables (COMPUTE), recoding the values of existing variables

(RECODE), running the same analysis on subgroups (SPLIT FILE) and using filters to select subsections of your data (FILTER).

The COMPUTE command

o The COMPUTE command creates new numeric variables or modifies the values of existing string or numeric variables. You may be familiar with the dialog box, accessed by clicking Transform >Compute

o BUT, It is often more efficient to write COMPUTE statements in syntax, instead of 'pointing and clicking' your way to a new variable. While the dialog box is especially useful for functions you are not familiar with, I find it faster to code common formulas directly in syntax. The examples below illustrate why computing with syntax might save you some time. Note the differences between the 4 different computations of the Satisfaction.

Page 8: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

8

Note: Syntax will not warn you if you are about to rewrite an existing variable. But you will have record of what you have done.

Notice the EXECUTE statement at the end. SPSS requires this at the end of COMPUTE commands, unless a procedural command follows it (i.e. FREQUENCIES, or other statistical analyses.) If you forget and run the syntax without it, the Data Editor window will display "Transformations Pending" in the bottom display area. Either add the EXECUTE statement to the syntax, highlight it and run it, or go to the menus and click Transform >Run Pending Transforms to complete the command. EXECUTE can be shortened to EXE.

The RECODE command

o The RECODE function allows you change the values of a variable. For example, it is sometimes useful to reverse code responses to a survey (change the highest response to the lowest and vice versa.) You may also need to collapse categories for an analysis. In general, it is safer to recode into a new variable, rather than change an existing one, so I will not address that option here.

Here is an example from our data set. We would like to change distance from store into a binary variable where people have traveled either more or less than 10 miles to shop:

Notice, at this time it makes sense to apply desired variable and value labels.

o The RECODE command needs a list of variables to act upon (yes, you can recode many variables at once by listing them after the RECODE statement). The variable(s) are followed by a list of recodes each enclosed in parentheses. Here is an example, splitting the satisfaction variables into low and high groups.

Page 9: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

9

o The keywords MISSING and SYSMIS both refer to missing values. MISSING includes both system-missing values (no value entered in the data set) and user-missing (values entered but set to missing by the user.) SYSMIS refers only to system-missing values. Since I included MISSING, I did not need to include SYSMIS.

RECODE price (1=0) (2=0) (3=0) (ELSE=SYSMIS) INTO price_low. EXECUTE.

o Remember that you can get reference information by clicking on the 'Syntax Help' button in the Syntax Editor, or by looking in the Command Reference.

o Another great recode feature that is not available through the dialogue boxes is the ability to recode given particular If statements. Let’s say that we want to create a variable that tells us if the customer who answered the survey was an older woman or not, because we are curious if the older female customers were getting the same attention as the other shoppers by the sales associates. With syntax we can create a variable with two (or more) conditional statements. Here is what that would look like.

COMPUTE FemOlder = 0.

EXECUTE.

If (Gender = 1) and (agecat > 4) FemOlder = 1.

EXECUTE.

Now we have a variable that we can cross with Contact to check if out our ideas.

The SPLIT FILE command

o If you are interested in running the same analysis on a set of subgroups in your data you can use SPLIT FILE to accomplish this. Using the survey data, let's see if the distribution of education is different for males and females. We'll use SPLIT FILE and FREQUENCIES to get the output we're looking for. SPLIT FILE requires that the data be sorted by the variable(s) we want to split by. The SORT CASES command will sort by the variables listed after the command. The default is to sort ascending. If you want to sort descending, add (D) after that variable.

SORT CASES BY contact. SPLIT FILE SEPARATE BY contact. FREQUENCIES Satisfaction /HISTOGRAM.

Page 10: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

10

SORT CASES BY contact. SPLIT FILE LAYERED BY contact. FREQUENCIES Satisfaction /HISTOGRAM. SPLIT FILE OFF.

o By default, SPLIT FILE will produce output with the values of the split variable as the outermost column entries of a table (the LAYERED BY option.) If you want split file groups to display in separate tables, use SEPARATE BY instead. It is a good practice to 'turn off' splits as soon as you complete the analysis. SPLIT FILE will be overridden by a later SPLIT FILE command. Until SPLIT FILE OFF is entered, all analyses will be carried out on a split file.

The FILTER and SELECT IF command

o Filter variables (also known as indicator variables and dummy variables) are used to identify cases that meet the criteria you have specified. For example, let's create a variable to identify all the women in the who have had contact with a sales employee and live within 10 miles of the store that they visited. To accomplish this enter:

USE ALL.

COMPUTE filter_$=(gender=1 & contact=1 & distance >2).

VARIABLE LABELS filter_$ 'gender=1 & contact=1 & distance >2 (FILTER)'.

VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.

FORMATS filter_$ (f1.0).

FILTER BY filter_$.

EXECUTE.

o The USE ALL command is added by the dialogue box to ensure that no other filters are active when creating the new variable

o The COMPUTE command creates a new variable 'filter_$'. This is the default SPSS name for filters. If you used the dialogue boxes to create another filter, it would also be named 'filter_$' which would overwrite the previous filter. You could, of course, rename the filter after using the dialogue boxes, but it is surprisingly easy to forget. Using syntax, you can name the variable something more useful as you create it.

o The next three lines of syntax (VARIABLE LABEL, VALUE LABEL, and FORMAT) are not absolutely necessary but are very helpful for documentation sake, not to mention ease of use of the data set. Remember, if you change the name of the filter variable, you need to change 'filter_$' to the new name in each of the next four lines as well.

o The next line of the syntax applies the filter to your data. FILTER BY filters out all cases where the filter variable is 0. In other words, applying a filter selects only those cases where the filter variable equals 1 or greater.

o The SELECT IF statement is another way to select subsets of cases. Instead of using a filter variable, the logical expression (or formula) to select cases is specified in the command. The major difference is that this command permanently deletes non-selected cases. So, SELECT IF can be used if you need to create a new data set that is a subset of existing data.For example, the following syntax will provide a data set that includes only females:

Page 11: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

11

SELECT IF (gender = 1).

o SELECT IF will evaluate the expression you enter between the parentheses as either True, False, or Missing; all the False and Missing cases are dropped from the active data set.

o Using the TEMPORARY command will allow you to return to the original data set once a command that reads the data is run. That is, temporary transformations only apply until the next command that reads data and are no longer in effect once that command has run. So, the following syntax will temporarily filter out males, and then run DESCRIPTIVES for the variables 'happy' and 'life'. Since the DESCRIPTIVES command reads the data, it turns off the TEMPORARY command. Repeating the same DESCRIPTIVES command will then act upon the entire data set, not just females.

TEMPORARY. SELECT IF (gender=1). DESCRIPTIVES distance overall.

o Which is better; SELECT IF or creating a filter? That depends on how you prefer to work in SPSS. If you are writing a syntax file to document an analysis, it may be easier to follow if you use SELECT IF, particularly if you need to run only one command on a subset of data. If you need to run multiple commands on a subset of data, it may be easier to use filters to subset your data, depending upon how you need to split the data. If you are in the midst of an analysis and are switching back and forth from syntax to point & click, it is useful to have permanent filter variables instead of re-creating them each time you want to use them.

Preparing Data for Analysis

o Syntax can also be helpful in keeping a good record of the preparation that has gone into a data set BEFORE hypothesis testing begins. For example let’s consider our data set. Let’s say that we want to test out customer satisfaction. We have 6 different variables regarding customer satisfaction. The ideal is to test these items for normality, to confirm a single latent factor in these items (of satisfaction), and to confirm this idea with reliability analysis. Syntax can serve as a good record of due diligence in data preparation.

To run frequencies:

FREQUENCIES VARIABLES=price numitems org service quality overall

/STATISTICS=STDDEV MEAN MEDIAN

/HISTOGRAM NORMAL

/ORDER=ANALYSIS.

To identify latent factors:

We can use the dialogue box under Analyze > Dimension Reduction > Factor

Page 12: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

12

Next Click on Extraction. In this case, we will allow for 3 factors to emerge. Under Extract click Fixed number of factors and then specify 3 in the dialogue box.

Press Continue. Once back to the main dialogue box, click PASTE.

Here is the resulting syntax.

FACTOR

/VARIABLES price numitems org service quality overall

/MISSING LISTWISE

/ANALYSIS price numitems org service quality overall

/PRINT INITIAL EXTRACTION ROTATION

/CRITERIA FACTORS(3) ITERATE(25)

/EXTRACTION PC

/CRITERIA ITERATE(25)

/ROTATION VARIMAX

Page 13: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

13

/METHOD=CORRELATION.

The Output :

Rotated Component Matrixa

Component 1 2 3

Price satisfaction .700 .480 .098 Variety satisfaction .683 .539 -.040 Organization satisfaction .127 .096 .982 Service satisfaction .846 .165 .142 Item quality satisfaction .271 .907 .134 Overall satisfaction .803 .202 .121 Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization. a. Rotation converged in 4 iterations.

The Rotation Matrix lets us know that indeed not all items load onto only one factor. Indeed it appears that what seems to load with Overall satisfaction are the variables of Price, Variety, and Service. The

items of Organization satisfaction and the Item quality satisfaction seem to have different

originations and possibly should not be considered in the same scale. But we do not need to guess about this, we

can now run reliability analysis that will provide for us inter-item correlations so that we may verify the internal

consistency of this dependent variable of satisfaction.

To run reliability:

Go to Analyze > Scale > Reliability Analysis You will see a dialogue box such as this:

Page 14: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

14

Press the Statistics Button and check off item, scale, scale if item deleted, means, variances and then hit continue.

Once back to the main dialogue box, enter the variables of interest

Then hit Paste . You will see a syntax like this…

Page 15: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

15

RELIABILITY

/VARIABLES=price numitems org service quality overall

/SCALE('ALL VARIABLES') ALL

/MODEL=ALPHA

/STATISTICS=DESCRIPTIVE SCALE

/SUMMARY=TOTAL MEANS VARIANCE.

Let’s run it and make some decisions about our measure of satisfaction. The analysis tells us that although Item quality satisfaction is not really hurting us, Organization satisfaction actually is.

Item-Total Statistics

Scale Mean if Item

Deleted

Scale Variance if

Item Deleted

Corrected Item-Total Correlation

Squared Multiple

Correlation

Cronbach's Alpha if Item

Deleted Price satisfaction 15.57 23.133 .731 .589 .772 Variety satisfaction 15.55 23.404 .701 .595 .779 Organization satisfaction

15.38 27.817 .267 .096 .867

Service satisfaction 15.50 23.235 .675 .489 .783 Item quality satisfaction

15.48 23.422 .612 .395 .797

Overall satisfaction 15.52 24.100 .653 .453 .789

Here is one of the beauties of syntax. Now we simply make a note about what we see in our syntax (i.e. *alpha at .828, but shows that measure could be improved by the removal of Organization), copy and paste the first reliability command, remove the Organization satisfaction variable and run it again. Now we have alpha at .867 and the following output:

Item-Total Statistics

Scale Mean if Item Deleted

Scale Variance if

Item Deleted

Corrected Item-Total Correlation

Squared Multiple

Correlation

Cronbach's Alpha if Item

Deleted Price satisfaction 12.35 18.132 .748 .584 .825 Variety satisfaction 12.33 18.072 .750 .590 .824 Service satisfaction 12.28 18.290 .682 .483 .841 Item quality satisfaction

12.26 18.460 .615 .388 .859

Overall satisfaction 12.30 19.016 .665 .452 .845

Page 16: Advanced SPSS Workshop Handout 2013 - Yale Universitystatlab.stat.yale.edu/workshops/StatLab_Advanced_SPSS_Sp13.pdf · 1 Intermediate SPSS StatLab Workshop February 21, 2013 Sherlock

16

Creating our new variable.

o Now we know how to do this! Let’s compute our new variable of satisfaction.

COMPUTE Satisfaction=SUM(price,numitems,service,quality,overall)/

NVALID (price,numitems,service,quality,overall).

EXECUTE.

VARIABLE LABELS Satisfaction 'Measure of Satisfaction'.

These procedures can be modified and used whenever data preparation is necessary BEFORE that all-important analysis is done. It is crucial to understand the distributions and the nature of your measures before you begin hypothesis testing. Using syntax is a very useful way to keep track for yourself and others, of your work.