Chapter 1: Introduction Statistical Program for Social...

29
1 Chapter 1: Introduction Statistical Program for Social Scientists The Statistical Program for Social Scientists (Spss), orginated at the Vogelback Computing Center on the Northwestern University campus in the early to mid 1960s. The goal of the Spss developers, who were social scientists, was to develop an easy-to-use programming language that would enable researchers to manipulate and analyze their data without requiring them to become full- fledged computer programmers. Before Spss and other statistical programs existed, social scientists had little choice but to learn a high level programming language such as Fortran 1 and to write their own programs. A few pages from a popular data analysis textbook 2 are reproduced in Table 1-1 to llustrate features of this language. The original versions of Spss were written in Fortran and ran on mainframes, 3 for in the 1960s no micro-computers existed and mini-computers were just in their infancy. Researchers used keypunches to enter one Spss command per IBM punch card 4 , and one row of data per card. The capacity of each card was 80 characters. If a researcher had a data set that consisted of, say, 1,000 rows of 1. Formula Translator (Fortran) was the first high-level programming language developed for general purpose computers. Previously all coding was performed in either machine code (bina- ry) or assembly language (e. g., jmp put, get, inc, dec, and so forth). Unlike assembly language, Fortran statements consist of English words (e.,g., read, write, do, subroutine, and so forth). Current versions of Spss are written in C/C++. 2. Cooley, William W., & Lohnes, Paul R. (1971). Multivariate data analysis. NY: Wiley. 3. Mainframes of that era were physically large, requiring at least the same amount of space in a four car garage, for the computer and supporting hardware and personnel. Further, the room had to be climate controlled. 4. The IBM card was made of heavy, stiff, stock paper. Size was 3-1/4 inches high by 7-3/8 inches wide with 80 columns numbered left to right and 12 rows from top to bottom. Numbers were punched in a column as 0 through 9. Alphabetic characters were punched in code; a 12 punch and a 1 punch produced the letter A, 12 and 2 = B, 12 and 3 = C, etc.

Transcript of Chapter 1: Introduction Statistical Program for Social...

Page 1: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

1

• • • •••

Chapter 1: Introduction

• • • • • • Statistical Program for Social Scientists

The Statistical Program for Social Scientists (Spss), orginated at the Vogelback Computing Center on the Northwestern University campus in the early to mid 1960s. The goal of the Spss developers, who were social scientists, was to develop an easy-to-use programming language that would enable researchers to manipulate and analyze their data without requiring them to become full-fledged computer programmers. Before Spss and other statistical programs existed, social scientists had little choice but to learn a high level programming language such as Fortran1 and to write their own programs. A few pages from a popular data analysis textbook2 are reproduced in Table 1-1 to llustrate features of this language.

The original versions of Spss were written in Fortran and ran on mainframes,3 for in the 1960s no micro-computers existed and mini-computers were just in their infancy. Researchers used keypunches to enter one Spss command per IBM punch card4, and one row of data per card. The capacity of each card was 80 characters. If a researcher had a data set that consisted of, say, 1,000 rows of

1. Formula Translator (Fortran) was the first high-level programming language developed for general purpose computers. Previously all coding was performed in either machine code (bina-ry) or assembly language (e. g., jmp put, get, inc, dec, and so forth). Unlike assembly language, Fortran statements consist of English words (e.,g., read, write, do, subroutine, and so forth). Current versions of Spss are written in C/C++.

2. Cooley, William W., & Lohnes, Paul R. (1971). Multivariate data analysis. NY: Wiley.3. Mainframes of that era were physically large, requiring at least the same amount of space in a

four car garage, for the computer and supporting hardware and personnel. Further, the room had to be climate controlled.

4. The IBM card was made of heavy, stiff, stock paper. Size was 3-1/4 inches high by 7-3/8 inches wide with 80 columns numbered left to right and 12 rows from top to bottom. Numbers were punched in a column as 0 through 9. Alphabetic characters were punched in code; a 12 punch and a 1 punch produced the letter A, 12 and 2 = B, 12 and 3 = C, etc.

Page 2: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

2 Chapter 1 • • • •••

Table 1-1: Example of Fortran Language

numbers with each row no longer than 80 characters, someone, say a graduate student, used a keypunch to “type” these data. Keypunches had no “erase” capability. A single mistake at any point ruined the card, which was thrown away and all the information retyped on a new card.

English Statement Fortran Statement

Reserve memory cells for a matrix A of maximum 50 x 50 size, and for a vector X of maximum 50 elements.

dimension a(50,50), x(50)

Read from file 7 according to format 3, values for variables probno, n, and m.

read (7,3) probno, n, m

Specification as form 3 a card containing 3 5-column integer fields.

3 format (3i5)

Write on file 6 a label and the problem number, using format 4.

write (6,4) probno4 format (‘problem no.’ i3)

Place b in storage location a, i.e., set a = b.set a = b + c(1/d).set a = bn.set set a to the absolute value of a.set , for i = 1, 2, 3,..., n.

a = b

a = b + c * (1.0 / d)a = b ** na = sqrtf(b)a = absf(a)do 2 i = 1 , n2 x(i) = a * sqrtf (y(i))

set for i = 1, 2, 3, ..., n. Note that en = n converts an integer number to a floating point number for floating point arithmetic.

en = ndo 5 i = 1, nei = i5 x(i) = (en * ei) ** (1.0- / 3.0)

Transfer control to statement 7. go to 7

Test n; if it is negative go to 2, if it is zero go to 3, if it is positive go to 4.

if (n) 2, 3, 4

Call subroutine matinv, which will operate on matrix X of order m to compute its inverse, which will replace X, and its determinant, and will be placed in location det.

call matinv (x, m, det)

a b=

xi a yi=

xi ni3=

Page 3: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 3 • • • •••

After all data cards were keypunched, the Spss command cards were entered in the same way, one command per card. The cards containing the commands were arranged sequentially, followed by the data cards. Although keypunches and punch cards seem quaint from our perspective, considered historically they represented a great advance in efficiency.1

Two approaches to statistical computing

Batch processingThe collection of cards containing Spss commands and the data were known as a “deck.” The deck was placed in a card reader that contained light sensors to detect the holes in the cards. All cards in the deck were read, one after another, until the last card, an end of job (EOJ) card, was read. This procedure was referred to as “batch processing.”2 The location of the holes on the cards represented an elaborate code that was interpreted by a combination of hardware and software and passed on to the Spss program. Hours, or days, later a listing of the results appeared if all commands were entered correctly3. For the beginning Spss programmer, what appeared hours or days later likely was not results, but rather a listing of error messages.

1. “Herman Hollerith, a young Census Bureau statistician, designed the firstelectro-mechanical punch card system. The 1880 census wasn't completed until 1887, causing concern that the 1890 census wouldn't be completed until after 1900. Hollerith borrowed J. M. Jacquard's 1804 paste-board method for automatic weaving, designing a 3 x 5 card and building a Card Punch, Sorting Box, and Counter Device. Cards passed over a mercury-filled vat and pins dropped to touch the card. Pins passing through card holes touched the mercury, made electrical contact, and incre-mented counters. Seemingly clumsy, primitive, and slow, at the time it was high tech. Hollerith's system completed the 1890 census in three years. Hollerith left the Census Bureau in 1903, started his own company, and in 1911 merged with International Time Recording Company and Dayton Scale Company to create the Computing-Tabulating-Recording Company. In 1924, the name was changed to International Business Machines. Thomas Watson, Sr. became president and made IBM a household name.” from http://www.geocities.com/pattonhq/ibm.html

2. Spss manuals published in the 1970s refer to the program as “the Spss Batch System.”3. The first research computer at Indiana University, a Cyber 3200 mainframe (containing vacuum

tubes), crashed, on average, every 20 minutes. Three full time engineers from the manufacturer, Control Data Corporation, had offices in the machine room and were there to provide immediate service.

Page 4: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

4 Chapter 1 • • • •••

Because hardware was scarce and expensive, academic researchers were not permitted to store either data or commands on hard drives. Instead of thinking in terms of files of data and files of source code that could be accessed quickly, both data and source code were stored as decks of paper punch cards. By the end of the 1970s high speed modems (300 baud) and dumb terminals (aka “the glass keypunch”) began to replace iron keypunches and decks of cards. Small Spss command programs that were previously stored as a deck of paper cards could now be stored in a file on a hard drive — which was about as large as a top-loading automatic clothes washing machine. Large data files were stored on open reel tapes that were about 12 inches in diameter. Character based, line oriented editors1 were viewed as a tremendous advance over typing, and retyping, cards on a keypunch.

Conversational statistical computingAlthough simplified application languages such as Spss reduced greatly the amount of computer related information a social scientist needed to learn, still many considered the language to be difficult and others wanted immediate results rather than waiting hours to see output from batch processing. This dissatisfaction with the batch style of statistical computing spawned papers and presentations at conferences touting the virtues of “conversational statistics,” also referred to as “interactive data analysis.” The opportunity to compute interactively was rare in the 1970s but nonetheless the idea was promoted as if it were nirvana and every social scientist who had no choice but batch processing was envious.2 Today, this early idea of conversational statistics is embodied in programs such as Spss for Windows, Minitab, Data Desk, Jmp, Stata and others.

Interactive statistical computing is the preferred approach when:

• the analysis is performed only once, or a most, a few times,

• the data set is small,

• the number of Spss commands is small,

• little or no documentation of the variable names or values of variables is needed,

1. At Indiana University UCedit (University of Calgary editor), written in Fortran, was the standard and viewed as a great time-saver compared to the keypunch.

2. see Nie, Norman H. (1980). Scss, a user’s guide to the Scss conversational system. New York: McGraw-Hill.

Page 5: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 5 • • • •••

• little or no data manipulation is required,

• no audit trail of transformations is needed,

• all needed procedures are available from menus, and

• a single researcher is involved.

The interactive mode is the method of choice when the utility of a rapid result outweighs thoughtful analysis and an audit trail preserved in a command file.

The Macintosh was the first widely available microcomputer1 with a graphic user interface (GUI) and received mostly favorable reviews. Pundits assured “the rest of us” that a GUI freed computer users from the necessity of learning any programming language and made the computer accessible to the most casual of users. Today, microcomputers with GUIs dominate the market, regardless of operating system (Mac OS, Windows, or Linux). For some applications, particularly graphic applications, GUIs represent a clear advantage over writing code. With other applications the advantages of a GUIs may be illusionary, for while

GUIs are considered by many as easy-to-learn, they are, nonetheless, often hard-to-use — especially when your goal requires the preservation of program logic.

Although card decks and batch processing may seem quaint today, the laboriousness of keypunching cards and the long interval between submitting a batch of cards and seeing the output enforced disciplined thinking and careful planning of algorithms before any code was written. These are the habits — not quick pick and click — that are essential for accurate data manipulation and statistical analysis.

Documenting commands and dataIn many research projects data analysis is performed by a various members of a research group. It is often necessary for individuals to determine where a specific piece of data (or variable) originated and all the calculations that were applied to it along the way. Equally critical is the ability to trace aggregations and the specific selection and/or calculation rules applied during the aggregation.

1. The first microcomputer with a graphical user interface was the Xerox Star, from which the Mac-intosh designers borrowed the interface that made the Mac famous. The latest version of the Mac-intosh operating system is based on a verion of Unix and contains a command line. see http://www.geocities.com/SiliconValley/Office/7101/

Page 6: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

6 Chapter 1 • • • •••

Tracing data origination, transformations, inferring calculation rules, and so forth verges on the impossible when a group of individuals uses interactive computing. For this reason the examples that follow show command files that are intended to be used with the Spss production mode (aka, batch processing).

Text editorsIn this style of computing a text editor is used to write and save the Spss command files before they are “sent off” to Spss for execution. The adjective “text” means that the editor has such features as column moves, rulers, ability to handle large files (say, ½ gigabyte1 or larger), and most importantly, stores commands and data as ascii2. Some individuals use the editors included with Windows — Notepad or Wordpad and are quite satisfied; other individuals do not consider these editors to be adequate. The secret to making this style of computing painless is to use an editor that has the characteristics listed above and one with which you feel comfortable. Individuals often state with one of the Windows editors and when its limitations begin to chafe, seek an editor with more capabilities.

Fortunately a large number of robust and reliable editors are available at low cost. One such editor is Ultraedit3. Numerous other programmer and text editors can be found at Simtel4.

Configuring Spss production modeWhen you installed the Spss program from the IU licensed CD, two separate Spss modes were placed on the hard drive, one for interactive data analysis (Spss 10 for Windows), and the other for batch style processing (Spss 10 Production Facility). Before using the production facility, open the initial screen, select Edit � Options. On this screen click the checkbox next to “Show Spss when running,” and click the

1. A gigabyte is 2 to the 30th power (1,073,741,824) bytes. One gigabyte is equal to 1,024 megabytes. Gigabyte is abbreviated as G or GB.

2. American Standard Code for Information Interchange, pronounced ask-ee. Ascii is a code for rep-resenting English characters as numbers, with each letter assigned a number from 0 to 127. For ex-ample, the Ascii code for uppercase “A” is 65, an lowercase “a” is 97, and a “space” is 32. Data can be transferred easily from one computer and/or program provided that both computers and/or programs use Ascii to represent text. Text files stored in Ascii format are frequently referred to as Ascii files.

3. http://www.ultraedit.com/4. http://www.simtel.net/

Page 7: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 7 • • • •••

radio button beside “Leave Spss open at the completion of job.” Near the top of the screen is “Editor for syntax files:” followed by a text box where you enter the filespec1 for the text editor of your choice. If you are using Ultraedit and accept the installation defaults, then you would enter the following in this box: C:\Program Files\UltraEdit\Uedit32.exe.

Using the batch style To compute in batch style, open the text editor of your choice, type in the Spss commands, and save the file. Start the Spss production facility and enter the name of the Spss command file in the dialogue box labeled “Syntax files,” by clicking on the “Add” button and using the “Browse” button to find the name of the command file. Then click on the “Run” button, symbolized as a right pointing triangle, and in a few moments the production facility will begin processing the command file. When Spss is finished the output will appear on the screen. You can inspect the results, and/or error messages. The output screen is divided with the results on the right and a navigation tree on the left. By clicking on certain objects in the navigation tree you can copy results from Spss and transfer to other programs. Also, clicking on the tab labelled “data view” reveals values of all variables in the active file, while clicking on “variable view” displays the current definitions for all variables.

The difficult part is knowing which commands to enter. Some individuals may write perfect programs the first time but the rest of us usually encounter many errors. When this happens, read the error message, return to the editor and correct the offending code, save the command file, and resubmit. This edit — submit command file — view error messages cycle is repeated until the desired results are obtained. You are strongly encouraged to use a monospaced Courier font for all command files because with this font the likelihood that you will be able to spot a missing single quotation mark or an offending colon is increased, even though this font may not be the first that comes to anyone’s mind when weighing aesthetic features.

1. Short for “file specification” filespec includes the drive designator (a letter on Windows systems), followed by the directory and subsequent sub-directories and lastly, the file name.

Page 8: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

8 Chapter 1 • • • •••

To state one more time: Using this style of programming you do not interact with Spss directly. Instead, using a text editor, you type the Spss commands into an file that is saved as ascii. This file you submit to Spss for processing. After Spss processes the commands, you inspect the output for warnings, error messages, and results.

Spss DocumentationDocumentation for Spss version 10 is available on the installation CD. To read and/or print the documentation, start Acrobat Reader and look at the pdf files on the installation CD. The file spssbase.pfd contains the syntax for many of the commands used for data manipulation and data analysis.

Data preparation and data analysis Data preparation refers to various manipulations that are needed before data can be analyzed, and consists of tasks such as checking for “impossible” values,1 looking for missing values, sorting data, converting character variables to numbers or numeric variables to characters, recoding values, assigning missing values, converting percentages to actual numbers, assigning variable labels and value labels, generating an Spss “system file,” and so forth.

In order to answer a research question, data may be acquired from more than one data base and/or one data source. For example, suppose you wish to see if a relationship exists between single parent households and student academic performance in school corporations. The data needed to answer this question for corporations in the state of Indiana reside in two separate data bases on the web site maintained by the Indiana Department of Education (http://ideanet.doe.state.in.us). Before a researcher can investigate whether this relationship exists, the appropriate variables from the two separate data bases must be downloaded and merged.

To merge two or more files, a common variable present in each file must be identified. For this example, the common variable is the corporation code. Is this variable of the same type (numeric or character) in both files? If not, one or both of the data files must be manipulated so that the variable is the same in each. If the

1. Suppose the variable sex were coded 0 for male and 1 for female. If values other than 0 or 1 appear, the likelihood that a data entry error occurred is high.

Page 9: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 9 • • • •••

variable is of type character, is it the same length in both files? If not someone must make it so. Further, before files can be merged, they must be sorted on the common variable and all files to be merged must first be converted to Spss system files. This is an example of the data manipulation that precedes data analysis.

Data analysis occurs only after the data preparation stage is completed. In many statistics classes students are presented with “clean” data — files that have been prepared for analysis by removing anomolies. When students collect their own data, or attempt an analysis of an existing dataset, they discover the time-consuming nature of data preparation.

Data Code booksIf you are working with a data set collected by others and have no knowledge of its contents, look first at the codebook (also referred to variously as a “data dictionary” or a “record layout”). Ideally, a code book contains the names of all variables in the file, a description of each variable, the columns in the file occupied by each variable, the range of acceptable values for each variable, a copy of the questionnaire used to collect the data, the number of records in the file, and a brief statement about how the data were collected. The following table contains a code book for a subset of variables from the 1994 General Social Survey. This survey is conducted yearly by the National Opinion Research Center, a nonprofit organization within the University of Chicago.

Page 10: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

10 Chapter 1 • • • •••

It is always a good idea to use a text editor to look at the data file, particularly when

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

martial 1-1 1 married2 widowed3 divorced4 separated5 never married9 missing data

agewed 2-3 Age of first marriage98 Don’t know99 no answer

divorce 4-4 Ever been divorced or separated1 yea2 no9 no answer

sibs 5-6 Number of brothers and sisters98 don’t know99 missing

childs 7-7 Number of childern8 eight or more9 no answer

age 8-9 age of respondent98 don’t know99 no anwer

educ 10-11 Highest year of school completed97 don’t know99 no answer

paeduc 12-13 Father’s highest year of school

maeduc 14-15 Mother’s highest year of school

speduc 16-17 Spouse’s highest year of school

degree 18-18 Respondents highest degree 0 less than high school1 high school2 junior college3 bachelor4 graduate7 not applicable8 dont know9 no answer

Page 11: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 11 • • • •••

padeg 19-19 Fathers highest degree0 less than high school1 high school2 junior college3 bachelor4 graduate7 not applicable8 dont know9 no answer

madeg 20-20 Mothers highest degree0 less than high school1 high school2 junior college3 bachelor4 graduate7 not applicable8 dont know9 no answer

spdeg 21-21 Mothers highest degree0 less than high school1 high school2 junior college3 bachelor4 graduate7 not applicable8 dont know9 no answer

sex 22-22 Respondents sex1 male2 female

race 23-23 Respondent race1 white2 black3 other

region 24-24 Region of interview1 New England2 Middle Atlantic3 East North Central4 West North Central5 South Atlantic6 East South Central7 West South Central8 Mountain9 Pacific

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

Page 12: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

12 Chapter 1 • • • •••

partyid 25-25 Political party affiliation0 strong democrat1 not strong democrat2 independent, near democrat3 independent4 independent, near republican5 not strong republican6 strong republican7 other party8 don’t know9 no answer

colath 26-26 Allow anti-religionist to teach?1 allowed2 not allowed3 don’t know4 no answer

gunlaw 27-27 Favor or oppose requiring gun permits1 favor2 oppose8 don’t know9 no answer

courts 28-28 View on how courts are dealing with criminals1 too harsh2 not harsh enough3 about right8 don’t know9 no answer

relig 29-29 Respondent’s religious preference1 Protestant2 Catholic3 Jewish4 None5 Other8 don’t know9 no answer

happy 30-30 General happiness1 very happy2 pretty happy3 not too happy8 don’t know9 no answer

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

Page 13: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 13 • • • •••

health 31-31 Condition of health1 excellent2 good3 fair4 poor8 don’t know9 no answer

helpful 32-32 People helpful or looking out for selves1 helpful2 look out for self3 depends8 don’t know9 no answer

fair 33-33 People fair or try to take advantage1 take advantage2 fair3 depends8 don’t know9 no answer

trust 34-34 Can people be trusted?1 can trust2 cannot trust3 depends

satjob 35-35 Satisfaction with job or housework1 very satisfied2 moderately satisfied3 a little dissatisfied4 very dissatisfied8 don’t know9 no answer

satfin 36-36 Satisfaction with financial situation1 very satisfied2 more or less satisfied3 not at all satisfied8 don’t know9 no answer

fework 37-37 Should women work1 approve2 disapprove8 don’t know9 no answer

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

Page 14: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

14 Chapter 1 • • • •••

fepres 38-38 Would you vote for a woman for President?1 yes2 no5 wouldn’t vote

abdefect 39-39 Should a woman be allowed to have an abortion if there is a strong chance of a serious defect?1 yes2 no8 don’t know9 no answer

abnomore 40-40 Should a woman be allowed to have an abortion if she is married and wants no more children?1 yes2 no8 don’t know9 no answer

abhlth 41-41 Should a woman be allowed to have an abortion if her health is seriously endangered?1 yes2 no8 don’t know9 no answer

abpoor 42-42 Should a woman be allowed to have an abortion if she has low income and cannot afford more children?1 yes2 no8 don’t know9 no answer

abrape 43-43 Should a woman be allowed to have an abortion if she is pregnant as a result of rape?1 yes2 no8 don’t know9 no answer

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

Page 15: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 15 • • • •••

absingle 44-44 Should a woman be allowed to have an abortion if she is not married?1 yes2 no8 don’t know9 no answer

chldidel 45-45 Ideal number of children7 seven or more8 as many as want9 don’t know / no answer

premarsx 46-46 Sex before marriage1 always wrong2 almost always wrong3 sometimes wrong4 not wrong at all8 don’t know9 no answer

income 47-48 Total family income1 under $1,0002 $1,000 - $2,9993 $3,000 - $3,9994 $4,000 - $4,9995 $5,000 - $5,9996 $6,000 - $6,9997 $7,000 - $7,9998 $8,000 - $9,9999 $10,000 - $14,49910 $15,000 - $19,999 11 $20,000 - $24,49912 $25,000 or more13 refused to answer98 don’t know99 no answer

colhomo 49-49 Allow homosexual to teach?4 allowed5 not allowed8 don’t know9 no answer

hapmar 50-50 Happiness of marriage1 very happy2 pretty happy3 not too happy8 don’t know9 no answer

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

Page 16: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

16 Chapter 1 • • • •••

coneduc 51-51 Confidence in education1 a great deal2 only some3 hardly any8 don’t know9 no answer

conmedic 52-52 Confidence in medicine1 a great deal2 only some3 hardly any8 don’t know9 no answer

anomia5 53-53 Lot of the average man getting worse1 agree2 disagree8 don’t know9 no answer

fear 54-54 Afraid to walk at night in neighborhood1 yes2 no8 don’t know9 no answer

burglr 55-55 Home broken into during the last year1 yes2 no8 don’t know9 no answer

robbry 56-56 Forcefully robbed during the last year1 yes2 no8 don’t know9 no answer

owngun 57-57 Have gun in home1 yes2 no3 refused to answer8 don’t know9 no answer

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

Page 17: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 17 • • • •••

polviews 58-58 Think of self as liberal or conservative?1 extremely liberal2 liberal3 slightly liberal4 moderate5 slightly conservative6 conservative7 extremely conservative8 don’t know9 no answer

cappun 59-59 Favor or oppose death penalty for murder1 favor2 oppose8 don’t know9 no answer

fehome 60-60 Women should take care of home, not the country1 agree2 disagree8 not sure9 no answer

fepol 61-61 Women no suited for politics1 agree2 disagree8 not sure9 no answer

sexeduc 62-62 Sex education in public schools1 favor2 oppose3 depends8 don’t know9 no answer

tvhours 63-64 Hours per day watching television98 don’t know99 no answer

helpsick 65-65 Should govt help pay for medical care?1 govt should help3 agree with both5 people should help selves8 don’t know9 no answer

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

Page 18: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

18 Chapter 1 • • • •••

zodiac 66-67 Respondents zodiac sign1 Aries2 Taurus3 Gemini4 Cancer5 Leo6 Virgo7 Libra8 Scorpio9 Sagittarius10 Capricorn11 Aquarius12 Pisces98 don’t know99 no answer

colrace 68-68 Allow racist to teach?4 allow5 not allow8 don’t know9 no answer

colmil 69-69 Allow militarist to teach?4 allow5 not allow8 don’t know9 no answer

drunk 70-70 Ever drunk too much?1 yes2 no8 don’t know9 no answer

abany 71-71 Should a woman be allowed to have an abortion if she wants it for any reason?1 yes2 no8 don’t know9 no answer

case 72-75 case number

Table 1-2: General Social Survey, 1994. Selected Variables

Variable Columns Description / Values

Page 19: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 19 • • • •••

it is a plain ascii format. The first 20 rows of the data that corresponds to this code book follows:

The first two rows indicate the column numbers and were typed in to aid in determining which columns are occupied by numbers. They should be removed before the data is read into Spss for processing. Compare these numbers to the code book specification and satisfy yourself that the data do seem to be in the correct columns. The values for the variable “case” are not shown. Other code books are available on the web for inspection.1

Anatomy of an Spss command file Beginners should adhere to the following order of commands to avoid unanticipated problems. The basic steps command programs follow are:

• read some data

• document the data

• apply transformations to the data, if needed

1 2 3 4 5 6 71234567890123456789012345678901234567890123456789012345678901234567890

3 42331212109911092220 212 2122211 54 9 21 32221 21 71202 43591299121219111225 221 2321229 3113 121 41112 3111 21282 22991299991209902222912123 1211929292241043 212251221 1 394 11 1 6359 899 5 909011121 223 2122311 2310 133 31112 32 55 3021139912991919222049213221223 111212 94 231112262 3 655225 30221599209919192227519522 432122121283124 1112222222 0 454 22 2 5440 912999901992320412233 231122121222114 12224222110 255 25 812512 8 09910091220 151 2211112 53 9 31 62221 41 12 2 26411299119919092220413422 331111111144124 122239222 2 349 13 514512 6 6991009222141212321213 111111 105 2212 241 2 755 15 99052 099209909492125 212 1212212 2213 31 42221 22 55 10311914169941392121 232 2221111 2412 11 52221 22 71292 425516 0 01230011126 251 1121211 2412 221 61221 21121222 1256161216183134112041211212121 111111 12411121 232 1 544 15 40361819149944291122412222 211111111124124 122241221 1 744 15 20251616189933492120412222 321111111124124 122232221 1 444 11322 02521813162041342120 232 1312111 2212 221 41221 03 9 21 2 02391616141633132121412212 1321111111241242 122229221 3 354 15 20711220129914191125512234 21211111189134 1122141211 3 145 11 2 20361616141633231120413421 2211111111241242 122232221 2 445 1

1. see the Common Core of Data code book at http://nces.ed.gov/ccd/

Page 20: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

20 Chapter 1 • • • •••

• analyze the data.

To illustrate these steps, the simple file that appears below is used as an example. The first line is a comment showing where this command file is stored. The ‘c:\...\filename-sps’ convention is used throughout (A brief overview of MS-Dos commands is included at the end of this chapter). The three dots should be replaced with the specific names of directory and subdirectories (i.e, the “path”).

* File: c:\...\fat-sps .* Your name and date .set header=on messages=on printback=on errors=on results=on.title 'spss frequency run - fat data' .

/* Read data in external file -------------------------------- */data list file = 'c:\...\fatdata.dat' fixed records = 1 table

/1 sex 1-1 bldprs 3-6 wgt 8-11satfat 13-16 exday 18-21 smoke 23-23 .

/* Data definition ------------------------------------------ */variable labels bldprs 'amt above normal blood pressure'

/ wgt 'kilograms overweight'/ satfat 'avg # grams saturated fatty foods consumed'/ exday 'avg # minutes exercise per day' .

value labels sex 1 'female'2 'male'

/ smoke 0 'no' 1 'yes' .missing values sex (9) exday (99) .formats sex smoke (f2.0) .

/* Analysis ------------------------------------------------- */frequencies variables = all

/statistics = all .

means tables = bldprs wgt satfat exday by sex .

correlation variables = bldprs wgt exday .

ttest group = sex(1,2)/ variable = satfat .

crosstabs tables = sex by smoke/ statistics = chisq .

Spss commands start in column one. Subcommands — words that follow the forward slash (/) — must be placed in a column other than the first. The default command terminator for Windows versions of Spss is the period.

Page 21: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 21 • • • •••

Comments The first line of this example command file is a comment that documents the location of the command file. There are several ways to place comments in Spss files. One is to place an asterisk in column one. All text that follows to the next period is interpreted as a comment. Another way to place a comment is to use the /* comment */ convention. Everything between the slash-asterisk and asterisk-slash is a comment. Comments delimited in this manner can be placed on the same line as a command, after the command terminator (i.e., a period).

Set The options on the set command instruct Spss to show the commands on the output. This command should appear in every Spss command file you write; simply copy from a previous file to the current one. The title command enables you to display a descriptive title for the job.

Reading data from an external ascii fileAlthough Spss can read data stored in Excel spreadsheets, dbase database files, and other data sources, this example focuses on reading data from an ascii file in which the data are in specified columns. You should lookup the complete specification of the data list command in the file spssbase.pdf.

/* Read data in external file -------------------------------- */data list file = 'c:\...\fatdata.dat' fixed records = 1 table

/1 sex 1-1 bldprs 3-6 wgt 8-11satfat 13-16 exday 18-21 smoke 23-23 .

Data list The command begins in column one with the words “data list.” The “file = filespec” tells Spss where to find the data file. The filespec must be enclosed in single quotation marks. The word “fixed” indicates that the data will be in the columns as specified. If values are not present, Spss interprets the blanks as system missing values. The word “records” indicates the number of data rows that make up one case — that is, all the data for a single person1. Although lines of data can be hundreds of columns in length, for ease of viewing on monitor and paper, lines of data are usually limited to around 70 columns. Suppose, for example, that the data for one case requires 350 columns, or five rows of 70 columns each. To read this data “records = 5” would indicate that five lines of data constitute each case. The word “table” tells Spss to print back a table that shows how each variable was read. After submitting the command file, a good practice is to compare this table to the code

1. This assumes that a single indvidual is the unit of analysis. If we are looking at classrooms rather than individuals, then the classroom is the case.

Page 22: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

22 Chapter 1 • • • •••

book and be sure that the beginning and ending columns agree for each variable. The “/1” indicates the beginning of variable names on record one. If this example contained a second record per case, somewhere you would see “/2” followed by the variable names that appear on the second record.

Variablenames Variable names in Spss must begin with a character and be no longer that eight

characters. Variable names may contain characters, numbers, period, and the underscore. The dollar sign and oglethorp may also be within variable names. However, these symbols have special meaning if they are used at the beginning of a variable name.

Spss variables are of two types: numeric and string (or character or alphanumeric). Variables are assumed to be numeric unless an “a” in parenthesis follows the column numbers. Let us assume that values for sex were recorded as “M” and “F.” The following code shows how to read them as string, or alphanumeric, variables:

/* Read data in external file -------------------------------- */data list file = 'c:\...\fatdata.dat' fixed records = 1 table

/1 sex 1-1 (a) bldprs 3-6 wgt 8-11satfat 13-16 exday 18-21 smoke 23-23 .

The preceding discussion covers the essentials of reading data in fixed columns from an external data file. A critical step in reading a new data file is to assure yourself that the data have, indeed, been read as you expect. One way to check was mentioned previously: compare the data list printback table to the code book. An addition check is to list the values of the variables as follows:

List variables list variables = sex bldprs wgt satfat exday smoke/ cases = from 1 to 100 .

This command will list values for the variables in the first 100 cases. Instead of typing each individual variable name, you could use the Spss reserved word “all” which would be typed after the equal sign. Or, if you wished to examine the values of the last four variables, you could type wgt to smoke. “To” is another reserved word that Spss interprets as “all the variables between” the two that were named. To improve readability, be sure that variable names on multiple lines are aligned.

Data definitionThis step consists of adding variable labels, value labels, declaring missing values, and recoding existing values if needed.

Page 23: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 23 • • • •••

Variablelabels Because Spss variable names are limited to eight characters, variable labels provide

a way of associating a longer (up to 120 characters), more informative description to a variable name, although some statistical procedures will truncate long labels to 40 characters on the listing of results.

/* Data definition ------------------------------------------ */variable labels bldprs 'amt above normal blood pressure'

/ wgt 'kilograms overweight'/ satfat 'avg # grams saturated fatty foods consumed'/ exday 'avg # minutes exercise per day' .

value labels sex 1 'female'2 'male'

/ smoke 0 'no' 1 'yes' .missing values sex (9) exday (99) .formats sex smoke (f2.0) .

The general procedure is to write the first variable name, followed by the label enclosed in single quotation marks. The forward slash signifies the next variable name and its label is enclosed in single quotation marks. This pattern is repeated until finished and the command is terminated with a period. A long label can be stretched over two lines by placing a plus sign at the end of the first line.

Value labels Value labels are indispensible for documenting the meaning of each distinct value for a variable. In the preceding example, we can see that for the variable “sex” a numeric value “1” means “female” and the numeric value “2” means “male.” If you were to return to this command file six months from now you might not remember whether 1 was male or female if there were no value label.

Value labels can be up to 60 characters in length but will be truncated by many statistical procedures.

The syntax of value labels for character variables differs slightly. If sex were read as an alphanumeric variable, then the value label would be:

value labels sex 'F' 'female''M' 'male'

/ smoke 0 'no' 1 'yes' .

Page 24: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

24 Chapter 1 • • • •••

MissingValues The missing values command enables the user to assign certain values as missing

for specific variables. In the preceding example code, Spss will treat the number 9 as missing for sex and 99 as missing for exday. Blanks in the data file for these variables will be treated as “system missing” and appear as periods on the list variables procedure.

Formats The formats command controls the appearance of the output format. If you omit this command values for sex will be listed as 1.00 or 2.00. The format command in this example displays the values as integers rather than floating point numbers. The format command can be used to display dollar, dates, comma, and time.

Displaydictionary After you complete all data defintions, one way to check the results of these

definitions is to include the following command:

display sorted dictionary / variables = all .

This lists all variables, variable labels, value labels, missing values, and formats.

Analysis proceduresTo this point the discussion has focused on the preparation of data for analysis. Except for textbook examples, the number of commands needed to prepare data for analysis often require several pages to print. These commands are often complex and distracting. To hide this large number of commands, analysis usually — at the conclusion of data preparation — save the file in a format known as an Spss “system file.” Such files are in binary format and contain not only the data in a non-redundant form, but also all the variable and value labels, the recodes, formats, and other transformations. The following example shows how to save a raw data file as an Spss system file:

/* Save data, definitons, & transformations in system file ----- */save outfile = 'c:\...\fatdata.sav' .

System files require much less effort to read.

Page 25: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 25 • • • •••

Spss’ reason for existence is to perform statistical analysis and so the program contains a large number of procedures that can be used to analyze data. This introductory example lists several, each of which is discussed in the Spss manual. We will devote considerable attention to the individual commands used to perform statistical commands.

Operating System Commands This section is strictly optional. You can accomplish all tasks by using the Window’s Explorer.

Regardless of the operating system you are using (e.g., Mac OS, Linux, or one of the variants of Windows), the ability to perform certain operations will be useful. These include creating a directory, changing directories, copying files, printing the names of all files in a directory, using wildcards, and so forth. All of these tasks can be accomplished via the Windows Explorer program if you are sufficiently adroit; they can also be accomplished via the command line.

To access the command line on a Windows machine, look at start � program � “command prompt” or “MS-Dos prompt.” At the command line you will see a prompt that looks similar to the following:

c:\>

You can exit the command window by typing “exit” at the c:\> prompt.

HelpInformation about MS-Dos commands is obtained by typing the command, forward slash, and a question mark. For example, if you wish to see all options available for the directory listing command, enter the following at a prompt:

c:\> dir/?

More than a screenful of information will be displayed and unless you are a graduate of a speed reading institute, you will not be able to read the beginning of this indubitably helpful message. To overcome this problem, simply redirect the

Page 26: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

26 Chapter 1 • • • •••

message to a file. The redirection symbol for both MS-Dos and Unix is the greater than sign (>). Redirecting the output of the previous command to a file looks like this:

c:\> dir/? > dirhelp.txt

Now, the helpful messages about the directory command are stored in an ascii file named “dirhelp.txt.” The contents of this file can be viewed an printed with any text editor. Or, the more command can be used to display one screen of information at a time from this file or any ascii file:

c:\> more < dirhelp.txt

On Unix systems, “man” is the abbreviation for “manual” and is used to display help about any Unix command. Suppose you wanted information about “ls,” the command used to obtain a directory listing. You would enter the following:

% _ man ls > manls.txt

Then you could use the more command as shown in the next line to view the contents of the manls.txt file. Note closely that on Unix systems the redirection symbol is not used with the more command:

% _ more manls.txt

So far we have discussed how to obtain help about individual commands on MS-

Dos and Unix systems. From this point onward the discussion will be limited to the former.

DirectoriesDirectories are used as a means of organizing presumably related files . You might considered creating a directory named “y590” to store all materials related to this class. Under, or below, this directory you might create further subdirectories such as “sps” (for all Spss command files that you write), “data,” (for all raw data files you download and/or create for this course), “sav” (all Spss system files), a directory named either “temp” or “work” (contains intermediate, temporary files created by some Spss routines), “xls” (Excel files — if we progress sufficiently to learn to read Excel files into Spss), “readings,” “handouts,” “homework” and so forth.

Page 27: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 27 • • • •••

The preceding paragraph discusses a two-level directory: y590 and each of the subdirectories underneath it. Additional sub-subdirectories can be added beneath each of the first level subdirectories; third level subdirectories can be added, fourth level subdirectories can be added, and so forth.

Next we discuss the terminology related to directories and the commands used to create them and change from one level of subdirectory to another.

Drivedesignator The letter “c:” is the drive designator. Traditionally, the letters “a:” and “b:” were

used to designate floppy drives, “c:” for the hard drive, and the remaining letters for other drives and devices. For example, “e:” might be the designator for a second hard drive, “d:” might designate a zip drive, and “f:” a CD-Rom. However, almost any letter can be associated with any device. The convention of a drive designator is unique to MS-Dos systems. Mac and Unix do not use this convention.

The backward slash (on Unix systems, and thus presumably also on the newest Mac OS, a forward slash is used instead) serves to delimit the directories and subdirectories. The single backslash indicates the “root” or “top-level” directory, depending on your perspective. Keep in mind that you are at the root (or highest or first) level when you see the prompt with a single backward slash:

c:\>

Makingdirectories Suppose you wish to make a directory named “y590” and a subdirectory beneath it

named “sps.” The MS-Dos command for creating directories is “mkdir” and a shortened version “md” also works. The following will create the first level directory:

c:\> md y590

Changingdirectories To change to the directory you just created, use “chdir” or “cd” or change directory

command:

c:\> cd y590 (the prompt you will see follows)c:\y590>

Page 28: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

28 Chapter 1 • • • •••

Notice that directory name is now displayed as part of the prompt. Next we create the subdirectory for Spss command files:

c:\y590> md spsc:\> cd spsc:\y590\sps>

Notice that the prompt now contains the name of the first level and second level directories. We can continue this process of creating subdirectories and using the “cd” command to “walk” down the path. There is a limit to the number of characters that is permitted in a path but you will likely will not reach that limit in the work for this class.

To reverse your course on the path return to the next higher level directory, enter the following:

c:\y590\sps> cd ..c:\y590>

Every time you issue “cd . .” you move one level up the path. If you wish to jump directly to the root level, simply type “cd\” At the root level you can go to the subdirectory by entering:

c:\> cd \y590\spsc:\y590\sps>

These simple commands can be used to create and navigate directories. Several shortcuts and options exist but will not be discussed here.

Removingdirectories Suppose you made a typing error while creating a directory as follows:

c:\y590> md spac:\> cd spac:\y590\spa>

You can remove the directory by first changing one directory level higher than the one you wish to remove and using the “rmdir” or “rd” command”

c:\y590\spa> cd ..c:\y590> rd spac:\y590>

Directory The directory command, “dir,” is used to list the names files and the directories one level below the current directory:

c:\y590> dir

Page 29: Chapter 1: Introduction Statistical Program for Social …educy520/sec6342/week_08/michael_spss...Statistical Program for Social Scientists The Statistical Program for Social Scientists

Statistical Program for Social Scientists 29 • • • •••

You can also use wildcards to select certain files. The two wildcard characters are the asterisk and the question mark. The following command will match any file with the .sps extension, regardless of the number of characters:

c:\> dir *.sps

The question mark is used to match any single character. In this example, the wildcard matches any letter used as the first letter in the file name extension and “ps” as the last two letters:

c:\> dir *.?ps

If you wish to print the names of the all the files in a directory, do the following:

c:\> dir > filenames.txt

Rather than listing the file names on the screen, they are redirected to a file named “filenames.txt” that can be opened with your text editor of choice and printed. If you wish to print the names of all files in all subdirectories below the current level:

c:\> dir /s > filenames.txt

The “/s” option means “list the file names in all subdirectories.” The redirection symbol sends the names to “filenames.txt” and printed.

More help You can obtain more information as needed about wild cards and other MS-Dos commands as needed (e.g., copy, del, rename) by looking at the following url:

http://www.washtenaw.cc.mi.us/dept/cis/mod/q02cd.htm