Introduction to Stata - ESQUANT

67
NEIL T. DIAMOND AND EWA M. SZTENDUR INTRODUCTION TO STATA ESQUANT STATISTICAL CONSULTING PTY LTD

Transcript of Introduction to Stata - ESQUANT

Page 1: Introduction to Stata - ESQUANT

N E I L T. D I A M O N D A N D E W A M . S Z T E N D U R

I N T R O D U C T I O N T O S TATA

E S Q U A N T S TAT I S T I C A L C O N S U LT I N G P T Y LT D

Page 2: Introduction to Stata - ESQUANT

Copyright © 2014 Neil T. Diamond and Ewa M. Sztendur

published by esquant statistical consulting pty ltd

typeset with tufte-latex

For academic use only. You may not reproduce or distribute without permission of the authors.

First printing, October 2014

Page 3: Introduction to Stata - ESQUANT

Contents

1 Introduction to Stata 5

2 Statistical Analysis in Stata 27

3 Exercises 61

4 Datasets 63

5 Bibliography 67

Page 4: Introduction to Stata - ESQUANT
Page 5: Introduction to Stata - ESQUANT

1 Introduction to Stata

1.1 Stata

• Stata is an environment for data analysis and graphics

• Stata is command based, but there is a graphical user interface.

• The command structure is consistent, which makes it easy to use.

• A great feature of Stata is the ability to save the commands in afile, for future re-use.

• There is a very large amount of statistical functionality.

1.2 Stata User-Interface

Figure 1.1 shows the Stata User Interface.

Figure 1.1: The Stata User Inteface

The GUI shows some menus, which we will discuss briefly. Inaddition, there are a number of important panels. These include:

The results window This is where the results go. Note that graphicswill go to a separate window.

Page 6: Introduction to Stata - ESQUANT

6 introduction to stata

The commands window Even if you use the menus, Stata will generatethe associated commands and they will appear in this window.You don’t have to use the menus. It is usually quicker and morereproducible to use the commands. Note that pressing enter willexecute the command.

The Review window As you issue commands, they are placed in thereview window. By clicking once, you can transfer the commandto the command window. Double clicking transfers the commandand executes the command. If you made a mistake you can deletea command from the review window.

1.3 A simple Stata session

1. We suggest that you have a folder named Stata Workshop with asubfolder Day 1 which itself has a subfolder Data. Place the datafiles provided in the Data Directory.

2. Choose File ⊳ Log ⊳ Begin . . . . We suggest you save the log file asDay1 and as a log file, not as a *.smcl file. Press Save.

3. The next step is to change the working directory. Choose File ⊳Change Working Directory. Go to Stata Workshop/Day1. Notehow the working directory changes at the bottom left.

4. We next need to get the data in. Choose File ⊳ Import ⊳ Text data(delimited, *.csv, . . . ). The data1 is employeedata.csv. 1 This was originally a SPSS file, which

we have long used to introduce SPSS.We converted it to a .csv file and as welladded some missing values.

5. Click on the data editor browser to have a look at the data. Notethat none of the variables have labels and in addition there are novalue labels as well.

6. Go to Data ⊳ Variables Manager.

7. Click down to jobcat and give the variable a label “Job Category".Press Apply.

8. Click on Manage Value Labels.

(a) Create a label. Call it jobCat. (It could be called jobcat, but wethink it is less confusing if the names of the variable and thenames of the label are different.

(b) Click on value 1, type “Clerical" and then press Add. Click onvalue 2, type “Custodial" and then press Add. Click on value 3,type “Manager" and then press Add. Press OK and then Close.

(c) In the Variables Manager dialog, select jobCat in the ValueLabels box and press Apply. Close the Variables Manager dialogbox.

Page 7: Introduction to Stata - ESQUANT

introduction to stata 7

9. Choose Statistics ⊳ Summaries, tables and tests ⊳ Frequency tables⊳ One way Tables. Choose jobcat as the Categories variable. PressSubmit2. 2 By pressing Submit instead of OK,

the dialog box remains open so youcan adjust the options you have chosenwithout having to reopen the dialogbox.

10. Redo the frequency table including the missing values.

11. Now we will label the variables. Choose Data Utilities ⊳ LabelUtilities ⊳ Label Variable.

(a) In the Attach a label to a variable dialog box, choose the vari-able gender and label it Male3 and press Submit. By pressing 3 It is always better to label the Gender

variable as Male or Female-this is anexample of a self-documenting variable.

Submit the action is completed and the dialog box remainsopen.

(b) Do the same for the other variables.

12. Tabulate minority.

13. Now we will create a value label for the minority variable, andvariables like it. Choose Data Utilities ⊳ Label Utilities ⊳ ManageValue Labels. Create a Label yesno with values 1 for Yes and 0 forNo.

14. Next choose Data Utilities ⊳ Label Utilities ⊳ Assign value labels tovariables. Choose minority as the variable and yesno as the valuelabel.

15. Now do a frequency table for minority. How do you get the origi-nal frequency table?

16. Choose Create or change data ⊳ Other variable transformationcommands ⊳ Convert variables from string to numeric. In thedialog box, choose salary and salbegin as the string variables toconvert. Create new numeric variables named cursal and bgnsal.Remove characters $ and ,. Press OK. Note that the labels havebeen transferred. Check the two sets of variables with the browser.

17. Save the data as a Stata data file.

18. Finally, close the log using File ⊳ Log ⊳ Close.

1.4 The log file

Open the log file, you can see all the commands and output. It’s bestto use a text version, as everyone can read that. The *.smcl looksslightly better, but you can only read that within Stata.

1.5 The Menus

The main submenus are

Page 8: Introduction to Stata - ESQUANT

8 introduction to stata

● File ● Edit ● Data● Graphics ● Statistics ●User● Window ● Help

Each of these have submenus as well, as outlined below.

1.5.1 File

Open Open a Stata datafile (which has an extension *.dta) or a do file(that is a file with a set of Stata commands, with an extension of*.do).

Save/Save As Save the data as a Stata datafile.

View Choose a file to view. This could be a log file, a do file, or ahelp file.

Do Open a do file and executes the commands in it.

Filename Copy the name of a file into the command window.

Change Working Directory The working directory is given at the bot-tom left hand corner of the Stata screen and be default will be thedirectory where you installed Stata. It is a good idea to change theworking directory for work in different projects.

Log Keeps a record of commands and output.

Import Imports data from various non-Stata sources.

Export Exports data to various non-Stata destinations.

Print Prints results, viewer contents (logs or do files) and graphs.

Example DataSets Describes datasets provided with Stata and a con-venient way to access them.

Recent DataSets Gives a list of recently used datasets. By clicking onone of them you clear the memory and load the dataset.

Exit Get out of Stata.

1.5.2 Edit

Copy/Paste Copy highlighted stuff and paste it elsewhere. WithinStata it is pasted in the command window.

Copy Table/Copy as HTML/Copy as Picture You can copy a table, butit will look irregular when you paste it, so best to Copy Table socolumns line up.

Page 9: Introduction to Stata - ESQUANT

introduction to stata 9

Clear Results This removes the contents of the Results window. Notethat they will remain in the log file.

Find/Find Next Go through the window to find the next occurrence ofa string.

Table Copy Options Allows you to change how vertical lines are han-dled.

Preferences Allows you to change how Stata looks and the windowsare arranged as well as defaults for graphics.

1.5.3 Data

Describe Data Descrbe the data or do a codebook or summarize thedata.

Data Editor The Edit submenu allows you to modify the data. This isnot good practice. Use the Browse submenu instead.

Create or change data Create and /or transform data.

Variables Manager Gives a list of the variables and their propertiesthat you can change.

Data Utilities Various utilities to handle labels and check the data.

Sort Sort the data by various variables.

Combine Datasets Merge or append datasets.

Matrices, Mata Language Mata is a programming langauge based onmatrices. For advanced users.

Matrix ado language You can do various matrix operations on yourvariables.

Other Utilities

1.5.4 Graphics

A wide variety of graphics are available

Manage Graphs Rename, Copy or Drop Graphs

Change scheme/size Change the appearance of a graph accoding to ascheme provided, or one you have set up yourself.

1.5.5 Statistics

A wide variety of Statistical Analysis are available.

Page 10: Introduction to Stata - ESQUANT

10 introduction to stata

1.5.6 Window

Shows the corresponding window and moves the cursor there.

Command

Results

Review

Variables

Properties

Graph

Viewer

Data Editor

Do file editor

Variables Manager

1.5.7 Help

PDF Documentation

Advice Helpful advice on using the help system. Great idea to readthis.

Contents

Search

Stata Commands

News

Resources

SJ and User-Written Programs SJ stans for the Stata Journal

What’s New?

Check for Updates

About Stata

Page 11: Introduction to Stata - ESQUANT

introduction to stata 11

1.6 do files

As we have used the menus, Stata has recorded the associated com-mands in the Review window. We can put all the commands in a fileto regenerate our analysis. This is extremely important as it meansthat our analysis is reproducible.

Highlight all the commands in the review window using the shiftbutton. Right click and select Send to Do-File Editor. The Do-FileEditor will open and the commands will look like this:

log using "C:\ Users\NeilDiamond\Documents\ S t a t a Workshop\Day 1\day2 . log "cd "C:\ Users\NeilDiamond\Documents\ S t a t a Workshop\Day 1"import del imited "C:\ Users\NeilDiamond\Documents\

S t a t a Workshop\Day 1\Data\Employeedata . csv "l a b e l v a r i a b l e j o b c a t " Job Category "l a b e l def ine jobCat 1 " C l e r i c a l " 2 " Custodial " 3 " Manager "l a b e l values j o b c a t jobCatl a b e l def ine jobCat 1 " C l e r i c a l " 2 " Custodial " 3 " Manager " , r e p l a c et a b u l a t e j o b c a tt a b u l a t e jobca t , missingl a b e l v a r i a b l e gender " Male "l a b e l v a r i a b l e gender " Male "l a b e l v a r i a b l e bdate " Date of B i r t h "l a b e l v a r i a b l e educ " Education Level "l a b e l v a r i a b l e j o b c a t " Job Category "l a b e l v a r i a b l e s a l a r y " Current Sa lary "l a b e l v a r i a b l e sa lbeg in " Beginning Salary "l a b e l v a r i a b l e jobt ime " Time on Job "l a b e l v a r i a b l e prevexp " Previous Experience "l a b e l v a r i a b l e minority " Minority group "l a b e l v a r i a b l e minority " Minority group "codebook , compactc lonevar Male2 = genderencode Male2 , generate ( Male2 )drop Male2

t a b u l a t e jobca t , missingt a b u l a t e minority , missingl a b e l def ine yesno 0 "No" 1 " Yes "l a b e l values minority yesnot a b u l a t e minority , missingcodebook , compactcodebook j o b c a tsave "C:\ Users\NeilDiamond\Documents\ S t a t a Workshop\Day 1\Data\employee . dta "log c l o s e

Page 12: Introduction to Stata - ESQUANT

12 introduction to stata

Save the do file as sw-employclean.do. Add the following4 before 4 Following the suggestion in TheWorkflow of Data Analysis using Stata byJ. Scott Long, 2009, Stata-Press.

the commands in the do fil:e

capture log c l o s elog using sw−employeeclean , r e p l a c e t e x t

// sw−employeeclean . do// yourname 17/10/14

vers ion 13

c l e a r a l lmacro drop _ a l ls e t l i n e s i z e 80

Add the following commands after the commands in the do file:

log c l o s ee x i t

An explanation of these extra commands is given below.

capture log close If we try to execute the do file and there is an error,then when we fix the error and try again, the log file will be openand an error will be generated. We close the log file to preventthis. However, when we run the do file the first time, there will beno log file open and we will get an error. The capture part of thecommand tells Stata to ignore this error and to carry on.

log using sw-employeeclean, replace text We don’t have to have the samename as the do file but it is a good idea. The log file will be calledsw-employeeclean.log. We want to replace it if it already exists andto have it as a text file not Stata’s *.smcl type file.

Blank Lines This improves the readability of the do file.

// Everything after // is regarded as a comment. You can also starta comment line with a *. Also everything between an opening /*and a closing */ is a comment. Comments help, particularly whenyou come back to work you completed a while ago. It is a goodidea to name the do file and who wrote it and when.

version 13 It is a good idea to put the version of Stata you are using.Some things change between versions. If you are using Version13 and put version 12, Stata will perform the commands as if youwere using Version 12

5. 5 If you are using Version 12, writeversion 12 in the do file.

clear all This gets rid of everything in memory before the do filecommands are executed.

macro drop all clear all does not clear any macros (covered later).

Page 13: Introduction to Stata - ESQUANT

introduction to stata 13

set linesize 80 this sets the width of the results.

log close We need to close the log file.

exit Strictly speaking we need a carriage return (i.e. Enter) for thelast command to execute. So a blank line will work. But it is agood idea to explicitly exit the do file.

1.6.1 Using the do file

1. Choose Edit ⊳ Clear rhd Results

2. Choose File ⊳ Do . . . . Select sw-employeeclean.do and press Open.The commands in the do file will be executed and the results willgo in the results window. If there is an error, debug it and try itagain until you get it to work.

3. Check out the log file.

1.6.2 Comments in do files

* is used at the beginning of a line and means that the whole line isdisregarded.

/* and */ can be used anywhere and means that everything in be-tween is disregarded.

// can be used anywhere and means that eveything after that on theline is disregarded6. 6 If it is used at the end of a line, then it

must be preceeded by a blank./// is used to break up long lines of code to make it more readable 7.

7 Like //, If /// is used at the end ofa line, then it must be preceeded by ablank.

Page 14: Introduction to Stata - ESQUANT

14 introduction to stata

1.7 General Commands

1.7.1 update

You can check whether you need to update by using the commandupdate query. If you do, then use update all

1.7.2 cd

You can change the working directory using the cd command. Re-member to put the directory name in quotations if you have spaces inthe directory name. If you want to go to a higher directory use cd ..with a space between cd and .. . It some cases it might be quicker touse File ⊳ Change Working Directory.

1.7.3 help

You can get help on a command by typing help command. Example:help do. The help will be displayed in a separate window.

1.7.4 findit

If you get an error, say error number 604, you can find informationabout it using findit rc 6048. More generally you can get (a great 8 rc stands for return code.

deal of) information about a topic by using a keyword or two. Exam-ple: findit logit ordinal.

1.7.5 set memory

Stata works with all the data in memory. If you have a big datafile you need to set the size of the working memory using the set

memory command. For example set memory 100m would set the sizeof the working memory to be 100 megabytes. You also have the op-tion of setting this as a permanent option using set memory 100m,

permanently.

1.7.6 display

If you want to calculate a quantity and display it in the resultswindow you use the display command. For example display

sqrt(32 + 42) will give 5.

1.8 Reading and Saving Data

1.8.1 Importing data

We have seen already seen an example using import delimited:

Page 15: Introduction to Stata - ESQUANT

introduction to stata 15

import del imited "C:\ Users\NeilDiamond\Documents\ S t a t a Workshop\Day 1\Data\Employeedata . csv "

1.8.2 Saving data

Once the data is cleaned, we can save it in Stata format using thesave command. For example, we might use the command

save cleanedemployee , r e p l a c e

Note that the data will have a *.dta extension.

1.8.3 Using a Stata datafile

Datasets with a *.dta extension can be opened using the use com-mand:

use cleanedemployee

or

use cleanedemployee . dta

These will only work if there is a match between your data in mem-ory and on the disk. You can clear the data in memory using theclear command. Alternatively clear is an option with the use com-mand.

use cleanedemployee , c l e a r

1.8.4 Editing data

You can do this with the edit command. Resist the temptation! Use ado-file instead. You then can reproduce the results.

1.8.5 Compressing the data

With the compress command you can save some space, if desired.

1.9 Stata syntax

1.9.1 Syntax structure

Type

help language

Page 16: Introduction to Stata - ESQUANT

16 introduction to stata

You get the following information

With few except ions , the b a s i c language syntax i s[ p r e f i x : ] command [ v a r l i s t ] [= exp ] [ i f ] [ in ] [ weight ] [ using fi lename ]

[ , opt ions ]

1.9.2 Variable Lists

A variable list gives a list of variables you want the command to op-erate on. It is simply one variable or a list of variables with spacesbetween them. There are some shorthand conventions that are avail-able.

jobcat Just one variablejobcat minority Two variablessal* All variables begining with sal*1 All variables ending with 1

var1-var4 var1 var2 var3 var4

_all All the variablesjobtime-minority jobtime prevexp minority

Note that some commands use all variables by default. The follow-ing are equivalent.

codebook , compactcodebook _ a l l , compact

1.9.3 The by prefix

This command tells Stata to do the command for each level of thevariable specified. It requires the data to be sorted first. Insetad youmight use the bysort prefix to do the sorting as you go.

. bysort j o b c a t : summarize prevexp

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−> j o b c a t = C l e r i c a l

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 286 84 .6958 94 .9574 0 476

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−> j o b c a t = Custodial

Var iab le | Obs Mean Std . Dev . Min Max

Page 17: Introduction to Stata - ESQUANT

introduction to stata 17

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−prevexp | 20 302 .45 101 .1432 155 460

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−> j o b c a t = Manager

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 70 8 2 . 5 78 .01742 3 285

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−> j o b c a t = .

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 41 88 .87805 94 .74629 0 367

.

1.9.4 The if qualifer

This tells Stata to do a command only on the data that satisfies acondition.

. summarize prevexp i f j o b c a t ==2

Variab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 20 302 .45 101 .1432 155 460

. summarize prevexp i f j o b c a t ==" Custodial " : jobCat

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 20 302 .45 101 .1432 155 460

.

Note that jobcat is a numeric variable with value labels “Clerical",“Custodial", and “Manager". The level 2 is associated with “Custo-dial". We can use level 2 directly or access the Custodial employeesby using "Custodial":jobCat9. 9 Remember the value labels were set

up with jobCat.

Page 18: Introduction to Stata - ESQUANT

18 introduction to stata

1.9.5 The in qualifier

This tells Stata to do a command only on a set of observations.

. summarize prevexp in 1/50

Variab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 46 110 .2391 87 .27611 0 381

. summarize prevexp in −20/ l

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 19 62 .73684 65 .653 0 241

Note that in 1/50 means from observation 1 to 50 and -20/l10 10 It is a lower case “l" not a 1.

means the last 20 observations.

1.10 Manipulating Data

1.10.1 The generate command

We can generate new variables with the generate command.

. generate t o t a l e x p = prevexp + bgnsal

1.10.2 Extended generation command, egen

This extends the features of generate. For example, we might want toget the mean of Total Experience.

. egen mean_totalexp = mean( t o t a l e x p )

(106 missing values generated )

. l i s t t o t a l e x p mean_totalexp in 1/6

+−−−−−−−−−−−−−−−−−−−−−+| t o t a l e x p mean_t~p ||−−−−−−−−−−−−−−−−−−−−−|

1 . | 20433 17257 .37 |2 . | 10162 17257 .37 |3 . | . 17257 .37 |4 . | 10301 17257 .37 |5 . | . 17257 .37 |

Page 19: Introduction to Stata - ESQUANT

introduction to stata 19

|−−−−−−−−−−−−−−−−−−−−−|6 . | . 17257 .37 |

+−−−−−−−−−−−−−−−−−−−−−+

1.10.3 Recode

The levels of a categorical variable can be changed with the recodecommand. It is better to use the generate option to create a newvariable.

. recode educ (8=0 " Did not complete HS" ) (12/14=1 " Completed HS" )(15/16=2 " Completed Uni " ) (17/21=3 " Post−graduate " ) ,generate ( educ_cat )

(425 d i f f e r e n c e s between educ and educ_cat )

. t a b u l a t e educ_cat

RECODE of educ |( Education Level ) | Freq . Percent Cum.

−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Did not complete HS | 47 11 .06 11 .06

Completed HS | 175 41 .18 52 .24

Completed Uni | 158 37 .18 89 .41

Post−graduate | 45 10 .59 100 .00

−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Tota l | 425 100 .00

1.10.4 The drop and keep commands

If you want to drop variables, you just list the ones you want to drop.

. drop mean_totalexp − educ_cat

Alternatively, you can use the keep command. All the variablesnot listed are dropped.

1.10.5 The encode command

Note that the gender variable is a string variable. Sometimes thesevariables do not show up in dialog boxes, meaning that it is not pos-sible to do some commands on such variables. The encode commandgenerates a new variable which is numeric, and a good feature is thatthe value labels are automatically transferred.

Page 20: Introduction to Stata - ESQUANT

20 introduction to stata

. encode gender , generate ( gender_cat )

. t a b u l a t e gender_cat , missing

Male | Freq . Percent Cum.−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

f | 198 41 .77 41 .77

m | 229 48 .31 90 .08

. | 47 9 . 9 2 100 .00

−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Tota l | 474 100 .00

. t a b u l a t e gender_cat , missing nolabe l

Male | Freq . Percent Cum.−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

1 | 198 41 .77 41 .77

2 | 229 48 .31 90 .08

. | 47 9 . 9 2 100 .00

−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Tota l | 474 100 .00

. codebook gender_cat

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−gender_catMale−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

type : numeric ( long )l a b e l : gender_cat

range : [ 1 , 2 ] u n i t s : 1

unique values : 2 missing . : 47/474

t a b u l a t i o n : Freq . Numeric Label198 1 f229 2 m

47 .

1.11 Describing Data

Page 21: Introduction to Stata - ESQUANT

introduction to stata 21

1.11.1 The describe command

The desribe command gives information about the variables: thenumber of observations, the storage type, the display format, thevalue label and the variable label.

. descr ibe

Contains dataobs : 474

vars : 19

s i z e : 28 ,914

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−s torage display value

v a r i a b l e name type format l a b e l v a r i a b l e l a b e l−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−id i n t %8.0ggender s t r 1 %9s Malebdate s t r 1 0 %10s B i r t h Dayeduc byte %8.0g Education Levelj o b c a t byte %12.0g jobCat Job Categorys a l a r y s t r 1 1 %11s Current Sa laryc u r s a l long %10.0g Current Sa larysa lbeg in s t r 1 0 %10s Beginning Salarybgnsal long %10.0g Beginning Salaryjobt ime byte %8.0g Experience a t Current Employerprevexp i n t %8.0g Previous Experienceminority byte %8.0g yesno Minoritynew_manager1 byte %8.0g j o b c a t ==1 . C l e r i c a lnew_manager2 byte %8.0g j o b c a t ==2 . Custodialnew_manager3 byte %8.0g j o b c a t ==3 . Managernew_gender1 byte %8.0g gender== fnew_gender2 byte %8.0g gender==mt o t a l e x p f l o a t %9.0ggender_cat long %8.0g gender_cat

Male−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Sorted by : j o b c a t

Note : d a t a s e t has changed s i n c e l a s t saved

We can also do describe on a variable list.

1.11.2 codebook

Unlike the describe command, the codebook command gives someinformation about the actual data values. Using the option compact

Page 22: Introduction to Stata - ESQUANT

22 introduction to stata

gives a nicely laid out summary with the number of observations, thenumber of unique observations, the mean, minimum and maximumas well as the variable label.

. codebook , compact

Var iab le Obs Unique Mean Min Max Label−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−id 474 474 237 .5 1 474

gender 427 2 . . . Malebdate 437 428 . . . B i r t h Dayeduc 425 10 13 .49647 8 21 Education Levelj o b c a t 429 3 1 .428904 1 3 Job Categorys a l a r y 437 215 . . . Current Sa laryc u r s a l 437 215 34653 .96 15750 135000 Current Sa larysa lbeg in 417 87 . . . Beginning Salarybgnsal 417 87 17167 .78 9000 79980 Beginning Salaryjobt ime 436 36 80 .91055 63 98 Experience a t Current Employerprevexp 417 193 95 .18225 0 476 Previous Experienceminority 431 2 .2296984 0 1 Minoritynew_manager1 429 2 .7552448 0 1 j o b c a t ==1 . C l e r i c a lnew_manager2 429 2 .0606061 0 1 j o b c a t ==2 . Custodialnew_manager3 429 2 .1841492 0 1 j o b c a t ==3 . Managernew_gender1 427 2 .4637002 0 1 gender== fnew_gender2 427 2 .5362998 0 1 gender==mt o t a l e x p 368 337 17257 .37 9023 80179

gender_cat 427 2 1 .5363 1 2 Male−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

You can also do the codebook on a variable list. This gives moredetailed information on all the variable in the variable list. Note thedifferent information for jobcat, which is a categorical variable, andcursal, which is a numeric variable.

. codebook j o b c a t c u r s a l

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−j o b c a t Job Category−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

type : numeric ( byte )l a b e l : jobCat

range : [ 1 , 3 ] u n i t s : 1

unique values : 3 missing . : 45/474

Page 23: Introduction to Stata - ESQUANT

introduction to stata 23

t a b u l a t i o n : Freq . Numeric Label324 1 1 . C l e r i c a l

26 2 2 . Custodial79 3 3 . Manager45 .

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−c u r s a l Current Sa lary−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

type : numeric ( long )

range : [15750 ,135000 ] u n i t s : 1

unique values : 215 missing . : 37/474

mean : 34654

std . dev : 17283

p e r c e n t i l e s : 10% 25% 50% 75% 90%20850 24000 29100 37650 60375

1.11.3 The inspect command

Another useful view of a variable (list) can be obtained with theinspect command. You get a histogram, the number of negative, zero,positive amd missing values.

. i n s p e c t prevexp

prevexp : Previous Experience Number of Observations−−−−−−−−−−−−−−−−−−−−−−−−−−−−− −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Tota l I n t e g e r s Nonintegers| # Negative − − −| # Zero 18 18 −| # P o s i t i v e 399 399 −| # −−−−− −−−−− −−−−−| # Tota l 417 417 −| # # . . . Missing 57

+−−−−−−−−−−−−−−−−−−−−−− −−−−−0 476 474

( More than 99 unique values )

Page 24: Introduction to Stata - ESQUANT

24 introduction to stata

1.11.4 The list command

A listing of the data can be obtained using the list command. The inand if qualifiers are useful here.

. l i s t prevexp i f gender_cat==1 in 1/5

+−−−−−−−−−+| prevexp ||−−−−−−−−−|

2 . | 412 |3 . | 5 |4 . | 101 |5 . | . |

+−−−−−−−−−+

1.11.5 The browser command

You can look11 at variables in a variable list using the browser com- 11 But not edit! This is a good thingas if you edit your results are notreproducible.

mand.

. browse gender

Once the browser window opens you can add extra variables bychecking the check boxes in the variables window.

Page 25: Introduction to Stata - ESQUANT

introduction to stata 25

1.11.6 The sort and gsort commands

You can sort the data using the sort command. The following com-mands sort the data by gender and within each gender sorts by cur-sal. Note that for gender the missing values come first, but for cursalthe missing values come last.

. s o r t gender c u r s a l

. browse gender c u r s a l

If you want to reverse sort you need to use the gsort command.The following command sort the data by gender as before, but nowthe highest salaries are first rather than at the end.

. gs or t gender − c u r s a l

. browse gender c u r s a l

Page 26: Introduction to Stata - ESQUANT

26 introduction to stata

1.11.7 The order command

The order command puts the variables in the variable list first.

. order j o b c a t gender

. codebook , compact

Var iab le Obs Unique Mean Min Max Label−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−j o b c a t 429 3 1 .428904 1 3 Job Categorygender 427 2 . . . Maleid 474 474 237 .5 1 474

bdate 437 428 . . . B i r t h Dayeduc 425 10 13 .49647 8 21 Education Level. . .

If you want to order all the variables alphabetically, you need tospecify all the variables using _all and supply the option alphabetic.

. order _ a l l , a l p h a b e t i c

. codebook , compact

Var iab le Obs Unique Mean Min Max Label−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−bdate 437 428 . . . B i r t h Daybgnsal 417 87 17167 .78 9000 79980 Beginning Salaryc u r s a l 437 215 34653 .96 15750 135000 Current Sa laryeduc 425 10 13 .49647 8 21 Education Level. . .

You can also move a variable to a specific position, by specifyingwhich variable it goes before.

. order gender_cat , before ( c u r s a l )

. codebook , compact

Var iab le Obs Unique Mean Min Max Label−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−bdate 437 428 . . . B i r t h Daybgnsal 417 87 17167 .78 9000 79980 Beginning Salarygender_cat 427 2 1 .5363 1 2 Malec u r s a l 437 215 34653 .96 15750 135000 Current Sa lary. . .

Page 27: Introduction to Stata - ESQUANT

2 Statistical Analysis in Stata

2.1 Introduction

2.1.1 Variable Definitions

• Response variable: measures the outcome of a study. Alsocalled the dependent variable.

• Explanatory variable: attempts to explain the variation in theobserved outcomes. Also called the independent variables.

• Many statistical problems can be thought of in terms of a re-sponse variable and one or more explanatory variables.

2.1.2 Examples

• Study of level of diabetes amongst elderly people in the Mel-bourne suburban area.

– Response variable: Diabetic status

– Explanatory variable: Age, sex, race

• School teachers perceptions of their training to deal with dis-abled children

– Response variable: Perception (measured on a Likert scale)

– Explanatory variables: Government school, gender, age, etc.

2.1.3 Types of data

• The ways of organising, displaying and analysing data dependson the type of data we are investigating.

• Categorical data (also called nominal or qualitative) e.g. sex,race, type of drug used, political party, postcode, offense

– Averages don’t make sense. Ordered categories are calledordinal data

Page 28: Introduction to Stata - ESQUANT

28 introduction to stata

• Numerical Data (also called scale) e.g. test score, age, weight,temperature, time, income, sentence duration

– Averages make sense

2.1.4 Statistical Analysis

Response Variable: NumericalExplanatoryVariable

None Numerical Categorical

Graphics Boxplot,Histogram

Scatterplot Side-by-side Box-plots

SummaryStats

Mean, Per-centiles, IQR,Std.Dev

Correlation Mean, Percentiles,Std.Dev by Group

Methods Regression t-test (2 groups),One-way ANOVA

ANCOVA, General Linear Model

2.1.5 Statistical Analysis (ctd)

Response Variable: CategoricalExplanatoryVariable

None Numerical Categorical

Graphics Barchart Side-by-sideBoxplots

Side-by-side Bar-charts

SummaryStats

Percentages Mean, Per-centiles,Std.Dev byGroup

Percentages byGroup, ContingencyTables

Methods Logistic Re-gression

Chi-squared test

Generalised Linear Model

Page 29: Introduction to Stata - ESQUANT

statistical analysis in stata 29

2.1.6 Statistical Procedures in Stata

Page 30: Introduction to Stata - ESQUANT

30 introduction to stata

2.1.7 Graphical Procedures in SPSS

2.1.8 Numerical response variable, no explanatory variables

• Graphical summaries

– Boxplot: From the Menus, choose Graphics ⊳ Boxplot. Se-lect prevexp as the Variable and press OK or Submit. Thecommand is graph box prevexp.

Page 31: Introduction to Stata - ESQUANT

statistical analysis in stata 31

– Histogram: From the Menus, choose Graphics ⊳ Hostogram.Select prevexp as the Variable and press OK or Submit. Thecommand is graph histogram prevexp. Note the outputindicates the number of bins, the left hand value of the firstbin and the width of the bins.

. histogram prevexp( bin =20 , s t a r t =0 , width = 2 3 . 8 )

Page 32: Introduction to Stata - ESQUANT

32 introduction to stata

• Numerical summaries

– Percentiles: From the Menus, choose Summaries, tables andtests ⊳ Summaries and Descriptive Statistics ⊳ Centiles withCIs. Choose prevexp as the variable and for centiles write0(20)100, which means from 0 to 100 in steps of 20. Theoutput is given below:.

. c e n t i l e prevexp , c e n t i l e ( 0 ( 2 0 ) 1 0 0 )

−− Binom . I n te r p . −−Variab le | Obs P e r c e n t i l e C e n t i l e [95% Conf . I n t e r v a l ]

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−prevexp | 417 0 0 0 0*

| 20 1 3 . 6 10 .17299 19

| 40 4 3 . 2 34 .7828 48

| 60 75 67 .00857 90

| 80 166 .2 143 191 .827

| 100 476 476 476*

* Lower ( upper ) conf idence l i m i t held a t minimum (maximum) of sample

Page 33: Introduction to Stata - ESQUANT

statistical analysis in stata 33

– Mean, Standad deviation: From the Menus, choose Sum-maries, tables and tests ⊳ Summaries and Descriptive Statis-tics ⊳ Summary Statistics. Choose prevexp as the variable.The command is summarize prevexp. You get both the meanand standard deviation as well as the number of observa-tions and the minimum and maximum.

. summarize prevexp

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 417 95 .18225 103 .3895 0 476

– Median: The median is the 50th percentile, so it can be ob-tained using the Centiles option above. Another optionis to choose the choose Display additional statistics in thesummarize-Summary statistics dialog box. The correspond-ing command is summarize prevexp, detail.

. summarize prevexp , d e t a i l

Previous Experience−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

P e r c e n t i l e s Smal les t1% 0 0

5% 2 0

10% 6 0 Obs 417

25% 20 0 Sum of Wgt . 417

50% 55 Mean 95 .18225

Largest Std . Dev . 103 .3895

75% 137 444

90% 258 451 Variance 10689 .38

95% 324 460 Skewness 1 .540924

99% 438 476 Kurtos is 4 .847728

Choose Another option is from the Menus, choose Sum-maries, tables and tests ⊳ Other tables ⊳ Compact tables ofsummary statistics. Choose prevexp as the variable and thenchoose Mean, Standard Deviation and 50th Percentile andpress OK or submit.

. t a b s t a t prevexp , s t a t i s t i c s ( mean sd p50 )

v a r i a b l e | mean sd p50

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Page 34: Introduction to Stata - ESQUANT

34 introduction to stata

prevexp | 95 .18225 103 .3895 55

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

2.1.9 Categorical response variable, no explanatory variables

• Graphical summaries:

– Bar-chart: Choose Garphics ⊳ Bar chart. In the main tabof the dialog box, choose Graph by calculating summarystatistics. The Statistic is Count nonmissing, the Variable isid. Then select the Categories tab and choose jobcat as thegrouping variable. Choose OK or Submit.

• Numerical Summaries

– Group frequencies/percentages: Choose Summaries, tablesand tests ⊳ Frequency tables ⊳ One-way Table. The categor-ical variable is jobcat. You can select the option of treatingmissing values like other values.

. t a b u l a t e jobca t , missing

Job Category | Freq . Percent Cum.−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

1 . C l e r i c a l | 324 68 .35 68 .35

2 . Custodial | 26 5 . 4 9 73 .84

3 . Manager | 79 16 .67 90 .51

. | 45 9 . 4 9 100 .00

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Tota l | 474 100 .00

Page 35: Introduction to Stata - ESQUANT

statistical analysis in stata 35

2.1.10 Numerical response variable, one numerical explanatory variable

Examples:

• Month of reunification and length of stay

• Test score and age

• Age and blood pressure

• Number of accidents and length of time worked

• Graphics: scatterplots: Choose Graphics ⊳ Twoway graph (scat-ter, line, etc). In the dialog box with the plot tab, press Createplots and select Scatter. Choose cursal as the Y variable andbgnsal as the X variable. Press Accept1. Then press Submit. 1 This means you are overlaying two

graphs.You will get a plain scatter plot. Now press Create again andselect a Fit plot and choose Select Linear Prediction w/CI. Youneed to specify the same Y variable and X variable as before.Now press Submit.

The corresponding command is

. twoway ( s c a t t e r c u r s a l bgnsal ) ( l f i t c i c u r s a l bgnsal )

• Statistics: correlation. Select Statistics ⊳ Summaries, tables,and tests ⊳ Summaries and descriptive statistics ⊳ PairwiseCorrelations. Choose cursal amd bgnsal as the variables.

. pwcorr c u r s a l bgnsal

| c u r s a l bgnsal−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−

c u r s a l | 1 .0000

Page 36: Introduction to Stata - ESQUANT

36 introduction to stata

bgnsal | 0 .8763 1 .0000

• Statistical methods: regression

Select Linear model and related ⊳ Linear Regression. In the dialogbox, choose cursal as the dependent variable and bgnsal as the inde-pendent variable. Go to the reporting tab and check the Standardizedbeta coefficient box. Press OK.

. r e g r e s s c u r s a l bgnsal , beta

Source | SS df MS Number of obs = 386

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− F ( 1 , 384 ) = 1270 .55

Model | 9 .3374 e+10 1 9 .3374 e+10 Prob > F = 0 .0000

Residual | 2 .8221 e+10 384 73490976 .1 R−squared = 0 .7679

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− Adj R−squared = 0 .7673

Tota l | 1 .2159 e+11 385 315829390 Root MSE = 8572 .7

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−c u r s a l | Coef . Std . Err . t P>| t | Beta

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−bgnsal | 1 .897318 .0532285 35 .64 0 .000 .8763061

_cons | 2153 .936 1021 .539 2 . 1 1 0 .036 .−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

• The correlation and regression coefficients are related. Theformula is

β̂ = rsy

sx

In other words, for every one standard deviation increase in x, yis predicted to increase by r standard deviations in y.

• If x and y are standardised, then the regression coefficient is thecorrelation coefficient.

• If you want to fit a regression through the origin (not recom-mended) you need to take out the intercept term.

• Regression coefficients are biased if you omit a variable thataffects the dependent variable. Structural Equation Modellingdoes not help in this regard.

• Regression coefficients are biased downwards if the explanatoryvariables are measured with error. SEM can help here if you usea number of different indicators to measure a concept.

Page 37: Introduction to Stata - ESQUANT

statistical analysis in stata 37

Before using a regression equation, you need to check the as-sumptions. One useful plot is a plot of the residuals versus fittedvalues. Choose Linear models and related ⊳ Regression diagnostics ⊳Residuals-versus-fitted plot and press OK.

The plot shows that the variability depends on the fitted valuesand so the assumption of homogeneity is violated. We need to do aweighted regression.

2.1.11 Categorical response variable, one categorical explanatory variable

Examples:

• Whether there was a reunification after 12 months and theinterim carer

• Gender and voting preference

• Religion and education level

• Treatment group and mortality

• Graphics: bar charts and variants

Choose Graphics ⊳ Bar chart. In the main tab, choose graphby calculating summary statistics with the Statistic to plot byCount nonmissing on the id variable. In the Categories tabchoose gender as the Group 1 grouping variable and jobcat asthe Group 2 grouping variable. Press OK.

Page 38: Introduction to Stata - ESQUANT

38 introduction to stata

The corresponding command is

. graph bar ( count ) id , over ( gender ) over ( j o b c a t )

• Statistics: cross-tabulations

Choose Statistics ⊳ Frequency Tables ⊳ Two-way tables withmeasures of association. In the main tab, select jobcat asthe Row variable and gender as the Column variable. CheckWithin-column relative frequencies and Within-row relativeFrequencies Cell contents. Press OK.

. t a b u l a t e j o b c a t gender , column row

+−−−−−−−−−−−−−−−−−−−+| Key ||−−−−−−−−−−−−−−−−−−−|| frequency || row percentage || column percentage |+−−−−−−−−−−−−−−−−−−−+

| MaleJob Category | f m | Tota l−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−

1 . C l e r i c a l | 165 127 | 292

| 56 . 51 43 .49 | 100 .00

| 94 . 29 60 .19 | 75 . 65

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−2 . Custodial | 0 25 | 25

Page 39: Introduction to Stata - ESQUANT

statistical analysis in stata 39

| 0 . 0 0 100 .00 | 100 .00

| 0 . 0 0 11 .85 | 6 . 4 8

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−3 . Manager | 10 59 | 69

| 14 . 49 85 .51 | 100 .00

| 5 . 7 1 27 .96 | 17 . 88

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−Tota l | 175 211 | 386

| 45 . 34 54 .66 | 100 .00

| 100 .00 100 .00 | 100 .00

• Statistical methods:

– Test for two proportions This could be used to test whetherthe differences in the proportion of Male Managers is dif-ferent to the proportion of Female Managers. Before we dothe test we need to form some indicator or dummy vari-able. Choose Data ⊳ Create or change data ⊳ Other variable-creation commands ⊳ Create indicator variables. For vari-ables to tabulate, select gender. For new variables’ stub,type new_gender. Two dummy variables will be created:new_gender1 and new_gender2. The variable new_gender2

will take the vaue 1 if the person is Male and 0 if the personis Female. Check this out by doing some one-way frequencytabulations of gender, new_gender1, and new_gender2.

. q u i e t l y t a b u l a t e gender , generate ( new_gender )

. t a b u l a t e gender , missing

Male | Freq . Percent Cum.−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

| 47 9 . 9 2 9 . 9 2

f | 198 41 .77 51 .69

m | 229 48 .31 100 .00

−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Tota l | 474 100 .00

. t a b u l a t e new_gender2 , missing

gender==m | Freq . Percent Cum.−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

0 | 198 41 .77 41 .77

1 | 229 48 .31 90 .08

. | 47 9 . 9 2 100 .00

Page 40: Introduction to Stata - ESQUANT

40 introduction to stata

−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Tota l | 474 100 .00

Repeat this for jobcat. Create three dummy variables new_jobcat1,new_jobcat2, and new_jobcat2. Note that new_jobcat1 takes the value1 if the person is a Manager and the value 0 if the person is not aManager.

To test whether Males are more likely to be managers than Females,choose Statistics ⊳ Summaries, tables, and tests ⊳ Classical tests ofhypotheses ⊳ Proportion test. In the main tab, select two-group usinggroups. The variable name is new_manager3, and the Group variablename is new_gender2. Press OK.

. p r t e s t new_manager3 , by ( new_gender2 )

Two−sample t e s t of proport ions 0 : Number of obs = 175

1 : Number of obs = 211

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Variab le | Mean Std . Err . z P>|z| [95% Conf . I n t e r v a l ]

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−0 | .0571429 .0175463 .0227528 .0915329

1 | .2796209 .0308976 .2190628 .3401789

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−d i f f | − .222478 .0355321 − .2921196 − .1528363

| under Ho: .0391742 −5 .68 0 .000

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−d i f f = prop ( 0 ) − prop ( 1 ) z = −5 .6792

Ho: d i f f = 0

Ha : d i f f < 0 Ha : d i f f != 0 Ha : d i f f > 0

Pr (Z < z ) = 0 .0000 Pr (|Z| < |z |) = 0 .0000 Pr (Z > z ) = 1 .0000

– Chi-squared test for independence

Choose Statistics ⊳ Frequency Tables ⊳ Two-way tables with measuresof association. In the main tab, select jobcat as the Row variable andgender as the Column variable. Check Pearson’s Chi-Squared as thetest statistic and Expected Frequency in the Cell Contents. Press OK.

. t a b u l a t e j o b c a t gender , chi2 expected

+−−−−−−−−−−−−−−−−−−−−+| Key ||−−−−−−−−−−−−−−−−−−−−|| frequency |

Page 41: Introduction to Stata - ESQUANT

statistical analysis in stata 41

| expected frequency |+−−−−−−−−−−−−−−−−−−−−+

| MaleJob Category | f m | Tota l−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−

1 . C l e r i c a l | 165 127 | 292

| 132 . 4 159 .6 | 292 .0

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−2 . Custodial | 0 25 | 25

| 1 1 . 3 1 3 . 7 | 2 5 . 0

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−3 . Manager | 10 59 | 69

| 3 1 . 3 3 7 . 7 | 6 9 . 0

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−Tota l | 175 211 | 386

| 175 . 0 211 .0 | 386 .0

Pearson chi2 ( 2 ) = 61 .9234 Pr = 0 .000

2.1.12 Numerical response variable, one categorical explanatory variable

Examples:

• Time to reunification and carer

• Strength of agreement to survey question and location

• Drug response and treatment group

• Sentence length and offense

• Graphics: boxplots

Choose Graphics ⊳ Boxplot. In the Main tab, choose prevexp asthe Variable. On the Category tab, choose jobcat as the Group 1

Grouping variable.

Page 42: Introduction to Stata - ESQUANT

42 introduction to stata

The corresponding command is:

graph box prevexp , over ( j o b c a t )

• Statistics: group means, standard deviations

Choose Statistics ⊳ Summaries, tables, and tests ⊳ Summariesand Descriptive Statistics ⊳ Summary Statistics. In the Maintab, select prevexp as the Variable. In the by/if/in tab, checkRepeat command by groups and select jobcat as the variablethat defines grouos. Press OK.

. by jobca t , s o r t : summarize prevexp

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−> j o b c a t = 1 . C l e r i c a l

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 286 84 .6958 94 .9574 0 476

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−> j o b c a t = 2 . Custodial

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 20 302 .45 101 .1432 155 460

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−> j o b c a t = 3 . Manager

Page 43: Introduction to Stata - ESQUANT

statistical analysis in stata 43

Variab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 70 8 2 . 5 78 .01742 3 285

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−> j o b c a t = .

Var iab le | Obs Mean Std . Dev . Min Max−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

prevexp | 41 88 .87805 94 .74629 0 367

• Statistical methods:

– t-tests

The t-test is for testing for the differences in the means oftwo groups. Choose Summaries, tables, and tests ⊳ Classicaltests of hypotheses ⊳ t-test (mean-comp test). On the maintab, select Two sample using groups. Choose cursal as thevariable name and minority as the Group Variable name.Press submit.

. t t e s t cursa l , by ( minority )

Two−sample t t e s t with equal var iances−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Group | Obs Mean Std . Err . Std . Dev . [95% Conf . I n t e r v a l ]−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

0 . No | 306 36631 .55 1054 .51 18446 .39 34556 .52 38706 .59

1 . Yes | 90 28012 .78 1008 .321 9565 .776 26009 .26 30016 .29

−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−combined | 396 34672 .74 865 .2023 17217 .31 32971 .76 36373 .72

−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−d i f f | 8618 .775 2021 .078 4645 .329 12592 .22

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−d i f f = mean ( 0 . No) − mean ( 1 . Yes ) t = 4 .2644

Ho: d i f f = 0 degrees of freedom = 394

Ha : d i f f < 0 Ha : d i f f != 0 Ha : d i f f > 0

Pr ( T < t ) = 1 .0000 Pr (|T| > | t |) = 0 .0000 Pr ( T > t ) = 0 .0000

The above test assumes the variances of the two groupsare the same. To relax this assumption check the unequalvariances box.

Page 44: Introduction to Stata - ESQUANT

44 introduction to stata

. t t e s t cursa l , by ( minority ) unequal

Two−sample t t e s t with unequal var iances−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Group | Obs Mean Std . Err . Std . Dev . [95% Conf . I n t e r v a l ]−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

0 . No | 306 36631 .55 1054 .51 18446 .39 34556 .52 38706 .59

1 . Yes | 90 28012 .78 1008 .321 9565 .776 26009 .26 30016 .29

−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−combined | 396 34672 .74 865 .2023 17217 .31 32971 .76 36373 .72

−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−d i f f | 8618 .775 1459 .008 5747 .154 11490 .39

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−d i f f = mean ( 0 . No) − mean ( 1 . Yes ) t = 5 .9073

Ho: d i f f = 0 S a t t e r t h w a i t e ’ s degrees of freedom = 289 .197

Ha : d i f f < 0 Ha : d i f f != 0 Ha : d i f f > 0

Pr ( T < t ) = 1 .0000 Pr (|T| > | t |) = 0 .0000 Pr ( T > t ) = 0 .0000

– One-way ANOVA The one-way ANOVA is for testing thedifferences between the means of more than one group.Choose Linear Models and related ⊳ ANOVA/MANOVA ⊳One-way ANOVA. In the main tab, choose prevexp as theResponse variable and jobcat as the Factor variable. ClickBonferroni for the Multiple comparison test and Producesummary statistics for Output.

. oneway prevexp jobca t , bonferroni t a b u l a t e

Job | Summary of Previous ExperienceCategory | Mean Std . Dev . Freq .

−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−1 . C l e r i c | 84 .695804 94 .957396 286

2 . Custod | 302 .45 101 .14319 20

3 . Manage | 8 2 . 5 78 .017417 70

−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Tota l | 95 .869681 104 .38264 376

Analysis of VarianceSource SS df MS F Prob > F

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Between groups 901729 .629 2 450864 .815 52 .82 0 .0000

Within groups 3184170 .98 373 8536 .65143

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Page 45: Introduction to Stata - ESQUANT

statistical analysis in stata 45

Tota l 4085900 .61 375 10895 .735

B a r t l e t t ’ s t e s t f o r equal var iances : chi2 ( 2 ) = 4 .2549 Prob>chi2 = 0 . 119

Comparison of Previous Experience by Job Category( Bonferroni )

Row Mean−|Col Mean | 1 . C l e r i 2 . Custo−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−2 . Custo | 217 .754

| 0 . 000

|3 . Manag | −2 .1958 −219 .95

| 1 . 000 0 .000

An alterntive is to use Statistics ⊳ Linear model and related ⊳ANOVA/MANOVA ⊳ Analysis of variances and covariances.Choose prevexp as the Dependent variable and jobcat as theModel. We get the ANOVA table by default.

. anova prevexp j o b c a t

Number of obs = 376 R−squared = 0 .2207

Root MSE = 92 .394 Adj R−squared = 0 .2165

Source | P a r t i a l SS df MS F Prob > F−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Model | 901729 .629 2 450864 .815 52 .82 0 .0000

|j o b c a t | 901729 .629 2 450864 .815 52 .82 0 .0000

|Residual | 3184170 .98 373 8536 .65143

−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Tota l | 4085900 .61 375 10895 .735

The advantage of doing the ANOVA the second way is thatwe have access to a number of tests of the assumptions.For example we can test for heteroskedasticity by choosingStatistics ⊳ Linear model and related ⊳ ANOVA/MANOVA ⊳Specification tests after anova. Choose Test for heteroskedas-ticity (hettest) and select the Breusch-Pagan / Cook-Weisbergtest.

. e s t a t h e t t e s t

Page 46: Introduction to Stata - ESQUANT

46 introduction to stata

Breusch−Pagan / Cook−Weisberg t e s t f o r h e t e r o s k e d a s t i c i t yHo: Constant var ianceVar iab les : f i t t e d values of prevexp

chi2 ( 1 ) = 0 . 2 6

Prob > chi2 = 0 .6086

We saw that custodial employees seemed to have longer pre-vious experience than the other two groups. To test this weuse Statistics ⊳ ANOVA/MANOVA ⊳ test linear hypothe-ses after anova.In the Main tab, choose Coefficients are 0 orlinear expressions are equal and press OK. For test statistic,choose Linear expresions are equal. In the coefficient box, se-lect _b[2.jobcat] and press Add. Then select _b[1.jobcat] andpress Add. Finally select _b[3.jobcat] and press Add. Theexpression is_b[2.jobcat]=_b[1.jobcat]=_b[3.jobcat].

. t e s t ( _b [ 2 . j o b c a t ] = _b [ 1 . j o b c a t ]= _b [ 3 . j o b c a t ] )

( 1 ) − 1b . j o b c a t + 2 . j o b c a t = 0

( 2 ) 2 . j o b c a t − 3 . j o b c a t = 0

F ( 2 , 373 ) = 52 .82

Prob > F = 0 .0000

This gives the identical F test as before and tells us there arestatistically significant differences between the groups. To getthe test we want edit the expression to become _b[2.jobcat]=(_b[1.jobat]+_b[3.jobcat])/2

and press OK.

. t e s t ( _b [ 2 . j o b c a t ] =( _b [ 1 . j o b c a t ]+ _b [ 3 . j o b c a t ] ) / 2 )

( 1 ) − . 5 * 1 b . j o b c a t + 2 . j o b c a t − . 5 * 3 . j o b c a t = 0

F ( 1 , 373 ) = 103 .05

Prob > F = 0 .0000

2.1.13 Categorical response variable, one numerical explanatory variable

Examples:

• Reunification after 12 months and sentence duration

• Mortality and pollutant level

• Language spoken and length of time in Australia

Page 47: Introduction to Stata - ESQUANT

statistical analysis in stata 47

• Graphics: boxplots

• Statistics: group means, standard deviations

• Statistical methods: logistic regression

2.1.14 Numerical response variable, several explanatory variables

• Statistical methods

– Multiple regression (all numerical explanatory variables)

– Multi-way ANOVA (all categorical explanatory variables)

– ANCOVA ( mixed explanatory variables

– General linear model.

2.1.15 Categorical response variable, several explanatory variables

• Statistical methods

– Multi-way contingency table and chi-square test (all categori-cal explanatory variables)

– Multiple logistic regression (all numerical explanatory vari-ables)

– Generalised linear model

2.1.16 Multivariate methods

• Reliability Analysis

The reliability of a measure id defined as

reliability = Variance of True ValuesVariance of Mean Values

.

– The reliability goes from 0 to 1.

– The maximum possible correlation of a variable with re-liability ρX is √

ρX . For example, consider a variable withreliability 0.8. Then the maximum correlation with anothervariable is √

0.8 = 0.894.

– Methods for estimating reliability:

* Test-retest

* Alternative form

Page 48: Introduction to Stata - ESQUANT

48 introduction to stata

* Split halves method: Take the odd items and even items ofthe scale and correlate the two results, giving ρXX′ as anestimate of the reliability. The estimated reliability of thewhole scale is

ρXX′′ = 2ρXX′

1+ ρXX′.

For example, say ρXX′ = 0.8, then the reliability of thewhole scale is given by

ρXX′′ = 2.81.8

= 0.888.

– Chonbach’s alpha This method uses the average of all thecorrelations between pairs of items in the scale. The formulais:

α = Nρ

[1+ ρ(N − 1]where N is the number of items in the scale, and ρ is theaverage correlation between pairs of items in the scale.

As an example, we will use a subset of the classic Holzingerand Swineford (1939) dataset 2 We will use the dataset when 2 From the help file in the Lavaan

package (Rossel, 2013): The classicdataset consists of mental abilitytest scores of seventh and eigth gradechildren from two different schools(Pasteur and Grant-White). In theoriginal dataset (available in the MBESSpackage), there are scores for 26 tests.However, a smaller subset with 9

variables is more widely used in theliterature (for example in Joreskog’s1969 paper , which uses the 145

subjects from the Grant-White schoolonly).

; and

Princal Component Analysis and (Exploratory) Factor Analy-sis are illustrated below (as well as an example of Confirma-tory Factor Analysis, if you are coming back for Day 2 andDay 3).

Page 49: Introduction to Stata - ESQUANT

statistical analysis in stata 49

Choose Statistics ⊳ Chonbach’s alpha. In the Main tab, putx1-x9 in the Variables box. In the Options tab, check the Listindividual interitem correlation and covariance box as wellas the Display item-test and item-rest correlation box.

. alpha x1−x9 , d e t a i l item

Test s c a l e = mean( unstandardized items )

averageitem− t e s t item− r e s t i n t e r i t e m

Item | Obs Sign c o r r e l a t i o n c o r r e l a t i o n covar iance alpha−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−x1 | 301 + 0 .6574 0 .5220 .3132312 0 .7248

x2 | 301 + 0 .4530 0 .2752 .363812 0 .7642

x3 | 301 + 0 .5283 0 .3702 .3463561 0 .7490

x4 | 301 + 0 .7029 0 .5807 .3021435 0 .7150

x5 | 301 + 0 .6733 0 .5254 .3024577 0 .7236

x6 | 301 + 0 .7075 0 .5951 .3057944 0 .7142

x7 | 301 + 0 .4138 0 .2457 .3742465 0 .7664

x8 | 301 + 0 .5114 0 .3691 .354241 0 .7484

x9 | 301 + 0 .6174 0 .4948 .3316157 0 .7310

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Test s c a l e | .3326553 0 .7605

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

I n t e r i t e m covar iances ( obs =301 in a l l p a i r s )

x1 x2 x3 x4 x5 x6 x7 x8

x1 1 .3629

x2 0 .4087 1 .3864

x3 0 .5818 0 .4526 1 .2791

x4 0 .5065 0 .2096 0 .2089 1 .3552

x5 0 .4421 0 .2118 0 .1127 1 .1014 1 .6653

x6 0 .4563 0 .2484 0 .2449 0 .8985 1 .0179 1 .2003

x7 0 .0850 −0 .0971 0 .0886 0 .2205 0 .1435 0 .1446 1 .1871

x8 0 .2647 0 .1100 0 .2130 0 .1260 0 .1812 0 .1660 0 .5370 1 .0254

x9 0 .4599 0 .2448 0 .3751 0 .2442 0 .2962 0 .2368 0 .3745 0 .4588

x9

x9 1 .0184

Generally, Chronbach’s alpha increases with the numberof items, although it may decrease if you add items thatdecrease the average correlation. The usual guidelines are

Page 50: Introduction to Stata - ESQUANT

50 introduction to stata

that Chronbach’s alpha should be greater than 0.7 or 0.8 atleast. Note that Chronbach’s alpha is a lower bound to thereliability, i.e. it is conservatiive, because it assumes that allitems contribute to the scale equally. Also note, that for thisdata, Chronbach’s alpha is NOT appropriate because there ismore than one underlying factor in this data.

• Principal components

The purpose is to find a hopefully small number of linear com-binations of the variables that explain the variability in the data.The first principal component is the linear combination thathas the maximum variance. The second principal componentis the linear combination, uncorrelated with the first, that hasthe maximum variance. If there are k variables, then there are kprincipal components, but we hope to use only a small set.

If the variables are on different scales, you should do the anal-ysis based on the correlation matrix-this is the default. You willget different answers if you base your analysis on the covari-ance matrix.

Choose Statistics ⊳ Multivariate Analysis ⊳ Factor and principalcomponents analysis ⊳ Principal component analysis (PCA). Onthe Model tab, type x1 − x9 as the Variables. On the Model 2

tab, check Minimum eigenvalue to be retained and type 1. PressOK.

Page 51: Introduction to Stata - ESQUANT

statistical analysis in stata 51

. pca x1−x9 , mineigen ( 1 )

P r i n c i p a l components/ c o r r e l a t i o n Number of obs = 301

Number of comp . = 3

Trace = 9

Rotat ion : ( unrotated = p r i n c i p a l ) Rho = 0 .6911

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Component | Eigenvalue D i f f e r e n c e Proport ion Cumulative

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Comp1 | 3 .21634 1 .57763 0 .3574 0 .3574

Comp2 | 1 .63871 .273554 0 .1821 0 .5395

Comp3 | 1 .36516 .666241 0 .1517 0 .6911

Comp4 | .698918 .114571 0 .0777 0 .7688

Comp5 | .584348 .0846603 0 .0649 0 .8337

Comp6 | .499687 .0265851 0 .0555 0 .8892

Comp7 | .473102 .1871 0 .0526 0 .9418

Comp8 | .286002 .0482767 0 .0318 0 .9736

Comp9 | .237726 . 0 .0264 1 .0000

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

P r i n c i p a l components ( e ig e nv e c t o r s )

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Variab le | Comp1 Comp2 Comp3 | Unexplained

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−x1 | 0 .3671 0 .0984 0 .3174 | . 4132

x2 | 0 .2173 0 .0677 0 .5318 | . 4546

x3 | 0 .2660 0 .2574 0 .4644 | . 3694

x4 | 0 .4270 −0 .3488 −0 .1450 | . 1855

x5 | 0 .4113 −0 .3778 −0 .1856 | . 1 7 5

x6 | 0 .4306 −0 .3350 −0 .0972 | . 2 0 7

x7 | 0 .1945 0 .3912 −0 .5059 | . 2782

x8 | 0 .2533 0 .4796 −0 .2823 | . 3 0 8

x9 | 0 .3294 0 .3998 −0 .0169 | . 3888

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

The interpretation of the output is :

– The eigenvalues are the variances of the linear combina-tions. The sum of the variances adds up to 9, the number ofvariables, since the analysis is being done on the correlationmatrix.

Page 52: Introduction to Stata - ESQUANT

52 introduction to stata

– Note that three components have eigenvalues greater than 1

and that is why three components are reported. These threecomponents explain just over 69% of the variance in the data.

– Another way to check the number of components is to usea Scree plot. Choose Statistics ⊳ Multivariate Analysis ⊳Factor and principal components analysis ⊳ Postestimation⊳ Scree plot of eigenvalues. On the CI tab, check the Graphconfidence intervals (valid after PCA only). The graph isbelow:

The “elbow" of the scree plot is at 4, so we choose 3 compo-nents.

– It is NOT appropriate to rotate the components, althoughStata will let you do this.

• Factor analysis

The purpose of factor analysis is to uncover a small numberof underlying factors that explains the correlation between theobserved variables. For the one factor the model is

x = µ + λ f + u

Page 53: Introduction to Stata - ESQUANT

statistical analysis in stata 53

where λ is the vector of loadings, and u is a vector of uniquefactors for each observational unit. The factor is assumed tohave mean 0 and variance 1, and the u’s have mean 0 and adiagonal covariance matrix Ψ, which needs to be estimated. Fork < p common factors, we have a vector of f of common factorsand a loadings matrix Λ, and

x = µ +Λ f + u

where the components of f have unit variance and are (initially)uncorrelated, and f and u are assumed to be uncorrelated.

With this model, the fitted covariace is given by

Σ = ΛΛT +Ψ.

Remembering that Ψ is a diagonal matrix, it means that all thecorrelations between the variables are explained only by thecommon factors.

Choose Statistics ⊳ Multivariate Analysis ⊳ Factor and principalcomponents analysis ⊳ Factor Analysis, and press OK.

. f a c t o r x1 x2 x3 x4 x5 x6 x7 x8 x9

( obs =301)

Fac tor a n a l y s i s / c o r r e l a t i o n Number of obs = 301

Method : p r i n c i p a l f a c t o r s Retained f a c t o r s = 3

Rotat ion : ( unrotated ) Number of params = 24

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Factor | Eigenvalue D i f f e r e n c e Proport ion Cumulative

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Factor1 | 2 .71201 1 .64003 0 .7366 0 .7366

Factor2 | 1 .07198 0 .41150 0 .2912 1 .0278

Factor3 | 0 .66048 0 .68551 0 .1794 1 .2072

Factor4 | −0 .02503 0 .03232 −0 .0068 1 .2004

Factor5 | −0 .05735 0 .04508 −0 .0156 1 .1848

Factor6 | −0 .10243 0 .05444 −0 .0278 1 .1570

Factor7 | −0 .15687 0 .01971 −0 .0426 1 .1144

Factor8 | −0 .17658 0 .06801 −0 .0480 1 .0664

Factor9 | −0 .24460 . −0 .0664 1 .0000

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−LR t e s t : independent vs . sa tura ted : chi2 ( 3 6 ) = 907 .15 Prob>chi2 = 0 .0000

Factor loadings ( pat te rn matrix ) and unique var iances

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Page 54: Introduction to Stata - ESQUANT

54 introduction to stata

Variab le | Factor1 Factor2 Factor3 | Uniqueness−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−

x1 | 0 .5623 0 .1871 0 .3047 | 0 .5560

x2 | 0 .3055 0 .1180 0 .3724 | 0 .7540

x3 | 0 .3827 0 .3083 0 .3713 | 0 .6207

x4 | 0 .7596 −0 .3356 −0 .0849 | 0 .3032

x5 | 0 .7374 −0 .3789 −0 .1352 | 0 .2945

x6 | 0 .7567 −0 .3084 −0 .0327 | 0 .3312

x7 | 0 .2871 0 .3701 −0 .4371 | 0 .5895

x8 | 0 .3728 0 .4904 −0 .2699 | 0 .5477

x9 | 0 .4913 0 .4464 −0 .0248 | 0 .5587

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

The interpretation of the output is below:

– The loadings are the correlations between the variables andthe factors.

– With the principal factor method, it is possible to get nega-tive eigenvalues.

– The communality for x1, the shared variance is given by:

0.56232 + 0.18712 + .30472 = 0.4440 = 1−Uniqueness

With the principal factors method in Stata, the number ofretained factors is the number of positive eigenvalues. Againit is better to check this with a Scree plot, which shows that 3

is probably appropriate.

Page 55: Introduction to Stata - ESQUANT

statistical analysis in stata 55

If we rotate the factors (with an orthogonal rotation) we getan equivalent model. There are an infinite number of possiblerotations. One method is varimax. The aim is to have factorswith a few large loadings and as many near-zero loadings aspossible. After the rotation, factors will have correlations witha small set of variables and little correlation with other sets.Choose Statistics ⊳ Multivariate Analysis ⊳ Factor and principalcomponents analysis ⊳ Factor Analysis ⊳ Postestimation ⊳ Ro-tate loadings. In the Main tab, select an Orthogonal rotation asVarimax. In the Reporting tab, Check Diplay loadings as blankwhen loadings < and type 0.4 . Press OK. The ouput is below:

. r o t a t e , blanks ( 0 . 4 )

Fac tor a n a l y s i s / c o r r e l a t i o n Number of obs = 301

Method : p r i n c i p a l f a c t o r s Retained f a c t o r s = 3

Rotat ion : orthogonal varimax ( Kaiser o f f ) Number of params = 24

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Factor | Variance D i f f e r e n c e Proport ion Cumulative

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Page 56: Introduction to Stata - ESQUANT

56 introduction to stata

Factor1 | 2 .20189 1 .07190 0 .5981 0 .5981

Factor2 | 1 .12999 0 .01740 0 .3069 0 .9050

Factor3 | 1 .11259 . 0 .3022 1 .2072

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−LR t e s t : independent vs . sa tura ted : chi2 ( 3 6 ) = 907 .15 Prob>chi2 = 0 .0000

Rotated f a c t o r loadings ( pat te rn matrix ) and unique var iances

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Variab le | Factor1 Factor2 Factor3 | Uniqueness

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−x1 | 0 .5622 | 0 .5560

x2 | 0 .4756 | 0 .7540

x3 | 0 .5927 | 0 .6207

x4 | 0 .8243 | 0 .3032

x5 | 0 .8362 | 0 .2945

x6 | 0 .7998 | 0 .3312

x7 | 0 .6294 | 0 .5895

x8 | 0 .6458 | 0 .5477

x9 | 0 .5093 | 0 .5587

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−( blanks represent abs ( loading ) < . 4 )

Fac tor r o t a t i o n matrix

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−| Factor1 Factor2 Factor3

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−Factor1 | 0 .8332 0 .3544 0 .4244

Factor2 | −0 .5334 0 .7174 0 .4481

Factor3 | −0 .1456 −0 .5998 0 .7868

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

Note how the variables are now nicely grouped in groups ofthree. The variables x1, x2, and x3 line up with Factor3; thevariables x4, x5, and x6 line up with Factor 1; and the variablesx7, x8, and x9 line up with Factor 2. We can get a plot of thefactor scores using Statistics ⊳ Multivariate Analysis ⊳ Factorand principal components analysis ⊳ Factor Analysis ⊳ Postesti-mation ⊳ Variables score plot.

Page 57: Introduction to Stata - ESQUANT

statistical analysis in stata 57

We can see the factor scores do appear to be uncorrelated. Tosave the factor scores we use the following command:

. p r e d i c t f1 f2 f3

Open the browser, and you will see the new variables. Do amatrix scatter plot to confirm that the Variables score plot is thesame.

There is no need for the factors to be orthogonal. We could doan oblique rotation which allows the factors to be correlated.Choose Statistics ⊳ Multivariate Analysis ⊳ Factor and princi-pal components analysis ⊳ Factor Analysis ⊳ Postestimation ⊳Rotate loadings. In the Main tab, select an Oblique rotation asPromax. In the Reporting tab, Check Diplay loadings as blankwhen loadings < and type 0.4 . Press OK. The ouput is below:

. r o t a t e , promax ( 3 ) obl ique blanks ( 0 . 4 )

Fac tor a n a l y s i s / c o r r e l a t i o n Number of obs = 301

Method : p r i n c i p a l f a c t o r s Retained f a c t o r s = 3

Rotat ion : obl ique promax ( Kaiser o f f ) Number of params = 24

Page 58: Introduction to Stata - ESQUANT

58 introduction to stata

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Factor | Variance Proport ion Rotated f a c t o r s are c o r r e l a t e d

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Factor1 | 2 .39912 0 .6516

Factor2 | 1 .71830 0 .4667

Factor3 | 1 .40893 0 .3827

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−LR t e s t : independent vs . sa tura ted : chi2 ( 3 6 ) = 907 .15 Prob>chi2 = 0 .0000

Rotated f a c t o r loadings ( pat te rn matrix ) and unique var iances

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−Variab le | Factor1 Factor2 Factor3 | Uniqueness

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−+−−−−−−−−−−−−−−x1 | 0 .5727 | 0 .5560

x2 | 0 .5169 | 0 .7540

x3 | 0 .6332 | 0 .6207

x4 | 0 .8265 | 0 .3032

x5 | 0 .8600 | 0 .2945

x6 | 0 .7872 | 0 .3312

x7 | 0 .6619 | 0 .5895

x8 | 0 .6434 | 0 .5477

x9 | 0 .4566 | 0 .5587

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−( blanks represent abs ( loading ) < . 4 )

Fac tor r o t a t i o n matrix

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−| Factor1 Factor2 Factor3

−−−−−−−−−−−−−+−−−−−−−−−−−−−−−−−−−−−−−−−−−Factor1 | 0 .9013 0 .6960 0 .5235

Factor2 | −0 .4185 0 .3944 0 .6725

Factor3 | −0 .1119 0 .6000 −0 .5231

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

We often will get a neater picture with an oblique rotation. Wecan get a plot of the factor scores using Statistics ⊳ MultivariateAnalysis ⊳ Factor and principal components analysis ⊳ FactorAnalysis ⊳ Postestimation ⊳ Variables score plot.

Page 59: Introduction to Stata - ESQUANT

statistical analysis in stata 59

We can see the factor scores do appear to be correlated. To savethe factor scores we use the following command:

. p r e d i c t f f 1 f f 2 f f 3

Again, check the new variables in the browser window and doa matrix scatter plot.

Page 60: Introduction to Stata - ESQUANT
Page 61: Introduction to Stata - ESQUANT

3 Exercises

3.1 Introduction to Stata

1. Close and Save any logfiles you have open, and save the currentdataset as a Stata file.

2. Open a new log file.

3. Open the voter.dta dataset supplied with Stata.

4. Using the menus, draw an appropriate bar chart that reveals themain features of the data.

5. The variables frac, prfac and pop don’t have labels. To deter-mine what labels they should have, tabulate the data. Work outhow you can determine what percentage of Clinton voters earn$75K+. Also determine how you can show what proportion ofvoters who earn less than $15K voted for Bush.

6. Go back to the bar chart and use the Graph Editor to improveits appearance. In particular, how can you prevent the names ofthe Politicians from overlapping?

7. Close the log file.

3.2 Commands in Stata

1. Close and Save any logfiles you have open, and save the currentdataset as a Stata file.

2. Open a new log file.

3. Send the commands you used for the Introduction to Stataexercise to the do-file ediitor.

4. Edit the commands as appropriate. Add in the heading andclosing commands and comments. Submit the do-file and ifthere are any errors, fix them up and try again until you aresatisfied that the do file works.

Page 62: Introduction to Stata - ESQUANT

62 introduction to stata

5. Close the log file.

3.3 Statistics in Stata

1. Close and Save any logfiles you have open, and save the currentdataset as a Stata file.

2. Open a new log file.

3. Open the textttlifeexp.dta dataset supplied with Stata.

4. Plot lexp vs gnpcc.

5. Take logs of the dependent variable and re-plot.

6. Take logs of the explanatory variable and re-plot.

7. Fit an appropriate linear model. Is it appropriate to add pop-growth as an explanatory variable? Do you need to take intoaccount any heteroscedasticity?

8. Based on your analysis, if you have two countries which havesimilar characteristics but one has 10% higher GDP than theother, how much do you expect lifeexpectancy to be different?

9. Close the log file.

3.4 Multivariate Methods

1. Close and Save any logfiles you have open, and save the currentdataset as a Stata file.

2. Open a new log file.

3. Smith and Stanley (1983) reported results of six tests given to112 individuals. The tests were of general intelligence, picturecompletion, block design, mazes, reading comprehension, andvocabulary. The variance-covariance matrix is given below:

general picture blocks maze reading vocabgeneral 24.641 5.991 33.520 6.023 20.755 29.701

picture 5.991 6.700 18.137 1.782 4.936 7.204

blocks 33.520 18.137 149.831 19.424 31.430 50.753

maze 6.023 1.782 19.424 12.711 4.757 9.075

reading 20.755 4.936 31.430 4.757 52.604 66.762

vocab 29.701 7.204 50.753 9.075 66.762 135.292

Conduct a factor analysis on the data and interpret the derivedfactors.

4. Close the log file.

Page 63: Introduction to Stata - ESQUANT

4 Datasets

4.1 Holzinger Swineford 1939

HolzingerSwineford1939 { lavaan } R DocumentationHolzinger and Swineford Dataset (9 Var iab les )

Descr ipt ion

The c l a s s i c Holzinger and Swineford ( 1 9 3 9 ) d a t a s e tc o n s i s t s of mental a b i l i t y t e s t s c o r e s ofseventh− and eighth −grade ch i ldren fromtwo d i f f e r e n t schools ( Pasteur and Grant−White ) .In the o r i g i n a l d a t a s e t ( a v a i l a b l e in the MBESSpackage ) , there are s c o r e s f o r 26 t e s t s . However ,a smal ler subset with 9 v a r i a b l e s i s more widelyused in the l i t e r a t u r e ( f o r example in Joreskog ’ s1969 paper , which a l s o uses the 145 s u b j e c t sfrom the Grant−White school only ) .

Usage

data ( HolzingerSwineford1939 )Format

A data frame with 301 observat ions of 15 v a r i a b l e s .

idI d e n t i f i e r

sexGender

ageyrAge , year part

Page 64: Introduction to Stata - ESQUANT

64 introduction to stata

agemoAge , month part

schoolSchool ( Pasteur or Grant−White )

gradeGrade

x1

Visual percept ion

x2

Cubes

x3

Lozenges

x4

Paragraph comprehension

x5

Sentence completion

x6

Word meaning

x7

Speeded addi t ion

x8

Speeded counting of dots

x9

Speeded d i s c r i m i n a t i o n s t r a i g h t and curved c a p i t a l s

Source

This d a t a s e t was r e t r i e v e d fromhttp ://web . missouri . edu/~kolenikovs/ s t a t a /hs−c f a . dtaand converted to a csv f i l e .

References

Page 65: Introduction to Stata - ESQUANT

datasets 65

Holzinger , K. , and Swineford , F . ( 1 9 3 9 ) . A study in f a c t o ra n a l y s i s : The s t a b i l i t y of a b i f a c t o r s o l u t i o n . SupplementaryEducat ional Monograph , no . 4 8 . Chicago : Univers i ty ofChicago Press .

Joreskog , K. G. ( 1 9 6 9 ) . A general approach to confirmatorymaximum l i k e l i h o o d f a c t o r a n a l y s i s . Psychometrika , 34 ,183 −202 .

Page 66: Introduction to Stata - ESQUANT
Page 67: Introduction to Stata - ESQUANT

5 Bibliography

[1] Alan C. Acock. A Gentle Introduction to Stata. Stata-Press, Texas,revised third edition, 2012.

[2] E.G. Carmines and R.A. Zeller. Reliability and Validity Assessment.Sage, Thousand Oaks: CA, 1979.

[3] B. Everitt and T. Horthorn. An Introduction to Applied MultivariateAnalysis with R. Springer, NY, 2011.

[4] John C. Loehlin. Latent Variable Models: an introductuon to factor,path, and structural equation analysis. Lawrence Erlbaum Asso-ciates, 4th edition, 2004.

[5] J. Scott Long. The Workflow of Data Analysis Using Stata. StataPress, 2009.