Stata Tutorial

7
STATA Tutorial GVPT622: Quantitative Methods I September 4, 2002 1 Preliminaries STATA is a command-line driven statistics package. This means that much like DOS, you need to type commands into the software to make it execute any routine. While this is a bit more difficult than menu-driven packages like SPSS, it is much faster and more flexible. This document is meant to get you started working with STATA. 2 Getting Started 2.1 Why All those Windows? STATA is a multiple-windowed environment. When you open STATA, you will see 4 windows. 1. Review window - The review window gives you a list of previously typed commands. You can access these two different ways. You scroll through the previous commands with the scroll arrows and click on the command, or each time you hit the “page up” command the previous command will show up in the command window. If you hit the “page up” key twice, the next to last command will pop up and so forth. 2. Variables window - The variables window provides a list of the variables and their labels that are in the currently loaded dataset. 3. STATA Command window - The STATA command window is where the user can interact with STATA. This is where commands are typed in. 4. STATA Results window - The STATA results window show you the results of the commands typed into STATA. The Graphics Window displays graphs as a result of a graph command being typed in the command window. This window will not be visible when you first open STATA, rather it will pop up directly following a graphical command. 2.2 Log Files Log files are a way to save all of the commands and corresponding output generated during a STATA session. It is essential to open a log file during every session to keep track of data manipulation and any analysis performed. A log file can be opened in two different ways. First, you can open a log using the menu option: File>>Log>>Begin at which point the program will ask you to specify the name for the log file. Make sure that you put it in the directory you want. There will be two options for a log file. The .smcl format is a formatted log and the .log format is an unformatted log which can be opened in any text editor. I find the .log files easier to work with, but this is only a personal preference. If you want a .log file, make sure to change the option in the pull-down menu of the save box. The second way to open a log is to type the command in the command window: log using filename [, append replace [ text | smcl ] ] 1

Transcript of Stata Tutorial

Page 1: Stata Tutorial

STATA Tutorial

GVPT622: Quantitative Methods I

September 4, 2002

1 Preliminaries

STATA is a command-line driven statistics package. This means that much like DOS, you need totype commands into the software to make it execute any routine. While this is a bit more difficult thanmenu-driven packages like SPSS, it is much faster and more flexible. This document is meant to get youstarted working with STATA.

2 Getting Started

2.1 Why All those Windows?

STATA is a multiple-windowed environment. When you open STATA, you will see 4 windows.

1. Review window - The review window gives you a list of previously typed commands. You can accessthese two different ways. You scroll through the previous commands with the scroll arrows andclick on the command, or each time you hit the “page up” command the previous command willshow up in the command window. If you hit the “page up” key twice, the next to last commandwill pop up and so forth.

2. Variables window - The variables window provides a list of the variables and their labels that arein the currently loaded dataset.

3. STATA Command window - The STATA command window is where the user can interact withSTATA. This is where commands are typed in.

4. STATA Results window - The STATA results window show you the results of the commands typedinto STATA.

The Graphics Window displays graphs as a result of a graph command being typed in the commandwindow. This window will not be visible when you first open STATA, rather it will pop up directlyfollowing a graphical command.

2.2 Log Files

Log files are a way to save all of the commands and corresponding output generated during a STATAsession. It is essential to open a log file during every session to keep track of data manipulation and anyanalysis performed. A log file can be opened in two different ways. First, you can open a log using themenu option: File>>Log>>Begin at which point the program will ask you to specify the name for thelog file. Make sure that you put it in the directory you want. There will be two options for a log file.The .smcl format is a formatted log and the .log format is an unformatted log which can be opened inany text editor. I find the .log files easier to work with, but this is only a personal preference. If youwant a .log file, make sure to change the option in the pull-down menu of the save box. The second wayto open a log is to type the command in the command window:

log using filename [, append replace [ text | smcl ] ]

1

Page 2: Stata Tutorial

This type of log file captures everything that comes up in the results window. There is another type oflog file - the command log - that captures only commands, not output. This type of file is one that wouldallow you to replicate your analysis with just one command. The command log can be requested usingcommand-line syntax as follows:

cmdlog using filename [, append replace]

The command log can then be opened as a text file, or can be opened in STATA’s do-file editor. Youcan access this document by clicking the envelope looking button (the fifth one from the left in STATA’stoolbar) which opens the do-file editor or you can access it through the command line by typing:

doedit filename

The file can easily be run in stata by typing in the command line:

do filename

or by clicking on the Tools>>Run menu option in the STATA do-file editor.Log files can be suspended or closed. Suspending a log file can be done by typing “log off” which

temporarily closes the log file. The log can then be turned back on by typing “log on”. Closing a log fileis done by typing “log close”. You may then open a new log file. You could open the same log file andadd more information by typing:

log using logname.log, append

You can also replace a log file with a new log file by replacing the append statement with the replacestatement in the syntax above.

2.3 Getting Data in STATA Format

There are 4 main ways to put your data in STATA:

1. Typing data into STATA.

2. Copy and paste from another program.

3. Infile/Insheet.

4. Stat-Transfer.

2.3.1 Typing in Data

You can type data into STATA directly without having it in any other format. You can type data intoa spreadsheet environment either by typing “edit” in the command window or by pushing the buttonwith the spreadsheet (without the magnifying glass) on it. The button with the spreadsheet and amagnifying glass is a data viewing environment where the data cannot be edited. STATA’s capabilitiesas a spreadsheet are lacking so typing directly into STATA is only a good idea with very small datasets.

2.3.2 Copy and Paste

Data can be cut and pasted from other programs into STATA. This works particularly well withspreadsheet data, but can also work with text delimited data. To copy and paste data into STATAsimply open the data editor as suggested above, then copy from a spreadsheet like excel and paste intoSTATA.

There are a couple of text editors that will allow the user to copy blocks of text, such as columns, fromthe middle of a document which is particularly useful in these types of situations. These are Textpad- www.textpad.org and WinEdt - www.winedt.org. These are essentially shareware. WinEdt is a $30registration and the program becomes annoying after the trial period is up. Textpad could also beregistered but is not particularly annoying if you don’t.

2

Page 3: Stata Tutorial

2.3.3 infile/insheet

STATA’s infile command allows the user to bring in any sheet of data into the program. This isusually done from a .txt document. The data file should be delimited by tabs, spaces or commas. Insheetis a similar command that is specifically designed for data read out of a spreadsheet program and in thisutility, the delimiter is an argument to the function where it is not in the infile command. The syntax tothe infile command is as follows 1:

infile varlist [_skip[(#)] [varlist [_skip[(#)] ...]]] using filename [if exp][in range][, automatic byvariable(#) clear ]

The syntax to the insheet command is:

insheet [varlist] using filename [, double [no]names{ comma | tab | delimiter("char") }clear ]

You can get a description of what the arguments mean to these and other functions by typing:

help infile1help insheet

or more generally:

help <function>

Dictionaries Dictionaries are a way to define variable types. STATA does not like to infile stringvariables without a dictionary. Dictionary files include not only the data you want to input, but adictionary command at the beginning. For an example, see “H:/GVPT622 F02/auto.dct”, you can openit in a text editor. STATA has two basic types of variables: string and numeric. To use a dictionary withthe infile statement, just type:

infile using [filename.dct]

1. String variables are those that contain at least one non-numeric character such as a letter or symbol.STATA calls these “str” variables. There is always a number after the “str” which denotes howmany characters wide the variable is, so a variable that is str8 is 8 characters long.

2. Numeric variables are those containing only numbers (including possibly a decimal point). Thereare different kinds of numeric variables: byte, int, long, float and double. They all have differentminima, maxima and precision toward zero. Type “help datatypes” for a more thorough discussion.

2.3.4 Stat-Transfer

By far, the easiest way to get data into STATA or nearly any other format for that matter, is withStat-Transfer. This program allows the user to take data in nearly any format (including SAS, SPSS,Excel (or other spreadsheet), Access (or other database), Systat, Gauss, Limdep, Matlab, Statistica,etc...) and transfer the data into any other format. One of the benefits is that variable names and labelsas well as value labels tend to be preserved across formats. Stat-Transfer is a windows program thatshould be on the statistical software menu in the graduate lab or in LeFrak.

The program works in 4 steps.

1. Choose the type of file you want to transfer.

2. Find the file on your computer

3. specify the type of file into which you want to transfer your data.1The hard brackets [ or ] in the commands need not be entered in the syntax, they are simply for clarity in the

presentation.

3

Page 4: Stata Tutorial

4. hit “Transfer”.

For more advanced users, there are tabs of observations, variables and options that will help the usertweak the program to produce more polished data, but often times specifying further options in thesetabs is not necessary.

3 Saving and Loading Data in STATA Format

3.1 Loading Data

Data can be loaded in one of two ways:

1. Menu - With the Menu option File>>Open, you can search and load data. Similarly, you can typectrl+O or hit the open folder button, the first one on the left-hand side of the STATA toolbar.

2. Syntax - You can type the use command directly into the command window. The command is asfollows:

use filename [, clear nolabel ]

The clear option allows data to be loaded in even if data are currently loaded into the program andhave changed since the last save command was executed.

3.2 Saving Data

Data can be saved in a couple of different ways as well.

1. Saving with the Menu - menu option File>>Save or File>>Save As, can be used to save data inSTATA format. These files end in a “.dta” extension. One can also hit ctrl+s to save as well.

2. Syntax - data can be saved using the command save as follows:

save [filename] [, nolabel old replace all intercooled ]

Where old instructs the software to save the dataset in the previous version of STATA. You shouldn’tneed this in the lab, but will if you’re using STATA 7 elsewhere and want to use the data in STATA6 in the lab. Replace simply replaces the dataset if there is one that has the exact same name. Theother options are irrelevant to your work.

4 Graphing

STATA’s graphing capabilities are not the best of the statistical packages, but they are sufficient forexploratory analysis. They are, however, probably not good enough for publication. There are manypossibilities. These can be broken down into two basic types - univariate and bivariate.

4.1 Univariate Graphs

Univariate graphs are usually meant to describe the distributional properties of a single variable.These include histograms, density plots, boxplots, and oneway scatterplots.

4.1.1 Histogram

Histograms - Histograms place observations into categories (or bins) which are then graphed as afunction of the percentage of the total observations that are in each bin. The command in stata is:

graph [variable] [weight] [if exp] [in range], histogram [common_optionsbin(#) {freq | percent} normal[(#,#)] density(#)]

The “bin” argument allows you to set the number of categories into which the observations are placed.A density curve can be imposed on the histogram.

4

Page 5: Stata Tutorial

4.1.2 Density Plots

A density plot is also called a “smoothed histogram”. In this graph, there are no bins. It is a singleline that is more like the population density function than the histogram. The command in stata for thisis:

kdensity varname [weight] [if exp] [in range] [, nographgenerate(newvarx newvard) n(#) width(#){biweight|cosine|epan|gauss|parzen|rectangle|triangle} normalstud(#) at(varx) symbol(...) connect(...) title(string)graph_options ]

The gauss option is probably the one that will be most useful. The biweight, cosine, epan (epanechankov),parsen, rectangle and triangle options are all options that control how observations are weighted (this isanalogous to deciding which bin they are in).

4.1.3 Boxplots

Boxplots, sometimes called “box and whisker” plots are particularly good at showing the spread of adistribution. The box represents the inter-quartile range (the range between the 25th and 50th percentiles.The whiskers cover most of the rest of the observations, but some extreme outliers can still lie outsidethe whiskers. The STATA command to make a boxplot is:

graph [varlist] [weight] [if exp] [in range], box [common_options[no]alt vwidth root]

4.1.4 Oneway Scatterplots

Oneway scatterplots (also called rug plots in other packages) are yet another way to visualize univariatedistributions. These are particularly good with smaller datasets as with larger ones, the distributionalqualities are not distinguishable. The STATA command to construct a oneway scatterplot is:

graph [varlist] [weight] [if exp] [in range], oneway [common_optionsjitter(#)]

4.2 Bivariate Graphs

Bivariate graphs display the relationship between two variables, While theoretically, there are a num-ber of possibilities for visualizing two variables together, such as a joint density plot, the one used almostexclusively is the bivariate scatterplot.

4.2.1 Bivariate Scatterplot

The bivariate scatterplot uses the values on two variables (X and Y) as coordinates graphed ontoa set of coordinate axes. A number of different lines can be plotted on the graphs to further describethe relationship between the two variables. We will learn more about these later in the semester. Thecommand to create a bivariate scatterplot in STATA is:

graph [varlist] [weight] [if exp] [in range], twoway [common_optionsjitter(#) rescale rbox {y|x|r}reverse]

You may consult the STATA graphing manual or help files for more specific help on any of these andmany other commands for the graphical display of data.

5 Miscellaneous

There are a number of other commands that will become useful as you begin to use STATA on aregular basis.

5

Page 6: Stata Tutorial

1. Describe - describe provides you with a list of properties of the variables specified or all of thevaraibles in the dataset if no variables are specified.

describe [varlist] [, short detail fullnames numbers ]

2. Summarize - summarize provides mean, variance, min and max for all of the variables specified orall variables in data if none are specified.

summarize [varlist] [weight] [if exp] [in range] [, [detail|meanonly]format ]

3. Set Memory - the memory set function will be important when you are using large datasets.

set mem 100m

This will set the memory at 100 megabytes. This should be sufficient for nearly all of your projects.The upper bound is determined by the computer’s physical memory and if 100 megabytes is notenough, if you computer has more memory available, you can set the limit higher.

4. Labelling - Labelling variables and variable values is important to keeping your dataset manageable.You will hear horror stories from many quantitative types about how they didn’t label variables andvariable values because they were sure they would always remember and then two years later afterhaving left the project sitting, they come back only to find they’ve forgotten everything about thevariables and their coding. You will need three different commands to properly label your variables.

(a) Label Variable - this command simply attaches a label to the variable name. So, if for instancethe name of the variable is ’var1’, and you label it ’party ID’, then ’party ID’ will show up inall printed output containing that variable. The command in STATA is as follows:

label variable varname ["label"]

Where ’varname’ is the variable name (var1 in the example above) and label is the label youwant to apply to that variable name (party ID in the example above). So, to create the labelparty ID for var1, we would type the following:

label variable var1 "party ID"

(b) Label Define - this command defines value labels. For instance, our party ID variable may haverepublicans, independents and democrats. We want to make a label so that if we tabulate thevariable, instead of 0, 1 and 2 as categories, it shows republicans, independents and democrats.The general STATA code is as follows:

label define lblname # "label" [# "label" ...] [, add modify nofix ]

Where lblname is the name you want to give to the label, like ’partyid’ for this case, # signifiesthe number you want the label to apply and label is the descriptor. For this example, we wouldtype:

label define partyid 0 "republican" 1 "independent" 2"democrat"

(c) Label Values - Finally, we can apply the new value label we defined ’partyid’, to the variableof interest.

label values var1 partyid

More generally, the syntax is:

label values varname [lblname] [, nofix ]

6

Page 7: Stata Tutorial

6 Resources

1. STATA’s website: www.stata.com has a number of useful resources, like help files and FAQ’s.

2. STATA also has a listserv called STATA list. You can subscribe to STATA list you can consult theSTATA list FAQs located at http://www.stata.com/support/faqs/res/statalist.html.

3. Reference manuals are also a great source of information, hopefully we will have them available toyou early in the semester.

7