Stata Lecture 1

39
NetCourse ® First, have you read NetCourse Basics? You should have. If not, click here to read them now. Second, we wish to emphasize that all will go better if you participate, which you do by posting messages to the NetCourse 151 message board. When you post messages, all participants can see your message. When asking questions about the lectures, please refer to the section heading for the portion of the lecture that you have a question about. Section headings look like this: This is a section heading With that out of the way, we begin. Lecture 1 1 Welcome 2 Entering and executing a program 3 The do-file 4 The interactive program command 5 A program in a do-file 6 Combination do-files 7 Ado-files 8 Organizing do-files 9 An individual do-file 10 A do-file to perform verification 11 Importing data 12 Reproducibility 13 Indexing 14 assert as an alternative to branching 15 Consuming calculated results 16 Conclusion 17 Exercises 1 Welcome Welcome to NetCourse 151—An Introduction to Stata Programming Before we get started, there are a few things we need to get out of the way. 151 An Introduction to Stata Programming

description

An Introduction to Stata

Transcript of Stata Lecture 1

Page 1: Stata Lecture 1

NetCourse®

First, have you read NetCourse Basics? You should have. If not, click here to read them now.

Second, we wish to emphasize that all will go better if you participate, which you do by posting messages to the NetCourse 151 message board. When you post messages, all participants can see your message.

When asking questions about the lectures, please refer to the section heading for the portion of the lecture that youhave a question about. Section headings look like this:

This is a section heading

With that out of the way, we begin.

Lecture 1 1 Welcome

2 Entering and executing a program

3 The do-file

4 The interactive program command

5 A program in a do-file

6 Combination do-files

7 Ado-files

8 Organizing do-files

9 An individual do-file

10 A do-file to perform verification

11 Importing data

12 Reproducibility

13 Indexing

14 assert as an alternative to branching

15 Consuming calculated results

16 Conclusion

17 Exercises

1 Welcome

Welcome to NetCourse 151—An Introduction to Stata Programming

Before we get started, there are a few things we need to get out of the way.

151An Introduction to Stata Programming

Page 2: Stata Lecture 1

With Stata, you can automatically download the important cutouts for each lecture. You will want to do that now for thislecture. For instance, within Stata you would move to your course directory (using the cd command), create a directoryfor this lecture, and download the files.

Consider the following:

. cd "C:/Users/jch/Documents"

. mkdir nc151

. cd nc151

. mkdir lecture1

. cd lecture1

. run http://www.stata.com/courses/nc151-13/lec1.do

In this lecture, we discuss the following:

• How to enter a program and execute it

• Programming as automating data management and analysis

• The importance of organization, especially as it pertains to reproducibility

• The importance of data verification

• Simple data checks and debugging (assert and trace)

• Working with datasets that are too large to fit in memory

• Reading a hierarchical dataset

Back to table of contents

2 Entering and executing a program

This is a course about programming Stata, but before we can get into the details of programming, you need to masterthe mechanics of programming. You need to learn how to enter a program into Stata and how to get Stata to executeit.

A famous book on programming began with a program to do nothing more than display the text Hello, world. Thiswas clever because it eliminated all programming complexity; however, that still left the considerable complexity ofdealing with the compiler.

We will do the same thing by writing a program with a body of

. display "Hello, world"Hello, world

We will deal more with the display command later, but right now, when you see display "Hello, world", youare supposed to imagine that it stands for something longer and more elegant. Then ignore that fact, and look at all theStata junk around it—the stuff that turned the body display "Hello, world" into a program that can be executedover and over.

Back to table of contents

3 The do-file

A do-file is an ASCII file (plain-text file with no special characters) containing Stata commands that you create with theDo-file editor or a text editor.

When you interactively type do filename at the keyboard, the contents of filename are executed just as if you typedeach line at the keyboard.

Page 3: Stata Lecture 1

DO-FILE: hello.do

display "Hello, world"

You run this program by interactively typing

. do hello <- you type this

. display "Hello, world" <- Stata types this

which displays the following in the Stata Results window:

Hello, world.end of do-file. <- finished; Stata awaits your next command

This simple experiment is worth trying for yourself. Even better, do not copy hello.do from this page; open the Do-file editor and type the line for yourself.

Although this seems simple enough, it may not work when you try it. There is a lot, mechanically, that can go wrong(which is why you should try it—it will be easier to master these problems now, with this one-line do-file, than to waitand face the problems in a complicated case).

What might go wrong:

• You enter your text editor, enter hello.do, and save it. You try to do it from Stata and are told filehello.do not found.

◦ Solution 1: hello.do is not in the current directory. Either copy the file to the current directory usingyour operating system, or use cd to change to the directory containing the do-file.

◦ Solution 2: hello.do is in the current directory, but you did not name it hello.do. Stata do-files have adefault suffix, .do; if the suffix is something other than the default, you must specify it. Say that younamed the do-file hello.pgm. Execute it by typing do hello.pgm. Say that you named it simply hello. Then you would execute it by typing do hello. (The period is part of the command; if the do-file has no suffix, put a period at the end of the do-file name when running it from Stata.)

• File hello.do is found, but what is displayed is nothing like what you entered.

◦ Solution: You did not save hello.do as an ASCII file but saved it as some sort of document. If you usea word processor to enter do-files, you must save them as ASCII files.

• File hello.do is found, it appears okay, but it does not do anything. When you run it, you see

. do hello

. display "Hello, world"

Stata executed the display command but never displayed Hello, world.

◦ Solution: You forgot to end the single line in the do-file with a hard return. The line is not terminated, soStata ignored it. Go back, and add the return. Save your file.

Perhaps something else could happen; if so, email us at [email protected]. In the meantime, if you ran into thislast problem, we have a suggestion: end all your do-files with the word exit. In a do-file, exit does not quit Stata butexits the do-file.

Thus we recommend that you enter your do-file as

Page 4: Stata Lecture 1

DO-FILE: hello.do

display "Hello, world"

exit

There are now two possibilities: either you remembered to put the hard return after the exit, or you did not. Eitherway, it will not matter. If you remembered, Stata will see the command exit and exit the do-file. If you did not, Statawill not see exit, the do-file will end, and Stata will still exit the do-file. With the exit command in place, you couldnot have forgotten the hard return at the end of

display "Hello, world"

because, if you had, that command would display in your text editor as being run together with the exit command:

display "Hello, world"exit

The below video is a basic introduction to the Do-file Editor.

The Do-file Edtior [1:24]

The Project Manager is also a nice tool in you plan to work on large scale projects with multiple people.

The Project Manager [1:49]

Back to table of contents

4 The interactive program command

Another way to enter our program would be to interactively type

. program hello1. display "Hello, world"2. end

and when we want to execute it, type:

. helloHello, world

Page 5: Stata Lecture 1

Do this. You have just created your first Stata command. You will seldom want to define your programs interactively,but it is easier to learn what a command can and cannot do for us by trying it interactively. As with do-files, things cango wrong:

• You want to change the program hello to display Hi back rather than Hello, world. You type

. program hellohello already definedr(110);

Stata remembers definitions throughout the session, so you cannot redefine a program until you drop the oldversion:

. program drop hello

. program hello 1. display "Hi back" 2. end . helloHi back.

• program does not know about built-in command names.

Let's enter our Hello, world program and call it b:

. program b 1. display "Hello, world" 2. end. bHello, world

Fine, that works. Now let's call it q:

. program q 1. display "Hello, world" 2. end. q

-------------------------------------------------------------------------------------- Memory settings set maxvar 5000 2048-32767; max. vars allowed set matsize 400 10-11000; max. # vars in models set niceness 5 0-10 set min_memory 0 0-1600gc

output omitted

What happened? q is the Stata abbreviation for the built-in Stata command query. program. You can define aprogram called q, but there is no way to execute it because Stata's built-in commands take precedence.

How can you find out if a name is already taken? Use Stata's which command:

. which q built-in command: query . which b

Page 6: Stata Lecture 1

command b not found as either built-in or ado-filer(111);

• program does not check syntax. The syntax of program is

program program-name 1. content of program more program etc. 2. end

Between the program and the end commands, Stata merely stores whatever you type. Stata does notdetermine whether what you type makes sense until you try to execute it.

. program hello1. display Hello, world2. end

. helloHello not foundr(111);

In this case, we omitted the double quotes around "Hello, world". In our one-line program, finding the erroris easy. Where else could it be than on line 1? If our program were 50 lines long, finding it would be moredifficult. Stata can trace the execution of a program:

. set trace on

. hello- display Hello, worldHello not foundr(111);

. set trace off

One thing to remember: if you set trace on, remember to turn it back off afterward. Otherwise, you will runsome other Stata command and be surprised by the amount of output produced.

Here are some hints for debugging large, complicated programs:

. set trace on /* turn on tracing */

. set more off /* turn off --more-- */

. log using junk, replace /* start a log */

. invoke program

Output will scroll by without --more-- ever appearing. Eventually, the error will occur. Then,

. log close

. set more on /* turn back on --more-- */

. set trace off /* turn off trace */

Now you can look at junk.log using the Viewer.

• There is a limit to how big a program can be. A single program must contain fewer than 3,500 lines and fewerthan 135,600 characters in Stata. Now 3,500 lines is a lot, and programs can call other programs. Nevertheless,we sometimes write programs that are too big and then have to go back and split them into pieces.

Page 7: Stata Lecture 1

• There is no way to edit a program after you have entered it. You can define programs interactively, you canexecute programs interactively, and you can drop programs interactively, and that is it (note that by"interactively", we mean at the Command window). Not only can you not edit the program but also there is noway to store it. This makes the interactive definition of a program not useful. Think of the Stata Do-file Editor asa separate thing.

Back to table of contents

5 A program in a do-file

If program can be used interactively and if do-files execute lines as if they were entered interactively, then do-files cancontain programs. The advantage, of course, is that because you enter do-files using your text editor , you can editprograms and store them.

DO-FILE: hello.do

program hello display "Hello, world"end exit

So, let's try it:

. do hello <- we type interactively . program hello <- Stata respondshello already definedr(110); end of do-filer(110);

Why did this not work? Because we have already defined a program called hello. (In preparing this lecture, we workin Stata interactively, and hello is left over from a previous example.) So, drop the hello program and try again:

. program drop hello <- we type interactively

. do hello <- we type

. program hello <- Stata types 1. display "Hello, world" 2. end

. exit

end of do-file

Look carefully at what happened: the program hello was not executed. Do-files contain lines that Stata executes as ifyou typed them in from the keyboard, and our do-file contains

DO-FILE: hello.do

program hello display "Hello, world"end exit

Page 8: Stata Lecture 1

The do-file does not call on hello to be executed, so Stata does not execute it. In this case, because our program isnow loaded, we can execute it interactively:

. helloHello, world

In do-files that merely define programs, we typically do not want to see the program lines scroll by when we load them.It is more convenient to load such programs using run, which is the same as do but suppresses the output:

. program drop hello <- we type, so we can reload

. run hello <- we type, hello.do runs silently . helloHello, world

Back to table of contents

6 Combination do-files

Do-files execute whatever commands you include in them. You do not have to merely define the program in the do-file—you could define the program and run it, or define many programs, or define many programs and run them. Anythingyou can do interactively—that is almost everything Stata can do—you can do in a do-file.

So, here is the do-file modified to load and execute the program:

DO-FILE: hello.do

program hello display "Hello, world"end helloexit

Now interactively type

. do hello <- we type . program hello <- Stata respondshello already definedr(110);

end of do-filer(110);

Oops!

. program drop hello <- we type

. do hello <- we type

. program hello <- Stata responds 1. display "Hello, world" 2. end . helloHello, world

Page 9: Stata Lecture 1

. exit end of do-file . <- our turn to type again

Having to remember to type program drop hello is tedious. Because do-files execute whatever is put in them justas if you were typing from the keyboard, you can put the program drop into the do-file:

DO-FILE: hello.do

program drop helloprogram hello display "Hello, world"end helloexit

Now you can type do hello when hello is already defined, and it will redefine hello and run the redefinedprogram. There is, however, a problem: what if hello is not already defined?

. program drop hello <- we type . do hello <- we type . program drop hello <- Stata typeshello not foundr(111); end of do-filer(111);

You fixed the do-file to redefine hello when it was already defined, but now the do-file breaks in cases when hellowas undefined anyway.

Ultimately, you will come to appreciate that this stop-on-error behavior of do-files is a useful feature, but right now, it ismerely an irritation. Right now, it does not matter whether program drop hello works or not. If it works, then itneeds to be done. If it does not work, it is not needed.

If you put the word capture in front of a Stata command, its error status is ignored. Typing capture X does thesame thing as X , but after X completes, any errors are reset, so it is as if the error did not happen. (capture X doesmore than that—it stores the error, if any, so that subsequent commands can find out about it; you will use that featurein a later lecture. It also catches all the output generated by X—including error messages—and discards it. See Exercise 3.)

So here is a version of the program that will run whether or not hello is already defined:

DO-FILE: hello.do

capture program drop helloprogram hello display "Hello, world"end helloexit

Back to table of contents

Page 10: Stata Lecture 1

7 Ado-files

This method is an extension of section 3 The do-file, a program in a do-file. If such a file's name is simply changedfrom X.do to X.ado, you do not have to load the program before executing it. Thus you can add the command helloto Stata by taking

DO-FILE: hello.do

program hello display "Hello, world"end exit

and simply renaming it hello.ado:

ADO-FILE: hello.ado

program hello display "Hello, world"end exit

Try this, and test it:

. program drop hello <- just to prove hello not loaded

. helloHello, world

"Ado" stands for automatically loaded do-file. The following occurs when you type X :

• If X is a built-in command of Stata, then Stata executes it.

• Failing that, if X is a defined program, then Stata executes it.

• Failing that, Stata looks for X.ado. If X.ado exists,

◦ Stata issues run X.ado to itself.

◦ Stata verifies that X is now a defined program.

◦ Stata reissues X to itself.

• Failing that, Stata returns unrecognized command.

Type discard at the Stata prompt after returning from your editor if you have changed any ado-files. Type discardto make Stata throw away any copies of automatically loaded files in memory, which forces Stata to refresh itself fromthe updated copies on disk.

Type discard only if you have changed an ado-file.

Do not gloss over this. At some point you will write your own ado command and will attempt to debug it. After changingthe file several times and never seeing your changes reflected, it will finally dawn on you that you forgot to discard theold copy.

Back to table of contents

Page 11: Stata Lecture 1

8 Organizing do-files

Our generic example of the do-file from section 3 The do-file—is

DO-FILE: hello.do

display "Hello, world"

and by comparison to the other methods of defining a program, this one seems crude and unsophisticated.

What do people mean by the word programming? What most statistical users mean is sequentially performingprerecorded data management and analysis steps on a set of data. This is what the do-file accomplishes. There areother more grand types of programming, but do-files handle the day-to-day programming of data analysis.

Allowing the user to interactively explore data is one of Stata's best features.In a few minutes, an hour, or two hours,you can learn a lot. This feature, however, can easily be put to ill use because it is too easy to make a mistake underthe pressure of the moment. Thus, when analyzing data,

I iterate { working iterativelypreparing do-files that reproduce what I have done

}

The result of this approach creates a lot of do-files. These do-files are organized by creating another do-file—usuallycalled master.do—that lists what is run in order:

DO-FILE: master.do

do crds1do ver1do ansumdo crds2do anregetc...

Seldom will master.do run, but every time another do-file is added to the ongoing analysis, another line is added atthe end of master.do. The result is that the entire analysis is re-created from scratch if necessary. You could, ifforced to, go back and describe exactly what you did in the order that you did it.

Stata has two commands—assert and capture—that are especially useful in these kinds of do-files, but mostly thiskind of programming has to do with organization. These are the rules to follow:

Page 12: Stata Lecture 1

Rule 1:

There is only one directory for a project. You know where to look for the data and results.

Rule 2:

No interactive exploratory analysis becomes an official part of the analysis until it is placed into a do-file.

Rule 3:

All do-files create logs—the code is in them to do that. You do not have to remember to start a log before runninga do-file; however, feel free to erase logs because rerunning them is easy.

Rule 4:

Individual do-files typically—but not always—start with the letters cr or an—these have a special meaning. cr*.do files create datasets. For example, crxyz.do creates the dataset xyz.dta. Historically, this meant thatthe datasets could have only six-character names. Sometimes a single cr*.do file can create multiple datasets,such as males and females. an*.do files perform some sort of analysis. They do not create datasets, or if theydo, they erase the working datasets they create. (Sometimes an*.do files are created by editing a log ofinteractive work, but most of the time, switch between Stata and your editor, writing your do-file as you work.)

Rule 5:

Once a do-file works and its name is inserted into master.do, it is never again edited. Absolute compliance withthis rule guarantees that typing do master will re-create what you have done. Never try to go back and changethe past. Instead, add more do-files.

There are reasons behind these rules that need an explaination as they are illustrated. This will help you developunwritten rules to guide your own behavior.

Back to table of contents

9 An individual do-file

An individual do-file typically looks like this:

DO-FILE: X.do

capture log closelog using X, replaceset more off

program code in here

log closeexit

Often when working, have a log going. Nevertheless, if you run one of the official do-files—say, X.do—you want its logto be saved in X.smcl. Thus, the do-files start by closing any open log. Because a log might not be open (thuscausing log close to generate an error), place capture in front of log close.

When you run the individual do-files, you do not want Stata to pause on --more-- conditions. Type set more off.When the do-file completes, Stata will automatically reset it to whatever it was originally.

Finally, at the end of the do-file, close the log.

So, start with an artificial example. You have the following raw data:

Page 14: Stata Lecture 1

mandm, price, mpg, foreign, rep78, weight, displacement AMC Concord, 4099, 22, 0, Average, 2930, 121AMC Pacer, 4749, 17, 0, Average, 3350, 258AMC Spirit, 3799, 22, 0, , 2640, 121 Audi Fox, 6295, 23, 1, Average, 2070, 97Audi, 9690, 17, 1, Exc, 2830, 131BMW, 9735, 25, 1, Good, 2650, 121Buick Century, 4816, 20, 0, Average, 3250, 196Buick Electra, 7827, 15, 0, Good, 4080, 350Buick LeSabre, 5788, 18, 0, Average, 3670, 231Buick Opel, 4453, 26, 0, , 2230, 304Buick Regal, 5189, 20, 0, Average, 3280, 196Buick Riviera, 10372, 16, 0, Average, 3880, 231Buick Skylark, 4082, 19, 0, Average, 3400, 231Cad. Deville, 11385, 14, 0, Average, 4330, 425Cad. Eldrado, 14500, 14, 0, Fair, 3900, 350Cad. Seville, 15906, 21, 0, Average, 4290, 350Chev. Chevette, 3299, 29, 0, Average, 2110, 231Chev. Impala, 5705, 16, 0, Good, 3690, 250Chev. Malibu, 4504, 22, 0, Average, 3180, 200Chev. MCarlo, 5104, 22, 0, Fair, 3220, 200Chev. Monza, 3667, 24, 0, Fair, 2750, 151Chev. Nova, 3955, 19, 0, Average, 3430, 250Datsun, 6229, 23, 1, Good, 2370, 119Datsun, 4589, 35, 1, Exc, 2020, 85Datsun, 5079, 24, 1, Good, 2280, 119Datsun, 8129, 21, 1, Good, 2750, 146Dodge Colt, 3984, 30, 0, Exc, 2120, 98Dodge Diplomat, 4010, 18, 0, Fair, 3600, 318Dodge Magnum, 5886, 16, 0, Fair, 3600, 318Dodge StRegis, 6342, 17, 0, Fair, 3740, 225Fiat Strada, 4296, 21, 1, Average, 2130, 105Ford Fiesta, 4389, 28, 0, Good, 1800, 98Ford Mustang, 4187, 21, 0, Average, 2650, 140Honda Accord, 5799, 25, 1, Exc, 2240, 107Honda Civic, 4499, 28, 1, Good, 1760, 91Linc. Cntntl, 11497, 12, 0, Average, 4840, 400Linc. Mark V, 13594, 12, 0, Average, 4720, 400Linc. Vrsills, 13466, 14, 0, Average, 3830, 302Mazda GLC, 3995, 30, 1, Good, 1980, 86Merc. Bobcat, 3829, 22, 0, Good, 2580, 140Merc. Cougar, 5379, 14, 0, Good, 4060, 302Merc. XR-7, 6303, 14, 0, Good, 4130, 302Merc. Marquis, 6165, 15, 0, Average, 3720, 302Merc. Monarch, 4516, 18, 0, Average, 3370, 250Merc. Zephyr, 3291, 20, 0, Average, 2830, 140Olds Cutlass, 4733, 19, 0, Average, 3300, 231Olds CutlSupr, 5172, 19, 0, Average, 3310, 231Olds Delta 88, 4890, 18, 0, Good, 3690, 231Olds Omega, 4181, 19, 0, Average, 3370, 231Olds Starfire, 4195, 24, 0, Poor, 2730, 151Olds Toronado, 10371, 16, 0, Average, 4030, 350Olds, 8814, 21, 0, Good, 4060, 350Peugeot, 12990, 14, 1, , 3420, 163Plym. Arrow, 4647, 38, 0, Average, 3260, 156Plym. Champ, 4425, 34, 0, Exc, 1800, 86Plym. Horizon, 4482, 25, 0, Average, 2200, 105Plym. Sapporo, 6486, 26, 0, , 2520, 119Plym. Volare, 4060, 18, 0, Fair, 3330, 225Pont. Catalina, 5798, 18, 0, Good, 3700, 231Pont. Firebird, 4934, 18, 0, Poor, 3470, 231Pont. GranPrix, 5222, 19, 0, Average, 3210, 231Pont. Le Mans, 4723, 19, 0, Average, 3200, 231Pont. Phoenix, 4424, 19, 0, , 3420, 231Pont. Sunbird, 4172, 24, 0, Fair, 2690, 151Renault Le Car, 3895, 26, 1, Average, 1830, 79Subaru Subaru, 3798, 35, 1, Exc, 2050, 97Toyota Celica, 5899, 18, 1, Exc, 2410, 134Toyota Corolla, 3748, 31, 2, Exc, 2200, 97Toyota Corona, 5719, 18, 1, Exc, 2670, 134VW Rabbit, 4697, 25, 1, Good, 1930, 89VW Diesel, 5397, 41, 1, Exc, 2040, 90VW Scirocco, 6850, 25, 1, Good, 1990, 97

Page 15: Stata Lecture 1

First, import the data interactivly.

. clear

. import delimited cars, rowrange(1:10)

. list

This test verifies that the import delimited command works. Then create the first do-file:

DO-FILE: crcars1.do

capture log close log using crcars1, replaceset more off

clearimport delimited carscompresssave cars1, replace

log closeexit

Interactively, read in the data:

. do crcars1

With that working, begin master.do:

DO-FILE: master.do

do crcars1exit

Back to table of contents

10 A do-file to perform verification

Now the second step would be to assemble a do-file that verifies what the documentation claims is true:

DO-FILE: ver1.do

Page 16: Stata Lecture 1

capture log closelog using ver1, replaceset more off

* The documentation says make and model* is a string and implies it is always defined.assert mandm != ""

* The documentation implies price is always defined.assert price<. & price>0

* Normally, do not include all these comments.* Continuing...assert mpg<. & mpg>0assert foreign==0 | foreign==1assert weight<. & weight>0assert displacement<. & displacement>0

log closeexit

assert is Stata's most useful command for data checking. If the statement is true, Stata continues; if it is false, Stataissues an error message, and everything stops right there.

This is a very important step, especially with large datasets. Before assert, a do-file containing summarize andperhaps a few tabulates would create satisfaction. The complication is that the problems in the output were notalways spotted. Invariably, something surprising (and perhaps wrong) about the data would be found that should havebeen established at the outset.

In this little dataset, all the variables check out. In a dataset with more variables, they probably would not stick asserts throughout the do-files, putting them in the first time they are used each variable.

Interactively, run ver1.do:

. do ver1 <- type interactively

. capture log close <- Stata responds

. log using ver1, replace

. set more off

.

. * The documentation says make and model is a string

. * and implies it is always defined.

. assert mandm != ""

.

. * The documentation implies price is always defined.

. assert price<. & price>0

.

. * Normally, do not include all these comments.

. * Continuing...

. assert mpg<. & mpg>0

. assert foreign==0 | foreign==1

1 contradiction in 74assertion is falser(9);

end of do-filer(9);

Page 17: Stata Lecture 1

Of course, this was prepared just to show you what happens when there is an error. Let's learn more about theproblem: the data are in memory, and you are right at the point where the assertion proved false.

. list mandm mpg foreign if !(foreign==0 | foreign==1)

mandm mpg foreign68. Toyota Corolla 31 2

This is more than a typographical error, so let's assume you want to fix foreign in this case. You could

• go back and modify crcars1.do, or• add another do-file after crcars1.do

.

The second approach is more appealing; the rule being that you add something to master.do, do not go back andchange it. (That is how to ensure that master.do works in the future.)

In any case, if this were a larger dataset, you should inquire about any other potential problems. Therefore, edit ver1.do (it is not in master.do yet—so you can still change it) and comment out the assertion that is not true:

DO-FILE: ver1.do

capture log closelog using ver1, replaceset more off

* The documentation says make and model* is a string and implies it is always defined.assert mandm != ""

* The documentation implies price is always defined.assert price<. & price>0

* Normally, do not include all these comments.* Continuing...assert mpg<. & mpg>0* assert foreign==0 | foreign==1 NOT TRUE!assert weight<. & weight>0assert displacement<. & displacement>0

log closeexit

Going back to Stata, try again,

. do ver1

and this time it works; the remaining assertions are true. Then add ver1.do to master.do:

DO-FILE: master.do

do crcars1do ver1exit

Now fix the problem observation. The new do-file crcars2.do reads:

Page 18: Stata Lecture 1

DO-FILE: crcars2.do

capture log closelog using crcars2, replaceset more off

use cars1, clearassert foreign==2 in 68 /*A Toyota Corolla*/replace foreign=1 in 68save cars2, replace

log closeexit

Note the use of assert—this time, assert that the error exists and that it exists where it was found interactively. Thisverifies that no mistakes are made.

Also create a verification routine, ver2.do, that contains a copy of ver1.do with the assert foreign==0 |foreign==1 line put back in (and the log statement updated):

DO-FILE: ver2.do

capture log closelog using ver2, replaceset more off

* The documentation says make and model* is a string and implies it is always defined.assert mandm != ""

* The documentation implies price is always defined.assert price<. & price>0

* Normally, do not include all these comments.* Continuing...assert mpg<. & mpg>0assert foreign==0 | foreign==1 /* hopefully, fixed*/assert weight<. & weight>0assert displacement<. & displacement>0

log closeexit

Interactively, type

. do crcars2

. do ver2

and if these work interactively type

. erase cars1.dta

because 1) This allows for more disk space and 2) This eliminates an old version of the data. Then add the following tomaster.do:

DO-FILE: master.do

Page 19: Stata Lecture 1

do crcars1do ver1do crcars2do ver2erase cars1.dtaexit

Although this is segmented much more than in real life, you get the idea—lots of do-files, repeated code (becausecopying files and editing is easy), and little concern with efficiency.

Let's assume that you are at an analysis step. There is only one rule here: no permanent datasets should be created.Temporary datasets are fine. Any datasets created, however, are to be deleted at the end. Why? Remember that theanalysis do-files are named ansomething.do. The an*.do files can be run in any order and will either produce thesame results or not work at all. They are dependent on datasets created by cr*.do files, and as you have seen, feelfree to erase such datasets (because they can always be re-created). But if the datasets exist, the an*.do file willwork.

Separating the creation of permanent datasets from analysis is an important step in obtaining reproducible results.

The analysis do-file might temporarily need to split the foreign and domestic cars into two datasets. It could read, inpart,

use cars2, clearkeep if foreignsave tmpf, replaceuse cars2, cleardrop if foreignsave tmpd, replace

and then go on from there. By using the prefix tmp for these temporary analysis files, you know that it is always safe totype erase tmp*.dta within a directory. You can include this command at the end of your do-file or type it later.

Most researchers have had the experience of having results and a program—a Stata do-file or some other package'sequivalent—and yet when they rerun the program, it produces slightly different results, and no one can explain why thishappens.

What if later you have further improved your data, say, by eliminating some observations that really did not belong inthe sample, and you want the updated version? Copy the anX.do file to a new name—say, anX2.do—edit it, changethe data it is using, and run it. Then add it to master.do.

Back to table of contents

11 Importing data

Stata's infile command has two ways of reading data from disk—with and without a dictionary. The differencebetween these two methods is not merely their style.

Without a data dictionary, Stata reads data in what is called stream mode. The data are a stream, and going to a newline has no special significance—it means the same as any other kind of white space (such as blanks).

With a data dictionary, Stata reads data in record mode. Going to new lines has a special meaning. Each "observation"in the data is a fixed number of records (such as one record per observation or two records per observation).

When you read your data, use the method that is appropriate to the kind of data you have. Nothing is more frustratingthan trying to read stream data with a data dictionary or reading record data without one.

The following statement is often heard: "I like the documentation aspect of the data dictionary. Why can't I read mystream data in this way?"

If you have stream data and want to document what you have done (as you should), read it using a do-file.

Now let's consider reading a hierarchical dataset with Stata. This requires caution.

Page 20: Stata Lecture 1

Suppose you have data on families and persons within families. The data have the format family record followed byone or more person records:

family recordperson record. . .person recordfamily recordetc.

Let's assume that

Family record Person recordcol. 1-5 family id col. 1-5 person idcol. 7 "1" col. 7 "2"col. 9 dwelling type code col. 8-9 age

col. 11 sex code

Note: Sample data are at the end of this lecture after the exercises; use the data in Appendix A: Sample data forhierarchical dataset example to test your ability to read in this kind of dataset.

The data probably contain more information than this, but this is enough for illustration. Note that if column 7 contains a1, it is a family record, and if it contains a 2, it is a person record. This is called the record-type indicator.

Create a Stata dataset containing

• family ID• dwelling code• person ID• age• sex code

This dataset will contain one observation per person, and the family information will be repeated for people in the samefamily.

First, create separate dictionaries for reading in the family and the person information:

DICTIONARY: family.dct

dictionary using hier.raw { long famid %5f "family id"_column(7) byte rectype %1f "record type"_column(9) byte dwell %1f "dwelling code"}

DICTIONARY: person.dct

dictionary using hier.raw { long perid %5f "person id"_column(7) byte rectype %1f "record type"_column(8) byte age %2f "age (years)"_column(11) byte sex %1f "sex code"}

Then test each one of these dictionaries interactively to make sure they work:

. clear

. infile using family if rectype==1 in 1/100

. list in 1/5

. type hier.raw <- I'll press Break to stop this

. clear

Page 21: Stata Lecture 1

. infile using person if rectype==2 in 1/100

. list in 1/5

. type hier.raw <- I'll press Break to stop this

This reads a few data, types the original, and compares them.

Satisfied that these are good dictionaries, create a do-file to read the entire dataset. The basic plan of the do-file is to

1. read in the family records and save them in a data file;

2. read in the person records; and

3. merge the person and the family records.

This problem would be easy if the person records contained the ID for the family to which the person belongs: step 1would be an infile ... if rectype==1 followed by a sort and a save, step 2 would be an infile ... ifrectype==2 followed by a sort, and step 3 would be a merge. The whole do-file would be

STEP 1 clear infile using family if rectype==1 sort famid save tmph, replace

STEP 2 clear infile using person if rectype==2 sort famid

STEP 3 merge famid using tmph

How easy that would be. In this example, however, the famid does not appear on the person records

This adds significantly to the complication. Read in the family records as you did above, but in addition, manufacture afamily ID variable, labeling the first family 1, the second family 2, and so on:

MODIFIED STEP 1clearinfile using family if rectype==1gen long id = _nsort idsave tmph, replace

Next, when you read in the person data, read in the family records as if they were person records, too. The result willbe a placeholder observation for the family record:

perid rectype age sex1. junk 1 junk junk2. 1 2 32 03. 2 2 30 14. junk 1 junk junk5. 1 2 40 1etc.

Then regenerate the temporary family ID variable and discard the misread family records.

To regenerate the ID variable, first type gen id = 1 if rectype == 1:

Page 22: Stata Lecture 1

perid rectype age sex id1. junk 1 junk junk 12. 1 2 32 0 .3. 2 2 30 1 .4. junk 1 junk junk 15. 1 2 40 1 .etc.

Then type replace id = sum(id):

perid rectype age sex id1. junk 1 junk junk 12. 1 2 32 0 13. 2 2 30 1 14. junk 1 junk junk 25. 1 2 40 1 2etc.

Finally, type drop if rectype == 1:

perid rectype age sex id1. 1 2 32 0 12. 2 2 30 1 13. 1 2 40 1 2

You will then be able to merge the family data with the person data. So the outline of the do-file is

MODIFIED STEP 1 clear infile using family if rectype==1 gen long id = _n sort id save tmph, replace

MODIFIED STEP 2 clear infile using person /* no if! */ gen long id = 1 if rectype==1 replace id = sum(id) drop if rectype==1 drop rectype sort id

MODIFIED STEP 3 merge m:1 id using tmph drop id

Some final details:

The outline assumes that rectype takes on the values 1 and 2 as the documentation claims and that everythingmerges. Some checks are needed. Thus the final do-file is

DO-FILE: crhier.do

Page 23: Stata Lecture 1

capture log closelog using crhier, replaceclearinfile using family if rectype==1drop rectypegen long id = _n /* make my own temporary id var */sort id /* to set sort markers */save tmph, replace

clearquietly infile using person /* no matter what the rectype */assert rectype==1 | rectype==2 /* just to be safe—see note */gen long id = 1 if rectype==1replace id = sum(id)drop if rectype==1drop rectypesort id perid

merge m:1 id using tmphassert _merge==3 /* they are supposed to match */drop _merge id

sort famid peridsave hier, replaceerase tmph.dtalog closeexit

Direct your attention to the line that reads

assert rectype==1 | rectype==2

This is an important part of the do-file. Everything hinges on the documentation being correct; that is, rectype reallydoes take on the values 1 and 2, and only the values 1 and 2, and the rectype is correctly read in (read the rightcolumns of the data). In the do-file, if rectype ever takes on a value other than 1 or 2, things stop right there.

Similarly, after merging, include the line

assert _merge==3

Theoretically, this must be true, but in reality, sometimes mistakes are made. Asserting things that must be true is agood way to catch mistakes.

Early on, you should probably verify more about the data. For instance, the documentation implies that there are noempty households, meaning two household records in a row. Prove this by including the line

assert rectype!=1 if rectype[_n-1]==1

Stata's infix command and Stata's infile command with a data dictionary are really the same command—bothread the data in record mode. infile is used in the example above, but infix could just as well have been used.The dictionaries would have had a slightly different format, but the logic would be the same. All that would change inthe do-file would be the switch from infile to infix.

Aside: Working with datasets that are too large

Page 24: Stata Lecture 1

We do not recommend that you use Stata with datasets that are larger than the physical amount of memoryon your computer. But, recommended or not, sometimes you need to analyze such data. Most modernoperating systems provide virtual memory, and Stata, like any other application, can use it. However, if youuse virtual memory you will quickly discover that it is slow.

We once had to process a 60 GB Stata .dta file on a Linux computer with only 16 GB of real memory. (32GB of real memory on a Windows computer leaves about 30 GB for Stata's data areas. 16 GB on this Linuxbox left about 14 GB for Stata's data areas, so we were fitting 60 GB into 14 GB.) Here is what we did.

Day 1:

We processed the raw data to create the Stata .dta file. We wrote a do-file and tested it by modifying the infile line near the top of the file:

infile ... in 1/500

The in 1/500 made the infile command read only the first 500 observations of the data. Thus we fullytested the do-file quickly. Knowing it worked, we removed the in 1/500 and reran the command. A fewminutes later we had the datasets.

In addition to creating all the data in a dataset called master.dta, we created a random sample of thedata (see [D] sample. Thus the do-file read

clearinfile ... <- no in 1/500 in the final versionlabel this and thatsort ... <- we wanted it sorted; it took hours...save master, replacesample 10save sample, replace

Days 2 and beyond:

We interactively analyzed the data using sample.dta. We kept careful track of what we did during the daythat we liked and created a do-file that would reproduce that work, just as we always do.

Toward the end of the day, we would test the do-file. It read

use sample, clear...whatever we had done...

Knowing it worked (it took little time to run and test), we modified the file to read

use master, clear...whatever we had done...

When we went home, we would leave this do-file running so that the next morning we could see what wewere pretty sure would work over the entire dataset.

This method actually worked pretty well.

Back to table of contents

Page 25: Stata Lecture 1

12 Reproducibility

The point of all this, obviously, is to organize your work so that results are reproducible. The organization suggestedabove is one way this could be done, but there are others. All that is important is that you adopt a way that works foryou.

The rest of this course is less prescriptive and more descriptive. Along those lines, let's describe situations where,even if you follow a safe-computing plan, results may not be reproducible, and then let's see what you can do about it.

First, some commands in Stata are inherently random—if you do the same thing twice, you will not get the same result.In the aside about processing datasets that are too large. A 10% sample of the data was created. The do-file contained

clearinfile ...label this and thatsort ......save master, replacesample 10save sample, replace

Notice the sample 10 command. It definitionally draws a 10% random sample, and so, if you ran this commandagain, you would presumably get a different sample.

Whenever you are dealing with a Stata command whose results are intentionally random, set the random-number seedto make the results of the command deterministic. The actual do-file read

clearinfile ...label this and thatsort ......save master, replaceset seed 7781319 <- note wellsample 10save sample, replace

The "random" sample is no less random merely because the seed is known and can be reproduced. This commentabout setting the seed applies to all of Stata's intentionally random commands (bootstrapped samples come to mind):Set the seed and then issue the command.

There is only one other place that Stata exhibits random behavior, and that is with the sort command. Consider adataset in which there is more than one observation per region of the country. Then sort region will sort the data inascending order of region, but the order of the observations within a region will be randomized. In most cases, thisrandomness does not matter, but if it does and you want results to be reproducible, you must deterministically breakthe ties. One way would be by typing

sort region, stable

and another way would be by typing

set seed some valuegen u = runiform()sort region udrop u

The first way would keep the data in the order of the original observations within each region. The second way wouldrandomize it in a controlled way.

In fact, this was relevant in the example of analyzing a too-large dataset.

Page 26: Stata Lecture 1

clearinfile ...label this and thatsort ... <- note this line...save master, replaceset seed 7781319sample 10save sample, replace

That early sort was actually sort district, which put the data into the order of sales districts. That line wasenough to break the reproducibility of the 10% sample because the sample is a function of the random-number seedand the order of the data. In fact, the do-file for creating these data read

clearinfile ...label this and thatsort district, stable...save master, replaceset seed 7781319sample 10save sample, replace

Now the result is reproducible.

Mostly, the order of the data does not matter and need not be controlled, but any time you are performing explicitlyrandom steps, you must control the order of the data if the results are to be reproducible.

Back to table of contents

13 Indexing

This course has discussed serious programming (at least in the data analysis sense) without yet talking about whatmost people consider to be the definitional aspect of programming—looping and branching. The course will coverlooping and branching, but one important trick in Stata programming—at least where the data are concerned—is toavoid looping and branching whenever possible. Considering Stata's loop-and-branch constructs are slow, you cangenerally avoid looping by indexing.

Most of Stata's data-handling commands, such as

generate X = ...

replace X = ...

assert X ...

implicitly loop across the data. Type

generate age2 = age*age

Stata takes this to mean

for each observation in the data { generate age2 = age*age in the current observation}

Page 27: Stata Lecture 1

Indexing allows you to refer to past and future observations in the implied loop. For instance, you can refer to fixedindices:

generate y = x[1]

This is interpreted as

for each observation in the data {generate y = the 1st observation in x

}

You can easily refer to x[2], x[3], and so on. The current observation in the implied loop is referred to as _n, so

generate y = x[_n]

means the same thing as

generate y = x

Any expression can appear in the brackets, so

generate y = x[_n-1]

produces in y the lagged values of x (y[1] will be missing), whereas

generate y = x[_n+1]

produces in y the lead values of x (y[_N] will be missing).

_N (not _n) refers to the last observation. Thus

generate y = x[_N]

produces in every observation of y the last value of x.

generate y = x[_N-_n+1]

produces in y the reversed values of x.

There is only one more thing to know about _n and _N, and that is how they interact with by. The formal definition ofStata's by prefix is

by varlist: stata_cmd

which produces the same result as forming separate datasets for each unique set of values of varlist and running stata_cmd on each dataset separately.

Thus, in by varlist: something referring to _n or _N , _n refers to the current observation number withinthe by-group, and _N refers to the last observation within the by-group, both counted from 1. Type

Page 28: Stata Lecture 1

by person: generate y = x[_n-1]

you obtain the lagged values of x within person. The first value of y for each person will be missing.

_n and _N can be used directly. For instance, in the hierarchical data example, assume you created the data hier.dta and want to add a variable stating how many people are in each family:

by famid: gen persons = _N

Type summarize persons to find out the minimum and maximum number of people per family. Obtain the average,but interpret the average cautiously. This is the average from the person point of view—the average size of family for aperson drawn randomly from the data. (Think about the average of a two-person family and a four-person family. Theaverage would be [2*2+4*4]/6, not [2+4]/2.) To obtain the average for a randomly drawn family, type

by famid: gen persons = _N if _n==1summarize persons

or, equally well,

by famid: gen persons = _N if _n==_Nsummarize persons

Use either one to record the number of people once—and once only—per family.

Truly amazing results form by combining _n and _N with explicit indexing. For instance, let's assume we have adataset that contains

Variable name Variable label

personid six-digit id number of person

age current age

sex sex (1=male, 2=female)

weight weight (lbs.)

fatherid six-digit id number of father (if in data)

motherid six-digit id number of mother (if in data)

A dictionary containing data named relation.dct is attached as a cutout at the end of this lecture in Appendix B:Sample data for relation example . Create relation.dta by typing

. infile using relation

. save relation

To these data, we wish to add to each person's record the age of their father. The solution to this problem is to

1. create a dataset mapping the ID number to the observation number; and

2. add the father's age by typing generate fage = age[father's obs#].

Assume the data are stored as relation.dta. The solution is

Page 29: Stata Lecture 1

use relation, clearsort personid /* put data in personid order */

gen obsno = _n /* obsno is 1, 2, 3, ..., _N */

keep personid obsnorename personid id /* use generic name id */save mapping, replace /* data contain id and obsno */use relation, cleargen id = fatherid /* the id we want is the father's */

sort idmerge m:1 id using mappingkeep if _merge==1 | _merge==3sort personid /* sort personid so obsno valid */

gen fage = age[obsno]drop _merge id obsno

Most people have to play around with this example to understand it. It will help if you try the steps above and look atsome of the data as you go along. For instance, if you look at the example dataset, you will see that observation 1 (personid==101612) has a father with an ID number of 939175. Observation 10 (personid==129315) also has thissame father. So, observation 1 and observation 10 are siblings. Later in the dataset, you find the father—observation237 (personid==939175). The father's information (age) merges into his children's records.

Easily add the mother's age with mapping.dta:

gen id = motheridsort idmerge m:1 id using mappingkeep if _merge==1 | _merge==3sort personidgen mage = age[obsno]drop _merge id obsno

With a dataset like this, mapping.dta would be a good dataset to keep around. Remember, do not index by obsno(as in age[obsno]) unless the data are sorted by personid (because observation numbers are meaningful only ifgiven a specification of order).

When dealing with this kind of data, do not add the father's age or the mother's age in the data-creation step—add thefather's observation number and the mother's observation number. Then, in the analysis step, obtain any mother's orfather's characteristic. The do-file might read

DO-FILE: crrel2.do

Page 30: Stata Lecture 1

capture log closelog using crrel2, replace

use relation, clearsort personidby personid: assert _N==1 /* see Exercise 9 */gen obsno = _nkeep personid obsnorename personid idsave mapping, replace

use relation, cleargen id = fatheridsort idmerge m:1 id using mappingkeep if _merge==1 | _merge==3rename obsno f_nlabel var f_n "Father's obs. # when sorted"drop _merge id

gen id = motheridsort idmerge m:1 id using mappingkeep if _merge==1 | _merge==3rename obsno m_nlabel var m_n "Mother's obs. # when sorted"drop _merge id

sort personidsave rel2, replaceerase mapping.dtalog closeexit

For the father's age, type

sort personid /* if not already */gen fage = age[f_n]

and for mother's weight, type

gen mweight = weight[m_n]

Back to table of contents

14 assert as an alternative to branching

In most, but not all, data analysis situations, looping can be avoided with explicit indexing. Similarly, branching can beavoided by using assert. In most data analysis situations, producing code is necessary to deal with that specificdataset; you are not producing general code for a generic problem. In such cases, most analysts omit explicitly statingall the assumptions, but if they are careful, they have looked at the data and verified that the tacit assumptions are true.

assert is used to explicitly narrow the focus of the program without saying what you are going to do when theassertion is false. If the assertion is false, Stata will stop, and you will have to intervene. assert is especially usefulwhen you must repeat the analysis on updated or different data, because it is in those cases that it is most tedious toverify assumptions.

Page 31: Stata Lecture 1

assert can be used to assert away data errors:

assert sex=="m" | sex=="f"assert 1 <= age & age <= 99assert packs==0 if !smoker

assert can also be used to assert that the problem is simple:

assert n_visits==1 | n_visits==0assert age<.assert packs>0 if smokerassert fatherid>=. if motherid>=.

assert can combine with other elements of Stata's language to prove complicated assumptions.

For example,

by patient: assert sex==sex[_n-1] if _n>1

by patient: assert abs(bp-bp[_n-1]) < 20 if bp<. & bp[_n-1]<.

sort patient timeby patient: gen gap = time - time[_n-1]by patient: assert gap<2 if _n>1

by patient: assert died==0 if _n!=_Nby patient: assert died==0 | died==1 if _n==_N

by patient: gen n_xplant = sum(xplant)by patient: assert n_xplant==0 | n_xplant==1 if _n==_N

Assertions can be collected into a single do-file to certify the data (sometimes do this), and they can be sprinkledthroughout all your do-files as the assumptions become obvious (always do this).

Back to table of contents

15 Consuming calculated results

Sometimes you may want to calculate things that are, themselves, based on other calculated results. For example, youwant to normalize the data to have mean 0, and possibly variance 1, and you want to use a subset of the estimatedregression coefficients to form a prediction, etc. When you work interactively, you can see a result and just use it byretyping, cutting, or pasting the number.

That approach does not work in do-files. In do-files, you will want to refer to the result without knowing the actual result.For instance, say that you want to remove the mean from variable x. You could interactively type summarize x andthen write down the mean—5.4237—and then, include the following line in your do-file:

replace x = x - 5.4237 /* 5.4237 obtained interactively */

You might even be cautious and try to keep a trail of what you have done:

summarize x /* verify mean is 5.4237 */replace x = x - 5.4237

There is a better way. Stata's statistical commands save calculated results, such as in numbers like r() or e() or inbuilt-in vectors like _b[], so that you can access them when you need them (see [R] stored results). Use the return list or ereturn list command after running a command to see a listing of stored results. For instance, summarize saves the mean in r(mean). The correct do-file way to remove the mean of x is by typing

Page 32: Stata Lecture 1

summarize xreplace x = x - r(mean)

This method is guaranteed to be correct and remains correct even if you have to rerun the do-file on new and differentdata.

Estimation commands save coefficients in the _b[] built-in vector:

regress ln_incom educ age age2 sexgen asifmale = _b[_cons] + _b[educ]*educ + _b[age]*age +_b[age2]*age2

[R] stored results documents where a command stores its results, and you will want to learn to use these resultswhen you need them in your calculations.

Back to table of contents

16 Conclusion

This course has covered everything there is to know, at least conceptually, about using Stata sequentially. What mostprogrammers would call programming, has not been covered. However, 80–90% of what data analysts callprogramming has been covered. At least it is 80–90% of the kind of programming used to analyze real datasets.

In the next three lectures, the course will cover local and global macros and branching and looping.

In the meantime, complete the exercises. However, do not send your solutions to us (and the other participants) unlessyou think your solution is novel or salient to a point you wish to make.

Send an email to [email protected] if you need to speak to us privately.

The documentation of completed exercises is not necessary; however, comments and questions are welcome, even ifthey are only tangentially related to the lecture.

Let's see how things go.

Back to table of contents

17 Exercises

1. In the hello.do file appearing at the top of section 4,

DO-FILE: hello.do

program hello display "Hello, world"end exit

exit is included at the end. This addresses problem 5 of section 3—you might forget the hard return. In thesection on do-files, however, the lecture said that if you have that problem, the hello do-file should read

display "Hello, world"exit

Explain why the exit does not appear inside the hello program. Would it make a difference?

Page 33: Stata Lecture 1

2. In section 6, a do-file loaded a program and executed it:

DO-FILE: hello.do

program hello display "Hello, world"endhello exit

Compare this with the do-file at the beginning of the lecture:

DO-FILE: hello.do

display "Hello, world"

Which is better? When would one method be preferred over the other?

3. capture X , where X is any Stata command, not only catches any errors but also catches and discards anyoutput produced by X . quietly X , another Stata command, executes X and discards output, but does notcatch errors. noisily X is the antonym of quietly X . Verify that

capture noisily X

executes X , catching errors but not output. What does

noisily capture X

do? Why?

What does

noisily capture noisily X

do?

4. Assume that the data in the crhier.do example (the hierarchical data in appendix A) are 25% too big to fit intomemory. Modify crhier.do to create two datasets, one for males and one for females, reading the originaldata without overflowing memory. (Hint: You need to read the original data three times: once to obtain thehousehold records, once to obtain the males, and once to obtain the females.)

5. The hierarchical data example implies that there are no empty households. Prove this implication by includingthe line

assert rectype!=1 if rectype[_n-1]==1

Is the line necessary? That is, would the data be misassembled (people assigned to the wrong households) ifthis assertion were false?

Page 34: Stata Lecture 1

6. Suppose you need to create two random numbers in your data—call them u1 and u2. You want to do this in areproducible way. Compare the following:

First way Second way

. set obs 10 . set obs 10

. set seed 329193 . set seed 329193

. gen u1 = runiform() . gen u1 = runiform()

. gen u2 = runiform() . set seed 4988329

. gen u2 = runiform()

In terms of reproducibility, does it matter which is used? What is a good rule for when the random-number seedneeds to be reset and when it does not?

7. In section 13. Indexing, it was casually noted that

generate y = x[_n-1]

sets the first observation of y to missing. How does it do this? What happens when you ask Stata to calculate x[0]? x[1000] in a 500-observation dataset? x[5.4] in a 10-observation dataset? x[5.9]? More importantly,how do you find out? (Hint: See [D] display.)

8. In obtaining the average size of families for a randomly drawn family, it was noted that

by famid: gen persons = _N if _n==1summarize persons

and

by famid: gen persons = _N if _n==_Nsummarize persons

will yield the same results. All that is important is that the number of people be recorded only once per family.Will

by famid: gen persons = _N if _n==2summarize persons

yield the same results? Why or why not?

9. In crrel2.do, an assert was added that did not appear in the original draft:

use relation, clearsort personidby personid: assert _N==1

What is the purpose of this assert? Why is it important?

Page 35: Stata Lecture 1

10. You have data containing two variables: group and x. group takes on the values 1, 2, 5, 9, 10, 11, and 12.There are repeated observations within group. Without using egen (see [D] egen), write out the Statacommands necessary to create variable devx, the deviation of x[_n], from the group mean. Set devx tomissing if there is only one observation in the group. Write out the Stata commands necessary assuming that 1)there are no missing x values and 2) there are missing x values.

11. In all the examples of analysis do-files, include

log using filename, replace

near the top. Why do you include the replace option?

Back to table of contents

Back to table of contents

Appendix A: Sample data for hierarchical dataset example

Below are some sample data that you can use to test your ability to read in a hierarchical dataset.

DATASET: hier.raw

Page 36: Stata Lecture 1

06470 1 1 1 232 0 2 230 107470 1 0 1 240 108470 1 0 1 227 009470 1 0 1 213 1 2 222 0 3 224 110470 1 1 1 220 0 2 211 111470 1 0 1 217 0 2 210 1 3 226 112470 1 0 1 218 1 2 220 0 13470 1 0 1 217 114470 1 0 1 215 1 2 218 015470 1 0 1 213 0 2 229 016470 1 0 1 219 119470 1 1 1 215 0 2 219 124470 1 0 1 222 126470 1 0 1 211 1 2 221 137470 1 0 1 215 0 2 218 0 3 221 1 4 222 0 5 240 139920 1 0 1 235 1 2 218 039925 1 1 1 213 0 2 229 046470 1 1 1 279 149470 1 1 1 255 0 2 269 154470 1 0 1 222 156470 1 1 1 231 1 2 231 158470 1 0 1 217 159470 1 0 1 243 1 2 222 1 3 224 160470 1 1 1 250 1 2 211 161470 1 0 1 217 0

Page 37: Stata Lecture 1

Back to table of contents

Back to table of contents

Appendix B: Sample data for relation example

The following dataset will let you experiment with the relation example that was presented in this lecture.

DICTIONARY: relation.dct

Page 38: Stata Lecture 1

dictionary { long personid "ID of person" byte sex "1 =male 2=female" int age "Age in years" int weight "Weight in lbs" long fatherid "ID of the father" long motherid "ID of the mother"} 101612 1 44 181 939175 . 111902 2 29 138 . . 115909 1 40 182 . . 117555 2 62 133 . . 119859 1 57 162 . . 125175 2 53 136 . . 125542 1 39 173 . 139867 126264 2 61 135 . . 126691 1 29 179 150023 . 129315 1 24 150 939175 . 129747 2 42 135 150023 680080 130433 1 0 36 583905 518533 130457 1 33 173 180713 . 130608 1 17 184 382421 744872 130844 1 8 70 759350 805333 134220 2 53 137 . . 135213 1 28 185 472361 420402 137886 1 36 190 180713 830860 138010 1 49 162 939175 . 138055 1 13 82 476023 938499 139797 1 30 170 150023 680080 139867 2 71 137 . . 142202 2 52 130 . . 142286 2 53 135 . . 143219 1 19 178 . 830860 145617 1 57 175 . . 150023 1 78 192 . . 151191 2 34 137 . 830860 151737 2 59 131 . . 156326 2 50 139 . . 161939 2 36 139 396335 139867 164954 2 33 135 610731 680080 169139 1 53 172 . . 169199 2 49 141 150023 . 171329 2 24 139 610731 117555 173536 1 10 73 615464 894873 175781 1 20 180 472361 . 176894 1 39 186 . . 177670 2 39 138 . . 178914 1 29 179 939175 . 180713 1 62 174 . . 181106 2 38 133 . . 181525 1 29 175 . . 181795 2 17 140 974383 202557 187893 2 51 130 . . 191592 1 48 172 . . 197055 1 76 182 . . 199973 2 46 141 . . 199996 1 17 156 222839 317098 202557 2 41 133 375917 420402 203388 2 74 136 . . 204822 2 41 129 180713 . 215693 2 12 79 197055 266826 215811 1 50 175 . . 219247 2 63 142 . . 222651 2 44 128 396335 . 222839 1 46 148 . . 237341 1 31 175 . 830860 238520 2 26 136 375917 420402 240298 2 56 137 . . 247560 2 28 138 610731 . 252640 1 38 179 396335 . 253559 2 17 128 298337 . 266363 1 34 169 396335 .

Page 39: Stata Lecture 1

Back to table of contents

© Copyright 2013 StataCorp LP.