EEA Stata Training Manual
-
Upload
getachew-a-abegaz -
Category
Documents
-
view
346 -
download
24
Transcript of EEA Stata Training Manual
-
8/14/2019 EEA Stata Training Manual
1/85
Training Module
Using Stata for Survey Data Analysis
Ethiopian Economics Association/ Ethiopian Economic Policy Research Institute/
September 2009
-
8/14/2019 EEA Stata Training Manual
2/85
-
8/14/2019 EEA Stata Training Manual
3/85
2
Section 5: Modifying variablesSection 6: Advanced descriptive statisticsSection 7: Presenting data with graph (graphing data)Section 8: Normality and outlierSection 9: Statistical testsSection 10: Linear regressionSection 11: Logistic regression
Section 12: Panel data analysis (regression)Section 13: Data managementSection 14: Advanced programmingSection 15: Trouble shooting and update
Each section will include some training in the use of Stata commands and a practical application ofthese commands to the analysis of household survey data. The ERHS1999, for example, containsover fifty files, but we will focus our attention on few of them:
-
8/14/2019 EEA Stata Training Manual
4/85
3
SECTION 1: INTRODUCTION TO STATA
Stata is a package that offers a good combination of ease to learn and power. It has numerouspowerful yet simple commands for data management, which allows users to perform complexmanipulations with ease. Under Stata/SE, one can have up to 32,768 in a Stata data file and11,000 for any estimation commands.
Stata performs most general statistical analyses (regression, logistic regression, ANOVA, factoranalysis, and some multivariate analysis). The greatest strengths of Stata are probably inregression and logistic regression. Stata also has a very nice array of robust methods that are veryeasy to use, including robust regression, regression with robust standard errors, and many otherestimation commands include robust standard errors as well.
Stata has the ability to easily download programs developed by other users and the ability tocreate your own Stata programs that seamlessly become part of Stata. One can find many cuttingedge statistical procedures written by other users before and incorporate them into his/her ownStata program. Stata uses one line commands which can be entered one command at a time orcan be entered many at a time in a Stata program.
When you open Stata, you will see a menu bar across the top, a tool bar with buttons, and 3-5windows (the number of windows open depends on which windows were open the last time Statawas used). Each is described briefly below.
The Stata Interface
1. Windows
The Stata windows give you all the key information about the data file you are using, recentcommands, and the results of those commands. Some of them open automatically when you start
Stata, while others can be opened using the Windows pull-down menu or the buttons on the toolbar.
These are the Stata windows:Stata Results To see recent commands and outputStata Command To enter a commandStata Browser To view the data file (needs to be opened)Stata Editor To edit the data file (needs to be opened)Stata Viewer To get help on how to use StataVariables To see a list of variablesReview To see recent commands
Stata Do-file Editor To write or edit a program (needs to be opened)
-
8/14/2019 EEA Stata Training Manual
5/85
4
The Command windowon the bottom right is where you'll enter commands. When you press
ENTER, they are pasted into theStata Results
window above, which is where you will see yourcommands executed and view the results. You can also use recent commands again by using thePage Up key (to go to the previous command) and Page Down key (to go to the next command).
The Result Window (with the black background) shows all recent commands, output, errormessages, and help. The text is color-coded as follows:
Green General information and the frame and headings of output tables blue Commands or error messages that can be clicked on for more information white Stata commands yellow Numbers in output tables red Error messages
The slide bar on the right side can be used to look at earlier results that are not on the screen.However, unlike SPSS, the Stata results window does not keep all output generated. It will keepabout 300-600 lines of the most recent output, deleting earlier output. If you want to store outputin a file, you must use the logcommand. /More on this latter/
Stata Browser This window shows all the data in memory. The Stata Browser does not appearautomatically when you start Stata. The only way to open the Browser is to click on the buttomwith a table and magnifying glass. Unlike SPSS, when the Stata Browser is open, you cannotexecute any commands, either from the Stata Command window or from the Do-file Editor. In
-
8/14/2019 EEA Stata Training Manual
6/85
5
addition, you also cannot change any of the data. You can, however, sort the data or hide certainvariables using buttons at the top of the Stata Browser window.
Stata Editor This window is exactly like the Stata Browser window except that you can changethe data. We do not recommend using this window because you will have no record of thechanges you make in the data. It is better to correct errors in the data using a Do-file programthat can be saved.
Stata Viewer This window provides help on Stata commands and rules. To open the StataViewer window, you can click on Windows/Viewer or click on the eye button on the tool bar. Touse the Stata Viewer window, type a command in the space at the top and the Viewer will giveyou the purpose and rules for using that command, along with some examples. Any blue text inthe Viewer can be clicked on for more information about that command.
Variables This window (tall with a white background) lists all the variables that exist inmemory. When you open a Stata data file, it lists the variables in the file. If you create newvariables, they will be added to the list of variables. If you delete variables, they will be removedfrom the list. You can insert a variable into the Stata Command window by clicking on it in the
Variables window.
Do-file Editor This window allows you to write, edit, save, and execute a Stata program. A Stataprogram (or Do-file) is simply a set of Stata commands written by the user. The advantage of using theDo-file Editor rather than the Stata Command window is that the Do-file allows you to save, revise, andrerun a set of commands. Exploratory analysis of the data can be done with the Stata Command window,but any serious data analysis should be carried out using the Do-file Editor, not the StataCommand window. The Do-File Editor can be opened by clicking on Windows/Do-file Editor or byclicking on the envelope button. With so many windows, it is sometimes difficult to fit them all on thescreen. You can adjust the size and position of each window the way you like it and then save the layoutby clicking on Prefs/Save Windowing Preferences. Each time you open Stata, the windows will bearranged according to your prefered layout.
On the right are two convenient windows. TheVariableswindowkeeps a list of your currentvariables. If you click on one of them, its name will be pasted into the current command at thelocation of the cursor, which saves a little typing. The Review windowkeeps a list of all thecommands you've typed in the Stata session. Click on one, and it will be pasted into thecommand window, which is handy for fixing typos. Double-click, and the command will bepasted and re-executed. You can also export everything in the Reviewwindow into a .do file(more on them later) so you can run the exact same commands at any time. To do this right-clickthe Reviewwindow.
When we first open Stata, all these windows are blank except for the Stata Resultswindow.You
can resize these 4 windows independently, and you can resize the outer window as well. To saveyour window size changes, click on Prefsbutton, then Save Windowing Preferences
Entering commands in Stata works pretty much like you expect. BACKSPACE deletes thecharacter to the left of the cursor, DELETE the character to the right, the arrow keys move thecursor around, and if you type the text is inserted at the current location of the cursor. The uparrow does not retrieve previous commands, but you can do that by pressing PAGE UP, orCTRL-R, or by using the Reviewwindow.
-
8/14/2019 EEA Stata Training Manual
7/85
6
2. Menus
Stata displays 8 drop-down menus across the top of the outer window, from left to right:File
Open open a Stata data file (use)Save/Save as save the Stata data in memory to diskDo execute a do-file
Filename copy a filename to the command linePrint print log or graphExit quit Stata
EditCopy/Paste copy text among the Command, Results, and Log windowsCopy Table copy table from Results window to another fileTable copy options what to do with table lines in Copy Table
DataGraphics
Statistics build and run Stata commands from menusUser menus for user-supplied Stata commands (download from Internet)Window bring a Stata window to the frontHelp Stata command syntax and keyword searches
3. Button bar
The buttons on the button bar are from left to right (equivalent command is in bold):Open a Stata data file: useSave the Stata data in memory to disk: savePrint a log or graphOpen a log, or suspend/close an open log: logOpen a new viewerBring Graph window to frontNew Dofile Editor: doeditEdit the data in memory: editBrowse the data in memory: browseClear-more condition: Space BarStop current command or do-file: Ctrl-Break
SECTION 3: EXPLORING DATA FILES
3.1. Common Stata Syntax
This section covers commands that are used for preliminary exploration of data in a file. Statacommands follow the same syntax:
[byvarilist1:] command[varlist2] [ifexp] [inrange] [weight], [options]
Items inside of the squares brackets are either options or not available for every command. Thissyntax applies to all Stata commands. In order to use byprefix, the dataset must first be sorted onthe by variable(s). it helps to repeat Stata command on subsets of the data.
-
8/14/2019 EEA Stata Training Manual
8/85
7
Logical operators used in Stata
~ Not
== Equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
-
8/14/2019 EEA Stata Training Manual
9/85
8
4. iweight, or importance weights, are weights that indicate the "importance" of the observationin some vague sense. iweights have no formal statistical definition; any command that supportsiweights will define exactly how they are treated. In most cases, they are intended for use byprogrammers who who need to implement their own analytical techniques by using some of theavailable estimation commands. Special care should be taken when using importance weights tounderstand how they are used in the formulas for estimates and variance. This information isavailable in the Methods and Formulas section in the Stata manual for each estimation command.In general, these formulas will be incorrect for computing the variance for data from a samplesurvey.
3.2 Examining dataset
clearThe clear command deletes all files, variables, and labels from the memory to get ready to use anew data file. You can clear memory using the clear command or by using the clear up commandas part of the use command (see the use command). This command does not delete any datasaved to the hard-drive.
set memoryFirst you can check to see how much memory is allocated to hold your data using the memorycommand. For instance, we are now running StataSE 11 under Windows, and this is what thememorycommand told us.
-
8/14/2019 EEA Stata Training Manual
10/85
9
Fi gur e 2: Worki ng memory space. memor y
byt es- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Det ai l s of set memory usage
overhead ( poi nters ) 5, 808 0. 06% dat a 107, 448 1. 02% - - - - - - - - - - - - - - - - - - - - - - - - - - - -
data + overhead 113, 256 1. 08% f r ee 10, 372, 496 98. 92% - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total al l ocat ed 10, 485, 752 100. 00%- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ot her memor y usage
set maxvar usage 1, 816, 666set mat si ze usage 1, 315, 200pr ogr ams, saved r esul t s, et c. 3, 338
- - - - - - - - - - - - - - -Total 3, 135, 204
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Gr and t otal 13, 620, 956
We have 11MB free for reading in a data file. Whenever we want to read data file bigger thanthis free bytes, we will get the error message read as:
no r oom t o add more obser vat i onsr ( 901) ;
In this case I have to allocate to more memory, say 25MB (if 25MB are sufficient for currentfile), with the set memorycommand before trying to use my file.
set memory 25m
Figure 3: Current memory allocation after set memory 25m command
Current memory allocation
current memory usagesettable value description (1M = 1024k)--------------------------------------------------------------------set maxvar 5000 max. variables allowed 1.733Mset memory 25M max. data space 25.000Mset matsize 400 max. RHS vars in models 1.254M
-----------
27.987M
Now that we have allocated enough memory, we will be able to read bigger files provided that itis within the specified memory spaces. After setting the memory space to 25m, we haveinformation on memory space read us:
-
8/14/2019 EEA Stata Training Manual
11/85
10
Figure 4: Adjusted working memory space. memor y
byt es- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Det ai l s of set memory usage
overhead ( poi nters ) 5, 808 0. 02% dat a 107, 448 0. 41% - - - - - - - - - - - - - - - - - - - - - - - - - - - -
data + overhead 113, 256 0. 43% f r ee 26, 101, 136 99. 57% - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total al l ocat ed 26, 214, 392 100. 00%- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ot her memor y usage
set maxvar usage 1, 816, 666set mat si ze usage 1, 315, 200pr ogr ams, saved r esul t s, et c. 1, 778
- - - - - - - - - - - - - - -Total 3, 133, 644
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Gr and t otal 29, 348, 036
If we want to allocate 25m (250 megabytes) every time we start Stata, We can type;
. set memory 250m, permanently
And then Stata will allocate this amount of memory every time we start Stata.
use
This command opens an existing Stata data file. The syntax is:
use filename [, clear ] opens new fileuse [varlist] [if exp] [in range] using filename [, clear ] opens selected parts of file
If there is no extension, Stata assumes it is .dta. If there is no path, Stata assumes it is in the current folder. You can use a path name such as: use C:\...\ERHScons1999 If the path name has spaces, you must use double quotes: use .d:\my
data\ERHScons1999. You can open selected variables of a file using a variable list. You can open selected records of a file using ifor in.
Here are some examples of the use command:use ERHScons1999 opens the file ERHScons1999.dta for analysis.use ERHScons1999 if q1a == 1 opens data from region 1use ERHScons1999 in 5/25 opens records 5 through 25 of fileuse hhid hhsize cons using ERHScons1999 opens 3 variables from ERHScons1999 fileuse C:\training\ ERHScons1999 opens the file ERHScons1999.dta in the specified
folder
use .C:\data files\ ERHScons1999 use quotation marks if there are spacesuse ERHScons1999, clear clears memory before opening the new file
-
8/14/2019 EEA Stata Training Manual
12/85
11
While running Do-file program, we have to use use and clear command at the same time.For instance, here we load a raw data set from ERHScons1999. The clear option then allowsStata to clear the memory of previous data set in order to load the new one.
. use C:\...\ERHScons1999.dta, clear
As Stata did not want you to lose the changes that you made to the data setting in memory. If youreally want to discard the changes in memory, clear option specifies that it is okay to replace thedata in memory, even though the current data have not been saved to disk.
saveThe savecommand will save the dataset as a .dta file under the name you choose. Editing thedataset changes data in the computer's memory, it does not change the data that is stored on thecomputer's disk.
. save C:\...\consumption.dta, replace
The replaceoption allows you to save a changed file to the disk, replacing the original file. Statais worried that you will accidentally overwrite your data file. You need to use the replaceoptionto tell Stata that you know that the file exists and you want to replace it.
editThis command use to open window called data editor window that allow us to view allobservation in the memory. You can change the data using data editor window but you do notrecommend using this window because you will have no record of the changes you make in thedata. It is better to correct errors in the data using a Do-file program that can be saved (we willsee Do-file program latter).
browseThis window is exactly like the Stata editor window except that you cant change the data.
describe
This command provides a brief description of the data file. You can use des or d and Statawill understand. The output includes:
the number of variables the number of observations (records) the size of the file the list of variables and their characteristics
-
8/14/2019 EEA Stata Training Manual
13/85
12
Example 1: Using describe to show information about a data file. des
Cont ai ns dat a f r omC: \ t r ai ni ng\ ERHSCONS1999. dt aobs: 1, 452
var s: 15 24 Feb 2007 07: 07si ze: 113, 256 ( 98. 9% of memor y f r ee) ( _dta has not es)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
st or age di spl ay val uevar i abl e name t ype f ormat l abel var i abl e l abel- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -q1a f l oat %9. 0g r eg Regi onq1b doubl e %15. 0g w Wer edaq1c doubl e %17. 0g pa Peseant associ at i onq1d doubl e %12. 0g Househol d i dsexh byt e %8. 0g sexhh Sex of househol d headageh f l oat %9. 0g p1s1q4 Age of househol d headcons f l oat %9. 0g consumpt i on per mont hf ood f l oat %9. 0g f ood cons per mont hhhsi ze byt e %8. 0g househol d si zeaeu f l oat %9. 0g adul t equi val ent uni t s i n
househol df pi f l oat %9. 0g f ood pr i ce i ndexr conspc f l oat %9. 0g r eal consumpt i on per capi t a
1994 pr i cesr consae f l oat %9. 0g r eal consumpt i on per adul t 1994
pr i cespoor doubl e %8. 2fhhi d doubl e %12. 0f sel ected househol d uni que i d- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Sort ed by: hhi d
It also provides the following information on each variable in the data file:
the variable name the storage type: byte is used for binary variables, int is used for integers, and float is
used for continuous variables that may have decimals. To see the limits on each storagetype, type help data types.
the display type indicates how it will appear in the output. the value label is the name of a set of labels for different values the variable label is a name for the variable that is used in output.
listThis command lists values of variables in data set. The syntax is:
list [varlist] [if exp] [in range]
With varlist, you can specify which variables values will be presented. If list is not specified, allvariables will be listed. With if and in, you can specify which records will be listed. Here aresome
examples:. list lists entire dataset. listin1/10 lists observations 1 through 10
-
8/14/2019 EEA Stata Training Manual
14/85
13
. list hhsize q1a food lists selected variables
. list hhsize sex in1/20 lists observations 1-20 for selected variables
. list ifq1a < 6 lists cases in region is 1 through 5
ifThis command is used to select certain records in carrying out a command. This is similar to the
process if command in SPSS, except that in Stata it is not considered a separate command. Thesyntax is:
command ifexp
Examples include:
. list hhid q1a foodiffood> 2000 lists data if food is above 12000
. tab q1aifcons>1000 &cons=1200 browse data if food consumption is above 1200
Note that if statements always use ==, not a single =. Also note that | indicates or while &indicates and.
inWe have also used into select records based on the case number. The syntax is:
command inexp
For example:. listin10 list observation number 10. summarize in10/20 summarize observations 10-20
Example 2: Using list to look at data
. l i st hhi d q1a q1b q1c q1d hhsi ze rconspc i n 10/ 25
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +| hhi d q1a q1b q1c q1d hhsi ze r conspc || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
10. | 101010000010 Ti gray At sbi Haresaw 10 4 134. 5961 |11. | 101010000011 Ti gray At sbi Haresaw 11 3 168. 9437 |12. | 101010000012 Ti gray At sbi Haresaw 12 3 135. 1815 |13. | 101010000013 Ti gray At sbi Haresaw 13 7 102. 3454 |14. | 101010000014 Ti gray At sbi Haresaw 14 9 68. 04964 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |15. | 101010000015 Ti gray Atsbi Haresaw 15 12 49. 61188 |16. | 101010000016 Ti gray At sbi Haresaw 16 4 85. 05015 |
17. | 101010000017 Ti gray At sbi Haresaw 17 5 84. 72104 |18. | 101010000018 Ti gray At sbi Haresaw 18 2 95. 42028 |19. | 101010000019 Ti gray Atsbi Haresaw 19 10 140. 7843 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |20. | 101010000020 Ti gray At sbi Haresaw 20 3 80. 58356 |21. | 101010000021 Ti gray At sbi Haresaw 21 3 95. 98959 |22. | 101010000022 Ti gray At sbi Haresaw 22 5 68. 05075 |23. | 101010000023 Ti gray At sbi Haresaw 23 4 52. 4964 |24. | 101010000024 Ti gray At sbi Haresaw 24 3 91. 86269 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |25. | 101010000025 Ti gray At sbi Haresaw 25 5 149. 1702 |
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
-
8/14/2019 EEA Stata Training Manual
15/85
14
. l i st q1a cons aeu poor i n 200/ 215
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +| q1a cons aeu poor || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
200. | Amhar a 661. 3979 1. 82 0. 00 |201. | Amhar a 321. 7693 8. 14 1. 00 |
202. | Amhar a 169. 784 2. 3 0. 00 |203. | Amhar a 907. 9995 3. 14 0. 00 |204. | Amhar a 232. 6273 4. 148 1. 00 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |205. | Amhar a 432. 4525 6. 86 1. 00 |206. | Amhar a 59. 53 1. 46 1. 00 |207. | Amhar a 228. 22 3. 4 0. 00 |208. | Amhar a 1298. 875 5. 44 0. 00 |209. | Amhar a 144. 494 3. 48 1. 00 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |210. | Amhar a 266. 974 4. 28 0. 00 |211. | Amhar a 43. 97179 . 74 1. 00 |212. | Amhar a 216. 0467 3. 408 1. 00 |213. | Amhar a 492. 4958 2. 94 0. 00 |
214. | Amhar a 437. 7144 2. 46 0. 00 || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
215. | Amhar a 166. 354 1. 74 0. 00 |+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
If you are not careful with list, you will get a lot more output than you want. If Stata startsgiving you more output than you really want, use the stop button ( button with an X).
codebookThe codebookcommand is a great tool for getting a quick overview of the variables in the datafile. It produces a kind of electronic codebook from the data file, displaying information about
variables' names, labels and values.
-
8/14/2019 EEA Stata Training Manual
16/85
15
Example 3: using codebook to look at data. codebook
sexh Sex of househol d head- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t ype: numer i c ( byte)l abel : sexhh
r ange: [ 0, 1] uni t s: 1uni que val ues: 2 mi ssi ng . : 0/ 1452
t abul at i on: Fr eq. Numer i c Label400 0 Femal e
1052 1 Mal e
. codebookr conspc r eal consumpt i on per capi t a 1994 pr i ces- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
t ype: numer i c ( f l oat )
r ange: [ 4. 2201104, 1018. 2954] uni t s: 1. 000e- 07uni que val ues: 1448 mi ssi ng . : 3/ 1452
mean: 90. 3674st d. dev: 81. 9962
per cent i l es: 10% 25% 50% 75% 90% 25. 1043 39. 9402 65. 9926 114. 253 180. 891
inspectIt is another useful command for getting a quick overview of a data file. inspectcommand
displays information about the values of variables and is useful for checking data accuracy.
Example 4: Using inspect to look at data. i nspect sexh
sexh: Sex of househol d head Number of Observat i ons- - - - - - - - - - - - - - - - - - - - - - - - - - - - Non-
Total I nteger s I nteger s| # Negat i ve - - -| # Zer o 400 400 -| # Posi t i ve 1052 1052 -| # - - - - - - - - - - - - - - -| # # Total 1452 1452 -
| # # Mi ssi ng -+- - - - - - - - - - - - - - - - - - - - - - - - - - -0 1 1452
( 2 uni que val ues)
sexh i s l abel ed and al l val ues are document ed i n t he l abel .
-
8/14/2019 EEA Stata Training Manual
17/85
16
countcount command can be used to show the number of observations that satisfying if options. If noconditions are specified, count displays the number of observations in the data.
. count1452
. count i f q1a==3466
3.3. Preliminary Descriptive Statistics
tabulate, tab1, tab2These are three related commands that produce frequency tables for discrete variables. They canproduce one-way frequency tables (tables with the frequency of one variable) or two-wayfrequency tables (tables with a row variable and a column variable. These commands are similarto the frequency and crosstab commands in SPSS. How do they differ?
tabulate or tab produce a frequency table for one or two variables tab1 produces a one-way frequency table for each variable in the
variable list tab2 produces all possible two-variable tables from the list of variables
You can use several options with these commands: all gives all the tests of association for two-way tables cell gives the overall percentage for two-way tables column gives column percentages for two-way tables row gives row percentages for two-way tables nofreq suppresses printing the frequencies. chi2 provides the chi squared test for two-way tables
There are many other options, including other statistical tests. For more information, type helptabulate
Some examples of the tabulate commands are:. tabulate q1a produces table of frequency by region. tabulate q1a sexh produces a cross-tab of frequencies by region and sex of head. tabulate q1a hhsize, row produces a cross-tab by region and hhsize with row percentages. tabulate sexh hhsize, cell nofreq produces a cross-tab of overall percent by sex and hhsize.. tab1 q1a q1b hhsize produces three tables, a frequency table for each variable
. tab2 q1a poor sexh produces three tables, a cross-tab of each pair of variables
-
8/14/2019 EEA Stata Training Manual
18/85
17
Example 5: Using tabulate on categorical variables. t ab q1b
Wereda | Fr eq. Percent Cum.- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
At sbi | 84 5. 79 5. 79Sebhassahsi e | 66 4. 55 10. 33
Ankober | 86 5. 92 16. 25Basso na Worana | 175 12. 05 28. 31
Enemayi | 61 4. 20 32. 51Bugena | 144 9. 92 42. 42
Adaa | 95 6. 54 48. 97Kersa | 95 6. 54 55. 51
Dodota | 109 7. 51 63. 02Shashemene | 97 6. 68 69. 70
Cheha | 65 4. 48 74. 17Kedi da Gamel a | 74 5. 10 79. 27
Bul e | 134 9. 23 88. 50Bol oso | 96 6. 61 95. 11
Daramal o | 71 4. 89 100. 00
- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00
. t ab q1b sexh
| Sex of househol d headWereda | Femal e Mal e | Total
- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -At sbi | 48 36 | 84
Sebhassahsi e | 29 37 | 66Ankober | 13 73 | 86
Basso na Wor ana | 52 123 | 175Enemayi | 11 50 | 61Bugena | 55 89 | 144
Adaa | 23 72 | 95Kersa | 31 64 | 95
Dodot a | 26 83 | 109Shashemene | 26 71 | 97
Cheha | 22 43 | 65Kedi da Gamel a | 15 59 | 74
Bul e | 11 123 | 134Bol oso | 25 71 | 96
Daramal o | 13 58 | 71- - - - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Total | 400 1, 052 | 1, 452
In one-way tables, Stata gives the count, the percentage, and the cumulative percentage(see first example in box).
In two-way tables, Stata gives the count only, unless you ask for other statistics (seesecond example in box)
col, row, and cell request Stata to include percentages in two-way tables
summarizeThe summarize command produces statistics on continuous variables like age, food, cons hhsize.The syntax looks like this:
-
8/14/2019 EEA Stata Training Manual
19/85
18
summarize [varlist] [if exp] [in range] [, [detail]]
By default, it produces the following statistics: Number of observations Average (or mean) Standard deviation Minimum Maximum
If you specify detail Stata gives you additional statistics, such as skewness, kurtosis, the four smallest values the four largest values various percentiles.
Here are some examples:.summarize gives statistics on all variables
. summarize hhsize food gives statistics on selected variables
. summarize hhsize cons if q1a==3 gives statistics on two variables for one region
Example 6: Using summarize to study continuous variables. sum r conspc r consae hhsi ze
Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 1449 90. 36742 81. 99623 4. 22011 1018. 295r consae | 1449 108. 7874 97. 27053 4. 811201 1212. 256hhsi ze | 1452 5. 782369 2. 740968 1 17
. sum r conspc r consae hhsi ze i f q1a==4
Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 395 111. 6185 99. 09839 8. 393298 1018. 295r consae | 395 132. 6018 116. 6133 9. 608795 1212. 256hhsi ze | 396 6. 209596 2. 853203 1 16
The first example gives the statistics for the whole sample, while the second gives the statisticsonly for households in Region 4.
byThis prefix goes before a command and asks Stata to repeat the command for each value of avariable. The general syntax is:
by varlist: command
Note: bysortcommand is most commonly used to shorten the sorting process
-
8/14/2019 EEA Stata Training Manual
20/85
19
Some examples of the by prefix are:
bysort sex: sum rconsae for sex of hh head, give stats on real per capitaconsumption.
Example 7: Using the by prefix
- > sexh = Femal e
Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 398 100. 2183 89. 18895 7. 068164 624. 1437- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- > sexh = Mal e
Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 1051 86. 63701 78. 82594 4. 22011 1018. 295
help
The help command gives you information about any Stata command or topic
help [command]
For example,. help tabulate gives a description of the tabulate command. help summarize gives a description of the summarize command
SECTION 4: STORING COMMANDS AND OUTPUT
In this section, we discuss how to store commands and output for later use. First, we describehow to store commands using a program (Stata calls it a Do-file), how to edit the program, andhow to run it. Second, we present different ways of saving and using the output generated byStata. The following topics are covered:
Using the Do-file Editorlog usinglog offlog onlog close
Using the Do-file Editor
The Do-file Editor allows you to store a program (a set of commands) so that you can edit it andexecute it later. Why use the Do-file Editor?
It makes it easier to check and fix errors, It allows you to run the commands later, It lets you show others how you got your result, and It allows you to collaborate with others on the analysis.
-
8/14/2019 EEA Stata Training Manual
21/85
20
In general, any time you are running more than 10 commands to get a result, it is easier and saferto use a Do-file to store the commands.
To open the Do-file Editor, you can click on Windows/Do-file Editor or click on the envelope onthe Tool Bar.
Within the Do-file Editor, there is a menu bar and tool bar buttons to carry out a variety ofediting functions. The menu bar is similar to the one in Microsoft Word:
File/New to open a new, blank Do-fileFile/Open to open an existing Do-fileFile/Save to save the current Do-fileFile/Save as to saving the current Do-file under a new nameFile/Insert file to insert another file into the current oneFile/Print to print the Do-fileFile/Close to close the Do-fileEdit/Undo to undo the last command
Edit/Cut to delete or move the marked text in the Do-fileEdit/Copy to copy the marked text in the Do-fileEdit/Paste to insert the copied or cut text into the Do-fileSearch/Find to find a word or phrase in the Do-textSearch/Replace to find and replace a word or phrase in the Do-fileTools/Do to execute all the commands or the marked commands in the Do-fileTools/Run to execute all the commands or the marked commands in the Do-file
without showing any output in the Stata Results window
The tool bar buttons can be used to carry out some of these tasks more quickly. For example,there are buttons for File/New, File/Open, File/Print, Search/Find, Edit/Cut, Edit/Copy,
Edit/Paste, Edit/Undo, Do, and Run. Probably the button you will use most is the last one thatshows a page with text on it. This is the Do button for executing the program or the markedpart of the program.
Finally, the keyboard commands may be even quicker to use than the buttons. The most usefulkeyboard commands are:
Control-O Open fileControl-S Save fileControl-C CopyControl-X Cut
Control-V PasteControl-Z UndoControl-F FindControl-H Find and Replace
To run the commands in a Do-file, you can click on the Do button (the last one) or click onTools/Do. If you want to run one or just a few commands rather than the whole file, mark thecommands and click on the Do button. You do not have to mark the whole command, but at leastone character in the command must be marked in order for the command to be executed (unlike
-
8/14/2019 EEA Stata Training Manual
22/85
21
SPSS, it is not enough to have the cursor on a command). Although layout is a matter of personalpreference, it may be useful to have the Stata Results window and the other windows on one sideof the screen and the Do-file Editor window on the other. This makes it easy to switch back andforth. When you arrange the windows the way you like, you can save the layout by clickingPrefs/Save Windowing Preferences. Each time you open Stata, it will use your chosen layout.
Note: If you would like to add a note to a do file, but do not want Stata to execute your notes, /**/ is used.
/* This Stata program illustrates how to read create a do file */
log using C:\...\eeatraining.log,replacelog close
Saving the OutputAs mentioned in earlier section, the Stata Results window does not keep all the output yougenerate. It only stores about 300-600 lines, and when it is full, it begins to delete the old resultsas you add new results. You can increase the amount of memory allocated to the Stata ResultsWindow. But even this will probably not be enough for a long session with Stata. Thus, we need
to use logto save the output.
There are four ways to control the log operations.1. You can use the log button on the tool bar. It looks like a scroll.2. You can click on File/Log to get four options: Begin (log using), Close, Suspend (log
off), and resume (log on).3. You can use .log. commands in the Stata Command window4. You can use .log. commands in the Stata Do-file Editor.
In this section, we describe the commands, which can be used in the Stata Command window orin a do-file (program).
log using
This command creates a file with a copy of all the commands and output from Stata. The firsttime you open a log, you must give a name to the new file to be created. The syntax is:
log using filename [, append replace [ text | smcl ] ]
where filename is that name you give the new file. The options are:
append adds the output to an existing file
replace replaces an existing file with the outputtext tells Stata to create the log file in text (ASCII) formatsmcl tells Stata to create the log file in SMCL format
Here are some examples:
log using temp22 saves output to a file called temp22log using temp20, replace saves output to an existing file, temp20, replacing contentlog using regoutput, append saves output to an existing file, results, adding to contentslog using .d:\my data\myfile.txt. saves output in specified file in specified folder
-
8/14/2019 EEA Stata Training Manual
23/85
22
Several points should be remembered in using this command:
if you use an existing file name but do not say replace or append, Stata will givean error message that the file already exists
log files in text format can be opened with Wordpad, Notepad, the DOS editor, or anyword processor., but the file does not have any formatting
smcl files have formatting (bold, colors, etc) but can only be opened with Stata smcl format is the default
log off
This command temporarily turns off the logging of output, so that any subsequent output is notcopied to the log file. This is useful if you want to save some of the output but not all. Log offonly works after a log using command.
log on
This command is used to restart the logging, copying any new output to the log file that wasalready defined. log on only works after a log using and a log off command.
log close
This command is used to turn off the logging and save the file. How are log off and log closedifferent? Log off allows you to turn it back on easily with log on continuing to use the samelog file. After a log close however, the only way to start logging again is with log using.
set logtype text
This command tells Stata to always save the log files in text (ASCII) format. It is the same asadding the text subcommand to every log using command, but it is easier. If you prefer textformat log files, this is the best way to make sure all the log files are in this format.
set logtype smcl
This command tells Stata to always save log files in SMCL format. It is the same as adding thesmcl subcommand to every log using command.
Exercise 1: Exploring the ERHS
This section includes some questions that you can answer using the r5ERHS files provided onyour computer and the commands described in this section. Remember two tricks to make iteasier to fix your mistakes:
You can use PageUp to retrieve the most recent command. You can click on variables in the Variable window to paste it into the Command window.
-
8/14/2019 EEA Stata Training Manual
24/85
23
Summary file The file ERHScons1999 contains summary variables calculated from variousother data files. It is at the household level. Open the file by entering useC:\training\ERHScons1999.dta, clear in the Command window and pressing Return. Opendo and log files to save command and outputs. Use log file and copy and paste some of outputtables into excel and word files.
1. How many variables and how many records are in ERHScons1999?2. What percentages of households have female heads?3. Is there a statistically significant difference between the percentage of female-headed
households in poor and non-poor?4. What percentage of Amhara households are considered poor household?5. What percentages of households are in SNNP region?6. How does the percentage of female headed household vary by region?7. What is the average size of a household?8. What is the average size of household in the Oromia region?9. How does household size vary with across status? (use poor variable)
Household members The file p1sec1_rv1 contains information about each member of thehousehold. It is at the individual level (each record is a person). You can answer the followingquestions using this file:
1. What percentage of the individual is female?2. What percentage of the individual over 45 years old is female?3. What percentage of the individual under 5 is female?4. What percentage of women are married?5. What percentage of the women over the age of 18 are married?6. Does this percentage vary among regions?7. What is the status of individuals as compared to round 4?
8. What is the reason for household who left since round 49. What was the major occupation of household head?10.What was the major occupation of household members aged 7 to 15?
Food and cash cropsThe file p2s1b_rv1 contains information on production of food and cashcrops. The data are at the crop level, meaning that each record represents one crop for onehousehold. Only crops that are grown by each household are included in the file. The crop codesand labels are given in variable crop. You can answer the following questions with this file.
1. How many households in the sample grow maize and wheat?
2. Among maize growers, what was the average area with maize?3. Among maize growers, what was the average amount of maize harvested?4. Among wheat growers, what was the average amount of wheat harvested?5. Does the average amount of Maize harvested vary among regions?6. Does the average amount of Wheat harvested vary among regions?7. Among farmers with more than 1 hectare of maize, what was the average amount of
maize harvested?8. What is the average amount harvested for major cereal crops? (Teff, barely, wheat, maize
and sorghum?)
-
8/14/2019 EEA Stata Training Manual
25/85
24
9. Farmers were asked Was any of the land cultivated under new extension program?What was the average response?
10.Farmers were also asked Was any of the land cultivated irrigated? And % of the landirrigated. Explore them.
SECTION 5: CREATING NEW VARIABLES
In the previous sections, we described how to explore the data using existing variables. In thissection, we discuss how to create new variables. When new variables are created, they are inmemory and they will appear in the Data Browser, but they will not be saved on the hard-diskunless you use the save command.
In this section, we will cover the following commands and options.generatereplacetab , generateoperators
functionsrecodextile
generate
This command is used to create a new variable. It is similar to compute in SPSS. The syntax is;
generate newvar = exp [if exp]
where exp is an expression like price*quant or 1000*kg. Several points about this command:
Unlike compute in SPSS, generate cannot be used to change the definition of anexisting variable. If you want to change an existing variable, you need to use replace,
You can use gen or g as an abbreviation for generate If the expression is an equality or inequality, the variable will take the values 0 if the
expression is false and 1 if it is true If you use if, the new variable will have missing values when the if statement is false
For example,
generate age2 = age*age create age squared variable
gen yield = outputkg/area if area>0 create new yield variable if area is positivegen price = value/quant if quant>0 create new price variable if quant is positivegen highprice = (price>1000) creates a dummy variable equal to 1 for high prices
-
8/14/2019 EEA Stata Training Manual
26/85
-
8/14/2019 EEA Stata Training Manual
27/85
26
Example 8: Using tab, gen to create dummy variables. t ab q1a, gen( r egi on)
Regi on | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gray | 150 10. 33 10. 33Amhara | 466 32. 09 42. 42Or omi a | 396 27. 27 69. 70
7 | 139 9. 57 79. 278 | 134 9. 23 88. 509 | 167 11. 50 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00
. t ab r egi on3
q1a==Or omi a | Fr eq. Percent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 1, 056 72. 73 72. 731 | 396 27. 27 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total | 1, 452 100. 00
egenThis is an extended version of generate[extended generate] to create a new variable byaggregating the existing data. It is a powerful and useful command that does not exist in SPSS. Itadds summary statistics to each observation. To do the same thing in SPSS, you would need tocreate a new file with aggregate and merge it with the original file using match files. Thesyntax is:
egen newvar = fcn(arguments) [if exp] [in range] , by(var)
where newvar is the new variable to be created; fcn is one of numerous functions such as:
count() number of non-missing valuesdiff() compares variables, 1 if different, 0 otherwisefill() fill with a patterngroup() creates a group id from a list of variablesiqr() interquartile rangema() moving averagemax() maximum valuemean() mean
median() medianmin() minimum valuepctile() percentilerank () rankrmean() mean across variablessd () standard deviationstd() standardize variablessum () sums
-
8/14/2019 EEA Stata Training Manual
28/85
27
argumentis normally just a variable var in the by() subcommand must be a categorical variable
Here are some other examples:egen avg = mean(yield) creates variable of average yield over entire sampleegen avg2 = median(income), by(sex) creates variable of median income for each sexegen regprod = sum(prod), by(region) creates variable of total production for each region
Example 9: Using egen to calculate averages. egen avecon=mean( cons) , by( q1c). gen hi ghavecon=( cons> avecon). l i st hhi d q1c cons avecon hi ghavecon i n 650/ 675
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +| hhi d q1c cons avecon hi ghav~n || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
650. | 407070000039 Si r bana Godet i 673. 582 940. 6532 0 |651. | 407070000040 Si r bana Godet i 793. 05 940. 6532 0 |652. | 407070000041 Si r bana Godet i 985. 257 940. 6532 1 |653. | 407070000042 Si r bana Godet i 844. 477 940. 6532 0 |
654. | 407070000043 Si r bana Godet i 946. 014 940. 6532 1 || - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |
655. | 407070000044 Si r bana Godet i 2206. 057 940. 6532 1 |656. | 407070000045 Si r bana Godet i 570. 0535 940. 6532 0 |657. | 407070000046 Si r bana Godet i 1340. 926 940. 6532 1 |658. | 407070000047 Si r bana Godet i 901. 222 940. 6532 0 |659. | 407070000048 Si r bana Godet i 887. 775 940. 6532 0 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |660. | 407070000049 Si r bana Godet i 1026. 795 940. 6532 1 |661. | 407070000051 Si r bana Godet i 1392. 845 940. 6532 1 |662. | 407070000052 Si r bana Godet i 574. 218 940. 6532 0 |663. | 407070000053 Si r bana Godet i 363. 63 940. 6532 0 |664. | 407070000054 Si r bana Godet i 926. 551 940. 6532 0 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |665. | 407070000055 Si r bana Godet i 1256. 021 940. 6532 1 |666. | 407070000057 Si r bana Godet i 753. 478 940. 6532 0 |667. | 407070000058 Si r bana Godet i 1378. 575 940. 6532 1 |668. | 407070000059 Si r bana Godet i 1640. 834 940. 6532 1 |669. | 407070000060 Si r bana Godet i 472. 841 940. 6532 0 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |670. | 407070000062 Si r bana Godet i 721. 425 940. 6532 0 |671. | 407070000063 Si r bana Godet i 1341. 702 940. 6532 1 |672. | 407070000064 Si r bana Godet i 781. 82 940. 6532 0 |673. | 407070000065 Si r bana Godet i 1962. 697 940. 6532 1 |674. | 407070000070 Si r bana Godet i 945. 045 940. 6532 1 |
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - |675. | 407070000071 Si r bana Godet i 1742. 247 940. 6532 1 |
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
In Example 9, we want to know which households have expenditure (cons) above the villageaverage. First, we calculate the average expenditure for each village with the egen command.Then we create a dummy variable based on the expression (cons > avecons). The list outputshows how the village average is repeated for every household in the village and confirms thatthe dummy variable is correctly calculated.
-
8/14/2019 EEA Stata Training Manual
29/85
28
operatorsThis is not a Stata command, but a topic related to creating new variables. Most of the operatorsare obvious, but some are not. Unlike SPSS, you cannot use words like or, and, eq, orgt.
Arithmetic
+ addition
- subtraction* multiplication/ division^ power
Relational
> greater than< less than>= more than or equal
-
8/14/2019 EEA Stata Training Manual
30/85
29
gen DDfemale = 0
replace DDfemale = 1 if q1b==9 & sexh==0
or an easier way to do this would be:
gen DDfemale = (q1b==9 & sexh==0)
Or suppose you wanted to create a dummy variable for households in the two regions (Amharaand Oromia). This variable can be created with:
gen amaoro = 0
replace amaoro = 1 if q1a==3 | q1a==4
or by one command:
gen amaoro = (q1a==3 | q1a==4)
You can also combine conditions using parentheses. Suppose you wanted a dummy variable thatindicates if a household is a poor farmer in one of the Tigray and Amhara region. We will definepoor as in the bottom 20 percent and use the variable poor.
gen PDF = ((q1a==1 | q1b==3) & poor==1)
Note: Here is a list of some of the more commonly-used additional functions used to create newvariables in stata. Other functions can be found by typing help functions in the Stata Commandwindow.
abs(x) computes the absolute value of xexp(x) calculates e to the x power.ln(x) computes the natural logarithm of xlog(x) is a synonym for ln(x), the natural logarithm.log10(x) computes the log base 10 of x.sqrt(x) computes the square root of x.invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z.normden(z) provides the standard normal density.normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not
missing, otherwise, the result is missing.norm(z) provides the cumulative standard normal.
group(x) creates a categorical variable that divides the data into x as nearly equal-sized subsamples as possible, numbering the first group 1, the secondgroup 2, etc. It uses the current order of the data.
int(x) gives the integer obtained by truncating x.round(x,y) gives x rounded into units of y.
-
8/14/2019 EEA Stata Training Manual
31/85
30
recodeThis command changes the values of a categorical variable according to the rules specified. It islike the recode command in SPSS except that in Stata you do not necessarily use parentheses.The syntax is:
recode varname old=new old=new . [if exp] [in range]
Here are some examples:recode x 1=2 changes all values of x=1 to x= 2recode x 1=2 3=4 changes 1 to 2 and 3 to 4recode x 1=2 2=1 exchanges the values 1 and 2 in xrecode x 1=2 *=3 changes 1 in x to 2 and all other values to 3recode x 1/5=2 changes 1 through 5 in x to 2recode x 1 3 4 5 = 6 changes 1, 3, 4 and 5 to 6recode x .=9 changes missing to 9recode x 9=. changes 9 to missing
Notice that you can use some special symbols in the rules:
* means all other values. means missing valuesx/y means all values from x to yx y means x and y
For example, recode region value 8 and 9 to 7
Example 10: Using recode to define a new variable. t ab q1a
Regi on | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gray | 150 10. 33 10. 33
Amhara | 466 32. 09 42. 42Or omi a | 396 27. 27 69. 70
7 | 139 9. 57 79. 278 | 134 9. 23 88. 509 | 167 11. 50 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00
. r ecode q1a 8 9=7( q1a: 301 changes made)
. t ab q1a
Regi on | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ti gray | 150 10. 33 10. 33Amhara | 466 32. 09 42. 42Or omi a | 396 27. 27 69. 70
7 | 440 30. 30 100. 00- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total | 1, 452 100. 00
-
8/14/2019 EEA Stata Training Manual
32/85
31
xtile
This command creates a new variable that indicates which category a record falls into, when thesample is sorted by an existing variable and divided into n groups of equal size. It is probablyeasier to explain with examples. xtile can be used to create a variable that indicates whichincome quintile a household belongs to, which decile in terms of farm size, or which tercile interms of coffee production. The syntax is:
xtile newvar = variable [if exp] [in range] , nq(#)
where newvar is the new categorical variable created; variable is the existing variable used tocreate the quantile (e.g income, farm size); # is the number of different categories (eg 5 forquintiles, 3 for terciles)
For example,
xtile incquint = income, nq(5)xtile farmdec = farmsize, nq(10)
Suppose we want to create a variable indicating the deciles of expenditure per capita.
Example 11: Using xtile to generate deciles (using the ERHS99cons data)
. xt i l e r conseadec= r consae, nq( 10)
. t ab r conseadec
10 |quant i l es |
of r consae | Fr eq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 | 145 10. 01 10. 012 | 145 10. 01 20. 013 | 145 10. 01 30. 024 | 145 10. 01 40. 035 | 145 10. 01 50. 036 | 145 10. 01 60. 047 | 145 10. 01 70. 058 | 145 10. 01 80. 069 | 145 10. 01 90. 06
10 | 144 9. 94 100. 00- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total | 1, 449 100. 00
. t ab r conseadec sexh, col nof r e
10 |quant i l es | Sex of househol d head
of r consae | Femal e Mal e | Tot al- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
1 | 7. 79 10. 85 | 10. 012 | 10. 30 9. 90 | 10. 013 | 8. 04 10. 75 | 10. 014 | 10. 30 9. 90 | 10. 015 | 8. 79 10. 47 | 10. 016 | 10. 30 9. 90 | 10. 017 | 10. 55 9. 80 | 10. 018 | 10. 05 9. 99 | 10. 019 | 10. 05 9. 99 | 10. 01
10 | 13. 82 8. 47 | 9. 94- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Total | 100. 00 100. 00 | 100. 00
-
8/14/2019 EEA Stata Training Manual
33/85
32
Exercise 21. Use the file ERHScons1999. Create a variable called reg4 which indicates whether a
household is in the Oromia or other regions. Then do a frequency table of the newvariable.
2. Using the same file, create a variable called hhquint that indicates the quintile ofhousehold size. Then do a frequency table on the new variable.
3. Using the same file, create a dummy variable called enbugthat is equal to 1 if thehousehold is the Enemayi and Bugena weredas and 0 otherwise. Then do a frequencytable on the new variable.
4. Create a new variable avgexp which is equal to the wereda average of foodexpenditure (food). Then calculate a new variable equal to the difference between thehousehold food expenditure and the weredaaverage expenditure.
5. Using the same file, create a new variable splot which is 1 if the person is cultivatingsingle plots and 0 otherwise.
6. Use file p1sec1_rv1. Create a set of dummy variables called relatxx based on therelationship of the person to the household head. For example, relat01 is a dummy forbeing the head, relat02 is a dummy for being the spouse, relat03 for a child, and so on.
SECTION 6: MODIFYING VARIABLES
In this section, we introduce some more powerful and flexible commands for generating resultsfrom survey data. We begin with an explanation of how to label data in Stata. Then see how toformat variables. These are the topics and commands covered in this section:
rename variablelabel variablelabel definelabel values
format variable
rename variablesThis command is used to rename variables in order to give other variable name. The command is
. rename old_variable new_variable
For instance, generate regional dummy variables and then:
Example 12: renaming variable. t ab q1a, gen( i ndex)
Regi on | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Ti gray | 150 10. 33 10. 33Amhara | 466 32. 09 42. 42Or omi a | 396 27. 27 69. 70
SNNP | 440 30. 30 100. 00- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total | 1, 452 100. 00
-
8/14/2019 EEA Stata Training Manual
34/85
33
. t ab i ndex1
q1a==Ti gray | Fr eq. Percent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 1, 302 89. 67 89. 671 | 150 10. 33 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00
. t ab i ndex2
q1a==Amhara | Fr eq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 986 67. 91 67. 911 | 466 32. 09 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00
r ename i ndex1 Ti gray r ename i ndex1 vari abl e t o Ti gray
r ename i ndex2 Amhara r ename i ndex2 var i abl e t o Amharar ename i ndex3 Or omi a r ename i nxex 3 var i abl e t o Or omi ar ename i ndex4 SNNP r ename i nxex4 var i abl e t o SNNP
label variable
This command is used to attach labels to variables in order to make the output easier tounderstand. For example, we know that Tigray is region1, SNNP are region 7. So we maywant to label the variables as follows:
l abel var i abl e Ti gr ay"Regi on 1"l abel var i abl e Amhara"Regi on 3"
l abel var i abl e Or omi a"Regi on 4l abel var i abel SNNP"Regi on 7"
You can use the abbreviation label var If there are spaces in the label, you must use double quotation marks. If there are no spaces, quotation marks are optional. This command is like variable label in SPSS except that you can only label one variable per
command and Stata uses double quotation marks, not single The limit is 80 characters for a label, but any labels over 30 characters will probably not look
good in a table.
label define
This command gives a name to a set of value labels. For example, instead of numbering the regions, wecan assign a label to each region. Instead of numbering the different sources of income, we can give themlabels. The syntax is:
label define lblname # "label" # "label" # label [, add modify]
wherelblname is the name given to the set of value labels# are the value numbers
-
8/14/2019 EEA Stata Training Manual
35/85
34
labelare the value labelsadd means that you want to add these value labels to the existing setmodify means that you want to change these values in the existing set
Note that:You can use the abbreviation label defThe double quotation marks are only necessary if there are spaces in the labels
Stata will not let you define an existing label unless you say modify or addThis command is similar to value label in SPSS except that in Stata you give the labels a name
and later attach it to the variable, while in SPSS you attach it to the variable in the same command.
-
8/14/2019 EEA Stata Training Manual
36/85
35
label valuesThis command attaches named set of value labels to a categorical variable. The syntax is:
label values varname [lblname] [, nofix]
where varname is the categorical variable which will get the labels lblname is a set of labels that havealready been defined by label define
Here are some examples of labeling values in Stata.
l abel def i ne r eg 1"Ti gr ay" 3"Amhara" 4"Or omi a" 7"SNNP", modi f yl abel val ues q1a reg
Some additional commands that may be useful in labeling
label dir to request a list of existing label nameslabel list to request a list of all the existing value labelslabel drop to delete a one or more labels
label save using to save label definitions as a Do-filelabel data to give a label to a data file
formatThe formatcommand allows you to specify the display format for variables. The internalprecision of the variables is unaffected.
The syntax for format command is
. format varlist %fmt
where %fmt is listed below:
%f mt descr i pt i on exampl e- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ri ght - j ust i f i ed f or mat s%#. #g general numer i c f or mat %9. 0g%#. #f f i xed numer i c f ormat %9. 2f%#. #e exponent i al numer i c f or mat %10. 7e%d def aul t numer i c el apsed dat e f ormat %d%d. . . user - speci f i ed el apsed date f ormat %dM/ D/ Y%#s st r i ng f ormat %15s
Ri ght - j ust i f i ed, comma f or mat s%#. #gc gener al numer i c f ormat %9. 0gc%#. #f c f i xed numer i c f ormat %9. 2f c
Leadi ng- zero f ormats%0#. #f f i xed numer i c f ormat %09. 2f%0#s st r i ng f or mat %015s
Lef t - j usti f i ed f or mat s%- #. #g general numer i c f ormat %- 9. 0g%- #. #f f i xed numeri c f ormat %- 9. 2f%- #. #e exponent i al numer i c f ormat %- 10. 7e%- d def aul t numer i c el apsed dat e f ormat %- d
-
8/14/2019 EEA Stata Training Manual
37/85
36
%- d. . . user- speci f i ed el apsed dat e f ormat %- dM/ D/ Y%- #s st r i ng f ormat %- 15s
Lef t - j ust i f i ed, comma f or mat s%- #. #gc gener al numer i c f ormat %- 9. 0gc%- #. #f c f i xed numeri c f ormat %- 9. 2f c
Cent ered f ormats
%~#s st r i ng f ormat ( speci al ) %~15s- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Exercise 31. Use exercise 2 and label values and variables for newly created variables2. label data file by This data is used for training3. list existing label names
SECTION 7: ADVANCED DESCRIPTIVE STATISTICS
In Section 3, we have seen at preliminary descriptive statistics mostly applied to explore thenature of the data. In this section we further explore more advanced statistics.
tabulate summarize
This command creates one- and two-way tables that summarize continuous variables. Thecommand tabulate by itself gives frequencies and percentages in each cell (cross-tabulations).With the summarize option, we can put means and other statistics of a continous variable. Thesyntax is:
tabulate varname1 varname2 [if exp] [in range], summarize(varname3) options
wherevarname1 is a categorical row variablevarname2 is a categorical column variable (optional)varname3 is the continuous variable summarized in each celloptions can be used to tell Stata which statistics you want
Some notes regarding this command: The default statistics are the mean, the standard deviation, and the frequency. You can specify which statistics with options means, standard and freq You can use the abbreviation tabsum( )
Some examples:
tab q1a, sum(cons) gives the mean, std deviation, and frequency of per capitaexpenditure for each region
tab q1b, sum(cons) mean gives the mean consumption for each villagetab q1a sexh, sum(food) gives the mean, std deviation, and frequency in each cell of
hh head sex per region
-
8/14/2019 EEA Stata Training Manual
38/85
37
The first table is a one-way table (just one categorical variable) showing the mean, standarddeviation, and frequency of per capita expenditure for each expenditure region.
In the second table, we use the mean option so only mean per capita expenditure is shown. In the third table, we add a second categorical variable (sexh) making it a two-way table.
Although we could have requested all the the default statistics in the two-way table, it makes thetable difficult to read so we do not advise it.
Example 13: Use tabulate. Sum () to generate tables. t ab q1a, sum( cons)
| Summar y of consumpt i on per mont hRegi on | Mean St d. Dev. Fr eq.
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ti gray | 413. 93552 297. 701 149Amhara | 545. 91653 467. 28072 465Or omi a | 697. 09029 478. 55749 395
SNNP | 331. 7384 221. 15601 440- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Total | 508. 51838 420. 4014 1449
. t ab q1b, sum( cons) mean
| Summar y of consumpt i on per monthWer eda | Mean
- - - - - - - - - - - - +- - - - - - - - - - - -At sbi | 417. 16834
Sebhassah | 409. 87Ankober | 301. 87563
Basso na | 777. 31823Enemayi | 234. 392Bugena | 542. 38657
Adaa | 940. 65322Ker sa | 567. 89355
Dodot a | 526. 58473
Shashemen | 775. 34926Cheha | 342. 54209
Kedi da Ga | 239. 09955Bul e | 379. 28676
Bol oso | 266. 93705Dar amal o | 416. 28045
- - - - - - - - - - - - +- - - - - - - - - - - -Total | 508. 51838
. t ab q1a sexh, sum( cons)
Means, Standar d Devi at i ons and Fr equenci es of consumpt i on per mont h
| Sex of househol d headRegi on | Femal e Mal e | Tot al- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -
Ti gray | 342. 44136 488. 3678 | 413. 93552| 277. 62091 301. 46008 | 297. 701| 76 73 | 149
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Amhara | 450. 61424 582. 89951 | 545. 91653
| 368. 60452 495. 93838 | 467. 28072| 130 335 | 465
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Or omi a | 610. 49528 728. 85178 | 697. 09029
-
8/14/2019 EEA Stata Training Manual
39/85
38
| 518. 32024 459. 98768 | 478. 55749| 106 289 | 395
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -SNNP | 271. 02927 346. 48695 | 331. 7384
| 171. 91652 229. 33158 | 221. 15601| 86 354 | 440
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Total | 433. 7347 536. 83799 | 508. 51838
| 389. 69001 428. 24021 | 420. 4014| 398 1051 | 1449
tabstatThis command gives summary statistics for a set of continuous variable for each value of acategorical variable. The syntax is:
tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname)where
varlist is a list of continuous variablesstatname is a type of statistic
varname is a categorical variable
Some facts about this command:
The default statistic is the mean. Optional statistics subcommands include mean, sum, max, min, range, sd (standard deviation),
var (variance), skewness, kurtosis, median, and pn (nth percentile). Without the by() option, tabstat is like summarize except that it allows you to specify the list of
statistics to be displayed. With the by() option, tabstat is like "tabulate summarize except that tabstat is more flexible in
the statistics and format
It is very similar to the SPSS command means.
Examples
tabstat food hhsize, stats(mean max min) gives mean, max, and min of food &hhsize
tabstat food hhsize, by(q1a) gives mean of two variables for each regiontabstat food, stats(median) by(q1a) gives the median food consumption for each
regionThe tabstat command displays summary statistics for a series of numeric variables in a singletable.
-
8/14/2019 EEA Stata Training Manual
40/85
39
Example 14: Using tabstate to create Table. tabstat rconsae, s(mean p50 sd cv min max) by( rconseadec) missing
Summary f or var i abl es: r consaeby cat egori es of : r conseadec ( 10 quant i l es of r consae)
r conseadec | mean p50 sd cv mi n max- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 | 21. 80935 21. 9194 5. 773654 . 264733 4. 811201 30. 401752 | 36. 24088 36. 03099 3. 400392 . 0938275 30. 6191 42. 706213 | 48. 52454 48. 31921 3. 09388 . 0637591 42. 74319 53. 919974 | 60. 38483 60. 0903 3. 811244 . 0631159 54. 00354 66. 852295 | 73. 09496 72. 92955 3. 61339 . 0494342 66. 90016 79. 382066 | 89. 3758 89. 33151 5. 708862 . 0638748 79. 39233 99. 118717 | 110. 407 110. 2909 6. 692319 . 060615 99. 12563 122. 81868 | 137. 7846 137. 5525 9. 298181 . 0674835 123. 5698 154. 96669 | 179. 5007 176. 1209 17. 33479 . 0965723 155. 0732 214. 4674
10 | 332. 2927 285. 4411 135. 2309 . 4069633 214. 4888 1212. 256. | . . . . . .
- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 108. 7874 79. 38206 97. 27053 . 8941343 4. 811201 1212. 256
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
tableThis command creates a wide variety of tables. It is probably the most flexible and useful of allthe table commands in Stata. The syntax is:
table rowvar colvar [if exp] [in range], c(clist) [row col]
whererowvar is the categorical row variablecolvar is the categorical column variableclist is a list of statistic and variablesrow is an option to include a summary rowcol is an option to include a summary column
Some useful facts about this command: The default statistic is the frequency. Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min, median, and pn
(nth percentile). The c( ) is short for contents of each cell. Like tab, it can be used to create one- and two-way frequency tables, but table cannot do
percentages Like tabsum, it can be used to calculate basic stats for each value of a categorical variable
Its advantage over tabsum is that it can do more statistics and it can take more than onecontinious variable Like tabstat, it can be used to calculate advanced stats for each value of a categorical variable Its advantage over tabstat is that it can use do two (and more) way tables, but its disadvantage is
that it has fewer statistics. It is similar to table in SPSS, but easier to learn and less flexible in formatting
Here are some examples:
table q1a , row table of frequencies by region with total row
-
8/14/2019 EEA Stata Training Manual
41/85
40
table q1a, c(mean income) table of average income by regiontable q1a, c(mean yield sd yield median yield) table of yield statistics by regiontable q1a, c(mean yield) format(%9.2f) table of average yields by region with
format .table q1a sexh, c(mean yield) table of average yield by region and sextable q1a sexh, c(mean income mean yield) table of avg yield & income by region & sex
Some output from table commands is shown in Example 15.
The tablecommand calculates and displays tables of statistics, including frequency, mean,standard deviation, sum, and 1stto 99thpercentile. The rowand coloption specifies an additionalrow and column to be added to the table, reflecting the total across rows and columns.
Example 15: Tabulate median real per capita consumption by region vs sex of household headtable q1a sexh, contents(p50 rconsae) row col missing
| Sex of househol d headRegi on | Femal e Mal e Tot al
- - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - -Ti gray | 73. 05909 74. 20448 73. 56232Amhara | 124. 9734 95. 00103 104. 7363Or omi a | 98. 59296 99. 43469 98. 75433
SNNP | 53. 73735 50. 34177 51. 14911|
Total | 90. 04483 77. 18623 79. 38206
. t abl e r conseadec, c( mean r consae)
10|quant i l es |o f |
r consae | mean( r consae)- - - - - - - - - - +- - - - - - - - - - - - - -
1 | 21. 809352 | 36. 240883 | 48. 524544 | 60. 384835 | 73. 094966 | 89. 37587 | 110. 4078 | 137. 78469 | 179. 5007
10 | 332. 2927
Exercise 41. Use ERHScons1999 and tabulate basic summery statistics showing mean, standarddeviation and frequency of per capita food consumption for each village. Interpret theresult.
2. Repeat the same procedures as q1 but report only median of food consumption.3. Tabulate basic summery statistics for food consumption by sex of household head and
regions (use single table)4. Tabulate mean 25p, median, 75p, sd, cv, min and max summery statistics for real food
consumption per capita by deciles of real consumption per capita.
-
8/14/2019 EEA Stata Training Manual
42/85
41
5. Tabulate median real food consumption per capita by sex of household head and decilesof real consumption per capita (use single table).
SECTION 8: PRESENTING DATA WITH GRAPH (GRAPHING DATA)
This section provides a brief introduction to creating graphs. In Stata, all graphs are made withthe graph command, but there are 8 types of charts and numerous subcommands for controllingthe type and format of graph. In this section, we focus on four types of graph and a few options.
The commands that draw graphs aregraph twoway scatterplots, line plots, etc.graph matrix scatterplot matricesgraph bar bar chartsgraph dot dot chartsgraph box box-and-whisker plotsgraph pie pie charts
Graphcommands can also used to produce histogram, box plot, kdensity, P-P plot, Q-Q plot but
we will postpone until the introduction of normality later. Let us first acquaint ourselves withsome twoway graph commands.
A two way scatterplot can be drawn using (graph) twoway scatter command to show therelationship between two variables, cons (total consumption) and food (food consumption). Aswe would expect, there is a positive relationship between the two variables.
. graph twoway scatter cons food
0
1000
2000
3000
4000
consumptionpermonth
0 1000 2000 3000 4000
food cons per month
We can show the regression line predicting consfromfoodusing lfitoption.
. twoway lfit cons food
-
8/14/2019 EEA Stata Training Manual
43/85
42
0
1000
2000
3000
4000
Fittedvalues
0 1000 2000 3000 4000food cons per month
The two graphs can be overlapped like this
. twoway (scatter cons hhsize) (lfit cons hhsize)
0
1000
2
000
3000
4000
0 5 10 15 20household size
consumption per month Fitted values
Exercise 5:Draw two way scatter with line fit graph for consumption per capita vs household size andexplain its pattern.
-
8/14/2019 EEA Stata Training Manual
44/85
43
SECTION 9: NORMALITY AND OUTLIER
Check for Normality
An outlier is an observation that lies in an abnormal distance from other values in a randomsample from a population. We must be extremely mindful of possible outliers and their adverse
effects during any attempt to measure the relationship between two continuous variables.
There are no official rules to identify outliers. In a sense, this definition leaves it up to the analyst(or a consensus process) to decide what will be considered abnormal. Sometimes it is obviouswhen an outlier is simply miscoded (for example, age reported as 230) and hence should be set tomissing. But most times it is not the case.
Before abnormal observations can be singled out, it is necessary to characterize normalobservations.
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, ordata set, is symmetric if it looks the same to the left and right of the center point. The skewnessfor a normal distribution is zero and any symmetric data should have a skewness near zero.Negative values for the skewness indicate data that are skewed left and positive values for theskewness indicate data that are skewed right. By skewing left, we mean that the left tail isheavier than the right tail. Similarly, skewing right means that the right tail is heavier than theleft tail.
Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Thatis, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly,and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather thana sharp peak. A uniform distribution would be the extreme case. The standard normaldistribution has a kurtosis of three. Positive kurtosis indicates a "peaked" distribution andnegative kurtosis indicates a "flat" distribution. A value of 6 or larger on the true kurtosisindicates a large departure from normality.
We can obtain skewness and kurtosis values by using detail option in summarize command.Clearly, variable rconspc(real consumption per capita) is skewed to the right and has a peakeddistribution. Both statistics indicate the distribution of rconspcis far from normal.
. sum r conspc
Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
r conspc | 1449 90. 36742 81. 99623 4. 22011 1018. 295
-
8/14/2019 EEA Stata Training Manual
45/85
44
. sum r conspc, det ai l
r eal consumpt i on per capi t a 1994 pr i ces- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Per cent i l es Smal l est1% 11. 65814 4. 220115% 18. 67906 6. 865227
10% 25. 10425 7. 068164 Obs 1449
25% 39. 94022 8. 201794 Sumof Wgt . 1449
50% 65. 99258 Mean 90. 36742Largest Std. Dev. 81. 99623
75% 114. 2533 577. 193790% 180. 8909 624. 1437 Var i ance 6723. 38295% 236. 1537 660. 1689 Skewness 3. 21231499% 405. 8775 1018. 295 Kur t osi s 21. 69683
Besides commands for descriptive statistics, such as summarize, we can also check normality ofa variable visually by looking at some basic graphs in Stata, including histograms, boxplots,
kdensity, pnorm, and qnorm. Lets keep using r conspc from ERHScons1999.dta file formaking some graphs.
The histogram command is an effective graphical technique for showing both the skewness andkurtosis of r conspc.
histogram rconspc
0
.002
.004
.006
.008
.01
Density
0 200 400 600 800 1000real consumption per capita 1994 prices
The normal option can be used to get a normal overlay. This shows the skew to the right inrconspc.
-
8/14/2019 EEA Stata Training Manual
46/85
45
. histogram rconspc, normal
0
.002
.004
.006
.008
.01
Dens
ity
0 200 400 600 800 1000real consumption per capita 1994 prices
We can use the bin() option to increase the number of bins to 100. This better illustrates thedistribution of rconspc. This option specifies how to aggregate data into bins. Notice that thehistogram resembles a bell shape curve, but truncated at 0.
. histogram rconspc, normal bin(100)
0
.002
.004
.0
06
.008
.01
Density
0 200 400 600 800 1000real consumption per capita 1994 prices
graph boxdraws vertical box plots. In a vertical box plot, the y axis is numerical, and the x axisis categorical. The upper and lower bounds of box are defined by the 25thand 75thpercentiles ofrconspc, and the line within the box is the median. The ends of the whiskers are 5 th and 95thpercentile ofrconspc. graph boxcommand can be used to produce a boxplot which can help usexamine the distribution of rconspc. If rconspcis normal, the median would be in the center ofthe box and the end of whiskers would be equidistant from the box.
-
8/14/2019 EEA Stata Training Manual
47/85
46
The boxplot forrconspcshows positive skew. The median is pulled to the low end of the box,and the 95thpercentile is stretched out away from the box, for both male and female hh head. Infact it seems worse for male household head.
. graph box rconspc, by(sexh)
0
200
400
600
800
1,0
00
Female Male
realcon
sumptionpercapita1994prices
Graphs by Sex of household head
The kdensitycommand with the normal option displays a density graph of the residual with anormal distribution superimposed on the graph. This is particularly useful in verifying that theresiduals are normally distributed, which is a very important assumption for regression. The plotshows that rconspc is more skewed to the right and has a higher mean than that of normaldistribution.
. kdensity rconspc, normal
0
.002
.00
4
.006
.008
.01
Density
0 200 400 600 800 1000real consumption per capita 1994 prices
Kernel density estimate
Normal density
-
8/14/2019 EEA Stata Training Manual
48/85
47
Graphical alternatives to the kdensitycommand are the P-P plot and Q-Q plot.
pnormcommand produces a P-P plot, which graphs a standardized normal probability. It shouldbe approximately linear if the variable follows normal distribution. The straighter the line formedby the P-P plot, the more the variable's distribution conforms to the normal distribution.
. pnorm rconspc
0.0
0
0.
25
0.5
0
0.7
5
1.0
0
NormalF[(rconspc-m)/s]
0.00 0.25 0.50 0.75 1.00Empirical P[i] = i/(N+1)
Qnormcommand plots the quantiles of a variable against the quantiles of a normal distribution.If the Q-Q plot shows a line that is close to the 45 degree line, the variable is more normallydistributed.
. qnorm rconspc
-500
0
500
1000
realcon
sumptionpercapita1994prices
-200 0 200 400Inverse Normal
-
8/14/2019 EEA Stata Training Manual
49/85
48
Both P-P and Q-Q plot prove that rconspcis not normal, with a long tail to the right. The qnormplot is more sensitive to deviances from normality in the tails of the distribution, where thepnormplot is more sensitive to deviances near the mean of the distribution.
From the statistics and graphs we can confidently conclude that there exists outlier, especially atthe upper end of the distribution.
Dealing with outliersThere are generally three ways to deal with outliers. The easiest is to delete them from analyses.The second one is to use measures that are not sensitive to them, such as median instead of mean,or transform the data to be more normal. The most complicated one is to replace them byimputation.
Since our data is heavily right-tailed, we will focus on very large outliers. A customary criterionto identify outlier is to three times of deviation from the median. Note that we are using themedian because it is a robust statistic and if there are big outliers the mean will shift a lot but notthe median.
Example 16: Using robust statistics to replace outliers/* Calculate number of standard deviations from median by sex of hh head */
. use "C: \ . . \ t r ai ni ng\ ERHScons1999. dt a", cl ear
. egen medi an=medi an( r conspc) , by ( sexh)
. egen sd=sd( r conspc) , by ( sexh)
. gen rat i o=( r conspc- medi an) / sd* ( 3 mi ssi ng val ues gener at ed). gen out l i er =1 i f r at i o>3 & r at i o~=.*( 1414 mi ssi ng val ues generat ed). r epl ace out l i er =0 i f out l i er ==. & r at i o~=.*( 1411 r eal changes made)
. t abul at e out l i er , mi ssi ng
out l i er | Freq. Per cent Cum.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0 | 1, 411 97. 18 97. 181 | 38 2. 62 99. 79. | 3 0. 21 100. 00
- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Total | 1, 452 100. 00
There are only 38 observations are identified as outliers. When we compare the mean andmedian values from using table command, the mean value has dropped around 5% and 14%
among female and male headed households, respectively, while the medians are less sensitive tooutliers.
-
8/14/2019 EEA Stata Training Manual
50/85
49
Example 17: Comparing mean and median values to replace outliers. t abl e sexh out l i er , cont ent s( mean r conspc) r ow col mi ssi ng
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Sex of |househol d | out l i erhead | 0 1 Tot al- - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - -