Business and Economic Statistics Tutorial 1: Describing Categorical Data ( Ch 4)
Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data...
Transcript of Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data...
![Page 1: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/1.jpg)
Chapter 2 - Describing Data
1. Summary Statistics - Proc Means
(a) Var
(b) Title
(c) Class
(d) By
(e) Output
2. More Statistics and Plots - Univariate
3. Proc Sort
This covers sections: 2.A-H. You should also read section
19I.
1
![Page 2: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/2.jpg)
Creating a SAS data set: Example
/* Population, population density, births and deaths for
Western European countries, 1995 */
DATA EUROPE_W; /* this creates a SAS Data Set called EUROPE_W */
/* Source: Organisation for Economic Co-op. and Devel. Labour
Force Stat., 1976-1996, Paris, 1997 Ed.*/
INPUT COUNTRY $ POP DENSITY BRATE DRATE;
/* POP = population in 1000’s, DENSITY = 1000’s of
residents/km^2 BRATE, DRATE = birth, death rate per 1000 */
DATALINES;
Austria 8047 95.9 . .
Belgium 10137 332.4 . .
Denmark 5228 121.3 13.4 12.0
Finland 5108 15.1 12.3 9.6
France 58143 105.9 12.5 9.1
2
![Page 3: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/3.jpg)
Creating a SAS Data Set: Example
Germany 81661 228.8 9.4 10.8
Greece 10454 79.2 9.7 9.6
Iceland 267 2.61 7.2 6.0
Ireland 3598 51.2 . .
Italy 57283 190.2 . .
Luxembourg 413 158.8 13.2 9.3
Netherlands 15459 378.91 2.3 8.8
Norway 4348 13.4 13.8 10.3
Portugal 9918 107.3 10.8 10.5
Spain 39210 77.7 9.2 8.7
Sweden 8847 19.7 11.6 10.6
Switzerland 7062 171.0 11.6 8.9
UK 58606 239.4 12.5 11.0
;
3
![Page 4: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/4.jpg)
Questions of interest:
1. How many missing birth rates are in our sample?
2. What is the mean population density?
3. How variable is population density from country to coun-
try?
4. What is the distribution of population? population den-
sity?
4
![Page 5: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/5.jpg)
Another SAS Data Set: Infile and Input
• The file snails.txt contains data from an experiment
in which groups of 20 snails were held for periods of
1, 2, 3 or 4 weeks in carefully controlled conditions of
temperature and relative humidity.
• There were two species of snail: A and B.
• At the end of the exposure time the snails were tested
to see if they had survived; the process itself is fatal for
the animals.
• Using the INFILE and INPUT statements, the data can be
read into a SAS data set called SNAILS.
Species Time Humidity Temperature Fatalities N
A 1 60.0 10 0 20
A 1 60.0 15 0 20
...........................................
B 4 75.8 20 7 20
5
![Page 6: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/6.jpg)
Questions of interest:
1. What is the mean and standard deviation of the num-
ber of fatalities of species B for each level of exposure
(TIME)?
2. What is the distribution of the number of fatalities?
3. What is an approximate 95% confidence interval for the
mean number of fatalities?
4. How many times did 0 fatalities occur?
6
![Page 7: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/7.jpg)
Proc Means
• Syntax:
PROC MEANS DATA = SASdata options;
(optional statements)
• Explanation:
– the DATA option specifies a SAS data set. If this
option is not used, SAS looks to the most recently
created or used SAS data set.
– Examples:
PROC MEANS DATA = EUROPE_W;
PROC MEANS DATA = SNAILS;
PROC MEANS;
7
![Page 8: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/8.jpg)
Optional Statements for Proc Means
• To compute specific kinds of statistics, use e.g. N,
NMISS, MEAN, STD, STDERR, CLM, MIN, MAX,
SUM, VAR, CV, SKEWNESS, KURTOSIS, T, and MAXDEC=n.
• An additional option is the NOPRINT option which sup-
presses printing of output in the Output Window.
PROC MEANS DATA = EUROPE_W NMISS MEAN STD
VAR MAXDEC=4;
gives the number of missing observations for each vari-
able in the SAS data set EUROPE_W, as well as the mean,
standard deviation and variance. The MAXDEC option
restricts the number of decimal places to 4.
• A number of types of optional statements can be used,
including a TITLE , VAR , CLASS, BY and OUTPUT statement.
8
![Page 9: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/9.jpg)
Subcommand statements for Proc Means
• The TITLE statement is useful for preparing reports.
• The VAR statement specifies which variables the sum-
mary statistics should be computed for.
Example:
PROC MEANS DATA = EUROPE_W NMISS MEAN STD VAR;
TITLE ’Demographic Statistics for Western Europe’;
VAR DENSITY BRATE DRATE;
9
![Page 10: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/10.jpg)
Subgrouping with the Class Statement
• The CLASS statement is used when we require computa-
tion of the various summary statistics for different sub-
groups of classes. For example, to estimate the mean
number of fatalities for each of the two species of snail,
we use SPECIES as a class variable:
• Example:
DATA SNAILS;
INFILE ’snails.txt’;
INPUT SPECIES $ TIME HUMIDITY TEMP FATALITY N;
PROC MEANS DATA=SNAILS MEAN;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES;
RUN;
QUIT;
10
![Page 11: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/11.jpg)
Subgrouping with Class
• After execution, the Output window contains the two
averages:
Mean Fatalities For Each Species of Snail
A 0.708333
B 4.020833
• We are actually interested in the mean number of fa-
talities for each type of snail at each level of exposure
(TIME). Thus, TIME is a second classification variable,
nested within the first classification variable SPECIES.
• We can obtain all of the required averages, as well as
95% confidence limits for the true mean in each case,
by employing the following:
PROC MEANS DATA=SNAILS MEAN CLM;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES TIME;
11
![Page 12: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/12.jpg)
Subgrouping with the By Statement
• The BY statement is almost interchangeable with the
CLASS statement. However, it will only work when the
data set is sorted according to the BY variable. The
CLASS statement does not have this restriction.
• Example:
PROC MEANS DATA=SNAILS MEAN CLM;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
BY SPECIES TIME;
• This works since SPECIES and TIME are already sorted.
For each value of SPECIES the variable TIME is sorted.
The CLASS statement uses more memory than BY, but
the BY will tend to be slower than CLASS, since sorting is
a slow operation. These differences are only noticeable
for large data sets.
12
![Page 13: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/13.jpg)
Using Output from Proc Means
• The OUTPUT statement is used to create a new SAS data
set consisting of the summary statistic computed by
PROC MEANS.
• Example 1: The following creates a new SAS data set
called SNAILSUM which will contain 2 observations (one
for each species) on the 3 variables M_FATAL, S_FATAL,
and V_FATAL.
PROC MEANS DATA=SNAILS MEAN STD VAR NOPRINT;
VAR FATALITY;
CLASS SPECIES;
OUTPUT OUT=SNAILSUM
MEAN=M_FATAL
STD =S_FATAL
VAR =V_FATAL;
13
![Page 14: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/14.jpg)
Output: Another Example
• The following creates a SAS data set consisting of a sin-
gle observation on the two variables M_BRATE and M_DRATE.
The number of variables in the VAR statement must
match the number of variables created by the OUTPUT
statement, for each statistic listed in the options.
PROC MEANS DATA=EUROPE_W MEAN;
VAR BRATE DRATE;
OUTPUT OUT=EUROPSUM
MEAN=M_BRATE M_DRATE;
• These new SAS data sets can later be used by SAS
procedures, if desired.
14
![Page 15: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/15.jpg)
Proc Means: Example
• Here we plot a histogram of the averages of the num-
bers of fatalities. Note that we have used the NOPRINT
option here to suppress output to the Output window.
PROC MEANS DATA=SNAILS MEAN NOPRINT;
TITLE ’Mean Fatalities For Each Species of Snail’;
VAR FATALITY;
CLASS SPECIES TIME;
OUTPUT OUT = SNAILSUM;
MEAN = M_FATAL;
PROC CHART DATA=SNAILSUM;
VBAR M_FATAL;
15
![Page 16: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/16.jpg)
PROC UNIVARIATE
• Syntax:
PROC UNIVARIATE DATA = SASdata options;
statements;
• Many of the options are the same as for PROC MEANS.
Some additional ones are available: see page 27 of the
textbook. The default output is quite extensive and
includes the median and quartiles, the extreme per-
centiles, and lowest and highest 5 observations. These
last are useful for ensuring that the data has been read
in sensibly.
• The NORMAL option gives a crude normal QQ plot.
– an informal, yet useful, test of normality.
– it is a plot of the ordered observations versus the
expected value of ordered normal observations
– If the plot is close to a straight line, then the data
are approximately normally distributed. Otherwise,
the data are likely non-normal
16
![Page 17: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/17.jpg)
Normal QQ Plot: Example
• This checks whether the distribution of Western Euro-
pean population densities are approximately normal.
PROC UNIVARIATE DATA=EUROPE_W NORMAL;
VAR DENSITY;
• To train your eye to recognize typical departures from
non-normality, simulation of normal and non-normal data
sets having various sample sizes is helpful:
DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
X = RANNOR(0);
PUT X;
END;
RUN; QUIT;
17
![Page 18: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/18.jpg)
Normal QQ Plotting
• Now, construct the normal QQ plot:
DATA NORTEST;
INFILE ’normal.dat’;
INPUT X;
PROC UNIVARIATE NORMAL;
VAR X;
RUN; QUIT;
• Repeating this for a number of different simulation runs
will give you a good notion as to what the normal QQ
plot should look like.
18
![Page 19: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/19.jpg)
Normal QQ Plotting of Non-Normal Data
• To see what a normal QQ plot shouldn’t look like, try
something like the following:
DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
U = UNIFORM(0);
IF U < .8 THEN X = RANNOR(0);
ELSE X = 5*RANNOR(0);
PUT X;
END;
RUN; QUIT;
or
19
![Page 20: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/20.jpg)
Normal QQ Plots of Non-Normal Data
• DATA _NULL_;
FILE ’normal.dat’;
N = 20;
DO I=1 TO N;
X = RANEXP(0);
PUT X;
END;
RUN; QUIT;
• In each case, create the normal QQ plot to see what
happens when the data is really not normally distributed.
20
![Page 21: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/21.jpg)
The Plot options and Proc Means
• Crude stem-and-leaf and boxplots can be produced us-
ing the PLOT option.
• Most of the statements that can be used with PROC MEANS
can be used with PROC UNIVARIATE. The exception is the
CLASS statement. You must make sure the data are
sorted properly and use the BY statement instead.
21
![Page 22: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics](https://reader034.fdocuments.in/reader034/viewer/2022051509/5adca1177f8b9aa5088bbb56/html5/thumbnails/22.jpg)
PROC SORT
• Syntax
PROC SORT DATA=SASdata;
BY var1 var2 ... ;
Example 1:
PROC SORT DATA = EUROPE_W;
BY DENSITY;
The SAS data set then becomes
Country POP DENSITY BRATE DRATE
Iceland 267 2.61 7.2 6.0
Norway 4348 13.40 13.8 10.3
Finland 5108 15.10 12.3 9.6
................................
Netherlands 15459 378.91 2.3 8.8
The following sorts the data set so that DENSITY appears
in reverse order.
PROC SORT DATA = EUROPE_W;
BY DESCENDING DENSITY;
22