HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All...

52
HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Transcript of HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All...

Page 1: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

HRP 222

Topic 3 – Showing Data

Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved.Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Page 2: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

From Last TimeOops - libname

Last time I had the library name and v6 statement transposed. This is correct:

libname ingridv6 v6 ‘c:\projects\ingrid\dis\old’;

Page 3: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

From Last TimeNew Data

When you get new data do the following:1. Scan the files for viruses2. Make the file read only3. Verify the number or records with the

sender4. Verify the first and last records5. Verify the content

Missing values Permitted values

Page 4: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

From Last TimeThe PDV

The program data vector is the storage of all the variables that SAS is working on. The contents of the PDV get are used to create new data sets. Variables and their values get into the PDV if they appear: in a source “set” in a data step in a “input” statement on the left side of an equal sign in an retain statement an automatic variable

Page 5: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Examples of Retain

Here is an example of the use of retain which counts the cases of gdm.

data blah;set grace.analysis;retain dx_gdm 0;if gdm=1 then dx_gdm=dx_gdm+1;/*the same thing asif gdm then dx_gdm+1;*/

run;

This is an optional default value. You should always give one.

Page 6: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Complex Retains

Combining the first and last variables with retain statements gives you real power. This code counts the total diagnoses for a woman.

data totaldx (keep=id dx_total);set fakebaby.analysis;by fake_id;retain dx_total 0;if first.fake_id then dx_total = 0;dx_total=dx_total+sum(gdm--thyroid);if last.fake_id then output;

run;

Page 7: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Security

Assume that somebody is always looking over your shoulder on the web and people are reading your email.

Put a firewall between you and the web.That said, the biggest threats to computer

security are the legal users of the system. Walking away from a terminal Using passwords that are easy to crack by script

kiddies Taking data off of restricted machines Viruses and Trojan horses will kill you if you let them!

Page 8: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Security Issues (2)

The left red arrow points to Norton Antivirus. Right click on it to open it up.

Before you send me your homework, update your definitions and scan the files of interest.

Page 9: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Security Issues (2b)

The newest Norton AntiVirus has a lousy interface.

Click this to find the file you want to scan.

Update your definitions by clicking the live update button.

Page 10: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Security Issues (2c)

Click on the files you want the scanner to check.

Page 11: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Security (3)

Securing your email: There are programs which will scramble your

email while it is in route, effectively making it impossible for people to read it without your permission.

The best way to encrypt data is by using PGP encryption. If you use a PC or Mac, visit the upper site for the

latest version information. http://cws.internet.com/encrypt.htmlhttp://web.mit.edu/network/pgp.html

Page 12: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Security (4)

You can secure the connection between machines by using encrypted transmissions. PGP SSH SSL

Virtual Private Networks (VPNs) are all the rage.Machines can recognize each other:

Kerberos – make a .klogin file on your unix account SSH

Page 13: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

More on Finding Problems

I showed you how to identify problems and write them to the log. This is an important task but documenting problems with reports that look good is an equally important task.

Page 14: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Checking Variables 2Proc Print

Use proc print to print stuff to the output (not the log) window.

proc print data= newData;var id sex;where sex not in ('M', 'F');

run;The if statement in a data step is replaced with awhere statement in a procedure.

Page 15: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dressing up output

You can add up to five lines of titles and five lines of footnotes to each page of output.

title1 People who have bad sex;proc print data= newData noobs;

var id sex; where sex not in ('M', 'F');

run;

Tell it you do not want the observation number printed.

Page 16: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dressing up output

title1;proc print data= newData noobs label;label sex = "Gender";var id sex; where sex not

in ('M', 'F');run;

You can tell the procedure you want to use labels instead of variable names and provide the labels like this.

Page 17: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

ODS

The Output Delivery System allows you to control what you print and how it looks. Use it to make your output web-ready and pretty.

ods html file=‘blah-body.htm'contents="blah-contents.htm"frame="blah-frame.htm"page="blah-page.htm" path="c:\projects\blah\LS\" (url=none)gpath="c:\projects\blah\LS\"(url=none);

Page 18: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

A Look at Data

If a variable is categorical (i.e., nominal or ordinal) you would take your first look at it with proc freq. You would look at it graphically with proc gchart.

If a variable is continuous (i.e., interval or ratio measure) you can take your first look at it with proc means or proc univariate. You would visualize it with proc gplot or proc gchart, proc univariate and proc boxplot.

Page 19: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Categorical Data

You can represent categorical data as strings of letters or numbers.

The choice is up to you but most programmers use numbers. Never use free form text for categories.

Page 20: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Plotting Frequencies

I prefer to see my data in chart format.

SAS/Graph is like dental surgery. Your results may be beautiful but getting them can be excruciating.

Page 21: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Plotting Frequencies (2)

Page 22: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Counting observations

If you want to get a tabular count of all the different values stored in a variable, use proc freq (pronounced “freak”) with this very simple syntax.

proc freq data= gen6sas.at; tables race;run;

proc freq data= gen6sas.at; where center = ‘stan’; tables race;run;

Page 23: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Counting observations (2)

Counting the missing

You can tell SAS to include the missing records in the body of the table like this:

proc freq data= gen6sas.at;tables race / missing;

run;

Page 24: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Counting Observations (3)

Lots of Tables

Cody and Smith mention that double dash notation can be used to get all tables between two variables.

tables gender -- cities;You can also specify just the text or

numeric variables like this:tables gender - _numeric_ - cities;tables gender - _character_ - cities;

Page 25: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Counting Observations (4)

Warning!

Proc freq only examines the first 16 positions of a character variable. These two strings are identical to proc freq.Do not put beans or raisins in your noseDo not put beans

Capitalization and spacing are both meaningful to proc freq. These are different: Spam & Eggs, Spam&Eggs, spam & Eggs,

spam & eggs

Page 26: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dealing With Strings

Try not to use strings for your categorical variables but if you have to….

SAS has functions that will convert your variables to all upper or lower case and sack the spaces.

Page 27: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dealing With Strings(2)

Page 28: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dealing With Strings(3)

The right way to deal with strings is to not use them at all!

Code your variables numerically and translate them with a format.

Page 29: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dealing With Strings (4)

Page 30: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dealing With Strings (5)

Page 31: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Continuous Variables

You can now describe numerically or graphically a categorical variable. Continuous variables are generally easier to work with.

Proc means by default will give you min max mean and SD for one or more variables.

Page 32: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Proc Means (1)

Easy Examples

proc means data = x;var age_st yob;

run;

proc means data = x;var age_st yob;where age_st not in (0, 9999) and yob not in (0, 8888, 9999) ;run;

Page 33: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Proc Means (2)

Easy Examples

If your data is sorted then you can do statistics for subgroups of your data by using the keyword by.

proc sort data= x; by sex; run;proc means data = x nonobs mean maxdec=0;

by sex;var age_st yob;where age_st not in (0,9999)

and yob not in (0,8888,9999);

run;

Page 34: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Proc Means (3)

Easy Examples

A couple of procedures, including proc means, will allow you to use a class statement instead of sorting and using by. If you have the RAM try it because it is faster.

proc means data = x nonobs mean maxdec=0;by sex;var age_st yob;

where age_st not in (0,9999) and yob not in (0,8888,9999);run;proc means data = x nonobs mean maxdec=0;

class sex;var age_st yob;

where age_st not in (0,9999) and yob not in (0,8888,9999);run;

Don’t print the N used in the stats.

Page 35: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Proc Means (4)

A Complex Example

You can make procedures, including proc means, create new data sets:

proc means data = x nonobs mean std maxdec=0 noprint;by sex;where age_st not in (0,9999) and yob not in (0,8888,9999);var age_st yob;output out = work.themeans

mean = age_m yob_mstd = age_s yob_s;

run;

Many other procedures produce datasets which can be used for further work.

Line these up!

Page 36: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Proc Means (4)

A Complex Example - 2

The outputted data set includes the statistics you requested plus two automatic variables. The _freq_ value tells you how many values were used in the stats. The _type_ value comes into play when you invoke means with a class statement or by statement. You can use it to see the means for the group and within the levels.

Page 37: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Proc Univariate

Proc univariate generates a sea of information on your numeric variables. It is syntactically easy.

Like proc means, it can output into a new data set and you can use it for further analysis (high resolution plots).

Page 38: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Proc Univariate (2)

I like to do this:proc univariate data=junk.babyweight noprint;var fetal_wgt_;histogram;

run;

This suppresses all the statistical output.

Page 39: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Proc Univariate (3)

Actually, I do something like this….proc univariate data=junk.babyweight noprint;var fetal_wgt_;histogram /midpoints = 1350 to 4300 by 100;

run;

Page 40: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Available Available Available Statistic Default Option Default Option StatementNumber of nonmissing observations Y N YNumber of missing observations NMISS YTotal number of observations YMean Y MEAN YMedian MEDIAN YMode YSum SUMStandard deviation Y STD YVariance VAR YMinimum Y MIN YMax Y MAX YRange RANGE YUncorrected sum of squares USS YCorrected sum of squares CSS YCovariance CV YSkewness SKEW YKurtosis KURT YStudent's t T YProbability of non-0 t PRT YQuartiles Q1 Q3 YInterquartile range QRANGE YPercentiles P1 P5 P10 P25… YSigned rank test YKolmogorov statistic YShapiro-Wilk statistic YTest for H0: normally distribution NORMALBox plots (Low Resolution) PLOTSStem-and-leaf plots (Low Resolution) PLOTSNormal probability plot (Low Resolution) PLOTSHistogram (High Resolution) HISTOGRAMProbability plot (High Resolution) PROBPLOTQQ plot (High Resolution) QQPLOT

Proc UnivariateProc Means

Based on DiIorio page 89.

Page 41: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Formats

Formats are typically used to indicate that numeric value corresponds to a text value.

You can also use formats to deal affectively with missing or invalid values.

Page 42: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Using Formats and Nulls

proc format;value badAge

.U = Unknown

.N = Not Applicable;run;

data blah;input ageAtCancer @@;format ageAtCancer badAge.;datalines;34 35 .U .N 36

; run;

Page 43: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Using Formats and Nulls (2)

When you do statistics on the variables that include the null values the null values are removed.

proc means data = blah maxdec = 0;var ageAtCancer;

run;

Page 44: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dates

You know how to import numbers and character data. I have alluded to the fact that dates in SAS are difficult to work with because dates are stored as number of days since Jan 01, 1960. Importing requires an informat and viewing a date requires a date format.

Page 45: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dates (2)

Importing a Date

To import a date you need to tell SAS how the date is structured:

data form; input id dob : mmddyy10.;datalines;

1 06/24/1967 2 01/18/1967 ; run;

This is optional

Page 46: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dates (3)

Importing a Date

Dates are stored as the number of days since Jan 01, 1960. If you need to specify a lot of dates you can use an informat statement:

data form; informat dob dom mmddyy10.;input id dob dom @@;datalines;

1 06/24/1967 06/10/1990 2 01/18/1967 06/10/1990 ; run;

Page 47: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dates (4)

Displaying a Date

To see the date correctly, specify a format in the importing datastep or later:

data form; informat dob dom mmddyy10.; format dob dom mmddyy10.; input id dob dom; datalines;1 06/24/1967 06/10/19902 01/18/1967 06/10/1990; run;

Formats stick around when you create new data sets but can be changed.

Page 48: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dates (5)

Changing a Date Format

data form; informat dob dom mmddyy10.; input id dob dom; datalines;1 06/24/1967 06/10/19902 01/18/1967 06/10/1990; run;

data blah; set form; format dob dom mmddyy10.;run;

data blah2; set blah; format dob dom date8.;run;

Page 49: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Dates (6)

Two Digit Dates and Y2K

SAS has done a lousy job with this…Don’t use two digit dates if you can

help it.You can specify a year cut-off of

something like 1920. If you use yearcutoff =1920 then your two digit dates refer to this range:

Page 50: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Converting From Text to Dates

You also have a pack of useful date functions to do things like:

Converting a text date to a SAS date is useful for determining study eligibility:

data eligible; set blah; if dom > "01jan1990"d then output;run;

data eligible; set blah; if (("01jan1990"d-mdy(monthOfB,dayOfB,yearOfB))/365.25)

> 65 then output;run;

Page 51: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

Before Next Time

Cody & Smith – Read the rest of Chapter 2, and all of Chapter 3

Page 52: HRP 222 Topic 3 – Showing Data Copyright © 1999-2001 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by.

In Class Exercise

Import the data. Get the contents. Verify the contents Generate frequency tables on all the

variables. Get descriptive statistics on the

numeric variables.