1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior...

38
1 Summary HRP223 – 2009 November 1 st , 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Transcript of 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior...

Page 1: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

1

Summary

HRP223 – 2009November 1st, 2010

Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved.Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Page 2: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

2

You know …

• How to create a table from scratch• How to import tables

– From external sources like Excel or using export/import code from databases

• How to create tables – from a single existing table

• with selected variables• with recoded variables• with or without subsets of the records

– from multiple tables• by adding columns (joins)• by adding sets of records (set operators)

With code or G

UI

Page 3: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

3

Create a New Table

• The GUI is the easiest.• Look in the optional textbooks for the class to

learn the syntax for code.

$ means a character string. 10. means 10 letters wide.

The age variable starts in column 11.

Missing numbers are

just a .

Missing characters are

just spaces (not tabs)

Page 4: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

4

Importing

• The most bullet proof way to import is to use the import wizard.

• You can also write a program with proc import

Page 5: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

5

Code

• If you write any code be sure to load my keyboard macros:

Once you have a program node open in a flowchart, you add the macros to both SAS and EG by using the Program menu.

The import macro gives you the shell to import Excel files.

Page 6: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

6

From a Database

• If you load data that came with an import/export program, you will probably need to add the path to infile statement.

Page 7: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

7

Importing Advice

• It is a good idea to import the source into a permanent library.

• After importing, use the Query Builder or a Program node and copy all the variables into a new data set. This node can be tweaked later to fix the problems that you identify later.– If you do not do this, you will have to change the

links leading from the cleaned/fixed data to point to the analyses.

Page 8: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

8

Creating New Datasets From 1 Table

• Name the query and new table.

• Drag the entire table or individual variables to the Select Data pane.

• In the Select Data pane pick variables then click the properties button.

Page 9: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

9

Changing a Variable

• Computed Columns>New… > Recoded column> pick a variable.

• Notice the other tabs for selecting what to change to a new value.

SAS allows 27 different types of missing

numbers. .A through .Z and .

Page 10: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

10

Bad Ages Recoded to NULL

• If you get data from a program that uses bogus numbers to indicate problems in a numeric field, replace the values with different NULL values .A , .B , etc. When you do descriptive statistics the null values will be automatically excluded.

Page 11: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

11

Removing/Choosing Records

• Right click on the variables you want to use for dropping records or use the Filter Data tab.

Page 12: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

12

Page 13: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

13

Advanced Changes Comparisons

• You can use the Advanced Expression dialog box to do complex tasks like editing and combining text variables.– catt(), lowcase(), compress(), combl()

• SAS has built in Regular Expression processing (like PERL) as well as Soundex for phonetic spellings and (Levenshtein) edit distances for measuring dissimilarity between strings.

Page 14: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

14

Working with Several Tables

• Joins add columns to a base table.

• Set operations add (or subtract) records.

Table 1

Table 2 New Table

Table 1

Table 2

New Table

Page 15: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

15

Commonly Used Joins

Table 1

Table 2

Inner JoinNew Table

Table 1

Table 2

Left Join

New Table

Keep only records where you can match IDs in both tables.

Keep only all records from the left table and matching records from the right. Use NULL for the unmatched records in the right table variables.

Page 16: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

16

One to Many Joins

• All of the SQL joins that I have mentioned work with either a 1 to 1 match of key variables across tables or a 1 to many match. But you need to be cognizant of how many records are in each table.

• Double check the new table size.

Inner

Left

Page 17: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

17

• If there are duplicate key values in one of the tables and you do not join on a second variable, SQL will multiply the combinations and you can end up with the total records being the product of the number of records.

Cartesian Joins

Inner Joinon Family

Page 18: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

18

PROC SQL - Set OperatorsNO GUI

• Outer Union Corresponding– concatenates

• Unions– unique rows from both queries

• Except– rows that are part of first query

• Intersect– rows common to both queries

Page 19: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

19

How does a data step typically work?

• The data statement says make this (or these) data set(s).1. SAS then reads every line down to the run

statement and gathers a list of all variables used.• This list is called the program data vector (PDV).

2. It then sets all the variables to missing.

Page 20: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

20

How does a data step typically work?

3. It then does the instruction listed on each line of the data step program in the order that the lines are written.

4. Then it writes all the variables out to the new dataset.

5. It then repeats the process if there is more data.

Page 21: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

21

How SAS Processes a Dataset(1)

• In the example below, SAS will look in the existing dataset called Teletubbies and it will find two variables, teletubby and thing. Then it will find the variable called kid.

• Then it will do the instructions in order.data Teletubbies2; *name of a new data set;set Teletubbies; *load 1 observation of data;kid = "Andrew"; * fill in the blank;output; *write the variables to teletubbies2;return; *return to the top of the step;

run; *end of these instructions;

Page 22: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

22

The Set Statement

set Teletubbies;

• This line tells SAS to load one row of data from the data set Teletubbies into the PDV. The first time this line is run, the first row of data is loaded into the PDV.

• When there is no more data to load, the data step is done.

Page 23: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

23

Variable Assignment

• In the example the word Andrew is assigned to the variable kid. All variables are assigned from the right side into the variable named on the left.

kid = "Andrew";

• If a variable appears on the left and right side of an equal sign, the original value on the right is changed and then written to the left.

• aNumber = aNumber + 4;

Assignment goes this way

original valuenew value

Page 24: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

24

How SAS Processes a Dataset(2)

• If you do not include the output and return statements, SAS will do them automatically. So, the previous data step would typically be written like this.

data Teletubbies2; set Teletubbies; kid = "Andrew";

run;

Page 25: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

25

How SAS Processes a Dataset(3)

• If, If-else, or select statements are typically used to conditionally assign values in a data step.

If: one possibility

If else: two possibilities

Select when otherwise end: multiple possibilities

Page 26: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

26

Error Trapping

• “Tinkywinkey” is not “Tinky Winkey” … Bad Teletubby.

Page 27: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

27

Test Your Understanding

data test3a test3b;set source;if isMale = 1 then output test3a;hasCancer = 1;output test3b;

run;

Page 28: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

28

Common Ground … where

• Both SQL and data step programming use where statements to select what records are included in the new dataset.

• With data steps the variables used in the where statement need to already exist in the source file. Use if to check variables created in the data step.

Page 29: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

29

where

• The syntax for where is identical in SQL and data steps.• Differences vs. if statements:

– main points work in where only • sub points work in either

– x between y and z• x >= y and x <= z• y <= x <= z

– string1 ? string2 or string1 contains string2• index(string1,string2) > 0

– string1 =* string2 • soundex(string1) = soundex(string2)

– x is null or x is missing• missing(x)

– String1 like “U%of%A%”• use regular expressions (PRX)

Page 30: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

30

where Syntax

• The where statement, like all SAS statements, begins with a keyword (where) and ends in a semicolon.– where isDead = "false";– where isDead ne "true";– where missing(gender);– where salary > 100000;– where country in ("USA", "Japan", "UK");– where country in ("USA" "Japan" "UK");

Page 31: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

31

where Syntax

• Arithmetic– where salary/12 > 10000;– where (salary /12) * 1.20 ge 9900;– where salary + bonus < 120000;

• Logical– where gender ne "M" and salary >= 50000;– where gender ne "M" or salary >= 50000;– where country = "UK" or country = "UTAH";– where country not in ("USA", "AU");

Page 32: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

32

• SAS has many operations available to help you make decisions.= eq, ~= ne, < lt, > gt, <= le, >= ge, in ( )Not

requires the expression following it to not be true.& And, | or, in

& Requires both operands to be true.| Requires one operand to be true.In () requires at least one comparison to be true.

Math operations:+ - * / **.

Make Decisions

Page 33: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

33

Logical Decisions & Compound Expressions

• Common tests and common problems:where YODeath < YOBirth;where Sex = "M" and numPreg > 0;

where Sex="M" and numPreg > 0 or ageLMP > 0; *** bad ***;

where Sex="M" and (numPreg > 0 or ageLMP > 0); *** good ***;

– Moral: Use parentheses generously with ands and ors.

Page 34: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

34

Where is everywhere

Page 35: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

35

Numeric Data and Looping• Say somebody tells you to simulate rolling dice. The formula to do this says:

– generate a random number between 0 and 1– multiply it by 6 – round up to the closest integer

data die;*the 22 says which list of numbers between 0 & 1;aNumber = ranuni(22);die = ceil(6*aNumber);* Generate a random integer between 1 and 6.;dieDie = ceil(6*ranuni(78687632));output; * write to the new dataset;return; * go to the top and try to read in data;

run;

Page 36: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

36

Doing Stuff Repeatedly

• How to roll two dice:data dice;do x = 1 to 2 by 1;

roll= ceil(6*ranuni(78687632));output;

end;return; * go to the top and try to read in data;

run;

Page 37: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

37

Craps…• In the dice game “craps” you throw two dice and the number you

roll determines if you win or lose. How do you simulate rolling 10 pairs of dice?

data craps ;do trial = 1 to 10;

do dieNumber = 1 to 2;roll = ceil(6*ranuni(78687632));output;

end;end;return;

run;

Page 38: 1 Summary HRP223 – 2009 November 1 st, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is.

38

Summing