1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland...

59
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Transcript of 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland...

Page 1: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

1

Data Manipulation (with SQL)

HRP223 – 2010October 13, 2010

Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved.Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Page 2: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

2

Topics For Today

• Organization• Sharing a SAS dataset

– As .sas7bdat files or other formats• Renaming

– Datasets– Variables

• Subsetting a dataset– Select a few variables– Select a few records

• SQL reports for a single table of data– Selecting/renaming variables– Applying labels and formats– Creating tables with SQL

Page 3: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

3

Avoiding Spaghetti Code

• Programmers refer to unstructured, poorly thought through, unorganized code as spaghetti code. Your EG projects will literally look like a tangled mess of spaghetti if you do not structure them in advance.– Use several named process flows– Use lots of notes in the project– Include a lot of comments if you write code

This is bad.

Organization

Page 4: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

4

Process Management

• Typically you will have a process flow that tells EG where to find existing SAS data or it says to import from the source file(s) from a database like REDCap or from Excel and then does data cleaning and splits the data into subsets.

• If you do different sets of analyses to the subsets, add in a process flow for each subset.

• Have one of the process flows create a dataset called analysis that has the cleaned data with all the information used in the analyses.

Organization

Page 5: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

5

Working with Multiple Process Flows

• You can add other process flows with the File menu or by right clicking on the background of a process flow.

or click here.Click here to move between flowcharts…

Page 6: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

6

Right click on the process flow and give it a meaningful name.

You may want to link the library to the dataset.

Organization

Page 7: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

7

The Greater Right of the Left

• Your process flows should have the source of the data on the left. The left margin should have:– A note saying what the flowchart does– A code node that creates a toy dataset or a library

(or libraries) that contains the data

Organization

Page 8: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

8

A Good Process FlowOrganization

Page 9: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

9

Organization in Programs

• All my SAS code begins with the same header information.

• The /* */ are used to mark large comments.

Page 10: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

10

Display manager deletes output text and log.Do not show the name of the procedures in output.

Do X commands ASAP. Don’t show the date in output and reset page # to 1.

Delete graphics in the work library.

Specify where output will be stored.

Make the folder where output will be stored if it does not exist. Delete what is there if it exists. Set file path to that directory.

Make a library to store output datasets.

Make a web page to display all output.

Make pretty graphics.

Run other programs.

Turn off graphics and output.

Page 11: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

11

Sharing Data

• You can share SAS data sets just like Excel files.• Create a library.• Copy the data into the library.• If the data has formats associated with it, be

sure to send the formats.– More on this on a later date.

Sharing

Page 12: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

12

Exporting the Point and Click Way

• Double click the data set you want to export and use the Export context dependent menu.

Sharing

Page 13: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

13

Libraries

• Recall that a library is reference to a location on a hard drive.

• If you tell EG to move a data set into a library it moves it into the folder that the library “points at”.

Page 14: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

14

With Code….

• Create a library with the GUI or use the libname statement

libname blah "C:\blah";

• Write a little program to move the data into a permanent library:

proc copy in = work out = blah;select humans;

run;

Sharing

Page 15: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

15

This code is efficient.

Sharing

Page 16: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

16

Alternatives

• Novices underuse proc copy. Instead they typically write less efficient data steps. For example,

data blah.humans;set work.humans;

run;

• Or they may write:

data "C:\blah\humans.sas7bdat";set work.humans;

run;

Sharing

Page 17: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

17

Sharing

Either create a library node or write this line.

Functionally the same but less efficient than proc copy.

Either create a library node or write this line.

Page 18: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

18

Export Code for a Different FormatSharing

Page 19: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

19

Note that you have to manually connect the code node to the right place in the flow chart and the exported item does not show up on the process flow.

Sharing

Page 20: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

20

Copy and Rename

• If you want to copy and rename a file, use the GUI or write code.– Double click the data set.– Choose Query Builder from the context sensitive menu.

Renaming datasets

Page 21: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

21

Renaming datasets

Page 22: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

22

With code…

data blah.test;set work.humans;

run;

Renaming datasets

Page 23: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

23

Make Some Fake Data

• You can tell SAS to make an ID variable and have it be output to a file named dudes with the values from 1 to 10 like this:

by 1 is optional. It will step by 1 by default.

The spaces before and after = are optional.

Page 24: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

24

Add in a Constant

• I want to add in a column to indicate that these are all of type Fake.

Page 25: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

25

Common Mistakes (1)

• What happens if you leave off the quotes around the value fake?– SAS thinks you want to set the variable type equal

to the variable fake.

Page 26: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

26

Always Search Your Log for uninitialized

• If you notice an empty variable at the end of your dataset you forgot quotes or you misspelled a variable name … and SAS made it for you.

There was no fake variable so it make one for you…

I wish this was an ERROR!

Page 27: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

27

Common Mistakes - Semicolons(2)

Page 28: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

28

Page 29: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

29

Page 30: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

30

Common Mistakes – Dataset Spaces(3)

• SAS lets you use white space to organize your program but you should not use spaces in variable names and you can’t use spaces in dataset names.

Not a syntax error but not what you wanted… a semantic error. You get

two datasets.

Page 31: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

31

More Bulletproof • You can specify the name of the dataset you

want to output into… this is a good idea.

Page 32: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

32

Common Mistakes – Variable Spaces(4)

Page 33: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

33

GUI Instead

• You can use the GUI to make a dataset by hand or include a program and then use the GUI to add:

Gooey = graphical user interface

4. Compute Columns

1. label the node

2. label the dataset

3. Drag and drop the ID variable

Page 34: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

34

To add in a column based on existing data:

5. Click New…6. Click Recoded Column7. Click the column you are basing the new

variable upon

6 7

Page 35: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

35

8. Specify the new column is character or number9. Click Add…

89

This is an example of bad GUI design. Commands

appear out of logical order.

Page 36: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

36

Add a constant

10.Pick from the Replace Values, Replace a Range, Replace Condition tabs

11.Specify what is replacing what.

We want to add in “Fake” to all records. All records are not missing and ID so use that for the request.

12.Click OK

10

11a

11b

12

Page 37: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

37

13.Specify what to do with all other values.14.Click Next>

13

14The same bad GUI with

commands appearing out of logical order.

Page 38: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

38

15.Specify the column label16.Specify the variable name17.Click Next>18.Click Finish19.Click Close

15

16

17

Notice the poor GUI design… why is the column type shown here as radio buttons which are disabled?

If the type of variable is wrong push back and fix it!

19

Page 39: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

39

A Simple 20 Step Process

20. Push Run.

20

Page 40: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

40

The SQL

• This is the code that was written by your pointing and clicking:

Click to see the code.

Consider saving this block of code in your

private code library out on Google sites.

Page 41: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

41

Select a Few Variables From Fake Data

• The next task is to select a couple of variables from a data set that has a LOT of variables.

• If you get a premade dataset with lots of extra variables, you want to drop the ones you will never use. Do this as soon as you can.

• First I will make some fake data. The data set will have a simulated test value filled into 6 “month” variables.

Fake data

Page 42: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

42

How to make a fake subjectFake data

Variables are added to the new dataset in the order in which they are created. New variables are created if they show up in array statement (rarely) or on the left side of an equal sign (=).

Comments can start with * and end with ;

Page 43: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

43

Fake data

Page 44: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

44

Page 45: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

45

You can use the Filter and Sort context sensitive menu to select a few variables.

To rename a variable or change how it prints in reports you need to use the Query Builder or write code.

Selecting variables and renamingRename and label variables

Page 46: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

46

Drag and drop the variables you want into the Select Data windowpane.

Rename and label variables

Click on a variable name. Then use the properties

button to change the name and the

display label.

Month1 is January but for reports I

want it to say First Month.

Page 47: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

47

Rename and label variables

Page 48: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

48

Rename and label variables

I usually display the variable names instead of the labels.

To write code, you need the names not the labels.

Page 49: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

49

What it did…Rename and label variables

Page 50: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

50

Data Step (SAS code) Version

Notice where the ; is found. This is

one long statement.

Rename and label variables

Page 51: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

51

Minimal SQL

• Print a report showing the contents of variables from a single data set.

Put a comma-delimited list of variables here or * for all variables.

Specify a library.table here.

Note that there is no create table ____ as

SQL reports

Page 52: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

52

What variables?

• Typically you will use a coma delimited list but you can use an * to indicate that you want all variables selected instead of typing them all.

• There is no syntax to specify variables based on position in the source files. That is, you can not specify that you want to select the 2nd and 7th variables (from left to right) or to select the first 3 variables.

SQL reports

Page 53: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

53

Use of Minimal SQL

Note that the order of the list sets the order in the report (or the order in a new dataset).

SQL reports – selecting variables

Page 54: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

54

Renaming and Labels

• You can rename a variable in the list with an as statement.

SQL reports – rename/label

as creates a new variable. Without as SQL just copies

the variable• You can also specify variable labels.

Page 55: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

55

Using Formats

• Labels affect column headings and similar titles, and formats affect how values appear without changing the values themselves.

Notice the lowercase i. The capitalization is set when the variable is created.

SQL reports – format

Page 56: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

56

Preview of User Defined Formats

Note the $ means a character format.

SQL reports – format

Page 57: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

57

blah

SQL tables

New table.

Original table

Page 58: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

58

More Tweaks

• The from line references tables which are in libraries. Complex queries require you to reference the table name over and over again. Instead of having to type the long library and dataset names repeatedly, you can refer to the files as an alias.

Print the column called dude from the table blah which is in the fakedata library.

Here the b. is optional because dude is only in one table (the query only uses one table).

SQL reports – table aliases

Page 59: 1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © 1999-2010 Leland Stanford Junior University. All rights reserved. Warning: This.

59

Data Step Version….Rename label and format variables