1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011...

27
1 Database Theory and Normalization HRP223 – 2010 November 14 th , 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011...

Page 1: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

1

Database Theory and Normalization

HRP223 – 2010November 14th, 2011

Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved.Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Page 2: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

2

Flat Files

• Some people try to store all their data in a single file. This causes lots of extra work because of holes in the tables and repeated information.

• Both problems can be fixed by a relational model.– Split the data into many tables.

• You need to use SQL to work with data split across multiple tables.

Page 3: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

3

Not Normalized

• I frequently get data, from people who are not professional programmers, where the diagnosis data is organized “wide” across the page. Where the first diagnosis is in the first column, the second is in the second, etc. and the task is to find or fix a diagnosis.

Page 4: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

4

Subsetting Based on 5 Variables

Page 5: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

5

Page 6: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

6

SQL vs. Datastep

• The GUI generates this code:

• Or you could write a little data step program:

Page 7: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

7

Change All 9s to 999s?

• It is a lot of clicking.

Page 8: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

8

Code

• The SQL is a bit complicated

Page 9: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

9

As Data Step

• If it is more than 5 columns, things get unruly. Imagine doing this across 20 possible diagnoses. There is an easy solution in data step code.

• First, the SQL code can be done easily in a data step.

Page 10: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

10

A List

• As you can see, there is a list of variables and you are doing the same things over and over.

• You want to make a list called dx and have the 1st element refer to dx1, the 2nd thing refer to dx2, etc. The concept of a named list of variables or an alias to a bunch of variables is instantiated as an array.

Page 11: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

11

Arrays

• A major improvement….. Ummmm.

• You want to process the same one line over and over. You need to count from 1 to 5…. Sounds like a loop.

Page 12: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

12

Change Lots of Things

• If you have an array, you can process wide files easily.

Page 13: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

13

Restructuring with Arrays

• You can use similar code to restructure data so that you have only a couple of columns of data.

• Add a new column that is called dxNum and another called theDX. Those two columns plus the subject ID number can contain the same information without all the “holes”.

Page 14: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

14

How does that work?

• Go through all five variables, one at a time.• If the variable is not missing, you need to do

three things:– Copy the diagnosis counter number into the dxNum

variable.– Copy the diagnosis code number into the variable

called theDx.– Write to the new data set.

Page 15: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

15

Repeated Ifs

• This is a lot of typing and it obscures the fact that you are doing three things if a condition is true:

Page 16: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

16

• You have seen do statements in the context where you do stuff over and over. There is also a do end command for when you need to do a block of instructions if a condition is true.

do end

You need both do and end

Page 17: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

17

Actual Code

Page 18: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

18

Normalization Part 2

• I got data where I needed to analyze age for people who have a particular diagnosis. The data was a not-normalized mess:

Sid dob1 dob2 dob3 dob4 dob5 dx1 dx2 dx3 dx4 dx5 code1 code2 code3 code4 code51 6/24/67 1/18/67 4/13/92 11/12/96 2/14/99 4/18/01 4/23/01 4/23/01 4/23/01 22 22 22 22

2 7/4/43 4/18/01 243 12/26/55 5/22/53 4/22/01 4/22/01 22 22

Page 19: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

19

Normalization Part 2 The Wrong Way

• If your database is like this, you need code like this:data bad2;set bad;if (dob1 ne . and not missing(dx1)) then do;

if code1= 22 then IsCase1=1; else Iscase1=0;

end;if (dob2 ne . and not missing(dx2)) then do;

if code2=22 then IsCase2=1; else Iscase2=0;end;if (dob3 ne . and not missing(dx3)) then do;

if code3=22 then IsCase3=1; else Iscase3=0;end;if (dob4 ne . and not missing(dx4)) then do;

if code4=22 then IsCase4=1; else Iscase4=0;end;if (dob5 ne . and not missing(dx5)) then do;

if code5=22 then IsCase5=1; else Iscase5=0;end;

run;

You will end up with the same code repeated as many times as you have repetitions.

Page 20: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

20

Normalization Part 2 The Right Way

• Instead, you should have a record in a table corresponding to each repetition.

• With code like this:

data good2;set good;if code= 22 then isCase1=1; else isCase1=0;

run;

sid mid dob dx code1 1 6/24/67 4/18/01 221 2 1/18/67 4/23/01 221 4 11/12/96 4/23/01 221 5 2/14/99 4/23/01 222 1 7/4/43 4/18/01 243 1 12/26/55 4/22/01 223 2 5/22/53 4/22/01 22

Page 21: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

21

• Your first attempt could go something like this:data normal1 (keep = sid mid dob dx code);

set bad; format dob dx mmddyy8.;

if (dob1 ne . and dx1 ne . and code1 ne .) then do;mid = 1; dob = dob1;dx = dx1; code = code1; output;

end;

if (dob2 ne . and dx2 ne . and code2 ne .) then do;mid = 2; dob=dob2; dx=dx2; code=code2; output;

end;if (dob3 ne . and dx3 ne . and code3 ne .) then do;

mid=3; dob=dob3; dx=dx3; code=code3; output;end;if (dob4 ne . and dx4 ne . and code4 ne .) then do;

mid=4; dob=dob4; dx=dx4; code=code4; output;end;if (dob5 ne . and dx5 ne . and code5 ne .) then do;

mid=5; dob=dob5; dx=dx5; code=code5; output;end; run;

But you end up with just as many blocks of code.

Page 22: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

22

Setting up Aliases (Arrays)

• What you want is a way to repeat this code over the five sets of variables:

if (dob1 ne . and dx1 ne . and code1 ne .) then do;mid = 1; dob = dob1;dx = dx1; code = code1; output;

end;

• You need: – A dob alias (dob_a) to refer to dob1, dob2, dob3, dob4 and dob5– A dx alias (dx_a) to refer to dx1, dx2, dx3, dx4 and dx5– A code alias (code_a) to refer to code1, code2, code3, code4 and

code5

Page 23: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

23

Setting up Aliases (Arrays)

data normal2a;set bad;array dob_a dob1-dob5;array dx_a dx1-dx5;array code_a code1-code5;

if (dob1 ne . and dx1 ne . and code1 ne .) then do;

mid = 1; dob = dob1;dx = dx1; code = code1; output;

end;run;

This sets up the arrays but they are not used in this program.

Page 24: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

24

Setting up Aliases (Arrays)

data normal2a;set bad;array dob_a dob1-dob5;array dx_a dx1-dx5;array code_a code1-code5;

if (dob_a[1] ne . and dx_a[1] ne . and code_a[1] ne .)

then do;

mid = 1; dob = dob_a[1];dx = dx_a[1]; code = code_a[1]; output;

end;

run;

Page 25: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

25

Setting up Aliases (Arrays)

data normal2c (keep = sid mid dob dx code);set bad;array dob_a dob1-dob5;array dx_a dx1-dx5;array code_a code1-code5;

do c = 1 to 5 by 1;

if (dob_a[c] ne . and dx_a[c] ne . and code_a[c] ne .) then do;mid = c; dob = dob_a[c];dx = dx_a[c]; code = code_a[c]; output;

end;end;

run;

Page 26: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

26

Arrays

• You can tell SAS that a set of variables are related by putting them into an array statement.

• Arrays in SAS are not like arrays in other languages like BASIC or C. SAS arrays are only aliases to an existing set of variables. They are created using the array statement:

array times_a [365] time1-time365;

My notation for arrays

An optionalsize of the array

What the array refers to

Page 27: 1 Database Theory and Normalization HRP223 – 2010 November 14 th, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning:

27

Arrays(2)

• If your array references variables that do not exist, they will be created. Make sure to use the $ if you intend to create character variables.

• If you want to reference all numeric variables between theValue and thingy2, do it like this:

array x theValue -- thingy2 _numeric_;

-- means all values between and including the starting and ending variables

- indicates the numeric sequence starting with the first variable and ending with the second