IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete...

35
71 IX. ADVANCED DATA MANAGEMENT TOPICS In this section further detail is provided on data management commands listed below and on issues related to missing data: Write (Export) Delete File/Table Delete Records / Undelete Records Merge Relate Write (Export) data Exporting to other file types The Write (Export) command allows users to save the data into a different Epi Info .MDB data file or into another file format available in this command. With the Write command you can also specify which variables to write to the file and their order in the new file. As an example, Read the viewEvansCounty file in Sample.mdb (see the previous Read section) into an Excel file. Click on Write (Export) in the Analysis Commands dialog box, and the Write dialog box is presented as follows: Figure 83. Dialog box for Write (Export) command, Epi Info. As seen in the Write dialog box, the A ll (*) symbol is initially selected by default. This option writes all variables from the current data set into a new data set. If you want to exclude some variables in the new Data table, you can use All (*) Ex cept option A ll (*) symbol must first be unchecked to permit the selection of All (*) Ex cept. You can also highlight and select desired variables from the variable box by right-clicking over individual variables, after unchecking A ll (*) and All (*) Ex cept symbols. Here for the sake of simplicity, we will stick to use all variables in the new data set with A ll (*) symbol checked. Then, decide how data should be written by using Output M ode which determines whether the data being written will Append to or Replace the existing data set. For this example, use Replace. With the Replace option checked, the new data will replace the current data set, whereas the data will be simply added to the file if the Append option is checked. See the Output Fo rmats compartment and select Excel 4.0 by clicking on down-arrow button. Using down- arrow button allows the selection of a data file format available in Epi Info: Epi2000 Access 97, 2000 dBase III, IV, 5.0 Paradox 3.x, 4.x, 5.x Excel 3.0, 4.0 Epi Info 6 Text (Delimited).

Transcript of IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete...

Page 1: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

71

IX. ADVANCED DATA MANAGEMENT TOPICS In this section further detail is provided on data management commands listed below and on issues related to missing data:

Write (Export) Delete File/Table Delete Records / Undelete Records Merge Relate

Write (Export) data Exporting to other file types The Write (Export) command allows users to save the data into a different Epi Info .MDB data file or into another file format available in this command. With the Write command you can also specify which variables to write to the file and their order in the new file. As an example, Read the viewEvansCounty file in Sample.mdb (see the previous Read section) into an Excel file. Click on Write (Export) in the Analysis Commands dialog box, and the Write dialog box is presented as follows: Figure 83. Dialog box for Write (Export) command, Epi Info.

As seen in the Write dialog box, the All (*) symbol is initially selected by default. This option writes all variables from the current data set into a new data set. If you want to exclude some variables in the new Data table, you can use All (*) Except option All (*) symbol must first be unchecked to permit the selection of All (*) Except. You can also highlight and select desired variables from the variable box by right-clicking over individual variables, after unchecking All (*) and All (*) Except symbols. Here for the sake of simplicity, we will stick to use all variables in the new data set with All (*) symbol checked. Then, decide how data should be written by using Output Mode which determines whether the data being written will Append to or Replace the existing data set. For this example, use Replace. With the Replace option checked, the new data will replace the current data set, whereas the data will be simply added to the file if the Append option is checked. See the Output Formats compartment and select Excel 4.0 by clicking on down-arrow button. Using down-arrow button allows the selection of a data file format available in Epi Info: Epi2000 Access 97, 2000 dBase III, IV, 5.0 Paradox 3.x, 4.x, 5.x Excel 3.0, 4.0 Epi Info 6 Text (Delimited).

Page 2: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

72

Pressing the button with “ . . .”to the right of the File Name displays a dialog box where you can select a folder to save the new file. Here, let’s go to the ‘C:\Epi Info’ folder and type a file name EvansCounty. You will see .xls in Save as type section of the dialog box. Click Save and the new Excel file will be ready to be created. Click OK, and EvansCounty.xls is now written (exported) to the folder ‘C:\Epi Info’. To check for accuracy of EvansCounty.xls, use Read/Import command or use Excel to open the file. Related to Data table option, Output Formats must be Epi2000 or Access. Only then, you can type in a desired table name in Data table box. Using down-arrow button, Data table box also allows for the selection of a Data table to receive output data set. This condition applies when you want to replace or append a current Data table. However, neither Epi-Info view files nor Data tables of views will appear in the list of Data table box, because the Write command cannot be used to add data to a view file. In that case, use Merge command. That’s the reason you don’t see the view file ‘ viewEvansCounty’ or related Data table of view file ‘EvansCounty’ in the Data table box. Similarly, you can create a new data set with other file formats (dBase, text, etc), different variables, and different output modes by following the aforementioned guideline.

Delete File/Table Delete File/Table in is used when you want to delete a file, a table from within an Epi2000/Access file, or a view from within an Epi2000/Access file (see Figure 84 for an example). Figure 84. Dialog box for Delete File/Table, Epi Info

As an example, Read the viewEvansCounty file, then use the Write (Export) command to save the file as Delete_Me in the Sample.MDB file. Next, use Delete File/Table, in the dialog box click on Table, for the Database select Sample.MDB, for the Table Name select Delete_Me.

Delete Records / Undelete Records Using Delete Records you can either mark records for deletion or permanently remove records from the file (Figure 85). Records that are marked for deletion remain in the data file but are usually ignored during analyses. (Note: using the Set command the usual setting for Process Records is Normal, i.e., perform analyses only on undeleted records; two other options are to analyze both records marked for deletion [Both] or only records marked for deletion [Deleted].) The other option is to permanently remove records from the file. As shown in Figure 85, you can choose criteria for determining which records to delete, such as “*” to delete all records or any other criteria, such as Age>50 or Sex=“M”, similar to the types of functions and mathematical comparisons described for Select (see Appendix 2). The Run Silent option, when not checked, makes a sound and pops up a small dialog box; when checked, neither the sound nor pop-up window will occur.

Page 3: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

73

Records marked for deletion can be undeleted using the Undelete Record command (Figure 86). Specific criteria can be given as to which records to undelete. Figure 85. Dialog box for Delete Records command, Epi Info.

Figure 86. Dialog box for Undelete Records command, Epi Info.

(Note inconsistency between command Undelete Records and dialog box name UNDELETE)

Relate files In some situations you may want to Relate two files. Two common examples where relating files is used includes with health clinic data where one file may contain information on an individual, such as name, age, sex, address, and another contains information on clinic visits. The other example would be with survey data where one file contains information at the household level and another has information on the individual. The investigator may want to Relate these two files and perform an analysis of the combined data table. A visual example is shown in Figure 87. To Relate two files, you must have a variable common to both data tables on which to link, such as a clinic ID number or a household number. Figure 87. Relating two data tables. + → As an example, lets relate the data table viewFamily to another data table viewPatient which can be found in Refugee.MDB, an example file included with Epi Info. (The details of these files can be found in the Appendix 1). A partial listing of the viewFamily table, the viewPatient table, and the related file are shown in Figure 88.

Data table A (Main table)

Data table B (The other table that is to be related to the main table)

Data table C (A combination of A and B)

Page 4: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

74

Figure 88. The viewFamily table, the viewPatient table, and the related file, viewFamily table Line Family Id Number household Date of Arrival: Port of Entry: Country of Origin: Language spoken1 1 1 12-22-1998 NEW YORK BOSNIA 4 2 2 2 01-06-1999 NEW YORK BOSNIA 4 3 3 3 01-20-1999 NEW YORK BOSNIA 4 4 4 4 01-12-1999 CALIFORNIA VIETNAM 3 5 5 5 01-20-1999 NEW YORK BOSNIA viewPatient table Line Today date Family ID Number BOH ID NUMBER: BOH Re-entry

16229 04-07-1999 1 688174 688174 16230 01-11-1999 1 9569112 9569112 16231 03-18-1999 1 8251382 8251382 16232 03-19-1999 2 8188724 8188724 16233 08-16-1999 2 7335445 7335445 Related viewFamily and viewPatient tables Line Family Id Number household Date of Arrival: Port of Entry: Country of Origin: Language spoken1 1 1 12-22-1998 NEW YORK BOSNIA 4 2 1 1 12-22-1998 NEW YORK BOSNIA 4 3 1 1 12-22-1998 NEW YORK BOSNIA 4 4 2 2 01-06-1999 NEW YORK BOSNIA 4 5 2 2 01-06-1999 NEW YORK BOSNIA 4 6 4 4 01-12-1999 CALIFORNIA VIETNAM 3 Read the data table viewFamily (you will need to change the Data Source to C:\Epi_Info\Reguee.MDB). Then click the Relate command from Analysis Commands on the left, and the Relate dialog box will appear as follows (Figure 89). Again, you will need to change the Data Source to C:\Epi_Info\Refugee.MDB. In the Views portion of dialog box, click on viewPatient, the table you want to relate. You must supple a Key variable which exists in both tables which will allow records to be related, by clicking on Build Key button. In doing so, another dialog box Relate - Build Key dialog box appears (Figure 90). With the main Current Table(s) (viewFamily) selected, click the down arrow next to the Available Variables blank box and select the key variable FAMIDNUM. Then, click OK. Select the Related Table (viewPatient) and once again click the down arrow next to the Available Variables to choose select FAMIDNUM. Click OK again to close Relate - Build Key dialog box and to return to the Relate dialog box. In this Relate dialog box, the Key at the bottom of the dialog box will say FAMIDNUM :: FAMIDNUM. Click the OK button and the relationship between files will be created with the following message presented in the Analysis Output window as shown in Figure 91.

Page 5: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

75

Figure 89. Dialog box for Relate command, Epi Info.

Figure 90. Dialog box for Relate - Build Key, Epi Info

Figure 91. Example Output from Relate command Current View: C:\Epi_Info\Refugee.MDB:viewFamily

Relate: LNK_2 Record Count: 1772 (Deleted records excluded) Date: 6/29/2005 10:53:25 AM One option when relating files in Figure 89 is Use Unmatched (All). If this option is selected by clicking on the box, the related file will contain all records from both files whether or not they can be related to one another; when this box is not checked, only records that can be related to one another will be in the related file.] Note that more than two tables can be related and that common identifier may span several fields.

Page 6: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

76

Merge files Here we describe two ways to Merge files in Epi Info: Append and Update. The first approach is to Read a file and Append (or concatenate) records from another file to the master file (Figure 92). An example of this approach is when you have two people entering data from a study on separate computers and you would like to combine the two files into one file. Figure 92. Conceptual approach to use of Merge using Append option.

Read Master Table Merged Table

ID Ltr ID Ltr 1 A 1 A 2 B 2 B 3 C 3 C 4 D 4 D 5 E Append 5 E

+ →→→→→→→→→→→→→→ 6 F Merge Second Table 7 G

ID Ltr 8 H 6 F 9 I 7 G 10 J 8 H 9 I

10 J The second approach is to Update a file where a file is Read and then information updated in the Merge table when the key matches. Only fields found in both datasets with a non-empty value in the Merge table will be replaced. A conceptual example of this is presented in Figure 93 and an example would be in a state health department reportable disease system where a master file is kept at the state and a local health department may send a table that had updated information. Figure 93. Conceptual approach to use of Merge using Update option.

Read Master Table Merged Table

ID Ltr ID Ltr 1 A 1 A 2 B 2 B 3 C 3 F 4 D 4 D 5 E Update 5 G

+ →→→→→→→→→→→→→→ Merge Second Table

ID Ltr 3 F 5 G

In general, the steps are: • Read a master file • Use Merge (see Figure 94 for the dialog box)

o Select a table or file o Choose either Update or Append or both o Provide one or more Key variables by pressing the Build Key button and completing the

Relate – Build Key dialog box (see Figure 90) o Click the OK button on both dialog boxes

Page 7: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

77

Figure 94. Dialog box for Merge command, Epi Info.

Page 8: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

78

Acknowledgments We would like to thank Andrew Dean, MD, MPH, for his comments and suggestions on this document. Should you have any suggestions to improve this document, please feel free to contact Kevin Sullivan at [email protected]. This document was made possible, in part, by a grant from the Bill and Melinda Gates Foundation.

References Kleinbaum DG. Survival Analysis: A Self-Learning Text. Springer Verlag Publishers, 1996. Kleinbaum DG, Klein M. Logistic Regression: A Self-Learning Text, 2nd Ed. Springer Verlag Publishers,

2002. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research: Principles and Quantitative Methods.

John Wiley and Sons Publishers, New York, 1982. Kleinbaum DG, Kupper LL, Muller KE, Nizam A. Applied Regression Analysis and Multivariable Methods, 3rd

Ed. Duxbury Press, 1998. Kleinbaum DG, Sullivan KM, Barker N. ActivEpi Companion Textbook. Springer Verlag Publishers, 2003. Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, Pacific Grove, 2000.

Page 9: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

79

APPENDICES

Appendix 1. Data Dictionaries This appendix contains the data dictionaries for the examples in this document in alphabetical order. For the files in the Sample.mdb, the files are:

Addicts Anderson BFmeasles Chemo Myeloma Stanford Vets viewAddfull viewAgeWithCount viewBabyBloodPressure

viewEpi1 viewEpi10 viewEstriolandBirthweight viewEvansCounty viewhmohiv viewLasum viewLEUKEM2 viewOswego viewRely viewSmoke

The files in the Refugee.mdb for merging or relating datasets are:

viewFamily viewPatient

Addicts – Survival Analysis These data are based on a cohort study among 238 heroin addict patients, comparing treatment effectiveness of one clinic to the other. The number of days from entry to a clinic until departure was the outcome variable. This is an example file in the text by Kleinbaum called ‘Addicts’. Please note that these data are originally provided by John Caplehorn (The University of Sydney, Department of Public Health). Reference: Kleinbaum DG. Survival Analysis: A Self-Learning Text. Springer-Verlag, New York, 1996. File Name: Addicts Project: Sample.mdb Number of records: 238 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable which assigns the study subjects into clinic 1 and clinic 2.

Clinic 1= clinic 1 2= clinic 2

161 77

Censored variable. This is the variable which denotes whether the patient has developed an event (exit from clinic) or not.

status 0= censored 1= uncensored (exist

from clinic)

150 88

Survival time in days from entry to a clinic until departure. This is the outcome variable “time to an event”

Survival_Time_Days Range: 2-1076 days Mean: 404.6555 Median: 367.5

Past history of imprisonment Prison_Record 0= No 1= Yes

126 112

Daily dose of Methadone substitute (mg/day)

Methadone_dose__mg_day_ Range: 20-110 Mean: 60.542 Median: 60

Page 10: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

80

Anderson – Survival Analysis This is a clinical trial studying survival times in weeks (remission) of 42 leukemia patients to compare the effect of a steroid (6-mercaptopurine) with placebo. The duration of relapsed-free period after treatment or placebo was the outcome variable. This is an example file in “Survival Analysis Self-Learning Text” by Kleinbaum called ‘Anderson’. Please note that these data are originally from Freireich, et al. Data source: Freireich et al. The effect of 6-mercaptopurine on the duration of steroid-induced remissions in acute leukemia. Blood 21: 699-716, 1963. File Name: Anderson Project: Sample.mdb Number of records: 42 Variable Label Values/Description Freq Survival time in weeks until relapse. This is the outcome variable “time to an event”

Stime Range: 1-35 weeks Mean: 12.881 Median: 10.5

Censored variable. This is the variable which denotes whether the patient has developed an event (exit from clinic) or not.

status 0= censored 1= relapsed

12 30

Gender sex 0= female 1= male

22 20

Log value of white blood cells Log_wbc Range: 1.4-5 Mean: 2.9302 Median: 2.8

Main predictor of interest. This is the exposure variable (treatment or placebo) randomly assigned to the leukemia patients.

Rx 1= placebo 0= treatment

21 21

BFMeasles - Measles Outbreak Investigation These data are test data provided with compliment by Epi Info working group, Epidemiology program office, CDC. Thanks to Roger Friedman for sharing the data information for this document. File Name: BFMeasles Project: Sample.mdb Number of records: 262 Variable Label Values/Description Freq location code expressed as text in the fields Province, District, Town, Village/Neighborhood).

EPID From BFA-TEN-OUA-01-0005 to BFA-BOB-DAN-02-1297

262

Name of Province of patient PROVINCE 11provinces ranging from BANFORA to TENKODOGO (alphabetically) 262

Name of District of patient DISTRICT 40 districts ranging from BANFORA to ZORGHO (alphabetically)

262

Name of Town TOWN Name of Village/Neighborhood VILLNEIG 160 village/neighborhoods ranging from

ABSINDO to ZOUMAMISSIRI (alphabetically) (.) missing

227 35

A location code which matches that on the map file used to display the data.

AMAPCODE 40 codes ranging from BFA BAN BAN to BFA TEN ZAB

Name of nearest Hospital Facility responsible for the patient.

NEARHF 136 facilities (.) missing

256 6

Unknown code UR 1 2

36 226

Date of birth DOB 02/27/1999 (.) missing Note: Date format is ‘month, day and 4 digit year’

1 261

Page 11: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

81

Age of patient (years) AGEYR every value is ‘3’

Age of patient (months) AGEMO Range: 1-4 Mean: 2 Median: 1.5 (.) missing

6 256

Gender SEX F female M male

Date of notification DNOT Range: 01/18/2001-07/17/2002 Note: Date format is ‘month, day and 4 digit year’ (.) missing

255 7

Date of investigation DOI Range: 12/19/2001-07/17/2002 Same format as above. (.) missing

68 194

Date of onset of illness DONSET Range: 01/17/2001 – 07/10/2002 Same format as above.

262

Status of patient: died or alive DIED 1 yes 2 no 9 unknown (.) missing

10 198 53 1

Number of doses of vaccine DOSES 0 not vaccinated 1 vaccinated 1 time 9 unknown

42 24 196

Date of last vaccination DVAC Range: 05/07/1998 – 03/11/2002 (.) missing

20 242

Date of sample collection DCOLL Range: 12/19/2001 – 07/17/2002 (.) missing

56 206

Date the sample was sent to lab DSENT1 Range: 01/06/2002 – 04/02/2002 (.) missing

6 256

Date the sample was received at the lab DREC1 Range: 01/07/2002 – 04/15/2002 (.) missing

7 255

Date of result received from lab DRESULT1 Range: 01/22/2002 – 07/24/2002 (.) missing

Result of measles immunoassay test INDIR 1 positive 2 negative 3 indeterminate (.) missing

38 10 2 212

Result of rubella test RUBTEST 1 positive 2 negative 3 indeterminate (.) missing

1 11 1 249

Name of investigator INVESTIGAT (.) missing 262 Result of investigation (in French)

INVRESULT Positive result value in French (.) missing

249 13

case categories 1-5: (meanings are unknown)

CLASS2 1 3 4 5

38 208 10 6

case categories 1-5: (meanings are unknown)

CLASS 1 2 3 4 5 (.) missing

38 7 56 135 19 7

Page 12: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

82

Chemo – Survival Analysis These data are from a clinical trial on gastric carcinoma by Stablein et al, involving 95 patients randomized to either chemotherapy alone or to a combination of chemotherapy and radiation, in order to assess treatment outcome. The number of days from a treatment until death was the outcome variable. This is an example file in the self-learning text by Kleinbaum, called ‘Chemo.dat’. Data source: Stablein DM. Carter WH Jr. Novak JW. Analysis of survival data with nonproportional hazard functions. Controlled Clinical Trials. 2(2): 149-59, 1981 Jun.. File Name: Chemo Project: Sample.mdb Number of records: 95 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable to patients which denotes either ‘chemotherapy alone’ or combination of ‘chemotherapy and radiation’.

Rx 1= chemotherapy alone 2= chemotherapy and

radiation

47 48

Censored variable. This is the variable which denotes whether the patient has developed an event (death) or not.

status 0= censored 1= died

17 78

Survival time in days from entry to a clinic until departure. This is the outcome variable “time to an event”

STime Range: 1-1519 days Mean: 529.1368 Median: 401

Myeloma – Survival Analysis These data are based on a study at the Medical Centre of the University of West Virginia, USA, where the association between some probable explanatory variables and the survival time of patients was examined. The response variable was the time (in months) from diagnosis until death from multiple myeloma. The data in the table were reported in Krall et al., and were related to 48 patients, aged ranging from 50 to 80 years. Reference: Krall, J. M., Uthoff, V. A. and Harley, J. B. (1975). A step-up procedure for selecting variables associated with survival. Biometrics, 31, 49 – 57. File Name: Myeloma Project: Sample.mdb Number of records: 48 Variable Label Values/Description Freq Identification number PATIENT Range: 1-48 Survival time in months from entry to the study until death. This is the outcome variable “time to an event”

STIME Range: 1-91 Mean: 23.375 Median: 14.5

Censored variable. This is the variable which denotes whether the patient has developed an event (died) or not.

STATUS 0 censored 1 died

12 36

Age of patients (years) AGE Range: 50-77 Mean: 62.8958 Median: 62.5

gender SEX 1= male 2= female

29 19

Blood urea nitrogen (mg%) BUN Range: 6-172 Mean: 33.9167 Median: 21

Page 13: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

83

serum Calcium (mg%) CA Range: 8-15 Mean: 9.9375 Median: 10

Hemoglobin (mg%) HB Range: 4.9-14.6 Mean: 10.2521 Median: 10.2

Percentage of plasma cells in the bone marrow (%)

PC Range: 3-100 Mean: 42.9375 Median: 33

Presence of Bence-Jones protein in the urine

BJ Yes No

15 33

Stanford – Survival Analysis These data are based on a Stanford heart transplant study by Kalbfleisch et al, involving 249 patients who were either treated with transplant or not, with varying period of waiting time before the transplant. The study was conducted to assess the effect on survival time between different attributes among patients who received transplants, as well as, to determine the survival time between patients with heart transplants and those without transplants. The survival time, a combination of pre-transplant survival time and post-transplant survival time (if any) was the outcome variable. This is an ideal example to use extended Cox model in order to take into account the different pre-transplant survival time (waiting time) because patients change treatment status during the course of the study. The data file can be found in “Survival analysis self-learning text’ by Kleinbaum, called ‘Stanf.dat’. Data source: Kalbfleisch, J and Prentice, R. The statistical analysis of failure time data. John Wiley and Sons, New York, 1980. File Name: Stanford Project: Sample.mdb Number of records: 249 Variable Label Values/Description Freq Survival time from entry to the study until death before the transplant (or) until the transplant.

PRE_TRANSPLANT_SURVIVAL_TIME Range: 0-340 days Mean: 40.7068 Median: 26

Censored variable 1. This is the variable which denotes whether the patient has died or not at first end-point (the time of Transplant).

STATUS 0= censored 1= died (.)= missing

193 55 1

Survival period from the time of transplant until death (or) the patient is censored.

POSTTRANSPLANT_SURVIVAL_TIME Range: 0-3694 days Mean: 696.9348 Median: 351 (.)= missing

184 65

Censored variable 2. This is the variable which denotes whether the patient has died or not at the time of second end-point (Feb 1980).

STATUS_AT_SECOND_ENDPOINT 0= censored 1= died (.)= missing

65 119 65

Age of patient at the time of transplant

AGE Range: 12 – 64 years Mean: 41.0924 Median: 44 (.)= missing

184 65

Tissue mismatch score TISSUE_MISMATCH_SCORE Range: 0-3.05 Mean: 1.1166 Median: 1.04 (.)= missing

157 92

Page 14: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

84

Vets – Survival Analysis These data are from Veterans’ administration lung cancer trial among 137 patients with pulmonary carcinoma, comparing effectiveness of test treatment with standard treatment. The survival time in days until death was the outcome variable. These data are originally provided by Kalbfleisch, et al., and used as an example data file in “Survival analysis self-learning text’ by Kleinbaum called ‘Anderson.dat’. Data source: Kalbfleisch, J and Prentice, R. The statistical analysis of failure time data. John Wiley and Sons, New York, 1980. File Name: Vets Project: Sample.mdb Number of records: 99 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable which assigns the study subjects into test and standard.

treatment 1= standard 2= test

69 30

cancer cell type- large cell cell_type_1 0= other 1= large cell

84 15

cancer cell type- Adeno cell cell_type_2 0= other 1= Adeno cell (.)= missing

89 9 1

cancer cell type- small cell cell_type_3 1= Small cell 0= other

59 40

cancer cell type- squamous cell cell_type_4 1= Squamous cell 0= other

64 35

Survival time in days until death. This is the outcome variable “time to an event”

STime Range: 1-999 days Mean: 136.8889 Median: 95

Performance status (0=worst,…..,100=best)

performance_status Range: 20-90 Mean: 9.0202 Median: 6

Disease duration (months from diagnosis)

disease_duration Range: 1-58 months Mean: 404.6555 Median: 367.5

Age of patients (years) age Range: 34-81 Mean: 58.4343 Median: 60

History of prior therapy prior_therapy 0= none 10= some

68 31

Censored variable. This is the variable which denotes whether the patient has died or not.

status 0= censored 1= death

8 91

Page 15: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

85

ViewADDFULL - Attention deficit disorder Note: we were not able to find more details on this datafile. File Name: ViewADDFULL Project: Sample.mdb Number of records: 359 Variable- Label Values/Description Freq Gender of patient GENDER 1 female??

2 male?? 198 161

? REPEAT 0 no history of repetition 1 history of repetition (.) missing

324 34 1

? ENGL 1 2 3 (.) missing

40 254 46 19

? ENGG 0 1 2 3 4 (.) missing

11 37 122 135 41 13

? OLMAT Range: 55-137 Mean: 102.7333 Median: 103 (.) missing

210 149

? KF Range: 75-129 Mean: 104.8444 Median: 105 (.) missing

90 269

? GPA Range: 0-4 Mean: 2.3797 Median: 2.5 (.) missing

347 12

? SOCPROB 0 1 (.) missing

304 44 11

? SCORE2 Range: 25-90 Mean: 53.3287 Median: 52

? SCORE4 Range: 22-90 Mean: 52.8936 Median: 53 (.) missing

357 2

? SCORE5 Range: 22-87 Mean: 53.2696 Median: 52 (.) missing

319 40

? DROPOUT 0 no history of dropout 1 history of dropout (.) missing

297 46 16

? ADDSC Range: 24.6667-80 Mean: 53.1068 Median: 53

Page 16: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

86

? IQ Range: 55-137 Mean: 102.3712 Median: 103 (.) missing

233 126

viewAgeWithCount File name: viewAgeWithCount Project: Sample.mdb Number of records: 16 Number of observations: 85 Variable Label Values/Description Freq RecordNumber Rage: 1-10 Age Range: 1-10 Count Range: 1-20 viewBabyBloodPressure - Hypertension in Infants In these data, birth weight and systolic blood pressure were measured in 16 infants. Systolic blood pressure is the dependent variable, and birth weight and age of the infant are independent variables. Reference: Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, 2000. File name: viewBabyBloodPressure Project: Sample.mdb Number of records: 16 Variable Label Values/Description Freq Birth weight of infant (in ounces); an independent variable

Birthweight Range: 90-160 Mean: 120.31 SD: 18.75

Age in days; an independent variable AgeInDays Range: 2-5 Mean: 3.31 SD: 0.95

Systolic blood pressure (mm Hg); the dependent variable

SystolicBlood Range: 77-98 Mean: 88.06 SD: 6.69

viewEpi1 - Complex Survey Data based on the Expanded Program for Immunization (EPI) method These data are based on a 30-cluster survey using the Expanded Program on Immunization (EPI) methodology. Using this methodology, 30 communities (i.e., clusters) are selected from a listing of all communities in a geographic area using the proportional to population size (PPS) sampling technique. The PPS methodology is self-weighted, i.e., statistical weights are not necessary when analyzing the data. Survey teams visit each cluster and, using one of several sampling techniques, visit households to identify seven children in the appropriate age range and assess their immunization status. The EPI survey is frequently referred to as a 30x7 cluster design, i.e., 30 clusters, each with 7 children. File name: viewEpi1 Project: Sample.mdb Number of records: 210 Variable Label Values/Description Freq A variable to specify in which cluster the individual lived.

CLUSTER Range: 1-30

A question concerning whether or not the mother had received prenatal care for the child being assessed.

PRENATAL 1 = received prenatal care 2 = no prenatal care

87 123

Whether the child was vaccinated. VAC 1 = vaccinated 2 = not vaccinated

155 55

Page 17: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

87

viewEpi10 - Complex Survey Data based on the Expanded Program for Immunization (EPI) method with 10 strata The viewEpi10 file is an example of a country performing an EPI survey in each of its 10 provinces, i.e., there were 10 separate EPI surveys carried out, one in each province. This is considered a stratified cluster survey. The viewEpi10 data has the same variables as viewEpi1 plus two additional variables: a variable for a numeric value to identify which province the child lived (LOCATION) and a variable that takes into account the differences in population sizes of the different provinces (POPW). To calculate national estimates, it would be important to take into account the population size of each province. The weighting scheme is presented in Table A1 and is calculated as the population size of the population divided by the number in the sample. In Location 1, each child sampled represents 43.87 children; in cluster 8, each child sampled represents 853.02 children. Please note that there are other methods for weighting data than the one presented here. Table A1. Population weights for children in each location

Location Population Sample POPW 1 9,870 225 43.87 2 33,600 219 153.42 3 14,130 212 66.65 4 27,900 219 127.40 5 12,750 212 60.14 6 15,810 214 73.88 7 16,050 210 76.43 8 180,840 212 853.02 9 9,030 217 41.61

10 25,650 212 120.99 Total 345,630 2,152

POPW = Population/Sample File name: viewEpi10 Project: Sample.mdb Number of records: 2152 Variable Label Values/Description Freq Variable with codes for the 10 strata LOCATION Range: 1-10 Statistical weight to estimate unbiased national estimates taking into account strata population sizes.

POPW Range: 41.61-853.02

Variable specifying cluster number. CLUSTER Range: 1-30 A question concerning whether or not the mother had received prenatal care for the child being assessed.

PRENATAL 1 = received prenatal care 2 = no prenatal care

1088 1064

Whether or not the child was vaccinated.

VAC 1 = vaccinated 2 = not vaccinated

1242 910

viewEstriolandBirthweight - Estriol and Birth Weight Data These data are by Greene and Touchstone and used as an example in the text by Rosner to study the relationship of the estriol level in pregnant women with birth weight. Reference: Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, 2000. File name: viewEstriolandBirthweight Project: Sample.mdb Number of records: 31 Variable Label Values/Description Freq Estriol level of pregnant woman (mg/24 hr)

ESTRIOL Range: 7-27 Mean: 17.23 SD: 4.75

Birth weight of infant (g/100) BIRTHWEIGHT Range: 24-43 Mean: 32.0 SD: 4.74

Page 18: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

88

viewEvansCounty - Evans County Heart Disease Study Data The data are based on the Evans County heart disease cohort study on the seven-year incidence of coronary heart disease in 609 white males. The variable CAT (endogenous catecholamine level) was fabricated for illustrative purposes and dichotomized into categories "high" (top quintile of cohort values) and "low." There are no missing values in this dataset. Thanks to Dr. David Kleinbaum for making the data available. Reference: Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research: Principles and quantitative methods. Lifetime Learning Publications, Belmont, California, 1982. File name: viewEvansCounty Project: Sample.mdb Number of records: 609 Variable Label Values/Description Freq Identification Number ID Range: 21-19161 Coronary Heart Disease CHD No = not a case

Yes = case 538 71

Age (years) AGE Range: 40-76 Mean: 53.71 SD: 9.26

Catecholamine Level CAT No = low Yes = high

487 122

Serum Cholesterol (mg/100 mL) CHL Range: 94-357 Mean: 211.74 SD: 39.83

Diastolic Blood Pressure (mmHg) DBP Range: 60-170 Mean: 91.18 SD: 14.50

Electrocardiogram ECG No = normal ECG Yes = abnormal ECG

443 166

Hematocrit (percent) HEM Range: 29-58 Mean: 46.26 SD: 3.47

Marital Status MAR No = not married Yes = married

64 545

Occupation OCC 1 = ? 2 = ?

365 244

Pulse (beats/min) PLS Range: 45-120 Mean: 74.59 SD: 12.67

Quetelet Index* QTI Range: 2.121-6.041 Mean: 3.62 SD: 0.59

Systolic Blood Pressure (mmHg) SBP Range: 92-300 Mean: 145.48 SD: 27.50

Socioeconomic Status (McGuire- White index)

SES Range: 20-84 Mean: 57.86 SD: 13.62

Cigarette Smoking SMK No = never smoked Yes = smoker

222 387

Age Group 1 (Years) AGEG1 No = LT 55 Yes = GE 55

358 251

Age Group 2 (Years) AGEG2 1 = 40-44 2 = 45-49 3 = 50-54 4 = 55-59 5 = 60-64 6 = 65-69 7 = 70+

109 138 111 92 63 52 44

Cholesterol Group CHLG No = LT 250 504

Page 19: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

89

Yes = GE 250 105 QTI Group QTIG No = LT 3.57

Yes = GE 3.57 306 303

SES Group SESG No = GE 57 Yes = LT 57

330 279

Hypertension HPT No = SBP<160 & DBP<95 Yes = SBP>159 or DBP>94

354 255

GE=greater than or equal to; LT=less than *100[(weight in pounds)/(height in inches)] viewhmohiv - survival analysis These data are provided with compliment by Epi Info development team, Epidemiology program office, CDC. File Name: viewhmohiv Project: Sample.mdb Number of records: 100 Variable Label Values/Description Freq Identification of patient ID Range 1-100 Survival time TIME1 Range 1-60

Mean 11.36 Median 5

age AGE Range 20-54 Mean 36.07 Median 35

exposure DRUG 0 placebo 1 treatment

51 49

CENSOR 0 censored 1 event

20 80

The date that the patient first entered the study

ENTDATE Range= 1-12-1989 to 12-27-1991 Format: mm-dd-yyyy

The date that the patient was last observed

ENDDATE Range= 2-15-1989 to 11-13-1995 Format: mm-dd-yyyy

ViewLasum - Estrogen and Endometrial Cancer Matched Case-Control Study (weighted analysis) These data come from a Los Angeles study to determine whether the effect of exogenous estrogen relates to endometrial cancer among 315 participants. The study design is a matched case-control study where each of the 63 cases with endometrial cancer, is matched to four control women who were born within one year of the case, had the same marital status, and lived in the same retirement community for the same length of time. Please note that the data set is of summary file format where individual records with similar characteristics were summarized into 25 groups. This study can be used as an example for conditional logistic regression analysis, taking into account the count (frequency) variable. Reference: Breslow and Day. Statistical methods in cancer research: Volume 1 – The analysis of case-control studies. Lyon : International Agency for Research on Cancer, 1980.

Page 20: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

90

File Name: viewLasum.dat Project: Sample.mdb Number of summary records: 25 Number of observation: 315 Variable Label Values/Description Freq Obesity

OBS 0 not obese 1 obese (.) missing

97 167 51

Estrogen conjugated dose (mg/day): An exposure variable

DOS 0= none 1= 0.1-0.299 2= 0.3-0.625 3= 0.626+ (.)= unknown

8 155 61 56 35

Disease outcome: A dependent variable.

OUTCOME 0 no 1 yes

252 63

A weight variable: Summary number of records

COUNT Range: 1-61

viewLeukem2 – Survival Analysis This is a clinical trial studying survival times in weeks (remission) of 42 leukemia patients to compare the effect of a steroid (6-mercaptopurine) with placebo. The duration of relapsed-free period after treatment or placebo was the outcome variable. Please note that these data are the same as ‘Anderson’ (mentioned earlier), but covariates ‘sex’ and ‘logwbc’ have been omitted. File Name: viewLeukem2 Project: Sample.mdb Number of records: 42 Variable Label Values/Description Freq Identification of patient ID Range: 1-42 Main predictor of interest - the exposure variable (6 mercaptopurine vs placebo) randomly assigned to the pts.

Rx placebo 6-MP

21 21

Censored variable - the variable which denotes whether the patient developed an event (exit from clinic).

status 0= censored 1= relapsed

12 30

Survival time in weeks until relapse. This is the outcome variable “time to an event”

Stime Range: 1-35 weeks Mean: 12.8810 Median: 10.5

viewOswego - Oswego Classical Study of Disease Outbreak Investigation. These data are based on a classical study of an outbreak of acute gastrointestinal illness in the village of Lycoming, Oswego County, New York, reported to the District Health Officer in Syracuse on April 19, 1940. It was learned that all persons known to be ill had attended a church supper the previous evening, April 18. Accordingly, the goal for the study was to find which food or foods caused the outbreak. The outcome variable is disease(yes/no). Possible risk factors (predictor variables) are foods and drinks consumed. Interviews regarding the presence of symptoms, including the day and hour of onset, and the food consumed at the church supper, were completed on 75 of the 80 persons known to have been present. A total of 46 persons who had experienced gastrointestinal illness were identified. Reference: The data and information for this outbreak is derived from an educational program developed by the CDC in Atlanta, and provided by Dr A.M.Rubin, then Epidemiologist-in-training who actually conducted the investigation.

Page 21: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

91

File Name: viewOswego Project: Sample.mdb Number of records: 75 Variable Label Values/Description Freq Age of patient (years) AGE Range: 3-77

Mean: 36.8133 Median: 36

Gender SEX Female male

44 31

Outcome variable: diarrheal illness

ILL Yes No

46 29

BAKEDHAM Yes No

46 29

SPINACH Yes No

43 32

MASHEDPOTA Yes No (.)

37 37 1

CABBAGESAL Yes No

28 47

JELLO Yes No

23 52

ROLLS Yes No

37 38

BROWNBREAD Yes No

27 48

Food items

FRUITSALAD Yes No

6 69

MILK Yes No

4 71

COFFEE Yes No

31 44

Beverages

WATER Yes No

24 51

CAKE Yes No

40 35

VANILLA Yes No

54 21

Desserts

CHOCOLATE Yes No (.)

47 27 1

Date of onset of illness (mm-dd-yyyy, time)

DATEONSET 04-18-1940; 3pm - 04-19-1940; 10:30am

Date of supper (mm-dd-yyyy, time)

TIMESUPPER 04-18-1940; 12am - 04-18-1940; 10pm

Name code of patient NAME Range: patient1-patient75 Identification number CODE_RW Range: P1- P75 (.) = missing value viewRely - Rely Tampons and Toxic Shock Syndrome Matched Case-Control Data This is an example of a matched case-control data set where cases (women who were diagnosed with toxic shock syndrome) were each matched to four controls. The specifics of the matching is not provided, but probably based on age and geographic location. As mentioned in the Match command section, the ID is repeated five times: once for the case and then for each of the four matched controls.

Page 22: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

92

File name: viewRely Project: Sample.mdb Number of records: 56 Variable Label Values/Description Freq Identification Number; an ID number that links each case with their individually matched controls

ID Range: 1-14

Case of toxic shock syndrome? Outcome variable which divides the study group into cases and controls

CASE No = control Yes = case

42 14

Use of Rely tampons? Exposure variable which separates the group into exposed and not exposed

RELY No = did not use Yes = did use

32 24

viewSmoke - A Telephone Survey With Multistage Stratified Cluster Design These data are based on a random digit telephone survey of adults (18 years of age and older) using a stratified three-stage design in a state. Clusters are defined as telephone numbers consisting of numbers with the same first eight digits of a 10-digit telephone number. Separately for each county, a with-replacement sample of clusters is randomly chosen with probabilities proportional to size (PPS) of the number of residential telephone numbers. Nest, a random sample of three participating households is selected in each cluster. Finally, an interview is completed with one adult who is chosen at random within each participating household. This would be considered a stratified three-stage sample, with clusters of telephone numbers as primary sampling units (PSUs), primary stratification by county, residential phone numbers as the second stage, and the random selection of one adult in the household as the third stage (see Table A2.) Table A2. Stages used in telephone survey Stage List Used Sampling Method One 8-digit telephone number clusters

by county Random PPS within 8-digit clusters (stratified by county)

Two Clusters from Stage One Three random households per clusters Three Households from Stage Two One adult selected at random from participating households File name: viewSmoke Project: Sample.mdb Number of records: 337 Variable Label Values/Description* Freq Primary Sampling Unit (PSU) ID number PSUID Range: 15-1310 Date of interview DATE Range: 010190-032490

Note: a character field; dates are month, day, and 2-digit year

Interviewer’s initials INTERID “Do you smoke now?” SMOKE 1= Yes

2= No 83 254

Number of cigarettes smoked per day NUMCIGAR Range: 2-40 n: 82 Mean: 17.256 SE: 0.972 Note: question asked of cigarette smokers only

Age of participant in years AGE Range: 9-96 Mean: 43.818 SE: 1.053 Note: value of “9” appears to be an error since survey was to be limited to adults only

Race of participant RACE 1= White 2= Black 5= Other

289 47 1

Marital status MARITAL 1= Married 2= Divorced 3= Widowed 4= Separated 5= Never married

184 45 48 6 52

Page 23: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

93

9= Refused 2 Weight (without shoes) in pounds WEIGHT Range: 88-285

Also 777 - don’t know 999 - refused

Height (without shoes) in feet and inches HEIGHT Range: 410-607 Also 777 - don’t know 999 - refused Note: 3-digit numeric field; 1st digit=height in feet; next 2 digits=height in inches .

Sex of participant SEX 1= Male 2= Female

122 215

Sample weight SAMPW Range: 47100152.009- 47113103.03

Stratum STRATA 1= County “A” 2= County “B” 3= County “C”

113 112 112

*Note that mean and standard error (SE) estimates take into account the complex survey design and statistical weighting viewFamily - Merging/Relating files This Data table is provided along with Epi Info software under the dataset named Refugee.MDB. It contains information concerning refugee families that have arrived to the United States (e.g., the language they speak or their country of origin). Filename: viewFamily Project:Refugee.MDB Number of records: 539 Variable Label Values/Description Freq Apartment APARTMENT City: CITY Contact Information Contact Information Country of Origin: COUNTRY County: COUNTY Date of Arrival: DTOFARR Port of Entry in USA: ENTRY AL 2 CA 1 CALIFORNIA 67 CHICAGO 53 FL 1 IL 9 LA 1 LOS ANGELES 6 MIAMI 1 NEW YORK 248 NY 123 Family Home Phone: FAMHMPH Family Id Number FAMIDNUM 0-539 household HOUSEHOLD Interpreter code INTERPRETE Language spoken LANG Sponsor: SPONSOR State: STATE Street: STREET Zip Code: ZIPCODE

NB: Description of individual variable was not available. viewPatient - Merging/Relating files

Page 24: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

94

This Data table is provided along with Epi Info software under the dataset named Refugee.MDB Filename: viewPatient Project:Refugee.MDB Number of records: 18000 Variable Label Values/Description Freq Date of record entry TODAYDATE Family ID Number FAMIDNUM No: 1 to 546 BOH ID NUMBER: BOHID BOH Re-entry BOH Alien Number2: ALIENNUM2 Alien Number: ALIENNUMBE Last Name LASTNAME First Name: FIRSTNAME

Head of Household: HEAD Yes No Missing

434 1325 16241

Relationship with the household head RELATION

Missing 0 1 2 3 4 5 6 7 8 9 10 11 13

16342 435 8 14 209 25 457 338 2 4 11 30 1 124

Date of Birth: DOB

Age in years: AGE

Range: 0-80 yrs Mean: 24.1 Median: 21 (n=1751)

Sex: SEX Missing F M

16232 832 936

Race: RACE

Missing A B H O White

16232 195 727 3 3 840

I-94 Status: I94STATUS

Missing 1 2 3

16229 1767 3 1

Previous Resettlement: RESETTL No Missing

1772 16228

From: FROM missing 18000

Health classification CLASS

Missing B B1 B2 O

16285 429 21 50 1215

NB: actual data were available only in 546 families (based on FAMIDNUM), and the remaining records have missing values in all variables except last and first name of a refugee.

Page 25: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

95

Appendix 2. Operators/Functions - for use in arithmetic and logical expressions Below is a partial listing of operators and functions Arithmetic + addition - subtraction

* multiplication / division ^ exponentiation (use ^0.5 for square root)

Comparison > greater than < less than >= greater than or equal to <= less than or equal to = equal to <> not equal to

Boolean Operators AND logical AND OR logical OR XOR exclusive OR NOT logical NOT

Numeric ABS(variable or expression) Absolute value EXP(variable or expression) Raises the base of the natural logarithm (e) to the power specified LN(variable or expression) natural logarithm LOG(variable or expression) logarithm (base 10) MOD(variable or expression) modulus or remainder ROUND(variable or expression) rounds to nearest whole number TRUNC(variable or expression) removes decimals/round towards zero

Date-related functions NUMTODATE(<Year>,<Month>,<Day>)converts three numbers to a date format where <Year> is a numeric variable representing the year, <month> is a numeric variable for the month, and <day> is a numeric variable for the day. YEARS(<date variable 1>, <date variable 2>) Calculates the number of years between two dates. MONTHS(<date variable 1>, <date variable 2>) Calculates the number of months between two dates. DAYS(<date variable 1>, <date variable 2>) Calculates the number of days between two dates.

Page 26: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

96

Page 27: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

97

Appendix 3. Answers to Exercises

Answers – Exercise 1

1. Mean of HEM using Means command:

Obs Total Mean Variance Std Dev

609 28173.0000 46.2611 12.0584 3.4725 Minimum 25% Median 75% Maximum Mode 29.0000 44.0000 46.0000 48.0000 58.0000 46.0000

2. Appear to be normally distributed? Use the Graph module and make either a histogram, bar, or line chart with

HEM as the X-axis. The data appears to be somewhat normally distributed. While there are statistical tests to see whether or not a variable is normally distributed, Epi Info does not perform this test.

3. Descriptive Statistics for Each Value of Crosstab Variable

Obs Total Mean Variance Std Dev Yes 251 11459.0000 45.6534 13.1954 3.6325 No 358 16714.0000 46.6872 10.8542 3.2946

ANOVA, a Parametric Test for Inequality of Population Means (For normally distributed data only)

Variation SS df MS F statistic Between 157.6822 1 157.6822 13.3420 Within 7173.8055 607 11.8185 Total 7331.4877 608

T Statistic =3.652 P-value =0.0003

Bartlett's Test for Inequality of Population Variances Bartlett's chi square= 2.8276 df=1 P value=0.0927

A small p-value (e.g., less than 0.05) suggests that the variances are not homogeneous and that the ANOVA may not be appropriate.

Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups)

Kruskal-Wallis H (equivalent to Chi square) = 14.7051 Degrees of freedom = 1

P value = 0.0001

Page 28: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

98

Are the variances approximately equal? Yes, Bartlett’s test has p-value of .09, so we can assume approximately equal variances. Therefore, can use the t-test p-value of .0003 and state that there are statistically significant different mean hematocrits between younger adults vs. older adults, with older adults having a slightly higher mean hematocrit.

4. Mean is 57.855. Obs Total Mean Variance Std Dev 609 35234.0000 57.8555 185.5712 13.6225

Minimum 25% Median 75% Maximum Mode 20.0000 49.0000 57.0000 71.0000 84.0000 71.0000

5. Using graph module, make a bar, histogram, or bar chart. Does not seem to be normally distributed.

6. Descriptive Statistics for Each Value of Crosstab Variable

Obs Total Mean Variance Std Dev 1 109 6186.0000 56.7523 206.8733 14.3831 2 138 7879.0000 57.0942 193.9254 13.9257 3 111 6479.0000 58.3694 165.2896 12.8565 4 92 5530.0000 60.1087 162.4935 12.7473 5 63 3661.0000 58.1111 180.1326 13.4213 6 52 2822.0000 54.2692 198.8673 14.1020 7 44 2677.0000 60.8409 182.8811 13.5234

Minimum 25% Median 75% Maximum Mode 1 20.0000 48.0000 57.0000 68.0000 84.0000 71.0000 2 20.0000 47.0000 57.0000 71.0000 81.0000 71.0000 3 26.0000 51.0000 57.0000 71.0000 84.0000 57.0000 4 32.0000 51.0000 59.0000 71.0000 84.0000 57.0000 5 34.0000 49.0000 55.0000 72.0000 84.0000 54.0000 6 20.0000 44.5000 54.0000 62.5000 84.0000 54.0000 7 38.0000 51.0000 57.0000 71.5000 84.0000 54.0000

ANOVA, a Parametric Test for Inequality of Population Means

(For normally distributed data only)

Page 29: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

99

Variation SS df MS F statistic Between 1774.0885 6 295.6814 1.6028 Within 111053.1955 602 184.4737 Total 112827.2841 608

P-value =0.1438

Bartlett's Test for Inequality of Population Variances Bartlett's chi square= 2.4070 df=6 P value=0.8787

A small p-value (e.g., less than 0.05) suggests that the variances are not homogeneous and that the ANOVA may not be appropriate.

Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups) Kruskal-Wallis H (equivalent to Chi square) = 8.3535

Degrees of freedom = 6 P value = 0.2133

Data do not seem to be normally distributed, so might be better to use Kruskal-Wallis test. Conclusion – there is no significant difference in SES score by age groups.

7. OR=1.21, RR=1.18; no statistically significant association.

Single Table Analysis Point 95% Confidence Interval Estimate Lower Upper PARAMETERS: Odds-based Odds Ratio (cross product) 1.2065 0.6448 2.2576 (T) Odds Ratio (MLE) 1.2061 0.6252 2.2224 (M) 0.5945 2.3087 (F) PARAMETERS: Risk-based Risk Ratio (RR) 1.1789 0.6833 2.0343 (T) Risk Difference (RD%) 2.0238 -5.0418 9.0895 (T) (T=Taylor series; C=Cornfield; M=Mid-P; F=Fisher Exact) STATISTICAL TESTS Chi-square 1-tailed p 2-tailed p Chi square - uncorrected 0.3456 0.5566322582 Chi square - Mantel-Haenszel 0.3450 0.5569564191 Chi square - corrected (Yates) 0.1770 0.6739617631 Mid-p exact 0.2751234372 Fisher exact 0.3287296811

CHD CHLG Yes No TOTAL

Yes Row % Col %

14 13.3 19.7

91 86.7 16.9

105 100.0 17.2

No Row % Col %

57 11.3 80.3

447 88.7 83.1

504 100.0 82.8

TOTAL Row % Col %

71 11.7

100.0

538 88.3

100.0

609 100.0 100.0

Page 30: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

100

8.

Third Variable Interaction p-

value Crude OR1 Adjusted

OR2 Conclusion?3

ECG 0.42 2.9 2.4 Confounding MAR 0.46 2.9 2.8 Neither SMK 0.46 2.9 2.9 Neither AGEG1 0.81 2.9 2.2 Confounding QTIG 0.07 2.9 2.9 Neither HPT <0.01 2.9 2.0 Interaction 1 Crude OR (cross-product) 2 Adjusted OR (MH) 3 Interaction, confounding, or neither

Answers – Exercise 2

1. First, use the Select command to select those with hypertension:

Next, use the Means command to get the mean cholesterol level – the mean is 215.2.

Obs Total Mean Variance Std Dev 255 54872.0000 215.1843 1702.2612 41.2585

Minimum 25% Median 75% Maximum Mode 126.0000 184.0000 211.0000 240.0000 336.0000 212.0000

2. Do a Tables command with CAT as the exposure variable and CHD as the outcome variable. The risk ratio is

1.2683.

Single Table Analysis Point 95% Confidence Interval Estimate Lower Upper PARAMETERS: Risk-based Risk Ratio 1.2683 0.7344 2.1904 (T)

Please run the command Cancel Select command to clear out the selection.

3. Be sure to first Define the variable CHD_index, then use the Assign command to do the calculation:

Page 31: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

101

The mean CHD_index is 6.5224.

Obs Total Mean Variance Std Dev

609 3972.1560 6.5224 6.5602 2.5613

4. Do those who developed CHD have a significantly higher or lower mean CHD_index compared to those who did not develop CHD? Assuming a normal distribution, we would conclude that there is no statistically significant difference in mean CHD_index between those with or without CHD.

Descriptive Statistics for Each Value of Crosstab Variable Obs Total Mean Variance Std Dev Yes 71 453.0072 6.3804 4.7808 2.1865 No 538 3519.1488 6.5412 6.8013 2.6079

Minimum 25% Median 75% Maximum Mode Yes 2.8617 4.6229 6.3317 7.4880 14.2804 2.8617 No 2.5707 4.8616 6.0801 7.6534 28.9549 5.0062

ANOVA, a Parametric Test for Inequality of Population Means

(For normally distributed data only) Variation SS df MS F statistic Between 1.6215 1 1.6215 0.2469 Within 3986.9572 607 6.5683 Total 3988.5787 608

T Statistic =0.4969, P-value =0.6195

Bartlett's Test for Inequality of Population Variances

Bartlett's chi square= 3.4989 df=1 P value=0.0614

5. First, Define the variable agegroup; next, use the Recode command as follows: on the first Recode dialog box, click on Fill Ranges to get to the screen below; provide the Start, End, and By values:

Page 32: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

102

Click OK to see the Recode dialog box with the ranges completed:

To determine the number in each group, use the Frequencies command:

agegroup Frequency Percent Cum Percent >39 - 59 450 73.9% 73.9% >59 - 79 159 26.1% 100.0% Total 609 100.0% 100.0%

6. First Define the variable Anemic. There are different programming approaches to doing this. One way is as follows:

IF HEM < 39 and SMK = (-) THEN Anemic = 1 END

Page 33: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

103

IF HEM >= 39 and SMK = (-) THEN Anemic = 2 END IF HEM < 40 and SMK = (+) THEN Anemic = 1 END IF HEM >= 40 and SMK = (+) THEN Anemic = 2 END Another approach that would work just as well is: ASSIGN Anemic = 1 IF HEM >= 39 and SMK = (-) THEN Anemic = 2 END IF HEM >= 40 and SMK = (+) THEN Anemic = 2 END IF HEM= (.) AND SMK= (.) THEN Anemic = (.) END The prevalence of anemia is 1.1%.

Anemic Frequency Percent Cum Percent 1 7 1.1% 1.1% 2 602 98.9% 100.0% Total 609 100.0% 100.0%

7. In the Program Editor, click on the Save button; a Save Program dialog box will appear – save the program name as Anemic and then click on the OK button. Next, click on the Open button in the Program Editor, click on the down arrow at the right of Program and select the Anemic program and edit it to remove commands not needed, then Save the edited program. Now, reRead viewEvansCounty, Open the Anemic program, and then click the Run button. Double check to see if the program worked correctly by doing a frequency of anemia.

Answers – Exercise 3

Third Variable Interaction p-value

Crude OR Adjusted OR

Conclusion?1

ECG 0.42 2.9 2.4 Confounding MAR 0.46 2.9 2.8 Neither SMK 0.46 2.9 2.9 Neither AGEG1 0.81 2.9 2.2 Confounding QTIG 0.07 2.9 2.9 Neither HPT 0.003 2.9 2.0 Interaction

1 Interaction, confounding, or neither

Page 34: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

104

Appendix 4. Analysis commands by number and types of variables

The tables below provide information on appropriate use of the analytic commands which depend upon the number of variables under consideration (one or more variables), the types of variables (categorical vs. continuous), and whether the data are to be analyzed assume simple random sampling or complex sampling designs. Table A.4.1. Epi Info commands for the analysis of one variable of interest, assuming simple random sampling

A variable of interest Analysis command Categorical variable

e.g., Illness=Yes or No, sex Frequencies

Means Continuous variable

e.g., age, blood pressure, cholesterol level Means

Time to event *

e.g., survival time until an event occurs Kaplan-Meier Survival

*Requires two variables, a time variable and a variable as to whether or not an event occurred. Table A.4.2. Epi-Info commands for the analysis of a predictor variable vs. an outcome variable, assuming simple random sampling

Predictor variables Outcome

Paired

observa-tions1

Categorical variable ( ≥ 2 categories)

Continuous variable Both categorical and continuous variables

No Tables Logistic Regression (unconditional)

Logistic Regression (unconditional) Means2

Logistic Regression (unconditional)

Categorical variable

e.g., illness= Yes or No Yes Match

Logistic Regression (conditional)

Logistic Regression (conditional)

Logistic Regression (conditional)

Continuous variable

e.g., age, blood pressure

No Means2 Linear Regression

Linear Regression

Linear Regression

Time to event e.g., survival time until an

event occurs/is censored

No Kaplan-Meier Survival Cox Proportional Hazards Extended Cox model3

Cox Proportional Hazards Extended Cox model3

Cox Proportional Hazards Extended Cox model3

1 e.g., matched case-control study 2 Student t-test and ANOVA for parametric tests, and Kruskal-Wallis test for non-parametric tests. 3 used when predictor variable/s are time-dependent or Cox PH assumptions are violated.

Page 35: IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete File/Table Delete Records / Undelete Records Merge Relate ... Delete File/Table in is used

105

TableA.4.3. Epi Info commands for the analysis of one variable of interest in a survey using a complex sample design

One variable of interest

Analysis command

Categorical variable e.g., illness=Yes or No

Complex Sample Frequencies

Continuous variable e.g., age, blood pressure

Complex Sample Means

TableA.4.4. Epi-Info commands for the analysis of a predictor variable vs. an outcome variable in a survey using a complex sample design

Outcome

Predictor variable (Categorical variable)

Categorical variable e.g., illness=Yes or No

Complex Sample Tables

Continuous variable e.g., age, blood pressure

Complex Sample Means

Intro to Epi Info 3.3.2 Analysis.doc January 4 2007