IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete...
Transcript of IX. ADVANCED DATA MANAGEMENT TOPICS - … · IX. ADVANCED DATA MANAGEMENT TOPICS ... Delete...
71
IX. ADVANCED DATA MANAGEMENT TOPICS In this section further detail is provided on data management commands listed below and on issues related to missing data:
Write (Export) Delete File/Table Delete Records / Undelete Records Merge Relate
Write (Export) data Exporting to other file types The Write (Export) command allows users to save the data into a different Epi Info .MDB data file or into another file format available in this command. With the Write command you can also specify which variables to write to the file and their order in the new file. As an example, Read the viewEvansCounty file in Sample.mdb (see the previous Read section) into an Excel file. Click on Write (Export) in the Analysis Commands dialog box, and the Write dialog box is presented as follows: Figure 83. Dialog box for Write (Export) command, Epi Info.
As seen in the Write dialog box, the All (*) symbol is initially selected by default. This option writes all variables from the current data set into a new data set. If you want to exclude some variables in the new Data table, you can use All (*) Except option All (*) symbol must first be unchecked to permit the selection of All (*) Except. You can also highlight and select desired variables from the variable box by right-clicking over individual variables, after unchecking All (*) and All (*) Except symbols. Here for the sake of simplicity, we will stick to use all variables in the new data set with All (*) symbol checked. Then, decide how data should be written by using Output Mode which determines whether the data being written will Append to or Replace the existing data set. For this example, use Replace. With the Replace option checked, the new data will replace the current data set, whereas the data will be simply added to the file if the Append option is checked. See the Output Formats compartment and select Excel 4.0 by clicking on down-arrow button. Using down-arrow button allows the selection of a data file format available in Epi Info: Epi2000 Access 97, 2000 dBase III, IV, 5.0 Paradox 3.x, 4.x, 5.x Excel 3.0, 4.0 Epi Info 6 Text (Delimited).
72
Pressing the button with “ . . .”to the right of the File Name displays a dialog box where you can select a folder to save the new file. Here, let’s go to the ‘C:\Epi Info’ folder and type a file name EvansCounty. You will see .xls in Save as type section of the dialog box. Click Save and the new Excel file will be ready to be created. Click OK, and EvansCounty.xls is now written (exported) to the folder ‘C:\Epi Info’. To check for accuracy of EvansCounty.xls, use Read/Import command or use Excel to open the file. Related to Data table option, Output Formats must be Epi2000 or Access. Only then, you can type in a desired table name in Data table box. Using down-arrow button, Data table box also allows for the selection of a Data table to receive output data set. This condition applies when you want to replace or append a current Data table. However, neither Epi-Info view files nor Data tables of views will appear in the list of Data table box, because the Write command cannot be used to add data to a view file. In that case, use Merge command. That’s the reason you don’t see the view file ‘ viewEvansCounty’ or related Data table of view file ‘EvansCounty’ in the Data table box. Similarly, you can create a new data set with other file formats (dBase, text, etc), different variables, and different output modes by following the aforementioned guideline.
Delete File/Table Delete File/Table in is used when you want to delete a file, a table from within an Epi2000/Access file, or a view from within an Epi2000/Access file (see Figure 84 for an example). Figure 84. Dialog box for Delete File/Table, Epi Info
As an example, Read the viewEvansCounty file, then use the Write (Export) command to save the file as Delete_Me in the Sample.MDB file. Next, use Delete File/Table, in the dialog box click on Table, for the Database select Sample.MDB, for the Table Name select Delete_Me.
Delete Records / Undelete Records Using Delete Records you can either mark records for deletion or permanently remove records from the file (Figure 85). Records that are marked for deletion remain in the data file but are usually ignored during analyses. (Note: using the Set command the usual setting for Process Records is Normal, i.e., perform analyses only on undeleted records; two other options are to analyze both records marked for deletion [Both] or only records marked for deletion [Deleted].) The other option is to permanently remove records from the file. As shown in Figure 85, you can choose criteria for determining which records to delete, such as “*” to delete all records or any other criteria, such as Age>50 or Sex=“M”, similar to the types of functions and mathematical comparisons described for Select (see Appendix 2). The Run Silent option, when not checked, makes a sound and pops up a small dialog box; when checked, neither the sound nor pop-up window will occur.
73
Records marked for deletion can be undeleted using the Undelete Record command (Figure 86). Specific criteria can be given as to which records to undelete. Figure 85. Dialog box for Delete Records command, Epi Info.
Figure 86. Dialog box for Undelete Records command, Epi Info.
(Note inconsistency between command Undelete Records and dialog box name UNDELETE)
Relate files In some situations you may want to Relate two files. Two common examples where relating files is used includes with health clinic data where one file may contain information on an individual, such as name, age, sex, address, and another contains information on clinic visits. The other example would be with survey data where one file contains information at the household level and another has information on the individual. The investigator may want to Relate these two files and perform an analysis of the combined data table. A visual example is shown in Figure 87. To Relate two files, you must have a variable common to both data tables on which to link, such as a clinic ID number or a household number. Figure 87. Relating two data tables. + → As an example, lets relate the data table viewFamily to another data table viewPatient which can be found in Refugee.MDB, an example file included with Epi Info. (The details of these files can be found in the Appendix 1). A partial listing of the viewFamily table, the viewPatient table, and the related file are shown in Figure 88.
Data table A (Main table)
Data table B (The other table that is to be related to the main table)
Data table C (A combination of A and B)
74
Figure 88. The viewFamily table, the viewPatient table, and the related file, viewFamily table Line Family Id Number household Date of Arrival: Port of Entry: Country of Origin: Language spoken1 1 1 12-22-1998 NEW YORK BOSNIA 4 2 2 2 01-06-1999 NEW YORK BOSNIA 4 3 3 3 01-20-1999 NEW YORK BOSNIA 4 4 4 4 01-12-1999 CALIFORNIA VIETNAM 3 5 5 5 01-20-1999 NEW YORK BOSNIA viewPatient table Line Today date Family ID Number BOH ID NUMBER: BOH Re-entry
16229 04-07-1999 1 688174 688174 16230 01-11-1999 1 9569112 9569112 16231 03-18-1999 1 8251382 8251382 16232 03-19-1999 2 8188724 8188724 16233 08-16-1999 2 7335445 7335445 Related viewFamily and viewPatient tables Line Family Id Number household Date of Arrival: Port of Entry: Country of Origin: Language spoken1 1 1 12-22-1998 NEW YORK BOSNIA 4 2 1 1 12-22-1998 NEW YORK BOSNIA 4 3 1 1 12-22-1998 NEW YORK BOSNIA 4 4 2 2 01-06-1999 NEW YORK BOSNIA 4 5 2 2 01-06-1999 NEW YORK BOSNIA 4 6 4 4 01-12-1999 CALIFORNIA VIETNAM 3 Read the data table viewFamily (you will need to change the Data Source to C:\Epi_Info\Reguee.MDB). Then click the Relate command from Analysis Commands on the left, and the Relate dialog box will appear as follows (Figure 89). Again, you will need to change the Data Source to C:\Epi_Info\Refugee.MDB. In the Views portion of dialog box, click on viewPatient, the table you want to relate. You must supple a Key variable which exists in both tables which will allow records to be related, by clicking on Build Key button. In doing so, another dialog box Relate - Build Key dialog box appears (Figure 90). With the main Current Table(s) (viewFamily) selected, click the down arrow next to the Available Variables blank box and select the key variable FAMIDNUM. Then, click OK. Select the Related Table (viewPatient) and once again click the down arrow next to the Available Variables to choose select FAMIDNUM. Click OK again to close Relate - Build Key dialog box and to return to the Relate dialog box. In this Relate dialog box, the Key at the bottom of the dialog box will say FAMIDNUM :: FAMIDNUM. Click the OK button and the relationship between files will be created with the following message presented in the Analysis Output window as shown in Figure 91.
75
Figure 89. Dialog box for Relate command, Epi Info.
Figure 90. Dialog box for Relate - Build Key, Epi Info
Figure 91. Example Output from Relate command Current View: C:\Epi_Info\Refugee.MDB:viewFamily
Relate: LNK_2 Record Count: 1772 (Deleted records excluded) Date: 6/29/2005 10:53:25 AM One option when relating files in Figure 89 is Use Unmatched (All). If this option is selected by clicking on the box, the related file will contain all records from both files whether or not they can be related to one another; when this box is not checked, only records that can be related to one another will be in the related file.] Note that more than two tables can be related and that common identifier may span several fields.
76
Merge files Here we describe two ways to Merge files in Epi Info: Append and Update. The first approach is to Read a file and Append (or concatenate) records from another file to the master file (Figure 92). An example of this approach is when you have two people entering data from a study on separate computers and you would like to combine the two files into one file. Figure 92. Conceptual approach to use of Merge using Append option.
Read Master Table Merged Table
ID Ltr ID Ltr 1 A 1 A 2 B 2 B 3 C 3 C 4 D 4 D 5 E Append 5 E
+ →→→→→→→→→→→→→→ 6 F Merge Second Table 7 G
ID Ltr 8 H 6 F 9 I 7 G 10 J 8 H 9 I
10 J The second approach is to Update a file where a file is Read and then information updated in the Merge table when the key matches. Only fields found in both datasets with a non-empty value in the Merge table will be replaced. A conceptual example of this is presented in Figure 93 and an example would be in a state health department reportable disease system where a master file is kept at the state and a local health department may send a table that had updated information. Figure 93. Conceptual approach to use of Merge using Update option.
Read Master Table Merged Table
ID Ltr ID Ltr 1 A 1 A 2 B 2 B 3 C 3 F 4 D 4 D 5 E Update 5 G
+ →→→→→→→→→→→→→→ Merge Second Table
ID Ltr 3 F 5 G
In general, the steps are: • Read a master file • Use Merge (see Figure 94 for the dialog box)
o Select a table or file o Choose either Update or Append or both o Provide one or more Key variables by pressing the Build Key button and completing the
Relate – Build Key dialog box (see Figure 90) o Click the OK button on both dialog boxes
77
Figure 94. Dialog box for Merge command, Epi Info.
78
Acknowledgments We would like to thank Andrew Dean, MD, MPH, for his comments and suggestions on this document. Should you have any suggestions to improve this document, please feel free to contact Kevin Sullivan at [email protected]. This document was made possible, in part, by a grant from the Bill and Melinda Gates Foundation.
References Kleinbaum DG. Survival Analysis: A Self-Learning Text. Springer Verlag Publishers, 1996. Kleinbaum DG, Klein M. Logistic Regression: A Self-Learning Text, 2nd Ed. Springer Verlag Publishers,
2002. Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research: Principles and Quantitative Methods.
John Wiley and Sons Publishers, New York, 1982. Kleinbaum DG, Kupper LL, Muller KE, Nizam A. Applied Regression Analysis and Multivariable Methods, 3rd
Ed. Duxbury Press, 1998. Kleinbaum DG, Sullivan KM, Barker N. ActivEpi Companion Textbook. Springer Verlag Publishers, 2003. Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, Pacific Grove, 2000.
79
APPENDICES
Appendix 1. Data Dictionaries This appendix contains the data dictionaries for the examples in this document in alphabetical order. For the files in the Sample.mdb, the files are:
Addicts Anderson BFmeasles Chemo Myeloma Stanford Vets viewAddfull viewAgeWithCount viewBabyBloodPressure
viewEpi1 viewEpi10 viewEstriolandBirthweight viewEvansCounty viewhmohiv viewLasum viewLEUKEM2 viewOswego viewRely viewSmoke
The files in the Refugee.mdb for merging or relating datasets are:
viewFamily viewPatient
Addicts – Survival Analysis These data are based on a cohort study among 238 heroin addict patients, comparing treatment effectiveness of one clinic to the other. The number of days from entry to a clinic until departure was the outcome variable. This is an example file in the text by Kleinbaum called ‘Addicts’. Please note that these data are originally provided by John Caplehorn (The University of Sydney, Department of Public Health). Reference: Kleinbaum DG. Survival Analysis: A Self-Learning Text. Springer-Verlag, New York, 1996. File Name: Addicts Project: Sample.mdb Number of records: 238 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable which assigns the study subjects into clinic 1 and clinic 2.
Clinic 1= clinic 1 2= clinic 2
161 77
Censored variable. This is the variable which denotes whether the patient has developed an event (exit from clinic) or not.
status 0= censored 1= uncensored (exist
from clinic)
150 88
Survival time in days from entry to a clinic until departure. This is the outcome variable “time to an event”
Survival_Time_Days Range: 2-1076 days Mean: 404.6555 Median: 367.5
Past history of imprisonment Prison_Record 0= No 1= Yes
126 112
Daily dose of Methadone substitute (mg/day)
Methadone_dose__mg_day_ Range: 20-110 Mean: 60.542 Median: 60
80
Anderson – Survival Analysis This is a clinical trial studying survival times in weeks (remission) of 42 leukemia patients to compare the effect of a steroid (6-mercaptopurine) with placebo. The duration of relapsed-free period after treatment or placebo was the outcome variable. This is an example file in “Survival Analysis Self-Learning Text” by Kleinbaum called ‘Anderson’. Please note that these data are originally from Freireich, et al. Data source: Freireich et al. The effect of 6-mercaptopurine on the duration of steroid-induced remissions in acute leukemia. Blood 21: 699-716, 1963. File Name: Anderson Project: Sample.mdb Number of records: 42 Variable Label Values/Description Freq Survival time in weeks until relapse. This is the outcome variable “time to an event”
Stime Range: 1-35 weeks Mean: 12.881 Median: 10.5
Censored variable. This is the variable which denotes whether the patient has developed an event (exit from clinic) or not.
status 0= censored 1= relapsed
12 30
Gender sex 0= female 1= male
22 20
Log value of white blood cells Log_wbc Range: 1.4-5 Mean: 2.9302 Median: 2.8
Main predictor of interest. This is the exposure variable (treatment or placebo) randomly assigned to the leukemia patients.
Rx 1= placebo 0= treatment
21 21
BFMeasles - Measles Outbreak Investigation These data are test data provided with compliment by Epi Info working group, Epidemiology program office, CDC. Thanks to Roger Friedman for sharing the data information for this document. File Name: BFMeasles Project: Sample.mdb Number of records: 262 Variable Label Values/Description Freq location code expressed as text in the fields Province, District, Town, Village/Neighborhood).
EPID From BFA-TEN-OUA-01-0005 to BFA-BOB-DAN-02-1297
262
Name of Province of patient PROVINCE 11provinces ranging from BANFORA to TENKODOGO (alphabetically) 262
Name of District of patient DISTRICT 40 districts ranging from BANFORA to ZORGHO (alphabetically)
262
Name of Town TOWN Name of Village/Neighborhood VILLNEIG 160 village/neighborhoods ranging from
ABSINDO to ZOUMAMISSIRI (alphabetically) (.) missing
227 35
A location code which matches that on the map file used to display the data.
AMAPCODE 40 codes ranging from BFA BAN BAN to BFA TEN ZAB
Name of nearest Hospital Facility responsible for the patient.
NEARHF 136 facilities (.) missing
256 6
Unknown code UR 1 2
36 226
Date of birth DOB 02/27/1999 (.) missing Note: Date format is ‘month, day and 4 digit year’
1 261
81
Age of patient (years) AGEYR every value is ‘3’
Age of patient (months) AGEMO Range: 1-4 Mean: 2 Median: 1.5 (.) missing
6 256
Gender SEX F female M male
Date of notification DNOT Range: 01/18/2001-07/17/2002 Note: Date format is ‘month, day and 4 digit year’ (.) missing
255 7
Date of investigation DOI Range: 12/19/2001-07/17/2002 Same format as above. (.) missing
68 194
Date of onset of illness DONSET Range: 01/17/2001 – 07/10/2002 Same format as above.
262
Status of patient: died or alive DIED 1 yes 2 no 9 unknown (.) missing
10 198 53 1
Number of doses of vaccine DOSES 0 not vaccinated 1 vaccinated 1 time 9 unknown
42 24 196
Date of last vaccination DVAC Range: 05/07/1998 – 03/11/2002 (.) missing
20 242
Date of sample collection DCOLL Range: 12/19/2001 – 07/17/2002 (.) missing
56 206
Date the sample was sent to lab DSENT1 Range: 01/06/2002 – 04/02/2002 (.) missing
6 256
Date the sample was received at the lab DREC1 Range: 01/07/2002 – 04/15/2002 (.) missing
7 255
Date of result received from lab DRESULT1 Range: 01/22/2002 – 07/24/2002 (.) missing
Result of measles immunoassay test INDIR 1 positive 2 negative 3 indeterminate (.) missing
38 10 2 212
Result of rubella test RUBTEST 1 positive 2 negative 3 indeterminate (.) missing
1 11 1 249
Name of investigator INVESTIGAT (.) missing 262 Result of investigation (in French)
INVRESULT Positive result value in French (.) missing
249 13
case categories 1-5: (meanings are unknown)
CLASS2 1 3 4 5
38 208 10 6
case categories 1-5: (meanings are unknown)
CLASS 1 2 3 4 5 (.) missing
38 7 56 135 19 7
82
Chemo – Survival Analysis These data are from a clinical trial on gastric carcinoma by Stablein et al, involving 95 patients randomized to either chemotherapy alone or to a combination of chemotherapy and radiation, in order to assess treatment outcome. The number of days from a treatment until death was the outcome variable. This is an example file in the self-learning text by Kleinbaum, called ‘Chemo.dat’. Data source: Stablein DM. Carter WH Jr. Novak JW. Analysis of survival data with nonproportional hazard functions. Controlled Clinical Trials. 2(2): 149-59, 1981 Jun.. File Name: Chemo Project: Sample.mdb Number of records: 95 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable to patients which denotes either ‘chemotherapy alone’ or combination of ‘chemotherapy and radiation’.
Rx 1= chemotherapy alone 2= chemotherapy and
radiation
47 48
Censored variable. This is the variable which denotes whether the patient has developed an event (death) or not.
status 0= censored 1= died
17 78
Survival time in days from entry to a clinic until departure. This is the outcome variable “time to an event”
STime Range: 1-1519 days Mean: 529.1368 Median: 401
Myeloma – Survival Analysis These data are based on a study at the Medical Centre of the University of West Virginia, USA, where the association between some probable explanatory variables and the survival time of patients was examined. The response variable was the time (in months) from diagnosis until death from multiple myeloma. The data in the table were reported in Krall et al., and were related to 48 patients, aged ranging from 50 to 80 years. Reference: Krall, J. M., Uthoff, V. A. and Harley, J. B. (1975). A step-up procedure for selecting variables associated with survival. Biometrics, 31, 49 – 57. File Name: Myeloma Project: Sample.mdb Number of records: 48 Variable Label Values/Description Freq Identification number PATIENT Range: 1-48 Survival time in months from entry to the study until death. This is the outcome variable “time to an event”
STIME Range: 1-91 Mean: 23.375 Median: 14.5
Censored variable. This is the variable which denotes whether the patient has developed an event (died) or not.
STATUS 0 censored 1 died
12 36
Age of patients (years) AGE Range: 50-77 Mean: 62.8958 Median: 62.5
gender SEX 1= male 2= female
29 19
Blood urea nitrogen (mg%) BUN Range: 6-172 Mean: 33.9167 Median: 21
83
serum Calcium (mg%) CA Range: 8-15 Mean: 9.9375 Median: 10
Hemoglobin (mg%) HB Range: 4.9-14.6 Mean: 10.2521 Median: 10.2
Percentage of plasma cells in the bone marrow (%)
PC Range: 3-100 Mean: 42.9375 Median: 33
Presence of Bence-Jones protein in the urine
BJ Yes No
15 33
Stanford – Survival Analysis These data are based on a Stanford heart transplant study by Kalbfleisch et al, involving 249 patients who were either treated with transplant or not, with varying period of waiting time before the transplant. The study was conducted to assess the effect on survival time between different attributes among patients who received transplants, as well as, to determine the survival time between patients with heart transplants and those without transplants. The survival time, a combination of pre-transplant survival time and post-transplant survival time (if any) was the outcome variable. This is an ideal example to use extended Cox model in order to take into account the different pre-transplant survival time (waiting time) because patients change treatment status during the course of the study. The data file can be found in “Survival analysis self-learning text’ by Kleinbaum, called ‘Stanf.dat’. Data source: Kalbfleisch, J and Prentice, R. The statistical analysis of failure time data. John Wiley and Sons, New York, 1980. File Name: Stanford Project: Sample.mdb Number of records: 249 Variable Label Values/Description Freq Survival time from entry to the study until death before the transplant (or) until the transplant.
PRE_TRANSPLANT_SURVIVAL_TIME Range: 0-340 days Mean: 40.7068 Median: 26
Censored variable 1. This is the variable which denotes whether the patient has died or not at first end-point (the time of Transplant).
STATUS 0= censored 1= died (.)= missing
193 55 1
Survival period from the time of transplant until death (or) the patient is censored.
POSTTRANSPLANT_SURVIVAL_TIME Range: 0-3694 days Mean: 696.9348 Median: 351 (.)= missing
184 65
Censored variable 2. This is the variable which denotes whether the patient has died or not at the time of second end-point (Feb 1980).
STATUS_AT_SECOND_ENDPOINT 0= censored 1= died (.)= missing
65 119 65
Age of patient at the time of transplant
AGE Range: 12 – 64 years Mean: 41.0924 Median: 44 (.)= missing
184 65
Tissue mismatch score TISSUE_MISMATCH_SCORE Range: 0-3.05 Mean: 1.1166 Median: 1.04 (.)= missing
157 92
84
Vets – Survival Analysis These data are from Veterans’ administration lung cancer trial among 137 patients with pulmonary carcinoma, comparing effectiveness of test treatment with standard treatment. The survival time in days until death was the outcome variable. These data are originally provided by Kalbfleisch, et al., and used as an example data file in “Survival analysis self-learning text’ by Kleinbaum called ‘Anderson.dat’. Data source: Kalbfleisch, J and Prentice, R. The statistical analysis of failure time data. John Wiley and Sons, New York, 1980. File Name: Vets Project: Sample.mdb Number of records: 99 Variable Label Values/Description Freq Main predictor of interest. This is the exposure variable which assigns the study subjects into test and standard.
treatment 1= standard 2= test
69 30
cancer cell type- large cell cell_type_1 0= other 1= large cell
84 15
cancer cell type- Adeno cell cell_type_2 0= other 1= Adeno cell (.)= missing
89 9 1
cancer cell type- small cell cell_type_3 1= Small cell 0= other
59 40
cancer cell type- squamous cell cell_type_4 1= Squamous cell 0= other
64 35
Survival time in days until death. This is the outcome variable “time to an event”
STime Range: 1-999 days Mean: 136.8889 Median: 95
Performance status (0=worst,…..,100=best)
performance_status Range: 20-90 Mean: 9.0202 Median: 6
Disease duration (months from diagnosis)
disease_duration Range: 1-58 months Mean: 404.6555 Median: 367.5
Age of patients (years) age Range: 34-81 Mean: 58.4343 Median: 60
History of prior therapy prior_therapy 0= none 10= some
68 31
Censored variable. This is the variable which denotes whether the patient has died or not.
status 0= censored 1= death
8 91
85
ViewADDFULL - Attention deficit disorder Note: we were not able to find more details on this datafile. File Name: ViewADDFULL Project: Sample.mdb Number of records: 359 Variable- Label Values/Description Freq Gender of patient GENDER 1 female??
2 male?? 198 161
? REPEAT 0 no history of repetition 1 history of repetition (.) missing
324 34 1
? ENGL 1 2 3 (.) missing
40 254 46 19
? ENGG 0 1 2 3 4 (.) missing
11 37 122 135 41 13
? OLMAT Range: 55-137 Mean: 102.7333 Median: 103 (.) missing
210 149
? KF Range: 75-129 Mean: 104.8444 Median: 105 (.) missing
90 269
? GPA Range: 0-4 Mean: 2.3797 Median: 2.5 (.) missing
347 12
? SOCPROB 0 1 (.) missing
304 44 11
? SCORE2 Range: 25-90 Mean: 53.3287 Median: 52
? SCORE4 Range: 22-90 Mean: 52.8936 Median: 53 (.) missing
357 2
? SCORE5 Range: 22-87 Mean: 53.2696 Median: 52 (.) missing
319 40
? DROPOUT 0 no history of dropout 1 history of dropout (.) missing
297 46 16
? ADDSC Range: 24.6667-80 Mean: 53.1068 Median: 53
86
? IQ Range: 55-137 Mean: 102.3712 Median: 103 (.) missing
233 126
viewAgeWithCount File name: viewAgeWithCount Project: Sample.mdb Number of records: 16 Number of observations: 85 Variable Label Values/Description Freq RecordNumber Rage: 1-10 Age Range: 1-10 Count Range: 1-20 viewBabyBloodPressure - Hypertension in Infants In these data, birth weight and systolic blood pressure were measured in 16 infants. Systolic blood pressure is the dependent variable, and birth weight and age of the infant are independent variables. Reference: Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, 2000. File name: viewBabyBloodPressure Project: Sample.mdb Number of records: 16 Variable Label Values/Description Freq Birth weight of infant (in ounces); an independent variable
Birthweight Range: 90-160 Mean: 120.31 SD: 18.75
Age in days; an independent variable AgeInDays Range: 2-5 Mean: 3.31 SD: 0.95
Systolic blood pressure (mm Hg); the dependent variable
SystolicBlood Range: 77-98 Mean: 88.06 SD: 6.69
viewEpi1 - Complex Survey Data based on the Expanded Program for Immunization (EPI) method These data are based on a 30-cluster survey using the Expanded Program on Immunization (EPI) methodology. Using this methodology, 30 communities (i.e., clusters) are selected from a listing of all communities in a geographic area using the proportional to population size (PPS) sampling technique. The PPS methodology is self-weighted, i.e., statistical weights are not necessary when analyzing the data. Survey teams visit each cluster and, using one of several sampling techniques, visit households to identify seven children in the appropriate age range and assess their immunization status. The EPI survey is frequently referred to as a 30x7 cluster design, i.e., 30 clusters, each with 7 children. File name: viewEpi1 Project: Sample.mdb Number of records: 210 Variable Label Values/Description Freq A variable to specify in which cluster the individual lived.
CLUSTER Range: 1-30
A question concerning whether or not the mother had received prenatal care for the child being assessed.
PRENATAL 1 = received prenatal care 2 = no prenatal care
87 123
Whether the child was vaccinated. VAC 1 = vaccinated 2 = not vaccinated
155 55
87
viewEpi10 - Complex Survey Data based on the Expanded Program for Immunization (EPI) method with 10 strata The viewEpi10 file is an example of a country performing an EPI survey in each of its 10 provinces, i.e., there were 10 separate EPI surveys carried out, one in each province. This is considered a stratified cluster survey. The viewEpi10 data has the same variables as viewEpi1 plus two additional variables: a variable for a numeric value to identify which province the child lived (LOCATION) and a variable that takes into account the differences in population sizes of the different provinces (POPW). To calculate national estimates, it would be important to take into account the population size of each province. The weighting scheme is presented in Table A1 and is calculated as the population size of the population divided by the number in the sample. In Location 1, each child sampled represents 43.87 children; in cluster 8, each child sampled represents 853.02 children. Please note that there are other methods for weighting data than the one presented here. Table A1. Population weights for children in each location
Location Population Sample POPW 1 9,870 225 43.87 2 33,600 219 153.42 3 14,130 212 66.65 4 27,900 219 127.40 5 12,750 212 60.14 6 15,810 214 73.88 7 16,050 210 76.43 8 180,840 212 853.02 9 9,030 217 41.61
10 25,650 212 120.99 Total 345,630 2,152
POPW = Population/Sample File name: viewEpi10 Project: Sample.mdb Number of records: 2152 Variable Label Values/Description Freq Variable with codes for the 10 strata LOCATION Range: 1-10 Statistical weight to estimate unbiased national estimates taking into account strata population sizes.
POPW Range: 41.61-853.02
Variable specifying cluster number. CLUSTER Range: 1-30 A question concerning whether or not the mother had received prenatal care for the child being assessed.
PRENATAL 1 = received prenatal care 2 = no prenatal care
1088 1064
Whether or not the child was vaccinated.
VAC 1 = vaccinated 2 = not vaccinated
1242 910
viewEstriolandBirthweight - Estriol and Birth Weight Data These data are by Greene and Touchstone and used as an example in the text by Rosner to study the relationship of the estriol level in pregnant women with birth weight. Reference: Rosner B. Fundamentals of Biostatistics, 5th Ed. Duxbury, 2000. File name: viewEstriolandBirthweight Project: Sample.mdb Number of records: 31 Variable Label Values/Description Freq Estriol level of pregnant woman (mg/24 hr)
ESTRIOL Range: 7-27 Mean: 17.23 SD: 4.75
Birth weight of infant (g/100) BIRTHWEIGHT Range: 24-43 Mean: 32.0 SD: 4.74
88
viewEvansCounty - Evans County Heart Disease Study Data The data are based on the Evans County heart disease cohort study on the seven-year incidence of coronary heart disease in 609 white males. The variable CAT (endogenous catecholamine level) was fabricated for illustrative purposes and dichotomized into categories "high" (top quintile of cohort values) and "low." There are no missing values in this dataset. Thanks to Dr. David Kleinbaum for making the data available. Reference: Kleinbaum DG, Kupper LL, Morgenstern H. Epidemiologic Research: Principles and quantitative methods. Lifetime Learning Publications, Belmont, California, 1982. File name: viewEvansCounty Project: Sample.mdb Number of records: 609 Variable Label Values/Description Freq Identification Number ID Range: 21-19161 Coronary Heart Disease CHD No = not a case
Yes = case 538 71
Age (years) AGE Range: 40-76 Mean: 53.71 SD: 9.26
Catecholamine Level CAT No = low Yes = high
487 122
Serum Cholesterol (mg/100 mL) CHL Range: 94-357 Mean: 211.74 SD: 39.83
Diastolic Blood Pressure (mmHg) DBP Range: 60-170 Mean: 91.18 SD: 14.50
Electrocardiogram ECG No = normal ECG Yes = abnormal ECG
443 166
Hematocrit (percent) HEM Range: 29-58 Mean: 46.26 SD: 3.47
Marital Status MAR No = not married Yes = married
64 545
Occupation OCC 1 = ? 2 = ?
365 244
Pulse (beats/min) PLS Range: 45-120 Mean: 74.59 SD: 12.67
Quetelet Index* QTI Range: 2.121-6.041 Mean: 3.62 SD: 0.59
Systolic Blood Pressure (mmHg) SBP Range: 92-300 Mean: 145.48 SD: 27.50
Socioeconomic Status (McGuire- White index)
SES Range: 20-84 Mean: 57.86 SD: 13.62
Cigarette Smoking SMK No = never smoked Yes = smoker
222 387
Age Group 1 (Years) AGEG1 No = LT 55 Yes = GE 55
358 251
Age Group 2 (Years) AGEG2 1 = 40-44 2 = 45-49 3 = 50-54 4 = 55-59 5 = 60-64 6 = 65-69 7 = 70+
109 138 111 92 63 52 44
Cholesterol Group CHLG No = LT 250 504
89
Yes = GE 250 105 QTI Group QTIG No = LT 3.57
Yes = GE 3.57 306 303
SES Group SESG No = GE 57 Yes = LT 57
330 279
Hypertension HPT No = SBP<160 & DBP<95 Yes = SBP>159 or DBP>94
354 255
GE=greater than or equal to; LT=less than *100[(weight in pounds)/(height in inches)] viewhmohiv - survival analysis These data are provided with compliment by Epi Info development team, Epidemiology program office, CDC. File Name: viewhmohiv Project: Sample.mdb Number of records: 100 Variable Label Values/Description Freq Identification of patient ID Range 1-100 Survival time TIME1 Range 1-60
Mean 11.36 Median 5
age AGE Range 20-54 Mean 36.07 Median 35
exposure DRUG 0 placebo 1 treatment
51 49
CENSOR 0 censored 1 event
20 80
The date that the patient first entered the study
ENTDATE Range= 1-12-1989 to 12-27-1991 Format: mm-dd-yyyy
The date that the patient was last observed
ENDDATE Range= 2-15-1989 to 11-13-1995 Format: mm-dd-yyyy
ViewLasum - Estrogen and Endometrial Cancer Matched Case-Control Study (weighted analysis) These data come from a Los Angeles study to determine whether the effect of exogenous estrogen relates to endometrial cancer among 315 participants. The study design is a matched case-control study where each of the 63 cases with endometrial cancer, is matched to four control women who were born within one year of the case, had the same marital status, and lived in the same retirement community for the same length of time. Please note that the data set is of summary file format where individual records with similar characteristics were summarized into 25 groups. This study can be used as an example for conditional logistic regression analysis, taking into account the count (frequency) variable. Reference: Breslow and Day. Statistical methods in cancer research: Volume 1 – The analysis of case-control studies. Lyon : International Agency for Research on Cancer, 1980.
90
File Name: viewLasum.dat Project: Sample.mdb Number of summary records: 25 Number of observation: 315 Variable Label Values/Description Freq Obesity
OBS 0 not obese 1 obese (.) missing
97 167 51
Estrogen conjugated dose (mg/day): An exposure variable
DOS 0= none 1= 0.1-0.299 2= 0.3-0.625 3= 0.626+ (.)= unknown
8 155 61 56 35
Disease outcome: A dependent variable.
OUTCOME 0 no 1 yes
252 63
A weight variable: Summary number of records
COUNT Range: 1-61
viewLeukem2 – Survival Analysis This is a clinical trial studying survival times in weeks (remission) of 42 leukemia patients to compare the effect of a steroid (6-mercaptopurine) with placebo. The duration of relapsed-free period after treatment or placebo was the outcome variable. Please note that these data are the same as ‘Anderson’ (mentioned earlier), but covariates ‘sex’ and ‘logwbc’ have been omitted. File Name: viewLeukem2 Project: Sample.mdb Number of records: 42 Variable Label Values/Description Freq Identification of patient ID Range: 1-42 Main predictor of interest - the exposure variable (6 mercaptopurine vs placebo) randomly assigned to the pts.
Rx placebo 6-MP
21 21
Censored variable - the variable which denotes whether the patient developed an event (exit from clinic).
status 0= censored 1= relapsed
12 30
Survival time in weeks until relapse. This is the outcome variable “time to an event”
Stime Range: 1-35 weeks Mean: 12.8810 Median: 10.5
viewOswego - Oswego Classical Study of Disease Outbreak Investigation. These data are based on a classical study of an outbreak of acute gastrointestinal illness in the village of Lycoming, Oswego County, New York, reported to the District Health Officer in Syracuse on April 19, 1940. It was learned that all persons known to be ill had attended a church supper the previous evening, April 18. Accordingly, the goal for the study was to find which food or foods caused the outbreak. The outcome variable is disease(yes/no). Possible risk factors (predictor variables) are foods and drinks consumed. Interviews regarding the presence of symptoms, including the day and hour of onset, and the food consumed at the church supper, were completed on 75 of the 80 persons known to have been present. A total of 46 persons who had experienced gastrointestinal illness were identified. Reference: The data and information for this outbreak is derived from an educational program developed by the CDC in Atlanta, and provided by Dr A.M.Rubin, then Epidemiologist-in-training who actually conducted the investigation.
91
File Name: viewOswego Project: Sample.mdb Number of records: 75 Variable Label Values/Description Freq Age of patient (years) AGE Range: 3-77
Mean: 36.8133 Median: 36
Gender SEX Female male
44 31
Outcome variable: diarrheal illness
ILL Yes No
46 29
BAKEDHAM Yes No
46 29
SPINACH Yes No
43 32
MASHEDPOTA Yes No (.)
37 37 1
CABBAGESAL Yes No
28 47
JELLO Yes No
23 52
ROLLS Yes No
37 38
BROWNBREAD Yes No
27 48
Food items
FRUITSALAD Yes No
6 69
MILK Yes No
4 71
COFFEE Yes No
31 44
Beverages
WATER Yes No
24 51
CAKE Yes No
40 35
VANILLA Yes No
54 21
Desserts
CHOCOLATE Yes No (.)
47 27 1
Date of onset of illness (mm-dd-yyyy, time)
DATEONSET 04-18-1940; 3pm - 04-19-1940; 10:30am
Date of supper (mm-dd-yyyy, time)
TIMESUPPER 04-18-1940; 12am - 04-18-1940; 10pm
Name code of patient NAME Range: patient1-patient75 Identification number CODE_RW Range: P1- P75 (.) = missing value viewRely - Rely Tampons and Toxic Shock Syndrome Matched Case-Control Data This is an example of a matched case-control data set where cases (women who were diagnosed with toxic shock syndrome) were each matched to four controls. The specifics of the matching is not provided, but probably based on age and geographic location. As mentioned in the Match command section, the ID is repeated five times: once for the case and then for each of the four matched controls.
92
File name: viewRely Project: Sample.mdb Number of records: 56 Variable Label Values/Description Freq Identification Number; an ID number that links each case with their individually matched controls
ID Range: 1-14
Case of toxic shock syndrome? Outcome variable which divides the study group into cases and controls
CASE No = control Yes = case
42 14
Use of Rely tampons? Exposure variable which separates the group into exposed and not exposed
RELY No = did not use Yes = did use
32 24
viewSmoke - A Telephone Survey With Multistage Stratified Cluster Design These data are based on a random digit telephone survey of adults (18 years of age and older) using a stratified three-stage design in a state. Clusters are defined as telephone numbers consisting of numbers with the same first eight digits of a 10-digit telephone number. Separately for each county, a with-replacement sample of clusters is randomly chosen with probabilities proportional to size (PPS) of the number of residential telephone numbers. Nest, a random sample of three participating households is selected in each cluster. Finally, an interview is completed with one adult who is chosen at random within each participating household. This would be considered a stratified three-stage sample, with clusters of telephone numbers as primary sampling units (PSUs), primary stratification by county, residential phone numbers as the second stage, and the random selection of one adult in the household as the third stage (see Table A2.) Table A2. Stages used in telephone survey Stage List Used Sampling Method One 8-digit telephone number clusters
by county Random PPS within 8-digit clusters (stratified by county)
Two Clusters from Stage One Three random households per clusters Three Households from Stage Two One adult selected at random from participating households File name: viewSmoke Project: Sample.mdb Number of records: 337 Variable Label Values/Description* Freq Primary Sampling Unit (PSU) ID number PSUID Range: 15-1310 Date of interview DATE Range: 010190-032490
Note: a character field; dates are month, day, and 2-digit year
Interviewer’s initials INTERID “Do you smoke now?” SMOKE 1= Yes
2= No 83 254
Number of cigarettes smoked per day NUMCIGAR Range: 2-40 n: 82 Mean: 17.256 SE: 0.972 Note: question asked of cigarette smokers only
Age of participant in years AGE Range: 9-96 Mean: 43.818 SE: 1.053 Note: value of “9” appears to be an error since survey was to be limited to adults only
Race of participant RACE 1= White 2= Black 5= Other
289 47 1
Marital status MARITAL 1= Married 2= Divorced 3= Widowed 4= Separated 5= Never married
184 45 48 6 52
93
9= Refused 2 Weight (without shoes) in pounds WEIGHT Range: 88-285
Also 777 - don’t know 999 - refused
Height (without shoes) in feet and inches HEIGHT Range: 410-607 Also 777 - don’t know 999 - refused Note: 3-digit numeric field; 1st digit=height in feet; next 2 digits=height in inches .
Sex of participant SEX 1= Male 2= Female
122 215
Sample weight SAMPW Range: 47100152.009- 47113103.03
Stratum STRATA 1= County “A” 2= County “B” 3= County “C”
113 112 112
*Note that mean and standard error (SE) estimates take into account the complex survey design and statistical weighting viewFamily - Merging/Relating files This Data table is provided along with Epi Info software under the dataset named Refugee.MDB. It contains information concerning refugee families that have arrived to the United States (e.g., the language they speak or their country of origin). Filename: viewFamily Project:Refugee.MDB Number of records: 539 Variable Label Values/Description Freq Apartment APARTMENT City: CITY Contact Information Contact Information Country of Origin: COUNTRY County: COUNTY Date of Arrival: DTOFARR Port of Entry in USA: ENTRY AL 2 CA 1 CALIFORNIA 67 CHICAGO 53 FL 1 IL 9 LA 1 LOS ANGELES 6 MIAMI 1 NEW YORK 248 NY 123 Family Home Phone: FAMHMPH Family Id Number FAMIDNUM 0-539 household HOUSEHOLD Interpreter code INTERPRETE Language spoken LANG Sponsor: SPONSOR State: STATE Street: STREET Zip Code: ZIPCODE
NB: Description of individual variable was not available. viewPatient - Merging/Relating files
94
This Data table is provided along with Epi Info software under the dataset named Refugee.MDB Filename: viewPatient Project:Refugee.MDB Number of records: 18000 Variable Label Values/Description Freq Date of record entry TODAYDATE Family ID Number FAMIDNUM No: 1 to 546 BOH ID NUMBER: BOHID BOH Re-entry BOH Alien Number2: ALIENNUM2 Alien Number: ALIENNUMBE Last Name LASTNAME First Name: FIRSTNAME
Head of Household: HEAD Yes No Missing
434 1325 16241
Relationship with the household head RELATION
Missing 0 1 2 3 4 5 6 7 8 9 10 11 13
16342 435 8 14 209 25 457 338 2 4 11 30 1 124
Date of Birth: DOB
Age in years: AGE
Range: 0-80 yrs Mean: 24.1 Median: 21 (n=1751)
Sex: SEX Missing F M
16232 832 936
Race: RACE
Missing A B H O White
16232 195 727 3 3 840
I-94 Status: I94STATUS
Missing 1 2 3
16229 1767 3 1
Previous Resettlement: RESETTL No Missing
1772 16228
From: FROM missing 18000
Health classification CLASS
Missing B B1 B2 O
16285 429 21 50 1215
NB: actual data were available only in 546 families (based on FAMIDNUM), and the remaining records have missing values in all variables except last and first name of a refugee.
95
Appendix 2. Operators/Functions - for use in arithmetic and logical expressions Below is a partial listing of operators and functions Arithmetic + addition - subtraction
* multiplication / division ^ exponentiation (use ^0.5 for square root)
Comparison > greater than < less than >= greater than or equal to <= less than or equal to = equal to <> not equal to
Boolean Operators AND logical AND OR logical OR XOR exclusive OR NOT logical NOT
Numeric ABS(variable or expression) Absolute value EXP(variable or expression) Raises the base of the natural logarithm (e) to the power specified LN(variable or expression) natural logarithm LOG(variable or expression) logarithm (base 10) MOD(variable or expression) modulus or remainder ROUND(variable or expression) rounds to nearest whole number TRUNC(variable or expression) removes decimals/round towards zero
Date-related functions NUMTODATE(<Year>,<Month>,<Day>)converts three numbers to a date format where <Year> is a numeric variable representing the year, <month> is a numeric variable for the month, and <day> is a numeric variable for the day. YEARS(<date variable 1>, <date variable 2>) Calculates the number of years between two dates. MONTHS(<date variable 1>, <date variable 2>) Calculates the number of months between two dates. DAYS(<date variable 1>, <date variable 2>) Calculates the number of days between two dates.
96
97
Appendix 3. Answers to Exercises
Answers – Exercise 1
1. Mean of HEM using Means command:
Obs Total Mean Variance Std Dev
609 28173.0000 46.2611 12.0584 3.4725 Minimum 25% Median 75% Maximum Mode 29.0000 44.0000 46.0000 48.0000 58.0000 46.0000
2. Appear to be normally distributed? Use the Graph module and make either a histogram, bar, or line chart with
HEM as the X-axis. The data appears to be somewhat normally distributed. While there are statistical tests to see whether or not a variable is normally distributed, Epi Info does not perform this test.
3. Descriptive Statistics for Each Value of Crosstab Variable
Obs Total Mean Variance Std Dev Yes 251 11459.0000 45.6534 13.1954 3.6325 No 358 16714.0000 46.6872 10.8542 3.2946
ANOVA, a Parametric Test for Inequality of Population Means (For normally distributed data only)
Variation SS df MS F statistic Between 157.6822 1 157.6822 13.3420 Within 7173.8055 607 11.8185 Total 7331.4877 608
T Statistic =3.652 P-value =0.0003
Bartlett's Test for Inequality of Population Variances Bartlett's chi square= 2.8276 df=1 P value=0.0927
A small p-value (e.g., less than 0.05) suggests that the variances are not homogeneous and that the ANOVA may not be appropriate.
Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups)
Kruskal-Wallis H (equivalent to Chi square) = 14.7051 Degrees of freedom = 1
P value = 0.0001
98
Are the variances approximately equal? Yes, Bartlett’s test has p-value of .09, so we can assume approximately equal variances. Therefore, can use the t-test p-value of .0003 and state that there are statistically significant different mean hematocrits between younger adults vs. older adults, with older adults having a slightly higher mean hematocrit.
4. Mean is 57.855. Obs Total Mean Variance Std Dev 609 35234.0000 57.8555 185.5712 13.6225
Minimum 25% Median 75% Maximum Mode 20.0000 49.0000 57.0000 71.0000 84.0000 71.0000
5. Using graph module, make a bar, histogram, or bar chart. Does not seem to be normally distributed.
6. Descriptive Statistics for Each Value of Crosstab Variable
Obs Total Mean Variance Std Dev 1 109 6186.0000 56.7523 206.8733 14.3831 2 138 7879.0000 57.0942 193.9254 13.9257 3 111 6479.0000 58.3694 165.2896 12.8565 4 92 5530.0000 60.1087 162.4935 12.7473 5 63 3661.0000 58.1111 180.1326 13.4213 6 52 2822.0000 54.2692 198.8673 14.1020 7 44 2677.0000 60.8409 182.8811 13.5234
Minimum 25% Median 75% Maximum Mode 1 20.0000 48.0000 57.0000 68.0000 84.0000 71.0000 2 20.0000 47.0000 57.0000 71.0000 81.0000 71.0000 3 26.0000 51.0000 57.0000 71.0000 84.0000 57.0000 4 32.0000 51.0000 59.0000 71.0000 84.0000 57.0000 5 34.0000 49.0000 55.0000 72.0000 84.0000 54.0000 6 20.0000 44.5000 54.0000 62.5000 84.0000 54.0000 7 38.0000 51.0000 57.0000 71.5000 84.0000 54.0000
ANOVA, a Parametric Test for Inequality of Population Means
(For normally distributed data only)
99
Variation SS df MS F statistic Between 1774.0885 6 295.6814 1.6028 Within 111053.1955 602 184.4737 Total 112827.2841 608
P-value =0.1438
Bartlett's Test for Inequality of Population Variances Bartlett's chi square= 2.4070 df=6 P value=0.8787
A small p-value (e.g., less than 0.05) suggests that the variances are not homogeneous and that the ANOVA may not be appropriate.
Mann-Whitney/Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups) Kruskal-Wallis H (equivalent to Chi square) = 8.3535
Degrees of freedom = 6 P value = 0.2133
Data do not seem to be normally distributed, so might be better to use Kruskal-Wallis test. Conclusion – there is no significant difference in SES score by age groups.
7. OR=1.21, RR=1.18; no statistically significant association.
Single Table Analysis Point 95% Confidence Interval Estimate Lower Upper PARAMETERS: Odds-based Odds Ratio (cross product) 1.2065 0.6448 2.2576 (T) Odds Ratio (MLE) 1.2061 0.6252 2.2224 (M) 0.5945 2.3087 (F) PARAMETERS: Risk-based Risk Ratio (RR) 1.1789 0.6833 2.0343 (T) Risk Difference (RD%) 2.0238 -5.0418 9.0895 (T) (T=Taylor series; C=Cornfield; M=Mid-P; F=Fisher Exact) STATISTICAL TESTS Chi-square 1-tailed p 2-tailed p Chi square - uncorrected 0.3456 0.5566322582 Chi square - Mantel-Haenszel 0.3450 0.5569564191 Chi square - corrected (Yates) 0.1770 0.6739617631 Mid-p exact 0.2751234372 Fisher exact 0.3287296811
CHD CHLG Yes No TOTAL
Yes Row % Col %
14 13.3 19.7
91 86.7 16.9
105 100.0 17.2
No Row % Col %
57 11.3 80.3
447 88.7 83.1
504 100.0 82.8
TOTAL Row % Col %
71 11.7
100.0
538 88.3
100.0
609 100.0 100.0
100
8.
Third Variable Interaction p-
value Crude OR1 Adjusted
OR2 Conclusion?3
ECG 0.42 2.9 2.4 Confounding MAR 0.46 2.9 2.8 Neither SMK 0.46 2.9 2.9 Neither AGEG1 0.81 2.9 2.2 Confounding QTIG 0.07 2.9 2.9 Neither HPT <0.01 2.9 2.0 Interaction 1 Crude OR (cross-product) 2 Adjusted OR (MH) 3 Interaction, confounding, or neither
Answers – Exercise 2
1. First, use the Select command to select those with hypertension:
Next, use the Means command to get the mean cholesterol level – the mean is 215.2.
Obs Total Mean Variance Std Dev 255 54872.0000 215.1843 1702.2612 41.2585
Minimum 25% Median 75% Maximum Mode 126.0000 184.0000 211.0000 240.0000 336.0000 212.0000
2. Do a Tables command with CAT as the exposure variable and CHD as the outcome variable. The risk ratio is
1.2683.
Single Table Analysis Point 95% Confidence Interval Estimate Lower Upper PARAMETERS: Risk-based Risk Ratio 1.2683 0.7344 2.1904 (T)
Please run the command Cancel Select command to clear out the selection.
3. Be sure to first Define the variable CHD_index, then use the Assign command to do the calculation:
101
The mean CHD_index is 6.5224.
Obs Total Mean Variance Std Dev
609 3972.1560 6.5224 6.5602 2.5613
4. Do those who developed CHD have a significantly higher or lower mean CHD_index compared to those who did not develop CHD? Assuming a normal distribution, we would conclude that there is no statistically significant difference in mean CHD_index between those with or without CHD.
Descriptive Statistics for Each Value of Crosstab Variable Obs Total Mean Variance Std Dev Yes 71 453.0072 6.3804 4.7808 2.1865 No 538 3519.1488 6.5412 6.8013 2.6079
Minimum 25% Median 75% Maximum Mode Yes 2.8617 4.6229 6.3317 7.4880 14.2804 2.8617 No 2.5707 4.8616 6.0801 7.6534 28.9549 5.0062
ANOVA, a Parametric Test for Inequality of Population Means
(For normally distributed data only) Variation SS df MS F statistic Between 1.6215 1 1.6215 0.2469 Within 3986.9572 607 6.5683 Total 3988.5787 608
T Statistic =0.4969, P-value =0.6195
Bartlett's Test for Inequality of Population Variances
Bartlett's chi square= 3.4989 df=1 P value=0.0614
5. First, Define the variable agegroup; next, use the Recode command as follows: on the first Recode dialog box, click on Fill Ranges to get to the screen below; provide the Start, End, and By values:
102
Click OK to see the Recode dialog box with the ranges completed:
To determine the number in each group, use the Frequencies command:
agegroup Frequency Percent Cum Percent >39 - 59 450 73.9% 73.9% >59 - 79 159 26.1% 100.0% Total 609 100.0% 100.0%
6. First Define the variable Anemic. There are different programming approaches to doing this. One way is as follows:
IF HEM < 39 and SMK = (-) THEN Anemic = 1 END
103
IF HEM >= 39 and SMK = (-) THEN Anemic = 2 END IF HEM < 40 and SMK = (+) THEN Anemic = 1 END IF HEM >= 40 and SMK = (+) THEN Anemic = 2 END Another approach that would work just as well is: ASSIGN Anemic = 1 IF HEM >= 39 and SMK = (-) THEN Anemic = 2 END IF HEM >= 40 and SMK = (+) THEN Anemic = 2 END IF HEM= (.) AND SMK= (.) THEN Anemic = (.) END The prevalence of anemia is 1.1%.
Anemic Frequency Percent Cum Percent 1 7 1.1% 1.1% 2 602 98.9% 100.0% Total 609 100.0% 100.0%
7. In the Program Editor, click on the Save button; a Save Program dialog box will appear – save the program name as Anemic and then click on the OK button. Next, click on the Open button in the Program Editor, click on the down arrow at the right of Program and select the Anemic program and edit it to remove commands not needed, then Save the edited program. Now, reRead viewEvansCounty, Open the Anemic program, and then click the Run button. Double check to see if the program worked correctly by doing a frequency of anemia.
Answers – Exercise 3
Third Variable Interaction p-value
Crude OR Adjusted OR
Conclusion?1
ECG 0.42 2.9 2.4 Confounding MAR 0.46 2.9 2.8 Neither SMK 0.46 2.9 2.9 Neither AGEG1 0.81 2.9 2.2 Confounding QTIG 0.07 2.9 2.9 Neither HPT 0.003 2.9 2.0 Interaction
1 Interaction, confounding, or neither
104
Appendix 4. Analysis commands by number and types of variables
The tables below provide information on appropriate use of the analytic commands which depend upon the number of variables under consideration (one or more variables), the types of variables (categorical vs. continuous), and whether the data are to be analyzed assume simple random sampling or complex sampling designs. Table A.4.1. Epi Info commands for the analysis of one variable of interest, assuming simple random sampling
A variable of interest Analysis command Categorical variable
e.g., Illness=Yes or No, sex Frequencies
Means Continuous variable
e.g., age, blood pressure, cholesterol level Means
Time to event *
e.g., survival time until an event occurs Kaplan-Meier Survival
*Requires two variables, a time variable and a variable as to whether or not an event occurred. Table A.4.2. Epi-Info commands for the analysis of a predictor variable vs. an outcome variable, assuming simple random sampling
Predictor variables Outcome
Paired
observa-tions1
Categorical variable ( ≥ 2 categories)
Continuous variable Both categorical and continuous variables
No Tables Logistic Regression (unconditional)
Logistic Regression (unconditional) Means2
Logistic Regression (unconditional)
Categorical variable
e.g., illness= Yes or No Yes Match
Logistic Regression (conditional)
Logistic Regression (conditional)
Logistic Regression (conditional)
Continuous variable
e.g., age, blood pressure
No Means2 Linear Regression
Linear Regression
Linear Regression
Time to event e.g., survival time until an
event occurs/is censored
No Kaplan-Meier Survival Cox Proportional Hazards Extended Cox model3
Cox Proportional Hazards Extended Cox model3
Cox Proportional Hazards Extended Cox model3
1 e.g., matched case-control study 2 Student t-test and ANOVA for parametric tests, and Kruskal-Wallis test for non-parametric tests. 3 used when predictor variable/s are time-dependent or Cox PH assumptions are violated.
105
TableA.4.3. Epi Info commands for the analysis of one variable of interest in a survey using a complex sample design
One variable of interest
Analysis command
Categorical variable e.g., illness=Yes or No
Complex Sample Frequencies
Continuous variable e.g., age, blood pressure
Complex Sample Means
TableA.4.4. Epi-Info commands for the analysis of a predictor variable vs. an outcome variable in a survey using a complex sample design
Outcome
Predictor variable (Categorical variable)
Categorical variable e.g., illness=Yes or No
Complex Sample Tables
Continuous variable e.g., age, blood pressure
Complex Sample Means
Intro to Epi Info 3.3.2 Analysis.doc January 4 2007