Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training...
-
Upload
branden-mccarthy -
Category
Documents
-
view
215 -
download
0
Transcript of Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training...
![Page 1: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/1.jpg)
Data cleaning
GAP Toolkit 5 Training in basic drug abuse data management and analysis
Training session 12
![Page 2: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/2.jpg)
Objectives
• To establish methods of uncovering coding errors • To discuss techniques for implementing logical tests• To present methods of selecting cases• To reinforce the SPSS skills presented to date
![Page 3: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/3.jpg)
Boolean operators: AND
• The AND operator is a logical operator in Boolean algebra
• Imagine two statements: X and Y• For the operation (X AND Y) to be true X has to be true
and Y has to be true• The rules for Boolean operators are commonly
displayed in Truth Tables
![Page 4: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/4.jpg)
Truth table: AND
Let: 0 = False ; 1 = TrueX Y X AND Y0 0 00 1 01 0 01 1 1
![Page 5: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/5.jpg)
Boolean operators: OR
• The OR operator is a logical operator in Boolean algebra
• Imagine two statements: X and Y• For the operation (X OR Y) to be true either X is true or
Y is true or both X and Y are true
![Page 6: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/6.jpg)
Truth table: OR
Let: 0 = False ; 1 = TrueX Y X OR Y0 0 00 1 11 0 11 1 1
![Page 7: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/7.jpg)
Data cleaning
• Check the data for errors• Clean the data before any data analysis
![Page 8: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/8.jpg)
Types of error
• There are two broad areas of error:– Coding errors– Logical errors
![Page 9: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/9.jpg)
Coding error
• Data entry errors• Out-of-range values
![Page 10: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/10.jpg)
Detecting out-of-range values
• For categorical variables, having declared valid values, frequency counts will highlight any peculiar entries
• For continuous variables, descriptive statistics, in particular the range and a histogram, will highlight any peculiar values
![Page 11: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/11.jpg)
Examples
• Age: generate descriptive statistics• Treatment type: generate a frequency distribution
![Page 12: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/12.jpg)
Statistic Std. Error
Age Mean 31.78 .315
95% Confidence Interval for Mean
Lower Bound 31.16
Upper Bound 32.40
5% Trimmed Mean 31.31
Median 31.00
Variance 154.614
Std. Deviation 12.434
Minimum 1
Maximum 77
Range 76
Interquartile Range 20.00
Skewness -.427 .062
Kurtosis -.503 .124
Descriptives
![Page 13: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/13.jpg)
Age
75.0
70.0
65.0
60.0
55.0
50.0
45.0
40.0
35.0
30.0
25.0
20.0
15.0
10.0
5.0
0.0
Histogram
Fre
qu
en
cy
300
200
100
0
Std. Dev = 12.43
Mean = 31.8
N = 1563.00
![Page 14: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/14.jpg)
Frequency Percent Valid Percent Cumulative Percent
Valid Inpatient 1027 65.4 65.7 65.7
Outpatient 535 34.1 34.2 99.9
4 1 .1 .1 100.0
Total 1563 99.5 100.0
Missing System 8 .5
Total 1571 100.0
Treatment type
![Page 15: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/15.jpg)
Resolving errors
• The questionnaires should be checked• If possible, return to the interviewer or interviewee• If still unresolved, consider setting the value as missing• Note the importance of ID numbers for linking the
computer to the questionnaire
![Page 16: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/16.jpg)
Selecting cases
• The ability to select a set of cases according to a criterion is essential in data cleaning
• Generating statistics for subsets of the data is also a useful analytical tool
![Page 17: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/17.jpg)
Example: Age
• Descriptive statistics of Age indicate that there is a case with a value of 1 and a case with the value 77
• It is advisable to check the extreme values
N Minimum Maximum Mean Std. Deviation
Age 1563 1 77 31.78 12.434
Valid N (listwise) 1563
Descriptive Statistics
![Page 18: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/18.jpg)
Example: Age
• It would be reasonable to check for values 10 and under and 70 and over
• The task is to select those cases and display the results• Data/Select Cases generates the following dialogue box
![Page 19: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/19.jpg)
Choose these options to
define selection criteria.
![Page 20: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/20.jpg)
![Page 21: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/21.jpg)
Data/Select Cases
• SPSS creates a new variable in the data set called filter_$ which = 1 when AGE<=10 OR AGE >= 70
• All subsequent analysis will be on the reduced data set until Data/Select Cases/All Cases is chosen
• The filtered cases are identified by a slash through the case number
![Page 22: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/22.jpg)
Frequency Percent Valid Percent Cumulative Percent
Valid 1 1 7.1 7.1 7.1
7 5 35.7 35.7 42.9
8 1 7.1 7.1 50.0
9 1 7.1 7.1 57.1
10 3 21.4 21.4 78.6
70 1 7.1 7.1 85.7
72 1 7.1 7.1 92.9
77 1 7.1 7.1 100.0
Total 14 100.0 100.0
Age
![Page 23: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/23.jpg)
Generating a report
• Analyse/Reports/Case Summaries • Select the variables to be included in the summary
![Page 24: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/24.jpg)
![Page 25: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/25.jpg)
Case number
ID Age Race Education Employment Marital status Treatment type
1st most frequently used
drug
1 16 16 8 White Secondary Working full-time
Married liv w. spouse
Inpatient ALCOHOL
2 85 85 77 White Tertiary Pensioner Widowed Inpatient ALCOHOL
3 183 183 70 White Secondary Pensioner Married liv w. spouse
Inpatient ALCOHOL
4 184 184 72 White Tertiary Pensioner Married liv w. spouse
Inpatient ALCOHOL
5 903 903 1 White . Student/pupil Never married Inpatient DAGGA
6 1041 1041 7 African Primary Student/pupil Never married Outpatient DAGGA
7 1042 1042 7 African Primary Student/pupil Never married Outpatient DAGGA
8 1043 1043 7 African Primary Student/pupil Never married Outpatient DAGGA
9 1044 1044 7 African Primary Student/pupil Never married Outpatient DAGGA
10 1045 1045 7 African Primary Student/pupil Never married Outpatient DAGGA
11 1518 1518 9 African Primary Student/pupil Never married Outpatient WHITE PIPE
12 1519 1519 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
13 1520 1520 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
14 1521 1521 10 African Primary Student/pupil Never married Outpatient WHITE PIPE
Total N 14 14 14 13 14 14 14 14
Case summariesa
a. Limited to first 100 cases.
![Page 26: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/26.jpg)
Note: All Cases
• Don’t forget that, once certain cases have been selected, all subsequent analysis is on the selected cases only
• Once you have finished working with the subset, restore the file to All Cases before doing any further analysis – Data/Select Cases…– Select the All Cases radio button– OK
![Page 27: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/27.jpg)
Locating a case
• From the Data Editor:– Data/Go To Case
OR – Select a variable, then Edit/Find
![Page 28: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/28.jpg)
Logical errors
• Detecting logical errors involves comparing answers to ensure that they are consistent
• The type of logical checks appropriate to identify particular errors will depend on the questions in the questionnaire
![Page 29: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/29.jpg)
Detecting logical errors
• Cross-tabulations between categorical variables can be used to highlight errors
• Check criteria using conditional statements and the Compute facility
• Some software, such as SPSS Databuilder, allows tests for logical and coding errors to be built into a data entry form
![Page 30: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/30.jpg)
Example: Cross-tabulation
• Cross-tabulations provide a simple method of investigating the joint distribution of two variables
• The following slide is a cross-tabulation of Drug1 against Mode1 to check that appropriate modes of ingestion have been reported
![Page 31: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/31.jpg)
Most Frequently Used Drug (Cross-tabulation) Mode of ingestion Drug1
Swallow Smoke Snort Inject Total
DAGGA 1 180 181
HEROIN 31 11 29 71
CODEINE 5 5
COCAINE 2 44 46
CRACK 97 1 98
AMPHETAMINE 4 1 2 7
ECSTASY 24 1 25
SEDATIVES & TRANQUILLIZERS
3 3
BENZODIAZEPINES 16 16
MANDRAX 12 12
VALIUM 2 2
LSD 5 5
SOLVENTS & INHALANTS 2 1 3 6
WHITE PIPE 309 309
ALCOHOL 717 717
ROHYPNOL 3 3
MISC. PRESCRIPTION DRUGS 9 1 10
MISC. DRUGS 1 1
Total 791 634 62 30 1517
Most frequently used drug
![Page 32: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/32.jpg)
Example: conditional statements
• Main.sav contains information on the three most frequently used drugs: Drug1, Drug2 and Drug3
• In a single case, no drug should appear in more than one of the three variables
• To check this, generate a test variable on the basis of a conditional statement; the test variable should take the value 0 if all three drug variables are different and the value 1 if there is any duplication
![Page 33: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/33.jpg)
Compute: Test = 0
• Transform/Compute • Enter the name of the new variable: TEST • Click the Type and Label button and declare the
variable as numeric with the label: TEST VARIABLE FOR DRUG DUPLICATION
• Set TEST = 0
![Page 34: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/34.jpg)
Compute: TEST = 1
• If any of the drug options are the same, TEST should equal 1 EXCEPT when Drug2 = Drug3 = 77 (not applicable)
• The condition is if– Drug1 = Drug2 OR– Drug1 = Drug3 OR– (Drug2 = Drug3 AND Drug2 77)– THEN Test = 1
![Page 35: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/35.jpg)
Click If… button to define the conditional statement.
![Page 36: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/36.jpg)
![Page 37: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/37.jpg)
1st most frequently used drug
2nd most frequently used drug
3rd most frequently used drug
ID
1 BENZODIAZEPINES MISC. PRESCRIPTION DRUGS
MISC. PRESCRIPTION DRUGS
734
2 CRACK CRACK ECSTASY 807
3 CRACK WHITE PIPE CRACK 835
4 HEROIN SEDATIVES & TRANQUILLIZERS
SEDATIVES & TRANQUILLIZERS
1182
5 SEDATIVES & TRANQUILLIZERS
MISC. PRESCRIPTION DRUGS
MISC. PRESCRIPTION DRUGS
1230
6 SEDATIVES & TRANQUILLIZERS
SEDATIVES & TRANQUILLIZERS
MISC. PRESCRIPTION DRUGS
1231
7 MISC. PRESCRIPTION DRUGS
MISC. PRESCRIPTION DRUGS
Not Applicable 1245
8 MISC. PRESCRIPTION DRUGS
MISC. PRESCRIPTION DRUGS
ALCOHOL 1250
Total N 8 8 8 8
Case summariesa
a. Limited to first 100 cases.
![Page 38: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/38.jpg)
Exercise
• Check for consistency between the drug reported and the method of ingestion for the second and third drugs of use
• What additional logical tests could be completed on the data in main.sav?
![Page 39: Data cleaning GAP Toolkit 5 Training in basic drug abuse data management and analysis Training session 12.](https://reader030.fdocuments.in/reader030/viewer/2022032709/56649ecf5503460f94bdc0fe/html5/thumbnails/39.jpg)
Summary
• Data entry errors • Out-of-range errors • Logical errors • Conditional statements • Selecting cases • Reports