Educational Research 101:How to Manage Your Data and Prepare
for the Statistical Consultation
Francis S. Nuthalapaty, MDH. Lee Higdon III, PhD
2009 APGO Faculty Development Seminar
Case Study: The wrong way
Case Study: The wrong way
• Statistician was consulted after the data had
been collected.
• Study question was not clearly defined.
• Variables were not defined.
• Data Dictionary was not developed.
• Data were not cleaned/validated.
• Result: a statistician that is asked to perform a
miracle!
Case Study: Lesson
Arrangements to consult with a statistician should be made before you start enrolling and collecting
data on patients! In fact, they should be made before protocol development to prevent issues
downstream.
Learning Objectives
1. Describe the continuum of data management
2. List data collection instruments / approaches
3. Understand how to create a data dictionary
4. Describe methods to validate data
5. Describe various data analytic tools
6. Describe how to decide on statistical
approaches
Question
Where does data management fit into the
research process?
The Research Process
1. Question2. Literature search3. Objective / Hypothesis4. Study design5. IRB6. Study conduct7. Data analysis8. Dissemination of results
Data Management Pearl
“No study is better than the quality of its data”
- Friedman
“…get it right the first time”- Crerand
Analysis
Steps in Data Management
• Definition
• Acquisition
• Data Entry
• Validation
Data Definitions
• Identifying your data
• Identifying your data types
• Naming your data variables
• Creating a data dictionary
Data Types
Types of Variables
Qualitative Quantitative
Nominal
Ordinal
Interval
Ratio
Data Definition Exercise
Data Variable Names
• Make the name descriptive (easier to remember)
• Keep it short (less than 10 characters)
• User lower case
• Avoid spaces – use “underscore”
• Use numbers to indicate sequences
Data Variable Formats
• Variable formats:
– Numeric
– String
Data Variable Values
• Possible responses for a variable
– Numeric format:
• 0 = no / 1 = yes
– String format:
• a = no / b = yes
Data Variable Values
Note on Missing Values
• What about variables with no response?
– Leave it blank
– Assign a period “.”
– Assign a value (usually out of the expected
response range)
– Avoid text
Data Naming Exercise
Data Dictionaries / Code Books
• Brings together all data elements:
– Data types / formats
– Variable names
– Expected response values (range)
– Comments
• Self-generated vs. computer generated
• “Rosetta Stone” for the database
Data Dictionary Exercise
Data Acquisition
Pick the best method for the environment
Data Acquisition Methods
• Interviews
• Questionnaires
• Assessments
– MCQ examinations
– OSCE / OSAT
• Laboratory studies
Data Acquisition Environments
• Observational encounters
• Structured research encounters
• Self-report
Data Acquisition Problems
• Major types of data issues:
– Missing data
– Incorrect data
– Excess variability
Data Acquisition Problems
• Reasons for poor data quality:
– Researcher-dependent data:
• Insufficient time
• Inadequate training
• Lack of focus on study tasks
• Poor communication
• Protocol deviation
Data Acquisition Problems
• Reasons for poor data quality:
– Subject-dependent data:
• Inadequate instruction
• Poor comprehension
• Sensitive or stigmatized behaviors
Data Acquisition Options
• Paper forms
• Direct entry
• Computer assisted data acquisition
Data Acquisition: Paper Forms
Advantages• Controlled
distribution and return
• Comments• Double data entry
Disadvantages• Anonymity• Manual quality
checks• Data entry time /
errors
Data Acquisition: Direct Entry
• Options:– MS Excel, MS Access– Epi Info – free on the web– Direct entry into statistical software
• Pros / Cons:– No data transcription– Errors
Data Acquisition
• Computer assisted data acquisition:
– Automated data collection
– OCR forms
– Computer-based case report forms /
questionnaires
– Computer-assisted self-interviews
– Mobile computing device diaries
Data Acquisition: CASI
• Special Focus: Health Behaviors
– Factors which may affect reporting:
• Sensitive or stigmatized behaviors
• Age discrepancy between participant and
interviewer
• Lack of privacy
• Lack of comprehension of self-administered
questionnaires
Data Acquisition: CASI
• Computer-assisted self-interview (CASI):
– Computer-based interview
– Can incorporate audio, video, and text
– Respondent listens to or reads questions on
screen
– Submits answers through keypad or touch
screen
Data Acquisition: CASI
• Benefits of CASI:
– Interview conducted in privacy
– Standardized interview
– Computer controlled branching
– Automated consistency and range checking
– Multilingual administration
Analysis
Steps in Data Management
• Definition
• Acquisition
• Data Entry
• Validation
Data Validation
1. Is all of the data present?
2. Are the responses within the expected
range?
3. Does the data make sense?
Data Validation
• Is all of the data present?
– Visually examine the data cells
– Frequencies
Data Validation
• Are the responses within the expected
range?
– Frequencies
• Maximum / minimum values
– Descriptive statistics
• Means
• Standard deviations
Data Validation
Once the outlier is found, one can reference the chart for clarification
Descriptive Statistics
Data Distribution
Definitions by SPSS 16.0
Data Distribution
Data Distribution
Scatterplots
Who is Represented in the Data?
• Sample test of proportions– Percent of gender– Percent of ethnicity
• Sample test of means – Age– BMI
• Does our data reflect the population at large or a subset?
Who is not?
• Compare data of the included and excluded individuals– Are they similar for:
• Age (continuous – Student t test)• BMI (continuous – Student t test)• Ethnicity (discrete/categorical – Chi-square test)• Gender (discrete/categorical – Chi-square test)
Analysis
Steps in Data Management
• Definition
• Acquisition
• Data Entry
• Validation
Data Analysis
• Choose the right tool for the job
• Commonly used statistical tests:
– If the data are normally distributed (i.e. bell-shaped curve) then we use parametric statistical test
– If the data are (1) not “bell-shaped”, or (2) have small sample sizes, generally less than 30 per group or (3) contain “outliners”, then we use nonparametric statistical tests.
• Choice of statistical tests is used on:– Distribution of the sample data– Sample size– Number of groups– Independence of the groups
Comparison Measurement
Normal Distributi
on
# of groups
Statistical Test
Mean (Average) Yes 2 Student’s t-test
Mean (Average) Yes ≥3 Analysis of Variance
Median No 2 Wilcoxon Rank-Sum or Mann-Whitney U-test
Median No ≥3 Kruskal-Wallis test
Proportions Yes ≥2 Chi-square test
Proportions No ≥2 Fisher’s exact test
Data Analysis
• Univariate vs. Multivariate– Multivariate methods are being required more frequently in
medical research because we are looking at relationships that involve more than one-to-one association.
• Multivariate methods allow us to:– Examine many variables simultaneously– Adjust for baseline differences between groups– Adjust for potential “confounding” variables– Obtain “adjusted” measures of effect
• Examples of multivariate methods: (Explain or predict the independent variables)– Linear regression – to predict the values of a numerical measurement (viral load)– Logistic regression – to predict a dichotomous outcome (pregnant/not pregnant)– Cox proportional hazard – to predict time to an event (survival time)
Data Analysis
Session content, including narrated MS Powerpoint slides available at:
http://www.obgynknowledgebank.net
Top Related