Post on 04-Mar-2021
Population Research Seminar Series
Session 3: Evaluating Survey Questions
Jack Fowler
Center for Survey Research
Survey and Statistical Methods Core
There are 7 kinds of standards
for questions
1. Ask the right question
2. Cognitive standards
3. Usability standards
4. Interpersonal standards
5. Psychometric standards
6. Multi-mode standards
7. Multi-lingual standards
Ways to evaluate survey
instruments and wording
1. Appraisal forms
2. Focus groups
3. Cognitive testing
4. Pretests with observation and/or debriefing (for self administration)+ paradata
5. Pretests with behavior coding (for interviews)
6. Split-ballot experiments
7. Analytic assessments of reliability and validity
Appraisal forms
• Can flag issues that are known to affect
understanding or usability
• However, most of them rely on some
judgments (not clear, difficult to recall) that
require additional information
Focus groups
• Recruit a few groups of 6-8 people to
come talk about the issues the survey will
cover
Focus groups
• Content standards– What people’s relevant experiences and
situations are
– What people know and think about a topic
– What kinds of answers they give
• Cognitive standards– Vocabulary—what different candidate words
mean to people and what words they use
– What questions they can answer
Strengths of focus groups
1. An efficient way to gather info about
several people at a time
2. Sometimes being in a group stimulates
ideas or raises issues that would not be
found one-on-one
Weakness of focus groups
• Do not get to probe into the way
individuals understand and answer
questions
• May not hear all the issues because each
person only gets so much air time
• That’s what one-on-one cognitive
interviews do
Cognitive testing—a little history
• 1984 an NSF conference brought together
survey researchers and cognitive scientists
• NCHS established a laboratory with NSF
funding in the latter 1980s
• Bureau of the Census started a lab a little
later
• Cognitive testing did not start to become
common until the early 1990s
What that means
• Questions that were designed before the
mid-1990s were not usually evaluated for
whether or not people understood the
questions or their answers meant what the
researchers hoped they meant
Answering questions
1. Comprehend what is being asked for
2. Having information relevant to the answer
3. Working with the information to put it in a
form needed to answer
4. Providing the answer
Error happens
• When one of these four steps is handled
problematically, validity is at risk
• The goal of question design is to minimize
the risk of problems at each of the steps of
the question answering process
• The goal of cognitive testing is to evaluate
how well each of these steps is performed
when someone answers a question
Cognitive testing
• Recruit volunteers
• Usually pay them
• Interviews can take 1 -1.5 hours
• Interviews are almost always tape
recorded and reviewed after interview
Protocols
• There is a lot of diversity in the way
interviews are done:
• Think aloud
• Follow-up probes after first asking
respondent to answer the test question
• Probes can be highly structured or
interviewers can be given a lot of flexibility
Potentially probes can focus on
each aspect of the process1. What the question is asking
2. What the respondent knows or thinks
3. Refinement needed to get material needed for answer
4. Turning relevant material into an answer
What question means
• IN THE PAST YEAR, HOW MANY TIMES
HAVE YOU SEEN OR TALKED WITH A
DOCTOR ABOUT YOUR OWN HEALTH?
Could you say in your own words what the
question is asking?
Do you think the question includes times when
you talked with a doctor on the telephone?
Do you think the question includes times when
you exchanged e-mails with a doctor?
What the respondent knows or thinks
• How many times do you think you have been to
a doctor’s office about your own health in the
last year?
• How do you remember that?
• Have you had your eyes checked in the last year
by a doctor? (Did you include that in your
answer?)
• Was there a time when you saw a nurse
practitioner, but not the doctor? (Did you count
that?)
Refining info to get close to answer
• So, could you take me through the process
you went through to decide how many times
you saw or talked with a doctor in the last
year?
• Were there any doubtful cases that you
decided to leave out that you thought about
including?
• Any that you included that you considered
leaving out?
Continued
• How confident are you that you have the
right answer?
• If you are in error, what would you guess
is the most likely kind of error you could
have made?
Providing an answer
• Did you consider more than one possible
answer before you gave your answer?
• How did you decide which one to give?
• Is there any way in which the answer you
gave might not be a good reflection of
what the true answer is?
Example
• Do you think we are doing too much, too
little or about the right amount to fight
terrorism?
Example
• How would you rate the importance of the
following behaviors that might promote good
health—very important, somewhat important,
or not important at all?
• Doing moderate exercise for 20 minutes at
least 3 days a week
• Having at least 2 glasses of wine, cans of
beer or 1.5 oz drinks of alcohol every day
What cognitive testing can evaluate
• Asking right question?
• Cognitive standards
• Maybe psychometric issues
• Multi-lingual standards
Conclusion
• The results depend on the investigator
juxtaposing the information provided in the
cognitive interview and the idealized way
that a respondent should answer if the
answer is going to be a ―valid‖ measure of
the target construct
Usability testing
• Observation
• Debriefing
• Analysis of errors of navigation
• Paradata (when computer based)
Pretests with behavior coding
Pretests of interviews with some people like
the planned respondents have been
standard for years
Main input from interviewer debriefing
Not very reliable
Not very informative
What should question-answer
process look like?
1. I reads question exactly as worded
2. R understands question as intended
3. R retrieves information needed to answer
4. R puts answer in form required and tells
interviewer
Premises of behavior coding
• 1. Deviations from this ideal may reflect
problems that are threats to validity of data
• 2. The wording of a question often is the
direct cause of these problems
• 3. The presence of problems can often be
inferred from the behavior of interviewers
or respondents
How to do behavior coding
• Can be done live using an observer
• Mainly done by recording interview and
then having trained coders listen to
recordings
• NOTE: Experience shows that almost
everyone who is willing to be interviewed
agrees to be tape recorded
What to code
• Unit of observation is usually the question
• The core focus of most coding is to count
deviations from an ideal question and
answer process
Question reading
• Read exactly as worded
• Minor changes
• Major changes
• Interrupted
Question answering
• Adequate (codable) answer given
• Inadequate (uncodable) answer given
• Qualified answer (―I think‖ ―It might be..‖)
• Refusals and ―don’t knows‖
Other aspects of R behavior
• Asks for clarification
• Asks for all or part of question to be
repeated
Sample output for each question
• Often results are reported like this:
– % read exactly as worded
– % interrupted reading
– % asked for clarification
– % gave inadequate answer
Let’s try some behavior coding
What do results mean?
• Requests for clarification often mean there
is an unclear term or concept.
• When did you move to New York?
• Does that mean the city, the metropolitan
area or the state?
What do results tell us?
( Some examples)
1.Interruptions often occur when there is
dangling material in the question, such as
a definition after the question.
• How many children do you have? Do not
include step children.
What do results mean?
2. Inadequate answers often occur when it is
unclear how to answer the question.
• When did you move to New York?
• Unclear whether interviewer wants date,
years ago, or stage in life cycle—and how
precise it has to be.
What do results mean?
3. Requests for clarification often mean
there is an unclear term or concept.
• When did you move to New York?
• Does that mean the city, the metropolitan
area or the state?
Effects on Data
• One of the clearest findings is that the
higher the rate at which interviewers have
to probe to get an adequate answer, the
higher the interviewer-related error
More effects on data
• There is evidence that qualified answers
and response latency are related to the
―accuracy‖ of responses
Strengths of behavior coding as question
evaluation method
• Low cost—easily integrated into pretest
• Evaluation is of how questions work under
realistic data collection conditions
• Results are reliable
• Results are objective—i.e. not dependent
on an individual’s subjective assessment
• Results are quantitative
Weaknesses of Behavior Coding
1. Sometimes hard to diagnose reason for
results and how to fix it
2. Can’t always tell if an observed problem
actually affects data
3. Some question problems (such as
comprehension) do not show up in
behavior coding
Split-ballot experiments
• About how many months has it been since
you last saw or talked to a dentist? Include
all types of dentists, such as orthodontists,
oral surgeons, or all other dental specialists,
as well as dental hygienists.
• About how many months has it been since
you last went to a dentist office for any type
of dental care?
Results
ORIGINAL ALTERNATIVE
6 Months or less 60% 57%
More than 6 months
but not more than 1
year
14% 18%
More than 1 year 26% 25%
TOTAL 100%
(n=77)
100%
(n=79)
Split-ballot experiment
• In the last 12 months, how often did you
get an appointment for regular or routine
health care as soon as you wanted -
always, usually, sometimes, never?
(“Always” recoded to “Yes”)
• In the last 12 months, were you always
able to get an appointment as soon as you
wanted?
Results
ORIGINAL ALTERNATIVE
Yes 47% 66%
No 53% 34%
Total 100%
(261)
100%
(299)
Split-ballot experiments are needed to
find out if wording affects estimates
They also are key to evaluating whether or
not questions meet
Multi-mode standards
Multi-lingual standards
Validity must be established via theory or
consistency with other evidence
Psychometric analysis
• Reliability
• Validity
Reliability
• Extent to which same people who should
not have changed much give same
answer at two points in time
• Extent to which people for whom true
value is the same give the same answers
Validity
• Extent to which answers measure what
they are supposed to measure
• Note: Answers can be reliable and yet not
be valid
How to measure validity
• Correspondence between answers and
some other ―true‖ measure of what is to be
measured. Record-check study is good
example.
• Correspondence between answers and
other answers or measures that are
supposed to be measuring or related to
the same thing
And what is bias?
• A systematic difference between survey
answers and the ―true‖ scores
• Example: On average, people report their
weight as lower than a scale reading
• Answers can be biased and valid: that is,
they can be systematically different by
correlated highly
More on Bias
• And bias is only meaningful with respect to
factual questions
• Since it is not possible to obtain a
calibrated ―true score‖ for a subjective
state (happiness?), bias is meaningless
What does it mean to have a
“validated measure”
• Validity = a correlation coefficient
• If it is a positive number, that means there
is evidence that the answers measure the
―true scores‖ to some extent……
Qualifications..
• In the population in which the data were collected
• Under the conditions in which the data were collected
• To the extent that you can be convincing that whatever it is correlated with has something to do with the true value you are trying to measure
Psychometric Testing
• In the end, evidence that you are
measuring what you want to measure is
the ultimate question standard.
• Often, it is hard to gather that evidence
• And it usually requires forethought and
special effort to have the data to assess
validity
Thank you.