PREP #11: Reliability, Validity, and Questionnaire Design · translated into objective questions...

27
PREP #11: Reliability, Validity, and Questionnaire Design Public Research Education Program Suzanne Sunday, Ph.D. Biostatistics Unit February 21, 2013 The Feinstein Institute for Medical Research North Shore-LIJ Health System

Transcript of PREP #11: Reliability, Validity, and Questionnaire Design · translated into objective questions...

PREP #11: Reliability, Validity, and Questionnaire Design

Public Research Education Program

Suzanne Sunday, Ph.D. Biostatistics Unit

February 21, 2013 The Feinstein Institute for Medical Research

North Shore-LIJ Health System

CME Disclosure Statement • The North Shore LIJ Health System adheres to the ACCME’s new Standards

for Commercial Support. Any individuals in a position to control the content of a CME activity, including faculty, planners, and managers, are required to disclose all financial relationships with commercial interests. All identified potential conflicts of interest are thoroughly vetted by the North Shore-LIJ for fair balance and scientific objectivity and to ensure appropriateness of patient care recommendations.

• Course Director, Kevin Tracey , has disclosed a commercial interest in Setpoint, Inc. as the cofounder, for stock and consulting support. He has resolved his conflicts by identifying a faculty member to conduct content review of this program who has no conflicts.

• Suzanne Sunday has nothing to disclose.

• Describe validity and reliability of measures

• Describe the definitions, calculation, and use of various statistical measures of validity and reliability

• Discuss issues related to the design of questionnaires and why validity is important

• Through small groups, attempt to develop valid questions for a questionnaire

Objectives

• While it sounds easy to create a new self-report scale or new evaluation tool, it never is!

• If possible, use an established, valid and reliable test or evaluation

• Even if you are using a valid test or evaluation, you’ll still need to establish reliability with your sample or with your evaluators/raters

• If you have to make changes to a validated, reliable instrument, make as few as possible and DO NOT make substantive changes

• If you are going to create your own measure, build in the time necessary to establish validity and reliability BEFORE using the measure in your study!

• Repeatable, consistent, precise

• Reliable scores minimize error through:

– Standardization (e.g. instructions, environment, training of evaluators)

– Aggregation (e.g. multiple questions within scales/subscales, raters)

• Types

– Test-retest and parallel-forms/alternate-forms reliability

– Interrater reliability

Reliability

• Self-report tests

– Compare test-retest over a short time frame in a standardized environment with the same instructions

– Internal consistency of items within a scale—Cronbach’s coefficient alpha

• Evaluations

– Establish, clear and repeatable trainings (scripted, videos)—can be put online

– Create videos of test cases (10 is a usual number) which represent the range of possible responses and have all raters watch and rate the videos

– Interrater reliability—intraclass correlations, Kappas

How to Establish Reliability-1

• Isn’t simple agreement sufficient?

– No, need to include chance in your expected agreement

– Kappa (of which intraclass correlations are a subtype) adjusts agreement for chance agreement

• Can have weighted Kappas if certain disagreements are more serious than others

– Example: For semi-structured psychiatric diagnostic interviews, the lack of distinction between affective disorders and psychotic disorders is more serious than the lack of distinction between affective disorders and anxiety disorders

How to Establish Reliability-2

• Intraclass correlation is generally used for quantitative ratings

• What is good reliability for a Kappa?

– Complete agreement would equal 1 but you will not get that

– >0.75 is excellent agreement

– <0.4 is poor agreement

– 0.4-0.75 is fair to good agreement

How to Establish Reliability-3

• Process by which evidence is collected to support the types of inferences that should be drawn from the test scores

• Is a developing process that changes over time

• Concerns what the test measures and how well it measures it—the meaning of the scores, accuracy

– May or may not reflect the name of the test and what the test supposedly measures

– Must be established based on performance on the test and other independently observable data or test scores concerning the characteristic(s) of interest

• Reliability constrains validity; it is necessary but not sufficient to establish validity

Validity

• Content—degree to which questions are representative of the universe of behavior the test was designed to sample

• Criterion-related—to draw inferences from the test to performance on some real behavioral variable; score is effective in estimating a person’s performance on an outcome measure

• Construct—to draw inferences from the test to performances that can be grouped under a particular construct. Generally theoretical and are complex and multifaceted

Is there a type of validity you expected that isn’t listed?

Categories of Validity

• Does the test appear superficially to measure what it is purported to measure?

• Not really a form of validity, although used frequently

• Having face validity is necessary for a test to be successful, especially to get buy-in from a non-research community

What was missing? Face Validity

• Does the test cover a representative sample of the domain to be measured?

– Must have a full description of the domain/trait

– Must be established based on the test and other independently observable data about the characteristics

– Must have sufficient numbers of questions

– Must not over-represent questions that are more easily translated into objective questions

• Preparation of items (which must be clearly defined)

– Use the literature to establish areas

– Use of experts—either to create or review the items

Content Validity

• Criterion contamination—be sure that a rater isn’t influenced by the test score

• Concurrent validity—the criterion measures must be obtained at about the same time as the test score

– Correlations

– Examples: Math achievement test and math grades, score on psychiatric test and formal psychiatric diagnosis

• Predictive validity—the criterion measures are obtained in the future

– Regressions

– SAT scores and college GPA

Criterion-Related Validity

• Items homogeneous, measuring a single construct

• Internal consistency—subtest/total correlations

• Developmental/age differentiation for some tests

• Theory-consistent group differences

• Theory-consistent intervention effects

• Convergent validity—test correlates (although not too highly) with other tests of similar constructs

• Discriminant validity—test does not correlate with tests that address different constructs

• Factor analysis—specialized statistical procedure to identify the minimum number of factors required to account for the intercorrelations between tests

Construct Validity

• Homogeneous, all about impulsivity (but what is it?)

• Internal consistency: subscales?—Cronbach’s alpha

• Child and adolescent versions

• Theory-consistent group differences

• Theory-consistent intervention effects

• Convergent validity—test correlates with BIS

• Discriminant validity—test does not correlate with depression or anxiety measures

• Factor analysis if sufficient n

Example of Construct Validity: Developing a new measure of impulsivity for children

So, after all of this, you still want to design your own test or evaluation,

what do you need to do?

Defining the test

Selecting a scaling method

Constructing the items

Testing the items (pilot study)

Revising the test

Cross validation with new sample

Publishing the test

Test Construction Process

Refer

to the

literature

• What will the test measure?

• How will it differ from existing tests? Test needs to add something to our body of research

• Must have a strong theoretical and research basis

• Must be easy to administer and objective to score

• Be sensitive to diverse needs: developmentally, culturally

Defining the Test

• Nominal—numbers serve as arbitrary category names; e.g. Glasgow Coma Scale: Eyes open 1=none, 2=to pain, 3=to speech, 4=spontaneously

• Ordinal—ranking; e.g. strongly disagree to strongly agree, Likert scales (generally odd number of items)

• Interval—provides information about ranking using equal-sized units or intervals; e.g. visual analog scale

Selecting a Scaling Method

• Language

– Precision of language and specificity of questions

– Readability, including level, usually 4th grade

– Issues of loaded, sensitive, threatening questions

– Demand characteristics and social desirability

• Ask only one question per item

• Time frames: last week vs. last year vs. ever

• What subjects know about the test, e.g. name

• Homogeneous vs. varied items

• Number of items and complications

Constructing the Items-1

• Questionnaire appearance

– Paper & pencil vs. electronic

– Fonts and spacing

– Double vs. single-sided

• Item formats

– Open-ended vs. closed-ended: multiple choice, forced choice, matching, true/false

– Examples given—might bias answers to focus on the examples given

Constructing the Items-2

• Framing of questions—having some positive and others negative to protect against people responding with all “5s” because they strongly agree with everything or are too busy to carefully read all of the questions

• Lie indices (MMPI)

• Faking good and bad (MMPI)

Constructing the Items-3

• Pick a topic

• Each group develop 5 brief questions that you think distinguish based on the topic

• Have scribe record the questions

Small groups

• Norms

• Focus groups/panels

• Diaries

Other Issues

• Frame of reference or norms

– Compares scores with a standardized or normative group—transformation of scores into z scores or t scores

• Age controls

• Psychiatric controls

• Nationalities

– Norm is selected through random or stratified sampling

– Requires describing the distribution of the norm and comparing your distribution with that

Norms

• Often used as a first step in formulating a questionnaire

• Ask people who are in the group you are interested in about your questions

• Need to clearly specify your format and the questions that you want to ask

• Results from focus group can often be published as a separate paper but need to tape proceedings and summarize completely

Focus Groups or Panels

• Generally considered an inferior data collection technique but there may be no other way of collecting the data, e.g. food diaries

• Techniques to improve results

– Extensive training of subjects, including a practice period (with review by researcher) with possible retraining BEFORE actual data is collected

– Using electronic methods (e.g. PDAs) with reminders (e.g. alarms) and frequent downloads

– Complete in the moment not later, questions should be asked with specific time frames

– Include checklists (e.g. listing condiments)

Diaries