Development of health measurement scales - part 1

Development of health measurement scales – part I

Dr. Rizwan S A, M.D.

OUTLINE OF PRESENTATION

• Introduction-Basic concepts

• Devising the items• Scaling responses• Selecting the items• Biases in responding • From items to scale • Article

Some terms

• Scale• Subscales

• Items• Responses

History of scales

• Initially we used mortality and morbidity indicators•After these came down, they were no longer

representative or sensitive, creating need for new health indices, happiness, QOL, sadness•WWII – provided the impetus• Scaling techniques developing psychologists• Sampling techniques were developed by political

scientists• Development in data analysis

Psychophysics and psychometrics

• Power law – humans can make consistent numerical estimates of sensory stimuli• Extrapolating this evidence to the concept that people can make subjective

judgments about health in a consistent manner

32 degree Celsius

Depression score - 32

Basic Steps in Scale development

1.Searching the Literature• Awareness of Existing scales for the same purpose

2.Critical Review• Reliability• Validity

Basic Steps

Reliability-Reliability refers to the degree to which the results obtained by a measurement procedure can be replicated.

Assessing Reliability• Internal Consistency• The average correlation among all the items in the measure. • Its calculated by Cronbach’s alpha, Kuder-Richardson, Split halves

• Stability• Reproducibility of a measure on different occasions.• Inter-Observer reliability• Intra-Observer reliability• Test-Retest reliability

Basic Steps

Validity• An expression of the degree to which a measurement measures what it

purports to measureTypes:1. Face Validity: the relevance of measurement may appear obvious to the

investigator2. Content Validity: the extent to which the measurement incorporates the

domain of the phenomenon under study. 3. Construct Validity : the extent to which the measurement corresponds to

theoretical concepts 4. Criterion Validity : the extent to which the measurement correlates with

an external criterion of the phenomenon under study-Concurrent Validity-Predictive Validity

Reliability Vs. Validity

Basic Steps

Traditions of assessments

Multidimensional scaling is a bridge between them

Categorical model(eg- DMS-IIIR)

Dimensional model(eg-CES-D)

Diagnosis requires multiple criteria each with threshold values

Occurrence of some features at high intensity can compensate for non-occurrence of others

Differences between cases and non-cases are implicit in definition

Difference between cases and non-cases are less clearly delineated

Severity is lowest in instances that minimally satisfy diagnostic criteria

Severity is lowest among non-disturbed individuals.

One diagnosis often precludes others A person can have varying amounts of different disorders.

Basic Steps

• Reduction of measurement error In Clinical observation • Through training• Interviewing skills• Clinical experience

In Psychometric tradition • Items screened to meet certain criteria• Consistency of answers across many items• Scale as a whole checked if its meeting other criteria

• Two solitudes can be merged using Diagnostic Interview Schedule (DIS)in psychiatry

It is derived from clinical examination used to diagnosed psychiatric patient but can be administered by trained lay people


• Introduction• Basic concepts


Devising the items

• Item• Refers to an individual question or response phrases in any health

measurements.

• First step in writing a scale is devising the items • By exploring various sources • Identifying strengths and weakness of each of them

• Items may be repeated from previous scales• Advantages

• saves work and necessity of constructing new• proof of being useful and psychometrically sound• only way of asking about a specific problem

• Disadvantages• outdated terminology• inadequate or incomplete for domain under study

Devising the items

• Sources of items1. Focus group2. Key informant interviews

• Clinical observation • Patients

3. Theory4. Research findings5. Expert opinion

• A scale may consist of items derived from some or all of these sources.

Devising the items

• After generation of items, Content validity should be addressed • Content Relevance: Each item on the test should relate to one of the

course objectives• Content Coverage: Each part of the syllabus should be represented by

one or more question.

Devising the items

• Generic versus Specific scales (Fidelity versus bandwidth issue)

Generic scale(Bandwidth)

Specific scale(Fidelity)

Allows comparison across different disorders, severity of disease, interventions,demographic and cultural groups

Questions will be relevant and appropriate for any specific problem

Psychometric properties well established Short

Devising the items

• Translation • Translating each item into other language.• Done by a person who is fluent in both English and target tongue,

knowledgeable about the content area and aware of intent of each item and scale as a whole

• Back Translation• done by another bilingual person, knowledgeable who translates

it back into English

• Re-establishing the reliability and validity within new context


• Introduction• Devising the items• Scaling responses• Selecting the items• Biases in responding • From items to scale • Article

Scaling Responses

• A method by which responses will be obtained• Divided into categorical or continuous variable • level of measurement are decided

• Nominal, ordinal, interval, ratio

Scaling Response

1-Dichotomous scale: one that arranges items into either of two mutually exclusive categories ,eg , Yes/no, alive/dead.

2. Nominal scale: classification into unordered qualitative categories; eg., Race, religion, and country of birth.

3. Ordinal scale: classification into ordered qualitative categories, eg., Social class (I, II, III,etc.).

4. Interval scale: an (equal) interval involves assignment of values with a natural distance between them, so that a particular distance (interval) between two values in one region represents the same distance between two values in another

region of the scale. Examples include celsius and fahrenheit temperature.

5. Ratio scale: A ratio is an interval scale with a true zero point, so that ratios between values are meaningfully defined. Examples are absolute temperature, weight, height.One value as being so many times greater or less than another value.

Scaling Response

• Categorical judgment• Required when response to a question is either yes or no/simple check

• Problems• Uncertainty and confusion on the part of respondents• Potential loss of information and reduced reliability• Loss of efficiency of the instrument

Scaling Response• Continuous judgment• Required when the response to a question is a continuous

variable

•Methods to quantify it are• Direct Estimation Technique• Comparative Methods• Econometric Method

Scaling Response

Direct estimation methods1. Visual analogue scale

Scaling ResponseDirect estimation methods

2.Adjectival scale

Scaling Response

Direct estimation methods3. Specific scaling methodsA. Likert scale

• Rater expresses an opinion by rating his agreement on series of statements, wherein responses are framed on an agree-disagree continuum.

Scaling Response

B. Semantic differential scale• To define a number of related dimensions of a characteristics on a

series of continuous bipolar scales

Scaling Response

• General issues in construction of continuous scales• How many steps should there be?• Is there a maximum number of categories?

Scaling Response

• Should there be an even or odd number of category?• Should all the points on the scale be labelled or only the ends?• Do adjectives always convey the same meaning?• Do numbers placed under the boxes influence the responses?• Should the order of successive question response changes?• Can it be assumed that data are interval?

Scaling Response

• Critique of Direct Estimation Methods• Subjective judgment• Easy to design• Little pre testing• Easily understood• Halo effect• End aversion bias• Positive skew

Scaling Response

•Comparative methods• These methods scale the value of each description

before obtaining responses, to ensure the response values to be on interval scale• Types• Thurstones’ method• Paired comparison technique• Guttman method

Thurstone’s method of equal appearing interval

1. Selection of 100-200 statements2. No. of judges are asked to sort them into single pile from

lowest to highest3. Median rank of each statement computed and it’s the

scale value of that statement4. Select a limited no. of statements about 25 having equal

intervals between successive items and spanning the entire range of values

5. Applying scale to respondent-they were asked to indicate the statement which applies to him/her

6. Respondents score will be average score of item selected

Scaling Response• Paired comparison technique• Similar to thurstone’s • Except here judges are asked to judge each item one at a time to

remaining items• Proportion of times each alternative is chosen over each other option • Convert the values to z-score using property of normal curve

Scaling Response Paired comparison technique

Scaling Response

•Guttman method

• Differs from thurstones’ in small sample 10-20 items • No calibration is done• Items are tentatively ranked according to increasing amount of attribute

assessed and responses are displayed in subject-by-item matrix were 1 is endorsed item and 0 is remaining item• Its an ordinal scale not interval• Here coefficient of reproducibility and coefficient of scalability is used to reflect

deviation from perfect cumulativeness• Best suited to behaviors which are developmentally determined ,where mastery

of one behaviour virtually guarantees mastery of lower order behaviour

Scaling Response -Guttman method

• eg. Assessment of function of lower limb in people with osteoarthritis

• A=4,B-3,C=2,D=2,E=1

Scaling Response -Guttman method

• The indices which reflect how much an actual scale deviate from perfect cumulativeness are • coefficient of reproducibility

The degree to which a person’s scale score is a predictor of his response pattern .

Varies between 0and 1;should be >0.9

• coefficient of scalabilityReflects whether the scale is uni dimensional and cumulative,Varies betwenm 0 and 1 and should be at least 0.6

Scaling Response

• Critique of Comparative method• Requires more time for their development • Thurstone’s and paired comparison guarantee interval level

measurement

Scaling Response

• Goal attainment scaling(GAS)• An attempt to construct scale which are tailored to specific individuals, yet can

yield results on a common ratio scale across all people• If intervention worked as intended subject should score 0• A higher mean score for all indicate goals were set too low • not all subjects need have same goalsCritique• Ability to tailor the scale to specific goals of the individual• Each subject has his own scale, different number of goals and vary criteria

for each one• Extremely labour intensive

GAS is useful when • The objective is to evaluate intervention as a whole, goals for each

person are different and adequate resources for training goal setters and raters are present.

Scaling Response

• Econometric Method• Required to scale benefits along a numerical scale so that

cost/benefit ratios can be determined• Health state is rated by averaging judgements from a large number

of individuals to create a utility score for the state.• Here focus of measurement is described health state not the

characteristics of the individual respondent.• eg-choice between medical management and CABG in managing

angina approached by the following methods:• Von Neumann-Morgenstern standard gamble• Time tradeoff technique

Scaling Response-Econometric method Von Neumann-Morgenstern standard gamble

You have been suffering from angina from several years. As a result of your illness you have chest pain after even minor physical exertion.You have been forced to quit your job and spend most days at home watching TV. Imagine you are offered a possibility of an operation that will result in complete recovery , though operation carry some risk there is a probability p that you will die during operation. How large must p be before you will decline the operation and choose to remain in your present state?

Closer the present state is to perfect health , the smaller the risk of death one would be willing to entertain. Having obtained an estimate of p from subjects,value of present state can be directly converted to 0-1 scale by 1-p.

Time trade off Technique Imagine living the remainder of your lifespan in your present state 40 years.Contrast

this, with operation you can return to your perfect health for fewer years .How many years would you sacrifice if you have perfect health?

So the respondent is presented with the alternative of 40 years in her present state versus 0 years of complete health.

Scaling Response

• Critique of Econometric method• Difficult to administer • Require a trained interviewer

•Multidimensional scaling• Technique to examine the similarities of different objects which may

vary along a number of separate dimensions• Begins with some index of how close each object is to every other

object and then try to determine how many dimensions underlie these evaluation of closeness

Scaling Response-Multidimensional scaling

Selecting the items

A. Pre-test the items to ensure that they1. comprehensible to target populations2. Unambiguous3. ask only a single question

B. Eliminate or rewrite any item which do not meet the criteria above and pre test again

C. Discard items endorsed by very few (or many) subjects

Selecting the items

D. Check for internal consistency of the scale using1. Item-Total correlationa) Correlate each item with the scale total omitting that

itemb) Eliminate or rewrite any with Pearson r’<0.20c) Rank order the remaining one and select items

starting with highest correlation

Selecting the items2. Coefficient α or KR-20

a) Calculate α eliminating one item at a time.b) Discard any item where α significantly increases.

E. For multi scale questionnaire, check the item is in ‘right’ scale by

a) Correlating it with the totals of all the scales , eliminating items which correlate more highly on scales other than the one it belongs to

b) Factor-analysing the questionnaire, eliminating items which load more highly on other factor than the one it should belong to.

Biases in Responding

• The people who develop a scale a scale, those who use in their work, and the ones who are asked to fill it out, all approach scales from different perspectives, for different reason•Optimizing• Describes performance a task in a careful and comprehensive manner

a) try to interpret meaning of the question itselfb) try to retrieve all the relevant information from their

memories.c) use this information to form a single integrated summary

judgementd) try to convey that judgment on the answer sheet.


• SatisficingGiving an answer which is satisfactory but not optimal which

may include-Selecting the first response option(written form),last option (verbal form),agreeing with every statement, answering either true or false to each option , keep things as they are as a response or I don’t know

It can be minimized by keeping simple task, words that are short and easy, response with all the possibilities and motivation of respondents


• Social desirability(SD) and faking good• The subject does not deliberately try to deceive or lie and gives a socially

desirable answer is-SD• When the subject is aware and intentionally attempt to create a false

positive impression it is called Faking good

• SD depends on on individual sex culture question and its context.• Assessed by - Differential Reliability Index(DRI),Social

Desirability Scale,Desirability scale,Social Relation Scale.• Faking good being volitional are easier to modify through

instructions and careful wording of the items than social desirability

.


• Deviation and faking bad• The tendency to test items with deviant responses is opposite of social

desirability and is known as Deviation• Faking bad- When the subject is aware and intentionally attempt to

create a false negative impression it is called faking bad opposite of faking good.

•Minimizing biases by• Disguising the intent of the test• Use of subtle items ones where the respondent is unaware of• Random response technique


• Yea-saying or acquiescence and Nay-saying• The tendency to give positive response such as yes ,like , true and

negative response such as no , dislike , false etc irrespective of the content of the item is called Yea and Nay saying respectively.• It can be reduced by having an equal number of item keyed in

positive and negative directions.• End –aversion bias or Central tendency bias• It’s the reluctance of some people to use extreme categories of a

scale.• Its reduced by avoiding absolute statements at the end points and

including throw away categories at the ends.

Biases in Responding• Positive skew• When responses are distributed more toward favourable end .It

produces ceiling effect.• It can be minimized by not putting average need in the middle or middle

is expanded.


• Halo• When judgement made on individual aspects of a person’s

performance are influenced by the raters. Overall impression of the person.• It can be minimized by training of raters, basing the evaluation on

large samples of behaviour and using more than one evaluator


• Framing• When the persons’ choice between two alternative states depends on how these

sates are framed.• People are RISK AVERSE when Gain is involved and RISK TAKERS when in loss

situations.• A-200 people will be saved.• B-There is one third probability that 600 people will be saved,and two third that

nobody will be saved.ORA.400 will die.B.There is a one-third probablility that nobody will die , and two third that 600 will die.

• The safest strategy for the test developer is to assume that all of these biases are operative and take the necessary steps to minimize them whenever possible.

From items to scales

• Differential weighting of items rarely is worth the trouble

• For test being developed for local use ,total score can be obtained by adding up all the items

• For general use and to be comparable transform the scores into percentile , z or T scores

• For attributes which differ between males and females or which show development changes ,separate age or age-sex norms can be developed.


• Combining items into a scale and expression of final score1. Add the score of the individual item when items are equally

contributing to the total score2. Weighting the items when some item may be more important

Each item is given either the same weight or different weight by different subjects

3. Transformation of final scores when comparing the scores on different scales in

PercentilesStandard ad standardized scoresNormalized scores


• Percentiles is the percentage of people who score below a certain value ,lowest being 0th percentile and highest is 99th percentile. Its easy to understand , requires many scores ,non normal in distribution and being an ordinal data cannot be analyzed by parametric statistics.

• To address the problems with percentiles scale, z score ,T scores can be calculated : Z-score by transforming scale with a mean of 0 and a standard deviation of1 and T scores by transforming z-scores using new mean and standard deviation chosen arbitrarily.

• To ensure normal distribution of z and T scores we use normalized standard score.


• Establishing the cut points• Receiver Operating Characteristics curves-ROC curve• Requires true positive rate(sensitivity) and true negative

rate(specificity)

• A graph is plotted where X axis is 1-specificity(false positive rate) and Y axis is sensitivity(true positive rate).The diagonal runs from (0,0) in lower left hand corner to (1,1) in upper right reflect characteristics of a test with no discriminating ability. The better the test in dividing cases from non cases , the closer it approach the upper left corner . An index of goodness of test is area under the curve as D’

From items to scales ROC curve

THANK YOU

Development of health measurement scales - part 1

Health & Medicine

Transcript of Development of health measurement scales - part 1