Development of health measurement scales – part 2

1

Development of health measurement scales – part II

Dr. Rizwan S A, M.D.If you cannot express in numbers something that you are describing, you probably

have little knowledge about it.

2

Scaling responses• Categorical • Continuous

– Direct Estimation Method• Visual Analogue Scale • Adjectival Scale

– Discrete– Continuous

• Specific Scaling– Likert Scale– Semantic Scale

– Comparative Method• Thurstone’s Method• Paired Comparision Method• Guttmann Method

– Econometric Method

3

Outline

• Reliability • Validity• Measuring change• Conclusions• Article discussion

4

Variance = sum of (individual value – mean value)2

----------------------------------------------------------------------------------

no. of values

5

Reliability• Whether our tool is measuring the attribute in a

reproducible fashion or not• A way to show the amount of error (random and

systematic) in any measurement• Sources of error – observers, instruments, instability

of the attribute• Day to day encounters

– Weighing machine, watch, thermometer

6

Assessing Reliability

• Internal Consistency– The average correlation among all the items in the tool• Item-total correlation• Split half reliability• Kuder-Richardson 20 & Cronbach’s alpha• Multifactor inventories

• Stability– Reproducibility of a measure on different occasions• Inter-Observer reliability• Test-Retest reliability (Intra-Observer reliability)

7

Internal consistency• All items in a scale tap different aspects of the same

attribute and not different traits• Items should be moderately corr. with each other and

each item with the total• Two schools of thought– If the aim is to describe a trait/behaviour/disorder– If the aim is to discriminate people with the trait from those

without

• The trend is towards scales that are more internally consistent

• IC doesn’t apply to multidimensional scales

8

Item-total correlation• Oldest, still used• Correlation of each item with the total score w/o that

item• For k number of items, we have to calculate k number

of correlations, labourious• Item should be discarded if r < 0.20• Best is Pearson’s R, in case of dichotomous items -

point-biserial correlation

10

Split half reliability • Divide the items into two halves and calculate corr.

between them• Underestimates the true reliability because we are

reducing the length of scale to half (r is directly related to the no. of items)– Corrected by Spearman-Brown formula

• Should not be used in – Highly timed achievement tests– Chained items

11

KR 20/Cronbach’s alfa• KR-20 for dichotomous responses• Cronbach’s alfa for more than two responses• They give the average of all possible split half reliabilities of a

scale• If removing an item increases the coeff. it should be discarded• Problems

– Depends on the no. of items– A scale with two different sub-scales will prob. yield high alfa– Very high alfa denotes redundancy (asking the same question in

slightly different ways)– Thus alfa should be more than 0.70 but not more than 0.90

13

• Cronbach’s basic equation for alpha

– n = number of questions– Vi = variance of scores on each question– Vtest = total variance of overall scores on the

entire test

Vtest

Vi

n

n1

1

14

Calculation of Cronbach’s coefficient alpha

Example: Assessment of emotional health

During the past month: Yes No

Have you been a very nervous person? 1 0

Have you felt downhearted and blue? 1 0

Have you felt so down in the dumps that

nothing could cheer you up? 1 0

15

Results

Patient Item 1 Item 2 Item 3Summed scale score

1 0 1 1 2

2 1 1 1 3

3 0 0 0 0

4 1 1 1 3

5 1 1 0 2

Percentage positive 3/5=.6 4/5=.8 3/5=.6

mean score = 2

16

Calculations

Mean score = 2

Sample variance =

Conclude that this scale has good reliability

5.1)15(

)22()23()20()23()22( 22222

86.02

3

5.1

)4)(.6(.)2)(.8(.)4)(.6(.1

1

)(%)(%1

k

k

Var

negposalphaCC ii

20

Multifactor inventories• More sophisticated techniques• Item-total procedure – each item should correlate

with the total of its scale and the total of all the scales• Factor analysis– Determining the underlying factors– For eg., if there are five tests • Vocabulary, fluency, phonetics, reasoning and

arithmetic• We can theorize that the first three would be correlated

under a factor called ‘verbal factor’ and the last two under ‘logic factor’

21

Stability/ Measuring error

• A weighing machine shows weight in the range of say 40-80 kg and thus an error of ±1kg is meaningful

• A ratio will be more useful,measurement error / total variability between subjects

But in reality we calculate the ratiovariability between subjects / total variability(Total variability includes subjects and measurement error)

• So that a ratio of– 1 indicates no measurement error/perfect reliability – 0 indicates otherwise

22

• Reliability =

subj. variability / (subj. variability + measurement error)

• Statistically ‘variance’ is the measure of variability so,• Reliability =

SD2 of subjects / (SD2 of subjects + SD2 of error)

• Thus reliability is the proportion of the total variance that is due to the ‘true’ differences between the subjects

• Reliability has meaning only when applied to specific populations

23

1 •Measurement error/ total variability between subjects

2 •Variability between subjects/ total variability

3 •Subj. variability/ (subj. variability + measurement error)

4 •SD2 of subjects/ (SD2 of subjects + SD2 of error)

24

Calculation of reliability

• The statistical technique used is ANOVA and since we have repeated measurements in reliability, the method is – repeated measures ANOVA

26

Example

28

• Classical definition of reliability• Interpretation is that 88% of the variance is

due to the true variance among patients (aka Intraclass Correlation Coefficient, ICC)

29

Fixed/random factor

• What happened to the variance due to observers?• Are these the same observers going to be used or they

are a random sample?

• Other situations where observations may be treated as fixed is subjects answering ‘same items on a scale’

31

Other types of reliability

• We have only examined the effect of different observers on the same behaviour

• But there can be error due to ‘day to day’ differences, if we measure the same behaviour a week or two apart we can calculate ‘intra-observer reliability coefficient’

• If there are no observers (self-rated tests) we can still calculate ‘test-retest reliability’

32

• Usually high inter-observer is sufficient, but if it is low then we may have to calculate intra-observer reliability to determine the source of unreliability

• Mostly measures of internal consistency are reported as ‘reliability’, because there are easily computed in a single sitting – Hence caution is required as they may not measure

variability due to day to day differences

33

Diff. forms of reliability coefficient

• So far we have seen forms of ICC• Others – Pearson product-moment correlation – Cohen’s kappa– Bland – altman analysis

34

Pearson’s correlation

• Based on regression – the extent to which the relation between two variables can be described by straight line

35

Limitations of Pearson’s R

• A perfect fit of 1.0 may be obtained even if the intercept is non-zero and the slope is not equal to one unlike with ICC

• So, Pearson’s R will be higher than truth, but in practice it is usually equal to ICC as the predominant source of error is random variation

• If there are multiple observations then multiple pairwise Rs are required, unlike the single ICC

• For eg. with 10 observers there will be 45 Pearson’s Rs whereas only one ICC

36

Kappa coeff.

37

0.70-0.41/1-0.41 = 0.491

39

• Used when responses are dichotomous/categorical

• When the frequency of positive results is very low or high, kappa will be very high

• Weighted kappa focuses on disagreement, cells are weighted according to the distance from the diagonal of agreement

• Weighting can be arbitrary or using quadratic weights (based on square of the amount of discrepancy)

• Quadratic scheme of weighted kappa is equivalent to ICC

• Also, the unweighted kappa is equal to ICC based on ANOVA

40

Bland and Altman method

• A plot of difference between two observations against the mean of the two observations

42

• The mean diff. is related to observer variance in ICC, and the SD of differences if related to the error variance in ICC

• Limits of agreement are calculated as mean difference ± 2 SE (= to the error variance)

• Agreement is expressed as the ‘limits of agreement’. The presentation of the 95% limits of agreement is for visual judgement of how well two methods of measurement agree. The smaller the range between these two limits the better the agreement is.

• The question of how small is small depends on the clinical context: would a difference between measurement methods as extreme as that described by the 95% limits of agreement meaningfully affect the interpretation of the results

• Limitation - the onus is placed on the reader to juxtapose the calculated error against some implicit notion of true variability

43

Issues in Interpretation

SE of measurement and reliability • R is a dimension-less ratio of variances and so it is difficult to

interpret R in terms of an individual score• SEM = σ sqrt(1-R)• If we knew the true score of someone, we can estimate the

limits within which 68% or 95% of the times the observed value would lie

• Eg. A scale with SD 10 and R 0.8. If the true score was 15, we can say 68% of the time his observed value will fall between 10.5 to 19.5

45

Standards for magnitude of reliability coeff.• How much reliability is good?

Kelly (0.94) Stewart (0.85)• A test for individual judgment should be higher

than that for research in groups• For Research purposes – – Mean score and the sample size will reduce the error– Conclusions are usually made after a series of

studies– Acceptable reliability is dependent on the sample

size in research

46

Reliability and probability of misclassification• Depends on the property of the instrument and the

decision of cut point• Relation between reliability and likelihood of

misclassification– Eg. A sample of 100, one person ranked 25th and another

50th

– If the R is 0, 50% chance that the two will reverse order on retesting

– If R is 0.5, 37% chance, with R=0.8, 2.2% chance

• Hence R of 0.75 is minimum requirement for a useful instrument

47

Improving reliability• Increase the subject variance relative to the error

variance (by legitimate means and otherwise)• Reducing error variance– Observer/rater training– Removing consistently extreme observers– Designing better scales

• Increasing true variance– In case of ‘floor’ or ‘ceiling’ effect, introduce items that

will bring the performance to the middle of the scale (thus increasing true variance)• Eg. Fair-good-very good-excellent (instead of bad-good)

48

• Ways that are not legitimate – Test the scale in a heterogeneous population

(normal and bedridden arthritics)– A scale developed in homogeneous population will

have a larger reliability when used in a heterogeneous population • correct for attenuation

49

• Simplest way to increase R is to increase the no. of items

• True variance increases as the square of items whereas error variance increases only as the no. of items

• If the length of the test is triples – Then Rspearman brown = 3R/ 1 + 2R

50

• In reality the equation overestimates the new reliability

• We can also use this equation to determine the length of a test for achieving a pre-decided reliability

• To improve test-retest reliability – shorten the interval between the tests

• An ideal approach is the examine all the sources of variation and try to reduce the larger ones (generalizability theory)

51

Sample size for reliability studies

52

Summary for Reliability

• Pearson R is theoretically incorrect but in practice fairly close

• Bland and Altman method is analogous to error variance of ICC but doesn’t relate this to the range of observations

• kappa and ICC are identical and most appropriate

53

Generalizability theory

• Backdrop of classical test theory– All variance in scores can be divided into true and error

variance (overtly simplistic assumption)– Don’t exhaust all possible sources of variance– Doesn’t account for interaction between sources of error

variance

• G theory– Cronbach et al 1972– Essence is the recognition that in any measurement

situation there are multiple sources of error variance (may be infinite)

60

Reliability Vs. Validity

61

Validity• Two steps to determine usefulness of a scale

– Reliability – necessary but not sufficient – Validity – next step

• Validity – is the test measuring what it is meant to measure?• Two important issues

– The nature of the what is being measured• Temperature Vs. quality of life/social support (physical vs. abstract)

– Relation to the purported cause• Sr. creatinine is a measure of kidney func. because we know it is regulated

by the kidneys• But whether students who do volunteer work will become better doctors?

• Since our understanding of human behaviour is far from perfect, such predictions have to validated against actual performance

62

32 degree Celsius

Depression score - 32

63

Types of validity

• Three Cs (conventionally)– Content – Criterion• Concurrent• Predictive

– Construct– Others (face validity)

• New types– Convergent, discriminant, trait etc.,

64

Differing perspectives

• Previously validity was seen as demonstrating the properties of the scale

• Current thinking - what inferences can be made about the people that have given rise to the scores on these scales?– Thus validation is a process of hypothesis testing (someone who scores

on test A, will do worse in test B, and will differ from people who do better in test C and D)

– Researchers are only limited by their imagination to devise experiments to test such hypotheses

• All types of validity are addressing the same issue of the degree of confidence we can place in the inferences we can draw from the scales

65

• Face validity– On the face of it the tool appears to be measuring what it is

supposed to measure– Subjective judgment by one/more experts, rarely by

empirical means

• Content validity– Measures whether the tool includes all relevant domains or

not– Closely related to face validity– aka. ‘validity by assumption’ because an expert says so

• Certain situations where these may not be desired - ?

66

Content validity

• Example – cardiology exam;– Assume it contains all aspects of the circulatory

system (physiology, anatomy, pathology, pharmacology etc., etc.,)

– If a person scores high on this test, we can say ‘infer’ that he knows much about the subject (i.e., our inferences about the person will right across various situations)

– In contrast, if the exam did not contain anything about circulation, the inferences we make about a high scorer may be wrong most of the time and vice versa

67

• Generally, a measure that includes a more representative sample of the target behaviour will have more content validity and hence lead to more accurate inferences

• Reliability places an upper limit on validity (the maximum validity is the square root of reliability coeff.) the higher the reliability the higher the maximum possible validity– One exception is that between internal consistency and

validity (better to sacrifice IC to content validity)– The ultimate aim of scale is inferential which depends

more on content validity than internal consistency

68

Criterion validity• Correlation of a scale to an accepted ‘gold standard’• Two types

– Concurrent (both the new scale and standard scale are given at the same time)

– Predictive – the GS results will be available some time in the future (eg. Entrance test for college admission to assess if a person will graduate or not)

• Why develop a new scale when we already have a criterion scale?– Diagnostic utility/substitutability– Predictive utility (no decision can be made on the basis of new

scale)• Criterion contamination

– If the result of the GS is in part determined in some way by the results of the new test, it may lead to an artificially high correlation

69

Construct validity• Height, weight – readily observable• Psychological - anxiety, pain, intelligence are abstract

variables and can’t be directly observed• For eg. Anxiety – we say that a person has anxiety if he has

sweaty palms, tachycardia, pacing back and forth, difficulty in concentrating etc., (i.e., we have a hypothesize that these symptoms are the result of anxiety)

• Such proposed underlying factors are called hypothetical constructs/ constructs (eg. Anxiety, illness behaviour)

• Such constructs arise from larger theories/ clinical observations

• Most psychological instruments tap some aspect of construct

70

Early morning stiffness

X rays changes

Elevated ESR, RA factor

3 or more joints

involved esp., small

jointsRheumatoid

arthritis

71

Establishing construct validity

• IBS is a construct rather than a disease – it is a diagnosis of exclusion

• A large vocabulary, wide knowledge and problem solving skills – what is the underlying construct?

• Many clinical syndromes are constructs rather than actual entities (schizophrenia, SLE)

72

• Initial scales for IBS – ruling out other organic diseases and some physical signs and symptoms – These scales were inadequate because they lead to

many missed and wrong diagnoses– New scales developed incorporating demographical

features and personality features

• Now how to assess the validity of this new scale– Based on my theory, high scorers on this scale

should have • Symptoms which will not clear with conventional

therapy• Lower prevalence of organic bowel disease on autopsy

73

Differences form other types

1. Content and criterion can be established in one or two studies, but there is no single experiment that can prove a construct• Construct validation is an ongoing process, learning more

about the construct, making new predictions and then testing them

• Each supportive study strengthens the construct but one well designed negative study can question the entire construct

2. We are assessing the theory as well as the measure at the same time

74

IBS example

• We had predicted that IBS patients will not respond to conventional therapy

• Assume that we gave the test to a sample of patients with GI symptoms and treated them with conventional therapy

• If high scoring patients responded in the same proportion as low scorers then there are 3 possibilities– Our scale is good but theory wrong– Our theory is good but scale bad– Both scale and theory are bad

• We can identify the reason only from further studies

75

• If an experimental design is used to test the construct, then in addition to the above possibilities our experiment may be flawed

• Ultimately, construct validity doesn’t differ conceptually from other types of validity– All validity is at its base some form of construct

validity… it is the basic meaning of validity – (Guion)

76

Establishing construct validity

• Extreme groups • Convergent and discriminant validity• Multitrait-multimethod matrix

77

Extreme groups

• Two groups – as decided by clinicians – One IBS and the other some other GI disease– Equivocal diagnosis eliminated

• Two problems– That we are able to separate two extreme groups implies

that we already have a tool which meets our needs (however we can do bootstrapping)

– This is not sufficient, the real use of a scale is making much finer discriminations. But such studies can be a first step, if the scale fails this it will be probably useless in practical situations

78

• Convergent validity - If there are two measures for the same construct, then they should correlate with each other but should not correlate too much. E.g. Index of anxiety and ANS awareness index

• Divergent validity – the measure should not correlate with a measure of a different construct, eg. Anxiety index and intelligence index

79

Multitrait-multimethod matrix• Two unrelated traits/constructs each measured by two different methods• Eg. Two traits – anxiety, intelligence; two methods – a rater, exam

– Purple – reliabilities of the four instruments (sh be highest)– Blue – homotrait heteromethod corr. (convergent validity)– Yellow – heterotrait homomethod corr. (divergent validity)– Red – heterotrait heteromethod corr. (sh be lowest)

• Very powerful method but very difficult to get such a combination

Anxiety Intelligence

Rater Exam Rater Exam

AnxietyRater 0.53

Exam 0.42 0.79

IntelligenceRater 0.18 0.17 0.58

Exam 0.15 0.23 0.49 0.88

80

Biases in validity assessment

• Restriction in range• May be in new scale (MAO level)• May be in criterion (depression score)• A third variable correlated to both (severity)

• Eg. A high correlation was found between MAO levels and depression score in community based study, but on replicating the study in hospital the correlation was low

82

• Unreliability of criterion– If the criterion is not as good as we expected it to be, the

validity coeff. may be attenuated even if the new scale was excellent

– We estimate the validity coeff. as if the criterion were perfected reliable by the formula• rxy’ = rxy / rxx

• rxx is the reliability coeff. of the criterion

– We can also estimate how much the corr. would improve if the new test was perfectly reliable, if both were perfectly reliable

• Changes in sample– Every time a scale is used in a new context/ new group of

people, its psychometric properties should be re-established

83

Measuring change

• Ultimate goal of most treatment studies is to induce a change in the patient’s status

• Controversial views against and for scales which are more sensitive to change in health status

• Goals of measuring change– To distinguish between those individuals who

change a lot and those who change little– To identify correlates of change– To infer treatment effects from group differences

84

• It is easier to demonstrate a consistent change in all the subjects, rather than different amounts of change in different subjects

• Why don’t we measure change directly?– Ask patients how have they changed since they

were put on the treatment, because people simply do not remember how they were at the beginning (validity of such response is debatable)

–Most defensible way to assess change is to measure it directly at the beginning of the study and subsequently on one or more occasions

85

Measures of association

• Reliability• Sensitivity to change

86

Reliability of change score

• Ability of a tool to detect individual differences in change scores

• By analogy to reliability coefficient– Reliability (D) = – This is related to the variance of pre-test, and post-test

scores, their reliability and the correlation between pre and post test

• A perfectly uniform response to treatment would be ideal; but this would yield a reliability coeff. for change score of zero, hence inappropriate for this purpose

87

Sensitivity to change from treatment effects

• Sensitivity to change/responsiveness• Builds on assumptions of generalizability theory– Treatment effect – facet of differentiation– All other interactions – facets of generalization over

which we want to estimate the effect of treatment• Sensitivity =

• varies between 0 and 1 (proportion of the variance in the change score due to true experimentally induced change)

• Related to reliability coeff. but not identical

91

Item response theory

• Limitations of G theory– Subject/population specific– Difficult to compare a person’s score on two or more

different tests (convert to z scores, normality assumption, not always correct)

– Homoscedasticity assumption that errors are the same at the ends as in the middle range of scores

– Assumption that all items have equal valences• Classical test theory – difficult to separate the properties of the test

from the attributes of the people taking it – the tool’s properties change as the people tested change, the people’s properties change as the test cahnges

92

• IRT – claims to rectify these limitations– Based on two ‘hard assumptions’

1. Data are unidimensional (tap only one trait)

2. The probability of answering any item in positive direction is unrelated to the probability of answering any other item positively for people with the same amount of the trait (local independence)

– Two postulates1. Performance of a subject can be predicted by a set of

factors called ‘traits’ or abilities, latent traits (theta)

2. The relationship between a person’s performance on any item and the underlying trait can be described by an ‘item characteristic curve’

93

• Some important properties of ICC– They are ‘ogives’, usually–Monotonic; the prob. of answering in a positive

direction consistently increases as the score on the trait increases

– Differ from each other in three dimensions• Slope• Location along the trait• The flattening out at the bottom

– Can be thought of as ‘imperfect’ Guttman scales

94

Item characteristic curve

• Q. A is a better discriminator than Q. B• Q. B is harder than Q. A

95

Different models of ICC

• One parameter model (Rasch model)– Assumes that all items have equal discriminating ability but

different difficulty

96

• Two parameter model– Assumes that both discriminating ability and

difficulty differ

97

• Three parameter model– In addition to the two parameter the lower end of the tail

asymptotes at some probability greater than 0– Takes care of the fact that when people answer questions

by guessing/ items that are correct by chance

98

Deriving the curves

• Taking a large number of subjects (200 for one-parameter model, 1000 to estimate the 3 parameter model)

• Random sampling Vs. latent trait model– In random sampling – it is not necessary to know

much about the items but large pool of items required

– Latent trait model – fewer items are required but every item should be known in detail

99

Advantages and disadvantages

• Allows test-free measurement; people can be compared to one another even if they took different items

• Eg. Wide Range Achievement Test• People in different levels can be given different items

and yet be placed on the same scale at the end (adaptive/tailored testing)

• Not widely used because– Large sample size needed to estimate the parameters– Assumptions are difficult to meet

100

Future guidelines for developing health measurements

1. Articles/manuals should give full description of purpose, population, intended use

2. Rationale for design of the instrument – conceptual definition if the object of measurement

3. Describe the ways in which questions were selected

4. Revisions if any should be stated along with reliability and validity

5. Clear instructions for standard administration and scoring

6. Reliability and validity testing should examine both internal structure and its relation to alternative measurements of the concept

7. The tool should be testes by users other than the original authors

101

Critical appraisal – Rcq - 36

• What is the population in this study?• What is the type of scale?• What is the scaling method used?• Have they missed any method for item generation?• Is Cronbach’s alfa calculated appropriately and is the scale

reliable?• Is it appropriate to calculate mean (SD) for each domain?• Have they established construct validity in this study?

Comment on the MTMM matrix used.• Can this scale be used to measure treatment effects for RC?

102

Thank you

“Belief is no substitute for arithmetic”

— Henry Spencer

103

Scaling Response

Categorical Continuous

Direct estimation

Comparative methods

Econometric methods

VAS Adjectival scale Specific scaling

Discrete Continuous Likert

Development of health measurement scales – part 2

Health & Medicine

Transcript of Development of health measurement scales – part 2