Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3...

7
Journal of Neurology, Neurosurgery, and Psychiatry 1991;54:18-24 Scales for rating motor impairment in Parkinson's disease: studies of reliability and convergent validity L Henderson, C Kennard, T J Crawford, S Day, B S Everitt, S Goodrich, F Jones, D M Park Abstract Study 1 examined the reliability of the ratings assigned to the performance of five sign-and-symptom items drawn from tests of motor impairment in Parkinson's disease. Patients with Park- inson's disease of varying severity per- formed gait, rising from chair, and hand function items. Video recordings of these performances were rated by a large sample of experienced and inexperi- enced neurologists and by psychology undergraduates, using a four point scale. Inter-rater reliability was moderately high, being higher for gait than hand function items. Clinical experience proved to have no systematic effect on ratings or their reliability. The idiosyn- cracy of particular performances was a major source of unreliable ratings. Study 2 examined the intercorrelation of several standard rating scales, com- prised of sign-and-symptom items as well as activities of daily living. The correlation between scales was high, ranging from 070 to 083, despite con- siderable differences in item composi- tion. Inter-item correlations showed that the internal cohesion of the tests was high, especially for the self-care scale. Regression analysis showed that the relationship between the scales could be efficiently captured by a small selection of test items, allowing the construction of a much briefer test. The advent of levodopa replacement therapy gave impetus to the development of clinical rating scales for assessing impairment in Parkinson's disease (see Marsden and Schach- ter,' Potvin and Tourtellotte' for reviews). Despite continuing proliferation of scales, few attempts have been made to evaluate their reliability or validity, or to provide a rationale for the selection of constituent items. Test items tend to fall into two broad categories, sign-and-symptom items which are essentially formalisations of the tests used in the consult- ing room to reveal Parkinsonian impairment, and Activities of Daily Living (ADL) items which assess the functional status of the patient in a more global fashion. In Study 1, we report an investigation of the inter-rater reliability of sign-and-symp- tom items and of the factors that influence reliability. In Study 2, we examine the inter- correlation between different scales and between test items of different types, to determine how much redundancy the tests possess and to investigate the extent to which different types of test item converge on the same underlying properties. STUDY 1 Aim Our general intention in this study was to investigate some of the factors that should guide the selection of sign-and-symptom based test items. In particular, we were con- cerned with inter-rater reliability and how it might be improved. Factors that might be expected to influence the reliability of subjective ratings include the following: Item selection: the nature of the particular movement that is assessed, how familiar it is, how revealing of abnormality, how many biomechanical degrees of freedom it admits, etc. We investigated gait, rising from a chair and three hand function items. Item standard- isation: we specified to the patient the action required, both verbally and by demonstration. Rater expertise: we compared ratings made by neurologists experienced in the use of Parkin- sonian rating scales, inexperienced neuro- logists and psychology undergraduates. We also investigated the effects of a brief training video. Rating criteria: we provided either brief written criteria or video demonstrations of prototypic examples of each scale value. Con- textual factors: these include simultaneous context (strictly irrelevant factors such as expressions of distress or tremor in limbs other than the one being assessed) and prior context (expectations derived from previous assessments of the patient or criterion bias induced by having just rated much more/less impaired patients). In this study some patients were rated several times, in different clinical states. An attempt was also made to vary the amount of simultaneous context available. Methods We began by making a video recording of 11 patients with idiopathic Parkinson's disease performing five test items. From this video data-base we selected and edited appropriate examples to produce our test video, which was used to gather ratings. A four point rating scale was employed for all test items. Ratings were obtained from 50 physicians, 44 of whom were neurologists, the remainder being geriatricians (henceforth, simply "neuro- logists"). We also obtained ratings from 80 psychology undergraduates. The Department of Neurology, The London Hospital, Whitechapel, London C Kennard T J Crawford S Goodrich D M Park Psychology Division, Hatfield Polytechnic, Hatfield,Hertfordshire L Henderson F Jones S Day Biometrics Unit, Institute of Psychiatry, Denmark Hill, London, United Kingdom B S Everitt Correspondence to: Professor Henderson, Psychology Division, Hatfield Polytechnic, Hatfield, Hertfordshire AL1O 9AB, United Kingdom. Received II January 1990 and in revised form 3 May 1990. Accepted 7 June 1990 18 on October 20, 2020 by guest. Protected by copyright. http://jnnp.bmj.com/ J Neurol Neurosurg Psychiatry: first published as 10.1136/jnnp.54.1.18 on 1 January 1991. Downloaded from

Transcript of Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3...

Page 1: Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3 0.2-0.1 0 0 0 0 0 0 0 0 0D°a 0 0 O o8 0 0 0*6 c 05'.2 0 > 0 0.4 0 n n(U 0.3 o

Journal of Neurology, Neurosurgery, and Psychiatry 1991;54:18-24

Scales for rating motor impairment in Parkinson'sdisease: studies of reliability and convergentvalidity

L Henderson, C Kennard, T J Crawford, S Day, B S Everitt, S Goodrich, F Jones,D M Park

AbstractStudy 1 examined the reliability of theratings assigned to the performance offive sign-and-symptom items drawnfrom tests of motor impairment inParkinson's disease. Patients with Park-inson's disease of varying severity per-formed gait, rising from chair, and handfunction items. Video recordings of theseperformances were rated by a largesample of experienced and inexperi-enced neurologists and by psychologyundergraduates, using a four point scale.Inter-rater reliability was moderatelyhigh, being higher for gait than handfunction items. Clinical experienceproved to have no systematic effect onratings or their reliability. The idiosyn-cracy of particular performances was amajor source of unreliable ratings.Study 2 examined the intercorrelation ofseveral standard rating scales, com-prised of sign-and-symptom items aswell as activities of daily living. Thecorrelation between scales was high,ranging from 070 to 083, despite con-siderable differences in item composi-tion. Inter-item correlations showed thatthe internal cohesion of the tests washigh, especially for the self-care scale.Regression analysis showed that therelationship between the scales could beefficiently captured by a small selectionof test items, allowing the constructionof a much briefer test.

The advent of levodopa replacement therapygave impetus to the development of clinicalrating scales for assessing impairment inParkinson's disease (see Marsden and Schach-ter,' Potvin and Tourtellotte' for reviews).Despite continuing proliferation of scales, fewattempts have been made to evaluate theirreliability or validity, or to provide a rationalefor the selection of constituent items. Testitems tend to fall into two broad categories,sign-and-symptom items which are essentiallyformalisations of the tests used in the consult-ing room to reveal Parkinsonian impairment,and Activities of Daily Living (ADL) itemswhich assess the functional status of thepatient in a more global fashion.

In Study 1, we report an investigation ofthe inter-rater reliability of sign-and-symp-tom items and of the factors that influencereliability. In Study 2, we examine the inter-correlation between different scales andbetween test items of different types, to

determine how much redundancy the testspossess and to investigate the extent to whichdifferent types of test item converge on thesame underlying properties.

STUDY 1AimOur general intention in this study was toinvestigate some of the factors that shouldguide the selection of sign-and-symptombased test items. In particular, we were con-cerned with inter-rater reliability and how itmight be improved. Factors that might beexpected to influence the reliability ofsubjective ratings include the following: Itemselection: the nature of the particularmovement that is assessed, how familiar it is,how revealing of abnormality, how manybiomechanical degrees of freedom it admits,etc. We investigated gait, rising from a chairand three hand function items. Item standard-isation: we specified to the patient the actionrequired, both verbally and by demonstration.Rater expertise: we compared ratings made byneurologists experienced in the use of Parkin-sonian rating scales, inexperienced neuro-logists and psychology undergraduates. Wealso investigated the effects of a brief trainingvideo. Rating criteria: we provided either briefwritten criteria or video demonstrations ofprototypic examples of each scale value. Con-textual factors: these include simultaneouscontext (strictly irrelevant factors such asexpressions of distress or tremor in limbsother than the one being assessed) and priorcontext (expectations derived from previousassessments of the patient or criterion biasinduced by having just rated much more/lessimpaired patients). In this study somepatients were rated several times, in differentclinical states. An attempt was also made tovary the amount of simultaneous contextavailable.

MethodsWe began by making a video recording of 11patients with idiopathic Parkinson's diseaseperforming five test items. From this videodata-base we selected and edited appropriateexamples to produce our test video, which wasused to gather ratings. A four point rating scalewas employed for all test items. Ratings wereobtained from 50 physicians, 44 of whomwere neurologists, the remainder beinggeriatricians (henceforth, simply "neuro-logists"). We also obtained ratings from 80psychology undergraduates.

The Department ofNeurology, TheLondon Hospital,Whitechapel, LondonC KennardT J CrawfordS GoodrichD M ParkPsychology Division,Hatfield Polytechnic,Hatfield,HertfordshireL HendersonF JonesS DayBiometrics Unit,Institute ofPsychiatry, DenmarkHill, London, UnitedKingdomB S EverittCorrespondence to:Professor Henderson,Psychology Division, HatfieldPolytechnic, Hatfield,Hertfordshire AL1O 9AB,United Kingdom.Received II January 1990and in revised form3 May 1990.Accepted 7 June 1990

18

on October 20, 2020 by guest. P

rotected by copyright.http://jnnp.bm

j.com/

J Neurol N

eurosurg Psychiatry: first published as 10.1136/jnnp.54.1.18 on 1 January 1991. D

ownloaded from

Page 2: Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3 0.2-0.1 0 0 0 0 0 0 0 0 0D°a 0 0 O o8 0 0 0*6 c 05'.2 0 > 0 0.4 0 n n(U 0.3 o

Scalesfor rating motor impairment in Parkinson's disease: studies of reliability and convergent validity

The patients All the patients filmed were

outpatients at the London Hospital who hadconsented to be videod and for the recordingsto be used for scientific purposes. A fewpatients were filmed twice, once when medica-tion had been delayed and again after medica-tion, when their usual level of functioning hadbeen restored.

The test recording This comprised twoparts, the second of which was prefaced by a

brief training demonstration. In both parts,five patients performed each of five items:rising from a chair, walking and turning, fingertaps, finger flexion and wrist pronation/supina-tion. (We reserve the term "performance" forthe recording of an individual patient'sperformance of a test item.) The two parts ofthe film differed in the following respects. Inpart 1, performances of all five items were

recorded in sequence, patient by patient. Firstcame a demonstration run, with one patientperforming each of the five test actions, in turn.These were not rated. Then came the series offive patients for rating, patients 2 and 5 beingthe same individual tested in two states ofmedication.

In part 2, there were again five differentperformances ofeach ofthe five test items, to berated. However, the following changes were

made from the format adopted in part 1. Ratherthan the various tests being presented patientby patient, the performances were presenteditem by item, with five different patients per-forming each test item before the next test itemwas encountered. Moreover, different (butoverlapping) sets of patients were chosen foreach test item. Before the five performances ofeach item a training demonstration was presen-ted. This consisted of a prototypical example ofa performance meriting each of the four ratings(0 = normal; 1 = mild; 2 = moderate;3 = severe). These four examples were eachviewed twice. The first of these presentationswas accompanied by a commentary pointingout any abnormal features of the movement.Two other changes distinguished part 2 frompart 1. The walking and turning item was notdivided into separately rated armswing and gaitratings; instead, one single composite ratingwas given. Also, wherever possible, the face ofthe patient and irrelevant parts of the bodywere blanked out.

The test items The items were drawn withminor amendments from the Webster3 andUnified Parkinson's Disease Rating Scale(UPDRS)4 tests. Rising from a chair was

performed from a hard, upright chair witharms. The patient was instructed to attemptinitially to rise without using leverage on thearms of the chair. Gait involved walking at a

natural pace for six metres, turning within theconfines of a box marked out on the floor andreturning to the starting point.The three hand function items were per-

formed seated. Finger tapping was performedwith the index finger, while the hand was restedflat upon the table, palm down. Finger flexionand wrist pronation/supination were per-formed with the relevant arm extended straightin front, level with the shoulder. Finger flexion

required the repetitive opposition of thumband index finger. Pronation and supinationwere performed in alternation, with fingerspartially extended. For these items, the 10seconds of recorded activity were culled fromthe end of a 24 second performance.

Raters and rating procedure The fiftyUnited Kingdom neurologists were testedtogether during a symposium on Parkinson'sdisease. For some of the subsequent analysesthey were divided into experienced/inexperienced subgroups, according towhether they had declared a particular interestin motor disorders and had experience of usingthe Webster test (N = 23) or lacked at least oneof these attributes, usually both (N = 27). Theundergraduate raters were first year (PSY1)and second year (PSY2) psychology students.PSY2 (N = 38) were run as a single group.PSYI were divided into two subgroups, whoeither rated the video in the standard order(Pt 1- > Pt 2: N = 19) or in the reverse order(Pt2->Ptl:N = 23).The rating criteria to be used for part 1 were

provided for the raters on a printed sheet. Thecriteria used in part 2 were provided by thetraining examples.

Statistical analysis The four-point ratings(0-3) for each item were used to calculate twomeasures of the agreement among groups, thestandard deviation and the coefficient of con-cordance. The standard deviation was ourprimary index and provided a measure of thevariability across a group of the ratingsassigned to a particular performance. Kendall'sConcordances5 have also been calculated toprovide a measure of the agreement betweenraters as to the ranking of patients. We citeconcordances to allow comparison withprevious reliability studies.467 However, as willappear later, a problem with this coefficient isthat it may be biased by the degree of similarityfound in a particular set of patients.

ResultsReliability and expertise Figure 1 shows themean ratings given by the neurologists(undivided) and the undergraduates to each ofthe 30 rateable items (five patients x six items)in part 1. Both groups seem to use the full rangeof the scale and their mean ratings agreeclosely, showing similar profiles for eachpatient.

Figure 2 displays the inter-rater variability(SD) of the ratings assigned to each itemperformance in part 1, as a function ofthe meanlevel of impairment indicated by the ratings.Only the ratings by neurologists are shown buta very similar inverted-U function was foundfor undergraduate ratings. What this pattern ofresults indicates is that the only systematicrelationship between a patient's mean ratedlevel ofimpairment on a test item and the inter-rater variability of these ratings, occurs at theextreme ends of the scale (<0-25 and > 2 75),where variability declines sharply.

Figure 3 allows inspection of the averagerating variability of each test item, as a functionof expertise. These data, trimmed of items with

19

on October 20, 2020 by guest. P

rotected by copyright.http://jnnp.bm

j.com/

J Neurol N

eurosurg Psychiatry: first published as 10.1136/jnnp.54.1.18 on 1 January 1991. D

ownloaded from

Page 3: Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3 0.2-0.1 0 0 0 0 0 0 0 0 0D°a 0 0 O o8 0 0 0*6 c 05'.2 0 > 0 0.4 0 n n(U 0.3 o

Henderson, Kennard, Crawford, Day, Everitt, Goodrich, Jones, Park

Patient JA

Neurologists

0- - -0 Undergraduates

Without drugs

Patient DA

II s /

0b

Patient ST

A $.- ..."I, ..d

_Q .

-eX Te STReegg e Te-

ea T '

Test items

Figure I Mean ratings assigned to performances in Part 1 by neurologists and undergraduates. (Test items were risefrom chair, walking andturning, finger tapping, finger flexion and wrist pronation/supination).

Figure 2 Variability(SD) of the ratingassigned by the neurologiststo a performance, as afunction of its mean ratedimpairment (Part 1).

0.9

0-8-

07

c 0.6.-0

" 0.4c

in0.3

0.2-

0.1

0

0

0

0

0

0

0

0

0D°a

00

O o8

0

0

0*6

c 05'.2

0 >0 0.4

0 n(U

n 0.3o X2

n0 C

0-2

0.1'

0

Ordered mean ratings

3 Test items in part one

0-6.

a mean above 2-75 or below 0-25, were submit-ted to a three-way ANOVA (groups x parts xitems). While, overall, variability was greatestfor the undergraduates and least for the in-experienced neurologists, this expertise factordid not approach significance. Variability wasreduced in part 2 and this effect was justsignificant (p < 005). The effect of item typewas highly significant (p < 0O001) with risingfrom chair and gait both showing higherreliability (mean SDs 0-34) than the handfunction items (SDs 0-48-0-52).

Figure 4 displays the concordance co-

efficients for each test item in parts 1 and 2.(Note that high scores now denote agreement.)Here, also, there is no indication that experts'ratings yield more agreement but there is a

tendency in all the groups for hand functionitems to yield less agreement. Finger taps inpart 2 produced notably low concordance butsubsequent analyses showed this to be due to a

reduced range of impairment on this item.Training To evaluate the effect ofthe train-

ing demonstration that introduced each testitem in part 2, we compared the reliabilitymeasures obtained in part 1 for the two PSY1

o0-5

~03

C

'm 0-2

0.1

0

RFC Gait Tap FFx P/STest items in part two

Figure 3 Variability (SD) as a function of raterexperience and item type, shown separately for Parts Iand 2.

subgroups. The reverse order subgroup, whoalone had seen the training examples by thispoint, showed no resulting benefit, indicatingthat our brief training with exemplars had beenineffective as a means of improving agreementamongst inexperienced raters.

Qualitative observations Item standardisa-tion presents a major problem, especially for

Patient NA

3.

2j

Patient ST*

6

* 'Expert' neurologistsla 'Novice' neurologistsO Undergraduates

20

on October 20, 2020 by guest. P

rotected by copyright.http://jnnp.bm

j.com/

J Neurol N

eurosurg Psychiatry: first published as 10.1136/jnnp.54.1.18 on 1 January 1991. D

ownloaded from

Page 4: Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3 0.2-0.1 0 0 0 0 0 0 0 0 0D°a 0 0 O o8 0 0 0*6 c 05'.2 0 > 0 0.4 0 n n(U 0.3 o

Scalesfor rating motor impairment in Parkinson's disease: studies of reliability and convergent validity

a 'Expert' neurologists

0 'Novice' neurologists

O Undergraduates

c06

3:5w

L 064.00

0-2.

Test items in part one

a)

0

c

0

0

RFC Gait Tap FF)

Test items in part two

the hand function items used here. The actionsrequired are not everyday movements anddespite clear instructions/demonstrationspatients found a variety of ways to execute themovements. Postural idiosyncracies may affectthe ease of execution. Such variation alsocomplicates the rater's task. (A solution that weare currently exploring is to employ a devicethat constrains the movement, such as a rotat-ing handle for pronation/supination.) Anothersource of difficulty in rating altemating handmovements is that with severely impairedpatients, unless instructions emphasisemovement amplitude rather than rate, tremormay be taken for rapidly alternatingmovements of low amplitude. Indeed, we sus-

pect that some patients generate tremor as a

movgment surrogate.Two types of contextual cue may obstruct

the attempt to focus on a specific feature andrate it independently. The first of these we

believe to be responsible for the high SDoutlier evident in fig 2. (JA performing P/S.Note that this outlier is not a "capricious"datum. This performance also yielded thelargest SD for the undergraduates.) While thepronation/supination is itself only very mildlyimpaired, the patient's face shows intenseeffort and there is some postural tremor. Dif-ferent weightings attached to these con-

currently available cues produce unreliableratings. As it happens, JA is also the subject ofsequential context cues, since she appears

several times throughout the video, in verydifferent states of medication.

Finally, another reason for supposing thatthe idiosyncracies of particular performancesrepresent a major source of variability in theratings is that the pattern ofSD obtained acrossperformances was remarkably similar for theneurologists and undergraduates (r = 0-659;p < 0001).We conclude that careful selection of test

items, standardisation of their manner ofexecution, the clarification of rating criteriaand removal of contextual cues seems morelikely to improve reliability than does theselection of raters on the basis of experience orthe provision of very brief training examples.

STUDY 2In this study, we drew upon data gathered inthe course of a double-blind drug inves-tigation.9 This data-base comprised the scoresof 49 patients on five different tests of impair-ment. The tests represent the full range of itemtypes, some being ADL-based and othersbeing sign-based. These data allowed us topursue two main questions: 1) Does therelationship between patients' scores on dif-ferent scales offer persuasive evidence of con-vergent validity? (Where an external, objectivevalidating criterion or "gold-standard" isunavailable, weaker evidence ofthe validity ofameasuring instrument can be found in itstendency to agree with other tests that purportto measure the same features. Convergentevidence of validity is more impressive themore the content of the test items differs); 2) Isthere sufficient redundancy amongst the testitems to allow construction of a much briefertest that nevertheless correlates well withexisting instruments.Patients The sample was screened to excludepatients with dementia or with an additionalcondition that might contribute to the assessedimpairment.Test The scales used were: 1) NorthwesternUniversity Disability Scale (NUDS),6 com-prising five ADL items (walking, dressing,eating/feeding, hygiene, speech), each rated ona 10 point scale, save eating/feeding which bothhave a five point scale. Total possiblescore = 50. (50 = normality); 2) Hoehn andYahr staging,'0 categorises patients intofive stages, using multiple criteria. (0 =normality); 3) Self Care Scale9-self-ratings of12 items (dressing, eating, food preparation,house cleaning, getting out of bed, turning inbed, rising from chair, climbing stairs, use oftoilet, use of tools, bathing, shopping/mobility), each rated on a four point scale (0-3).Total possible score = 36. (0 = normality); 4)Webster Scale,2 largely a sign-based test con-taining 10 items (manual bradykinesia, rigidity,posture, arm swing while walking, gait, tremor,facies, seborrhoea, speech, self-care), eachrated on a four point scale (0-3). Total possiblescore = 36. (0 = normality); 5) KarnofskyPerformance Score,"1 originally devised toprovide a classification of cancer patients' func-tional self sufficiency into 10 gradations (from100 = "normal" to 0 = "dead").

Figure 4 Concordancewithin groupsfor each testitem, shown separately forParts 1 and 2.

1-0a

0-8

21

on October 20, 2020 by guest. P

rotected by copyright.http://jnnp.bm

j.com/

J Neurol N

eurosurg Psychiatry: first published as 10.1136/jnnp.54.1.18 on 1 January 1991. D

ownloaded from

Page 5: Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3 0.2-0.1 0 0 0 0 0 0 0 0 0D°a 0 0 O o8 0 0 0*6 c 05'.2 0 > 0 0.4 0 n n(U 0.3 o

Henderson, Kennard, Crawford, Day, Everitt, Goodrich, Jones, Park

Results and discussionIntercorrelation of the composite scores Table 1shows some of the intercorrelations of thecomposite test scores. All the correlations are

quite high, accounting for at least 50% of thevariance. This is not entirely surprising, giventhe overlap of content between the tests. Mostinteresting therefore is the high correlationbetween the Webster and NUDS (0-82), sincethese scales overlap negligibly in content. Infact, the correlation of the NUDS with theWebster (different item types) is as great as thatbetween the NUDS and Self Care (0 81) bothof which are ADL-based. Brown et al'2 alsofound a similarly high correlation betweenscores on an ADL scale and a symptom-basedscale (the King's College Hospital Parkinson'sdisease rating scale). In their study, agreementwas greater when the ADL ratings were madeby the clinician who had rated the symptomitems, rather than when ADL scores derivedfrom patients' self ratings.ADL versus sign-based items To examine therelation between ADL and sign-based items,the intercorrelations between all the con-

stituent items in the Webster and NUDS were

calculated (table 1). Of the sixty correlationsbetween individual items, only five failed toattain significance (p < 0-05). Three of theseinvolved speech items.Hoehn and Yahr No patient was assigned tostage 5. Those patients classified as stage 4revealed, as a group, reliably greater impair-ment scores than those at stage 3 on all the othermeasures except the Webster. However,classification into stages 1-3 showed a veryinconsistent relationship with the NUDS, SelfCare and Webster scores. We therefore pur-sued the relationship between Hoehn andYahr, and Webster scores in a larger data-base,acquired by the United Kingdom Parkinson'sDisease Research Group (PDRG) and com-

prising 175 untreated patients, at stages 1-3.The groups of patients rated as stage 1 and 2could not be reliably distinguished on theirWebster scores. Taken together with other datasuggesting that Hoehn and Yahr scores maybehave anomalously,'3 and that the reliability ofthe measure may be low,8 this result suggeststhat the ease of arriving at Hoehn and Yahrstagings is won at excessive cost, at least for theearlier stages.

Redundancy and test length Larsen et al'4 haveproposed that a test of Parkinsonism should bebrief, to avoid fatigue and to permit multipletesting during a patient's day. Given therelatively high intercorrelations we obtainedbetween test items, it seemed worth investi-gating whether a much briefer instrumentcould be constructed as an index of the func-tional competences on which the test items tendto converge. To this end, the NUDS totalscores were regressed on the 22 items whichcomprised the Webster and Self Care scales.The seven test items showing the largestregression coefficients relative to their standarderrors are shown in table 2. The new scalecomprising these items correlates with theNUDS at 0-89 and with the Karnofsky at 0 82,suggesting considerable convergence, at a costof relatively few items in the new scale.

Since the most remarkable recent develop-ment in measures for rating Parkinsonism hasbeen the construction of the chimericalUPDRS,4 a vast instrument assembled fromcomponents of several existing scales, it seemsworth posing the question as to what might begained by grossly extending the number ofitems in a scale. Given the coarse, four-pointscale employed by both the Webster and theSelf Care Inventory, it seems unlikely thatincreasing the number of items has the poten-tial greatly to enhance sensitivity of the scale.However, a modest degree of item redundancymay be justifiable, in terms of a consequentimprovement in reliability ofthe total score. Toachieve finer discriminations, at least one of thefollowing steps must be taken: 1) A coarse scalemay be retained but item difficulty varied, withthe result that items differ in the portion of thecontinuum of disability that they resolve. Thisis essentially the technique employed by mostintellectual tests, where each item producesonly a binary (right/wrong) score; 2) Anattempt can be made to transcend the normallimitation of a raters' capacity for absolutejudgements by anchoring scale values in

detailed descriptions, such as the NUDSprovides for its ten-point scales; 3) Change ofstate may be made the explicit focus of therating;"5 4) Ratings may be abandoned in favourof objective measures.

Internal cohesion of the tests Sign-and symp-tom based items are directed at more element-

Table I Correlations ( x 100) of items in the Webster Scale with items in the-NUDS, Values are also providedforthe Total Webster Score, Total NUDS, Total Self Care (SC) and Karnofsky Performance Score

Northwestern University Disability Scale NUDS

Walking Dressing Hygiene Eating Feeding Speech Total Karnofsky

WebsterBradykinesia 49 55 50 38 45 42 58 46Rigidity 25 29 26 28 39 34 35 31Posture 44 47 48 43 51 39 55 59Arm swing 40 43 51 38 45 20 48 45Gait 59 62 59 39 50 28 62 62Tremor 34 27 47 19 42 6 36 19Facies 25 36 41 41 43 66 50 25Seborrhoea 28 43 34 7 42 33 39 32Speech 17 27 24 35 37 80 43 26Self care 65 54 63 55 52 42 68 71Webster Total 65 71 74 57 74 62 82 70SC Total 81 83

22

on October 20, 2020 by guest. P

rotected by copyright.http://jnnp.bm

j.com/

J Neurol N

eurosurg Psychiatry: first published as 10.1136/jnnp.54.1.18 on 1 January 1991. D

ownloaded from

Page 6: Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3 0.2-0.1 0 0 0 0 0 0 0 0 0D°a 0 0 O o8 0 0 0*6 c 05'.2 0 > 0 0.4 0 n n(U 0.3 o

Scales for rating motor impairment in Parkinson's disease: studies of reliability and convergent validity

Table 2 Itemsfrom the Webster and Self Care TestsScales with predictive utility for the NUDS

Regression Standardcoefficient error

Webster itemsBradykinesia 0-285 0 11Gait 0131 011Tremor 0-153 0 09Speech 0-157 0-12

Self care itemsFood preparation 0 248 0 18Climbing stairs 0 216 0 11Mobility 0 285 018

ary features of movement and its impairmentthan are ADL items. Performance of ADLitems is influenced by a large number offeatures, several ofwhich are likely to be sharedby different items. These considerations sug-gest that the internal cohesiveness of the itemscomprising a largely sign-and-symptom basedscale like the Webster might be lower than thatofan ADL-based instrument. In pursuit ofthisquestion we calculated separately for the Web-ster and the Self Care (both comprising similarnumbers of items and using the same four-point scale) the intercorrelations amongst theirconstituent items. These are shown in table 3.To make the matrices easier to inspect onlythose values that were highly significant(p < 0-001) are displayed. It is clearly evidentthat the Self Care scale is the more cohesivetest. In the Self Care scale, all the correlationsthat fail to meet our strict significance levelinvolve either eating or tool use, items thatinvolve fine motor control and lack a majorcomponent of mobility. In contrast, only asmall minority of the Webster inter-correla-tions met our significance level.To explore further the relationships between

the constituents of the Webster and Self Carescales a principal components analysis'6 wasconducted on each. This statistical technique isdirected towards the identification of coherentsubsets within a group of variables. Subsets ofvariables (in this case, test items) that arecorrelated with one another but dissociatedfrom other subsets are combined into "com-ponents". The essential feature of the tech-

2 3 4052 060 062

0*50086

0 001 for the Self Care Scale and

9048

0680-680-640-58061073

100*500-42050

0*500430*51049

11066

0780790-600520650680730-53

120-600400770-850-580-610470580700*50077

5 6 7 8059 063 054 050

050 049 0-58 058053 054 050 056

0 68 0-63 0-710-60 0 60

057

2 3 4 5 6054 051 040

0-52053 051

0-55

7 8 9 10

0-480*50054

074

Table 4 Principal component (PC) resultsfor theintercorrelation of the items in the Self Care Scale

Before rotation After rotation

PCI PC2 PCI PC2

Dressing 0-77 0-24 0 50 0 54Eating 0 51 0 73 0 00 0 89Cooking 0-85 0-22 0 57 0-67House work 0 84 0 06 0 65 0-53Rise from bed 0 78 -0 36 0 84 0.15Turn in bed 0-75 -0 20 0 73 0-27Rise from chair 0 75 -0 10 0 67 0-35Stair climb 0-79 -0-33 0 84 0-18Toilet 0-82 -0-26 0-82 0-26Tool use 0 65 0 20 0 41 0 54Bathing 0-88 0 00 0-72 0.51Mobility 0-85 0-07 0-66 0.55

nique is that the original variables are transfor-med into a new set, the components, which arethemselves uncorrelated. These componentsare ordered so that they account for decreasingproportions of the variance in the original data.Since a small number of these new variables(each of which is a linear combination of theoriginal) may account for a large part of thevariance, they may provide a parsimonioussummary of the data. In some cases the derivedvariates, are subjected to a mathematicalprocess known as "rotation" which is intendedto aid the identification ofthe components. Thetechnique may be used descriptively to sum-marise the data, with redundancy removed. Itmay also be used interpretatively in an attemptto uncover, for instance, the ability factors thatunderlie performance.

Tables 4 and 5 show the principal com-ponent results obtained before and after a"varimax" rotation, for the Self Care items andWebster items, respectively. For the Self Careitems, the two components extracted from thecorrelations accounted for 70% of the variance.After rotation, the first had loadings above 0-60on (in descending order) rising from bed,climbing stairs, toilet, turning in bed, bathing,rising from chair, housework and mobility outof doors. Only eating and cooking had suchloadings on the second component. Theseresults suggest that the solution for the SelfCare scale is rather simple, with one majorcomponent that seems to reflect mobility andwhole body movements. The minor com-ponent seems to reflect manual fine motorcoordination.The three components extracted from the

Webster inter-item correlations together onlyaccounted for 65% of the variance. Loadingabove 060 on the first rotated component werearm swing, gait, self care and posture; on thesecond component, speech and facies; on thethird component, seborrhoea. These com-ponents are not easy to interpret, although thefirst appears to relate to mobility. A factoranalytic study of a similar set of symptom-based items has been reported by Reynolds andMontgomery."7 They also required to positthree factors to account for 70% of thevariance, with the first factor seeming to be oneof mobility.

In summary, these analyses suggest that theSelf Care scale is very much more internally

Table 3 Inter item correlations significant at p <the Webster Scale

Self care itemsDressingEatingCookingHouse workRise from bedTurn in bedRise from chairStair climbToiletTool useBathingMobility

Webster itemsBradykinesiaRigidityPostureArm swingGaitTremorFaciesSeborrhoeaSpeechSelf care

23456789101112

2345678910

23

on October 20, 2020 by guest. P

rotected by copyright.http://jnnp.bm

j.com/

J Neurol N

eurosurg Psychiatry: first published as 10.1136/jnnp.54.1.18 on 1 January 1991. D

ownloaded from

Page 7: Scales motor in Parkinson's of convergent · impairment (Part1). 0.9 0-8-07 c 0.6.-0 " 0.4 c in 0.3 0.2-0.1 0 0 0 0 0 0 0 0 0D°a 0 0 O o8 0 0 0*6 c 05'.2 0 > 0 0.4 0 n n(U 0.3 o

Henderson, Kennard, Crawford, Day, Everitt, Goodrich, Jones, Park

Table 5 Principal component (PC) results for the intercorrelation of items in theWebster Scale

Before rotation After rotation

PCI PC2 PC3 PCI PC2 PC3

Bradykinesia 0 68 0 00 0 35 0-36 0-31 0 60Rigidity 0 67 0.21 0 21 0-29 0 50 0A44Posture 0 82 -0 11 -0 01 0-67 0 34 0-34Arm swing 0-66 -0-49 -0 20 0 84 -0 04 0 14Gait 0 65 -0 34 -0-33 0 80 0 10 0 01Tremor 0 37 -0 47 0 36 0 37 -0-25 0 53Facies 0 60 0 61 -0 10 0-16 0-84 0 09Seborrhoea 0 27 0 05 0-77 -0-14 0-08 0-80Speech 0 53 0 73 -015 0 07 0.91 0 00Self care 0 70 -014 -0 31 0-72 0-29 0 03

cohesive than the Webster scale and has asimpler underlying structure. Two factors maycontribute to these differences between the twoscales. First, sign-based test items come closerto reflecting independent, elementary featuresof the disorder, whereas the ADL items drawupon overlapping sets of elementary features.This in turn implies that whilst a brief ADL-based scale may efficiently summarise thepatient's functional status, more detailedinvestigations into the various features of thedisease and their responsiveness to therapy willbe better served by a sign-based instrument.Second, it is worth bearing in mind that theapparent cohesiveness of the Self Care scalemay partly be due to reliance on patients' selfratings, resulting in some lack of independencein the assessment of items.

General conclusions1) It should be possible to construct a short,highly cohesive scale, consisting ofADL items,and relying largely on self ratings, which wouldprovide a useful assessment of a patient'sgeneral functional status.2) While ADL-based scales are more internallycohesive than sign-based ones, the relationshipbetween these contrasting types of tests issufficiently close to provide reassuringevidence of convergent validity.3) Sign-based scales offer a better prospectthan ADL based scales as instruments foranalysing patterns of impairment or selectiveeffects of therapy. However, as Study 1showed, in their present form their reliability isunsatisfactory even when a scale with only fourlevels is employed.4) The major source ofunreliability seems to beinherent peculiarities and ambiguities in par-

ticular performances, rather than expertise of

the rater. This might be considerably reducedby careful selection and standardisation ofitems and their scoring criteria. However, evenwith reliability optimised, the discriminativepower of a test based on subjective clinicalratings is likely to be severely limited.Moreover, greatly extending the number ofitems will not, in itself, obviate this difficulty.To discriminate between increasingly refinedtherapeutic interventions, either some meansmust be found to overcome the limitations ofthe raters' capacity for absolute categoricaljudgements or valid objective measures have tobe developed.

This research was supported by grants from the MedicalResearch Council, the Parkinson's Disease Society and aNational Advisory Board award. We are grateful to JohnGosling and Christian Lueck for helpful comments, to SandozPharmaceuticals for the opportunity to gather the neurologists'ratings in Study 1 and for the test data in Study 2, and to theUnited Kingdom Parkinson's Disease Research Group foraccess to their data-base.

1 Marsden CD, Schachter M. Assessment of extrapyramidaldisorders. Brit J Clin Pharmacol 1981;1 1:129-51.

2 Potvin PE, Tourtellotte MD. Quantitative examination ofneurological functions. Boca Raton, Florida: CRC Press,1984.

3 Webster DD. Clinical analysis of the disability in Parkin-son's diseaase. Mod Treat 1968;5:257-82.

4 Fahn S, Elton RL. Members of the UPDRS Committee.Unified Parkinson's disease rating scale.

5 Siegel S. Nonparametric statistics for the behavioral sciences.London: McGraw Hill, 1956.

6 Canter CD, de la Torre A, Mier M. A method for evaluatingdisability in patients with Parkinson's disease. J NervMent Dis 1961;133:143-7.

7 Kennard C, Munro AJ, Park DM. The reliability of clinicalassessment in Parkinson's disease. J Neurol NeurosurgPsychiatry 1984;47:322-3.

8 Ginanneschi A, Degl'Innocenti F, Magnolfi S, et al. Evalu-ation of Parkinson's disease: reliability of three ratingscales. Neuroepidemiology 1988;7:38-41.

9 Everitt BS, Hand D, Barrie MA, et al. (United KingdomBromocriptine Research Group) Bromocriptine in Park-inson's disease: a double-blind study comparing "low-slow" and "high-fast" introductory dosade regimens in denovo patients. J Neurol Neurosurg Psychiatry 1989;52:77-82.

10 Hoehn MM, Yahr MD. Parkinsonism: Onset, progressionand mortality. Neurology (Cleveland) 1967;5:427-42.

11 Karnofsky DA, Abelmann WH, Craver LF, Burchetval JH.The use of the nitrogen mustards in the palliative treat-ment of carcinoma, with particular reference tobronchogenic carcinoma. Cancer 1948;1:634-9.

12 Brown RG, MacCarthy B, Jahanshahi M, Marsden CD.Accuracy of self-reported disability in patients withParkinson's disease. Arch Neurol 1989;46:955-9.

13 Diamond SG, Markham CH. Evaluating the evaluations: orhow to weigh the scales of Parkinsonian disability.Neurology (Cleveland) 1983;33:1098-9.

14 Larsen TA, LeWitt PA, Caine DB. Theoretical and practicalissues in assessment of deficits and therapy in Parkin-sonism. In: Calne DB, ed. Lisuride and other dopamineagonists. New York: Raven Press, 1983.

15 Feinstein AR, Josephy BR, Wells CK. Scientific and clinicalproblems in indexes of functional disability. Annals of IntMed 1986;105:413-20.

16 Everitt B, Dunn G. Advanced methods ofdata exploration andmodeling. London: Heinemann, 1983.

17 Reynolds NC, Montgomery GR. Factor analysis of Parkin-son's impairment. Arch Neurol 1987;44:1013-16.

24

on October 20, 2020 by guest. P

rotected by copyright.http://jnnp.bm

j.com/

J Neurol N

eurosurg Psychiatry: first published as 10.1136/jnnp.54.1.18 on 1 January 1991. D

ownloaded from