The challenge of creating new OSCE measures to capture the characteristics of expertise

7
The challenge of creating new OSCE measures to capture the characteristics of expertise Brian Hodges, Nancy McNaughton, Glenn Regehr, Richard Tiberius & Mark Hanson Purpose Although expert clinicians approach inter- viewing in a different manner than novices, OSCE measures have not traditionally been designed to take into account levels of expertise. Creating better OSCE measures requires an understanding of how the inter- viewing style of experts differs objectively from novices. Methods Fourteen clinical clerks, 14 family practice residents and 14 family physicians were videotaped during 2 15-minute standardized patient interviews. Videotapes were reviewed and every utterance coded by type including questions, empathic comments, giving information, summary statements and articulated tran- sitions. Utterances were plotted over time and exam- ined for characteristic patterns related to level of expertise. Results The mean number of utterances exceeded one every 10 s for all groups. The largest proportion was questions, ranging from 76% of utterances for clerks to 67% for experts. One third of total utterances consisted of a group of Ôlow frequencyÕ types, including empathic comments, information giving and summary state- ments. The topic was changed often by all groups. While utterance type over time appeared to show characteristic patterns reflective of expertise, the differ- ences were not robust. Only the pattern of use of summary statements was statistically different between groups (P <0Æ05). Conclusions Measures that are sensitive to the nature of expertise, including the sequence and organisation of questions, should be used to supplement OSCE checklists that simply count questions. Specifically, information giving, empathic comments and summary statements that occupy a third of expert interviews should be credited. However, while there appear to be patterns of utterances that characterise levels of exper- tise, in this study these patterns were subtle and not amenable to counting and classification. Keywords *Clinical competence; education, medical *standards; educational measurement *standards; interviews methods. Medical Education 2002;36:742–748 Introduction In the past 2 decades, medical education assessment has been revolutionized by the widespread adoption of performance-based objective structured clinical exam- inations (OSCE). OSCEs are now a major part of certification for all Canadian medical graduates, all international medical graduates in the United States and in the next few years, all American medical graduates. Not surprisingly, tremendous research and development effort has focused on maximizing the reliability and validity of this innovative assessment technology. As a result, methods of administering and scoring OSCEs have evolved considerably since they were first introduced in the 1970s. An area of great importance in OSCE research is the way in which candidates’ performances are scored. Because scores are ultimately used to distinguish individuals who are competent from those who are not, educators would like to have every confidence that their assessment measures are sufficiently reliable and valid to make this distinction. Indeed OSCE research is increasingly concerned with comparing the properties of different types of rating scales. 1,2 OSCEs have traditionally employed binary content checklists to score behaviours demonstrated by students. Behaviours are recorded on these checklists by a physician exam- iner or a standardized patient in no particular order and Department of Psychiatry and the Centre for Research in Education, University of Toronto, Toronto, Canada. Correspondence: Brian Hodges, The Toronto General Hospital, 8EN212–200 Elizabeth Street, Toronto, Ontario M5G 2C4, Canada. Tel.: 00 1 416 340 4451; Fax: 00 1 416 340 4198; E-mail: [email protected] Developing professional skills 742 Ó Blackwell Science Ltd MEDICAL EDUCATION 2002;36:742–748

Transcript of The challenge of creating new OSCE measures to capture the characteristics of expertise

The challenge of creating new OSCE measures to capturethe characteristics of expertise

Brian Hodges, Nancy McNaughton, Glenn Regehr, Richard Tiberius & Mark Hanson

Purpose Although expert clinicians approach inter-

viewing in a different manner than novices, OSCE

measures have not traditionally been designed to take

into account levels of expertise. Creating better OSCE

measures requires an understanding of how the inter-

viewing style of experts differs objectively from novices.

Methods Fourteen clinical clerks, 14 family practice

residents and 14 family physicians were videotaped

during 2 15-minute standardized patient interviews.

Videotapes were reviewed and every utterance coded by

type including questions, empathic comments, giving

information, summary statements and articulated tran-

sitions. Utterances were plotted over time and exam-

ined for characteristic patterns related to level of

expertise.

Results The mean number of utterances exceeded one

every 10 s for all groups. The largest proportion was

questions, ranging from 76% of utterances for clerks to

67% for experts. One third of total utterances consisted

of a group of �low frequency� types, including empathic

comments, information giving and summary state-

ments. The topic was changed often by all groups.

While utterance type over time appeared to show

characteristic patterns reflective of expertise, the differ-

ences were not robust. Only the pattern of use of

summary statements was statistically different between

groups (P < 0Æ05).

Conclusions Measures that are sensitive to the nature of

expertise, including the sequence and organisation of

questions, should be used to supplement OSCE

checklists that simply count questions. Specifically,

information giving, empathic comments and summary

statements that occupy a third of expert interviews

should be credited. However, while there appear to be

patterns of utterances that characterise levels of exper-

tise, in this study these patterns were subtle and not

amenable to counting and classification.

Keywords *Clinical competence; education, medical ⁄*standards; educational measurement ⁄ *standards;

interviews ⁄ methods.

Medical Education 2002;36:742–748

Introduction

In the past 2 decades, medical education assessment

has been revolutionized by the widespread adoption of

performance-based objective structured clinical exam-

inations (OSCE). OSCEs are now a major part of

certification for all Canadian medical graduates, all

international medical graduates in the United States

and in the next few years, all American medical

graduates. Not surprisingly, tremendous research and

development effort has focused on maximizing the

reliability and validity of this innovative assessment

technology. As a result, methods of administering and

scoring OSCEs have evolved considerably since they

were first introduced in the 1970s.

An area of great importance in OSCE research is the

way in which candidates’ performances are scored.

Because scores are ultimately used to distinguish

individuals who are competent from those who are

not, educators would like to have every confidence that

their assessment measures are sufficiently reliable and

valid to make this distinction. Indeed OSCE research is

increasingly concerned with comparing the properties

of different types of rating scales.1,2 OSCEs have

traditionally employed binary content checklists to

score behaviours demonstrated by students. Behaviours

are recorded on these checklists by a physician exam-

iner or a standardized patient in no particular order and

Department of Psychiatry and the Centre for Research in Education,

University of Toronto, Toronto, Canada.

Correspondence: Brian Hodges, The Toronto General Hospital,

8EN212–200 Elizabeth Street, Toronto, Ontario M5G 2C4,

Canada. Tel.: 00 1 416 340 4451; Fax: 00 1 416 340 4198;

E-mail: [email protected]

Developing professional skills

742 � Blackwell Science Ltd MEDICAL EDUCATION 2002;36:742–748

summed at the end of the interaction without regard for

sequence or timing of questions or actions. Growing

recognition of the inability of checklists to capture

complex human behaviours such as empathy and

organization has resulted in the adoption of multipoint

global ratings designed to capture non-binary charac-

teristics of the performance. While such global ratings

have been shown to have good psychometric proper-

ties,3,4 they have generally been developed as an �add-

on� to compensate for the deficiencies of checklists,

rather than on the basis of clearly demonstrated

characteristics of medical expertise.

At the same time, there is a body of literature

demonstrating that clinicians with higher levels of

experience approach clinical interviewing and arrive at

diagnoses in a very different manner than do novices.

For example, Dreyfus and Dreyfus5 observed the

problem solving of many different professionals at

various levels of expertise. They suggested that profes-

sionals pass through five stages in the development of

expertise – novice, advanced beginner, competence,

proficiency and expertise. They found each level of

development to be characterized by its own form of

problem solving. The individual stages will not be

reviewed here, except to say that the early novice stage

is characterized by the collection of large amounts of

data, in no particular order, with little regard for

situational factors, which are then used to synthesize a

problem solution (diagnosis in medicine). At the other

end of the spectrum, an expert gathers much more

focused information, relies on many different types of

data including situational variables, compares all

observations to previous experiences, and quickly and

automatically responds to such observations, often

without resorting to any formal process of problem

solving. In medicine, this means making rapid diagno-

ses on the basis of many types of information. In

gathering information experts typically ask a few

specific questions, in a hierarchical order, designed to

narrow the diagnostic possibilities quickly and focus on

the problem at hand.6,7 Further, experts cannot easily

break down their thinking into component steps and

therefore have great difficulty returning to a novice

form of solving problems.8

An implication for assessment is �the puzzling fact

that experienced clinicians score little better and

sometimes worse than less experienced clinicians. A

possible reason for this is that most methods measure

clinical factual knowledge rather than the organization

of knowledge that allows clinicians to recognize and

handle situations effectively.�9 Indeed, OSCE studies

have shown that sometimes novices score higher than

experts on detailed checklists, despite arriving at the

correct diagnosis less often than the expert.10,11 In

these studies binary checklists were not valid meas-

ures of clinical competence at higher levels of

expertise because they reflected the approach of

novices to clinical problems and could not effectively

detect the complex and hierarchical problem-solving

characteristic of the experienced clinicians. Indeed,

checklists penalized clinicians who arrived at diagno-

ses quickly as a result of pattern recognition and

rapid hypothesis testing.

While adoption of holistic measures such as global

ratings may be a partial solution, we wondered if we

might be able to dissect interviews and categorize the

individual utterances made by clinicians in an attempt

to boost the validity of the checklists. We hoped that

patterns would emerge in the nature and sequence of

utterances that might be used to distinguish novices

from experts.

The existing literature suggested that variables would

include the types and sequence of questions and

statements. Thus, we developed a coding procedure

that would capture these variables. As part of a previous

research study, we had an extensive set of videotapes of

84 interviews between clinical clerks, residents and

family physicians and standardized patients.11 These

tapes contained a valuable source of information about

the way clinicians at different levels approached prob-

lems. The analysis of examiner scored checklists and

global ratings from these interviews had shown the

pattern already discussed. That is experts scored

highest on global ratings and clerks scored highest on

checklists.11 The current study provided the opportun-

ity to carefully review all of these videos in search of

variables that might help us understand more about the

Key learning points

Expert clinicians approach interviewing in a

different manner than novices.

OSCE measures have not traditionally been

designed to take into account levels of expertise.

Measures that are sensitive to the nature of

expertise should include the sequence and

organisation rather than just the number of

questions asked.

While there appear to be patterns of interviewing

that characterise different levels of expertise, they

are subtle and may not be amenable to counting

and classification for examination purposes.

OSCE measures and expertise • B Hodges et al. 743

� Blackwell Science Ltd MEDICAL EDUCATION 2002;36:742–748

specific differences in interviewing technique between

clerks, residents and experienced family physicians. We

hoped that our findings could be used to create a new

generation of expertise-sensitive OSCE checklists. Here

we present our analysis of the patterns of utterances

made by clinicians at different levels of expertise and

our reflections on the potential of such a method for

improving OSCE measures.

Method

Subjects

After giving consent, 42 subjects (14 clinical clerks, 14

family practice residents and 14 family physicians in

practice) were videotaped during 2 clinical interviews.

In each they interviewed a standardized patient for

15 min. Subjects were all volunteers and were paid a

$50 honorarium for their participation. Clinical clerks

and family practice residents were recruited from the

University of Toronto while family physicians were

recruited from several urban family practice units in

Toronto.

Procedure

A total of 84 15-minute clinical encounters were

videotaped. One was an initial assessment of a patient

with panic disorder and the other with obsessive-

compulsive disorder. Both were carefully constructed

using a standard protocol and extensively field-tested

in multiple examinations at the University of

Toronto.10,11 All interviews were set in a family

practice context. Due to technical difficulty (poor

sound, part of interview cut off, etc.) 15 videos had to

be discarded from this analysis, leaving a total of 69

complete interviews. Based on a review of the litera-

ture, preliminary viewing of a subset of tapes and

extensive consultation by the researchers with teachers

of medical interviewing and communication skills, the

investigators developed a set of preliminary codes that

would be used to categorize interviewer utterances

recorded during the interviews. Codes were: questions

(e.g. �When did the symptoms start?�), summary

statements (�I see these symptoms are linked to your

anxiety.�), empathic comments (�This must be hard on

you.�), articulated transitions (�Now I am going to

switch gears and ask you about your family.�) and

information giving (�Panic attacks are often mistaken

for heart trouble�). As well, every time the interviewer

introduced a new subject (changed the topic), the

change was noted. Using a small sample of the

interviews, four of the investigators independently

coded one tape of clinicians at each level of expertise.

Codes were compared and discussed where there was

disagreement and the coding system was refined

slightly. From the outset there was a high level of

agreement (>95%) regarding the coding of utterances.

Two investigators then examined two further tapes

independently. There was complete agreement of

codes. Having established a high degree of agreement

among raters, for consistency one member of the team

(NM) viewed all of the tapes and coded every

utterance made by the interviewers. Utterances of

each type were counted, calculated in frequency per

minute and summed at 5-minute intervals (5, 10 and

15 min of elapsed time). An additional summation

was done at 2 min to look for evidence of early pattern

recognition associated with experts.

Results

The first striking finding was that every one of the

interviewers talked a lot. The total number of utter-

ances in each interview exceeded 90 for all groups, a

rate of one utterance every 8 s (7Æ73 utterances per

minute). Most of these were questions, the proportion

of which dropped as level of expertise increased. While

the pattern was in the expected direction (76% of the

utterances made by clerks were questions, 74% for

residents and only 67% for experts) a one way between

subjects ANOVA comparing the three groups was not

significant (P ¼ 0Æ07). Similarly, while changes of topic

were also in the expected direction (novices changed

the subject most at 48% of utterance compared with

44% for residents and 40% for experts) again the one

way between subjects ANOVA was not significant

(P ¼ 0Æ27). Figure 1 shows the total number of utter-

ances, questions and changes of topic for each group.

Figure 2 shows the low frequency utterances –

empathic comments, summary statements, information

giving and articulated transitions. We were surprised to

find that these utterances accounted for almost a third

of each interview on average.

Figures 3–7 illustrate the pattern of different types

of utterances over time. The distribution of type of

utterance across time appeared to show a characteristic

pattern for each level of expertise that was reinforced

by reviewing the actual videotapes. For example, clerks

appeared to ask disconnected questions throughout

and to give information prematurely. Residents asked

more connected questions but ended by lecturing

patients with general information. Experts used a

mixture of questions and summary statements and

then ended by giving information relevant to the

specific patient.

OSCE measures and expertise • B Hodges et al.744

� Blackwell Science Ltd MEDICAL EDUCATION 2002;36:742–748

Figure 3 Number of questions per

minute asked by group over 4 time

intervals.

Figure 2 Low frequency utterances

among clerks, residents and experts.

Figure 1 Total utterances, changes of

topic, questions and low frequency

utterance among clerks, residents and

experts.

Figure 4 Number of empathic comments

per minute by group over 4 time intervals.

OSCE measures and expertise • B Hodges et al. 745

� Blackwell Science Ltd MEDICAL EDUCATION 2002;36:742–748

For each of the 5 forms of articulation (questions,

summary statements, information giving, empathic

comments, articulated transitions) and for changes of

topic we looked statistically for differences in the 3

groups regarding their patterns over time. Using a 2-way

mixed ANOVA with level of expertise as a between subjects

factor and time as a within subjects factor for each of the 5

forms of utterance and changes of topic as consecutive

dependent variables, none of the analyses were significant

save one (questions: P ¼ 0Æ09, articulated transitions:

P ¼ 0Æ11, information giving: P ¼ 0Æ33, empathic

comments: P ¼ 0Æ56 and changes in topic: P ¼ 0Æ51).

For summary statements there was a significant differ-

ence between groups over time (P ¼ 0Æ04).

Figure 5 Number of summary state-

ments per minute by group over 4 time

intervals.

Figure 6 Number of information giving

statements per minute by group over 4

time intervals.

Figure 7 Number of articulated transi-

tions per minute by group over 4 time

intervals.

OSCE measures and expertise • B Hodges et al.746

� Blackwell Science Ltd MEDICAL EDUCATION 2002;36:742–748

Discussion

OSCEs have a central role in ensuring clinical compet-

ence across the spectrum from first year medical school

to licensure, however as with all new technologies,

increasing experience with OSCEs has demonstrated

the limits of their application. The original checklists,

which have served well for novices, are not suitable at

higher levels of expertise. For examinations that involve

more experienced clinicians, checklists must evolve or in

some cases make way for newer measures that are

sensitive to the behaviours that characterize experienced

clinicians. While global ratings have shown promising

psychometric properties and seem to be able to capture

complex, non-binary human behaviours such as com-

munications or technical skills, they are not sufficiently

tied to a structured understanding of the characteristics

of medical expertise. Thus, there is a need for new

OSCE measures that focus on these characteristics.

This study was a very preliminary exploration of the

characteristics of interviews conducted by clinicians at

different levels. It was limited in several important ways.

First, it was impossible to blind the coders of the

videotapes to the level of experience of interviewers, and

it is possible that this affected the coding process.

However, by keeping the codes at a very basic level

(question vs. information vs. empathic comment) and

stabilising the coding system in advance, there was

limited room for subjective interpretation. Second,

although subjects were not provided with information

about the domain of interviewing prior to the encoun-

ters, both situations were psychiatric. It is possible that

the way in which clinicians performed in psychiatry

scenarios might not generalise to other clinical domains.

Third, the experiment was clearly conducted in an

artificial environment. Patients were simulated, the

setting was unfamiliar and there were observers and a

video recorder. Thus we can speculate that the clini-

cians might not have behaved exactly as they would in

their own setting. Nevertheless, given all these limita-

tions, we feel there may be some important findings to

glean from this study, albeit cautiously and with the

understanding that they should be replicated widely.

First, we found clinicians at all levels asking at least

60 questions per 15-minute encounter. Even the

longest OSCE checklists contain only about 20–25

items leaving over half of the questions unscored. And

although the prediction that the total number of

questions would decline with more experience was

consistent with the pattern seen in the data, the

difference was not significant.

Second, in addition to questions, the average inter-

viewer made about 30–40 other types of utterances.

Utterances such as information giving, empathic com-

ments and summary statements, which occupied as

much as a third of the total interview, would not receive

credit on a typical OSCE checklist.

Third, it seemed that it was among the low

frequency utterances that some of the most important

differences between novice and expert resided. Look

for example at the use of summary statements, the

utterance type with a pattern that varied significantly

over time (Fig. 5). Experts used summary statements

at the end of an encounter to draw together all the

strands of the encounter, to articulate their under-

standing and to ensure that the patient understood

their thinking. Novices, on the other hand, appeared

to summarise for a very different reason. Their use of

summary statements early in the interview may have

functioned as a stalling device to help them think up

more questions. However, while it appeared that the

nature and sequence of utterances might be important

variables in identifying expertise, the patterns we saw

on the videotapes for the most part did not reach

statistical significance. The relatively low number of

interviews we examined at each level of expertise may

explain why intergroup differences appear only as

statistical trends. Perhaps future studies employing a

larger numbers of subjects will reproduce these find-

ings more robustly.

How can we use these results to develop new

measures? A simple first step might be to incorporate

the findings into the behavioural anchors of global

ratings. On a communication scale, for example, the

descriptor �uses summary statements early in interview

as a means of pausing to think� might be put at one end

and �uses summary statements at the end of the

interview to articulate an understanding of the problem�at the other. In fact, global ratings might be changed

conceptually to range from novice to expert, rather than

from poor to excellent or 1–5 as many do currently.

Another approach might be to capture the actual

utterances in a more elaborate fashion. Perhaps check-

lists could be modified to allow the recording of the

sequence in which the questions were asked. Or,

perhaps a method of counting the number of times

the subject was changed, or the degree to which one

question related to the last could be developed. On the

basis of this work we might conclude that the labour

intensive process of coding individual utterances is

unlikely warranted in terms of discriminatory power.

However we remained firmly convinced of the need to

elucidate the relationship between utterances made by

interviewers and ratings of competence assigned by

observers. Such research can contribute meaningfully

to understanding the validity of various ratings

OSCE measures and expertise • B Hodges et al. 747

� Blackwell Science Ltd MEDICAL EDUCATION 2002;36:742–748

schemes, be they atomistic (checklists) or holistic

(global ratings).

We are left with the fact that at the present time,

global ratings appear to be the most reliable measures

of competence that are also sensitive to increasing levels

of expertise. Nevertheless, we anticipate widespread

experimentation with different types of measures as

OSCE researchers worldwide rise to the challenge of

continuing to improve and refine this highly successful

educational tool.

Acknowledgements

The authors wish to acknowledge the support of the

Centre for Research in Education of the University of

Toronto at the University Health Network.

Contributors

All authors contributed to the research design and

conceptualization of the coding scheme. NM coordi-

nated the collection of data and coding of utterances.

GR performed the statistical analyses. BH wrote the

manuscript with significant input from all authors.

Funding

This research was supported by an educational research

grant from the Medical Council of Canada.

References

1 Reznick R, Regehr G, Yee G, Rothman A, Blackmore D,

Dauphinee D. Process rating forms versus task specific

checklists in an OSCE for medical licensure. Academic Med

1998;73:S97–9.

2 Regehr G, MacRae H, Reznick R, Szalay D. Comparing the

psychometric properties of checklists and global rating scales

for assessing performance on an OSCE-format examination.

Academic Med 1998;73:993–7.

3 Regehr G, Freeman R, Hodges B, Russell L. Assessing the

generalizability of OSCE measures across content domains.

Academic Med 1999;74:47–9.

4 Hodges B, Herold McIlroy J, Primary Trait OSCE. Global

Ratings are Sensitive to Level of Training. Abstract presented

at the International Ottawa Conference, Cape Town, South

Africa, 2000.11

5 Dreyfus HL, Dreyfus SE 1986. Mind Over Machine. New York:

Free Press; 2000.

6 Elstein A, Shulman L, Sprafka S. Medical Problem Solving: an

Analysis of Clinical Reasoning. Cambridge, MA: Harvard Uni-

versity Press; 1978.

7 Leaper DJ, Gill PW, Staniland JR, Horrocks JC, de Dombal

FT. Clinical Diagnostic Process: An Analysis. BMJ

1973;3:569–74.

8 Schmidt HG, Norman GR, Boshuizen E. A cognitive per-

spective on medical expertise: Theory and implications.

Academic Med 1990;65:611–21.

9 Charlin B, Tardif J, Boshuizen PA. Scripts and medical

diagnostic knowledge: Theory and applications for clinical

reasoning instruction and research. Academic Med

2000;72:182–90.

10 Hodges B, Regehr G, Hanson M, McNaughton N. Validation

of an objective structures clinical examination in psychiatry.

Academic Med 1998;73:910–2.

11 Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson

M. Checklists do not capture increasing levels of expertise.

Academic Med 1999;74:1129–34.

Received 27 April 2001; editorial comments to authors 28 August

2001; accepted for publication 1 November 2001

OSCE measures and expertise • B Hodges et al.748

� Blackwell Science Ltd MEDICAL EDUCATION 2002;36:742–748