Development and Validation of Behaviorally-Anchored Rating Scales

13
Development and Validation of Behaviorally-Anchored Rating Scales for Student Evaluation of Pharmacy Instruction 1 Paul G. Grussing College of Pharmacy, M/C 871, The University of Illinois at Chicago, 833 South Wood Street, Chicago IL 60612 Robert J. Valuck Department of Pharmacy Administration, The University of Illinois at Chicago, Chicago IL Reed G. Williams Department of Medical Education, The University of Illinois at Chicago, Chicago IL The study purpose was to improve pharmacy instruction by identifying dimensions of teaching unique to pharmacy education and developing reliable and valid rating scales for student evaluation of instruction. Error-producing problems in the use of student ratings of instruction, existing rating methods and dimensions of effective teaching are reported. Rationale is provided for development of Behaviorally-Anchored Rating Scales, BARS, and the methods used are described. In a national study, 4,300 descriptions of pharmacy teaching were collected in nine critical incident writing workshops at four types of schools. Ten dimensions of pharmacy teaching were identified and validated for classroom, laboratory and experiential teaching. Scales were developed for each dimension. Measures of scale quality are described including retranslation data, standard deviations of effectiveness ratings, reliability and validity data and data supporting reduction of leniency and central tendency effects. Four outcomes of the project are discussed, emphasizing two: use of the newly-validated dimensions in modification of traditional numerically-anchored scales in local use, and of BARS in providing clear and convincing performance feedback to pharmacy instructors. INTRODUCTION AND PURPOSE From among the traditional faculty roles of teaching, re- search and service, this study investigated only the evalua- tion of teaching. Teaching performance may be evaluated using multiple data sources: (i ) documented self-evaluation and course improvement; (ii) peer review of instructional methods, instructor-written texts or manuals, and other developed media, syllabi and tests; (iii) gains in student learning; (iv) student ratings of instructor performance; (v) observation or videotaping; and ( vi ) teaching awards (1,2). This study focused on only one data source: student evalua- tion of faculty performance. Its purpose was to improve the quality of instruction in U.S. colleges of pharmacy by iden- tifying dimensions of pharmacy instruction and developing new, reliable and valid student measures of effective phar- macy teaching 2 . Such measures of instructional performance, whether utilized in instructor self-assessment, for periodic performance reviews or in the critical promotion and tenure process, are essential for the continued development of effective teachers. If pharmacy students and instructors are to have confidence in instructional rating systems and to eventually benefit from the rating process, clear dimensions of effective teaching should be identified and rating errors minimized. Problems with the content validity of student ratings of instructor performance introduce rating error when instruments are not sensitive to the unique differences in lecture, laboratory and experiential instruction. More- over, when instructor rating instruments are developed for use across university colleges and departments or disci- plines, without having been validated for use in rating pharmacy instruction in particular, additional questions of validity and rating error arise. Error in Instructor Ratings Reduction of measurement error is imperative in evalu- ation of faculty teaching performance. Eight kinds of error in the administration and use of instructional performance rating scales prompted this study. The research and devel- opment methods chosen were intended to minimize most of these common sources of rating error, especially the first five listed: (i) error in instrument content; (ii) error in the interpretation of the meaning of ratings (3-5); (iii) show- manship(6-8); (iv) common rating error effects such as “halo effect”(9), “reverse halo effect”(10), “leniency effect” and “harshness (or strictness) effect”(11), and “central ten 1 The research was supported, in part, by a GAPS grant from the SmithKline Beecham Foundation through the American Association of Colleges of Pharmacy. 2 The term “dimension”, as used in this article, refers to an axis, or continuum, along which performance descriptors, varying in quality or intensity, may be ordered. The dimension is identified arid shown to be independent and non-overlapping in meaning with other clusters of similar behaviors. 3 Formative evaluation refers to evaluation of a process or product to provide feedback for the purpose of making possible mid-process refine- ments or improvements. 4 Summative evaluation is conducted to examine the quality or impact of a final, completed process or product. American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 25

Transcript of Development and Validation of Behaviorally-Anchored Rating Scales

Page 1: Development and Validation of Behaviorally-Anchored Rating Scales

Development and Validation of Behaviorally-Anchored Rating Scales for Student Evaluation of Pharmacy Instruction1 Paul G. Grussing College of Pharmacy, M/C 871, The University of Illinois at Chicago, 833 South Wood Street, Chicago IL 60612

Robert J. Valuck Department of Pharmacy Administration, The University of Illinois at Chicago, Chicago IL

Reed G. Williams Department of Medical Education, The University of Illinois at Chicago, Chicago IL

The study purpose was to improve pharmacy instruction by identifying dimensions of teaching unique to pharmacy education and developing reliable and valid rating scales for student evaluation of instruction. Error-producing problems in the use of student ratings of instruction, existing rating methods and dimensions of effective teaching are reported. Rationale is provided for development of Behaviorally-Anchored Rating Scales, BARS, and the methods used are described. In a national study, 4,300 descriptions of pharmacy teaching were collected in nine critical incident writing workshops at four types of schools. Ten dimensions of pharmacy teaching were identified and validated for classroom, laboratory and experiential teaching. Scales were developed for each dimension. Measures of scale quality are described including retranslation data, standard deviations of effectiveness ratings, reliability and validity data and data supporting reduction of leniency and central tendency effects. Four outcomes of the project are discussed, emphasizing two: use of the newly-validated dimensions in modification of traditional numerically-anchored scales in local use, and of BARS in providing clear and convincing performance feedback to pharmacy instructors.

INTRODUCTION AND PURPOSE From among the traditional faculty roles of teaching, re-search and service, this study investigated only the evalua-tion of teaching. Teaching performance may be evaluated using multiple data sources: (i) documented self-evaluation and course improvement; (ii) peer review of instructional methods, instructor-written texts or manuals, and other developed media, syllabi and tests; (iii) gains in student learning; (iv) student ratings of instructor performance; (v) observation or videotaping; and (vi) teaching awards (1,2). This study focused on only one data source: student evalua-tion of faculty performance. Its purpose was to improve the quality of instruction in U.S. colleges of pharmacy by iden-tifying dimensions of pharmacy instruction and developing new, reliable and valid student measures of effective phar-macy teaching2. Such measures of instructional performance, whether utilized in instructor self-assessment, for periodic performance reviews or in the critical promotion and tenure process, are essential for the continued development of effective teachers. If pharmacy students and instructors are to have confidence in instructional rating systems and to eventually benefit from the rating process, clear dimensions of effective teaching should be identified and rating errors minimized. Problems with the content validity of student ratings of instructor performance introduce rating error when instruments are not sensitive to the unique differences in lecture, laboratory and experiential instruction. More-over, when instructor rating instruments are developed for use across university colleges and departments or disci-

plines, without having been validated for use in rating pharmacy instruction in particular, additional questions of validity and rating error arise. Error in Instructor Ratings

Reduction of measurement error is imperative in evalu-ation of faculty teaching performance. Eight kinds of error in the administration and use of instructional performance rating scales prompted this study. The research and devel-opment methods chosen were intended to minimize most of these common sources of rating error, especially the first five listed: (i) error in instrument content; (ii) error in the interpretation of the meaning of ratings (3-5); (iii) show-manship(6-8); (iv) common rating error effects such as “halo effect”(9), “reverse halo effect”(10), “leniency effect” and “harshness (or strictness) effect”(11), and “central ten

1 The research was supported, in part, by a GAPS grant from the SmithKline Beecham Foundation through the American Association of Colleges of Pharmacy.

2The term “dimension”, as used in this article, refers to an axis, or continuum, along which performance descriptors, varying in quality or intensity, may be ordered. The dimension is identified arid shown to be independent and non-overlapping in meaning with other clusters of similar behaviors.

3 Formative evaluation refers to evaluation of a process or product to provide feedback for the purpose of making possible mid-process refine-ments or improvements.

4Summative evaluation is conducted to examine the quality or impact of a final, completed process or product.

American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 25

Page 2: Development and Validation of Behaviorally-Anchored Rating Scales

Table I. Dimensions of teaching selected from the education literatureLiterature sources and dimensionsb

1 2 3 4 5 6Tentative study dimensionsa Dickinsonc Wotrubad ICESe Centraf Dasg Hildebrandh D Course Subject and Course Organization Course Provides organization course management, structure, outlining, objectives organization structure clarity structuring A. Teaching Teaching Speaking, Style, Lecturing Teaching Teaching ability methods interpreting skill and ability style and clarification clarity methods application F. Grading and Testing, as Fairness in Grading and Grading, Objective feedback learning testing exams examinations evaluation, experience and feedback G. Student- Student- Flexible, Class climate, Interaction, Sensitivity. Individual instructor faculty attitudes re: warmth, student availability, interaction, interaction interaction students Concern rapport responsive accessibility H. Workload, Work Workload, course requirements, course difficulty difficulty difficulty I. Enthusiasm/ Enthusiasm, Enthusiasm, Dynamic, Dynamism, motivation stimulates encourages stimulates enthusiasm thought thinking thinking J. Knowledge of Competence Knowledge Knowledge subject area of subject a Dimensions listed by final dimension letters and order. b First author only. c See reference 8. dSee reference 23. e See reference 25. f See reference

1. g See reference 28. h See reference 7.

dency effect”(12,13); (v) error in instrument reliability; (vi) mixed purposes of evaluation3,4(14,15); (vii) inconsistent methods of instrument administration(16-19); and (viii) errors in data implementation(20,21).

Procedures for minimizing the first five types of rating errors were sought. Emphasis was placed on selecting or developing procedures and instruments to rate the most appropriate pharmacy teaching behaviors and to rate them accurately and consistently. Study Goals

Four goals were set for the study. First, the project would identify dimensions of instructional behavior unique to pharmacy education and to three teaching environments: classroom, laboratory and experiential. Faculty colleagues have reported the belief that effective pharmacy teaching is different from good teaching in other departments and disciplines, and that it varies from one pharmacy teaching environment to another. The researchers sought to apply a method, other than factor analysis, to identify and describe dimensions of pharmacy teaching. The second goal was to develop Behaviorally-Anchored Rating Scales, BARS, for each dimension and teaching environment. Third, the re-searchers intended to demonstrate concurrent validation of the scales developed, by showing correlations with a known reliable and valid, traditional numerically-anchored scale of parallel content. Finally, the project was designed to demon-strate generalizability of the scales for use in all U.S. colleges of pharmacy. METHODS Nine study steps were elaborated to achieve the project goals. First, the study began with identification of tentative dimensions of pharmacy teaching. This initial validation step would be based on the literature. The second step was

to select the most appropriate scaling method. The litera-ture supporting this selection decision is described. The third step was to conduct critical incident workshops for the collection of descriptors of effective and ineffective teach-ing in U.S. colleges of pharmacy. Editing and selection of collected incidents was the fourth step. The fifth was to establish and validate dimensions of pharmacy teaching using the retranslation process to demonstrate indepen-dence of the dimensions5 (22). Simultaneously, the sixth step of obtaining effectiveness ratings for incidents from study panelists would provide data for establishing scale anchors. The seventh step was to develop scales by selection of meaningful behavioral anchors based on the retranslation process and high rater agreement on the scale anchors. A concurrent validation study would constitute the eighth step, for which traditional, numerically-anchored scales, parallel in content, would be developed. The final step was accomplished through the concurrent validation study, yield-ing a useful parallel set of traditional, content-parallel nu-merically-anchored scales. Identification of Tentative Dimensions

Tentative dimensions of pharmacy instruction were identified and validated based on a review of the pertinent literature. Tables I and II display dimensions mentioned in studies and review articles outside and within pharmacy education. The tentative dimensions so identified were later used for preliminary classification of student-and faculty-generated critical incidents of pharmacy teaching. 5 The Smith and Kendall retranslation process uses an independent group of expert raters who reallocate descriptors of performance to dimensions describing performance qualities. It is analogous to the procedures used by language translators to ensure that all of the meanings of an original text are preserved. Text material is translated into a foreign language, then retranslated to the original by an independent expert.

26 American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994

Page 3: Development and Validation of Behaviorally-Anchored Rating Scales

Table II. Teaching dimensions in the pharmacy education literature, 1975-90

Citations, American Journal of Pharmaceutical Educationb 1 2 3c 4 5 6c 7 8 9 Vol. 39, 39, 40, 40, 41, 42, 44, 47, 50, pp. 446-448 450-552 8-13 165-166 317-325 114-118 428-430 102-107 193-195 Dimensionsa Carlson Zanowiak Jacoby Sauter Purohit Kotzand Peterson Martin Downs

A. Teaching ability x x x x x x x x G. Student-instructor interaction x x x x x x x x I. Enthusiasm/motivation x x x x x x x J. Knowledge of subject area x x x x x D. Course Organization x x x x x x F. Grading and feedback x x x x H. Workload, course difficulty x x x

a Listed by frequency of mentions, final dimension letters and order. bFirst author only. cDimensions based on authors’ original research. dAlso Vol. 40, pp. 3-7.

Education, General. Seven dimensions of effective instruc-tion were reported often in the education and psychology literature. Table I summarizes the most frequently men-tioned dimensions of teaching in original studies or reviews. In their article describing the development of a teacher rating instrument, Wotruba and Wright reviewed 21 pub-lished studies of student evaluation of teaching(23). Of the 40 criteria they listed, the nine most frequently mentioned were also cited in a text chapter on uses and limitations of student ratings(24). Seven are shown in Table I. The text author also summarized dimensions of teaching behavior as identified in factor-analytic studies, four of which are re-ported in Table I. Brandenberg et al. described development and validation of scales for student evaluation of teach-ing(25). Their work yielded a comprehensive evaluation system available at the researchers’ school(26). Instructors may select traditional, numerically-anchored items from a “catalog” of over 400 items classified by teaching dimen-sions. Items designed for use in summative evaluation are normed by instructor rank and by required/ elective status. Items designed for instructor’s formative self-evaluation are not so normed. Hildebrand et al. asked faculty and students to provide descriptions, in observable and behavioral terms, of the “best” and “worst” teaching they had experienced(27). Responses were factor-analyzed into five clusters (dimen-sions) of teaching performance. In a Canadian study of teaching in the behavioral sciences, Das et al. identified seven dimensions of teaching, and developed BARS for student evaluation of instruction (28). In equivalent forms comparisons using traditional rating instruments, they re-ported the BARS to be at least as psychometrically sound in terms of reliability, inter-rater variability and content valid-ity. Dickinson and Zellinger identified six teaching dimen-sions for veterinary medicine instructors(29).

Education, Pharmacy. After review of the education and psychology literature, evidence of criteria for effective phar-macy teaching was sought. Ten articles from the pharmacy education literature, which described or mentioned dimen-sions of teaching, are summarized in Table II. Three of the articles reported research on pharmacy instruction. Based on deficiencies in the use of rating instruments designed for use in faculty performance evaluation generally, Kotzan and Mikael and Kotzan and Entreken developed and imple

mented a factor-analyzed instrument for student evaluation of undergraduate pharmacy instruction(30, 31). Jacoby de-scribed how modification of an existing instrument for use in student evaluation of pharmacy teaching contributed to improved classroom instruction(32). Based on this research, an instructional consulting service was initiated to provide feedback to faculty. Purohit et al. explored issues of student evaluation of instruction(33). Citing Hildebrand, the au-thors discuss “components” of effective teaching as per-ceived by their colleagues and by students(34). Sauter and Walker, in reporting a theoretical model for pharmacy faculty peer evaluation, mentioned basic components of teaching and learning as requisite elements in such evalua-tion^). Two authors reported special needs for evaluation of clinical teaching performance. Martin et al, described clinical faculty evaluation and development programs at one college of pharmacy(36). Downs and Troutman identi-fied criteria for the evaluation of clinical pharmacy teach-ing(37). Three articles, written as reports or invitational articles, mentioned qualities of good pharmacy teaching. As part of a panel devoted to the evaluation of pharmacy teaching, Carlson suggested a comprehensive evaluation program(38). Citing articles by Kulick, and by Brown, the author emphasized three major dimensions students use to judge their teachers, and discussed major functions in the supervision of students by clinical pharmacists(39,40). Peterson, in an ad hoc committee report, mentioned key features of pharmacy teaching performance(41). Zanowiak, in an invitational article, citing Kiker, mentioned character-istics of an effective pharmacy teacher(42,43). Some of the authors cited in Tables I and II described additional kinds of dimensions and behaviors not shown in the tables. First, problems in attaching meaning to labels assigned to factors in earlier studies could introduce bias in the generation of unobservable behaviors in this study. Specific behaviors might not, in the retranslation process chosen for this study, be assigned to the same dimension as suggested by the factor name coined by authors of previous scales and instruments. A second type of dimension not listed in the tables was based on the notion of self-rated student accomplishment. Behaviors associated with this named factor seemed unlikely to be collected and used as scale anchors in this research which would focus on teacher, not student, behaviors. Finally, some instruments contained

American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 27

Page 4: Development and Validation of Behaviorally-Anchored Rating Scales

items describing environmental conditions and curricular features which were beyond the control of a single instruc-tor-ratee. Such behaviors were not expected to result from this study which would use critical incidents describing observable instructor behaviors only. Choice of Scaling Method

The Behaviorally-Anchored Rating Scale, BARS, was chosen for development in this study because of its unique measurement properties. First, it relies on critical incidents which may be classified into dimensions of behavior shown to be unique and independent of each other in their mean-ing. Second, it consists, for each performance dimension identified, of an array of behavioral statements which range from most effective to least effective. Raters are instructed to read the entire continuum of behaviors and then select the one which most closely describes the actual, or expected, behavior of the ratee. Each statement is accompanied by a number on the scale, one of which is recorded to indicate the ratee’s performance on that particular dimension.

A review of the literature supports the choice. First, the development of BARS scales for rating performance of pharmacy practitioners and student externs, and as a crite-rion measure in prediction studies, was previously re-ported(44,45). Second, BARS have been used for evalua-tion of performance in a wide variety of other professions and occupations(46,47). They are claimed by some research-ers to demonstrate more reliability and validity than nu-merically-anchored scales because the behaviors serving as scale anchors are clear, unambiguous statements of ratee performance (48-50). This clarity is supported by following the Flanagan critical incident technique in scale item gen-eration, instead of having experts write general descriptors of performance along a continuum, or of using a traditional numerically-anchored scale only (51). Several studies exam-ined the psychometric properties of BARS vs. numerically-anchored scales. Comparable reliability and validity was observed and reported(52-56). However, one early article compared BARS with traditional scales for leniency error and inter-rater agreement and found more favorable scale properties for traditional scales(57). A third reason for selecting the BARS scale type is that the vivid behavioral descriptions used are easy for raters to associate with ratee performance, and are very compelling in pointing out where the ratee may benefit from introspection and performance improvement(58-60). Finally, BARS scales for student evalu-ation of college instruction have been reported.

Six cited studies have demonstrated the presence of independent dimensions of teaching performance and the feasibility of generating “scaleable” behaviors in construc-tion of reliable and valid rating instruments. Harari and Zedick identified nine dimensions of teaching behavior and developed corresponding BARS scales for evaluating teach-ing ability of college psychology professors(61). They found that when faculty and student ratings for quality of instruc-tor behavior were correlated, that “1.0 or near -1.0 relation-ships” were found. Das et al. identified performance dimen-sions associated with teaching behavioral science courses, developed BARS scales based on the dimensions, and then compared scale properties with parallel versions in a nu-merically-anchored scale format(62). Dickson and Zellinger compared veterinary medicine students’ ratings of their instructors using a BARS scale and a “mixed standard scale” in which items were scrambled so that both the dimensions

and the ordinal relationships among scale anchors were disguised(8). Green and Sauser, followed by Champion and Green, developed BARS scales for use by psychology stu-dents in rating of their instructors, and reported compari-sons of the scale properties(63,64). Horn et al. compared properties of a BARS scale with a well-established, content-parallel numerically-anchored scale(58). Business under-graduate students rated their instructors twice and the study compared the effect of mid-course feedback for both types of scales.

Distinctions are made between BARS and other behav-iorally-based rating scales. Seeking improvements over tra-ditional graphic, numerically-anchored scales, panels of experts may write broad behavioral descriptions for use as scale anchors. Examples may be found in adaptations of the goal-attainment scaling process(65). Two examples occur in the pharmacy literature describing methods of rating phar-macy residents, pharmacy students, and pharmacy employ-ees generally(66,67). Such instruments enjoy advantages in ease and cost of development. Their disadvantages include showing less evidence of reliability and validity. Economy in development has also been described in connection with, so-called, “Short-cut” BARS(64). Critical Incident Collection Sampling. This study was based upon critical incidents of teaching behavior in a variety of pharmacy teaching envi-ronments. In order to ensure generalizability of scales for use across all school types, study schools selected as sources for generating critical incidents were classified and selected using four strata: (i) BS-or PharmD-conferring; (ii) Public or private ownership; (iii) Geographic (East or West); and (iv) High vs. low graduate-education emphasis 6-8(68). Each of the these variables was believed to contribute to the instructional culture of the colleges, possibly impacting upon the methods, styles and quality of teaching.

A three-stage sampling procedure was used, combining systematic and random sampling. In stage one, based on the four strata, names of all U.S. schools were randomly as-signed to one of the appropriate 16 cells. Two sets of four cells were systematically eliminated because they did not contain schools in all cells. Then one set of the two remaining complete sets was randomly chosen. In stage two, schools were randomly selected from within each of four cells in the selected stage one set. Each of the four schools then repre-sented each stratum and each stratum was represented by two schools.

The third stage of sampling occurred when research collaborators at the four selected schools, following re-searcher guidelines, arranged for representative types of volunteer students and faculty to attend local critical inci-dent writing workshops. The local collaborators were re-quested to secure broad representation of differing educa-tional levels in a group of 30 undergraduate professional students and from all disciplines in a group of 15 faculty representing classroom, laboratory and experiential teach-ing. 6 Private vs. public ownership of U.S. colleges of pharmacy was confirmed in personal correspondence with Mary Bassler, American Association of Colleges of Pharmacy, May 30.1990.

7 High, vs. low. graduate education emphasis was defined as schools above or below the median number(22) of Ph.D. students enrolled in U.S. college of pharmacy graduate programs in 1989.

8 Eastern schools were defined as those located in AACP-NABP Districts I-IV; Western in Districts V-VIII.

28 American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994

Page 5: Development and Validation of Behaviorally-Anchored Rating Scales

Critical Incident Workshops. The researchers’ school was used as a pilot site for training in the conducting of item writing workshops. Use of forms and presentation of the writing tasks in a clear, standardized and reliable manner among researchers, was checked during the pilot. Incidents collected were reviewed for their quality and content rela-tive to the tentative dimensions of instruction. No modifica-tions in forms or procedures were made after the pilot administration.

Separate workshops for students and for faculty panel-ists were then conducted at each study school. Panelists were asked to think about effective and ineffective teaching incidents they had actually experienced or observed in the classroom, laboratory or experiential teaching site. Using forms provided, they were asked to write brief “stories” about each incident they could recall, describing the situa-tion and specifically what the instructor said or did. Panelists were reminded that the incidents were to have been person-ally experienced and described as observable behaviors. Positive feedback was provided to the groups, based on selected good examples of clear, vivid and unidimensional incidents written. Near the end of the workshops, panelists were given a list of seven tentative dimensions of pharmacy teaching to prompt them to recall and write additional incidents. See Tables II and V. In addition to critical incident writing, students were invited to complete a learning styles inventory(69-71). As an incentive, student-participants were promised written feedback on their learning styles and suggestions for adaptation to differing teaching styles and formats. As a second incentive, student participants were invited to a post-workshop luncheon.

After the workshops, students received a letter includ-ing a report providing feedback on their learning style and suggestions for adaptation to differing teaching styles and formats. They were encouraged to visit with the local re-search collaborators for additional information about how to apply their own learning styles. Productivity in generation of incidents by 138 critical incident writers was broadly-based among schools, student educational levels and all faculty disciplines. Students at the pilot plus study schools wrote 3,098 incidents (72 percent, school mean = 620, SD = 239). Faculty members from all disciplines wrote 1,202 incidents (28 percent, School mean = 242, SD = 78). The mean numbers of incidents written per student and faculty were 22 (SD = 4.8) and 23 (SD = 4.8) respectively. Editing and Selection of Incidents New Dimensions. During review of the critical incidents, and their classification into seven tentative dimensions, three additional, distinct clusters of sufficient numbers of scaleable incidents were observed. The new tentative di-mensions of pharmacy teaching were: (i) “Selection and use of media;” (ii) “Teaching ability—laboratory;” and (iii) “Teaching ability—experiential.” While the selection and use of effective media was frequently subsumed under the effective teaching dimension in previous studies, the media selection and development incidents collected in this study suggested independence from behaviors describing instruc-tor lecture performance. Moreover, a wide variety of inci-dents reflecting effective, mediocre and ineffective media selection and use were observed. It also became apparent that incidents relating to choice of media might confound ratings based on incidents describing instructor behavior in the classroom.

Similar incident generation outcomes occurred for the second and third new tentative dimensions. Clusters of incidents describing both laboratory and experiential in-struction were observed. The numbers and kinds of inci-dents were sufficiently rich and varied to enable a useful number of potentially-scaleable items to be used in the retranslation process/ Incident Selection Criteria

Ten criteria were applied in the selection and editing of incidents for the retranslation process. 1. The incidents must have described instructional behav-

iors, not environmental ones beyond the control of the instructor (e.g., “This instructor’s lecture room is not air-conditioned.”)

2. Behaviors must have been observable in the classroom, laboratory or experiential teaching site. Opinions or vague general descriptors of teaching “attributes” were not used, nor were student attitudes or moralistic state-ments based on students’ belief systems.

3. Each incident must have been clear, unambiguous and unidimensional in its meaning.

4. Frequency of mention was a primary selection factor, demonstrating importance of behaviors cited by mul-tiple panelists and occurring across several types of colleges of pharmacy.

5. Only behaviors which related to one of the teaching dimensions were included.

6. Only descriptors of instruction in the professional cur-riculum were included, not exclusively pre-pharmacy teaching behaviors.

7. Educational jargon or school-specific terminology was avoided.

8. Behaviors describing unusual instructor leniency or lack of rigor were avoided because some students might perceive them as evidence of poor teaching while others might rate them highly because of perceptions that “easy” behaviors are associated with effective teaching.

9. Incidents describing unprofessional conduct, obviously unethical or criminal activity were also eliminated. Space for low scale anchors was reserved for ineffective, yet frequently-occurring incidents, not for obviously un-common and aberrant behaviors.

10. Finally, incidents were reviewed to ensure brevity, a uniform format and unidimensional behavioral style.

Importance, Retranslation and Effectiveness Ratings

The incidents were retranslated and rated following the process developed by Smith and Kendall(22) and previously reported in the pharmacy literature (72). Because of the large (N = 402) number of incidents retained after the editing process, it was necessary to divide them proportion-ally into two booklets to be sent to two separate retranslation groups. Each workbook contained approximately an equal number of incidents representing each dimension. Incidents were selected by the researchers to represent high, medium and low quality instructional behaviors.

Care was taken to ensure that student raters were “upperclassmen” with exposure to instruction at all curricu-lar levels. To enhance this, 40 students from the final profes-sional year at the researchers’ school were added to increase 9The final newly-identified tentative dimension, “Teaching Ability — Experiential,” includes behaviors common to several kinds of community and institutional experiential instruction, not clinical instruction alone.

American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 29

Page 6: Development and Validation of Behaviorally-Anchored Rating Scales

the total student retranslator/rater pool to 106. This also enabled the critical incidents to be retranslated/rated by students with and without incident-writing experience. Re translation booklets were mailed with a letter of explana-tion. After 10 working days a postcard reminder was sent to non-respondents. Ten working days later another letter and retranslation booklet was sent to the remaining non-respon-dents. Fifty-seven students responded for a 54 percent re-sponse rate.

Student raters were given four tasks—two to validate the importance of dimensions to be identified, one to re-translate the incidents into dimensions, and one to assign an effectiveness rating to each incident. First, the importance of each dimension was determined by asking students to study the dimension descriptions in Table V, and then to rate their importance on a seven-point scale. The second task asked students to divide and assign a total of 100 points to the 10 dimensions. These first two tasks were designed to validate the dimensions by showing their relative importance to students. If dimensions would not be valued as being impor-tant, they might not be selected for inclusion in the final rating scales. Importance ratings could also be used to assign weights to each dimension in the calculation of an overall teaching performance score.

The third task, the retranslation step, involved assigning each incident listed to one of the 10 tentative dimensions. This procedure was intended to show the independence of the dimensions. If respondents would not agree that inci-dents were descriptive of their respective dimensions, the incidents would not be useable as scale anchors. A standard of 80 percent agreement on assignment of incidents to dimensions was used to retain the item for scaling.

The final task was, for each incident, marking of an effectiveness rating on a 15-point scale with 15 points being the highest (most effective) teaching performance. The purpose was to obtain mean ratings with sufficiently low standard deviations to enable their use as scale anchors. Guidelines furnished to raters for using the 15-item effec-tiveness scale have been previously reported(73). Scale Construction

Incidents were retained as scale anchors in the respec-tive scales if at least 80 percent of the participants agreed on assignment to the dimension and if the standard deviation about each mean scale point was 2.0 or less.10 After incidents were sorted based on these criteria, a group of 11 critical incidents with standard deviations of less than 2.0, but with respondent assignment of less than 80 percent agreement and assignment divided equally between two dimensions, was reviewed. A panel of five faculty members was asked to review these incidents to determine their suitability as be-havioral descriptors for both dimensions. The group as-signed the incidents to the dimension for which they felt the best description was provided. After this review, these incidents were added to the scales only if the behavior was different than a behavior at the same, or near to the same, scale point. This process yielded an additional eight inci-dents as useable scale anchors.

Importance ratings were considered in scale construc-tion. Students responded by indicating that all scales were “important” or “very important,” x = 5.77. Based on the 100-point forced distribution, no dimension received lower than a 5 point rating or more than approximately 15 points. Responses to the two ratings correlated highly, Rho = 0.93.

The two dimensions rated most important were “Teaching Ability—Lecture” and “Knowledge of Subject Area.” “Se-lection and Use of Media” and “Workload/Course Diffi-culty” were considered by students to be the least important. Although statistically significant differences were shown between dimension importance ratings, practical review of both of these ratings suggested that all 10 dimensions are generally considered to be important by students and that none should be eliminated from the final set of rating scales.

After the scales were constructed using incident ratings with the greatest rater agreement, a possible source of variance in ratings was examined. Using data from the learning style inventories administered to all student inci-dent writers who also participated in the retranslation pro-cess, four basic learning styles were identified(69). Kolb has labeled these styles as “Converger”, “Diverger”, “Assimila-tor,” and “Accomodator”. Application of the styles to phar-macy students’ and pharmacists’ learning has been de-scribed by Garvey et al. and Riley(70,71). To determine if learning style differences had an impact on overall respon-dents’ ratings, a grand mean of all scale anchor points for all 10 scales was computed. Using one way analysis of variance, no significant rating differences among the four learning style groups were found, F3,50 = 0.34, P = 0.80. To determine if learning style differences related to mean scale ratings for individual dimensions, ten one way analyses of variance were conducted and none yielded significant differences between learning style groups at P = 0.05. The mean P value for these tests was 0.80, with values ranging from P = 0.22 to P = 0.99. Differences in learning style among respondent-raters did not relate to their ratings which established scale anchor points.

The final scales are typical of BARS scales generally in terms of distribution of anchors at the high and low ranges of the scales. Critical incident generation is “easier” for descriptions of extremely ideal or unsatisfactory behaviors, less productive for generation of “average” incidents of professional behavior. The researchers elected not to select items with standard deviations greater than 2.0 in order to provide more mid-scale range anchors. RESULTS The project results are reported first in terms of measures of scale quality, before describing the products developed: validated dimensions and scales. Measures of BARS Quality Reliability. Measures of inter-rater agreement and of stabil-ity were conducted using a limited number of volunteer faculty. Inter-rater reliability is reported in Table IV, based on one lecturer’s performance using a sample of pairs of ratings. Test-retest reliability was based on two administra-tions of three selected BARS scales in one class, with a five-week time interval. See Table IV. The notion of stability of BARS scales is a useful, but not necessary, condition for demonstration of BARS scale reliability. Historical effects in the students’ experiencing of instruction are expected during the length of a course. Results show significant test-retest correlations for two of the three scales, and a corre-lated means “t” test showed significant changes (lower ratings over time) for two of the three scales. 10Sixty percent agreement is frequently cited as a selection criterion. In addition, some studies report that incidents with greater variances are selected for mid-scale anchors in scale construction.

30 American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994

Page 7: Development and Validation of Behaviorally-Anchored Rating Scales

Table III. Numerically-anchored scalea

Name of the instructor being rated: PLEASE COMPLETE THE FOLLOWING RATINGS SHOWING YOUR EVALUATION ON THE FIVE POINT SCALESBELOW. EXAMPLE…The Minnesota Twins will win the World Series in 1993. Agree _ _ _√ _ Disagree 1 2 3 4 5 1. The course objectives were: very clear _ _ _ _ _ very unclear 2. The instructor stated clearly what was expected of students: almost always _ _ _ _ _ almost never 3. Did the instructor make good use of examples and Illustrations? Yes, very often _ _ _ _ _ No, seldom 4. It was easy to hear and understand the instructor. Almost always _ _ _ _ _ Almost never 5. The instructor summarized material presented in each class. Almost always _ _ _ _ _ Almost never 6. The instructor’s clinical demonstrations were clear and concise. Strongly agree _ _ _ _ _ Strongly disagree 7. The grading procedures for the course were: very fair _ _ _ _ _ very unfair. 8. Was the grading system for the course explained? Yes, very well _ _ _ _ _ No, not at all 9. The amount of graded feedback given to me during the course was: quite adequate _ _ _ _ _ not enough

10. Were exam questions worded clearly? Yes, very clear _ _ _ _ _ No, very unclear 11. How well did examination questions reflect content and emphasis of the course? Well related _ _ _ _ _ Poorly related 12. The instructor was sensitive to student needs. Almost always _ _ _ _ _ Almost never 13. The instructor listened attentively to what class members had to say. Always _ _ _ _ _ Seldom 14. How accessible was the instructor for student conferences about the course?

Available regularly _ _ _ _ _ Never available

15. The instructor promoted an atmosphere conducive to work and learning. Strongly agree _ _ _ _ _ Strongly disagree 16. The instructor attempted to cover too much material. Strongly agree _ _ _ _ _ Strongly disagree 17. The instructor was a dynamic teacher. Yes, very dynamic _ _ _ _ _ No, very dull 18. The instructor motivated me to do my best work. Almost always _ _ _ _ _ Almost never 19. The instructor stimulated my intellectual curiosity. Almost always _ _ _ _ _ Almost never 20. The instructor’s knowledge of the subject was: excellent _ _ _ _ _ poor 21. How would you characterize the instructor’s command of the subject Broad and accurate _ _ _ _ _ Plainly deficient 22. The instructor seemed well prepared for classes. Yes, always _ _ _ _ _ No, seldom

a Item source: Instructor Course Evaluation System, ICES, University of Illinois, Champaign-Urbana.

Concurrent Validity. BARS ratings were correlated with corresponding numerically-anchored scales constructed with selected items from the catalog of items available from the university’s Office of Instructional Resources. The research-ers first identified all catalog items which related to the content described in the ten dimensions. Then, 31 items were selected which most closely matched the behaviors described in the dimensions tested. The final 22-item nu-merically-anchored scale appears in Table III and its con-struction in relationship to the ten dimensions is described in Table IV. A numerically-anchored media scale was not constructed because the two lecturers did not use media other than assigned readings and the chalkboard.

Selected scales were administered to two groups of students at the researchers’ school, one which rated two lecturers and one which rated clerkship instructors. Two senior faculty members, with courses in lecture format, volunteered for the study and signed releases which were included with the written rating instructions provided to students. Raters included all students in attendance at one third professional year lecture. After team-taught courses and unwilling volunteer instructors were eliminated, the two available participants received very high ratings with low variance and ranges of ratings. Ratings for the experien-tial rotations were obtained by asking volunteer students from the final professional year and recent alumni who were new, first-year members of the clinical faculty to rate their “second preceptor.” This procedure provided sufficient num-

bers of raters and eliminated ratings of “first-preceptor” student-faculty relationships. Because rotations were sys-tematically scheduled, it also ensured representativeness from the wide variety of required clerkships offered. Faculty members responsible for laboratory instruction declined participation. They noted that items and scale anchors in-volving quality of laboratory instruments could cause low ratings which, if not kept confidential, might adversely affect their departmental and college-wide performance reviews. All but two of the correlations are positive and significant. Scale Properties and Error Reduction

BARS and numerically-anchored scales were compared for scale properties contributing to leniency, central ten-dency and halo effects. Evidence for less leniency effect in the use of BARS was provided by comparing the means for both sets of four selected scales: Evaluation, Interaction, Workload and Teaching. All four BARS means were lower. The mean BARS rating for four scales was 1.13 scale points lower, a statistically significant difference. Although these data may suggest that the BARS produce less leniency in ratings, possibly attributable to their unambiguous scale anchors, it is not clear which scale best represents a “true” rating of instructor performance.

Comparison of two scale properties suggest that BARS have produced less central tendency rating effect. The vari-ance in ratings was greater for all BARS scales. Moreover,

American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 31

Page 8: Development and Validation of Behaviorally-Anchored Rating Scales

Table IV. Reliabilities and concurrent validation of BARSa using parallel ICESb scales and items Numerically-anchored BARS reliabilities

scale statistics, ICES Correlation N of Scale Relia- Inter- Test- statistics Dimensionsc items items bilityd Rateee raterf retestg rh N A. Teaching ability-lecture 3 3-5 0.63 1 0.45 87 0.78 2 0.31 0.63 0.43 89

C. Teaching ability-experiential 1 6 3 0.64 30

D. Course organization 2 1-2 0.83 2 0.45 90

F. Student performance evaluation 5 7-11 0.66 1 0.32 86 0.72 2 0.26ns 0.35 0.28 89 0.82 3 0.72 30

G. Student-instructor interaction 4 12-15 0.78 1 0.50 82 0.77 2 0.43 0.18ns 0.51 87 0.96 3 0.90 30

H. Workload/course difficultyj 1 16 2 0.19ns 0.24i 80 3 0.14ns 30

I. Enthusiasm/motivation 3 17-19 0.85 1 0.33 86 0.83 2 0.39 0.47 89

J. Knowledge of Subject Area 3 20-22 0.75 1 0.19ns 86 0.80 2 0.49 90

22k a Behaviorally-anchored rating scales. b Instructor and Course Evaluation System, University of Illinois. c Two scales not tested: B, Teaching Ability-Laboratory, and E, Selection and use of media. d Cronbach’s “alpha”. c Ratees 1 and 2 are lecturers (N > 80 raters) and ratees no. 3 are experiential preceptors (N = 30 raters). f Based on 60 randomly-selected pairs of ratings. 8 Correlations between 2 measures at 5-week intervals, N = 50 pairs. h Showing concurrent validity, BARS and ICES. f-h All correlations positive and significant at P<0.01, except “i” where P=0.05. Preceptor ratees for the Workload/Course Difficulty item, and lecturer ratee

no. 1 for the Knowledge of Subject Area scale, and 3 BARS scales are non-significant j-Low scale reliability, single item used. k Nine items deleted from original scale to develop reliable subscales

comparison of the modal ratings for both sets of all four scales shows that the BARS yielded modal ratings which were farther from their respective scale mid-points than their adjusted numerical scale counterparts—total differ-ences of 14.9 vs. 8.3 scale points, respectively.

Halo effect was compared by examining correlations of measures with each other within BARS and within numeri-cal scale types. If scales show a low inter-correlation their independence is demonstrated, suggesting that raters are less apt to allow performance in one area to affect their ratings in another. Evidence for lower halo effect for BARS was not found. The mean intercorrelation for all four BARS was 0.71, SD = 0.10, and for the numerical scales, 0.58, SD = 0.19. Project Products Dimensions. Ten independent dimensions of pharmacy teaching were identified. They are described in Table V and include three previously-unreported new dimensions: “Se-lection and Use of Media,” “Teaching Ability—Labora-

tory,” and “Teaching Ability—Experiential.” Three’ scales are environment-specific: “Teaching Ability—Lecture,” “Teaching Ability—Laboratory,” and “Teaching Ability— Experiential.” The other seven scales apply to all three teaching environments. By combining the scales as the table suggests, either seven or eight dimensions of teaching may be measured in three pharmacy teaching environments. Laboratory instruction might also include evaluation of selection and use of media. Three sample scales appear in the Appendix. BARS Scales. A total of 134 critical incidents “survived” the retranslation and effectiveness rating process, with a range of from 10 to 19 incidents used as anchors per scale. The process and results are summarized in Table VI. The mean percentage agreement on assignment of incidents to dimen-sions, 79.6 percent, nearly met the 80 percent retranslation goal. The mean standard deviation of 1.76 scale points illustrates strong student rater agreement on the level of effectiveness of each scale point.

32 American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994

Page 9: Development and Validation of Behaviorally-Anchored Rating Scales

Table V. Pharmacy instruction dimensions Dimensionsa A. Teaching Ability—Lectureb

Audible and clear speaking; Interpretation and explana-tion of concepts ;

Use of examples and illustrations; Emphasis and summary of main points; Effective use of

chalkboard. B. Teaching Ability—Laboratory

Availability of equipment, reagents and ingredients; Demonstration before performance; Supervision; Safety; Sufficient time and access; Concise, useful reporting.

C. Teaching Ability—Experiential Demonstration and supervision of learning experiences; Professional and patient communications; Practice.

D. Course Organization11 Clarity of scheduling; Detail of content outline; Clarity of learning objectives, assignments and student

expectations; Following the course outline and ob-jectives.

E. Selection and Use of Media Effective use of slides, overheads, videos, texts, hand-

outs, models) F. Student Performance Evaluationb

Lecture, Laboratory and Experiential, Relationship to course content/objectives;

Clear, unambiguous questions and assignments; Explanation of method, content, administration; Feedback to students; Fair, objective grading; Applica-

tion, not rote memory. G. Student-Instructor Interactionb

Availability for consultation; Responses to student diffi-culties;

Conveying a helpful and supportive attitude; Concern about student learning; Sensitivity to students’

needs; Interest in student outcomes; Availability for help after

class; Listening to student questions and concerns; Initiatives

to help students; Atmosphere conducive to learning.

H. Workload/Course Difficultyb Scope of content; Length and difficulty of assignments; Coverage of content; Reasonable due dates and project deadlines.

I.. Enthusiasm/Motivationb Dynamic in presentation of subject; Stimulation of student thought and interest; Motivation of students to do their best work.

J. Knowledge of Subject Areab Well-prepared; Competent in field; Knows limits of ex-

pertise. a Classroom teaching evaluated on dimensions A, D-J.

Laboratory teaching evaluated on dimensions B, D-J. Experiential teaching evaluated on dimensions C, D, F-J. bTentative dimensions identified at onset of study. Dimensions B, C & E added on basis of critical incidents surviving the retranslation/rating process.

DISCUSSION The project yielded four major outcomes: (i) validated dimensions of teaching performance for use in development

or revision of traditional scales; (ii) reliable and valid nu-merically-anchored I.C.E.S. scales; (iii) reliable and valid BARS for administration, and iv) BARS for use in faculty development. Utility of the Dimensions in Local Scale Development or Revision. The kind and quality of instruments in current use for student rating of pharmacy faculty teaching varies con-siderably. These BARS and parallel traditional scales are the first to be based on the ten new independent dimensions of teaching performance unique to pharmacy education. For colleges of pharmacy which participate in university-wide rating systems, the project offers guidance for the college to work with the central agency responsible for managing the faculty evaluation program. Existing tradi-tional numerically-anchored items of high quality may be combined into scales for the 10 unique pharmacy teaching dimensions. Such scales may be used to report performance ratings with higher reliability than is possible with a series of individual items. If the central service agency does not offer items to rate performance in all of the new pharmacy teaching dimensions identified, item-writing and validating activities are called for to complete the locally-developed scales. With such revised scales in place, development of local, within-pharmacy norms is possible. For schools not required to participate in university-wide teaching evalua-tion systems, similar possibilities exist for within-school scale modification and improvement. Use of Equivalent Form ICES Scales. Concurrent valida-tion of the BARS using specially-constructed numerical scales of parallel content has an additional useful outcome. For schools using the I.C.E.S. system, use of the traditional scales developed for this study, augmented by additional I.C.E.S. items or other items descriptive of the ten dimen-sions, could provide reliable scale scores based on the dimensions. Reliability studies on such expanded scales are recommended. Administration of BARS Scales. The expected project out-come of reliable and valid scale development was accom-plished and the product is available for use in schools of pharmacy. BARS scales are expensive to develop and main-tain. Use and continued research and development of these scales in multiple pharmacy schools would provide addi-tional positive returns on the research and development investment. Care should be taken, however, to systemati-cally select, introduce, administer, and monitor the scales. Use of BARS has been most successful in organizations where persons being rated have had input into the scale development process and where the scales are profession-ally-administered(74). Each administration should be man aged by a human resources expert familiar with develop-ment and administration of this type of performance rating scale. Unsupervised scale use by students is not recom-mended, nor is administration by persons untrained in performance assessment. Potential user schools should uti-lize a designated testing specialist for BARS scale adminis-tration. Use of BARS in Faculty Development. One of the many characteristics of BARS is that, because of the vivid behav-iors they portray, faculty ratees are prone to adopt effective teaching behaviors and to abandon those associated with low scale ratings. This tends to cause a favorable shift in teaching behaviors and an inflation of ratings based on

American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 33

Page 10: Development and Validation of Behaviorally-Anchored Rating Scales

Table VI. Summary statistics, retranslation and effectiveness ratings

N of useable incidents

Percent agreement on relevant dimension

Standard deviation, effectiveness ratings

Dimensions Total items

Mean n/scale

Range/ scale Mean Range Mean Range

10 134 13.4 10-19 79.6 60.6-100 1.76 1.1-2.0

improved faculty performance. This desirable side effect of BARS use suggests that their greatest contribution may be in the provision of highly-effective faculty performance feedback, and not in their reliable and valid performance assessment capabilities alone. The utility of BARS in pro-viding performance feedback is well-established(75). Heart-felt introspection about these unforgiving “snapshots” of what students think of their instructors’ teaching could result in re-dedicated commitment to improved teaching.

Study Constraints Three constraints, two methodological and one philo-sophical, may have limited the study outcomes. First, the known disadvantages of using study volunteers is evident. Although sufficiently represented to permit rich incident writing contributions from upperclassmen, a larger number of volunteers from the final professional year could have enhanced the study. Perspectives of additional mature stu-dents’ writings would have enhanced the pool of incidents. More importantly, participation by a larger proportion of “seniors” from study schools would have enabled their utilization in larger numbers for the retranslation/rating steps, allowing less reliance on senior students from the pilot school. Second, faculty member commitment from study schools for the purpose of concurrent validation of the scales was not sought at the onset. Instead, volunteers were ob-tained only from the researchers’ school. Only two lecturers, both highly-experienced, volunteered and with limitations on their available class time. This required administration of only part of the scales. Both lecturers received very high ratings on both types of scales, thus narrowing the range of responses. The higher correlations for experiential courses were due, in part, to a much wider variance in ratings than for the two volunteer lecturers. Third, the factor-analytic basis for classifying teaching behaviors was not challenged in this study. The foundation for scale construction was the commonality of seven factors established and named in previous studies. Because this study stressed observed be-haviors, it did not create global descriptors of instructors’ “personality.” Moreover, the dimensions were not created or edited by students. Perhaps students, not educational researchers, should be asked to fashion a tentative set of dimensions based on the critical incidents, without prompt-ing of previously-named factors or the dimensions identi-fied in this study. It is possible that students have a discern-ing and reliable way of “knowing” qualities of instruction and may be able to organize and describe an of instructional qualities more efficiently than researchers who begin with factor-analyzed groupings of teaching behaviors and who insist on working only with descriptions of observed behavior.

Topics for Future Research Reliability. Ongoing reliability studies are planned. Coop-eration of additional volunteer instructors, including those with little teaching experience, would broaden the range of

talent being rated. Such studies should be expanded to include all of the dimensions of teaching in all environments, particularly laboratory teaching. The low concurrent validity correlations for two scales require additional study. Low correlations for the Workload item and the Knowledge scale are attributable, in part, to student differences in perceptions. Review of scale development ratings of critical incidents depicting “Workload and course difficulty” showed that some students approach ratings for this dimension in terms of relative “ease” of workload, others in a more normative sense in terms of perceived “appropriateness” of the amount of work assigned. Thus, both types of scales are subject to students’ perceptions of appropriate input and effort vs. their own learning styles and willingness to expend effort. Similarly, for student ratings of “Knowledge,” stu-dents deal with perceptions rather than facts about the instructor’s knowledge. Only vivid examples of lack of preparedness in the classroom, as measured by the BARS scale, served to measure this reliably. Reliability studies will also be conducted on expanded versions of the numerically-anchored I.C.E.S. scales. Research on Learning Styles. When students are made aware of their personal learning styles, accommodations to instructional formats and styles may be made. This study demonstrated that the mean BARS ratings for items se-lected in scale development were not affected by students’ learning styles. Research is continuing on the effect of learning styles on all 402 critical incidents which were sub-jected to retranslation and effectiveness ratings, especially those which were rejected for scale use because of wide rating variance. Significance of item variance differences between learning style groups, if discovered, may offer insights to instructors for possible instructional style and performance accommodations based on specific observed teaching behaviors.

Taxonomical Classification of Incidents with Ethical Impli-cations. Numerous items describing substandard profes-sional behavior were eliminated from the scales. A review of the bank of critical incidents for purposes of classification into available taxonomies of ethical behaviors is planned (76, 77).

Personal Dimensions. The emphasis in scale construction and use has been on the advantages of unidimensional observed behaviors as scale anchors. This emphasis enabled identification of ten discrete dimensions of teaching perfor-mance. It may also be possible to classify additional teaching behaviors based on personal attributes of the instructor. More “trait-like” than the observable performance-based dimensions identified, such clusters may “cut across” many of the ten validated performance dimensions. Such personal dimensions, e.g. “Independence/Assertiveness” and “Han-dling/Coping with Detail”, have been previously reported

34 American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994

Page 11: Development and Validation of Behaviorally-Anchored Rating Scales

for BARS describing pharmacy practice behaviors(78). Iden-tification of such teaching dimensions could supplement these BARS, enhancing the ability to more completely and accurately describe the characteristics of effective teaching. CONCLUSION This study has addressed the problem of rater error in several ways. First, in terms of scale content, separate di-mensions have been identified and scales have been devel-oped for three pharmacy teaching environments. Scale an-chors refer to instructional behaviors only, not to extrane-ous conditions beyond the instructor’s control. Enhanced by broadly-based input, the scales are generalizable and avail-able for use in all types of colleges of pharmacy. Second, global student descriptors of instructors’ “personality” have been replaced by measures of two kinds of important, and observable, teaching behaviors: “Student Interaction” and “Enthusiasm/ Motivation.” Third, problems with rater er-rors have been reduced. Fourth, approaches to more reli-able use of traditional, numerically-anchored scales have been suggested. Finally, however, the greatest impact may be the “mirror” which these BARS have provided into pharmacy teaching styles and behaviors. For better or worse, students, with input from faculty incident writers, have painted their multi-colored picture of the teaching land-scape. This vivid painting as a rating instrument may, it is hoped, blur and fade. Prompted by faculty review of BARS, the desired outcome of improved teaching could then de-mand even more sensitive measures and compelling re-minders of how the teaching/learning enterprise might con-tinually be enhanced. Acknowledgments. The assistance of Mikyoung Choi, in literature review, Debra Agard and Trena Magers in data entry; and helpful consultation with and comments by Bruce A. Sevy, Personnel Decisions, Inc.; are gratefully acknowl-edged. Several colleagues collaborated in research at four anonymous colleges of pharmacy, providing essential sup-port in selection of representative faculty and student groups and making logistical arrangements for conducting the criti-cal incident writing workshops.

Am. J. Pharm. Educ., 58, 25-37(1994); received 9/29/93, accepted 1/23/94. References

(1) Centra, J.A., Determining Faculty Effectiveness, Jossey-Bass, San Francisco CA (1979) pp. 7-11.

(2) Braskamp, L.A., Brandenburg, D.C. and Ory, J.C., Evaluating Teach-ing Effectiveness. Sage Publications, Newbury Park CA (1984) pp. 29-76.

(3) Centra, J.A., and Creech, F.R., The Relationship between Student, Teachers, and Course Characteristics and Student Ratings of Teacher Effectiveness, Project Report 76-1. Educational Testing Service, Princeton NJ (1976).

(4) Measurement and Research Division, Office of Instructional Re-sources, “ICES norms”, Unpublished Report, University of Illinois, Urbana IL (1977-83).

(5) Op. cit. (2), p. 48,49. (6) Ware, J.E. and Williams, R.G., “The Dr. Fox effect: A study of lecture

effectiveness and ratings of instruction,” J. Med. Educ., 50, 149-156 (1975).

(7) Hildebrand, M., Wilson, R.C. and Dienst, E.R., Evaluating University Teaching, Center for Research and Development in Higher Educa-tion, Berkeley CA (1971) pp. 18-20.

(8) Dickinson, T.L. and Zellinger, P.M., “A comparison of the behavior-ally anchored rating and mixed standard scale formats,” J. Appl. Psychol, 65, 147-154(1980).

(9) MacMillan Dictionary of Psychology, (edit. Sutherland, S.) MacMillan, London (1989), p. 183.

(10) Ibid., p. 183. (11) Encyclopedia of Psychology, Vol. 3, (edit. Corsini, R.) John Wiley and

Sons, New York NY (1984) p. 205. (12) Smith, P.C., “Behaviors, results and organizational effectiveness: The

problem of criteria,” in Handbook of Industrial Psychology, (edit Dunnette, M.) Wiley and Sons, New York NY (1983) p.757.

(13) Op. cit. (11), p. 205. (14) Op. cit. (2) pp. 25-26, 49-50. (15) Manning. R.C, The Teacher Evaluation Handbook, Prentice-Hall,

Englewood Cliffs NJ (1988) pp. 4-5. (16) Ibid., p. 6-9. (17) Op. Cit., (1), p. 43. (18) Op. Cit., (2), p. 51. (19) Op. Cit., (2), p. 51-52. (20) Op. Cit, (2), p. 80. (21) Ivancevich, J.M., Human Resource Management. 5th ed., Irwin,

Homewood IL, (1992) pp. 327-327. (22) Cain-Smith, P. and Kendall, L.M., “Retranslation of expectations: An

approach to the construction of unambiguous anchors for rating scales,” J. Appl. Psychol., 47, 149-155 (1973).

(23) Wotruba, T.R., and Wright, P.L., “How to develop a teacher rating instrument: A research approach” J. Higher Educ., 46, 653-63 (1975).

(24) Op. Cit., (3) p. 18-19. (25) Brandenburg, D.C, Braskamp, L.A. and Ory, J.C “Considerations

for an evaluation program of instructional quality,” CEDR Quart., 12, 8-12(1979).

(26) ICES Item Catalog, Newsletter No. 1, Office of Instructional Re-sources, Univ. of Illinois, Champaign-Urbana IL (1977).

(27) Op. cit., (7) p. 4-13. (28) Das, H., Frost, P.J. and Barnowe, J.T., “Behaviorally anchored scales for

assessing behavioral science teaching,” Can. J. Behav. Sci., 11, 79-88(1979).

(29) Op. cit. (8) p. 149. (30) Kotzan, J.A. and Mikeal, R.L.,” A factor-analyzed pharmacy-student

evaluation of pharmacy faculty,” Am. J. Pharm. Educ., 40, 3-7(1976). (31) Kotzan, J.A. and Entrekin, D.N., “Development and implementation of

a factor-analyzed faculty evaluation instrument for undergraduate pharmacy instruction,” ibid., 42, 114-118(1978).

(32) Jacoby, K.E., “Behavioral prescriptions for faculty based on student evaluations of teaching,” ibid., 40, 8-13(1976).

(33) Purohit, A. A., Manasse, H.R., Jr. and Nelson, A.A., “Critical issues in teacher and student evaluation,” ibid., 41, 317-325(1977).

(34) Op. cit., (7) p. 16-22. (35) Sauter, R.C. and Walker, J.D., “A theoretical model for faculty ‘peer’

evaluation,” Am. J. Pharm. Educ., 40, 165-166(1976). (36) Martin, R.E., Perrier, D. and Trinca, C.E., “A planned program for

evaluation and development of clinical pharmacy faculty,” ibid., 47. 102-107(1983).

(37) Downs, G.E. and Troutman, W.G., “Faculty evaluation and develop-ment issues: Clinical faculty evaluation,” ibid. 50, 193-195(1986).

(38) Carlson, P.G., “A panel: The evaluation of teaching in schools and colleges of pharmacy,” ibid., 39, 446-448(1975).

(39) Kulik, J.A., “Evaluation of teaching,” Memo to the Faculty, 53, 2(1974).

(40) Brown, B.F., Education by Appointment, Parker Publishing, West Nyack NJ (1968).

(41) Peterson, R.V., “Chair report of the AACP Council of Faculties Ad Hoc Committee on Promotion and Tenure,” Am. J. Pharm. Educ., 44, 428-430(1980).

(42) Zanowiak, P., “Evaluation of teaching: One faculty member’s view-point,” ibid., 39, 450-452(1975).

(43) Kiker, M., “Characteristics of the effective teacher,” Nursing Outlook, 21, 721-723(1973).

(44) Grussing, P.G., Silzer, R.F. and Cyrs, T.E., Jr., “Development of behaviorally-anchored rating scales for pharmacy practice,” Am. J. Pharm. Educ., 43, 115-120(1979)

(45) Lipman, A.G. and McMahon, J.D., “Development of program guide-lines for community and institutional externships,” ibid., 43, 217-222(1979)

(46) Schwab, D.P., Heneman III, H.G. and DeCotis, T.A., “Behaviorally anchored rating scales: A review of the literature,” Personnel Psychol., 28, 549-562(1975).

(47) Kingstrom, P.O. and Bass, A.R., “A critical analysis of studies com-paring behaviorally anchored rating scales (BARS) and other rating formats,” ibid., 31, 263-289(1981).

(48) Campbell, J.P., Dunnette, M.D., Arvey, R.D. and Hellervik, L.W.,

American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 35

Page 12: Development and Validation of Behaviorally-Anchored Rating Scales

The development and evaluation of behaviorally based rating scales,” J. Appl. Psychol., 57, 15-22 (1973).

(49) Borman, W.C. and Dunnette, M.D., “Behavior-based versus trait-oriented performance ratings: An empirical study,” ibid., 60, 561-565(1975).

(50) Harari, O. and Zedeck, S., “Development of behaviorally anchored scales for the evaluation of faculty teaching,” ibid., 58, 261-265(1973).

(51) Flanagan, J.C., “The critical incident Technique,” Psychol. Bull., 51, 327-357(1954).

(52) Op. cit. (46). p. 554-555. (53) Op. cit. (47), p. 266-273. (54) Op. cit. (8), p. 150-153. (55) Landy, F.J. and Guion, R.M., “Development of scales for the mea-

surement of work motivation,” Org. Behav. Human Perform., 5, 93-103(1970). (56) Jacobs, R., Kafry, D. and Zedeck, S., “Expectations of behaviorally anchored

rating scales,” Personnel Psych., 33, 595-610(1980). (57) Bernadin, H.J., Alvares, K.M. and Cranny, C.J., “A recomparison of behavioral

expectation scales to summated scales,” J. Appl. Psychol., 61, 564-570 (1976). (58) Hom, P.W., DeNisi, A.S., Kinicki, A.J. and Bannister, B.D., “Effec-

tiveness of performance feedback from behaviorally anchored rating scales,” ibid., 67, 568-576(1982).

(59) Blood, M.R., “Spin-offs from behavioral expectation scale proce-dures,” ibid., 59, 513-515(1974).

(60) Zedeck, S., Imparato, N., Krausz, M., and Oleno, T., “Development of behaviorally anchored rating scales as a function of organizational level,” ibid., 59, 249-252(1974).

(61) Op. cit. (50), p. 263. (62) Op. cit. (28). (63) Green, S.B., Sauser. W.I., Fagg, J.N. and Champion, C.H., “Shortcut methods

for deriving behaviorally anchored rating scales,” Educ. Psychol. Meas., 41, 761-775(1981).

(64) Champion, C.H., Green, S.B. and Sauser, W.I., “Development and evaluation of shortcut-derived behaviorally anchored rating scales,” ibid., 48, 29-41(1988).

(65) Keresuk, T.J., Smith, A. and Cardillo, J.E., Goal Attainment Scaling: Applications, Theory, and Measwrcmenf, Erlbaum, Hillsdale N J, (1993).

(66) Elenbaas. R.M., “Evaluation of students in the clinical setting,” Am. J. Pharm. Educ., 40, 410-417(1976).

(67) Nelson, A.A. and Maddox, R.R., “An assessment of the mastery of entry-level practice competencies using a primary care clerkship training model,” ibid., 56, 354-363(1992).

(68) Penna, R.P. and Sherman, M.S., “Enrollments in schools and colleges of pharmacy, 1988-1989,” ibid., 53, 270-302(1989).

(69) Kolb, D., Learning Style Inventory Interpretation Booklet, McBer and Co., Boston MA (1985).

(70) Garvey, M.G., Bootman, J.L. and McGhan, W.F., “An assessment of learning styles among pharmacy students,” Am. J. Pharm. Educ., 48, 134-140(1984).

(71) Riley, D.A., “Learning styles: A comparison of pharmacy students, graduate students, faculty and practitioners,” ibid., 51, 33-36 (1987).

(72) Op. cit. (44), p. 116-117. (73) Op. cit. (44), p. 117. (74) Op. cit. (58), p. 570. (75) Op. cit. (58), p. 574. (76) Counelis, J.S.. “Toward empirical studies on university ethics,” J. Higher

Educ., 64, 84-86(1993). (77) Fassett, W.E., Doing Right by Students: Professional Ethics for

Professors, PhD Dissertation, University of Washington, Seattle WA (1992). (78) Op. cit. (44), p. 116.

APPENDIX: THREE SAMPLE BARS SCALES FOR STUDENT EVALUATION OF PHARMACY INSTRUCTION INSTRUCTIONS TO RATER: 1. Carefully read the dimension and supporting examples (in

parentheses)

2. Read each performance level on this dimension for your ratee.

3. Consider the Typical performance level on this dimension for your ratee. Compare his/her typical performance with each of the performance examples. Circle the scale number (1-15) nearest to the performance example

which best shows his/her typical performance in this dimension. 4. Follow the same rating procedure for all 10 dimensions.

* * * A. TEACHING ABILITY -LECTURE

(Audible and clear speaking; Interpretation and explana-tion of concepts; Use of examples and illustrations; Empha-sis and summary of main points; Effective use of chalk-board.)

Rating Performance Example

EXCELLENT 15- 14- 13-

12.3 At the beginning of each class period, this instruc-tor briefly summarized the previous lecture and outlined the present lecture.

12.1 This instructor not only described concepts and process, but also rationale supporting them.

12- 11.9 This instructor began each class period by asking

students if they had any questions from the last class period, or This instructor taught several approaches to solv-ing problems, pointing out rationale for each method.

11.8 When new drug products entered the market, this instructor frequently used them in examples illus-trating therapeutic aspects of the active ingredient(s).

11- 10.5 When lecturing from overhead projections, this

instructor looked to the class, paused, asking if there were any questions.

10- 9- 8- 7- 6-

5.5 This instructor frequently said “Aahhh” or “Ummm” between phrases and sentences.

5- 4.6 When overhead transparencies were removed be-

fore students could complete their notes, this in-structor would say “You only need to listen to what I am saying.”

4.3 This instructor used new scientific and profes-sional terms freely, assuming that students already knew them.

4- 3.7 This instructor did not speak clearly, saying “sorp-

tion” and not conveying whether adsorption or absorption was meant.

3- 2.8 This instructor lectured “over the heads” of the

level of intellect of the students. 2.6 This instructor did not enunciate clearly, mum-

bling through lectures, or When students would ask this instructor to please repeat a point made in lecture, the instructor would say “Get it from your neighbor,” and continue lecturing.

2.5 This instructor wrote notes on the chalkboard faster than students could comprehend and record them, then erased the notes before students com-pleted taking them down.

2- 2- 1- POOR

36 American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994

Page 13: Development and Validation of Behaviorally-Anchored Rating Scales

D. COURSE ORGANIZATION (Clarity of scheduling; Detail of content outline; clarity of learning objectives, assignments and student expectations; Following the course outline and objectives)

Rating Performance Example EXCELLENT 15- 14- 13- 12-

11.3 This instructor’s course syllabus contained helpful suggestions on how to take notes, study for exams, and general expectations for student performance.

11- 11.0 This instructor reviewed learning objectives be-fore each examination.

10.4 This clinical preceptor told students “up front” what was expected and followed through with learning situations.

10- 9.5 This instructor provided students with written exam,

term project and grading policies. 9- 8- 7- 6-

5.7 This instructor’s course included content which was duplicative of previously taught prerequisite “material”.

5- 4.1 This instructor wrote a special text for the course,

but did not make it available until the third week of the term.

4- 3.8 When this instructor divided a class into recitation

sections, the content was not standardized be-tween sections.

3.7 After arriving late for conferences, this clinical preceptor would spend additional time to collect materials and get organized.

3.6 This instructor coordinated a team-taught course in which lecturers had no idea of what other lectur-ers were teaching.

3.3 This instructor frequently delayed lecture ten min-utes while returning to his/her office for forgotten lecture notes.

3.1 This instructor never had sufficient copies of hand-outs on the first day of class.

3- 2.9 This instructor frequently arrived late to lecture

and then would run overtime with lecture. 2.8. This instructor began the course without a sylla-

bus, saying that he would work it up as the term progressed.

2.7 After arriving late to class, this instructor would ask “What are we supposed to lecture about to-day?”

2.2 Unknown to college administration and students, this instructor arranged for a T.A. to teach the entire course.

2.0 This instructor distributed his/her syllabus two weeks before the end of instruction.

2- 1- POOR

F. STUDENT PERFORMANCE EVALUATION (Lecture, Laboratory, and Experiential: Relationship to course content/objectives; Clear, Unambiguous questions and assignments; Explanation of method, content, adminis-tration; Feedback to students; Fair, objective grading; Ap-plication, not rote memory).

Rating Performance Example EXCELLENT 15- 14- 13- 13.0 After exams, this instructor made examinations available

via computer where students could see the correct answers, answers missed, plus helpful comments on each question.

12.2 This instructor provided practice quizzes on computer terminals.

12- 11.5 During the next lecture after an exam, this instruc-

tor reviewed the questions most frequently missed by students.

11.4 This preceptor conducted weekly performance feedback sessions with all externs.

11- 10.9 This’ preceptor’s constructive feedback included

reasons for needed improvement as well as posi-tive outcomes of things the students did well.

10.6 This instructor encouraged students to submit term papers early so that feedback could be provided enabling revision before the due date.

10- 9.9 This clinical preceptor’s exams were patient-ori-

ented in case format. 9- 8- 7- 6- 5- 5.0 This lab instructor based grades on results and not

on explanations of process used to obtain results. 4.5 This instructor did not proofread exams and made

corrections on the chalkboard only after students detected errors during the exam.

3.9 This instructor provided only one description of how grades would be computed—”totally bell curve.”

3.5 This clinical preceptor was unable to document, with specific student performance behaviors, rea-sons for the grade assigned.

2.9 This instructor, named “trivial pursuit” by the class, tested on facts which were least emphasized in class.

2.8 This instructor’s exams were so long that it was impossible to complete them in the time allowed.

2.6 This instructor administered multiple-choice ex-ams containing not less than twelve responses per question.

2.5 This preceptor did not give student performance feedback, even if asked.

2.2 This clinical preceptor refused to give students their final rotation evaluation until they turned in their evaluation of the preceptor first.

2.1 This instructor did not return midterm exams until one day before the final.

2- 2.0 This instructor had a policy of not assigning “A” grades, saying “No one is perfect.”

1- POOR

American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 37