Jonathan Arnett - Publications Portfolio.pdf · Web viewAside from providing pedagogical theory and...

Jonathan ArnettENGL 5365 – Quantitative MethodsFinal Paper

Investigations toward a Method of Analyzing Instructor Grading Trends in First Year Composition

IntroductionIn every first-year composition program that employs graduate students as teachers, Writing Program Administrators (WPA) face the annual challenge of orienting their new instructors to the composition program’s goals and procedures, not the least part of which involves grading. Aside from providing pedagogical theory and course development information — and especially the classroom management tips and techniques that the new instructors crave — a WPA has to train the new instructors how to grade student papers, norm their grading standards to the program’s standards, and provide refresher training on grading for experienced instructors (Qualley, 2002). This cycle of training and retraining helps both the new and experienced graduate instructors provide consistent, reliable grades, but it obscures rather than addresses the interrelated issues of how graduate instructors develop over time, and at what rate.

These general issues are fundamental to an understanding of how educators develop professionally, and they are partially addressed in educational literature. For example, according to Coulter (2000), novice graders in first-year composition tend to follow scoring rubrics very closely, while experienced graders tend to use more internal guides to grading. Saunders & Davis (1998) state that instructors’ understanding of grading criteria is dependent on their discussions about the criteria, and this understanding — and hence their use — changes with staff turnover. Qualley (2002) discusses the “perpetual cycle of teacher preparation” (p. 279) faced by a first year composition program that depends entirely on Master’s degree students as instructors. Hipple & Bartholomew (1982) discuss instructor-orientation strategies that problematize the process of grading student papers. Wyatt-Smith & Castleton (2005) discuss the relative impacts of official educational materials and practices and contextually-based, personal criteria on the grading practices of Australian fifth-grade teachers. Flores (2003) examines elementary school teachers and reports that the second year of teaching is when new teachers turn their attention from teacher/student relationships and classroom management to the assessment of student essays. Campbell & Evans (2000) review student teachers’ lesson plans and determine “that

Arnett — Grading Trends in First Year Composition Instructors of 24

during student teaching, preservice teachers do not follow many of the assessment practices recommended in their coursework” (p. 350).

Although these kinds of article address the state and development of instructors’ grading practices, they deal in generalities, do not address the specific evolution of any particular teacher’s grading patterns, and often address elementary education, which is often not closely related to first year composition instruction. A search of the literature revealed just one department-wide study of English instructors’ grading practices in first year composition courses — a comparison of the final grades assigned by Rhetoric 100, 101, and 102 instructors during the 1951-2 academic year at the University of Illinois’ Chicago campus (Thompson, 1955). In it, the author demonstrates that a) several instructors in each course provide abnormally high or low final grades and b) several instructors in each course disperse their classes’ final grades over unusually broad or narrow ranges. These conclusions are unlikely to surprise an experienced first year composition instructor, but they make clear the need for a more in-depth examination of grading practices across the different sections of first year composition programs.

ChallengesAn examination of grading practices in first year composition, however, is complicated by the difficulties inherent in evaluating any writing course’s student products. As Branthwaite, Trueman, & Berrisford (1981) point out, “As it is normally carried out, marking essays is a very private and intuitive procedure” (p. 42). Some first-year composition programs, including those at SUNY-Stonybrook and the University of Cincinnati, have attempted to counter this problem with portfolios. Many authors, such as Belanoff & Elbow (1991) and Durst, Roemer, & Schultz (1994) have described the positive effects of the portfolio system; however, portfolio grading contains its own problems, such as creating conflict between members of grading groups, fostering a sense of disempowerment and lowering morale among less experienced graders, and generating deep, program-wide resistance on the part of the graders (Coulter, 2000). In addition, many first year writing programs do not use portfolios and employ a traditional model, where instructors grade their own students’ work, which has not been widely addressed in the literature.

The majority of the literature regarding individual graders’ marking of individual student essays is on high-stakes testing of the sort conducted by the National Center for Education Statistics (NCES) (White, Smith, & Vanneman, 2000), colleges and universities that use testing to place students in courses (Barrett, Stock, & Clark, 1986; Weigle, 1998), professional organizations that use tests for accreditation purposes (O’Neill & Lunz, 1997), and, more recently, Educational Testing Service (ETS), the entity that develops and administers the SAT. These organizations have


developed strict protocols for accurately and efficiently grading large numbers of essay tests, but their high-stress, time-limited procedures (Engelhard, 1992; Gyagenda & Engelhard, 1998; MacMillan, 2000; McQueen & Congdon, 1997) are far removed from the (typically) steady, semester-long effort of grading student papers. For example, NCES has conducted the National Assessment of Educational Practice (NAEP) tests every year since 1969. The 2000 iteration was expected to generate “close to 10 million constructed responses” (White, Smith, & Vanneman, 2000, p. 3) involving some degree of writing, which were to be graded by approximately 150 scorers for the mathematics portion, 175 scorers for the science portion, and 50 for the writing portion. Clearly, although the methods developed by NCES and ETS are valuable for their intended purpose of evaluating large numbers of documents with accuracy, analysis of their graders’ marking patterns do not shed light on the typical situation faced by a first year composition instructor.

Another problem, but one that is not addressed specifically in the literature, is the dearth of grades to analyze. College registrars maintain records of student final grades, from which an investigator could conceivably assemble sets of final grades given by individual instructors, but final grades, as is clear from the Thompson article, are not deeply illuminating for two reasons. First, final grades only come out once or twice in a year, and in the case of graduate students who instruct first year composition courses for only a handful of semesters before leaving, the data set left behind is not sufficient for in-depth analysis. Second, final grades are a cumulative figure that often contain extraneous factors, such as extra credit and class participation, and thus may mask significant variations in instructors’ everyday grading habits. Alternatively, a researcher may find it possible to obtain detailed grade records from individual instructors, but it is questionable whether the effort in gathering the data would be cost-effective because data sets would likely be incomplete — teachers throw away old grade books; computer disks are erased, damaged, and discarded, or spreadsheet data is either corrupted or simply unreadable by modern programs — and it would be extremely difficult to find a relatively homogeneous sample of instructors who graded the same assignments in the same sequence.

As it stands, no longitudinal studies exist that track individual teachers’ reliability and severity under normal grading conditions; i.e., where teachers grade students’ writing as an integral part of their everyday work, over the course of multiple semesters. Hence, in order to accurately examine the ways in which first-year composition instructors change their grading practices over time, it would be worthwhile to examine a complete set of grades, comprising every grade on every assignment, given by actual first year composition instructors in a college setting. Ideally, the object of study should be a complete set of individual instructors’ assigned grades


during a semester-long time frame that requires grading more than one type of student essay, as is the case with first year composition courses and not one-time tests like the SAT.

Severity and ReliabilityThe extant literature on individual teachers’ grading patterns is somewhat limited, but two basic issues discussed at length in the literature on grading patterns are rater severity — the degree to which graders tend to provide high or low grades — and intra-rater reliability — whether the same grader provides consistent marks over time.

SeverityA great deal of the literature on rater severity concerns studies on high-stakes testing, where teams of judges rate large numbers of essay tests, often in order to make pass/fail judgments or place students into classes. For example, Barritt, Stock, & Clark (1986) describe the tensions associated with evaluating student essays on a 1-4 scale for the purposes of placing students in composition courses at the University of Michigan. In particular, the authors focus on the confusion they encountered when essays were rated 1 (superior) by one judge and 4 (poor) by another. McQueen & Congdon (2000) examine rater severity over the course of a nine-day long (seven days of active grading, not including a weekend) test-grading session in which 16 raters examined 8285 one- to two-page student-written papers. Ten of the raters display significant differences between the severity of their first and ninth days’ ratings, with nine becoming more severe and one becoming less severe.

Other studies of rater severity examine long-term judging sessions. Lumley & McNamara (1995) examine the results of three judging sessions of an ESL speaking test, administered over a 20-month time frame, with each individual judging session lasting about a week. The results indicate that significant intra-rater differences in severity exist over time, even with training. Myford (1991) describes judges with three different levels of expertise — “buffs,” experts, and novices — rating student dramatic performances over a one-month period. The results show significant differences in the judges’ severity over time. O’Neill & Lunz (1997) pool the results of 17 examinations, held at intervals over the course of 10 years, from a practical examination of scientific laboratory competence and statistically analyze the severity of the nine raters who participated in at least 10 examinations. They concluded that raters’ severity is usually consistent and predictable, but some raters are more consistent than others and even the most stable raters occasionally vary without notice.

Reliability


Variety in intra-rater stability is no surprise to an experienced first year composition instructor; educators have been aware of intra-rater reliability as a problem for many years. For example, Coffman’s 1971 article, “On the Reliability of Ratings of Essay Examinations in English,” cites five sources dated 1936 (Hartog & Rhodes), 1951 (Findlayson), 1953 (Pearson), 1954 (Vernon & Millican), and 1963 (Noyes) in a footnote specifically regarding intra-rater reliability.

Intra-rater reliability is not a problem in purely objective testing, in situations where only one answer can be correct; it is, however, of paramount importance in situations where writing samples are to be graded, which demands a degree of subjectivity. One popular measure of intra-rater reliability is to apply repeated tests — i.e., for graders to evaluate the same written materials at least twice. Eells (1930, as cited in Branthwaite, et al., 1981) describes a study in which 61 teachers rated two history and two geography essays twice, the second time after an 11-week interval. The mean correlation between the grades was only 0.37, which is surprisingly low. (A correlation of 1.0 indicates perfect matches, a correlation of 0.0 indicates no matches, and a correlation of -1.0 indicates perfect inverse matches; a correlation of 0.37 indicates a “definite but small relationship” [Guilford, 1956, p. 145]).

Blok (1985) compared 16 schoolteachers’ independent, holistic ratings of 105 elementary school student essays, rated on a 1-10 scale, repeated once after a 3-month interval. Intra-rater reliability correlations were inconsistent, ranging between a rather low 0.415 and a very respectable 0.910 (p. 51).

In contrast to this result, other researchers provide results that suggest intra-rater reliability is likely on repeated tests. Marsh & Ireland (1987) compare grades on essays composed by 139 seventh-grade students. Three experienced teachers graded each essay twice, with a 10-month break between rating sessions; in addition, three student teachers graded the essays during the second grading session. During the first grading session, the experienced teachers gave each paper a holistic score on a 1-100 scale. During the second grading session, all teachers graded the papers on six components using an nine-point scale and provided a holistic score on a 1-100 scale. The grades assigned by the experienced teachers to each paper display correlations of 0.80 between the first and second holistic evaluations and 0.82 between the first holistic evaluation and the second evaluation’s totaled component scores (p. 362).

Similarly, Shohamy, Gordon, & Kraemer (1992) use a 2x2 design with a repeated test to examine the effects of training versus no training on both professional English teachers and ordinary English speakers. Four groups of raters, each comprising five members, rated 50 essays. In a measure of


intra-rater reliability, the trained professional English teacher group re-graded ten randomly selected essays three weeks after the first rating session. The raters’ correlation coefficients ranged from 0.76 to 0.97, and the researchers described the ratings as “relatively stable, although they did vary from rater to rater” (p. 30) and recommended repeated training sessions as a method of maintaining high rater reliability.

Anecdotal evidence also exists that graders tend to be internally inconsistent. Branthwaite, Trueman, & Berrisford (1981) describe the grades assigned to two cases of plagiarized academic assignments, in which the submitted papers “differ[ed] only in handwriting and minor changes of wording” (p. 42).. In the first case, one student wrote and submitted 11 science lab reports that were copied and resubmitted by another student. The same instructor graded all 22 reports, but none of their marks were identical, and the plagiarizing student received a higher grade on seven of the 11 assignments. None of the marks were more than one whole letter grade apart, but there was no statistically significant correlation between the paired grades (p. 43). In the second case, five students handed in plagiarized essays during a social science course; three pairs of these essays were graded by one instructor. Of those three pairs, none received the same grade, and the difference in grades was less than one full letter mark (p. 43).

Social factors outside the academy may also play a part in intra-rater severity and reliability. Composition as a field has undergone a great many changes over the past few decades, and evidence suggests that instructors’ marks have often been influenced by the social milieu. For example, Eldridge (1981) describes a study in which seven English teachers graded the same essay twice, the first time in 1972 and the second time in 1978. In 1972, the essay received a C or better from every instructor, but in 1978, only one instructor graded it higher than C-minus. Eldridge speculates that the change in grading is due to an increased emphasis on form brought about by the teachers becoming “message-weary” (p. 67) in the late ‘70s. Similarly, Longstreth & Jones (1976) demonstrate that grading became more lenient at the University of Southern California from fall 1966 to spring 1970, but conclude that the trend is an artifact of the times because “new instructors began with greater leniency after 1970 than before” (p. 80, italics in original).

ICON/TOPICThe first year composition program at Texas Tech University, which employs a pedagogy entitled Interactive Composition ONline (ICON) and a proprietary course management program named Texas Tech Online Print-Integrated Curriculum (TOPIC), provides an opportunity to conduct a study that will address the gap in the literature regarding first year composition instructors’ grading patterns.


ICON is a writing-intensive, rhetorically-based first year composition pedagogy that employs a common syllabus, a custom-printed textbook, and distributed, anonymous grading. In both the first-semester (ENGL 1301) and the second-semester (ENGL 1302) first year composition course, students attend one 80-minute class per week, access an online syllabus to obtain reading assignments and writing assignment instructions, and submit all writing assignments electronically via Among other features, TOPIC provides each student with a course syllabus, class attendance record, place to receive instructor announcements, file containing all her submitted assignments, and up-to-the-minute cumulative grades.

A benefit of the common syllabus shared by all sections of both ENGL 1301 or 1302 is the possibility of distributing grading among the Graduate Part Time Instructors (GPTIs) employed by the English Department. GPTIs fall into two categories: Classroom Instructors (CIs), who lead the weekly classes, and Document Instructors (DIs), who evaluate and grade student submissions. All CIs are required to spend at least one hour DI-ing per section; DIs may or may not be CIs. Each DI reads all assignments “blind,” without possessing any information about the student’s identity, and in turn, the DI does not provide the student with any identifying information.

Distributed, blind grading works to the advantage of Texas Tech’s first year composition program because in order to establish consistency throughout the program, Document Instruction is criteria-based, as opposed to the traditional context-based grading criteria used when teachers grade their own students’ work. The use of external criteria has long been a common practice; in 1929, a researcher noted, “ Subjectivity of marking may be reduced by about one-half by the adoption of and adherence to a set of scoring rules when essay examinations are to be graded” (Lumley & McNamara, 4). CIs do have the ability to change grades and sometimes exercise that power, but the data set is largely uncompromised by traditional writing teachers’ application of context-based criteria such as classroom discussions, document production history, knowledge of the student’s personality, student gender, the assigning teacher’s expectations (Wyatt-Smith & Castleton, 2005), likely grade spreads, and class participation (Hipple & Bartholomew, 1982).

DIs use a separate TOPIC interface to access and grade all writing assignments. In order to grade, a DI logs in from any Internet-connected computer, clicks on the type of assignment she wishes to grade (first reads of major writing assignments, called “drafts”; second reads of drafts; peer critiques; or student reflective pieces, called “Writing Reviews”), and receives a document drawn from the respective pool of ungraded documents.


Grading of Peer Critiques and Writing Reviews is a simple process. In these cases, the DI reads the document, selects prewritten comments through a radio button interface, and assigns a grade on a 100-point scale, with 50 being the base grade for a submitted but extremely poorly-completed assignment. The DI has the option to add a holistic comment and/or to single out specific areas deserving praise or requiring improvement in the student’s writing.

Grading drafts is more complex; it is a two-part process, involving two separate DIs. The first DI reads a draft, provides holistic and/or specific commentary, highlights a representative selection of grammar errors, and assigns a numerical grade out of 100 possible points, with 50 being the minimum grade for a submitted assignment. If the draft contains a problem that renders it ungradeable, such as plagiarized content or no content at all, the DI can “flag” the draft to the student’s CI. After a draft has been commented and graded once, TOPIC then returns the draft to the grading pool for a “second read.” A different DI selects the draft, reads both its contents and the first DI’s comments, and assigns a new grade out of 100 points; this second reader also has the option of adding a supplementary comment or “flagging” the draft. If the two DIs’ scores are within nine points out of 100, TOPIC then calculates a mean grade before returning the draft and “reconciled” grade to its author. If the scores differ by more than nine points, TOPIC does not return a reconciled grade to the draft’s author until another DI performs a third read on the paper and TOPIC calculates a mean grade based on the two closest grades.

TOPIC collects and stores over 50 separate pieces of information per assignment, among which are every grade assigned to every assignment, the identity of the DI who assigned the grade, and the time the grades were assigned, and has been recording data continuously since its launch in August 2002. Therefore, the data collected by TOPIC is a prime resource for examining individual instructors’ everyday grading practices over time, which is an area that has not been fully served by research.

Methods

SampleGrade records from the Fall 2002, Spring 2003, Fall 2003, Spring 2004, Fall 2004, and Spring 2005 semesters were obtained from a Writing Program Administrator. The record set comprised all grades assigned by six Texas Tech English Department graduate students who worked as DIs during each of these six semesters. Data from ENGL 1301 and 1302 were separated, and ENGL 1301 data was selected for analysis.

Variables


The data was coded by DI identity (1-6), semester (fa02, fa03, fa04 in some tests, later recoded as 102, 103, and 104, respectively, due to limitations in SPSS, the statistical software used in this study), assignment number, and First Read grade. The assignment numbers were 0.1, 1.1, 1.2, 1.3, 2.1, 2.2, 2.3, 3.1, 3.2, 3.3, 4.1, 4.2, and 4.3.

Data AnalysisThe number of different drafts per semester was calculated. Descriptive statistics for each DI’s First Reads were calculated for each assignment number and each semester. An analysis of variance (ANOVA) was performed on each DI’s First Reads for each assignment across semesters in order to determine if a significant difference exists between the grades given to similar assignments during different semesters.

ResultsDIs #1, 2, 4, and 6 graded a combined total of 7192 drafts over the Fall 2002, Fall 2003, and Fall 2004 semesters; they graded 2275 drafts in Fall 2002, 3315 drafts in Fall 2003, and 1602 drafts in Fall 2004. (Table 1)

(Note to Angela: I missed the boat here and didn’t run the simple stats, like drafts per DI, in tables. I’d be sure to do that next time.)

ANOVAs were not able to be calculated for all relationships due to a lack of sufficient data in all categories; in these cases, DIs did not grade the same type of assignments in each semester, which prevented comparisons. Full results of the ANOVA can be seen in Table 3.

For the relationships that were calculated, DI #1 displayed only one significant difference, on Draft 4.2 between the Fall 2002 and Fall 2003 semesters (F= 10.374, df=2/130, p<.05). The mean first read grade given in Fall 2002 was significantly lower (M=76.50, SD=11.77) than the mean first read grade given in Fall 2003 (M=84.55, SD=7.913).

DI #4 displayed four significant differences. The first appears on Draft 1.1 between the Fall 2003 and Fall 2004 semesters (F=163.691, df=1/151, p<.05). However, since DI #4 only graded one Draft 1.1 in Fall 2004 (M=50, SD=0) that was significantly lower than the grades on 152 Draft 1.1s in Fall 2005 (M=152, SD=87.36), this result can probably be discounted. Similarly, DI#4 displayed a dubious significant difference on Draft 2.1 between Fall 2003 and Fall 2004 (F=128.199, df=1/89, p<.05). The mean first read grade given in Fall 2003 was significantly lower (M=60.00, SD=17.321) than in Fall 2004 (M=88.32, SD=3.416). However, DI #4 only graded three Draft 2.1s in Fall 2003, versus 88 in Fall 2004, which casts doubt on this particular result. Another significant difference, this time based on six drafts in Fall 2003 versus 110 drafts in Fall 2004, is


evident in Draft 3.3 (F=10.256, df=1/114, p<.05). The mean first read grade in Fall 2003 (M=91.67, SD=2.066) is significantly higher than the mean first read grade in Fall 2004 (M=82.42, SD=7.031). The significant difference between Fall 2003 and Fall 2004 (F=13.329, df=1/123, p<.05) on Draft 3.1 is more solid, though. The mean first read grade in Fall 2003 is significantly higher (M=80.36, SD=10.832) than the mean first read grade in Fall 2004 (M=86.15, SD=4.626).

DI #5 displayed six significant differences, all between the Fall 2002 and Fall 2003 semesters. On Draft 1.1 (F=11.881, df=1/192, p<.05), the Fall 2002 mean first read grade (M=91.17, SD=3.609) was significantly higher than the Fall 2003 mean first read grade (M=89.41, SD=6.591). On Draft 1.3, (F=22.196, df=1/144, p<.05), the Fall 2002 mean first read grade (M=80.87, SD=3.609) was significantly lower than the Fall 2003 mean first read grade (M=88.37, SD=7.355). For the significant difference present on Draft 3.1 (F=15.940, df=1/110, p<.05), the first read grades from Fall 2002 (M=73.31, SD=14.398) were significantly lower than the first read grades from Fall 2003 (M=83.32, SD=11.032). Notably, all three first read grades from Drafts 4.1, 4.2, and 4.3 display significant differences over the years. The significant difference on Draft 4.1 (F=15.719, df=1/165, p<.05) came from the lower first read grades in Fall 2002 (M=79.32, SD=14.916) than in Fall 2003 (M=86.72, SD=9.123). On Draft 4.2, the significant difference between the semesters (F=29.735, df=1/177, p<.05) arises from the lower grades in Fall 2002 (M=75.05, SD=16.177) than in Fall 2003 (M=86.40, SD=5.806). On Draft 4.3, the significant difference between semesters (F=40.372, df=1/119, p<.05) arises from the lower first read grades in Fall 2002 (M=69.43, SD=14.699) than in Fall 2003 (M=85.18, SD=6.026).

DI #6 is somewhat akin to DI #4, for both of DI #6’s significant differences between semesters, on Draft 1.1 (F=68.231, df=1/52, p<.05) and Draft 1.2 (F=35.889, df=1/31, p<.05) come from cases where the DI only assigned one first read grade in Fall 2002. In both cases, the Fall 2002 first read grade is a zero, (M=0, SD=0) which by default is significantly higher than the mean Fall 2003 first read grades for Draft 1.1 (M=88.96, SD=10.670) and Draft 1.2 (M=75.31, SD=12.380).

DiscussionAlthough a more in-depth examination of the data available through TOPIC would be worthwhile, this analysis of three semesters’ worth of data from ENGL 1301 serves as a proof-of-concept in regard to the potential of systematically examining the marks assigned by first year composition instructors on workaday class assignments. Specifically, the ANOVA results tend to show that DIs’s everyday grading patterns do change as time passes, but the direction of change is indeterminate. DI #1 tended to increase scores; DI #4 displayed mixed trends; DI #5 had one decreasing mean score but five increasing mean scores; DI #6 displayed two increasing


mean scores. Further investigation is called for, but these preliminary, mixed results seen here serve as evidence that the subject warrants further investigation.

As a suggestion for further research, educational measurement specialists have suggested that mathematical transformations should be used on raw scores, such as the first read grades examined in this study, in order to establish objective measurement of trends in rater severity and consistency (Lumley & McNamara, 1995; Wolfe & Chiu, 1997; Wolfe, Moulder, & Myford, 1999). In particular, many researchers have used Rasch modeling to analyze rater severity and reliability in high-stakes testing situations (Engelhard, 1992; Gyagenda & Engelhard, 1998; Lumley & McNamara, 1995; McQueen & Congdon, 1997; Mulqueen & Baker, 2000; O’Neill & Lunz, 1997; Weigle, 1998). It is likely that applying the Rasch model to the data found in TOPIC would prove enlightening.


References

Barritt, L., Stock, P.L., & Clark, F. (1986). Researching practice: Evaluating assessment essays. College Composition and Communication, 37(3), 315-327.

Belanoff, P., & Elbow, P. (1991). Using portfolios to increase collaboration and community in a writing program. In Belanoff, P., & Dickson, M. (Eds.), Portfolios: Process and Product. (pp. 17-36). Portsmouth, NH: Boynton/Cook.

Blok, H. (1985). Estimating the reliability, validity, and invalidity of essay ratings. Journal of Educational Measurement, 22(1), 41-52.

Branthwaite, A., & And Others. (1981). Unreliability of marking: Further evidence and a possible explanation. Educational Review, 33(1), 41-46.

Campbell, C., & Evans, J. A. (2000). Investigation of preservice teachers’ classroom assessment practices during student teaching. The Journal of Educational Research, 93(6), 350-355.

Coffman, W. E. (1971). On the reliability of ratings of essay examinations in English. Research in the Teaching of English, 5(1), 24-36.

Coulter, L. S. (2000). Lean, mean grading nachines? A Bourdieuian reading of novice instructors in a portfolio-based writing program. WPA: Writing Program Administration, 23(3), 33-49.

Durst, R., Roemer, M., & Schults, L. M. (1994). Portfolio negotiations: acts in speech. In Daiker, D., et al. (Eds.), New Directions in Portfolio Assessment. (pp. 286-300). Portsmouth, NH: Boynton/Cook.

Eells, W. C. (1930). Reliability of reported grading of examinations. Journal of Educational Psychology,21, 48-52.

Eldridge, R. (1981). Grading in the 70s: How we changed. College English, 43(1), 64-68.

Engelhard, G., Jr. (1992). The measurement of writing ability with a many-faceted Rasch model. Applied Measurement in Education, 5(3), 171-191.

Findlayson, D. S. (1951). The reliability of the marking of essays. British Journal of Educational Psychology, 21, 126-134.

http://web26.epnet.com/citation.asp?tb=1&_ug=fvd+0+sid+DEABA417-E540-4B28-83F2-74A631F71213@sessionmgr4+fic+11+dbs+aph%2Chhh%2Cbuh%2Cufh%2Ceric%2Cfunk%2Ctfh%2Cpbh%2Cpsyh%2Cbwh%2Crlh%2Cslh%2Ctth+fdp+rl+cp+1+fim+0+EFF3&_us=hd+False+hs+False+or+Date+fh+False+ss+SO+sm+ES+sl+-1+ri+KAAACBWC00011561+dstb+ES+mh+1+frn+1+D5E9&_uso=_15&bk=N&anfn=1&anrn=7&sp=&fn=&





Flores, M. A. (2003). Mapping teacher change: A two-year empirical study. Paper presented at the 84th Annual Meeting of the American Educational Research Association, Chicago, IL, 21-25 April 2003.

Guilford, J. P. (1956). Fundamental statistics in psychology and education (3rd ed). New York: McGraw-Hill.

Gyagenda, I. S., & Engelhard, G., Jr. (1998). Applying the Rasch model to explore rater influences on the assessed quality of students’ writing ability. Paper presented at the annual meeting of hte American Educational Research Institute, San Diego, CA, April 13-17, 1998. 1-30.

Hartog, P., & Rhodes, E. C. (1936). The marks of examiners. New York: Macmillan.

Hayes, J. R., Hatch, J. A., & Silk, C. M. (1995). How consistent is student writing performance? Quarterly of the National Writing Project and the Center for the Study of Writing and Literacy, 17(4), 34-36.

Hipple, T. W., & Bartholomew, B. (1982). What beginning teachers need to know about grading. English Education, 14(2), 95-98.

Longford, N. T. (1994). Reliability of essay rating and score adjustment. Journal of Educational and Behavioral Statistics, 19(3) 171-200.

Longstreth, L. E., & Jones, D. (1976). Some longitudinal data on grading practices at one university. Teaching of Psychology, 3(2), 78-81.

Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1) 54-71.

Lunz, M. E., Stahl, J. A, & Wright, B. D. (1994). Interjudge reliability and decision reproducibility. Educational and Psychological Measurement, 54(4), 913-925.

Marsh, H. W., & Ireland, R. (1987). The assessment of writing effectiveness: A multidimensional perspective. Australian Journal of Psychology, 39(3), 353-367.

McQueen, J. & Congdon, P. J. (2000). Rater severity in large-scale assessment: Is it invariant? Journal of Educational Measurement, 37(2), 163-178.

Myford, C. M. (1991). Judging acting ability: The transition from novice to expert. Paper presented at the American Educational Research Association, Chicago, IL.








Noyes, E. S. (1963). Essay and objective tests in English. College Board Review, 49, 7-10.

O’Neill, T. R., & Lunz, M. E. (1997). A method to compare rate severity across several administrations. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL.

Pearson, R. (1953). The test fails as an entrance examination, in “Should the general composition test be continued?” College Board Review, 25, 2-9.

Qualley, D. (2002). Learning to evaluate and grade student writing: An ongoing conversation. In Pytlik, B. P. (Ed.), Preparing College Teachers of Education: Histories, Theories, Programs, Practices. (pp. 278-291). New York: Oxford.

Saunders, M. N. K., & Davis, S. M. (1998). The use of assessment criteria to ensure consistency of marking: Some implications for good practice. Quality Assurance in Education, 6(3), 162-171.

Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters' background and training on the reliability of direct writing tests. Modern Language Journal, 76(1), 27-33.

Thompson, W. N. (1955). A study of the grading practices of thirty-one instructors in freshman English. Journal of Educational Research, 49, 65-68.

Vernon, P. E., & Millican, G. D. (1954). A further study of the reliability of English essays. British Journal of Statistical Psychology, 7, 65-74.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287.

White, S., Smith, C., & Vanneman, A. (2000). How does NAEP ensure consistency in scoring?; National Center for Education Statistics (ED), Washington, DC.; Focus on NAEP, 4(2), 1-4.

Wolfe, E. W., & Chiu, C. W. T. (1997). Detecting rater effects with a multi-faceted rating scale model. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL, March 25-27, 1997. 1-37.

Wolfe, E. W., Moulder, B. C., & Myford, C. M. (1999). Detecting differential rater functioning over time (DRIFT) using a Rasch multi-faceted rating scale model. Based on a paper presented at the Annual Meeting







of the American Educational Research Association, Montreal, Quebec, Canada, April 19-23, 1999. 1-41.

Wyatt-Smith, C., & Castleton, G. (2005). Examining how teachers judge student writing: An Australian case study. Journal of Curriculum Studies, 37(2), 131-154.


Table 1 – Assignments per Semester

Assignment #

Semester ID Frequency Percent Valid PercentCumulative

Percent102 Valid .1 231 10.2 10.2 10.2 1.1 237 10.4 10.4 20.6 1.2 240 10.5 10.5 31.1 1.3 230 10.1 10.1 41.2 2.1 166 7.3 7.3 48.5 2.2 254 11.2 11.2 59.7 2.3 62 2.7 2.7 62.4 3.1 185 8.1 8.1 70.5 3.2 181 8.0 8.0 78.5 3.3 29 1.3 1.3 79.8 4.1 119 5.2 5.2 85.0 4.2 186 8.2 8.2 93.2 4.3 155 6.8 6.8 100.0 Total 2275 100.0 100.0103 Valid .1 132 4.0 4.0 4.0 1.1 167 5.0 5.0 9.0 1.2 343 10.3 10.3 19.4 1.3 280 8.4 8.4 27.8 2.1 219 6.6 6.6 34.4 2.2 122 3.7 3.7 38.1 2.3 619 18.7 18.7 56.8 2.4 2 .1 .1 56.8 3.1 297 9.0 9.0 65.8 3.2 359 10.8 10.8 76.6 3.3 203 6.1 6.1 82.7 4.1 125 3.8 3.8 86.5 4.2 202 6.1 6.1 92.6 4.3 245 7.4 7.4 100.0 Total 3315 100.0 100.0104 Valid 1.1 188 11.7 11.7 11.7 1.2 160 10.0 10.0 21.7 1.3 111 6.9 6.9 28.7 2.1 107 6.7 6.7 35.3 2.2 149 9.3 9.3 44.6 2.3 170 10.6 10.6 55.2 3.1 99 6.2 6.2 61.4 3.2 110 6.9 6.9 68.3 3.3 140 8.7 8.7 77.0 4.1 130 8.1 8.1 85.1 4.2 106 6.6 6.6 91.8 4.3 132 8.2 8.2 100.0 Total 1602 100.0 100.0


Table 2 – Descriptive StatisticsDescriptives

First Read Grade

DI Identity Assign_# Sem. N MeanStd.

DeviationStd. Error

95% Confidence Interval for Mean

Minimum MaximumLower Bound

Upper Bound

1 1.1 102 97 85.78 5.199 .528 84.74 86.83 60 98 104 36 85.19 5.143 .857 83.45 86.93 73 94 Total 133 85.62 5.171 .448 84.74 86.51 60 98 1.2 102 85 78.67 12.268 1.331 76.02 81.32 50 95 104 35 82.83 6.951 1.175 80.44 85.22 65 94 Total 120 79.88 11.120 1.015 77.87 81.89 50 95 1.3 102 73 80.67 9.221 1.079 78.52 82.82 50 93 104 29 79.62 9.271 1.722 76.09 83.15 55 94 Total 102 80.37 9.202 .911 78.57 82.18 50 94 2.1 102 71 84.82 13.408 1.591 81.64 87.99 0 95 104 19 84.84 6.449 1.479 81.73 87.95 68 95 Total 90 84.82 12.240 1.290 82.26 87.39 0 95 2.2 102 70 77.41 13.430 1.605 74.21 80.62 50 97 104 58 81.21 6.771 .889 79.43 82.99 60 95 Total 128 79.13 11.053 .977 77.20 81.07 50 97 2.3 102 1 87.00 . . . . 87 87 104 30 80.83 10.100 1.844 77.06 84.60 50 98 Total 31 81.03 9.992 1.795 77.37 84.70 50 98 3.1 102 79 74.68 12.451 1.401 71.89 77.47 50 97 104 46 77.33 9.996 1.474 74.36 80.29 60 98 Total 125 75.66 11.637 1.041 73.60 77.72 50 98 3.2 102 77 80.06 7.599 .866 78.34 81.79 50 95 104 8 82.63 8.585 3.035 75.45 89.80 70 95 Total 85 80.31 7.678 .833 78.65 81.96 50 95 3.3 102 7 85.29 8.480 3.205 77.44 93.13 70 98 104 30 77.70 10.528 1.922 73.77 81.63 50 97 Total 37 79.14 10.504 1.727 75.63 82.64 50 98 4.1 102 28 81.32 12.655 2.392 76.41 86.23 50 98 103 11 76.73 17.430 5.255 65.02 88.44 50 94 104 50 80.66 8.980 1.270 78.11 83.21 50 93 Total 89 80.38 11.426 1.211 77.98 82.79 50 98 4.2 102 36 76.50 11.770 1.962 72.52 80.48 50 94 103 87 84.55 7.913 .848 82.87 86.24 57 96 104 10 78.30 10.478 3.313 70.80 85.80 62 93 Total 133 81.90 9.929 .861 80.20 83.61 50 96 4.3 102 39 77.10 14.475 2.318 72.41 81.79 50 98 103 36 82.53 9.154 1.526 79.43 85.63 50 97 104 50 73.86 13.554 1.917 70.01 77.71 0 90 Total 125 77.37 13.159 1.177 75.04 79.70 0 982 .1 102 46 91.85 6.756 .996 89.84 93.85 50 96 103 1 79.00 . . . . 79 79 Total 47 91.57 6.940 1.012 89.54 93.61 50 96



DeviationStd. Error

95% Confidence Interval for Mean Minimum MaximumLower Bound

Upper bound

4 1.1 103 1 50.00 . . . . 50 50 104 152 87.36 2.910 .236 86.89 87.82 79 93 Total 153 87.11 4.187 .339 86.44 87.78 50 93 1.2 103 155 83.66 9.094 .730 82.22 85.11 50 93 104 125 84.78 4.694 .420 83.95 85.61 70 93 Total 280 84.16 7.466 .446 83.28 85.04 50 93 1.3 103 163 84.68 11.363 .890 82.92 86.44 50 95 104 81 82.11 6.450 .717 80.68 83.54 65 93 Total 244 83.83 10.062 .644 82.56 85.10 50 95 2.1 103 3 60.00 17.321 10.000 16.97 103.03 50 80 104 88 88.32 3.416 .364 87.59 89.04 79 95 Total 91 87.38 6.618 .694 86.01 88.76 50 95 2.2 103 4 79.00 19.425 9.713 48.09 109.91 50 90 104 91 82.79 6.364 .667 81.47 84.12 62 96 Total 95 82.63 7.170 .736 81.17 84.09 50 96 2.3 103 127 84.25 9.623 .854 82.56 85.94 50 95 104 140 83.14 6.367 .538 82.07 84.20 60 95 Total 267 83.67 8.084 .495 82.69 84.64 50 95 3.1 103 72 80.36 10.832 1.277 77.82 82.91 52 93 104 53 86.15 4.626 .635 84.88 87.43 78 94 Total 125 82.82 9.187 .822 81.19 84.44 52 94 3.2 103 110 83.99 8.028 .765 82.47 85.51 50 96 104 102 83.54 4.811 .476 82.59 84.48 70 92 Total 212 83.77 6.665 .458 82.87 84.68 50 96 3.3 103 6 91.67 2.066 .843 89.50 93.83 90 95 104 110 82.42 7.031 .670 81.09 83.75 58 95 Total 116 82.90 7.161 .665 81.58 84.21 58 95 4.3 103 7 86.14 6.619 2.502 80.02 92.26 78 95 104 79 81.53 7.306 .822 79.90 83.17 60 94 Total 86 81.91 7.327 .790 80.34 83.48 60 955 .1 102 111 90.50 4.721 .448 89.62 91.39 50 97 103 90 89.27 5.582 .588 88.10 90.44 50 97 Total 201 89.95 5.149 .363 89.23 90.67 50 97 1.1 102 87 91.17 3.609 .387 90.40 91.94 75 98 103 107 87.98 7.995 .773 86.45 89.51 50 98 Total 194 89.41 6.591 .473 88.48 90.35 50 98 1.2 102 92 85.72 6.823 .711 84.30 87.13 70 95 103 95 86.94 5.878 .603 85.74 88.13 55 95 Total 187 86.34 6.373 .466 85.42 87.26 55 95 1.3 102 103 80.87 9.285 .915 79.06 82.69 50 98 103 43 88.37 7.355 1.122 86.11 90.64 50 97 Total 146 83.08 9.385 .777 81.55 84.62 50 98 2.1 102 62 88.90 4.467 .567 87.77 90.04 70 95 103 151 86.32 10.308 .839 84.67 87.98 50 95 Total 213 87.08 9.072 .622 85.85 88.30 50 95 2.2 102 104 80.60 11.671 1.144 78.33 82.87 50 95 103 69 84.43 13.287 1.600 81.24 87.63 50 95 Total 173 82.13 12.446 .946 80.26 83.99 50 95 2.3 102 4 82.25 4.787 2.394 74.63 89.87 78 89 103 15 89.47 4.324 1.116 87.07 91.86 80 95 Total 19 87.95 5.244 1.203 85.42 90.47 78 95 3.1 102 65 73.31 14.398 1.786 69.74 76.88 50 94 103 47 83.32 11.032 1.609 80.08 86.56 50 95 Total 112 77.51 13.950 1.318 74.90 80.12 50 95



DeviationStd. Error

95% Confidence Interval for Mean Minimum MaximumLower Bound

Upper bound

3.2 102 56 84.00 4.729 .632 82.73 85.27 75 92 103 154 82.60 11.455 .923 80.78 84.43 50 95 Total 210 82.98 10.116 .698 81.60 84.35 50 95 3.3 102 17 81.65 6.154 1.492 78.48 84.81 70 90 103 127 84.21 9.836 .873 82.49 85.94 50 96 Total 144 83.91 9.496 .791 82.35 85.47 50 96 4.1 102 56 79.32 14.916 1.993 75.33 83.32 50 95 103 111 86.72 9.123 .866 85.00 88.44 56 95 Total 167 84.24 11.880 .919 82.42 86.05 50 95

4.2 102 114 75.05 16.177 1.515 72.05 78.05 50 95 103 65 86.40 5.806 .720 84.96 87.84 68 94 Total 179 79.17 14.429 1.078 77.04 81.30 50 954.3 102 83 69.43 14.699 1.613 66.22 72.64 50 92 103 38 85.18 6.026 .978 83.20 87.17 68 94 Total 121 74.38 14.585 1.326 71.75 77.01 50 94

6 1.1 102 1 .00 . . . . 0 0 103 53 88.96 10.670 1.466 86.02 91.90 50 100 Total 54 87.31 16.070 2.187 82.93 91.70 0 1001.2 102 1 .00 . . . . 0 0 103 32 75.31 12.380 2.188 70.85 79.78 50 94 Total 33 73.03 17.898 3.116 66.68 79.38 0 944.3 103 147 82.44 8.835 .729 81.00 83.88 55 99 104 2 82.00 4.243 3.000 43.88 120.12 79 85 Total 149 82.44 8.782 .719 81.01 83.86 55 99


Table 3 – ANOVA between First Reads across Semesters

ANOVA

First Read Grade

DI Identity Assignment # Sum of

Squares df Mean Square F Sig.1 1.1 Between Groups 9.111 1 9.111 .339 .561 Within Groups 3520.092 131 26.871 Total 3529.203 132 1.2 Between Groups 428.619 1 428.619 3.540 .062 Within Groups 14285.748 118 121.066 Total 14714.367 119 1.3 Between Groups 22.906 1 22.906 .269 .605 Within Groups 8528.937 100 85.289 Total 8551.843 101 2.1 Between Groups .010 1 .010 .000 .994 Within Groups 13333.146 88 151.513 Total 13333.156 89 2.2 Between Groups 456.239 1 456.239 3.818 .053 Within Groups 15058.503 126 119.512 Total 15514.742 127 2.3 Between Groups 36.801 1 36.801 .361 .553 Within Groups 2958.167 29 102.006 Total 2994.968 30 3.1 Between Groups 203.011 1 203.011 1.505 .222 Within Groups 16589.197 123 134.872 Total 16792.208 124 3.2 Between Groups 47.497 1 47.497 .804 .373 Within Groups 4904.550 83 59.091 Total 4952.047 84 3.3 Between Groups 326.596 1 326.596 3.135 .085 Within Groups 3645.729 35 104.164 Total 3972.324 36 4.1 Between Groups 175.502 2 87.751 .667 .516 Within Groups 11313.509 86 131.552 Total 11489.011 88 4.2 Between Groups 1791.112 2 895.556 10.374 .000 Within Groups 11222.617 130 86.328 Total 13013.729 132 4.3 Between Groups 1576.490 2 788.245 4.833 .010 Within Groups 19896.582 122 163.087 Total 21473.072 1242 .1 Between Groups 161.555 1 161.555 3.540 .066 Within Groups 2053.935 45 45.643 Total 2215.489 46


DI Identity Assignment #Sum of

Squares df Mean Square F Sig.4 1.1 Between Groups 1386.295 1 1386.295 163.691 .000 Within Groups 1278.816 151 8.469 Total 2665.111 152 1.2 Between Groups 85.485 1 85.485 1.537 .216 Within Groups 15466.283 278 55.634 Total 15551.768 279 1.3 Between Groups 357.359 1 357.359 3.567 .060 Within Groups 24245.411 242 100.188 Total 24602.770 243 2.1 Between Groups 2326.448 1 2326.448 128.199 .000 Within Groups 1615.091 89 18.147 Total 3941.538 90 2.2 Between Groups 55.072 1 55.072 1.072 .303 Within Groups 4777.033 93 51.366 Total 4832.105 94 2.3 Between Groups 82.975 1 82.975 1.271 .261 Within Groups 17302.358 265 65.292 Total 17385.333 266 3.1 Between Groups 1023.364 1 1023.364 13.329 .000 Within Groups 9443.404 123 76.776 Total 10466.768 124 3.2 Between Groups 10.798 1 10.798 .242 .623 Within Groups 9362.334 210 44.583 Total 9373.132 211 3.3 Between Groups 486.662 1 486.662 10.255 .002 Within Groups 5410.097 114 47.457 Total 5896.759 115 4.3 Between Groups 136.728 1 136.728 2.595 .111 Within Groups 4426.528 84 52.697 Total 4563.256 855 .1 Between Groups 76.155 1 76.155 2.900 .090 Within Groups 5225.348 199 26.258 Total 5301.502 200 1.1 Between Groups 488.634 1 488.634 11.881 .001 Within Groups 7896.376 192 41.127 Total 8385.010 193 1.2 Between Groups 69.502 1 69.502 1.718 .192 Within Groups 7484.273 185 40.456 Total 7553.775 186 1.3 Between Groups 1705.608 1 1705.608 22.196 .000 Within Groups 11065.406 144 76.843 Total 12771.014 145 2.1 Between Groups 292.279 1 292.279 3.595 .059 Within Groups 17156.519 211 81.311 Total 17448.798 212 2.2 Between Groups 611.207 1 611.207 4.015 .047 Within Groups 26033.995 171 152.246 Total 26645.202 172 2.3 Between Groups 164.464 1 164.464 8.460 .010 Within Groups 330.483 17 19.440 Total 494.947 18 3.1 Between Groups 2733.932 1 2733.932 15.940 .000 Within Groups 18866.059 110 171.510 Total 21599.991 111 3.2 Between Groups 80.043 1 80.043 .781 .378 Within Groups 21306.838 208 102.437 Total 21386.881 209


DI Identity Assignment #Sum of

Squares df Mean Square F Sig. 3.3 Between Groups 98.684 1 98.684 1.095 .297 Within Groups 12797.142 142 90.121 Total 12895.826 143 4.1 Between Groups 2037.863 1 2037.863 15.719 .000 Within Groups 21390.557 165 129.640 Total 23428.419 166 4.2 Between Groups 5330.347 1 5330.347 29.735 .000 Within Groups 31729.284 177 179.261 Total 37059.631 178 4.3 Between Groups 6466.416 1 6466.416 40.372 .000 Within Groups 19060.096 119 160.169 Total 25526.512 1206 1.1 Between Groups 7767.724 1 7767.724 68.231 .000 Within Groups 5919.925 52 113.845 Total 13687.648 53 1.2 Between Groups 5500.095 1 5500.095 35.889 .000 Within Groups 4750.875 31 153.254 Total 10250.970 32 4.3 Between Groups .386 1 .386 .005 .944 Within Groups 11414.259 147 77.648 Total 11414.644 148


Table 4 – Homogeneity of Variances

Test of Homogeneity of Variances (a,b,c,d,e)

First Read Grade

DI Identity Assignment #Levene Statistic df1 df2 Sig.

1 1.1 .023 1 131 .8801.2 16.104 1 118 .0001.3 .001 1 100 .9702.1 1.552 1 88 .2162.2 27.589 1 126 .0003.1 1.195 1 123 .2763.2 .669 1 83 .4163.3 1.202 1 35 .2804.1 7.081 2 86 .0014.2 3.111 2 130 .0484.3 2.665 2 122 .074

4 1.2 13.611 1 278 .0001.3 6.915 1 242 .0092.1 71.213 1 89 .0002.2 21.132 1 93 .0002.3 10.434 1 265 .0013.1 14.958 1 123 .0003.2 3.956 1 210 .0483.3 4.360 1 114 .0394.3 .025 1 84 .875

5 .1 .959 1 199 .3291.1 7.058 1 192 .0091.2 10.378 1 185 .0021.3 5.163 1 144 .0252.1 14.493 1 211 .0002.2 .107 1 171 .7442.3 .015 1 17 .9043.1 11.284 1 110 .0013.2 20.593 1 208 .0003.3 2.079 1 142 .1514.1 26.759 1 165 .0004.2 92.993 1 177 .0004.3 51.384 1 119 .000

6 4.3 1.385 1 147 .241

a Test of homogeneity of variances cannot be performed for First Read Grade in split file DI Identity = 1, Assignment # = 2.3 because only one group has a computed variance.b Test of homogeneity of variances cannot be performed for First Read Grade in split file DI Identity = 2, Assignment # = .1 because only one group has a computed variance.c Test of homogeneity of variances cannot be performed for First Read Grade in split file DI Identity = 4, Assignment # = 1.1 because only one group has a computed variance.d Test of homogeneity of variances cannot be performed for First Read Grade in split file DI Identity = 6, Assignment # = 1.1 because only one group has a computed variance.e Test of homogeneity of variances cannot be performed for First Read Grade in split file DI Identity = 6, Assignment # = 1.2 because only one group has a computed variance.

Jonathan Arnett - Publications Portfolio.pdf · Web viewAside from providing pedagogical theory and...

Documents

Transcript of Jonathan Arnett - Publications Portfolio.pdf · Web viewAside from providing pedagogical theory and...