An early rubric

For the last 15 years or so, I’ve been a big fan of rubric grading. I got the bug after reading a column by my colleague Robert Runte in our faculty association newsletter. Over the years, I’ve developed a variety of different rubrics, several of which have been adopted and adapted by my own colleagues (see here and here).

So it was very interesting in the course of my reading for our Unessay project to come across this early work by Paul Diederich of Educational Testing Services (Diederich 1965; it looks like Diederich 1974 contains a more developed discussion, though I haven’t see it yet).

Low

Middle

High

General Merit
Ideas

2

4

6

8

10

Organization

2

4

6

8

10

Wording

1

2

3

4

5

Flavour

1

2

3

4

5

_______

Mechanics
Usage

1

2

3

4

5

Punctuation

1

2

3

4

5

Spelling

1

2

3

4

5

Handwriting

1

2

3

4

5

_______

Total

_______

In addition to this scale—which was used to ensure consistency in translation from quality to numeric score for the purposes of the test itself—the rubric also requires a use of qualitative descriptions of performance at each level for each entry on the list. “High” for “Ideas,” for example, is described as having the following qualities

The student has given some thought to the topic and writes what he really thinks [sic—Diederich is writing in 1965]. He discussed each main point long enough to show clearly what he means. He supports each main point with arguments, examples, or details; he gives the reader some reason for believing it. His points are clearly related to the topic and to the main idea or impression he is trying to convey. No necessary points are overlooked and there is no padding.

Rubrics have since become an extremely mainstream way of marking, especially in formative assessment schemes. Diederich, on the other hand, was interested in them for very summative reasons: his work was prompted by a desire to improve interscorer reliability in the grading of standardised tests (see Nunes 2013; Tremmel 2011 for interesting and wide-ranging discussions).

In my experience, they are good in both areas. They seem to be popular with students and my personal, anecdotal experience is that they really do improve consistency of grading: twice in my career (that I am aware of, at least) I have had identical essays submitted to me for grading—in both cases by accident. Because I often have large classes and shuffle the papers, in both cases, I ended up marking the both copies of the essays at different points in my grading cycle without realising it (a batch of 50 or more essays can take a week to ten days to mark—and after 10 days of marking first year papers on a single topic, only the most spectacular and original papers stand out in your memory). In both cases, the grades I assigned were within a third of a letter grade (approximately 6% on my scale) of each other: a B+ (let’s say) the first time and an A- or B the second. While this is not perfectly reliable, it is within an acceptable margin of error in my books.

I also like rubrics for two other reasons: they really speed up marking and, to my mind at least, they improve my intellectual—as well as the numeric—consistency. Looking over my rubric before I begin a set of papers helps me re-establish what it is I am looking for in papers and what different types of quality look like (avoiding what Golden 2011 has described as the September/May distinction that affects most instructors); immediately after finishing a student essay, I then fill out a form-based rubric as a fast way of gathering my thoughts on the overall performance (as I have recently discovered, this is a version of the “holistic” approach advocated initially by Cooper 1977).

The hard part in writing a rubric, in my opinion, is developing an appropriate typology of qualities (see Lloyd-Jones 1977 for an early theoretical discussion of this). And this is why I so like Diederich’s: it is extremely compact and yet to my mind quite comprehensive. It covers the areas that seem most important as indicators of both quality (the summative grading question) and the type of skills we want our students to concentrate their efforts on developing (the formative grading question).

Perhaps best of all, however, is the fact that it was developed organically, by observing experts in the field and attempting to discover the areas in which they could reach consensus. Diederich’s original problem was the low rate of agreement he found among scorers of College Entrance Examination essays (in a 1963 talk reported by Tremel, he suggested the reliability might be as low as 30%). In order to address this, he took 300 papers by university freshmen, assigned to them to 60 “distinguished” readers drawn in pairs from various professions (professors, teachers, lawyers, etc.), and asked the readers to rank the papers. He then asked for explanations of the rankings and looked for patterns as to what people considered important. His rubric and the scoring it uses reflects the most important features his scorers commented upon. The weighting is slightly less related to his informant’s opinions (various studies have shown that mechanics and, when applicable, handwriting, have a disproportionate influence on reader opinions about a piece of writing, see Golden 2011), but rather based on the emphasis ETS wanted to place in their evaluation.

In his subsequent discussion of this method, Diederich claimed an interscorer reliability well into the 80s. Certainly it seems likely to me!

Works cited

Cooper, Charles R. 1977. “Holistic Evaluation of Writing.” In Evaluating Writing: Describing, Measuring, Judging., edited by Charles R. Cooper and Lee Odell, 3–31. http://eric.ed.gov/?id=ED143020.

Diederich, Paul B. 1965. “Grading and Measuring.” 13. http://0-search.proquest.com.darius.uleth.ca/eric/docview/64400392/abstract/1403502511F2B6C8FD6/1?accountid=12063.

———. 1974. Measuring Growth in English. National Council of Teachers of English; 1111 Kenyon Road, Urbana, Illinois 61801 (Stock No. 03460, $2.50 nonmember, $2.25 member). http://0-search.proquest.com.darius.uleth.ca/eric/docview/64054139/1403502511F2B6C8FD6/3?accountid=12063.

Golden, Charles Hurst. 2011. “The Effects of a Simulated Self-Evaluative Routine on Teachers’ Grades, Intraclass Correlations, and Feedback Characteristics”. Dissertation, Kansas City: University of Kansas. http://hdl.handle.net/1808/9749.

Lloyd-Jones, Richard. 1977. “Primary Trait Scoring.” In Evaluating Writing: Describing, Measuring, Judging., edited by Charles R. Cooper and Lee Odell, 33–66. http://eric.ed.gov/?id=ED143020.

Nunes, Matthew J. 2013. “The Five-Paragraph Essay: Its Evolution and Roots in Theme-Writing.” Rhetoric Review 32 (3): 295–313. doi:10.1080/07350198.2013.797877. http://www.tandfonline.com/doi/abs/10.1080/07350198.2013.797877.

Tremmel, Michelle. 2011. “What to Make of the Five-Paragraph Theme: History of the Genre and Implications.” Teaching English in the Two Year College 39 (1) (September): 29–42. http://0-search.proquest.com.darius.uleth.ca/docview/889137120/1402B7392ED17A638CA/1?accountid=12063.


Follow

Get every new post delivered to your Inbox

Join other followers: