Writing assessment in six lessons — from “American Idol”

Taking a cue from reality television shows may help educators improve assessments.

Groans reverberate through the classroom as half my students realize that they’ve been trapped by their own rubrics. Each year, this activity is my favorite moment in the Foundations of Assessment course that I teach at the University of Lethbridge in Alberta. The activity itself is a simple one: Design a rubric that you can use to evaluate the performance of contestants on the “The X-Factor.”

The groans were from students who realized that, in spite of their best intentions, the rigorous application of their rubrics forced them to pass Geo Godly onto the next stage of “The X-Factor,” even though they knew this was the wrong decision. Godly was a contestant who exposed himself on stage during his live audition. What trapped students into their bad decision was that the rubrics they used to describe excellence contained such keywords as memorable, confident, and strong stage presence. Godly demonstrated confidence, strong stage presence, and he created a memorable audition. He wasn’t a particularly strong singer nor was his performance memorable for the right reasons, but my groaning students’ rubrics failed to emphasize these qualities and consequently trapped them into a poor decision.

I use this activity to open a discussion about the assessment literature my students have been reading in the course. A problem with the burgeoning literature about classroom assessment is similar to the problem that has made large-scale assessment so damaging in many educational contexts. The literature presupposes that “best practices” in assessment are equally valid across a wide range of educational and disciplinary contexts. This technocentric approach to assessment focuses on narrow methods of achieving validity and reliability, and it depends primarily on traditional assessment tools such as multiple-choice items and timed impromptu response formats for its designs (Huot, 2002).

Reality TV programs reveal a great deal about effective assessment practices.

The reality, however, is quite the opposite: Good assessment practices are contextual, designed in accordance with expert knowledge, and derived from and responsive to the contexts in which they are employed (Gallagher & Turley, 2012). What constitutes good assessment in science does not necessarily translate to good assessment of writing, fine arts, or physical education. The sooner those of us who design assessments, who teach assessment courses, and who theorize around assessment practices realize this, the sooner we can make important strides in improving assessment practices.

Lessons in assessment

Consider what “American Idol” tells us about assessment. Educators might balk at the idea that a reality TV show can offer anything of value to our work, but each year as I run my “X-Factor” assessment activity, I see more of what these programs reveal about effective assessment practices.

The purpose of reality TV shows like “American Idol,” “The Voice,” and “The X-Factor” is to identify and cultivate talent. This goal is similar, though narrower, to the purpose of schooling. Granted, there is not a perfect parallel between the two. Schools serve a much broader range of individuals and subject areas. And, unlike these reality TV shows, schools do not have the luxury of eliminating 99% of their students in the first few weeks of the school year so they can focus on the top 20 students who show the most promise. Nor can schools let the public be the final arbiters of student grades.

It is difficult to argue against the success with which “American Idol” delivers on its purpose. As of 2013, finalists on the show have collectively garnered 11 Grammys and one Oscar. The show has spawned over 345 Billboard Hot 100 songs, and, in 2007, “Idol” alumni were collectively responsible for 2.1% of album sales worldwide. There are many reasons for this success, but the show’s capacity to assess and cultivate talent is one of them.

Lesson #1: Multiple performances provide many ways to demonstrate quality and growth.

Contestants who have made it to the final rounds of “American Idol” will have performed a minimum 25 times before scouts, producers, judges, and the public. During this process, they’re asked to perform in different settings — individually, in groups, a cappella, playing an instrument — and across a range of styles, genres, and periods. This process allows judges and viewers to observe the full range of a contestant’s abilities.

It is curious to note that most assessments of student learning are based on single snapshots of student performance. Standardized writing tests, for example, are almost always completed in a single sitting (Behizadeh, in press). Many classroom teachers still rely most heavily on end-of-unit tests and timed essay assignments (often modeled after standardized test formats) for grading purposes (Hillocks, 2002). Seldom are students given multiple opportunities across a range of genres and contexts and over a sustained period of time to demonstrate growth and competence.

“American Idol” teaches us to value expert judgment and to draw on that judgment to design our assessment programs.

Lesson #2: There is no proxy for expert knowledge.

Part of the “American Idol” success stems from the people chosen to evaluate, judge, and provide feedback to the contestants. The judges have included some of the most successful artists in American music history. They are unabashedly subjective. They view each performance through the lens of their own experiences, values, and perspectives. They disagree with one another almost as much as they agree. They don’t defer to a producer’s rubric or some predetermined scoring criteria. Yet, on all decisions, they have to reach a measure of consensus. During these discussions, everything is put on the table, their opinions are presented, and they fight for their positions. The process of debate and argumentation leads to a final decision, which they then collectively act on.

“American Idol” teaches us to value expert judgment and to draw on that judgment to design our assessment programs. When I work with student teachers on preparing for life in the classroom, most often they’re terrified about assessing student writing. They latch onto the nearest set of rubrics they can find. Too often, the rubric is a crutch for their lack of expert knowledge. Many of these students are not writers themselves. Few ever have completed the assignments they ask their students to write. Most have limited experience with writing. And that is true of the profession in general. In a study of 9th-grade English teachers in Alberta, Canada, I see similar practices. Most of the study participants’ classroom writing assessments are modeled after the government-mandated, large-scale assessment. The rubric they use to judge student writing is almost exclusively the one used on the provincial writing assessment.

Lesson #3: The construct being measured must be clearly defined.

Validity refers to the degree to which an assessment accurately captures what it intended to measure (Messick, 1989). This means that factors extraneous to the construct being assessed should not in some way interfere in that process. In technical language, this concern is called construct irrelevant variance. Related to this is the problem of construct under-representation, an issue that occurs when an assessment does not fully capture the skill or knowledge domain it has been designed to measure.

“American Idol” and its cousins “The Voice” and “The X-Factor” are concerned with validity issues. “The Voice,” as its name suggests, focuses on the vocal qualities of its contestants, aiming to pick the best voice from among all of the season’s competitors. The show’s set has been designed to reduce construct irrelevant-variance. In the first round of auditions, the judges’ chairs — with their high and wide headrests — are turned away from the stage to ensure that judges can’t see the contestants. Judges decide whether contestants should continue onto the next round based solely on the quality of their voices. Once judges have decided in favor of a contestant, they press a button that spins their chair to face the stage. Only then is the judge able to see what the person looks like and how dynamic he or she is on stage. For a show that focuses on the voice, using this additional information to make decisions about a contestant would introduce irrelevant variance into the process.

“American Idol” and “The X-Factor,” however, are measuring a slightly different construct than does “The Voice.” These shows are looking for the next superstar. For these shows, vocal quality is only one part of the construct being measured. Other factors that contribute to superstardom include appearance, ability to perform, and a certain dynamism or X factor. This is a more difficult and slippery construct to measure. Dannii Minogue, one of the judges for the United Kingdom’s 2007 version of the show, defined the X factor this way: “It’s something you can’t describe or bottle, but you it know when you see or hear it.”

We need to be clear about what we are trying to measure. We also need to be circumspect about the constructs we define. Looking for generalized and stable constructs is not always productive. This is a long-standing issue in writing assessment (Smit, 2004). An effective email is very different from a well-crafted academic paper or a beautifully shaped poem. And, within each of these genres, the markers of quality vastly differ. A well-crafted email to one’s employer is quite different from a well-crafted email to a family member or close friend, and a successful history research paper is quite different from a research paper in English or biology. Most such rubrics are based on a generalized sense of expertise in writing (see the popular 6+1 Trait^® Writing program); consequently, they fail to capture the nuanced differences that really matter for communicative success in specific contexts (Mabry, 1999).

Lesson #4: Consistency is best achieved through dialogue.

Objectivity has long been a holy grail of assessment design because it is grounded in the psychometric concern for reliability. The idea is that eliminating subjectivity leads to greater consistency in the evaluation process, which, in turn, leads to higher degrees of validity by reducing construct irrelevant variance.

The fourth lesson “American Idol” offers is that validity is not inherently dependent on objectivity. Expert judgment is not always reliable. This is why many writing assessment programs train or norm their markers. In this norming process, markers are taught what to focus on when evaluating a text. They’re taught to put aside subjective experiences of text for a more objective, dispassionate response. But if we look at shows like “American Idol,” we see quite a different dynamic at work. We see experts responding emotionally and analytically to performances. We see judges who disagree frequently with one another, often quite passionately. To the viewer, these debates seem to suggest a lack of consistency on the part of the judges. Harry Connick Jr., a judge in 2014 season of “American Idol,” however explains the process this way:

We all have opinions, but what makes us judges is that our opinions are coupled with some things that are objective. Sure, it boils down to what moves you and what you like . . . but, along the way, I think there are certain specifics that can be passed along that will not only help the contestants but also help the audience understand the process.

The point that Connick is making is that judgments that may seem capricious to the nonexpert are in fact based on clear criteria that an expert has internalized.

The check on idiosyncratic judgments is not achieved through a process of norming but rather through a process of dialogue. On key decisions, the judges come together to discuss and debate their views. Through this process, they reach consensus on decisions that need to be made. This process is what Pamela Moss (1994) calls a hermeneutic approach to achieving inter-rater reliability (or agreement between judges). It is also consistent with a process that Bob Broad (2003) calls dynamic criteria mapping. Assessment designers use this process to bring together invested parties to discuss, explore, and defend their values around the constructs being assessed. Broad and Moss argue that such approaches lead to more meaningful and robust decisions. Unlike the process of norming, which tends to strip away qualities that are contentious or difficult to measure, this process ensures that a more complete examination of the construct being assessed is undertaken. Both processes seek to achieve consistency. Norming does this by narrowing and limiting factors under consideration; a hermeneutic model does this by embracing richness, debate, and consensus.

Imagine replacing “American Idol” judges with computer programs that made decisions about contestants based on an algorithm. The idea is laughable.

Lesson #5: Real audiences matter.

The fifth lesson drawn from “American Idol” is that real audiences matter. In my previous life I taught 12th-grade academic English classes in Alberta. I also served on the province’s English diploma exam-marking team. The writing exam was based on two questions that asked students to respond to pieces of literature. Each year, I was struck by the intellectual inconsistency demanded of the experience. During our norming sessions, markers were provided with rubrics that cautioned:

Assessment of the Personal Response to Texts Assignment on the diploma examination will be in the context of Louise Rosenblatt’s suggestion: . . . the evaluation of the answers would be in terms of the amount of evidence that the youngster has actually read something and thought about it, not a question of whether, necessarily, he has thought about it the way an adult would, or given an adult’s “correct” answer.

The irony here was that as markers we had to apply Rosenblatt’s (1938) transactional theory of reading — the idea that the meaning of a text is created in the interaction between the reader and the text. But as markers, we were not permitted to apply this same transactional approach to our own appraisal of the student texts we were evaluating. The problem with this approach is that writing, like music, by its very nature is designed to affect an audience in some way. Part of the success of a written text or a musical performance is always related to whether the piece successfully evokes the desired response in the reader or listener. Part of the success of “American Idol” is that performers have to connect with a real audience. Taking that element out of the equation, as so many writing assessments do, necessarily limits the construct being measured. Most writing assignments are written for the teacher or the assessor rather than for real, authentic audiences.

Frighteningly, this problem is about to get significantly worse. Developers of large-scale writing assessments are seriously studying how to best implement machine scoring of student writing. No matter how sophisticated these programs might become, computers will never be able to capture the human aspect of writing because computers can never react to a text with emotion, nor can they meaningfully synthesize current experience with prior knowledge (Condon, 2013). Imagine for a moment the idea of replacing “American Idol” judges with computer programs that made decisions about contestants based on an algorithm. The idea is laughable.

Lesson #6: Attend to consequences.

Reality shows like “American Idol” depend on two things for their survival: ratings and the success of their alumni. Given this, they are highly sensitive to the issue of consequences. Poor design choices — whether how judges are selected or the process that determines finalists — lead to a drop in ratings, and so these problems are rectified rather quickly.

We’ve known for a long time in education that assessment programs need to pay closer attention to the consequences that stem from their implementation and use. Yet very little research that directly collects this consequential validity evidence has been conducted (Cizek, Bowen, & Church, 2010). Research on the consequences of classroom assessment practices is virtually nonexistent; the research that directly examines the consequences of large-scale writing assessment on education in Canada and the U.S. however is far from positive (Hillocks, 2002; Slomp, Corrigan, & Sugimoto, 2014). And so problematic practices continue undeterred. This needs to change, even if that means taking our cue from reality television.

References

Behizadeh, N. (in press). Mitigating the dangers of a single story: Creating large-scale writing assessments aligned with sociocultural theory. Educational Researcher.

Broad, B. (2003). What we really value: Beyond rubrics in teaching and assessing writing. Logan UT: Utah State University Press.

Cizek, G.J., Bowen, D., & Church, K. (2010). Sources of validity evidence for educational and psychological tests: A follow-up study. Educational and Psychological Measurement, 70 (5), 732-743.

Condon, W. (2013). Large-scale assessment, locally developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing, 18, 100-108.

Gallagher, C. & Turley, D. (2012). Our better judgment: Teacher leadership for writing assessment. Urbana, IL: NCTE.

Hillocks, G., Jr. (2002). The testing trap: How state writing assessments control learning. New York, NY: Teachers College Press.

Huot, B. (2002). (Re)articulating writing assessment for teaching and learning. Logan UT: Utah State University Press.

Mabry, L. (1999). Writing to the rubric. Phi Delta Kappan, 80, 673-680.

Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18 (2), 5-11.

Moss, P.A. (1994). Can there be validity without reliability? Educational Researcher, 23 (2), 5-12.

Rosenblatt, L.M. (1938). Literature as exploration. New York, NY: D. Appleton-Century.

Slomp, D., Corrigan, J., & Sugimoto, T. (2014). A framework for using consequential validity evidence in evaluating large-scale writing assessments. Research in the Teaching of English, 48 (3), 276-302.

Smit, D. (2004). The end of composition studies. Carbondale, IL: Southern Illinois University Press.

CITATION: Slomp, D. (2015). Writing assessment in six lessons — from “American Idol”. Phi Delta Kappan, 96 (6), 62-67.