It’s time to (re)evaluate evaluation

My career trajectory has often involved backroads. I quit teaching in 2012 to pursue a doctorate in educational policy until family circumstances led me to return to the classroom in 2019 — right in time for a global pandemic. This year, I am shifting back to higher education.

As a policy researcher, I have always been fascinated with the way policies interact with the work of teachers. As a teacher, I had to work hard reconciling what policy asked me to do with what I felt was good practice. Now that I am moving back to research, my past three years in K-12 remain on my mind, even as my former students and colleagues are going through this school year without me.

This past August, as I was preparing for my new role, my email lit up with an announcement that my Tennessee Value-Added Assessment System (TVAAS) scores for my U.S. History End of Course exams (EOCs) were in. I could not help myself, so I logged in to see the proclamation that psychometric formulas had determined that I was, indeed, an effective teacher. And then I did something that surprised me and would likely surprise anyone who knows my personal feelings about test-based accountability policies: I smiled.

Evaluation as motivator?

You see, William Firestone (2014) was not wrong when he described the motivational mechanisms behind teacher evaluation policy. Sure, any external threat of job loss was gone because I was already leaving teaching (for a second time), but that internal pleasure was there. It just feels good to be told you are doing a good job. Even with all my knowledge of the many, many limitations of value-added models like TVAAS (see Goldhaber, Goldschimdt & Tsing, 2013; Grissom & Youngs, 2016; Smith & Kubacka, 2017) it still feels nice to see that yes, psychometrics agree that I am among the OK-est of the profession.

I beamed. I am so proud of my former students for many reasons. But then I remembered the end of the spring semester, when I was given students’ raw scores to use for their final grades. A complicated “cubed root” formula turned their raw scores into a palatable grade to be recorded as a student final. So, I knew that my students who had A’s on the final really got about 20% of the questions wrong. Even my students who only got about 1/2 right on the test earned C’s on their finals. This gap demonstrated to me that there was a misalignment between the standards I was supposed to teach, the content of the test, and how the scores related to student mastery of the subject.

The EOC scores didn’t tell me anything I didn’t already know. Each of my students scored exactly where I expected. I knew this because I conducted my own classroom assessments in varied and formative ways, but also because my school required that we give three benchmark exams during the year. Teachers were told that whatever a student scored on the benchmark they would score on the EOC. And they did! But that almost perfect correlation tells me that the benchmark tests did nothing to move the needle.

If the benchmark data offered meaningful and actionable feedback I could use as a teacher, my EOC scores should have been much higher than the benchmarks. But the truth is, the benchmark data did not tell me anything that my own assessments did not already reveal. If the benchmark data had additional utility beyond what I already was doing, I should have been able to take the benchmark data and use it to blow the EOC scores out of the water. The fact that I couldn’t does not mean I’m a bad teacher. Even TVAAS said there was “moderate to high” evidence I was effective! It just means the benchmarks were an exercise in confirming what was already going to happen. A crystal ball displaying a future I was unable to change.

Make it make sense

We have models so sophisticated it takes months for a teacher to learn whether a test determined whether they were effective at a job they had done all year. And this test, which was meant to assess an entire course but was given before the final 10% of the course is over, took less than an hour for most students to take. One hour to encapsulate a whole year!

The results said I was effective, which made me feel good, but what else do they mean? What do they mean to my instruction? To my career choices? To my students?

We have the means to create reliable measures. We can accurately predict how a student will do on a standardized test. We can input our data into formulas and do mathematical gymnastics to be able to proclaim, “This says I’m OK.” But really, what does it mean? When are we going to get real about how we evaluate the success of our schools? When are we going to use measures that indicate whether kids are prepared to enter the world as productive citizens? Or has this very specific and narrow type of accountability become part of the tradition of schools, much like senior skip days and that one guy who asks too many questions at faculty meetings? Is this just something we are going to keep doing? Dare I ask, should we?

It’d be easy for teachers to just dismiss scores such as this one if the stakes weren’t so high and if it weren’t so time-consuming. Teaching is complicated and hard. And yet, the ways we hold teachers and students accountable have become seemingly more complicated than the work itself. More important, these measures don’t tell us in what ways our schools are working (or not) and what to do about it. We need measures that tell us what we can change and whether the changes we make are producing the intended results. Reliable measures are great. But what we really need are measures that are real.

References

Firestone, W.A. (2014). Teacher evaluation policy and conflicting theories of motivation. Educational Researcher, 43 (2).

Goldhaber, D., Goldschimdt, P., & Tseng, F. (2013). Teacher value-added at the high school level: Different models, different answers? Education Evaluation and Policy Analysis 35 (2).

Grissom, J.A. & Youngs, P. (2016). Improving teacher evaluation systems: Making the most of multiple measures. Teachers College Press.

Smith, W.C. & Kubacka, K. (2017). The emphasis of student test scores in teacher appraisal system. Educational Policy Analysis Archives, 25 (86), 1-29.

This article appears in the December 2022/January 2023 issue of Kappan, Vol. 104, No. 4, pp. 66-67.