How effective are the most popular teacher observation instruments at determining which teachers need to be voted off the island and which deserve immunity?
At a Glance
- For teachers, high-stakes evaluations can feel like trying to keep your footing in a reality game show like “Survivor.”
- Teacher observation models enable the education version of a tribal council to make decisions about who gets a reward and who gets voted out.
- Observation models should produce reliable, valid, fair, and bias-free results so teachers are not blindsided.
- A review of research into the most commonly used observation instruments shows that they can help teachers improve but should be used with caution when making high-stakes decisions.
On the popular reality game show “Survivor,” soon entering its 50th season, contestants are dropped onto remote islands and sorted into tribes. They must construct camps using natural resources, build fires, and find food. They earn tools and other rewards as they compete in physically and mentally demanding challenges. The prizes help them survive. Losers are voted off their team and off the island.
How do group members decide who to exile? It can be based on someone’s poor survival skills or lack of athletic, cerebral, or psychological abilities. Other times, it is because of rumors, internal conflict, or one person or group putting another on the chopping block for real or perceived mistakes or missteps.
Teacher evaluations are not so different. Observers — supervisors, principals, peers, or mentor teachers — frequently assess classroom performance based on incomplete and often subjective perceptions.
Surviving the game
Over the last two decades, the teaching profession became more competitive and high stakes (akin to surviving). The U.S. Department of Education in 2011 created Race to the Top (RttT). This grant competition used federal funds as incentives to districts, asking or requiring educators to compete with their colleagues through teaching effectiveness scales (e.g., highly effective, effective, developing, ineffective). Teachers’ standings were determined in two ways:
- Growth or value-added measurement (VAM), which purportedly showed the extent to which teachers caused gains in their students’ test scores.
- Observations, through which teachers demonstrated how effectively they imparted content knowledge on their students.
Initially, teacher observation models were often less heavily weighted than VAMs, but after the Every Student Succeeds Act (ESSA, 2016) replaced RttT, observational systems became power players (Close, Amrein-Beardsley, & Collins, 2020).
While observation models have been used for decades to objectively measure teacher effects, their failures have been acknowledged since before RttT. For example, The Widget Effect (Weisberg et al., 2009) demonstrated that over 99% of teachers across the U.S. were rated as satisfactory despite the intent of observation systems to distinguish across levels of teacher effectiveness and offset the alleged bias of scores from allegedly biased observers (see also Kraft & Gilmour, 2017). One of the intents of RttT was to enable states to more objectively identify teachers who might outwit, outplay, and outlast their colleagues so they could win the highest teacher effectiveness scores every year (Amrein-Beardsley & Geiger, 2019).
“Survivor” has shown us how strategy, perception, and limited information shape high-stakes outcomes. Teacher evaluations have similarly relied on observational (and often competitive) snapshots and shifting metrics. Yet, unlike a reality show, the stakes for teachers involve their careers and, perhaps more importantly, student learning outcomes. Given these stakes, it is important to explore what we know about these systems and what they really tell us about teacher performance.
A torchlight on teacher observations
We recently completed a systematic literature review of three teacher observation models (Amrein-Beardsley et al., 2025). Specifically, we studied the three most often used models across the U.S. (Close, Amrein-Beardsley, & Collins, 2020):
- The Danielson Framework for Teaching
- The Marzano Teacher Evaluation Model
- The TAP System for Teacher and Student Achievement
Although these models are distinctly different, they have enough in common to be essentially treated as coming from the same island. Each model has a set of criteria on which teachers are observed and assessed in practice. How well teachers implement these criteria is determined by their tribal councils (e.g., supervisors, principals, peers, mentors). Teachers and their observers attend training sessions to learn how to account for, measure objectively, and earn or hand out points according to the criteria (Sawchuck, 2016). Instead of “Survivor” challenges, teachers face multiple observation visits, both announced and unannounced.
Each observation model has a different scoring system. Scores are generally reported back to teachers, and teachers have discussions with members of their tribal councils to determine how to improve. To encourage teachers to take their evaluations seriously, positive and negative consequences are attached to teachers’ observation scores. Teachers with consistently high scores may win rewards, such as pay raises or bonuses, with the most desired prize being immunity (i.e., tenure or a renewed contract). Each prize secured helps each survivor succeed further.
Teachers with consistently poor scores may face pay cuts, transfers, mandatory professional development, a revocation of tenure, or even termination — voted off their professional islands (Amrein-Beardsley & Close, 2019; Education Week, 2014; Paige & Amrein-Beardsley, 2020).
Observation systems do have the potential to transform learning at systemwide levels. If used as designed, they can help with curriculum refinement, lesson planning, and professional learning. The positive consequences for these (and likely other) observation models include increasing teachers’ confidence in their schools, teacher morale, self-efficacy, and motivation. The feedback teachers receive on their instructional techniques can help them improve their teaching strategies, provide more quality instruction, and ultimately increase student learning and achievement.
Proceeding with caution
However, just as the reality show survivors need to know how to traverse uncertain and sometimes rugged territories, teachers and their tribal councils need to cautiously review their observation instruments to verify that the results they yield are what they intended to measure.
Unlike on a competition show like “Survivor,” teachers should be objectively evaluated on the standard or criteria in play, and not in norm-referenced or relative ways, whereby teachers are forced to compete with one another (Amrein-Beardsley & Close, 2019; Amrein-Beardsley & Geiger, 2019). To be certain, the very success of such evaluation practices, in support of instructional practices, curriculum development, and transformational learning, is dependent upon such appropriate assessment rules and conditions.
Putting this more technically, those using teacher observation systems must ensure that the results are reliable and valid and neither biased nor unfair. The goal is for all who are observed to be consistently, accurately, judiciously, and fairly assessed.
Thriving, not just surviving
Tribal council members on “Survivor” are allowed to use whatever measures they choose when casting their vote for who should leave the island and who should win the game. A teacher’s tribal council, however, has a responsibility to use measures that live up to certain standards, such as those in The Standards of Educational and Psychological Testing, published by the American Educational Research Association (AERA et al., 2014). Our review reveals how the top observation models live up to these criteria.
Reliability
Observation models need to yield reliable results, meaning that observation scores should be consistent over time across individual teachers in varying contexts. Even with training for teachers and observers, the three most popular observation models still have reliability issues. However, levels of reliability are highest when teachers are observed multiple times throughout the year. Unfortunately, not all schools have the time or resources to conduct multiple observations of teachers throughout each school year.
Validity
Another vital test any observation model needs to pass is related to validity, which means that any measurement (in this case, observation score) is aligned with the interpretation it was intended to capture (e.g., teacher effectiveness). It does not make sense to require a tribe to paddle a canoe from a dock to a shore as fast as they can and then penalize them because they neglected to climb a tree. Here, the question is whether the observation score provides accurate interpretations about teachers’ actual levels of effectiveness.
Surprisingly, for the three models we examined, very little research has been done to test their validity. Developers of such systems have shown that these systems are research-based, meaning that they used research on teaching and learning that aligns with reason and common sense to build an instrument. However, developers and researchers have not yet adequately researched the validity of the systems themselves. Those who have attempted have examined the relationships between teachers’ observation and VAM scores and found only weak relationships.
This lack of validity evidence is especially important to stress when high-stakes consequences are attached to such measurements. Decisions related to observations are the leading reason states and school districts across the U.S. have faced lawsuits where they have had to defend their decisions, most often losing challenges (Education Week, 2014; see also Amrein-Beardsley & Close, 2019).
Free of bias
As a subset of validity, it does not make sense to give a survivor a score during a challenge based on something they said to another tribal member rather than scoring based on the task at hand. This is bias. Bias is introduced into any measurement — or observation model — when outside factors erroneously impact scores, especially when some scores are impacted more than others.
Undeniably, users of the three observation models we examined had some explicit (and perhaps implicit) biases. Researchers in some studies saw bias when two teachers were observed using the same criteria but different sets of students. If observers compare a teacher with a greater proportion of students with special learning needs to another equally effective teacher teaching a class with fewer such students, the latter teacher will be significantly more likely to get higher observation scores.
Fair to all
The consequences of using any measurement instrument need to be fair to all teachers. One of the biggest factors impeding fairness has to do with teachers in less mainstream positions, such as early elementary grades, elective subjects, and advanced courses. Teachers in these areas may be left out, penalized, or even outplayed when positive consequences are at stake (e.g., merit pay). Inversely, they may be lucky enough to be left holding the immunity idol when negative consequences are in play (e.g., teacher termination).
Simply put, these teachers do not feel they have equal opportunities to be fairly evaluated. This is especially true when the instruments used to evaluate these teacher types are generic, and those doing the observation do not understand what is required to teach well in these less mainstream positions.
Blindsided
Adopting any of these observation tools to evaluate teachers could have unintended negative consequences, making them imperfect tools for determining who gets immunity and who gets voted off the island (Amrein-Beardsley et al., 2025).
It is nearly impossible to evaluate teachers on everything they do in practice, so what ends up mattering most is what is measured or can be measured. In other words, there are many different practices that increase student achievement, and model creators are the ones who decide which ones are the most important. The most important are those that can be observed. Consequently, many important latent practices are disregarded.
The fact that students learn in different ways is another major concern. Two teachers may have the same abilities, but different students, and, therefore, be scored differently. To some extent, this could mean teachers’ observation scores would be based on luck when it came to which students were assigned to which teachers.
The final vote
Despite the potential for unintended negative consequences, these observation models are solid and commonsensical. Even if they are currently more research-based than researched, using observation models — especially for low-stakes, formative purposes — is more empirically defensible than relying on students’ growth on large-scale standardized tests over time. Continuing to invest time and resources in observation systems that suit unique local needs is a step in the right direction.
However, there is still a great need for future research into the reliability and validity of these models, as well as the potential for bias, unfairness, or unintended negative consequences. This is especially true when these models are used to make high-stakes decisions that can negatively affect people’s careers and lives. Teachers should not be voted off the island without the empirical evidence verifying that their career torch should be extinguished.
References
American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. American Educational Research Association.
Amrein-Beardsley, A. & Close, K. (2019). Teacher-level value-added models (VAMs) on trial: Empirical and pragmatic issues of concern across five court cases. Educational Policy, 35 (6), 1-42.
Amrein-Beardsley, A. & Geiger, T.J. (2019). Potential sources of invalidity when using teacher value-added and principal observational estimates: Artificial inflation, deflation, and conflation. Educational Assessment, Evaluation and Accountability, 31 (4), 465-493.
Amrein-Beardsley, A., Stone, C., Tremblay, C.M., Beall, G.L., Mendhe, S., & Vecellio, A. (2025). A systematic literature review of the empirical research on three of the most popular, U.S.-based classroom observational systems. In S.P. Kelly (Ed)., Research handbook on classroom observation (pp. 39-62) Edward Elgar Publishing.
Close, K., Amrein-Beardsley, A., & Collins, C. (2020). Putting teacher evaluation systems on the map: An overview of states’ teacher evaluation systems post-Every Student Succeeds Act. Education Policy Analysis Archives, 28 (1), 1-58.
Education Week. (2014). Teacher evaluation heads to the courts.
Every Student Succeeds Act (ESSA) of 2016, Pub. L. No. 114-95, § 114 Stat. 1177. (2016).
Kraft, M.A. & Gilmour, A.F. (2017). Revisiting the Widget Effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46 (5) 234-249.
Paige, M.A. & Amrein-Beardsley, A. (2020). “Houston, we have a lawsuit”: A cautionary tale for the implementation of value-added models (VAMs) for high-stakes employment decisions. Educational Researcher, 49 (5).
Race to the Top Act of 2011, S. 844–112th Congress. (2011).
Sawchuck, S. (2016). Despite teacher-evaluation changes, the ‘Widget Effect’ is alive and well. Education Week.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The Widget Effect: Our national failure to acknowledge and act on differences in teacher effectiveness. The New Teacher Project (TNTP).
This article appears in the Winter 2025 issue of Kappan, Vol. 107, No. 3-4.

ABOUT THE AUTHORS

Catherine M. Gross
Catherine M. Gross is a doctoral candidate in the geological sciences program at the School of Earth and Space Exploration at Arizona State University.

Audrey Amrein-Beardsley
Audrey Amrein-Beardsley is a professor in the Educational Policy and Evaluation Program at Mary Lou Fulton Teachers College, Arizona State University. She is the author of Rethinking Value-Added Models in Education: Critical Perspectives on Tests and Assessment-Based Accountability and coeditor of Student Growth Measures in Policy and Practice: Intended and Unintended Consequences of High-Stakes Teacher Evaluations.

Courtney Stone
Courtney Stone is an assistant professor in the Department of Educational Psychology at the University of Illinois, Urbana-Champaign, where she co-directs the EvaLab and serves as a faculty affiliate for the Center for Culturally Responsive Evaluation and Assessment (CREA).

Grace L. Beall
Grace L. Beall is a doctoral student in educational policy and evaluation in the Mary Lou Fulton Teachers College at Arizona State University.
