Educators must make important decisions using complicated assessment data, but they don’t always have the expertise to interpret the data. How can they respond to this validity paradox?

Whether used for accountability, diagnostic, or achievement purposes, standardized assessments are common in K-12 education. Underlying each assessment score is a technically complex series of calculations, algorithms, and models. As the processes to calculate scores become increasingly sophisticated, administrators and schools have been expected to use the resulting data to make more and more evidence-based decisions. This sets up an unresolved problem that we call the validity paradox. The validity paradox exists where those asked to make valid decisions with data don’t have the technical knowledge to understand nuance around data quality, assessment aims, and statistical limitations that influence how they should interpret the data.

The validity paradox is a critical problem. Every year, stories emerge regarding the unintended consequences of how education systems use standardized assessment results. Many of these consequences come down to leaders making unsupportable inferences from test results. We see this when leaders misuse results to rank systems, schools, and individual teachers; narrow the curriculum to focus only on what is being assessed; and overemphasize scores in making decisions about student progression. These mistakes can have lifelong effects on individual students and teachers. A key factor in these unintended consequences is misunderstanding or overinterpretation of data, often exacerbated by the fact that educators have little help available to make decisions that can require highly technical skills.

Understanding inferences

To use test data in more logical, and invariably modest, ways, we need to focus on the inferences being made about the data. Valid use of testing data relies on making correct inferences about what the data reveal. Validity is arguably the most important aspect of assessment (Popham, 2016). However, models used to generate valid data to assist in decision making tend to require detailed technical knowledge about educational measurement.

For example, assessment experts Michael Kane (2013, 2021) and James Popham (2016) both state that the first step in building a valid argument is determining what kinds of inferences you can make from testing data. Without the technical skills to make these judgments, it appears impossible for educators to proceed. Popham (2016) acknowledges this limitation by suggesting that schools and districts employ the services of a “reasonable-fee assessment expert” who can “explain the relevant evidence and validation argument for a particular test” (p. 50).

Unfortunately, not all schools have access to these experts, nor is the funding readily available in schools in disadvantaged contexts. Furthermore, the rise of commercial ventures selling tests to schools means that some assessment experts may represent the interests of the contracted testing companies. Finally, the number of experts in measurement and assessment is not great enough to meet the demand. We need another solution model that enables educators to make logical arguments about what test data tell us before making decisions based on that data.

Popham (2016) tells us that “validity flows from human judgment about the persuasiveness of a particular validity argument and the evidence on which that argument has been fashioned” (p. 46). Keeping that maxim in mind, we propose a model designed to help principals, classroom teachers, and other education professionals understand what testing data reveal so their arguments about the data have validity. This model has limitations, but we shouldn’t let perfect be the enemy of good. We believe that, given the frequent misuse of data, moving closer to greater validity is a clear good.

Validity theory, in brief

Validity is a complex, theoretical field that has changed significantly over time (Newton & Shaw, 2014). While academics have debated, and continue to debate, the finer points of validity, the 2014 Standards for Educational and Psychological Testing by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME), represents something of a consensus in the field. The standards make two key theoretical claims:

  • Every claim made with test results has to be argued for, and the onus is on those making the claim to provide evidence (p. 13).
  • Validity requires “constructing and evaluating arguments for and against the intended interpretation of test scores and their relevance to the proposed use” (p. 11).

This theoretical perspective informs our view that teachers and school leaders need a process to assist in developing and supporting arguments for validity.

Validity scholars such as Lee J. Cronbach (1980) argue that the key to any validity argument is not finding evidence to support an interpretation, but rather attempting to falsify it. If there is no compelling evidence (or argument) that undermines or falsifies an inference, that inference is reasonably valid. This is not a precise or perfect measure, but it has pragmatic value.

As an example, consider using an end-of-year (EOY) accountability assessment for teacher evaluation and retention decisions. Teachers whose students’ average test scores are in the bottom quartile of state results are identified as needing remediation or potentially flagged for dismissal. The implied inference is that student achievement on EOY assessments directly reflects teacher quality. This inference is based on several assumptions:

  • Teachers are fully responsible for student outputs and no other outside factors contribute to student scores.
  • The assessment results are an accurate and comprehensive measure of the implemented curriculum.
  • All children have had an opportunity to learn the curriculum.
  • All students are motivated and do their best on the assessment.

Because of the high stakes of the inference about teacher quality, decision makers should interrogate each of these assumptions. This is where we confront the validity paradox. Everyone involved in developing, administering, and using tests might advocate that the results be used in valid ways; but the standards for determining validity are prepared for a specialist audience of assessment experts, rather than the pedagogical or domain experts.

Trusting published technical reporting

One limitation of our model is that it requires educators to assume that the tests they are using are reliable — that they measure what they claim to measure, based on information in the test’s technical report. There is an important caveat because it depends on there being a freely available technical report.

As its name suggests, a technical report is the compendium of technical information about the test; how it is constructed, administered, and scored; and generally whether it has met acceptable technical standards. In the past, tests were modeled in a way that enabled developers to report their reliability using a single measure that was intuitive to interpret. However, as test theory become more robust and complex, the reliability of a test or test item cannot easily be captured in a single number, and these technical reports have become more challenging to understand. Worse, some contractors do not make their technical reports readily available.

Before students take a test, it should be clear in each school and classroom what the test data will be used for. Starting with data and then deciding how to use it is ripe for poor decision making.

If a technical report is available, practitioners can turn to the report to understand what the test intends to measure and verify that experts agree that the test reliably measures what it intends to measure. This leads to an obvious conclusion. State, district, and school leaders must insist that testing contractors make their technical reports available before signing a contract. Leaders should ask to see the most recent technical report before procuring a test, and they should ask how technical reports are scrutinized and how frequently they are updated. If the company cannot provide a report that has been evaluated by experts and is regularly updated, we would advise leaders not to purchase their products because intentional use of the data will be almost impossible.

The technical report is invaluable because most practitioners do not have the expertise to determine a test’s reliability. A technical report establishes confidence in the test and allows the psychometric community to interrogate it. Without a published technical report that assures the reliability of a test, any attempt to make an argument for validity is practically impossible. If there is a recent, published technical report, our view is that denying teachers and school leaders the opportunity to argue for and against the validity of certain inferences from test data is a bigger problem than the risk inherent in trusting experts’ determinations of reliability.

If we can be satisfied by this imperfect but pragmatic argument, then it’s possible for practitioners  to mount a validity argument. The validity argument focuses on finding and weighing evidence to mount logical (and falsifiable) claims. Because it enables educators to avoid becoming mired in sophisticated technical modeling, it is a pragmatic response to the validity paradox and has the best chance to be useful for the greatest number of educators.

Modeling logical claims to inform decisions

Our proposed model is a framework through which practitioners attempt to falsify every inference or intended use of data. It encourages the intentional use of data by requiring educators to carefully consider whether claims logically follow from the test result. Specifically, the model, shown in Figure 1, is intended to help educators ensure that they, and others, are using large-scale assessment data appropriately.

Step 1: Clearly outline the claim.

Before students take a test, it should be clear in each school and classroom what the test data will be used for. Starting with data and then deciding how to use it is ripe for poor decision making. We suggest that schools clearly specify to their community how they will use test data at the start of each school year.

Knowing the purpose of the assessment also will enable educators to clearly outline a claim, or inference, using the assessment results. Clearly outlining the claim allows it to be matched to the technical report to establish whether this is what the test is designed for or whether it is a new inference. If the stated claim is a new inference, they will then validate it by mounting a logical argument in which they attempt to falsify or disprove the statement.

If we return to the example regarding the use of an EOY accountability assessment for teacher evaluation and retention decisions, the inference is that the EOY assessment is a reliable measure of teacher input (or effectiveness). A look at the technical report finds that the EOY test is validated as a measure of statewide student achievement in literacy and numeracy, so the inference for teacher effectiveness is a new one that requires a new validity argument.

Step 2: Outline the opposite/falsifying claim.

After we’ve identified an inference, the next step in the process is to think through the opposite claim and what evidence might be used to support and/or falsify the inference. This step, which relies on Stephen Toulmin’s (2003) presumptive logic model, is the heart of the validity argument.

Returning to our example, the original inference was that the EOY assessment is a reliable measure of teacher input (or effectiveness). The opposite claim is that the EOY assessment is not a reliable measure of teacher input (or effectiveness).

Step 3: Find the evidence.

For each of the above claims, we must search for evidence to back these claims. Evidence could come from a range of sources: peer-reviewed journal articles, books and book chapters, research repositories, assessment frameworks, published reports, and technical reports. There is a presumptive logic at work in this evidentiary stage: If an inference (either the intended inference or the falsification inference) is true, then we should expect to see evidence that proves or justifies that inference.

It is important to note three things:

  • Research can be of varying quality. For the non-researcher, quality can be difficult to judge, so the best strategy is to gather evidence from a range of sources and err on collecting more evidence to test the falsification inference than the original inference.
  • Much evidence can be hidden behind paywalls. Those who have institutional access, such as through a university, can easily find the research. If you do not have access, researchers are usually allowed to provide a copy of the published research with anyone who requests it. Don’t feel embarrassed to ask. Most researchers are thrilled to think that their work is being used, especially by practitioners.
  • A single body of research may include different conclusions, depending on the types of questions asked and the methodologies used. Don’t just try to find the one published study that supports the desired inference and ignore the multiple studies that arrive at different conclusions.

If we return to the inference that a EOY assessment is a reliable measure of teacher input (or effectiveness), one obvious way to support or falsify the claim concerns the likelihood of error in a score. As Margaret Wu (2016) explains, a single test is just a sample of a student’s ability:

Depending on the particular set of questions selected . . . a student may perform better or worse, so there will be a variation in test scores should similar tests be administered to the same student. This variability in test scores on similar tests for a student is called measurement error. (p. 21)

The measurement error associated with an individual student is likely to be significant, and the error associated with a class is invariably large enough to suggest caution in inferring teacher quality, even before factoring other sources of error. This finding, then, tentatively supports the opposite inference that the EOY assessment is not a reliable measure of teacher input (or effectiveness).

Step 4: Consider the stakes.

Where stakes are significant, the quality and scale of the evidence for an interpretation need to be overwhelming. For this reason, it’s important that we pause and consider the stakes before proceeding to make a final decision about the validity of certain inferences. Where test scores have high stakes, the onus must be on those making the judgment to justify it. This remains important even when the intended use of the test matches the aims of the assessment as outlined in the technical report.

Where stakes are significant, the quality and scale of the evidence for an interpretation need to be overwhelming.

For example, the decision to use EOY assessment results for teacher evaluation and retention decisions is extremely high stakes for those teachers. There must be a great deal of strong evidence to support that inference. The issue of error in measures of achievement means that a claim about a teacher’s effectiveness using an EOY assessment is problematic.

However, more “modest claims” that carry low stakes “may not require much evidence for validation” (Kane, 2013, p. 456). For example, if the results of the test were used to diagnose gaps in student achievement and determine what concepts teachers should revisit with which students, the evidence needed is much more modest.

Step 5: Decide on the most valid interpretation.

Now, with the evidence for and against the inference in hand, and a sense of the stakes involved, we’re ready to make a judgment about whether the original claim or the opposite claim is best supported by the gathered evidence. If the evidence tends to support the opposite claim (the falsification inference), then the intended inference has low validity and is difficult to justify. On the other hand, if the attempt to falsify the inference has little support, we can be more confident in the intended use of the data.

If there is no evidence available for either the intended or the falsification inference, we should be extremely cautious in following through with the intended inference. This is particularly important when the stakes are high. Similarly, when the evidence, when weighed up, is relatively balanced between the original inference and the falsification inference, caution is needed.

Educators can and should revisit any decisions made using this process as new evidence emerges.

A method for responsible data use

We have spent years talking with teachers, school leaders, unions, and policy makers about how to become more intentional with test data. (Sellar, Thompson, & Rutkowski, 2017). For us, the validity paradox is challenging but not insurmountable. We agree with James Popham (2016) that it is “desirable that teachers mentally engage in some serious [validity] thinking” (p. 49). Further, we believe that our model enables educators to overcome their lack of technical assessment expertise to engage in some basic thinking about validity. Our model has obvious limitations, not least of which concerns trusting in a test’s reliability and the ability to access and evaluate research. But if we want to equip teachers and school leaders with tools to make better decisions with assessment data, our model has much to offer.

Validity evidence often can be very technical; the onus is on those creating, marketing, and selling standardized assessments to provide the evidence of its rigor and utility. Further, these test creators should present their evidence in a way that practitioners can understand. This is especially true if a testing organization makes the claim that a particular use is valid. If an educator or an institution wishes to make a claim outside of the original test’s purpose, the responsibility is on the educator to either have the skills, or work with others who have the skills, to support the claim through an attempt to falsify the inference.

This model is intended for practicing educators, not technical experts. And educators need not be assessment experts to use this model. It is essential that technical experts keep up their end of the bargain by continuing to interrogate test characteristics and publish their findings in reports, journal articles, and conferences. This will enable practitioners to trust the research evidence they collect. Assessment experts will see their work being put to good use, and K-12 educators will be able to make decisions in ways more likely to lead to real improvement.


References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association.

Cronbach, L.J. (1980). Validity on parole: How can we go straight? In W.B. Schrader (Ed.), New directions for testing and measurement: No. 5 (pp. 99–108). Jossey-Bass.

Kane, M. (2013). The argument-based approach to validation. School Psychology Review, 42 (4), 448-457.

Kane, M.T. (2021). Articulating a validity argument. In G. Fulcher & L. Harding (Eds.), The Routledge handbook of language testing (pp. 32-47). Routledge.

Newton, P. & Shaw, S. (2014). Validity in educational and psychological assessment. SAGE Publishing.

Popham, W.J. (2016). The ABCs of educational testing: Demystifying the tools that shape our schools, SAGE Publishing.

Sellar, S., Thompson, G., & Rutkowski, D. (2017). The global education race: Taking the measure of PISA and international testing. Brush Education.

Toulmin, S.E. (2003). The uses of argument. Cambridge University Press.

Wu, M. (2015). What national testing data can tell us. In B. Lingard, S. Sellar & G. Thompson (Eds.), National testing in schools (pp. 18-29). Routledge.


This article appears in the March 2023 issue of Kappan, Vol. 104, No. 6, pp. 34-39.

ABOUT THE AUTHORS

default profile picture

David Rutkowski

DAVID RUTKOWSKI is an assistant professor of education leadership and policy studies in the School of Education at Indiana University, Bloomington, Ind.

default profile picture

Greg Thompson

GREG THOMPSON is a professor of education research in the School of Teacher Education and Leadership at Queensland University of Technology, Brisbane, Australia.

default profile picture

Leslie Rutkowski

LESLIE RUTKOWSKI is a professor of research methods in the School of Education at Indiana University, Bloomington.