The right test for the wrong reason

The tests we use to evaluate student achievement may well be sound measures of what students know, but, at best, they are faulty indicators of how well they have been taught.

For almost a full century, the mission of U.S. educational measurement has been to elicit test-takers’ scores so those scores can be compared with one another. This is a good and useful thing to do, particularly so in situations where the number of applicants exceeds the number of openings. To make a flock of important educational decisions, we need to identify our strongest and weakest performing students.

The legitimacy of such test-based comparisons was firmly established way back in World War I — almost 100 years ago — when the government administered the Army Alpha intelligence test to about 1.75 million U.S. Army recruits in an effort to identify the most suitable candidates for officer training programs. Using the Alpha to provide comparative score interpretations was regarded as a smashing success and, although the test was clearly a measure of a test-taker’s aptitude, the makers of educational achievement tests soon emulated the Alpha’s focus on comparative score interpretations. Indeed, a number of the test-construction and test-refinement tactics used for today’s achievement tests can be traced back to the comparative assessment procedures associated with the Army Alpha.

Increasingly, America’s educators are being evaluated on the basis of student performances on tests that were created to yield comparative score interpretations rather than to measure instructional quality. This is a terrible mistake.

But tests capable of providing score comparisons aren’t necessarily tests that should be used to evaluate schools or teachers. Such evaluative applications of educational assessment, although similar in some ways to comparative applications of educational assessment, are fundamentally different. However, increasingly America’s educators are being evaluated on the basis of their students’ performances on tests that were created to yield comparative score interpretations rather than to measure instructional quality. This is a terrible mistake.

This mistake is being made because of an erroneous but pervasive belief by Americans that schools are responsible for the knowledge and skills students display when responding to achievement tests. In some instances, this is accurate. Instruction in schools is responsible for certain skills and bodies of knowledge that are measured by today’s achievement tests.

Yet, what if the tests we traditionally employ to measure students’ achievement, because of those tests’ preoccupation with providing comparative score interpretations, also measure many things other than what students were taught in school? What if our traditional achievement tests, in an effort to provide the necessary variance in total test scores that are so vital for comparative score interpretations, also measure test-takers’ status with respect to such variance-inducing factors as students’ socioeconomic status and inherited academic aptitudes? Clearly, such a confounding of causality would make such traditional achievement tests less appropriate for evaluating how well students have been taught. To what extent is a student’s performance on a traditional achievement test attributable to what was taught in school rather than what was brought to school? For many of today’s achievement tests, we just can’t tell.

To what extent is a student’s performance on a traditional achievement test attributable to what was taught in school rather than what was brought to school?

I contend that the traditional way we build and burnish our educational achievement tests may lead to using those tests inappropriately to evaluate schools and teachers. The italicized may is intended to emphasize my conviction that the suitability of today’s traditional achievement tests for evaluative use has not been rigorously scrutinized. But it should be.

If one wishes to evaluate the performance of a school’s instructional staff or the performance of a particular teacher, then having evidence about students’ performances on almost any sort of achievement test would be better than relying on no achievement evidence at all. Thus, I’d certainly rather use student scores from the tests we now employ for such evaluative purposes than have no data about student achievement. But the choice isn’t whether we should try to do evaluations using flawed tests instead of using no tests. Instead, our challenge is to carry out today’s increasingly high-stakes evaluations using the most appropriate tests. We can do a better job of evaluating our schools and teachers than to do so by using today’s achievement tests.

The cornerstone of our assessment castle

If you were to ask today’s educators — regardless of how much they actually knew about educational testing — what is the single most important concept in educational measurement, the most frequent response surely would be “validity.” That response, happily, turns out to be correct. Educational measurement is predicated on the conviction that by getting students to make overt responses to stimuli such as test items, educators can arrive at valid inferences about students’ covert knowledge and skills. Determining the covert based on the overt lies at the heart of all educational assessment.

However, tests are not valid or invalid. Instead, it’s the inference based on student test scores that is valid or invalid. Validity thus represents the accuracy of test-based inferences. Increasingly these days, assessment validity is regarded not only as the accuracy of a test-based inference but also whether those inferences are used appropriately (Kane, 2013). Optimally, therefore, a test-based inference would be accurate and educators would use that accurate inference to accomplish a suitable consequence, such as making sound educational decisions about students.

The validity of a score-based inference gets our test-usage ball rolling. If we can’t establish that test-takers’ performances lead to an accurate inference about what test-takers’ scores signify, then the likelihood of then making a sensible inference-based decision is definitely diminished. And this is where we currently are with respect to the tests we use to evaluate U.S. schools and teachers. Although educators have been urged — in some instances, statutorily required — to evaluate schools and teachers using student performances on educational tests, we have no meaningful evidence at hand indicating that these tests can accurately distinguish between well taught and badly taught students. This state of affairs is truly astonishing.

The evidence to support the accuracy of such score-based inferences about instructional quality is essentially nonexistent.

Instructional sensitivity

Yes, our nation increasingly relies on student scores on tests, typically using standardized achievement tests, to arrive at inferences about the quality of instruction provided to those students. Yet, the evidence to support the accuracy of such score-based inferences about instructional quality is essentially nonexistent. Today’s educators are being asked to sidestep the most important tenet of educational measurement: the obligation to supply validity evidence regarding the interpretations and significant uses of an educational test’s results. Putting it differently, no evidence currently exists about these evaluative tests’ instructional sensitivity.

What is this “instructional sensitivity,” and how is it determined? Actually, the concept is quite a straightforward one, and it simply refers to how well a test can accurately distinguish between test takers who have been taught well and test takers who have been taught badly. Although a certain amount of definitional disagreement about instructional sensitivity can be found in the measurement community, the following definition reflects what most writers on this topic understand it to be:

Instructional sensitivity is the degree to which student performances on a test accurately reflect the quality of instruction specifically provided to promote student mastery of what’s being assessed (Popham, 2006).

As you can see, this definition revolves around the “quality of instruction” insofar as it specifically contributes to “student mastery” of whatever the test is measuring. A test, then, can vary in the degree to which it is instructionally sensitive. We need not, therefore, distinguish between a test that is totally sensitive to instruction or totally insensitive to instruction. Instructional sensitivity is a continuous rather than a dichotomous variable. Our quest, therefore, should be to determine a minimum threshold of instructional sensitivity for any test being used to evaluate the caliber of instruction. The more significant the stakes associated with a test’s use, the higher should be our acceptability threshold.

The instructional sensitivity of education tests is not a new concept. More than 30 years ago when the high-stakes accountability movement began to capture the attention of American educators, Haladyna and Roid (1981) described the role of instructional sensitivity when judging the merits of accountability tests.

Earlier still, when initial proponents of criterion-referenced measurement were attempting to sort out how to create and improve tests leading to criterion-referenced inferences, Cox (1971) and other measurement specialists tried to devise ways to maximize a test item’s sensitivity to instruction. But those early deliberations among advocates of criterion-referencing were focused almost exclusively on measurement challenges: how to build tests capable of yielding more valid criterion-referenced inferences. As the years tumbled by, the evaluative use of student test performances has become more significant. For example, in the coming years, many American teachers will lose their jobs primarily because their students perform poorly on tests. The high-stakes decisions riding on student test scores have become higher still.

Despite the increased importance attached to evaluative test-based consequences, the attention to the instructional sensitivity of the tests being used still ranges from trifling to nonexistent. Perhaps today’s inattention to tests’ instructional sensitivity simply stems from not knowing how to determine the degree of a test’s sensitivity to instructional quality. Yet we already have on hand a successful strategy drawn from our experiences in reducing assessment bias in our important educational tests. Let’s look at the chief elements of that strategy.

Serious problems; serious responses

Test makers routinely expend considerable attention on reducing assessment bias in significant educational tests. Diminishing assessment bias is a canon of good test building. It was not always thus.

Go back to the 1960s and 1970s, and you’ll find that only perfunctory attention was given to reducing assessment bias on the off chance that it was given at all. Usually, it was absent. This was quite understandable because we rarely analyzed test results to reveal differences among test-taker groups associated with gender, race, or ethnicity. But the rules of educational testing changed dramatically in the late ‘70s when a substantial number of states — dismayed by what they perceived to be the poor quality of their state’s public schools — began linking high school graduation to passing minimum competency tests, which purported to show that students had basic skills in reading, mathematics, and sometimes writing.

Because those minimum competency tests were administered to all students in a state’s public schools and those scores were typically made public, we soon began to see astonishing disparities between the performances of racial groups as well as students drawn from different socioeconomic strata. Indeed, the difference in the racial pass rates on Florida’s diploma-denial tests triggered a class-action lawsuit in the precedent-setting Debra P. v. Turlington case (Popham & Lindheim, 1981). In that case, which even now remains the operative case law in such litigation, a federal appellate court affirmed that a violation of the U.S. Constitution occurs when students are denied a property right (such as a high school diploma) if they are tested using a test whose content had not been taught. The precipitating circumstance was that far more black than white students were failing the state’s basic skills test. The Debra P. litigation and similar disparities in racial pass rates elsewhere presented a serious problem to America’s educational measurement specialists. They quickly grasped the significance of the situation — and they set out to fix it.

A two-pronged strategy

Having recognized the legitimacy of complaints that the nation’s tests were biased against certain groups, the measurement community soon devised a two-tactic strategy to minimize such bias. The first of these was a judgmental review completed during the development of each test item in an effort to identify and eliminate items thought to offend or unfairly penalize test takers because of their personal characteristics such as gender or ethnicity. Second, an empirical analysis of actual student test performances was undertaken, usually during field-testing new items, so as to spot items potentially contributing to a test’s assessment bias. The typical analytic approach that evolved after several years was to employ “differential item functioning” (DIF) techniques in which items were isolated that were being answered differently by different groups of test takers. Items identified by DIF as possibly biased were then modified or jettisoned before being used in an operational test. As a consequence of employing this two-tactic strategy over many years, we have witnessed a substantial reduction in the number of items on high-stakes tests that are biased against particular groups of test takers.

The actual procedures for these two approaches to reducing assessment bias are now well-known among measurement specialists. While their use may not have eliminated assessment bias, the marked impact of these procedures on reducing assessment bias is undisputed.

Benign borrowing

The methodological strategy we could employ in reducing the instructional insensitivity of today’s evaluatively oriented achievement tests might be nothing more than a straight-out lift from what has been used in reducing assessment bias — a blend of judgmental and empirical procedures.

Although we don’t have a definite, well-honed set of procedures for dealing with the instructional sensitivity of tests, the essential elements of an attack on this problem could be derived from work in minimizing assessment bias. For example, the charge to be issued when asking seasoned educators to scrutinize test items for instructional insensitivity could be quite similar to the language employed when asking a committee of bias reviewers to look for biased elements in test items. A review committee of teachers thoroughly familiar with the content and age-levels of students to be tested could be oriented to their item-review responsibilities by learning about the most likely ways an item might be instructionally insensitive. Reviewers could then be given the following charge and asked to render a per-item judgment regarding each item intended for inclusion in a high-stakes evaluative test:

Attention reviewers: Please note the specific curricular aim, which, according to the test’s developers, this item is assessing. Only then, answer the following question: If a teacher has provided reasonably effective instruction to promote student mastery of the specific curricular aim being assessed, is it likely that the bulk of the teacher’s students will answer this item correctly? Choose one: YES, NO, NOT SURE (Popham, 2014, p. 397).

Items for which one or more reviewers have supplied a specified proportion of negative and/or not-sure responses would then be scrutinized to discern if the items embody elements apt to render them instructionally insensitive. Such items would be revised or removed.

Similarly, procedural elements for carrying out empirical DIF-like studies for instructional sensitivity must be generated and refined. The overriding thrust of such DIF analyses is to identify two groups of teachers who, for item-analysis purposes, are decisively different in their demonstrated effectiveness in bringing about improvements in students’ assessed achievement levels. Having identified two extreme groups of teachers on the basis of, for instance, their students’ performances for several previous years’ worth of annual assessments, we can then see if those teachers’ current student responses to new items are consonant with what would be predicted. For example, if students taught by lower-effectiveness teachers actually perform better on particular items than students taught by higher-effectiveness teachers, then those items should certainly be subjected to serious scrutiny to discern what seems to be rendering them instructionally insensitive. Although Joseph Ryan and I (2012) have proposed one use of DIF procedures using student-growth percentiles to carry out item-sensitivity analyses, more exploratory work on this problem should be undertaken.

As with the reduction of assessment bias in high-stakes educational tests, implementing the previously described two-tactic strategy for dealing with instructional sensitivity won’t transform instructionally insensitive tests overnight into assessments that reek of instructional sensitivity. But colleagues who coped with assessment bias have given us a set of directions for making our evaluative tests more instructionally sensitive. And progress in that direction will increase the validity of test-based inferences about instructional quality and the subsequent decisions we make about the teachers or schools being evaluated.

A discontented winter

“Now is the winter of our discontent . . .” are the initial seven words of Shakespeare’s “Richard III.” Well, winter or not, I am definitely discontent about America’s current misuse of student performances on educational tests. I find it altogether intolerable to be a member of a measurement clan that allows hugely important educational decisions to be made on the basis of student scores on tests not demonstrated to be suitable for their evaluative applications. How can we let such misuses continue? How can we in good conscience permit our nation’s educational leaders and policy makers to rely on test results that may be completely unsuitable for the purposes to which they are being put? How can we allow teachers to be fired because of student scores on the wrong tests? How can we? And yet we do.

The only way to begin changing an indefensible practice is to set out seriously to alter that practice. It is time, indeed past time, for those of us who recognize the seriousness of this situation to don our alteration armor and head into battle.

References

Cox, R. (1971). Evaluative aspects of criterion-referenced measures. In W.J. Popham (Ed.), Criterion-referenced measurement: An introduction (pp. 67-75). Englewood Cliffs, NJ: Educational Technology Publications.

Haladyna, T. & Roid, G. (1981). The role of instructional sensitivity in the empirical review of criterion-referenced test items. Journal of Educational Measurement, 18 (1), 39-53.

Kane, M.T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50 (1), 1-73.

Popham, W.J. (2006, June). Determining the instructional sensitivity of accountability tests. Presented at the Large-Scale Assessment Conference, Council of Chief State School Officers, San Francisco, Calif.

Popham, W.J. (2014). Classroom assessment: What teachers need to know (7th ed.). Boston, MA: Pearson.

Popham, W.J. & Lindheim, E. (1981). Implications of a landmark ruling on Florida’s minimum competency test. Phi Delta Kappan, 63 (1), 18-22.

Popham, W.J. & Ryan, J. (2012, April). Determining a high-stakes test’s instructional sensitivity. Presented at the annual meeting of the National Council on Educational Measurement, Vancouver, British Columbia, Canada.

Citation: Popham, W.J. (2014). The right test for the wrong reason. Phi Delta Kappan, 96 (1), 46-52.