November 1, 2004

•

5 min (est.)

•

Vol. 62

•

No. 3

A Game Without Winners

W. James Popham

Striving to reduce the achievement gap without reforming testing is an impossible dream.

Suppose the new products division of a game company created a board game named Blind Alley. For some perverse reason, the new game's rules never allowed any player to win. Thus, every person who takes part in this new game inevitably ends up a loser. We could confidently predict that the annual sales of Blind Alley would not seriously challenge the annual sales of Monopoly.

A game without winners has limited appeal—perhaps attracting only those who have masochistic leanings. Games without winners are patently pointless. Yet, for the last few decades, many U.S. educators have enthusiastically taken part in an instructional game they simply cannot win. I call it the gap-reduction game, although as it is currently being played, it could just as easily be labeled Blind Alley. Although most participants in the gap-reduction game are well-intentioned, they don't understand that as matters currently stand, there's not a chance in Hades that they'll succeed in their efforts to reduce achievement gaps.

Test Performance Does Not Equal Learning

When U.S. educators speak of “achievement gaps,” they mean the performance differentials among various racial/ethnic groups, and between children from poor families and those from middle-class or well-off families. When statewide achievement tests are administered in, say, language arts or mathematics, the average test scores of black and Latino students are rarely as high as those of white students. Similarly, when children from lower-socioeconomic status (SES) families take those sorts of tests, they frequently score lower than their higher-SES counterparts. Educators would like to diminish these gaps or eliminate them altogether.

It is wonderful that so many educators are committed to reducing test score gaps among the different student groups. Our nation's long-standing commitment to equality for all its citizens demands no less. However, most educators' understanding of gap-reduction ground rules is inadequate.

Much of the problem stems from the fact that many educators (and most laypeople) employ the term achievement interchangeably with learning. When people think about school-related achievement, they typically think about what students have learned in school. Indeed, my dictionary even describes an achievement test as “a test to measure a person's knowledge or proficiency in something that can be learned or taught.” Thus, achievement tests have historically been regarded by almost everyone as satisfactory measurements of what kids have learned in school, and when most people think about achievement gaps, they are referring to differences stemming from what students have learned. When educators and the public set out to determine whether various student groups have been taught equally well, we almost always look at the achievement test scores earned by those students.

But the assumption that students' in-school learnings and their scores on standardized achievement tests are essentially the same thing is mistaken. This faulty assumption leads to a doomed approach to gap reduction.

All educators, I believe, want students from lower-SES families to master the same cognitive skills and knowledge that students from upper-SES families master. We want every child to have a chance to achieve her or his potential. But we have been relying exclusively on students' test scores to tell us whether this goal has been attained—and that's where our well-intentioned gap-reduction strategies have foundered.

As long as we unthinkingly accept the premise that high standardized test scores equal gobs of achievement, whereas low standardized test scores equal the opposite, our gap-reduction gambits are certain to be ineffectual. Actually, if we referred to the gaps we're trying to reduce as test score gaps rather than achievement gaps, people might become more aware of the inappropriateness of using test scores as the sole benchmark for student achievement. The reality is that educators are using the wrong measures to tell whether gap reductions have occurred.

The Worship of “Score Spread”

To see why standardized achievement tests are flawed measurement instruments, let's look at how most such tests are constructed. Traditional achievement tests are “norm-referenced,” meaning they have been built to yield scores capable of being interpreted comparatively. Thus, when Johnny scores at the 84th percentile in mathematics and Evan scores in the 79th percentile (those percentiles being based on the scores of previous test takers who constitute the test's norm group), the comparison of their scores gives parents and teachers one way to interpret the information. But to provide the fine-grained contrasts that are crucial to a norm-referenced measurement strategy, these tests must produce a high degree of score spread—that is, a range of student test scores, with satisfactory numbers of high, middle, and low scores.

Because traditional standardized achievement tests need this range of scores, the creators of such tests take pains to include test items that will produce ample score spread. Score spread, in fact, becomes almost a deity in test construction.

Surprisingly, most of the state-specific, custom-built tests that are intended to better assess a particular state's official curricular aims have also been influenced by this adulation of score spread. Because the major test-development companies usually build these state-customized tests, many state tests function in almost exactly the same score-spreading manner that national norm-referenced tests do.

Test developers and teachers have different priorities. Teachers complain about the amount of time external tests take away from their teaching. But from the perspective of achievement test developers, far too little time is available to assess students. That's because scads of score spread must be produced in only an hour or two of testing.

Statistically, items that very large proportions of students answer either correctly or incorrectly do not produce score spread. Thus, test developers dare not put many “too easy” or “too hard” items in their tests. If too many students actually begin to answer too many items correctly, then a test's score spread evaporates. In practical terms, this means that the majority of the items on a standardized achievement test will turn out to be answered correctly by between 40 and 60 percent of the test takers.

The makers of standardized achievement tests have no serious interest in selecting test items that will reflect effective instruction. They are interested in using items that not all test takers can answer, even if having many such items causes a test to be instructionally insensitive— that is, incapable of detecting the presence and impact of effective instruction.

Here's where this zany assessment puzzle gets even more perplexing. Some of the best items for yielding sufficient score spread are those that are apt to be answered correctly by students from upper-SES families and incorrectly by students from lower-SES families. We call such test items SES-linked. From the test developers' perspective, SES is a delightfully spread-out variable—meaning that students represent a wide range of family income levels and that lower-SES students and upper-SES students show many differences. So, SES-linked items will almost always yield the score spread that is so crucial for comparative interpretations.

Now, let's return to the core issue with which we're dealing, namely, how to promote “achievement catch-up” for minority and low-SES students. Because of the United States' social and economic history, minority students are more likely to be low-SES than nonminority students are. It should be apparent, therefore, that if many items on standardized achievement tests are more directly linked to students' SES than to what students have been taught in school, then the use of such tests will never reduce the difference in test scores between minority and white students. We are relying on tests containing SES-linked items to demonstrate that students can overcome SES-linked education deficits. This is really stupid.

As long as educators and testing companies continue to assess what students have learned in school using assessment devices that rely heavily on SES-linked items, we will be measuring what students bring to school, not what they learn in school. SES-linked assessment will never allow educators to show that they've reduced the kinds of gaps among students that we all want diminished.

What Is an SES-Linked Item?

All right, you may be asking, what does an SES-linked item look like? Here's a multiple-choice item that I think is SES-linked, drawn from a standardized science achievement test used nationally to measure the knowledge of 6th graders:If you wanted to find out if a distant planet had mountains or rivers on it, which of the following tools should you use?(a) binoculars, (b) microscope, (c) telescope, or (d) camera

If you think about this item carefully, you'll see how students from lower-SES families might have less of a chance to come up with the correct answer. Consider two families, each of which includes a 6th grader. Family X has three children and two parents. Both parents earn decent salaries and the family's oldest child possesses an expensive telescope. This family often watches science programs on TV; sometimes they discuss science-related topics at dinner. Family Y, on the other hand, consists of three children and one parent—a mother whose skill level only qualifies her for jobs that pay minimum wage. In this second family there is no cable television, no telescope, and no dinner-time discussions of articles in news magazines that the family cannot afford.

I hope you will have little difficulty answering this question:On average, which student will be more likely to answer the test item about seeing distant planets correctly?(a) the 6th grader in Family X, or (b) the 6th grader in Family Y

Norm-referenced standardized achievement tests contain far too many items like the telescope question. I recently went through one complete grade level's worth of items on two nationally standardized achievement tests, item by item. I concluded that in reading, language arts, science, and social studies, between 40 and 80 percent of the items were SES-linked. One could almost say that the tests are little more than measures of students' SES. In mathematics, only 15 to 20 percent of the items were SES-linked, but this is still unacceptable. If students' scores on achievement tests are dependent on their SES levels, how will such tests ever show a reduction in student achievement gaps that follow racial and socioeconomic lines?

So does the presence of all these SES-linked items on standardized achievement tests indicate that the folks who develop such tests are determined to oppress the masses? Are test developers part of a plot to keep the nation's blue-collar proletariat in its proper place? Definitely not. The test developers, with no malice at all, are simply hot to create items that produce score spread. And because SES is such a wonderfully spread-out variable, many items that have a good track record of yielding a range of scores turn out to be SES-linked.

As educators, we should not blame the test-development companies for creating SES-biased tests. Rather, we should blame ourselves for allowing such tests to be used as measures of our effectiveness as teachers.

What About Standards-Based Tests?

So far, I have been sniping exclusively at norm-referenced standardized achievement tests. But what about achievement tests that have been custom-built for particular states in such a way that any student's score is based on individual mastery of specific criteria rather than on student-to-student comparisons? Such tests are called criterion-referenced or standards-based. About half of the state-level achievement tests used throughout the United States are supposed to measure students' mastery of a state's official curricular aims—also known as content standards, benchmarks, or expectancies.

If you've been thinking that criterion-referenced tests will save the day, get ready for a disappointment. These standards-based achievement tests are just as tied to SES as are norm-referenced tests. The crux of the problem with states' standards-based tests is that such tests are intended to measure too many curricular aims. Typically, a state's curricular aims are crafted by well-intentioned content specialists who dream up a “wish list” rather than a set of essential competencies and knowledge that can be realistically taught and measured in the time available within public schools. I was recently in a state that expected its teachers to promote more than 6,000 curricular aims!

Any given year's standards-based test can't possibly measure a whole galaxy of content standards in a meaningful way. So state standards-based tests typically contain items that sample the plethora of curricular aims. Some curricular aims don't get measured at all, whereas others get measured only superficially. The state's teachers can only guess which curricular aims a given year's test is going to assess, and in many instances they guess wrong.

Moreover, because there are so many competencies to measure, there aren't enough items on most tests for test results to give teachers an accurate fix on students' mastery of any one curricular aim. Teachers receive information that's so general as to be nearly meaningless in terms of analyzing and improving their instruction.

Imagine that you're a teacher who for years has guessed wrong about what's likely to be assessed on your state's standards-based tests. For years you've aimed your instruction at the wrong curricular aims. In addition, the score reports from the state's standards-based tests don't help you figure out what parts of your instructional program are working. Wouldn't you be apt to give up on the whole instruction-aligned-with-assessment approach and just return to teaching in the best way you know how?

As currently set up, most standards-based assessments don't really align with classroom instruction, and instruction doesn't significantly influence test scores. Instead, students' scores on most states' standards-based tests turn out to be tied most directly to—you guessed it—students' socioeconomic status. If criterion-referenced tests cover curricular content that, more often than not, wasn't stressed sufficiently in class, then students must draw on knowledge and experience outside their classroom learning to have a shot at a correct answer. Once again, simply because of life experiences, youth from economically advantaged families will outperform less advantaged youngsters. With few exceptions, today's standards-based tests—even those proudly cavorting in criterion-referenced costumes—are no better at evaluating the merits of gap-reduction efforts than norm-referenced achievement tests are.

Thus, when any gap-reduction guru touts the virtues of a special instructional strategy without first requiring the installation of appropriate assessments, that person is making a serious mistake. Don't misunderstand me. I've listened to many first-rate folks argue for instructional procedures that seem to be sensible ways of reducing the achievement gap between lower-SES and higher-SES kids. But because of most existing achievement tests' SES links, those instructional procedures cannot reduce the test score gap between lower-SES and higher-SES children. Gap-reduction experts are sending their best instructional strategies marching onto a battlefield where those strategies are certain to stumble.

A Two-Step Strategy

A straightforward two-step strategy might free educators and students from this untenable situation.

As step one, anyone who is working to reduce achievement gaps must become assessment literate—at least with respect to the qualities of achievement tests that will or won't reveal genuine differences between what upper- and lower-income students learn. Educators have historically given far too much deference to assessment specialists. Most educators don't know squat about measurement, and they wrongly assume that anyone who can actually compute an internal consistency reliability coefficient must be sufficiently intelligent to avoid making measurement mistakes. Such a deferential demeanor needs to disappear—and in a hurry. The fundamentals of educational testing, at least those concepts necessary to be able to spot a suitable achievement test, really aren't too complex to grasp.

As step two, educators ought to be working toward the adoption of instructionally supportive accountability tests that are designed, from the very get-go, to detect the kind of instructional impact that must be present if achievement gaps will ever be demonstrably reduced.

Measure only a modest number of curricular aims of extraordinary significance so teachers are not overwhelmed by too many curricular targets;
Describe curricular aims in clear, teacher-palatable language so teachers can aim their instruction directly at the curricular goals rather than at a particular test's items; and
Supply score reports that show whether or not each curricular aim was mastered by each student, thus helping teachers determine which aspects of their instruction were or were not effective.

Educators who wish to reduce achievement gaps must initiate an immediate and dramatic turnaround with respect to the assessments they use to evaluate their efforts. Otherwise they will be destined to continue playing their own special version of Blind Alley. In this game, unfortunately, low-SES and minority students turn out to be the biggest losers.

End Notes

•

1 Commission on Instructionally Supportive Assessment. (2001). Building tests that support instruction and accountability: A guide for policymakers. Washington, DC: Author. Available:<LINK URL="http://www.ioxassessment.com/catalog/pdfdownloads/BuildingTestsToSupport.pdf">www.ioxassessment.com/catalog/pdfdownloads/BuildingTestsToSupport.pdf</LINK>

James Popham is Emeritus Professor in the UCLA Graduate School of Education and Information Studies. At UCLA he won several distinguished teaching awards, and in January 2000, he was recognized by UCLA Today as one of UCLA's top 20 professors of the 20th century.

Popham is a former president of the American Educational Research Association (AERA) and the founding editor of Educational Evaluation and Policy Analysis, an AERA quarterly journal.

He has spent most of his career as a teacher and is the author of more than 30 books, 200 journal articles, 50 research reports, and nearly 200 papers presented before research societies. His areas of focus include student assessment and educational evaluation. One of his recent books is Assessment Literacy for Educators in a Hurry.

Learn More

ASCD is a community dedicated to educators' professional growth and well-being.

Let us help you put your vision into action.

Discover ASCD's Professional Learning Services

From our issue

Closing Achievement Gaps

Go To Publication