I also believe that most teachers are missing a major dividend that educational testing can provide. Teachers are failing to take advantage of the instructional benefits that properly constructed tests can bring to themselves and to their students. As I'm using the phrase, a “properly constructed” test is one that significantly illuminates the instructional decisions teachers must make. If all high-stakes tests were properly constructed, we'd find that a high-stakes testing program would typically have a positive effect on educational quality.
And that, in a nutshell, is what this book is about. It begins by addressing the current misuses of high-stakes tests, and then explains how we can create tests that can improve, not degrade, instructional quality. To ensure that I don't make any profound pedagogical mistakes, I suppose I should state this book's two objectives outright. After completing this book, I want you to
- Understand the misuses of today's high-stakes tests and be able to explain to others what those misuses are and why they occur.
- Recognize the distinguishing features of an instructionally illuminating test and be able to differentiate between tests that are and are not instructionally illuminating.
With respect to the second objective, I have no illusions that completing this slender volume will transform you into a measurement maven who, in scant inspired moments, can whip out a brand new test that will help teachers teach better and students learn better. No, that's a mite ambitious. (You'd probably need to read the book twice before you'd be up to that.) What I really want is for you to be able to review any high-stakes test currently being foisted on you (or a test that's under consideration for possible foisting) and confidently determine whether the test is apt to be—from an instructional perspective—a winner or a loser.
Throughout this book, I use the words test, assessment, and measurement in an essentially interchangeable manner. There are no special nuances associated with the three terms; all refer to a wide range of ways that educators can get a fix on a student's status. When I use any of the three synonyms, I am definitely thinking of more than paper-and-pencil tests of the traditional sort that most of us experienced as we went through school. Today, educational assessment includes a much broader, and sometimes quite nontraditional array of measurement approaches. You'll be reading about many of these newer ways of testing as you proceed through this book.
So that you understand from whence I'm coming regarding the book's content, let me give you a compressed look at my personal educational meanderings in a career spanning almost the entire last half of the 20th century. Interestingly, it was during this 50-year period that educational tests were transformed from teachers' tools into teachers' terrors.
In 1953, soon after I wrapped up my teacher education requirements, I took my first teaching job as a high school instructor in a rural eastern Oregon town. It was a small town—only 1,800 inhabitants—and I was delighted to be there . . . it was the only job offer I received.
Testing was regular practice in that Oregon high school. My colleagues and I built and administered all sorts of our own classroom tests. We even administered standardized achievement tests each spring. In retrospect, I believe our town's school superintendent insisted that standardized tests be given only because such tests were required in the larger Oregon school districts. Perhaps he saw them as a mark of cosmopolitan progress. We may have been rural, but as our superintendent once proudly proclaimed, “We use national tests!”
No one paid any real attention to our students' scores on those standardized tests. The citizens of our town thought we teachers were doing a satisfactory job. And for the most part, we were. On occasion, a few students failed to perform all that well in school, and some teachers were certainly less effective than others. But in general, the teachers in my high school (and in the town's elementary school) were thought to be successful. The children were learning, and parents were pleased.
That's pretty much the way it was throughout the United States at midcentury. Although there were surely some so-so teachers out there, it was generally believed that America's teachers were doing what they were being paid to do—and doing it pretty well.
After my high school teaching experience, which I really cherished, I picked up a doctorate focused on instruction from Indiana University. I taught in a pair of small colleges, and, in 1962, I joined faculty of the UCLA Graduate School of Education. My chief assignment at UCLA was to teach prospective teachers about instructional methods.
Even at that time—the early 1960s—confidence in public schools was generally quite high. One rarely heard heated attacks on the caliber of our national educational system. Although there were occasional books critical of schools, such as Rudolf Flesch's Why Johnny Can't Read, most citizens thought, as had their parents before them, that the nation's schools were very successful.
Faith Followed by Doubts
Since its earliest beginnings, public education has been regarded as one of the United States' finest accomplishments. Consistent with most people's conception of what a democratic society ought to be, our public schools have offered a toll-free road that can lead even the most humble to success, happiness, and a good life. Consider the starring role American public schooling plays in some our fondest patriotic metaphors: Public education is the latchkey that can open the door to a land of opportunity; it is the cornerstone of our nation's democratic system of government.
These sentiments capture the positive regard for public education, for teachers, and for the teaching profession itself that was widely held by most U.S. citizens well into the 1960s. Students were assumed to be learning, and teachers were assumed to be teaching. All was much as it should be.
But sometime during the late 1960s and early '70s, mutterings of public discontent began to surface. Newspapers published the first articles about students who had secured high school diplomas, yet couldn't fill out job application forms properly. Other scare stories hit the press about students who, although unable to read or write at even a rudimentary level, were being promoted merely on the basis of “seat time.” U.S. public schools, long venerated, were coming under increasingly serious attacks. And, of course, this assault included the nation's teachers who, some said, had failed to teach America's children what those youngsters needed to know.
Minimum Competency Tests
Because widespread citizen distress often engenders some sort of legislative response, a number of state legislatures, as well as several state and district school boards, soon established basic-skills testing programs (usually focused on reading, writing, and arithmetic). New regulations required students to pass these tests before receiving their high school diplomas. In some instances, students in lower grades were obliged to pass a specified test before being promoted to the next higher grade. The policymakers who established these tests typically referred to such assessments as minimum competency tests. The objective, the policymakers claimed, was to supply parents with a “limited warranty” that a child who passed a competency test had at least mastered the fairly modest set of basic skills these tests measured.
But whether the tests' warranties were limited or not, they definitely made a meaningful difference to the students who failed them. Denying a diploma to a high school student on the basis of that student's score on a single test created a whole new set of rules for educational measurement. Penny-ante assessment was definitely over; high-stakes testing had arrived.
Although at first glance it would seem that the focus of the late-1970s minimum-competency tests was on students, this really wasn't the case. The policymakers who installed these competency tests were actually displaying their doubts about public school educators. As legislators and other educational policymakers saw the problem, if kids were making their way through an educational system without having learned how to read, write, or fill out a job application, then someone was falling down on the job—falling down on the job of teaching.
Not surprisingly, members of the business community lined up solidly behind the establishment of minimum competency testing programs. After all, corporate America needed high school graduates who possessed basic skills. And if competency tests could even partially guarantee that graduates possessed those skills, then corporate America quite naturally endorsed these tests with gusto.
Most of the minimum competency tests of the 1970s and early '80s focused on remarkably low-level skills and knowledge. The reason is worth considering. It's a lesson from which today's educators might profit as they wrestle with the problem of what a high-stakes test ought to measure.
You see, once a state legislature formally enacted a law establishing a statewide minimum competency testing program, the law's implementation was usually turned over to that state's education department. And when officials of the education department moved forward to create the authorized assessment program, those officials typically entered into a contract with an external test development firm to carry out the test's construction. It was true then and it's still true now: Few state departments of education possess the in-house capacity to generate high-stakes assessments—meaning most must rely on substantial external assistance.
Having chosen a test development contractor (for example, CTB-McGraw-Hill of Monterey, California), the next step for state officials was to determine the nature of the skills and knowledge to be measured. Ordinarily, these decisions were made by a committee of that state's educators. For example, if the state's competency testing legislation called for a test in reading, a test in mathematics, and a test in written composition, state authorities would typically appoint a committee of 20–30 teachers and curriculum specialists for each of the three subject areas. These committees, usually operating under the guidance of the external contractor's staff, would identify which skills and knowledge in each subject area their state's upcoming competency test would measure.
With few exceptions, these committees of educators selected remarkably low-level sets of basic skills and knowledge. Indeed, a more appropriate label for these early minimum competency tests would have been “most minimum imaginable competency tests.” Let me explain why.
To do so properly, however, I need to take another brief dip into my own experiences. I used to be a test developer. In 1968, I formed a small nonprofit organization, called the Instructional Objectives Exchange (IOX), to create behaviorally stated instructional objectives and distribute them to U.S. educators. I abhor wheel reinvention, and during this period it struck me that too many of the nation's educators were cranking out redundant behavioral objectives. I thought a national clearinghouse for such objectives would help.
Later, in the mid-1970s, we set up a successor organization, known as IOX Assessment Associates, with the purpose of developing high-stakes tests for state departments of education and for large school districts. I soon began to realize that the important tests we were creating would significantly influence what teachers actually taught. As a result of this insight, I found my own career interests turning from instruction and heading toward educational measurement. I soon became so enamored of assessment that, after doing a ton of assessment-related reading, I began teaching UCLA's graduate classes in educational measurement.
When IOX entered the test development arena, I was hoping to create high-stakes tests that would clarify assessment targets and help teachers to design on-target, more effective lessons. Testing, as I saw it, could be a potent force for instructional improvement if the tests were deliberately built with that mission in mind.
For more than a decade IOX served as the external test development contractor for a dozen states and several large school districts. During this period, I sat in on many state curriculum committee meetings convened to decide what sorts of skills and bodies of knowledge the state-legislated competency test would assess. You might as well learn what happened during those deliberations: The chief reason that most states ended up with minimum competency tests is that the committees of educators who chose the content to be tested invariably decided to assess truly low-level content.
Because I personally moderated many of these meetings and watched dozens of content-determining sessions take place, I can report that the majority of committee members didn't want to establish genuinely high-level expectations for the competency tests. They realized that denying diplomas or grade promotions to many students because of low test scores would reflect unfavorably on teachers. So, more often than not, the subject matter committees simply surrendered to the selection of low-level skills and knowledge. The result: competency tests that measured minima.
Once a minimum competency test was in place, the teachers who were chiefly responsible for teaching students to master its content (usually English teachers and mathematics teachers) devoted considerable energy to having their students pass the tests. Teachers did not wish to face the disgruntled parents of students who had failed the tests, even if what many high school graduation tests actually measured was sometimes barely more sophisticated than material 6th grade students ought to have learned.
Not surprisingly, thanks to the low level of the assessment targets, relatively few students ultimately failed to pass these early-vintage minimum competency tests. Failing students were typically given opportunities to re-take the test and achieve a passing score. But even so, in most schools, at least some students failed—and thus were denied diplomas.
It didn't take the press long to figure out that something potentially newsworthy was taking place. Newspapers could easily write stories that compared schools within a district on the basis of competency test failure rates (and subsequent diploma denial). Consequently, a public perception began to emerge that schools in which few students failed were good schools, and schools in which many students failed were bad schools. The quality of schooling was being linked to the quality of students' test scores. And, as we shall see, once this approach to judging schools took root, it flourished.
The Elementary and Secondary Education Act (ESEA) of 1965
Another factor nurtured the notion that a school's quality could be determined by its students' test scores, and that factor was an important piece of federal legislation. The Elementary and Secondary Education Act (ESEA) of 1965 was the first major federal law dispensing significant amounts of money to U.S. school districts for the support of locally designed programs intended to bolster children's learning. Prior to the 1965 enactment of ESEA, the amount of federal dollars flowing from Washington, D.C., to local schools had been relatively modest. ESEA, by contrast, promised truly big bucks for local educators.
The newness of ESEA's federal funds-for-education strategy led Congress to build in a number of corresponding safeguards. One of the most influential was championed by Robert Kennedy, then a senator from New York. Kennedy's addition to the law required educators receiving ESEA dollars to demonstrate that these funds were being well spent—namely, by evaluating and reporting on the effectiveness of their federally supported programs. According to the new law, if local officials did not formally evaluate the current year's federally subsidized program, then they would not receive next year's ESEA funds. In truth, it was not all that important whether a program's evaluation was positive or negative, at least in the early days of ESEA funding. Just conducting the evaluation was all that was necessary.
Given the potency of the “green carrot,” it should come as no shock that educators who were getting ESEA awards scurried madly about in an effort to document the success of their ESEA-funded programs. And because almost all these programs were aimed directly at improving students' basic skills, the first step for most local educators was to identify suitable tests that could determine whether students were in fact learning the three Rs. The most readily available tests were off-the-shelf, standardized achievement tests such as the Metropolitan Achievement Tests or the Comprehensive Tests of Basic Skills. These sorts of tests became almost everyone's choice to evaluate whether ESEA-supported programs were working, because they (1) had been developed by respected measurement companies, (2) were widely available, and (3) were regarded as technically first-rate.
Then, as now, few educators knew much about test development and were generally willing to leave the creation of tests to the specialists. As a result, most of the educators whose work depended on ESEA dollars readily accepted that students' scores on these celebrated off-the-shelf, standardized achievement tests accurately determined the quality of classroom instruction. The unfortunate truth was that what these standardized tests were measuring often had only a faint relationship to the skills and knowledge being promoted by a particular ESEA-funded program. Sadly, few members of the education profession cried foul, and the use of standardized achievement test scores to determine an ESEA program's success became the national norm.
Both policymakers and the general body of educators made what must have seemed the logical next step in assessment application: If standardized achievement tests could ascertain the effectiveness of ESEA-funded, basic skills-focused instructional programs, they could be employed to evaluate the success of other types of instructional programs as well. Having bought into the idea that certain kinds of instructional quality could be determined by using standardized achievement tests, policymakers were pretty well locked into the position that those tests did in fact provide defensible estimates of instructional quality.
Although ESEA of 1965 certainly stimulated a vastly increased reliance on standardized achievement tests as a way of judging educational success, U.S. educators must accept the blame for simply rolling over and allowing their teaching to be evaluated by students' scores on those off-the-shelf tests. As you'll read in later chapters, that wrong-headed acquiescence has led to a series of educational practices that have seriously eroded the quality of today's schooling.
Newspapers Take Notice
By the late 1980s, most states had established some kind of mandatory statewide testing program. Although many of these assessment programs consisted of the sorts of low-level competency tests I've already described, some state authorities, troubled by their competency tests' low expectations, had set about to renovate their initial minimum competency testing programs to assess more demanding outcomes. Some states preferred to revise their own competency tests (typically with a contractor's help). Other states simply selected one of the nationally published standardized achievement tests. Thus, usually in the spring, a statewide test was administered to all students in selected grades (for instance, in grades 3, 5, 8, and 11). The most ambitious states chose to administer achievement tests to students in every grade, usually starting with grade 3.
Tests scores were sent to local districts and schools so that educators and parents could see how students had performed. At that time, test results were not routinely sent to newspapers. Indeed, for the most part, local newspapers displayed little interest in these test results.
And then came the day (rumored by some to have been a “slow news day”) that the first newspaper reporter obtained a copy of statewide test results and wrote a story that used students' test scores to rank districts and schools within the state. This ranking system allowed parents to quickly see how their child's school stacked up against other schools. And because most educators had previously accepted the idea that scores on standardized achievement tests indicated the effectiveness of educational programs, the press soon billed these annual rankings as reflections of educational quality. Highly ranked schools were regarded as effective; lowly ranked schools were regarded as ineffective.
For a number of years now, I've been following these newspaper rankings in many localities, where they often attract as much attention as the publication of the winning numbers in the state lottery. These rankings invariably lead to judgments about which educators are doing good jobs and which are doing bad jobs. And because citizens believe that high scores signify successful instruction, the annual rankings place enormous pressure on teachers to improve their students' scores on statewide tests.
Some of these statewide tests are off-the-shelf, national standardized achievement tests, some are customized tests built exclusively for a particular state, and some are a combination of national and customized items. All these tests, however, are standardized in the sense that they are administered and scored in a uniform, predetermined manner. Incidentally, most citizens tend to ascribe more credibility to national achievement tests, five of which are currently used in our public schools: California Achievement Tests, Comprehensive Tests of Basic Skills, Iowa Tests of Basic Skills, Metropolitan Achievement Tests, and Stanford Achievement Tests. In general, folks place less trust in customized, state-specific standardized tests, regarding these “home grown” tests as more likely to have been softened to make the state's educators look good.
But customized or national, when newspapers run their annual rankings of district-by-district and school-by-school scores, there is a clear message to all concerned: These rankings reflect instructional quality. Given that message, I am surprised that newspaper editors do not publish these score-based school rankings in their sports sections, along with team standings in basketball, baseball, and football. Americans love contests—and while we derive modest gratification from applauding a winner, we appear to get more fundamental joy from identifying losers. Yes, the sports pages are the natural home for score-based school rankings.
As we all know, the 1990s brought a tremendous increase in the reliance on students' standardized achievement test scores as indicators of instructional quality. Think about the name of the tests for a moment: achievement tests. Because an achievement test would seem to measure what students achieve—that is, what they learn in school—it's natural to perceive it as a suitable measure of what kids have been taught in school. As you'll read in Chapters 3 and 4, that perception is plumb wrong.
The attention given to achievement test scores and the tacit implication that students' test scores provide an accurate index of educational success helped fuel an enormous preoccupation with those scores during the last decade. School boards demanded that their district's educators improve students' test performances. School administrators at all levels were evaluated almost exclusively on the basis of students' scores on standardized achievement tests. And more than a few governors pinned their political aspirations directly to the elevation of their state's test scores. California Governor Gray Davis, for example, made the improvement of test scores so central to his administration's success that he publicly proclaimed he would forgo any bid to seek the U.S. presidency if his state's scores failed to rise. George W. Bush made Texas's rising test scores a central element in his successful presidential campaign.
Now, in 2001, there's no question that a score-boosting sweepstakes has enveloped the nation. Who has been tasked with boosting students' test scores? Teachers and administrators, of course. And it is precisely because teachers and administrators are under such pressure these days to bring about improvements in students' test scores that I have written this book.
U.S. educators have been thrown into a score-boosting game they cannot win. More accurately, the score-boosting game cannot be won without doing educational damage to the children in our public schools. The negative consequences flowing from a national enshrinement of increased test scores are both widespread and serious. Chapter 1 highlights the most harmful of these consequences.
But let's get one thing straight: I do not believe America's educators are the guiltless victims of an evil imposed by wrong-thinking policymakers. I think the education profession itself is fundamentally at fault. We allowed students' test scores to become the indicator of our effectiveness. We failed to halt the profound misuse of standardized achievement tests to judge our educational quality. We let this happen to ourselves. And more grievously, we let it happen to the children we are supposed to be educating. Shame on us.
It's not too late to alter this sorry state of affairs. As the book proceeds, I'll describe a strategy for modifying what has become an untenable measurement context for appraising the nation's schools. I hope you, and other readers, will take seriously my challenge to change this situation.