November 1, 2003

•

5 min (est.)

•

Vol. 61

•

No. 3

The Lessons of High-Stakes Testing

George F. Madaus

Research shows that high-stakes tests can affect teaching and learning in predictable and often undesirable ways.

The Lessons of High-Stakes Testing - thumbnail

Today's widespread implementation of standards-based reform and the federal government's commitment to test-based accountability ensure that testing will remain a central issue in education for the foreseeable future. Test results can provide useful information about student progress toward meeting curricular standards. But when policymakers insist on linking test scores to high-stakes consequences for students and schools, they often overlook lessons from the long history of research.

The Current Landscape

Content standards that communicate the desired knowledge and skills;
Tests designed to measure the progress toward achieving the content standards;
Performance targets, which identify criteria used to determine whether schools and students have reached the desired level of achievement; and
Incentives, such as rewards and sanctions based on the attainment of the performance targets.

Every state has now instituted a statewide testing program and curricular standards or frameworks—except Iowa, where local districts develop their own standards and benchmarks. The state tests vary substantially in difficulty, content, item format, and, especially, the sanctions attached to test performance. For example, Massachusetts, New York, Texas, and Virginia use test results to award high school diplomas. Other states—Missouri, Rhode Island, and Vermont, for example—use students' performance on the state test to hold schools, rather than students, accountable. Still others, including Iowa, Montana, Nebraska, and North Dakota, currently attach no sanctions to test performance (Edwards, 2003).

The 2001 reauthorization of the Elementary and Secondary Education Act, also known as No Child Left Behind (NCLB), carries testing and accountability requirements that will substantially increase student testing and hold all schools accountable for student performance. This legislation marks a major departure from the federal government's traditional role regarding elementary and secondary education. It requires that states administer reading and math tests annually in grades 3–8 and during one year in high school starting in 2005–2006. These requirements will affect almost 25 million students each school year (National Center for Education Statistics, 2002).

NCLB requires states to meet adequate yearly progress (AYP) goals to ensure school accountability for student achievement on state tests. Schools that fail to achieve AYP goals face demanding corrective actions, such as replacement of school staff, implementation of new curriculum, extension of the school day or academic year, parental choice options, and, finally, complete reorganization.

Lessons We Should Have Learned

The current emphasis on testing as a tool of education reform continues a long tradition of using tests to change pedagogical priorities and practices. In the United States, this use of testing dates back to 1845 in Boston, when Horace Mann, then Secretary of the Massachusetts State Board of Education, replaced the traditional oral exam with a standardized written essay test. Internationally, high-stakes testing extends as far back as the 15th century in Treviso, Italy, where teacher salaries were linked to student examination performance (Madaus & O'Dwyer, 1999).

A 1988 examination of the effects of high-stakes testing programs on teaching and learning in Europe and in the United States (Madaus, 1988) identified seven principles that captured the intended and unintended consequences of such programs. Current research confirms that these principles still hold true for contemporary statewide testing efforts.

Principle 1: The power of tests to affect individuals, institutions, curriculum, or instruction is a perceptual phenomenon. Tests produce large effects if students, teachers, or administrators believe that the results are important. Policymakers and the public generally do believe that test scores provide a reliable, external, objective measure of school quality. They view tests as symbols of order, control, and attainment (Airasian, 1988).

Today's high-stakes testing movement relies on the symbolic importance of test scores. Forty-eight states currently require schools to provide the public with “report cards” (Edwards, 2003). Goldhaber and Hannaway (2001) found that the stigma associated with a school receiving a low grade on the state report card was a more powerful influence on Florida teachers than were the school-level sanctions imposed for poor test results.

Principle 2: The more any quantitative social indicator is used for social decision making, the more likely it will be to distort and corrupt the social process it is intended to monitor. In other words, placing great importance on state tests can have a major influence on what takes place in classrooms, often resulting in an emphasis on test preparation that can compromise the credibility or accuracy of test scores as a measure of student achievement.

We can assess whether this principle still applies today by examining the relationship between rising state test scores and scores on other achievement tests. Both old and new studies of this relationship (for example, Amrein & Berliner, 2002; Haladyna, Nolen, & Haas, 1991; Klein, Hamilton, McCaffrey, & Stecher, 2000; Linn, 1998) show that improvements in the state test scores do not necessarily reflect general achievement gains.

We can also find examples of this second principle in two recent surveys of teachers' opinions. In one national study, roughly 40 percent of responding teachers reported that they had found ways to raise state-mandated test scores without, in their opinion, actually improving learning (Pedulla et al., 2003). Similarly, in a Texas survey, 50 percent of the responding teachers did not agree that the rise in TAAS scores “reflected increased learning and high-quality teaching” (Hoffman, Assaf, & Paris, 2001, p. 488).

Principle 3: If important decisions are based on test results, then teachers will teach to the test. Curriculum standards and tests can focus instruction and provide administrators, teachers, and students with clear goals. A substantial body of past data and recent research, however, confirms that as the stakes increase, the curriculum narrows to reflect the content sampled by the test (for example, Jones et al., 1999; Madaus, 1991; McMillan, Myran, & Workman, 1999; Pedulla et al., 2003; Stecher, Barron, Chun, & Ross, 2000).

New York State, where the state department of education is requiring schools to spend more time on the NCLB-tested areas of reading and math, provides an example of how such pressure encourages schools to give greater attention to tested content and decrease emphasis on nontested content. According to one school principal, “the art, music, and everything else are basically out the window. . . something has to go” (Herszenhorn, 2003).

Principle 4: In every setting where a high-stakes test operates, the exam content eventually defines the curriculum. Pressure and sanctions associated with a state test often result in teachers using the content of past tests to prepare students for the new test. Several studies have documented that an overwhelming majority of teachers feel pressure to improve student performance on the state test. For example, 88 percent of teachers surveyed in Maryland and 98 percent in Kentucky believed that they were under “undue pressure” to improve student performance (Koretz, Barron, Mitchell, & Keith, 1996a, 1996b). As an outgrowth of this pressure, the amount of instructional time devoted to specific test preparation often increased.

More recent studies have found that teachers are spending a sizable amount of instructional time and using a variety of test-specific methods to prepare students for their state tests (Herman & Golan, n.d.; Hoffman, Assaf, & Paris, 2001). In North Carolina, 80 percent of elementary teachers surveyed “spent more than 20 percent of their total instructional time practicing for the end-of-grade tests” (Jones et al., 1999, p. 201). A national survey found that teachers in high-stakes states were four times more likely than those in low-stakes settings to report spending more than 30 hours a year on test preparation activities, such as teaching or reviewing topics that would be on the state test, providing students with items similar to those on the test, and using commercial test-preparation materials from previous years for practice (Pedulla et al., 2003).

Principle 5: Teachers pay attention to the form of the questions of high-stakes tests (short-answer, essay, multiple-choice, and so on) and adjust their instruction accordingly. A wide variety of research confirms that test format does influence instruction in both positive and negative ways.

Studies in states that require students to formulate and provide written responses to test questions show an increased emphasis on teaching writing and higher-level thinking skills (Taylor, Shepard, Kinner, & Rosenthal, 2003). For example, in Kentucky, 80 percent of teachers surveyed indicated that they had increased their instructional emphasis on problem solving and writing as a result of the portfolio-based state test (Koretz, Barron, Mitchell, & Keith, 1996a).

In several studies, teachers have reported decreases in the use of more time-consuming instructional strategies and lengthy enrichment activities (Pedulla et al., 2003; Taylor et al., 2003). Further, a recent study found that the format of the state test may adversely affect the use of technology for instructional purposes: One-third of teachers in high-stakes states said that they were less likely to use computers to teach writing because students were required to construct handwritten responses on the state test (Russell & Abrams, in press).

Principle 6: When test results are the sole or even partial arbiter of future education or life choices, society treats test results as the major goal of schooling rather than as a useful but fallible indicator of achievement. Almost 100 years ago, a chief inspector of schools in England described this principle in a way that resonates today:Whenever the outward standard of reality (examination results) has established itself at the expense of the inward, the ease with which worth (or what passes for such) can be measured is ever tending to become in itself the chief, if not sole, measure of worth. And in proportion as we tend to value the results of education for their measurableness, so we tend to undervalue and at last to ignore those results which are too intrinsically valuable to be measured. (Holmes, 1911, p. 128)

In the next five years, almost half of U.S. states will require students to pass a state-mandated test as a requirement for graduation (Edwards, 2003). As a result, a passing score on the state test is the coin of the realm for students, parents, teachers, and administrators. The social importance placed on state test scores ensures that students' successful performance on the state test is the ultimate goal for schools. Local press coverage on school pass rates and anecdotal evidence that scores on the state test may influence local real estate sales show the importance of test performance as a surrogate for education quality.

Principle 7: A high-stakes test transfers control over the curriculum to the agency that sets or controls the exam. State standards-based reform efforts leave the details and development of testing programs to state departments of education and whomever the department contracts with to construct the test. This system shifts the responsibility for determining curricular priorities and performance standards away from local school administrators or classroom teachers and often results in a one-size-fits-all curriculum and test.

Falmouth, Massachusetts, provides a recent noteworthy example of how a high-stakes state test can override local control. Under the threat of losing state funding and the licensure of the school principal and superintendent, the Falmouth School Committee reversed a decision to award diplomas to special-needs students who failed the Massachusetts state exam, thus shattering the hopes of a student seeking admittance to a nonacademic culinary degree program (Myers, 2003).

From High-Stakes Tests to Multiple Measures

No one denies the importance of accountability. The relationship between test scores and accountability, however, is not as simple as most people think. The seven principles formulated in 1988 have been acted out in state after state in the past 15 years and clearly reveal the serious flaws in the practice of using a single high-stakes measure to hold all students and schools accountable.

Cut-off scores that place students in such performance categories as needs improvement, basic, proficient, or advanced are arbitrary. The subjective methods used to categorize students into performance categories often lack validity (Horn, Ramos, Blumer, & Madaus, 2000). Further, most policymakers and the public do not understand the psychometric underpinnings of the tests. Issues that might seem trivial to them, such as the assumptions made when running computer programs that produce scaled scores, and even basic decisions about rounding, have significant consequences when categorizing students.

Like any measurement tool that produces a number—such as blood pressure gauges, complex laboratory tests, radar detectors, breathalyzers, and fingerprinting—test scores are fallible. Yet most state laws do not consider margin of error when interpreting a student's scores.

Misguided executive decisions, poorly conceived legislation, understaffing, unrealistic reporting deadlines, and unreasonable progress goals can cause numerous errors in test scores (Rhoades & Madaus, 2003). In addition, scoring or programming errors can result in incorrect scores. Winerip (2003) describes the impact of human error on one Florida 3rd grader. When the tests of 71 3rd grade students who scored below the passing cut-off score were hand-scored, the process unearthed a scoring machine error that marked a question wrong for this student because of an erasure. “Instantly, Raven was transformed from 3rd-grade dufe to a state-certified 4th grader” (p. B9).

In addition, any single test can only sample knowledge and cannot give a full picture of what students know and can do. As an illustration, Harlow and Jones's interviews with students (2003) showed that on the science portion of the Third International Mathematics and Science Study (TIMSS), the students had more knowledge about concepts than their written answers had demonstrated for more than half of the test questions. Conversely, the interviews suggested that for one-third of the items, students lacked a sound understanding of the information assessed even though they had given the correct response.

A fundamental principle in social science research is to always use at least two methods when studying social science phenomena because relying on only one method can produce misleading results. We need to enhance state testing programs by including multiple measures of student achievement. Measuring in a variety of ways does not mean giving students multiple opportunities to take the same test, but rather incorporating other methods of measurement or additional criteria, such as teacher judgments, when making decisions about grade promotion and graduation.

As districts, schools, and teachers respond to federal and state test-based accountability policies, we must step back from a blind reliance on test scores. We need to acknowledge that tests, although useful, are also fallible indicators of achievement. We also need to recognize that when test scores are linked to high-stakes consequences, they can weaken the learning experiences of students, transform teaching into test preparation, and taint the test itself so that it no longer measures what it was intended to measure.

In classrooms, teachers provide many opportunities for students to demonstrate their knowledge and skills. Likewise, we should insist that test scores never be used in isolation, but that schools incorporate other indicators of what students know and can do—especially teacher judgment—before making high-stakes decisions about the education of students.

References

•

Airasian, P. W. (1988). Symbolic validation: The case of state-mandated, high-stakes testing. Educational Evaluation and Policy Archives, 10(4), 301–313.

•

Amrein, A. L., & Berliner, D. C. (2002). High-stakes testing, uncertainty, and student learning. Education Policy Analysis Archives, 10(18). [Online]. Available: http://epaa.asu.edu/epaa/v10n18

•

Edwards, V. (Ed.). (2003, Jan. 9). Quality Counts 2003: If I can't learn from you. . . (Education Week Special Report), 12(17). Bethesda, MD: Editorial Projects in Education.

•

Goldhaber, D., & Hannaway, J. (2001). Accountability with a kicker: Observations on the Florida A+ Plan. Paper presented at the annual meeting of the Association of Public Policy and Management, Washington, DC.

•

Haladyna, T., Nolen, S., & Haas, N. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher, 20(5), 2–7.

•

Hamilton, L., Stecher, B., & Klein, S. (Eds). (2002). Making sense of test-based accountability in education. Santa Monica, CA: Rand.

•

Harlow, A., & Jones, A. (2003, July). Why students answer TIMSS science test items the way they do. Paper presented at the annual conference of the Australian Science Education Research Association, Melbourne, Victoria, Australia.

•

Herman, J., & Golan, S. (n.d.). Effects of standardized testing on teachers and learning (CSE Technical Report 334). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing.

•

Herszenhorn, D. (2003, July 23). Basic skills forcing cuts in art classes. The New York Times, p. B1.

•

Hoffman, J., Assaf, L., & Paris, S. (2001). High-stakes testing in reading: Today in Texas, tomorrow? The Reading Teacher, 54(5), 482–494.

•

Holmes, E. (1911). What is and what might be: A study of education in general and elementary in particular. London: Constable.

•

Horn, C., Ramos, M., Blumer, I., & Madaus, G. (2000). Cut scores: Results may vary. Chestnut Hill, MA: National Board on Educational Testing and Public Policy, Boston College.

•

Jones, M., Jones, B., Hardin B., Chapman, L., Yarbough, T., & Davis, M. (1999). The impact of high-stakes testing on teachers and students in North Carolina. Phi Delta Kappan, 81(3), 199–203.

•

Klein, S., Hamilton, L., McCaffrey, D., & Stecher, B. (2000). What do test scores in Texas tell us? Education Policy Analysis Archives, 8(49). [Online]. Available: http://epaa.asu.edu/epaa/v8n49

•

Koretz, D., Barron, S., Mitchell, K., & Keith, S. (1996a). Perceived effects of the Kentucky instructional results information system (KIRIS) (MR-792-PCT/FF). Santa Monica, CA: Rand.

•

Koretz, D., Barron, S., Mitchell, K., & Keith, S. (1996b). Perceived effects of the Maryland school performance assessment program (CSE Technical Report 409). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing.

•

Linn, R. (1998). Assessments and accountability (CSE Technical Report 490). Boulder, CO: CRESST/University of Colorado at Boulder.

•

Madaus, G. (1988). The influence of testing on the curriculum. In L. Tanner (Ed.), Critical issues in curriculum (pp. 83–121). Chicago: University of Chicago Press.

•

Madaus, G. (1991, January). The effects of important tests on students: Implications for a national examination or system of examinations. Paper prepared for the American Educational Research Association Invitational Conference on Accountability as a State Reform Instrument, Washington, DC.

•

Madaus, G., & O'Dwyer, L. (1999). A short history of performance assessment: Lessons learned. Phi Delta Kappan, 80(9), 688–695.

•

McMillan, J., Myran, S., & Workman, D. (1999, April 19–23). The impact of mandated statewide testing on teachers' classroom assessment and instructional practices. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Quebec, Canada.

•

Myers, K. (2003, July 16). A dream denied: Aspiring chef rethinks her future as Falmouth school board bows to state pressure on MCAS. Cape Cod Times. Available: www.capecodonline.com

•

National Center for Education Statistics. (2002). Digest of educational statistics 2001. Washington, DC: Government Printing Office.

•

Pedulla, J., Abrams, L., Madaus, G., Russell, M., Ramos, M., & Miao, J. (2003). Perceived effects of state-mandated testing programs on teaching and learning: Findings from a national survey of teachers. Chestnut Hill, MA: National Board on Educational Testing and Public Policy, Boston College.

•

Rhoades, K., & Madaus, G. (2003). Errors in standardized tests: A systemic problem. Chestnut Hill, MA: National Board on Educational Testing and Public Policy, Boston College.

•

Russell, M., & Abrams, L. (in press). Instructional uses of computers for writing: The impact of state testing programs. Teachers College Record.

•

Stecher, B., Barron, S., Chun, T., & Ross, K. (2000). The effects of the Washington state education reform on schools and classrooms (CSE Technical Report 525). Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing.

•

Taylor, G., Shepard, L., Kinner, F., & Rosenthal, J. (2003). A survey of teachers' perspectives on high-stakes testing in Colorado: What gets taught, what gets lost (CSE Technical Report 588). Los Angeles: CRESST.

•

Winerip, M. (2003, July 23). Rigidity in Florida and the results. The New York Times, p. B9.

George F. Madaus has been a contributor to Educational Leadership.

Learn More

ASCD is a community dedicated to educators' professional growth and well-being.

Let us help you put your vision into action.

Discover ASCD's Professional Learning Services

From our issue

The Challenges of Accountability

Go To Publication