February 1, 2000

•

5 min (est.)

•

Vol. 57

•

No. 5

Charting the Course of Student Growth

To be truly effective, assessments must not only measure the current achievement levels of students, but they must also monitor and report on student achievement over time.

It is a rare autumn day in late October—but no one is watching the slant of light play through the leaves. Massachusetts has just released the raw scores on last spring's high-stakes test. A teacher-researcher, Marion G., is talking with colleagues about her efforts to make sense of her school's results.

The raw data say that her school has too many 4th grade science students in the failing and needs improvement categories—and that the number of students in these categories is up from last year. Marion is puzzled; she knows more than a handful of students who can discuss such concepts as gravity and scientific evidence in a sophisticated way. She knows that as large-scale tests go, this one is well designed and carefully scored. But Marion is not looking for excuses. She also knows that a strong public education is the major, and perhaps only, economic passport that her students will ever be given.

At the same time, she is frustrated. In the new, one-dimensional state accountability system, her school's test scores could well affect whether middle-class parents consider her school "good enough," and an economic mix of students is crucial to the performance of her poorest students. Marion also worries that the fallout from low scores (bad press, remediation, pressure to raise the scores by practicing items) will wear out the school's hard-working faculty. In this version of standards-based reform, the current status of students is all that matters. There is no way to inquire about, never mind acknowledge, students' growth toward that standard.

Wrestling with the Data

Have the 1998 4th graders and the 1999 4th graders always performed differently? As a group, did this year's class have more English-language learners? Were there fewer older siblings to act as role models? Are mobility rates higher? If so, then to make a difference, don't we need to know what works for children who bring such different learning histories to school? And do we need to begin before 4th grade?
Is it possible that this year's 4th graders entered school performing like, or even better than, other classes, but "lost" their momentum? If so, don't we need to know when this loss occurred so that we can understand why it happened? Did students miss more school in the early years and did their growth level off? Did they perform differently once they had to read to learn?
Suppose that this year's 4th graders started at a lower point than last year's students. Shouldn't we ask whether our teaching is, nevertheless, producing steady or even accelerating change? No one is excused from the standards, but don't teachers who elect this challenging work deserve to know this type of information?

Reporting Current Achievement

Standards-based reform is about setting higher standards and measuring the attainment of those standards in criterion-referenced, rather than norm-referenced, ways. But the reform also has (or used to have) two larger goals: First, as students progress through the school system, the standards for their performance will rise steadily until these standards describe what young adults need to be family members, citizens, and workers; and second, throughout the process, educators will find ways to support students who do not yet perform at the standard. Thus, the reform is (or once was) a social compact to promote growth over time in all segments of the population.

But Marion's questions point out that we have replaced this implicit pledge of support for longitudinal growth with the technologies of standard-based testing, analysis, and public reporting—when we should have joined the two. The resulting systems have a number of features that severely constrain our ability to look at growth—and thus to understand whether the reform is creating increasing numbers of able students.

Cross-sectional data collection. The large-scale assessment that Marion works with is based on a cross-sectional design. That means that different students are examined at each testing point. This type of design allows policymakers only to compare performances across cohorts (4th graders in 1997–98 versus 4th graders in 1998–99); groups within a cohort (Marion's students versus students in wealthier areas of the city or in the suburbs); and actual performances with those cited in the standards. But this design makes it impossible to look at how students grow over time.

However, users of the data regularly infer developmental patterns. In science, 56 percent of Massachusetts students perform in the upper two achievement levels in 4th grade, 11 percent in 8th grade, and 24 percent in 10th grade. On the basis of these results, many people conclude that science teaching is fine until 4th grade, deteriorates sharply in middle school, and improves slightly in high school. But because different students, items, and stakes were involved at the three testing points, Massachusetts educators have three separate data points to compare, but no basis on which to connect them into a trajectory showing the course of science learning in grades 4–12.

Horizontal scoring. We typically score student responses by using scales that distinguish among performances at that grade level. Far too often, the effort to rank performances at a grade level privileges easily recognizable changes in amount, detail, or length over differences in quality. For example, when on the state test, 4th graders must "name three things that we can do to help prevent the bald eagle from becoming extinct," scorers use the scale in Figure 1.

Figure 1. Horizontal Scoring

Charting the Course of Student Growth - table

Score	Description
4	Student response clearly and correctly describes three things that we can do to ensure that bald eagles will not become extinct.
3	Student response describes three things that we can do but has minor errors and fewer details.
2	Student response describes two things that we can do. There may be minor errors. OR Student response lists three things that we can do but gives no detail or explanation.
1	Student response shows minimal understanding of the factors leading to extinction. OR Student response describes one thing we can do.
0	Response is totally incorrect or irrelevant.

This kind of scale focuses largely on the amount of information that students give, not on their problem-solving abilities.

The focus falls largely on the amount of information that students give. Such items as the relative effectiveness of individual strategies, their compatibility as parts of a larger conservation system, or their appropriateness to the species do not factor in. From both the perspective of science knowledge and models for classroom assessments, the effort to create such horizontal scales yields a problematic model of excellence.

Broad achievement levels. Although students earn numerical scores on their state tests, their performances are typically reported and discussed in broad achievement levels (such as failing, needs improvement, proficient, or advanced). At 4th grade, more than one-third (36 percent) of students fall into the needs improvement category. But we have no idea what proportion of these students are near failing or close to proficient. In addition, given the breadth of this band, individual student performance could rise or fall without registering a change in achievement level. And when "gains" are reported, this truncated scoring can give the appearance that large gains occurred when it may have been that only tiny growth happened among those who originally scored at the cusp (Bryk, Thum, Easton, & Luppescu, 1998). These features pose significant motivational issues for both students and teachers.

Wide testing intervals. The students that Marion and her colleagues teach are tested at 4th, 8th, and 10th grades. Although Massachusetts is trying to spread out the burden of state testing, the current design provides no broad, public inquiry into student learning until or between these points. Thus, the first three years (K–2)—all of primary school—goes uncharted. Moreover, the next state-level public examination after 4th grade is not until 8th grade. Although no one is eager for more testing (especially not for young children), these wide intervals mean that teachers have no signaling or self-correction system available for stretches of three or more years.

Making Growth Visible

Cross-sectional to longitudinal designs. To examine growth, schools, districts, and states need to follow representative populations of the same students across grades. Although this takes planning for attrition and adequate funding, it is not impossible. For instance, both the Head Start studies and the National Educational Longitudinal Study have successfully provided long-term data on school performance and factors associated with persistence and success for thousands of young U.S. students.
Sampling the domain to concentrating on valued performances. To monitor growth, we also have to focus not on individual standards, but on selected standards-rich performances across time. This is different from our current efforts to "cover" the domain in state tests. Changing this habit would mean that schools, districts, or states would, from the outset, elect to follow over time a few performances with long-term significance and payoff (such as writing a research paper or conducting a science experiment). This will be technically challenging because we are accustomed to think about reliability, validity, and equity of specific items for specific ages. There are also issues of rotating or changing the content of tests to prevent inflation of scores. However, many Commonwealth nations, such as Scotland, Wales, Australia, and New Zealand, have designed their standards to demonstrate how particular valued performances can become more sophisticated over time. Interestingly, the Massachusetts science standards are laid out in major strands with accompanying developmental descriptions, demonstrating not only that the approach is feasible, but also that some of the conceptual groundwork has begun in a U.S. context.
Achievement levels to developmental scales. Systems that have elected more developmental approaches use scales that are independent of age and grade levels. Thus, the Australian literacy bands describe eight developmentally sequenced levels of performance. Although these documents establish levels of performance that students should achieve by certain points in their schooling, depending on the quality of his or her performance, an 8-year-old can score at any level in the sequence. In this way, such systems are more effectively standards-based than they are grade–or age–based. Equally powerful are the proficiency scales developed by the foreign language community and those that special educators use to chart the growth of independent-living skills.
League tables to growth curves. Currently, we present the results of annual testing programs in league tables—charts that typically give the percentage of children at each achievement level at each grade. Such tables rank schools or districts from highest to lowest and may compare this year's scores with those of the previous year. This is an efficient display of results for a system that is founded on cross-sectional data collection and focused on warnings and commendations based on the current status of achievement. But these tables show only a limited window of data—too little to chart long-term trends for even the available cross-sectional data. Moreover, this information tells us nothing about whether this year's 4th grader is substantially more educated than she was in 3rd grade, or whether as a 5th grader she knows and can do substantially more than she could the year before. Significantly, both growth-curve analysis (Bryk & Raudenbush, 1992; Willett, 1994) and value-added approaches provide substantial tools for supporting such analyses and reporting.

Are Growth Studies Feasible?

How realistic is it to imagine collecting and reporting longitudinal data? Not very—if you want to displace the current paradigm of large-scale cross-sectional testing. However, if standards-based school reform is going to keep promises about continual improvement (not just regular measurement), then faculties and families need a supplementary, low-stakes assessment system that will allow them to follow the progress that children make toward meeting the standards. How can we feasibly create such studies?

Form a statewide network or a cross-district consortium. The effort will demand more design and measurement know-how than is available at school sites or in smaller districts. Thus, we will need central technical expertise, such as in a state department of education or a research collaboration like the Chicago Consortium. Such a network also provides colleagues from the outside, who can afford to look objectively at data.

Select a small set of valued performances. Such work involves a major shift from the abstract, and in some ways the distorting, language of standards to the actual phenomena of student performances. To examine how all the performances in the standards develop over time would be impossible. But a definable set of performances is central to school success: for instance, developing a mathematical model for a complex phenomenon; designing and conducting an experiment; reading a text critically; writing a research paper; conducting an interview; or discussing a topic in a second language. Selecting several such performances would be possible. A number of U.S. districts have piloted such approaches in literacy; Gwinett County, Georgia, and Rochester, New York, for example, have done so.

Select longitudinal populations. Realistically, it is unlikely that we will have the resources to follow all children in this intensive way. Thus, participating communities have to define and select cohorts of children to follow. Communities may also have to draw and track large, representative samples (from which inferences about this particular population's growth patterns can be made) as compared with following every child in this select population. But we should concentrate on those children about whom we need the most information—those who need greater support to meet the standards.

We can follow the children from schools that have track records of serving such children very poorly or very well. Many people will argue that this is impossible, especially in urban areas where mobility rates are as high as 150 percent. But because the majority of children move between and among schools within a district, technology could help us follow them. Moreover, we could follow overlapping populations, simulating the entire course of pre-K–12 growth and cutting down on attrition. During the same three-year period, we could model K–12 growth by following several populations, such as grades K–2, 2–4, and 4–8 (Light, Singer, & Willett, 1990).

Developing research designs and scales for describing growth over time. This would involve developing valid and reliable approaches to eliciting such performances across a range of ages; a series of descriptors and benchmark performances that portray the ways in which selected performances become more sophisticated over time; and a pool of scorers trained to think developmentally. This is technically challenging in a country as diverse as the United States. Nevertheless, the experience of the Commonwealth countries, the American Council for the Teaching of Foreign Languages proficiency scales, and the numerous systems for describing the development of early literacy prove that it is not impossible. The requisite research designs and statistics are already widely used. They inform such fundamental measures of growth as the height and weight charts that every pediatrician uses, as well as such large-scale studies as the National Education Longitudinal Study.

Map growth as necessary professional development. Many teachers will say, "No more testing!" But they will also tell you that they are starved for the opportunity to talk with skilled and experienced colleagues in a sustained way about their practice and their students' performances. Hence, there is considerable interest in protocols for examining student work and the assignments that generate them. Well-conducted, longitudinal cohort studies could use common assessment tasks and samples of ongoing classroom assignments relevant to the valued performances. Scoring the resulting student work and using the developmental scales could provide a kind of "spinal cord" for a school's efforts to understand student growth and the role that quality teaching and assignments play in producing good work. This collaboration also could help teachers fashion a continual corridor of learning opportunities in a mutual and reciprocal fashion: The 1st, 2nd, and 3rd grade teachers are responsible for developing children who can thrive in a demanding 4th grade. In turn, 4th grade teachers must acknowledge and build on their colleagues' earlier work.

Use longitudinal results to talk with families and students about growth. An advantage of using longitudinal studies that follow student development over time is their transparency. These studies are like the height lines that families draw on a kitchen door to measure their child's growth or the different colored belts in a karate academy. Developmental scales written in clear language and comparisons with benchmark samples show the level of current work and how far students have to go. This clarity can help ensure that the messages are clear and constant. Using these scales, students can appraise their own work, compare it with work of the previous year, and set goals for making a difference in the coming year.

Why Bother?

When Massachusetts legislators wrote the Education Reform Act of 1993, the language of the bill called for a system of assessments that would include, but would in no sense be limited to, a large-scale, curriculum-independent state test for grades 4, 8, and 10. Some argue that other assessments should fill in what a large-scale program cannot do easily or well, such as assess public speaking, world language competence, or artistic skills. But Marion and her fellow teachers argue for something quite different: the longitudinal monitoring of student growth. Why?

Because as a culture, we can't afford to believe that student capacity is innate and fixed, immune to development. Such beliefs consign large numbers of children to failure.

Because without a system designed to monitor growth, we will confuse the benefits of socioeconomic advantage with the results of good teaching. Teachers in poor schools will be doomed to underperformance, and teachers in wealthier settings can coast on the backs of special lessons, summer camps, and parental education levels.

Because we need a professional teaching culture that emphasizes shared responsibility for continual improvement, not simply high scores at certain grades. It takes a school, not a few good teachers, to educate a child.

Because if we are genuinely interested in developing the abilities of a wider range of students, nothing is more crucial than diagnosis—the skill of knowing what will support the next increment of growth, however small. Until we are willing to create professional systems that focus squarely on change over time, we are unlikely to nurture this capacity.

References

•

Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. London: Sage.

•

Bryk, A. S., Thum, Y. M., Easton, J. Q., Luppescu, S. (1998, March). Academic productivity of Chicago public elementary schools (Examining Productivity Series). Chicago: Consortium on Chicago School Research.

•

Light, R. J., Singer, J. D., & Willett, J. B. (1990). By design: Planning research on higher education. Cambridge, MA: Harvard University Press.

•

Willett, J. B. (1994). Measuring change more effectively by modeling individual growth over time. In T. Husen & T. N. Postlethwaite (Eds.), The international encyclopedia of education (2nd ed.). Oxford, England: Pergamon Press.

Dennie Palmer Wolf has been a contributor to Educational Leadership.

Learn More

ASCD is a community dedicated to educators' professional growth and well-being.

Let us help you put your vision into action.

Discover ASCD's Professional Learning Services

From our issue

What Do We Mean by Results?

Go To Publication