HomepageISTEEdSurge
Skip to content
ascd logo

Log in to Witsby: ASCD’s Next-Generation Professional Learning and Credentialing Platform
Join ASCD
March 1, 1994
Vol. 51
No. 6

Lessons from the Field About Outcome-Based Performance Assessments

author avatar
From its work with teachers, the Mid-continent Regional Educational Laboratory concludes that outcome-based performance tasks have definite promise if schools proceed cautiously.

The use of performance assessments has received a great deal of attention recently in educational literature. One common argument for their increased use is that they provide information about students' abilities to analyze and apply information—their ability to think—whereas more traditional forms that employ forced-choice response formats (multiple-choice, fill-in-the-blank, true/false) assess only students' recall or recognition of information. Lauren Resnick sums up this argument for performance assessments: Many of the tests we use are unable to measure what should be the hallmark of a “thinking” curriculum: the cultivation of students' ability to apply skills and knowledge to real-world problems. Testing practices may, in fact, interfere with the kind of higher-order skills that are desired (1987). Other reasons for the use of performance assessments include: (1) they provide clear guidelines for students about teacher expectations (Berk 1986); (2) they reflect real-life challenges (Hart 1994); (3) they make effective use of teacher judgment (Archbald and Newmann 1988); (4) they allow for student differences in style and interests (Mitchell 1992, Wiggins 1989); and (5) they are more engaging than other forms of assessment (Wiggins 1991).
The perceived benefits of performance assessment have generated advances in measurement theory to accommodate this new form of assessment. For example, great strides have been taken in advancing a new theory regarding the generalizability of performance tasks (Shavelson and Webb 1991, Shavelson et al. 1989); in identifying the underlying skills and abilities measured by performance assessments (Embertson 1993); and in developing effective procedures for scoring performance assessments (Baker et al. 1992). Virtually all of these advances are in specific content areas—most notably, science and mathematics.
Unfortunately, these technical advances do not directly address the needs of many schools and districts that, in the name of reform, equate performance assessment with outcome-based education or OBE. Many of the outcomes identified by schools and districts address skills and abilities not usually addressed in traditional content domains. Consequently, outcome-based performance assessments incorporate skills and abilities not yet addressed in the research and theory on content-specific performance assessments.
For more than three years, the Mid-continent Regional Educational Laboratory (McREL) in Aurora, Colorado, has worked with schools, districts, and states involved in developing outcome-based performance assessments. Following is a report of what we see happening in the field and some conclusions from studies we've made.

Reform Movement Spurs the Process

Directly or indirectly, many local outcome-based education programs have been influenced by Bill Spady's conceptualizations, although his concept of transformational OBE has not generally been embraced. Specifically, reformers have not heeded Spady's recommendation that to truly “transform” education, learning objectives within specific content domains must be discarded in lieu of objectives that reflect more realistic life roles (Spady 1988, Spady and Marshall 1991). Rather, most schools, districts, and states have attempted to implement what Spady refers to as “transitional” OBE. Within this model, outcomes are identified for traditional content areas such as mathematics, science, history, and the language arts—as well as for less traditional areas such as community involvement, the ability to provide quality products, and the ability to work cooperatively and collaboratively with others.
The Aurora Public Schools are perhaps the prototypic example of the transitional approach. They have identified five broad learner outcomes: self-directed learner, collaborative worker, complex thinker, quality producer, and community contributor (Redding 1992). Additionally, they have identified a number of content area outcomes. Numerous schools, districts, and even entire states, such as Minnesota and Oregon, have adopted or are considering adopting similar transitional models.

Proficiencies Spell Out Outcomes

In general, we have found that the “big” outcomes—those that provide the overall framework for school reform—are frequently subsets of the following eight outcomes, although different wording is commonly used: knowledgeable person, complex thinker, skilled information processor, effective communicator/producer, collaborative/cooperative worker, self-regulated learner, community contributor/responsible citizen, and tolerant learner/culturally diverse learner.
  1. sets priorities and achievable goals,
  2. monitors and evaluates progress,
  3. creates options for self,
  4. assumes responsibility for actions, and
  5. creates a positive vision for self and the future (Redding 1992).
The important point here is that proficiencies are the components around which performance assessment, curriculum, and instruction are organized. Because they play such an important role in the reform movement and in outcome-based performance assessment, we consider proficiencies rather than outcomes as the driving force in the schools, districts, and states we have worked with. Let's briefly consider the proficiencies within one of the eight common outcomes.
  • understands and uses basic aspects of probability;
  • utilizes basic and advanced computational techniques within problem solving; and
  • utilizes a variety of problem-solving strategies and techniques.
  • understands the characteristics and uses of spatial organization of the earth's surface;
  • understands the physical and human characteristics of place; and
  • understands the characteristics and uses of mental maps of earth.
Schools and districts commonly identify between 30 and 80 proficiencies within the knowledgeable person outcome. In contrast, they identify far fewer proficiencies for the other outcomes—usually about three to seven for each. Figure 1 lists commonly identified proficiencies for the other outcomes.

Figure 1. Example Proficiencies in Non-Content Categories

Complex Thinker:

  • Uses a variety of reasoning strategies

  • Monitors own thinking

  • Develops conclusions based on sound evidence

  • Solves problems and makes decisions effectively

Skilled Information Processor:

  • Effectively interprets and synthesized information

  • Effectively uses a variety of information-gathering techniques and resources

  • Accurately assesses the value of information

  • Recognizes where and how projects would benefit from additional information

Effective Communicator/Producer:

  • Expresses ideas clearly

  • Effectively communicates with diverse audiences

  • Effectively communicates through a variety of mediums

  • Creates quality products

Collaborative/Cooperative Worker:

  • Works toward the achievement of group goals

  • Contributes to group maintenance

  • Demonstrates effective interpersonal within a group

  • Is sensitive to the level of knowledge and feeling of others within a group

Self-Regulated Learner:

  • Sets and carries out personal goals

  • Perseveres in difficult situations

  • Pushes the limits of his or her knowledge and ability

  • Restrains impulsivity

Community Contributor/Responsible Citizen:

  • Participates in the democratic process

  • Recognizes and takes action in community problems

  • Understands basic values of community

  • Takes and defends a position when warranted

Tolerant Learner/Culturally Diverse Learner:

  • Understands the differences in beliefs and values among various social and ethnic groups

  • Exhibits ability in a variety of community roles

  • Works effectively with those who have different beliefs and values

  • Resolves conflict in positive ways

 


In summary, most schools, districts, and states involved in outcome-based education begin by identifying specific proficiencies that represent broad outcomes. These proficiencies form the basis for constructing outcome-based performance assessments.

Performance Assessments Are Primary

Schools, districts, and states that have defined proficiencies for various outcomes usually view performance assessments as their primary assessment tool. By definition, outcome-based performance assessments are structured to provide information about students' skills and abilities on the various proficiencies.
To illustrate, consider the following performance task designed for use by high school teachers in South Dakota to provide assessment information about four proficiencies: There is a current debate about giving the Black Hills of South Dakota back to the Lakota—the Native Americans who lived on the land before the United States took it over. Research the history of the Black Hills, focusing on the treaties and how they influenced the use of the land. Use at least three different sources for your research (for example, books, personal interviews, articles, videotapes).Based on what you find out about the treaties and how they affected the use of the land, construct an argument for or against returning the land to the Lakota. Include in your argument specific references to the use of conflict, power, and cooperation during negotiations for the various treaties. Prepare to present your argument in a video documentary, a debate, a pamphlet written and produced for the public, a slide show, or an oral/visual presentation to be presented at a meeting of public policy people studying the issue. You will be assessed on and provided with rubrics for the following: Knowledgeable Person Proficiency: Your understanding of how conflict, power, and cooperation in social, political, and economic spheres influence the ownership and use of national resources. Complex Thinker Proficiency: Your ability to provide sufficient and appropriate evidence for a claim. Information Processing Proficiency: Your ability to use a variety of sources. Communication Proficiency: Your ability to effectively communicate through a variety of mediums.
Once a task is constructed, rubrics are designed for the various proficiencies embedded in the task. As is the case with most real-world tasks, performance tasks do not have a single correct answer. Consequently, student performance must be judged by one or more persons guided by well-defined criteria usually stated in the form of a rubric. More specifically, a scoring rubric consists of a fixed scale and characteristics describing performance for each point on the scale. Here is a rubric for the complex thinker proficiency in the South Dakota task: 4 = Provides a comprehensive argument that represents a detailed survey of the available information relative to the claim; additionally, all important aspects of the argument are well documented.3 = Provides a well-documented argument that includes all the major evidence that supports the claim.2 = Provides support for his or her claim, but the argument does not include some important points and/or provides poor documentation for the evidence in his or her argument.1 = Provides little cohesive support for his or her claim and/or presents little or no documentation with the argument.
Because the performance task is intended to measure four proficiencies, three other sets of rubrics are used to score the task. All rubrics are presented to students prior to the task.

Students Don't Always Do Well on Performance Tasks

One important question about outcome-based performance assessments is, “How well do students perform on them?” Although outcome-based performance assessments are fairly new, the National Assessment of Educational Progress (NAEP) has gathered considerable information on domain-specific performance tasks. In general, students do not perform well on them. For example, on a history performance task asking students to utilize a fairly straightforward analytic proficiency in the complex thinking category, only 27 percent of American 12th graders—and only 16 percent of 8th graders—provided an adequate or better response (Mullis et al. 1990).
In contrast, the percentages appear much higher for outcome-based performance tasks. For example, in a sample of 383 students from grades K–5, 55 percent of them provided adequate or better responses on content proficiencies, and 45 percent did so on complex reasoning proficiencies within outcome-based performance tasks.
In isolation, these results might lead one to conclude that the students in our sample outperformed students across the nation. However, the tasks performed by our students were embedded within classroom instruction, whereas the NAEP tasks were presented more as “tests” taken under controlled conditions. Our students, then, had the advantage of teacher guidance, peer support, and an unhurried pace. The differences in performance may have been a result of the added resources available to them. In fact, when a sample of students was drawn from the same district and given “secured” mathematics performance tasks under controlled conditions, the results were quite different. That task assessed three proficiencies—two were content related, the other was a complex thinking proficiency. On the two content proficiencies, 15 percent and 29 percent of the students provided an adequate or better response. On the complex reasoning proficiency, 19 percent of the students provided an adequate or better response.
Hence, there appears to be a significant difference in student performance between tasks embedded in classroom instruction and those that are presented in a controlled fashion. Consequently, results on performance tasks, whether domain-specific or outcome-based, must be interpreted in the context of the instruction and guidance (or lack thereof) provided before or during their administration.

Teachers Find Performance Tasks Helpful

Another commonly asked question about outcome-based performance tasks is the extent to which teachers find them useful. To explore this issue, we polled 62 teachers who had been using outcome-based performance tasks for at least six months. The teachers came from three different districts and represented K–12 classrooms and a variety of content areas.
When asked how useful they perceived performance tasks to be in terms of enhancing student learning, 74 percent of the teachers rated them as useful or highly useful. Additionally, 67 percent of the respondents indicated that these tasks provided them with better assessment information than more traditional classroom assessment practices (for example, quizzes and end-of-chapter tests). However, when asked how many performance tasks they could administer in a month's time, the average response was 1.3. Because a single outcome-based performance task usually assesses no more than four proficiencies, the relatively infrequent use of such tasks implies that teachers will have to supplement the results with information from more traditional forms of assessment.

Specific Rubrics Are Reliable

A critical question regarding the use of outcome-based performance assessments is, “How reliable are they?” In this case, reliability refers to the extent to which independent raters agree on the scores assigned to students on the various proficiencies measured within performance assessments. This is called inter-rater reliability. Research on content-specific performance tasks has already demonstrated their reliability. For example, Richard Shavelson notes that performance assessments in mathematics and science can be scored in a highly reliable fashion (Shavelson et al. 1993). However, as mentioned previously, the research has been conducted on domain-specific tasks that do not include proficiencies commonly included in outcome-based performance tasks.
In a number of independent studies, we have found some very strong trends about the reliability of outcome-based performance tasks. Column A of Figure 2 illustrates that teacher judgments of the various proficiencies addressed within outcome-based performance tasks can be quite high. The ranges report the lowest and the highest reliability of those that were calculated. In interpreting these ranges, measurement experts recommend that inter-rater reliabilities be at least .80 before individual teachers can be justified in making judgments about students on proficiencies embedded in performance tasks (Shavelson and Baxter 1992). The highest reliability in each category of Column A exceeds this .80 criteria. However, the lowest reliability in each category is well below this criterion. In other words, teachers sometimes rated the various proficiencies quite reliably, but sometimes quite unreliably.

Figure 2. Inter-Rater Reliabilities for Teacher Judgment on Specific Tasks (A) and Retrospective Teacher Judgments (B)

Lessons from the Field About Outcome-Based Performance Assessments-table2

"Reliabilities for Specific Tasks A"

"Reliabilities for Retrospective Judgments B"

Knowledgeable Person"Average = .68 R = .41 – .95 N = 45""Average = .61 R = .35 – .88 N = 21"
Complex Thinker"Average = .69 R = .39 – .92 N = 48""Average = .52 R = .34 – .83 N = 38"
Skilled Information Processor"Average = .64 R = .34 – .94 N = 27""Average = .60 R = .37 – .85 N = 17"
Effective Communicator/Producer"Average = .67 R = .47 – .92 N = 31""Average = .54 R = .46 – .91 N = 22"
Collaborative/Cooperative Worker"Average = .61 R = .38 – .91 N = 18""Average = .58 R = .41 – .93 N = 20"
Self-Regulated Learner"Average = .59 R = .32 – .87 N = 21""Average = .50 R = .32 – .81 N = 14"
Community Contributor/Responsible Citizen"Average = .51 R = .35 – .86 N = 7""Average = .47 R = .39 – .82 N = 14"
Tolerant Learner/Culturally Diverse Learner"Average = .52 R = .32 – .82 N = 5""Average = .46 R = .31 – .80 N = 10"
Average = Average inter-rater reliability R = Range of inter-rater reliabilities (lowest reliability through highest reliability) N = Number of inter-rater reliability coefficients used to calculate the average

To determine which tasks produced high reliabilities, we analyzed 22 tasks designed by K–5 teachers. Although we found little difference in the tasks themselves, we did find great differences in the rubrics for these tasks. Specifically, we grouped the tasks into three categories. The tasks with rubrics that were the most specific in nature had an average reliability of .84; the tasks with the least specific rubrics had an average reliability of .40; and the tasks in the middle group had an average reliability of .58. In general, then, outcome-based performance tasks that have rubrics written specific to the proficiencies assessed can be scored quite reliably, whereas tasks whose rubrics are very general cannot be scored reliably.

Retrospective Judgments Are Sometimes Reliable

Many of the teachers we have worked with believe that they can assess students' performance on various proficiencies without designing and administering performance tasks. Their assumption, if correct, would render outcome-based assessment relatively easy because teachers could simply make judgments about students in a wholistic manner at the end of a period of time. To study this issue, we ran a series of studies in which teachers made “retrospective” judgments at the end of a quarter or semester about students' performance on various proficiencies using their general impressions. Column B of Figure 2 lists the reliabilities for these judgments.
One of the more interesting patterns exhibited is that the average inter-rater reliabilities for the retrospective judgments were always less than the reliabilities for judgments within specific tasks. For example, the average reliability for proficiencies that deal with the complex thinker outcome was .69 for specific tasks and .52 for retrospective judgments. The greatest difference between reliabilities was .17 within the complex thinker outcome, and the smallest difference of .03 for the collaboration/cooperation outcome. This implies that for some proficiencies like those related to collaboration and cooperation, teachers' general impressions may be as consistent as their judgments of specific situations. However, for other proficiencies, like those related to the complex thinker outcome, teachers' judgments of specific tasks are far more consistent than their general impressions.
It's possible that some proficiencies involve behaviors that are readily observable by teachers in a variety of settings, whereas other proficiencies involve behaviors that are observable only when elicited by specific tasks or situations. If this is the case, a great deal of research must be conducted to separate those proficiencies that require the context of specific tasks from those that can be judged in a retrospective manner.

How Valid Are Outcome-Based Performance Assessments?

One of the major issues regarding performance assessment is their validity: the extent to which they measure what they are supposed to measure. Commonly, performance tasks are considered to have strong “face validity,” which means that the assessment appears to measure what it is supposed to. While some of the rhetoric regarding performance assessments has perhaps given the impression that face validity has now attained a more elevated status, researchers Robert Linn and Eva Baker warn that it is still not sufficient (1991). Rather, performance assessments must still be expected to demonstrate that they do in fact measure what they purport to measure. Gathering evidence to confirm that outcome-based performance assessments do in fact measure specific proficiencies is quite difficult, and information must be quite diverse to demonstrate validity. One type of evidence is the extent to which judgments of performance on the various proficiencies measured within a given task are independent of one another. Again, we have noticed some patterns.
When outcome-based proficiencies are assessed within the context of performance tasks, judgments appear to be somewhat independent. For example, the correlations among proficiencies from the outcomes of knowledgeable person, complex thinker, information processor, communication, and cooperation/collaboration averaged .47. This relatively low correlation suggests that teachers' judgments of students' performance in these proficiencies can be done relatively independently of one another when proficiencies from these categories are embedded in the same task.
In contrast, when teachers make retrospective judgments about proficiencies over time, they may allow a student's performance in one proficiency to influence their judgments of the student's performance in another proficiency. Specifically, the average correlation among proficiencies from the knowledgeable person, complex thinker, information processor, communication, and the cooperation/collaboration outcomes was .71 for retrospective judgments made over time. If there is an interdependence of teacher judgments about multiple proficiencies within wholistic judgments made over time, such judgments are not valid measures. Additionally, those same judgments had an average correlation of .75 with students' grade point average, suggesting that teachers are highly influenced by students' overall academic performance when making wholistic judgments over time.

Proceed with Caution

Our studies of outcome-based performance tasks indicate that they have definite promise. Teachers view them as valuable assessment tools, superior in many cases to more traditional forms of classroom assessments. For the types of proficiencies commonly identified by schools, districts, and states involved in outcome-based education, they might be the most viable means of collecting assessment data. For a few proficiencies, wholistic, retrospective judgments may be effectively used.
These positive findings, however, must be interpreted with caution. Given their complexity, outcome-based performance tasks probably cannot be used very frequently by classroom teachers; thus, they will probably not totally replace more traditional assessments. Finally, much research is needed to determine the validity of outcome-based performance tasks and the conditions under which high inter-rater reliabilities can be guaranteed.
References

Archbald, D. A., and F. M. Newmann. (1988). Beyond Standardized Tests. Reston, Va.: National Association of Secondary School Principals.

Baker, E. L., P. R. Aschbacher, D. Niemi, and E. Sato. (1992). CRESST Performance Assessment Models: Assessing Content Area Explanations. Los Angeles, Calif.: National Center for Research on Evaluations, Standards and Student Testing, UCLA.

Berk, R. A., ed. (1986). Performance Assessment: Methods and Applications. Baltimore, Md.: The Johns Hopkins University Press.

Embertson, S. (1993). “Psychometric Models for Learning and Cognitive Processes. “ In Test Theory for a New Generation of Tests, edited by N. Frederiksen, R. J. Mislevy, and I. I. Bejas. Hillsdale, N.J.: Erlbaum.

Hart, D. (1994). Authentic Assessment: A Handbook for Educators. Menlo Park, Calif.: Addison Wesley.

Linn, R., and E. L. Baker. (1991). Complex, Performance-Based Assessment: Expectations and Validation Criteria. Los Angeles, Calif.: National Center for Research in Evaluations, Standards and Student Testing, UCLA.

Mitchell, R. (1992). Testing for Learning: How New Approaches to Evaluation Can Improve American Schools. New York: The Free Press.

Mullis, I. V. S., E. H. Owen, and G. W. Phillips. (1990). “America's Challenge: Accelerating Academic Achievement (A Summary of Findings from 20 Years of NAEP). ” Princeton, N.J.: Educational Testing Service.

Redding, N. (1992). “Assessing the Big Outcomes.” Educational Leadership 49, 8: 49–53.

Resnick, L. B. (1987). Education and Learning to Think. Washington, D.C.: National Academy Press.

Shavelson, R. J., and G. R. Baxter. (1992). “What We've Learned about Assessing Hands-On Science.” Educational Leadership 49, 8: 20–25.

Shavelson, R. J., X. Gao, and G. R. Baxter. (1993). Sampling Variability of Performance Assessments. (CSE Technical Report 361). Santa Barbara, Calif.: National Center for Research in Evaluation, Standards and Student Testing, UCLA.

Shavelson, R. J., and N. M. Webb. (1991). Generalizability Theory: A Primer. Newbury Park, Calif.: Sage Publishing.

Shavelson, R. J., N. M. Webb, and G. Rowley. (1989). “Generalizability Theory.” American Psychologist 44: 922–932.

Spady, W. G. (1988). “Organizing for Results: The Basis of Authentic Restructuring and Reform.” Educational Leadership 46, 2: 4–8.

Spady, W. G., and K. J. Marshall. (1991). “Beyond Traditional Outcome-Based Education.” Educational Leadership 49, 2: 67–72.

Wiggins, G. (1989). “Teaching to the (Authentic) Task.” Educational Leadership 46, 7: 41–47.

Wiggins, G. (1991). “Standards, Not Standardization: Evoking Quality Student Work.” Educational Leadership 48, 5: 18–25.

Robert Marzano is the CEO of Marzano Research Laboratory in Centennial, CO, which provides research-based, partner-centered support for educators and education agencies—with the goal of helping teachers improve educational practice.

As strategic advisor, Robert brings over 50 years of experience in action-based education research, professional development, and curriculum design to Marzano Research. He has expertise in standards-based assessment, cognition, school leadership, and competency-based education, among a host of areas.

He is the author of 30 books, 150 articles and chapters in books, and 100 sets of curriculum materials for teachers and students in grades K–12.

Learn More

ASCD is a community dedicated to educators' professional growth and well-being.

Let us help you put your vision into action.
From our issue
Product cover image 194047.jpg
The Challenge of Outcome-Based Education
Go To Publication