Phone Monday through Friday 8:00 a.m.-6:00 p.m.
1-800-933-ASCD (2723)
Address 1703 North Beauregard St. Alexandria, VA 22311-1714
Complete Customer Service Details
by Grant Wiggins and Jay McTighe
Table of Contents
Purpose: To identify the most appropriate evaluative criteria for assessments, mindful of the desired results from Stage 1.
Desired Results:
Unit designers will understand that
Valid assessment requires appropriate criteria for evaluating student work. Appropriate criteria are derived primarily from the Stage 1 goals being assessed, not just from the surface features of a particular assessment task. Evaluative criteria should correspond to the most salient features that distinguish understanding and masterful transfer performance, not merely those qualities that are easiest to see or score. Complex performance involves different aspects, so unit assessments typically involve varied criteria so that students can receive the most helpful feedback. Rubrics are evaluative tools based on criteria. There are two widely used types of rubrics—holistic and analytic. Students may be given product/performance choices within an assessment task, but the evaluative criteria and rubric must remain consistent—that is, aligned with the goals of Stage 1 (and the choices must provide valid indicators of what is being measured).
Unit designers will be able to
Develop the most appropriate criteria for use in their Stage 2 assessments. These criteria will be the basis for more detailed scoring rubrics.
Module Design Goals: In this module, you will learn to distinguish among four types of evaluative criteria and how to select and weight appropriate criteria based on Stage 1 goals.
You should work on Module J if you have not already considered or identified appropriate evaluative criteria for your assessments in Stage 2.
You might skim or skip Module J if you have already identified appropriate evaluative criteria or have appropriate scoring tools (rubrics, checklists) for your assessments, reflecting Stage 1 goals.
We have made the case (Wiggins & McTighe, 2011) for using open-ended performance assessments to gather evidence that students understand and can transfer their learning. So how should we evaluate their performance on assessments tasks and prompts that do not result in a single "correct" answer or solution process? In this module we explore how to identify and use appropriate evaluative criteria.
Clearly defined criteria are used to guide evaluative judgments about products and performances related to the overall goals identified in Stage 1. The clarity provided by well-defined criteria helps to make a judgment-based process as consistent and defensible as possible when evaluating student performance. Look at the example from a driver's education unit shown in Figure J.1. Regardless of the task specifics or who is doing the judging, we would look for drivers to be (1) skillful, (2) courteous, (3) defensive, (4) responsive to conditions, and (5) law-abiding. From those five criteria we can then build rubrics useful to both students and would-be judges. In fact, we can then also construct a variety of valid driving tasks to evoke those kinds of behaviors. More generally, when agreed-upon criterion-based assessment tools are used throughout a department or grade-level team, school, or district, more consistent grading and test design occur because the criteria and their weight do not vary from teacher to teacher.
Stage 2—Evidence
Evaluative Criteria
Assessment Evidence
Performance is judged in terms of—
Students will show their learning by …
TRANSFER TASK(S):
OTHER EVIDENCE:
A second benefit of criterion-based evaluation tools relates to teaching. Clearly defined criteria provide more than just evaluation tools to use at the end of instruction; they help make performance goals transparent for all. For instance, specifying indicators and examples of what defensive driving looks like (and what other driving looks like) becomes part of the instructional plan so that the learners can use these indicators when watching others drive and when practicing on their own.
Practice in using criteria and indicators helps teachers as well as students. Educators who have scored student work as part of a large-scale performance assessment at the district, state, or national level often observe that the process of evaluating student work against established criteria teaches them a great deal about what makes the products and performances successful. As teachers internalize the qualities of solid performance, they become more attentive to those qualities in their teaching and provide more specific and helpful feedback to learners.
Similar benefits apply to students. When students know the criteria in advance of their performance, they have clear, stable goals for their work (especially when they also see complementary samples of work that make the criteria concrete). There is then no mystery as to the desired elements of quality or the basis for evaluating (and grading) products and performances. Students don't have to guess about what is most important or how their work will be judged.
In addition, by sharing the criteria and related evaluation tools with students, we offer them the opportunity to self-assess as they work—a key element in student achievement gains. Using criterion-based evaluation tools as a part of instruction helps students to get their minds around important elements of quality and to use that knowledge to improve their performance along the way. Thus the criteria serve to enhance the quality of student learning and performance, not simply to evaluate it.
Consideration of criteria brings us to an important understanding for unit designers about the connection between Stages 1 and 2. The most appropriate criteria are derived from the Stage 1 goals being assessed, not primarily from a particular assessment task. This may seem odd at first. Aren't evaluative criteria determined after you have identified the assessment task? No! Consider an essay. Regardless of the specifics of any essay prompt, we should judge all essays against the same criteria—the qualities that make an essay effective. Thus criteria tell us where to look and what to look for in specific task performance to determine the extent to which the more general goals have been achieved.
This insight is signaled in the UbD Template by the location of the Criteria column to the left of the Performance Task section. This placement is meant to suggest that the criteria need to be considered first based on the desired results of Stage 1. Note that this idea is exemplified in the driver's education example—the same four criteria are used to assess the second and third performance tasks because they are derived from the overall goals of "skillful" and "responsive" driving.
In sum, the logic of backward design is common sense. The unit goals of Stage 1 dictate the needed assessments as well as the associated criteria and their weights.
Design Tip: Because good feedback is specific, you may want to add concrete indicators related to the unique aspects of the task. For example, the general trait of "defensive driving" may look different in varied driving tasks. Therefore, we might frame our rubrics with the general trait first, followed by task-specific indicators, as follows:
DEFENSIVE DRIVING: Students indicate that they drive defensively in this particular situation by ___________________________________.
Thus the key criteria are prominent and recurring, whereas the feedback is concrete and specific for each situation.
Because most complex performances have multiple dimensions, we typically need to use multiple criteria for evaluation. For example, in a problem-solving task, a solution needs to be not only accurate but also supported by sound reasoning and evidence. Similarly, a graphic design needs to be well executed while also communicating an idea or emotion.
The need for multiple criteria does not mean to imply that "more is better." In fact, the challenge is to identify the smallest set of valid and independent criteria. Valid criteria are those that are essential to genuinely successful work, not arbitrary or merely easy to score. Independent criteria are those that are not linked to other criteria in how performance unfolds—that is, successfully meeting one criterion has little bearing on the other criteria. For example, one science lab report could have conclusions unsupported by evidence even though the lab procedures were meticulously followed, whereas another report could show sloppy record keeping yet provide an insightful and well-supported conclusion. In other words, following procedures is independent from well-supported conclusions. Or one student's essay might be mechanically correct but dull, whereas another student's essay might be riddled with errors in spelling and grammar but fascinating. Mechanically correct and thought-provoking are independent of one another.
What we have said about independent and useful criteria is evident in our driving example (Figure J.1). Recall that on-the-road performances are evaluated against five criteria: (1) skillful, (2) courteous, (3) defensive, (4) responsive to conditions, and (5) law-abiding. Notice again that one could be skilled but not law-abiding, and vice versa; one could be attentive but not defensive, and vice versa.
Thus the best scoring system uses the most feasible set of valid and independent criteria. In practice, this means that scorers usually use three to six independent criteria in judging complex work in order to optimize the balance of high-quality feedback and efficiency of scoring.
In short, our concern about the validity of the set of criteria is a variant of the two-question validity test for assessments (see Module D, Wiggins & McTighe, 2011). In this case, however, we are "testing" the validity of the criteria used to evaluate work, not the task itself, with these questions:
A complex performance involves not only varied criteria but also criteria of different types. "Is the graphic display informative?" is a different kind of question than "Is the graphic display attractive?" Similarly, asking "Was the solution effective?" is not the same as asking "Was the process efficient?" The first question in each case refers to the purpose and desired impact, whereas the latter question refers to the quality of the process or content.
Indeed, it is common for judges in various fields to distinguish between content and process. Figure J.2 illustrates the difference in the two types with sample indicators provided for each category.
CONTENT
PROCESS
accurate
mechanically sound
valid
original/creative
insightful
precise
appropriate
poised
comprehensive
polished
justified
well crafted
Note, however, that impact of performance (was the performance successful?) is not addressed by either process or content types of criteria, something we think is a common error in scoring student work. Thus we have found it helpful to expand the categories to identify four different kinds of criteria that may be relevant in complex and authentic performances. These four types are listed and exlained in Figure J.3.
Impact—Refers to the success or effectiveness of performance, given the purpose and audience.
Content—Refers to the appropriateness and relative sophistication of the understanding, knowledge, and skill employed.
Quality—Refers to the overall quality, craftsmanship, and rigor of the work.
Process—Refers to the quality and appropriateness of the procedures, methods, and approaches used—before and during performance.
Why would we argue for these four types of criteria, when it seems to make matters complicated? Let's consider each type of criterion and its importance.
Impact is clearly at the heart of what we seek in authentic performance tasks; that is, did the performance work? Did it achieve the desired result—irrespective of effort, attitude, methods used? Note how considering impact returns us to a fundamental question: What is the purpose of the performance? Or more generally, what's the point of the learning? Students need to be reminded that in performance, the bottom line matters. If the work was meant to be entertaining, informative, or persuasive—and it wasn't—then the performance did not achieve the desired result, no matter what other strengths are evident.
Content refers to the degree of understanding or proficiency evident in student work. This category includes such indicators as accuracy, thoroughness, and quality of explanation. Was the final answer or work on target? Was the content correct or complete? Did the student address the question asked? Was the use of content sufficiently sophisticated? Did the performance reflect knowledge, skill, and understanding?
Quality refers to such elements as attention to detail, craftsmanship, mechanics, neatness, or creativity in the product or performance. Was the paper grammatically sound? Was the oral presentation fluent? Was the poster colorful and unique? Were the data in the table neatly recorded?
Process refers to the approach taken or the methods used in performance or in preparation for performance. Were directions followed? Was the manner fluid and poised? Was the performer dedicated and persistent? Did the learner practice and prepare fully? Were the students on task in their group? Did they collaborate well? These are process-related questions.
Note that these four types are fairly independent of one another. The content may be excellent, but the process could be inefficient and ineffective; the content and process might have been appropriate, but the work quality could still be shoddy. Most important, the content, process, and work quality could be fine, but the desired impact might still not have been achieved—in that situation, with that audience, given that purpose.
Here is an example in which all four types of criteria are used to evaluate a meal in nine different ways:
Goal: With a partner, cook a healthy and tasty meal for a specific group of people with varied dietary needs and interests.
Impact Meal is nutritious. Meal is pleasing to all guests.Content Meal reflects knowledge of food, cooking, and diners' needs and tastes. Meal contains appropriate, fresh ingredients. Meal reflects sophisticated flavors and pairings.Quality Meal is presented in an aesthetically appealing manner. All dishes are cooked to taste.Process Meal is efficiently prepared, using appropriate techniques. The two cooks collaborated effectively.
Impact
Note that these nine criteria are appropriate and independent: a meal could meet criteria 1, 3, 4, 6, 8, and 9 but not meet criteria 2, 5, 7.
However, in reality, nine criteria are too many to manage, so we could collapse the nine criteria to three without sacrificing too much in the way of feedback: Meal is nutritious, pleasing, and well prepared. Designers will always have to tinker with the right blend of criteria to balance the validity of the criteria, the helpfulness of the feedback, and the efficiency of the assessment.
It is important to note that although these four categories (impact, content, quality, and process) reflect common types of criteria, we do not mean to suggest that you must use all four types for every performance task. Rather, you should select the criterion types that are most appropriate for the goals being assessed through the task and for which you want to provide feedback to learners. The chart in Figure J.4 can be useful in considering which criteria to use in specific performance tasks and to ensure that each type of criterion is considered (more detailed rubrics can then be constructed).
In sum:Was the work effective?
Content
In sum:Was the content appropriate to the task, accurate, and supported?
Process
In sum:Was the approach sound?
Quality
In sum:Was the performance or product of high quality?
Design Tip: Look at your draft criteria. Do they overlook or downplay the issue of impact—of the purpose of the performance? Do they tend to address only content, process, and work quality? Use this stem to test your criteria set: Could the criteria I am proposing all be met but the performance/product fail to meet the purpose, the goal?
Design Task: How do these four types of criteria apply to your unit's assessments? Consider which indicators reflect the most salient features, based on your targeted Stage 1 goals.
It may have occurred to you that in matters of content, the assessment decision is often simple: the content was correct or incorrect, appropriate or inappropriate. In such cases we would use a checklist to assess. However, our focus is on designing, teaching, and assessing for understanding—and understandings are rarely simply right or wrong. Unlike factual knowledge, understanding is more a matter of degree than of correctness or incorrectness. In other words, understanding is more appropriately described along a continuum, with a rubric that delineates, for example, an in-depth and sophisticated understanding, a solid understanding, an incomplete or simplistic understanding, or a misunderstanding.
Thus it is sometimes helpful to think through criteria related to content understanding using a set of prompting questions: What is a sophisticated response to this issue? What is a novice response? What are some in-between responses? The example in Figure J.5 (p. 28) shows how this might look in social studies.
Design Tip: Given that understanding exists along a continuum, useful prompts for assessing understanding and building rubrics include these: To what extent … ? How thorough and in-depth … ? How sophisticated … ?
Our discussion thus far has focused on criteria. Although a set of criteria can be used to give feedback and evaluate performance, a more detailed rubric may be needed. A rubric is a scoring or evaluation tool that is based on identified criteria and includes a measurement scale (e.g., 4 points) and descriptions of levels of performance across the scale. Figure J.6 (p. 29) shows descriptive terms that could be used for a typical 4-point rubric.
Two general types of rubrics—holistic and analytic—are widely used to judge student performance. A holistic rubric provides an overall impression of a student's work, yielding a single rating or score. An analytic rubric divides a product or performance into distinct traits or dimensions and judges each separately. Because an analytic rubric rates each of the identified traits independently, a separate score or rating is provided for each.
So, which type of rubric should we use? Well, because these are tools, we should select the one best suited to our job. Here's a general rule of thumb: When our purpose is to provide an overall rating of a student's performance, then a holistic rubric will do. When we want more specific feedback for teachers and students, then an analytic rubric is more appropriate.
Consider the application of the two types of rubrics shown in Figure J.7. One could score the graphic displays holistically, and three students could easily end up with the same score for different reasons. Such an evaluation is by definition invalid, as well as misleading. So although holistic scoring may be easier for teachers, it often ends up sending unclear and confusing messages to students. If we applied the analytic rubric with its four different criteria (title, labels, accuracy, neatness), each student would know more precisely what was done well and what needed to be improved. Although the analytic rubric may be more time-consuming to use, the feedback is clearly more precise.
Finally, there is no need for a rubric at all if the issue is not one of degree. In that case, a simple list of things to look for will suffice. Performance lists offer a practical means of judging student performance based upon identified criteria.
A performance list consists of a set of criterion elements or traits and a rating scale. The rating scale is quite flexible, ranging from 3 to 100 points. Teachers can assign points to the various elements, in order to weight certain elements over others (e.g., accuracy counts more than neatness), based on the relative importance given the achievement target. The lists may be configured to convert easily to conventional grades. For example, a teacher could assign point values and weights that add up to 25, 50, or 100 points, enabling a straightforward conversion to a district or school grading scale (e.g., 90-100 = A, 80-89 = B). When the lists are shared with students in advance, they provide a clear performance target, signaling to students what elements should be present in their work.
Despite these benefits, performance lists do not provided detailed descriptions of performance levels. Thus, despite identified criteria, different teachers using the same performance list may rate the same student's work quite differently.
An additional aspect of understanding has to do with student independence or autonomy. After all, if transfer is the goal, then the assessment cannot heavily prompt and cue the learner; if it did, the ensuing performance would not show successful, independent transfer. Similarly, with meaning-making, if the teacher asks leading questions, provides lots of scaffolding graphic organizers, and reminds the learner of ideas discussed earlier, then we surely do not have adequate evidence of each student's ability to make meaning on his own.
Sometimes it is useful to have a rubric for the degree of independence shown in handling any complex assessment task, as is sometimes used in special education and vocational programs. The general rubric in Figure J.8 can be used in all performance assessments to gauge the degree of independence in a performance and to signal that the ultimate goal is student independence from teacher prompts and scaffolds. This rubric can be used as part of differentiation as well as for any ongoing formative assessment of recurring tasks or complex skills. The teacher might say, for example, that in the fall it is perfectly OK to require some teacher assistance, but by spring, on a similar task (e.g., a genre of writing or a presentation), students should strive for as little teacher assistance as possible.
Level of Independence
Description
Independent
Learner completes task effectively with complete autonomy
Hints
Learning completes task with minimal assistance (e.g., 1–2 hints or guiding prompts from teacher)
Scaffolded
Learner needs step by step instructions and scaffolding (e.g., graphic organizer) to complete the task
Hand holding
Learner needs the task simplified; requires constant feedback and advice, review and reteaching; needs moral support to complete the task
Dependent
Learner cannot complete the task, even with considerable support
Design Tip: Be careful with quantitative descriptors in rubrics. We have seen many rubrics that are set up so that differences in performance levels can be "counted." For example, consider this all-too-common example of problematic scoring of a research paper: A good paper must have "more than 5 footnotes," whereas a weaker paper has "1–2 footnotes." This criterion does not pass the two-question test (refer to questions on p. 23 about the proposed criteria being met), no matter how common such criteria are in schools. We can easily imagine an insightful research paper that uses only a few resources and footnotes, but that would be viewed as a major deficiency if we focused on the number of footnotes as a criterion. Doesn't the quality of the sources matter most? Shouldn't the criterion be "well supported by appropriate sources" rather than just the number of sources? In a similar vein, teachers who give more points for the length of a paper rather than high-quality content are sending a dubious message to students about what matters in writing. Here's a rule of thumb (pun intended): Be cautious of rubrics in which you can count on your fingers to obtain a score. In other words, emphasize qualities rather than quantities in assessment.
In Module N we will discuss ways of tailoring, or differentiating, your unit to address notable and constant differences in students' readiness levels, learning profiles, and interests/talents. Although differentiation is a natural approach for responsive teaching in Stage 3, we must be cautious when tailoring our assessments in Stage 2.
Consider a science standard that calls for a basic understanding of life cycles. Evidence of this understanding could be obtained by having students explain the concept and offer an illustrative example. Evidence could be collected in writing, but such a requirement would be inappropriate for an ESL student with limited skills in written English. Indeed, an ESL student's difficulty in expressing herself in writing could yield the incorrect inference that she does not understand life cycles. However, if she is offered flexibility with the response mode, such as explaining orally or visually, we will obtain a more valid measure of her understanding. In this regard, some state and district tests permit students to take math tests in their native language, to ensure that the student's knowledge of mathematics is tested fairly.
Similarly, it may well make sense to provide students with options on tests, papers, or projects whereby they can play to their strengths or preferred styles if the goal is to have a fair test that enables students to show what they know and can do; in other words, to differentiate the particulars of an assessment task.
Although we may offer product and performance options, we will almost always need to use the same evaluative criteria in judging all of the responses. This may seem counterintuitive. How could we use the same criteria if one student draws an illustrative picture and another provides a written explanation? The answer relates back to the logic of backward design and what we said earlier in this module about evaluative criteria being more general than the task: the goal in assessment is to obtain appropriate evidence of our general goals targeted in Stage 1. Assume we are looking for evidence of "understanding" and "polish" in the product. Then it doesn't matter what format or mode of communication is used. So we must be careful not to get sidetracked by the unique surface features of a product that do not directly relate to the goal.
In the previous example, we might judge every student's explanation of life cycles by the same three criteria—accurate, thorough, and inclusion of appropriate examples—regardless of whether a student responded orally, visually, or in writing. The criteria are derived primarily from the content goal, not the response mode. If we vary the criteria for different students, then we no longer have a valid and reliable assessment measure and our unit will be misaligned.
Of course, we want students to do high-quality work, regardless of what options they select, so we may wish to include secondary criteria related to quality. If a student prepares a poster to illustrate a balanced diet, we could look for neatness, composition, and effective use of color. Likewise, if a student made an oral presentation, we could judge pronunciation, delivery rate, and eye contact with the audience. However, it is critical to recognize that these features are linked to specific products or performances and are not the most salient criteria determined by the content goal. Here, too, we need to ensure that the relative weights of these secondary criteria are less than the primary ones related to content understanding. Figure J.9 provides a visual representation of these points.
This has been a rich and detailed module; a summary of key points is in order. The best criteria
Simply put, we should evaluate what really matters and what will provide students with the most helpful feedback long term, based on our Stage 1 goals.
In earlier modules (see Wiggins & McTighe, 2011), we used a unit on nutrition to illustrate various points. Let's return to the nutrition unit to consider the evaluative criteria for the two performance tasks. In both cases, we have included the primary content criteria and the secondary quality criteria.
Task 1—Because our class has been learning about nutrition, the 2nd grade teachers in our elementary school have asked our help in teaching their students about good eating. Your task is to create an illustrated brochure to teach the 2nd graders about the importance of good nutrition for healthful living. Use cut-out pictures of food and original drawings to show the difference between a balanced diet and an unhealthy diet. Show at least two health problems that can occur as a result of poor eating. Your brochure should also contain accurate information and should be easy for 2nd graders to read and understand.
Content Criteria Accurate nutrition information provided Clear and complete explanation of a balanced diet versus an unhealthy diet Quality Criteria Neat and attractive Two nutritionally related health problems shown
Content Criteria
Quality Criteria
Task 2—Since we have been learning about nutrition, the camp director at the Outdoor Education Center has asked us to propose a nutritionally balanced menu for our three-day trip to the center later this year. Using the USDA guidelines and the nutrition facts on food labels, design a plan for three days, including the three main meals and three snacks (a.m., p.m., and campfire). Your goal is to create a healthy and tasty menu. In addition to your menu, prepare a letter to the director explaining how your menu meets the USDA nutritional guidelines. Include a chart showing a breakdown of the fat, protein, carbohydrates, vitamins, minerals, and calories. Finally, explain how you have tried to make your menu tasty enough for your fellow students to want to eat.
Content Criteria Menu plan that meets USDA guidelines Clear and complete explanation of nutritional values and taste Accurate and complete nutrition chart Quality Criteria Proper letter form Correct spelling and grammar
Design Task: What follows, then, for your own unit? Consider how can you transfer these ideas about evaluative criteria to your unit.
Use the following questions to assess the evaluative criteria in your unit:
Online you'll find the following worksheets and other helpful materials: Figure J.10, Four Types of Criteria with Descriptors/Indicators; J.11, Criterion-Based Performance List for Graphic Display of Data; Figure J.12, Naive to Expert Understanding: A Continuum Worksheet; Figure J.13, An Analytic Scoring Rubric for Understanding; Figure J.14, An Analytic Rubric Frame; Figure J.15, Holistic Rubric for Understanding; Figure J.16, Tips for Designing Effective Scoring Tools.
Educative Assessment (Wiggins, 1998). Chapter 3 discusses the nature of feedback. Chapter 7 discusses rubrics and rationale and guidance on their construction and use. Chapter 10 discusses the challenge of grading and reporting when using rubrics.
Understanding by Design, 2nd ed. (Wiggins & McTighe, 2005). Chapter 8 discusses the importance and validity and how choice of scoring criteria is key.
Understanding by Design: Professional Development Workbook (McTighe & Wiggins, 2004). Stage 2. Samples of rubics and rubric worksheets can be found on pages 181–196.
Schooling by Design: Mission, Action, and Achievement (Wiggins & McTighe, 2007). Chapter 5 discusses teacher noncontact roles (what the job requires of teacher when they are not with students). Discusses the key role of teacher as developer of assessments (Role 1: Contributor to the Curriculum) and scorer of student work (Role 2: Analyzer of Results).
Scoring Rubrics in the Classroom (Arter & McTighe, 2000). A detailed book describing the characteristics, design, and use of scoring rubrics.
Guide for Instructional Leaders, Guide 2: An ASCD Action Tool (Wiggins, 2003). "Leading Curriculum Development" (pp. 1–22) discusses the link between curriculum design and assessment design.
Subscribe to ASCD Express, our free email newsletter, to have practical, actionable strategies and information delivered to your email inbox twice a month.
ASCD respects intellectual property rights and adheres to the laws governing them. Learn more about our permissions policy and submit your request online.