HomepageISTEEdSurge
Skip to content
ascd logo

Log in to Witsby: ASCD’s Next-Generation Professional Learning and Credentialing Platform
Join ASCD
March 1, 2014
Vol. 71
No. 6

Criterion-Referenced Measurement: Half a Century Wasted?

author avatar
Four serious areas of confusion have kept criterion-referenced measurement from fulfilling its potential.

premium resources logo

Premium Resource

Criterion-Referenced Measurement: Half a Century Wasted? - Thumbnail
Credit: Credit: Catchlight Visual Services / Alamy Stock Photo
Fifty years ago, Robert Glaser (1963) introduced the concept of criterion-referenced measurement in an article in American Psychologist. In the half century that has followed, this approach has often been regarded as the most appropriate assessment strategy for educators who focus more on teaching students than on comparing them. Its early proponents touted criterion-referenced testing as a measurement strategy destined to revolutionize education. But has this approach lived up to its promise? Let's see.

Origins of an Idea

To decide whether criterion-referenced testing has accomplished what it set out to accomplish, we need to understand its origins. Glaser, a prominent University of Pittsburgh professor, asserted in his seminal 1963 article that certain instructional advances could render traditional education testing obsolete. More specifically, he raised the issue of whether traditional measurement methods, and especially their score-interpretation strategies, were appropriate in situations in which instruction was truly successful.
During World War II, Glaser had tested bomber-crew trainees by using the then widely accepted norm-referenced measurement methods aimed chiefly at comparing test takers with one another. Each trainee's score was interpreted by comparing it with (or "referencing it to") the scores earned by previous trainees, usually known as the norm group. Because the norm group's performances were usually nicely spread out across the full range of possible scores, it was easy to understand what it meant for an individual test taker to score at the 98th percentile or at the 30th percentile. Such comparative interpretations were particularly useful in military settings, providing a straightforward way to select the highest-scoring (and presumably most-qualified) applicants to fill a limited number of openings.
Following the war, Glaser pursued his PhD at Indiana University and studied with B. F. Skinner, often regarded as the father of modern behaviorism. In the late 1950s, Glaser became an advocate of programmed instruction, an approach growing out of Skinner's theories, in which students worked through carefully sequenced instructional materials that were designed to present information in small steps, provide immediate feedback, and require learners to correctly complete one step before moving on to the next (Lumsdaine & Glaser, 1960). Because practitioners of programmed instruction relentlessly revised their curriculum materials until these materials were effective in getting students to the desired learning objective, Glaser and his programmed instruction compatriots were often able to produce high levels of learning for essentially all students.
We might think that this accomplishment would engender jubilation. However, a number of measurement traditionalists were far from delighted. That's because the uniformly high test results typically produced by programmed instruction materials exposed a serious shortcoming in traditional test-interpretation practices. When the range of student scores was compressed at the high end of the scale, the possibility of useful student-to-student comparisons instantly evaporated.
Glaser recognized that a dramatic reduction in the variability of students' test scores would make norm-referenced score interpretation meaningless. After all, if nearly every student's score approached perfection, it made no sense to compare one student's near-perfect score with the near-perfect scores of other students. In his landmark 1963 article, therefore, Glaser proposed an alternative way of interpreting students' test performances in settings where instruction was working really well. The label he attached to this new, more instructionally attuned score interpretation strategy, criterion-referenced measurement, is still widely used today.

An Approach Preoccupied with Instruction

Unlike the more traditional method of referencing a given student's test score to the scores of other test takers, Glaser's proposed approach called for referencing a student's test score to a criterion domain—a clearly described cognitive skill or body of knowledge. For example, suppose that a set of 250 hard-to-spell words has been the focus of the school year's spelling instruction, and students take a spelling test containing a representative sample of those words. A criterion-referenced interpretation of a student's score would focus on the number or percentage of the 250 words that the student was able to spell correctly. We might report, for example, that a student's test score signified that he or she had mastered 90 percent of the criterion domain of hard-to-spell words. Or, if we had previously determined that the proficiency cutoff score would be 90 percent, we might simply report the student's performance on this criterion domain as "proficient."
The contrast between a norm-referenced and a criterion-referenced interpretation is quite striking. On the same end-of-year spelling test, a norm-referenced interpretation might report that a student who spelled 90 percent of the words correctly scored at the 78th percentile in relationship to the scores of students in the norm group—or at the 98th percentile, or the 30th percentile, depending on how well the norm group students performed. This norm-referenced interpretation, however, would be of little use in deciding whether a particular student had mastered the criterion domain to the desired level.
Of course, advocates of criterion-referenced testing don't suggest that there is no role for tests yielding norm-referenced interpretations. Indeed, in some situations it's useful to compare a student's performance with the performance of other students. (For example, educators may want to identify which students in a school would most benefit from remedial support or enrichment instruction.) However, to support actionable instructional decisions about how best to teach students, norm-referenced inferences simply don't cut it.
An inherent assumption of criterion-referenced assessment, then, is that by articulating with sufficient clarity the nature of the curricular aims being assessed, and by building tests that enable us to measure whether individual students have achieved those aims to the desired level, we can teach students better. Criterion-referenced measurement, in every significant sense, is a measurement approach born of and preoccupied with instruction.

Four Areas of Confusion

Glaser's 1963 introduction of criterion-referenced testing attracted only modest interest from educators. Actually, nothing more was published on the topic until the late 1960s, when a colleague and I published an article analyzing the real-world education implications of criterion-referenced measurement (Popham & Husek, 1969). Nonetheless, a small number of measurement specialists began to tussle with issues linked to this innovative approach.
Here are four key issues we must address to decide whether criterion-referenced measurement has lived up to the instructional promises accompanying its birth.

Tests or Test Interpretations?

During the 1970s, when interest in criterion-referenced measurement began to flower, a misconception emerged that still lingers: the idea that there are "criterion-referenced tests" and "norm-referenced tests." This is simply not so. What's criterion-referenced or norm-referenced is the inference about, or the interpretation of, a test taker's score. Although test developers may build tests they believe will provide accurate norm-referenced or criterion-referenced inferences, a test itself should never be characterized as norm-referenced or criterion-referenced.
To understand this point, imagine a district-level accountability test whose items are designed to measure students' mastery of three distinct criterion domains representing three key mathematical skills. The district uses the test results to make criterion-referenced inferences—that is, to measure the degree to which each student has mastered the three key math skills. However, after administering the test for several years, district educators also develop normative tables that enable them to compare a student's score with the scores of previous test takers. Thus, students' performances, originally intended to provide criterion-referenced inferences, could also be interpreted in a norm-referenced manner. The test itself hasn't changed—only the way the results are interpreted has.
If a colleague refers to "a norm-referenced test" or "a criterion-referenced test," you should not necessarily regard this colleague as a loose-lipped lout. Your colleague might be casually referring to tests that have deliberately been developed to provide norm-referenced or criterion-referenced interpretations. But to use precise language in a measurement arena where precision is so badly needed, it's score-based inferences—not tests—that are criterion-referenced or norm-referenced.

What's a Criterion?

One of the important early disagreements among devotees of criterion-referenced measurement was what the word criterion meant. In his 1963 essay, Glaser used the term the way it was commonly employed in the early 1960s, to refer to a level of desired student performance. In that same essay, however, Glaser indicated that a criterion identified a behavior domain, such as a cognitive skill or a body of knowledge.
Candidly, a degree of definitional ambiguity existed in Glaser's initial essay. Nor did Husek and I improve that situation in our 1969 follow-up—regrettably, we also failed to take a clear stance on the level-versus-domain issue.
Nonetheless, by the close of the 1970s, most members of the measurement community had abandoned the view of a criterion as a level of performance (Hambleton, Swaminathan, Algina, & Coulson, 1978), recognizing that the criterion-as-domain view would make a greater contribution to teachers' instructional thinking. Although determining expected levels of student performance is important, the mission of criterion-referenced measurement criteria is to tie down the skills or knowledge being assessed so that teachers can target instruction, not to set forth the levels of mastery sought for those domains.
Regrettably, the criterion-as-level view appears to be seeping back into the thinking of some measurement specialists. During several recent assessment-related conferences, I have heard colleagues unwisely characterize criterion-referenced testing as an assessment strategy intended to measure "whether test takers have reached a specified level of performance." Such a view makes little contribution to the kind of measurement clarity Glaser thought would lead to better instruction.

What's the Optimal Grain Size?

We can properly consider tests that provide criterion-referenced interpretations as ways of operationalizing the curricular aims being measured. That's where grain-size—the breadth of a criterion domain—comes in. If the grain size of what's to be measured is either too narrow or too broad, instructional dividends disappear.
If each curricular domain is too narrow, a teacher may be overwhelmed by too many domains. We saw this clearly when the behavioral objectives movement of the late 1960s and early 1970s foundered because it sought students' mastery of literally hundreds of behaviorally stated objectives (Popham, 2009). Sadly, that same mistake was reenacted in recent years when state education officials adopted far too many state curriculum standards. Moreover, the federal government (in an effort to dissuade states from aiming only at easy-to-achieve curriculum targets) insisted that each state's annual accountability tests measure students' mastery of all of that state's standards. The result was an excessive number of curricular targets—far too many for teachers to use in day-to-day instructional decision making
On the other hand, now that so many states have adopted the Common Core State Standards, the assessment pendulum may be swinging too far in the opposite direction. At last report, the two federally funded state assessment consortia charged with creating assessments to measure students' mastery of the Common Core standards appear intent on reporting a student's performance on their tests at a remarkably general level. In the case of reading, for example, this is the "assessment claim" one assessment consortium plans to use to report a student's performance: "Students can read closely and analytically to comprehend a range of increasingly complex literary and informational texts" (Smarter Balanced Assessment Consortium, 2012). Teachers are certain to be baffled about what such a broad domain is actually intended to measure. If the Common Core–focused assessment domains remain too broad, the criterion-referenced inferences about students' performances that these tests yield may be instructionally useless.
And therein lies the dilemma that determines the promise—or the impotence—of criterion-referenced assessment. If students' mastery of the Common Core State Standards is measured with criterion referencing that yields instructionally actionable reports, then the architects of the Common Core curricular aims are likely to see their lofty education aspirations realized. If, however, the grain size of the Common Core assessments is too broad to guide teachers in making sensible instructional moves, then our optimism regarding the Common Core initiative should diminish.

How Much Descriptive Detail Should We Provide?

Criterion-referenced measurement revolves around clear descriptions of what a test is measuring. If teachers possess a clear picture of what their students are supposed to be able to do when instruction is over, those teachers will be more likely to design and deliver properly focused instruction. And if the test shows that an instructional sequence has failed to work satisfactorily, a clear criterion domain description can help isolate needed adjustments so that the teacher can make the next version of the instructional sequence more effective. It's just common sense: Clarified descriptions of curricular ends permit teachers to more accurately select and refine their instructional means.
However, we need to include the right amount of detail when describing curricular targets, or few educators will actually employ such descriptions. Too brief or too detailed descriptions of criterion domains can erode the instructional dividends of criterion-referenced measurement.
Over the years, particularly since the mid-1960s, U.S. educators have often made these two opposite but equally serious mistakes when describing the criterion domains to be taught and measured. Initially, educators tried to describe what tests ought to measure by using extremely abbreviated statements of instructional objectives. But such abbreviated statements frequently led to misinterpretation. To avoid this problem, certain assessment specialists tried to describe the nature of criterion domains in great detail (Hively, 1974). Sometimes the description of a single domain consumed 3–5 single-spaced pages. Unfortunately, the longer and more detailed these descriptions were, the less likely it was that busy educators possessed the patience to use them—or even read them.
Clearly, we need "Goldilocks" domain descriptions, in which the level of descriptive detail is neither too brief nor too elaborate, but just right.

Promises Fulfilled?

Looking back on 50 years of criterion-referenced measurement, what can we conclude? Has Glaser's concept lived up to his vision?
Criterion-referenced testing as Glaser conceptualized it represented an important departure from traditional thinking. Instead of interpreting test takers' performances in relative terms (that is, by referencing these performances to the performances of other test takers), criterion-referenced measurement is an absolute interpretive strategy in which students' performances are referenced to clearly explicated domains of knowledge or skills. This fundamental relative-versus-absolute distinction continues to be important.
However, in our attempts to implement criterion-referenced measurement, we have sometimes made four serious mistakes that have robbed it of its instructional potential. We have been sloppy in the way we think and talk about criterion-referenced measurement, often slapping the label criterion-referenced on tests rather than on test-based interpretations. We've also sometimes subscribed to a dysfunctional criterion-as-level definition of this approach to testing, beclouding our measurement picture even more. We have failed to focus our tests on a reasonable number of instructionally digestible assessment targets. And we often haven't described the domains of skills or knowledge being assessed in practical language. These four implementation mistakes have distorted the instructional use of criterion-referenced measurement.
Happily, all of these mistakes are rectifiable. If educators understand the basics of criterion-referenced testing and demand that any instructionally oriented assessments avoid the four implementation errors identified here, Glaser's assessment gift to education will fulfill its promise and foster the improved instruction its early advocates foresaw.
But please, let's get this done without waiting another 50 years.
Author's note: This article is based on a presentation made at the Teach Your Children Well conference honoring Professor Ronald K. Hambleton, held on November 9–12, 2012, at the University of Massachusetts, Amherst.
References

Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18, 519–521.

Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D. B. (1978). Criterion-referenced testing and measurement: A review of technical issues and developments. Review of Educational Research, 48(1), 1–47.

Hively, W. (1974). Introduction to domain-referenced testing. Educational Technology, 14, 5–10.

Lumsdaine, A. A., & Glaser, R. (Eds.). (1990). Teaching machines and programmed learning: A source book. Washington, DC: National Education Association.

Popham, W. J. (2009). Unlearned lessons: Six stumbling blocks to our schools' success. Cambridge, MA: Harvard Education Press.

Popham, W. J., & Husek, T. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6(1), 1–9.

Smarter Balanced Assessment Consortium. (2012, March 1). Claims for the English language arts/literacy summative assessment. Retrieved from www.smarterbalanced.org/wordpress/wp-content/uploads/2012/09/Smarter-Balanced-ELA-Literacy-Claims.pdf

James Popham is Emeritus Professor in the UCLA Graduate School of Education and Information Studies. At UCLA he won several distinguished teaching awards, and in January 2000, he was recognized by UCLA Today as one of UCLA's top 20 professors of the 20th century.

Popham is a former president of the American Educational Research Association (AERA) and the founding editor of Educational Evaluation and Policy Analysis, an AERA quarterly journal.

He has spent most of his career as a teacher and is the author of more than 30 books, 200 journal articles, 50 research reports, and nearly 200 papers presented before research societies. His areas of focus include student assessment and educational evaluation. One of his recent books is Assessment Literacy for Educators in a Hurry.

Learn More

ASCD is a community dedicated to educators' professional growth and well-being.

Let us help you put your vision into action.
From our issue
Product cover image 114023.jpg
Using Assessments Thoughtfully
Go To Publication