All About Assessment / Unraveling Reliability

W. James Popham

Premium Resource

If you were to ask an educator to identify the two most important attributes of an education test, the response most certainly would be "validity and reliability." These two tightly wedded concepts have become icons in the field of education assessment.

As far as validity is concerned, the term doesn't refer to the accuracy of a test (see "A Misunderstood Grail" in the September 2008Educational Leadership). Rather, it refers to the accuracy of score-based inferences about test takers. Once educators grasp the idea that these inferences are made by people who can, of course, make mistakes, they're apt to be more cautious about how to use test results.

In the case of reliability, however, it's the test itself that is or isn't reliable. That's a whopping difference. You'd think, therefore, that most educators would have a better handle on the meaning of reliability. Unfortunately, that's not the case.

Defining Reliability

The term reliability connotes positive things. Who would want anything, or anyone, to beunreliable? Moreover, with respect to education assessment, reliability equals consistency. And who among us would prefer inconsistency to consistency? Clearly, reliable tests are good, whereas unreliable tests are bad. It's that simple.

But here confusion can careen onto the scene—because measurement experts have identified three decisively different kinds of assessment consistency. Stability reliability refers to the consistency of students' scores when a test is administered to the same students on two different occasions. Alternate-form reliabilitydescribes the consistency of students' performances on two different (hopefully equivalent) versions of the same test. Andinternal consistency reliability describes the consistency with which all the separate items on a test measure whatever they're measuring, such as students' reading comprehension or mathematical ability.

Because these three incarnations of reliability constitute meaningfully different ways of thinking about a test's consistency, teachers need to recognize that the three approaches to reliability are not interchangeable. For example, suppose a teacher wants to know how consistent a particular test would be if it were administered to certain students at different times of the school year to track their varying rates of progress. What the teacher needs to look at in this instance would be the test's stability reliability. Similarly, if a teacher has developed two different forms of the same test, possibly for purposes of test security, then the sort of reliability evidence needed to determine the two forms' consistency with each other would bealternate-form reliability. Finally, teachers might be interested in a test's internal consistency whenever they want to know how similarly a test's items function—that is, whether the test's items are homogeneous.

Educators frequently run into reports of a test's reliability when they're using standardized tests, either national or state-developed exams. Typically, these reliability estimates are reported as correlation coefficients, and these "reliability coefficients" usually range from zero to 1.0—with higher coefficients, such as .80 or .90, being sought. If standardized tests are distributed by commercial assessment companies, such tests are invariably accompanied by technical manuals containing some sort of reliability evidence.

Usually, this evidence will be presented as an internal consistency coefficient because this kind of reliability can be computed on the basis of a single test administration, as opposed to the alternate-form reliability and stability reliability coefficients, which both require multiple test administrations. Because collecting internal consistency evidence involves the least hassle, it's the most often reported kind of reliability evidence.

The One Thing to Know

Almost all classroom teachers are far too busy to collect any kind of reliability evidence for their own teacher-made tests. So why should teachers know anything at all about reliability?

Well, there's one situation in which teachers actually do need to know what's going on regarding reliability. This arises when teachers are trying to determine the consistency represented by a student's performance on a nationally standardized test or, perhaps, on a state-built standardized test. To get a fix on how consistent an individual student's test score is, the teacher can look at the test's standard error of measurement(SEM). Standard errors of measurement, which differ from test to test, are similar to the plus-or-minus margins of error accompanying most opinion polls. They tell a teacher how likely it is that a student's score would fall within a specific score range if a student were (theoretically) to take the same test 100 times.

A standard error of measurement of 1 or 2 means the test is quite reliable. Because all major published tests are accompanied by information regarding this measure, teachers need to check out a given test's SEM so they'll know how much confidence to place in their students' scores on that test.

This is briefly how it works. Suppose, for example, a student earned a score of 53 points on a 70-point test that had a standard error of measurement of 1. We can make two assumptions: First, about 68 percent of the time, that student would score within plus or minus onepoint of the original score (between 52 and 54). Second, about 95 percent of the time, the student's score would fall within plus or minustwo points of the original score (between 51 and 55).

If this same text had a standard error of measurement of 2, about 68 percent of the time, that student would score within plus or minus two points of the original score (between 51 and 55); about 95 percent of the time, the student would score within plus or minus four points of the original score (between 49 and 57).

As the standard error of measurement increases, so does the range in possible scores. So if the test had a standard error of measurement of 10, 68 percent of the time the student's score would be within plus or minus 10 points of the original score (between 43 and 63); and 95 percent of the time it would be within plus or minus 20 points of the original score (between 33 and 73), which is not very reliable at all. Clearly, teachers should place more confidence in tests sporting smaller standard errors of measurement.

Most educators think that the validity-reliability pairing is a marriage made in measurement heaven, with each partner pulling equal weight. In truth, reliability is the minor member of that merger. And I've just told you all you basically need to know about the concept. You can rely on it.

James "Jim" Popham (1930–2025) was Emeritus Professor in the UCLA Graduate School of Education and Information Studies. At UCLA he won several distinguished teaching awards, and in January 2000, he was recognized by UCLA Today as one of UCLA's top 20 professors of the 20th century.

Popham was a former president of the American Educational Research Association (AERA) and the founding editor of Educational Evaluation and Policy Analysis, an AERA quarterly journal.

He spent most of his career as a teacher and was the author of more than 90 books, 250 journal articles, 50 research reports, and nearly 200 papers presented before research societies. His contributions to education spanned decades, shaping how we think about student assessment and educational evaluation.

Learn More

ASCD is a community dedicated to educators' professional growth and well-being.

Let us help you put your vision into action.

Discover ASCD's Professional Learning Services

From our issue

How Teachers Learn

Go To Publication