Interpreting Education Research and Effect Sizes

Matthew A. Kraft

John Hattie

In a candid conversation, two preeminent education researchers discuss the state of the field and their own methodological differences.

Premium Resource

Professional Learning

Interpreting Education Research and Effect Sizes thumbnail

In a candid conversation, two preeminent education researchers discuss the state of the field and their own methodological differences.

Matthew A. Kraft, an influential education scholar and former K–12 teacher, sits down with John Hattie, renowned author of Visible Learning. They debate their different viewpoints on effect sizes, discuss how evidence can best be used to inform educational practice, and look for common ground about the importance of educators being critical consumers of research.

Matthew Kraft: John, your book Visible Learning (2009) has had a profound impact on the education sector. What have you learned since it was published about why education research does or does not influence practice?

John Hattie: Over the past decade, teachers and school leaders have begun seeing research and evidence as a core part of their thinking and doing. This is a major change. They are much more evaluative and critical of evidence; they are more often using the evidence from their own classes and schools and comparing it to the evidence from the research literature. There has been a major movement away from asking teachers to be "researchers" and toward inviting them to be evaluators of research.

The conversation is also switching from debates about how to teach to collaborative discussions about the impact of teaching. This has led to many valuable implications, especially helping educators focus on teaching the strategies of learning; recognizing the emotion of learning and discovery; and focusing on growth, progression, and being more transparent with students about success criteria for their work.

MK: Interesting. It sounds like you see administrators and teachers becoming more critical consumers of education research. I'd certainly welcome that, although I would add that it is incumbent upon researchers to do a much better job at asking relevant research questions, writing in a more broadly accessible way, and interpreting the implications of our findings directly for policy and practice.

I'm also worried that research can sometimes be misused. What do you see as the greatest misinterpretations of Visible Learning?

JH: My colleague Arran Hamilton and I collected as many misinterpretations of our research as we could find, and the greatest was the misuse of the ranking of the different influences on student achievement. I included the ranking as an appendix in the book at the last minute, but many saw it as the major story. In reality, each of the (now 300) influences is not unique; the big messages are about discovering the underlying story as to why some collections of influences rank higher and others lower; and the quest for subgroup differences (e.g., age, curricula, subject area) may mean the average is not the best estimate. Yet many readers still zero in on the individual rankings.

But since 2009, I have tried to discourage focusing only on the top influences and ignoring the lower effects, as it may be easier to derive high effects when the outcome is narrow (e.g., vocabulary) than when it is wide (e.g., creativity). I have also tried to emphasize that while achievement is important, there are many other attributes that we need to value in schools. Last year, we released all the meta-data with the invitation for others to devise better stories about the underlying influences, and to move past the data to interpretations of that data.

MK: That is fascinating that your list of factors related to student achievement was a last-minute addition but has become arguably the most influential part of your book. A simple ranking makes things more accessible, but also runs the risk of abstracting away from important differences in the underlying studies, interventions, and outcomes that produced the rankings.

JH: A major problem is confusing correlation with causation; the use of the term "effect" implies causation.

MK: Yes! Effect size is such a misleading term. It sounds like we are saying an "effect size" captures how big the effect of a given intervention is. However, an effect size is simply a statistical translation used to quantify a relationship between two measures on a common scale. It tells us nothing about whether the underlying relationship represents cause and effect or a simple correlation.

The very first thing we should ask ourselves after reading about an effect size is, Does this effect size represent a causal relationship or a correlation? Here is why I think this matters: In an updated version of the Visible Learning effect size rankings, "classroom discussion" and "creativity programs" are both ranked among the top of the 252 influences (15th and 41st), while "whole-school improvement programs" and "summer school" are ranked toward the bottom (162nd and 177th). To the degree that these rankings are based on correlational studies, they are likely to be misleading.

Correlational relationships are often driven by differences in the groups of schools and students that participate rather than by the influences themselves. For example, we would expect to find a strong positive correlation between student outcomes and the use of "classroom discussion" and "creativity programs" because these types of pedagogy are more likely to occur in well-resourced schools with smaller class sizes and higher achievement—schools that don't face accountability pressure to focus on state tests. We would also expect to find a much weaker correlation between student outcomes and "whole-school improvement" and "summer school," given that these efforts are typically targeted at low-performing schools and students.

Instead of correlational relationships, we should want to know the underlying causal effect of each influence that is not confounded by the types of non-random selection patterns that typically occur in schools. Such causal relationships can be determined through randomized controlled trials (RCTs) and other naturally occurring experiments.

JH: You seem to privilege RCTs, but in education the design of the control groups is so, so different from many other areas where RCTs are more common. I think the gold standard should be "beyond reasonable doubt" (to adopt Michael Scriven's term) and while RCT is a great design, it is not the gold standard. In education (a) we cannot disguise which students receive the treatment and which do not (the placebo); (b) the teachers in the control group "know" they are being compared and may focus on mimicking the intervention to not look bad; (c) we too often have poor controls over the dosage, fidelity, and quality of the implementation; and (d) related to b and c, we have a poor understanding of what "business as usual" is for the control group.

MK: I do privilege RCTs as the "gold standard" and think consumers of education research should as well. In general, I want education leaders to be more skeptical, critical consumers of research that does not use credible methods of causal inference. There are few research designs other than RCTs that we should trust as providing evidence beyond a reasonable doubt. For example, simply measuring pre-post interventions is not enough. Showing that a relationship remains after controlling for variables commonly collected in administrative data sets is not enough. Even matching methods are rarely, if ever, strong enough to overcome a reasonable doubt, given the limited set of observable measures in most data.

These types of analyses can still provide useful information to educators, but we should be cautious about overinterpreting their findings because they cannot rule out other possible explanations for the relationships they find. We should always ask: Why did some schools, teachers, and/or students participate in the intervention and others did not? Non-random selection makes it is extremely difficult to tease apart the effect of an intervention from the pre-existing differences between those two groups.

But you do raise another key question that consumers of education research should ask, including in RCTs: What did the treatment group get and what did the control, or comparison, group get? The effect of the same intervention will likely appear quite large if it provides a strong contrast from the status quo, but quite small if it is not much different. Consider, for example, instructional coaching. Coaching effects will likely be much larger for teachers when implemented in schools where teachers rarely receive any meaningful professional development. On the other hand, the effect of coaching will likely be smaller if provided to teachers who already engage in successful peer-learning communities.

JH: I agree, Matt—much more attention is needed to ensure that both the nature of the intervention and the control group "intervention" are coded. I'm working with a team of researchers to complete a review and meta-synthesis of "flipped learning," and the variance in what is meant by this term is a major constraint to understanding how it works, and when it does not.

MK: But getting back to the effect size issue: The "hinge point" of a 0.40 effect size you propose as a benchmark for judging whether an education intervention is effective has been widely adopted by educators. As you know, some scholars, including me, have argued this yardstick is unrealistically large. Others have questioned the value of benchmarks altogether. Are benchmarks useful for judging the policy relevance of a research finding, and is this the right one?

JH: The average of all 300 influences across the entire database is 0.40. As with all averages, we then need to detect moderators and understand the variance around this average. As you have noted in your recent paper on "Interpreting Effect Sizes of Education Interventions" [see Educational Researcher], there are critical ways to detect these influences on student learning. Like most averages, 0.40 is reasonable as a starting point if you lack any other contextual information.

Working with schools, we encourage and show them how to build up their own comparison points relative to the tests they use (narrow or wide), time periods, and contexts. There can be important differences, also, if the 0.40 is from a pre-post or comparison group design (although across all studies, the average effects based on either are 0.40). The 0.40 is simply the mean of all influences and is a valuable start to begin discussions. But often, more appropriate comparison points may be needed.

MK: Well, we can at least agree that benchmarks are helpful in the absence of better information. We all need a starting point. I just think the 0.40 threshold is the wrong one. Of the almost 2,000 effect sizes I analyzed from RCTs examining the effects of educational interventions on standardized student achievement, only 13 percent were 0.40 or larger. When I cut my data further by subject (reading vs. math) or by the scope of the test (narrow vs. broad) or by the size of the study, I still couldn't find anything close to an average effect size of 0.40.

I would argue that the hinge point of 0.40 sets up education leaders to have unrealistic expectations about what is possible and what is meaningful. The scale of data you draw on using thousands of meta-analyses is incredibly impressive. But it also means that this average effect size of 0.40 pools across studies of widely variable quality. Correlational studies with misleadingly strong associations likely dominate the data, considering that education research only began to apply research methods to estimate causal effects more widely over the last two decades, and these studies still remain rare given the limitations of what can be randomized in school settings. Publication bias, where lots of studies of ineffectual programs are never written or published, further strengthens my suspicion that the 0.40 hinge point is too large.

JH: So you have recommended new empirical benchmarks for interpreting effect sizes (less than 0.05 for small outcomes, 0.05 to 0.20 for medium outcomes, and greater than 0.20 for large outcomes). But these are based on RCT designs and ignore the second guideline in your paper that "the magnitude of the effect size depends on what, when, and how outcomes are measured."

MK: True, I'm trying to find an elusive balance between the value of general benchmarks and the limitations of their applicability across different study designs and outcome types. We both agree that the magnitude of an effect size depends a lot on what outcome you measure. Studies can find large effect sizes when they focus on more narrow or proximal outcomes that are directly aligned with the intervention and collected soon after. It is much easier to produce large improvements in teachers' self-efficacy than in the achievement of their students. In my view, this renders universal effect size benchmarks impractical.

For these reasons, the benchmarks I propose are for effect sizes from causal studies of preK–12 education interventions evaluating effects on student achievement. These benchmarks should not be applied to effect sizes on outcomes such as tests of very specific domains of knowledge, teacher-written assessments, or self-reported outcomes. We need different benchmarks for other classes of studies and outcomes.

JH: It is pleasing that we are building up better guidelines—in your case for RCT studies—and I look forward to more comparative studies of interventions to better understand what we mean by impact—and for whom and by how much.

MK: Agreed! But I am still a bit disillusioned that, on average, larger interventions produce systematically smaller effect sizes. Why do you think so many "evidence-based" practices fail to scale in education and what can we do about it?

JH: When you search the education literature, there are relatively few studies on how to scale education interventions. There have been so many recommendations for improved models of schooling (going back to Dewey's progressivism), but rarely have they been scaled up. Too often we look for failure and aim to fix it, whereas we need to look for success and aim to scale it. For example, when we work with schools, we should begin by seeing where there is success and then aim to celebrate and scale up that success so that more teachers can replicate it in their classrooms. We have developed a "grammar of schooling" that seems to serve many students well, but sadly we often label and find explanations relative to students who do not fit this grammar. Too often, we find ways to name and blame the student for not learning or benefitting from our teaching rather than questioning our teaching and looking at ways it could be improved based on examples of more effective practice.

COVID-19 has forced major changes in how we teach and learn, so clearly teachers can quickly scale up new ways of acting—but will we learn from this or rush back to the old grammar? The old grammar has served many students well, but too many still fail to thrive. As Viviane Robinson argues in her book Reduce Change to Increase Improvement (Corwin, 2017), the debate should be more about improvement and less about change.

The Visible Learning recommendation is to invest more in human capital and resource development, build evidence-based systems of the effectiveness of lessons (see EdReports as one example), encourage teachers to improve their evaluative thinking, and capitalize on the high levels of evaluative thinking within many schools. We need to build better models of implementation of worthwhile interventions within schools and assist leaders to have the courage to recognize teachers who have high impact on the learning lives of students, build coalitions around these teachers, and invite others to join this coalition (ditto for superintendents with school leaders).

MK: I can definitely get behind investing more in teachers and their working conditions. Certainly, the science of replicating programs at scale has a long way to go. I also wonder if we have a tendency to abandon ship too early in the education sector. Political terms promote education reforms that work within an election cycle. High rates of turnover among superintendents and principals often means a revolving door of new initiatives. I think it might just take more time and a relentless commitment to continuous improvement to do things well. Sometimes we have unrealistic expectations. What about valuing real incremental improvements over the allure of silver bullets?

In closing, what advice do you have for education leaders for being critical consumers of education research?

JH: My major argument is that educators need to be effective at evaluative thinking—asking about the worth, merit, and significance of interventions, asking more about "So what?" than "What is so?" and building the five major skills of evaluative thinking (see The Turning Point for the Teaching Profession, a book I coauthored last year with Field Rickards and Catherine Reid). These skills are: (1) invoking critical thinking and reasoning in valuing evidence, and making decisions about appropriate interventions; (2) addressing the fidelity of implementation, continually checking for unintended consequences, and allowing for adaptations to maximize the value of the outcomes; (3) understanding and agreeing across staff about the nature, extensiveness, and magnitude of impact, and maximizing the impact on the learning lives of students; (4) investigating potential biases and confounding factors that may lead to false conclusions about our impact; and (5) learning to understand others' points of view.

MK: Great points. At the same time, I wonder what is realistic to ask of education leaders. Where will they find the time to do all this? Certainly, this type of training is not a central part of teacher or principal education in the United States, although it probably should be.

If I had a magic wand, I would wave it and require all academic papers to include a "practitioner's abstract" that states the main findings, costs, and specific implications of the research for policy and practice in clear and simple terms. I would also highly recommend that teachers and education leaders read two books that have informed my thinking about interpreting research: Common-Sense Evidence by Carrie Conaway and Nora Gordon (Harvard Education Press, 2020), as well as Data Wise by Kathryn Parker Boudett, Elizabeth City, and Richard Murnane (Harvard Education Press, 2013).

Great to talk with you, John.

Matthew A. Kraft is an associate professor of education and economics at Brown University. He researches the economics of education, education policy, and applied quantitative methods for causal inference.

Learn More

John Hattie is an emeritus laureate professor at the University of Melbourne and chair of the Australian Institute of Teaching and School Leaders. He is the author of several books, including Visible Learning (Routledge, 2009).