Alex Webb / Magnum Photos

The Titanium Law of Evaluation

Nancy La Vigne

June 19, 2024

The director of the National Institute of Justice issues a wake-up call to researchers: Context and implementation matter mightily in assessing research results.

The director of the National Institute of Justice issues a wake-up call to researchers: Context and implementation matter mightily in assessing research results.

Every few decades, the criminology community encounters a publication so incendiary that it ignites a response verging on vitriol. One of the most notorious examples dates back to 1974, when sociologist Robert Martinson published a piece titled “What Works? Questions and Answers about Prison Reform.” Martinson’s conclusion, which asserted rehabilitation programs were ineffective, earned the article’s thesis the moniker of the “nothing works” doctrine. His assertion about the limits of rehabilitation was strategically used by proponents of tough-on-crime policies to advocate for harsher sentencing. In response, both practitioners and academics produced a vast array of publications — empirical and philosophical alike — to challenge this narrative. As an example, a recent Google Scholar search using the terms “nothing works rehabilitation” yielded an astonishing 954,000 results.

In subsequent decades, other publications have evoked similarly spirited responses, albeit with more tempered tones. In the late 1980s, Peter Rossi introduced the iron law of evaluation, suggesting that larger-scale interventions were more likely to yield null findings. Rossi later expanded this idea with additional laws named after other metals, such as the “stainless law,” which posits that more rigorous evaluations are less likely to detect effects, and the “brass law,” which suggests that programs focusing on individual change often have zero net impact.

Although these articles prompted defensive reactions, they also had a positive effect on the field. They paved the way for a deeper understanding of rigorous evaluation methods, enabling consumers of research to discern credible evidence from the vast body of literature. This groundwork eventually led to the establishment of several “What Works” clearinghouses, including Crime Solutions, a clearinghouse devoted to safety and justice research launched in 2011 by Office of Justice Programs’ Assistant Attorney General Laurie Robinson.

Reacting to “Does Evidence Matter?”

Fast forward to 2023. Megan Stevenson’s article “Cause, Effect, and the Structure of the Social World” has once again galvanized the criminology community. Stevenson, an esteemed economist, has dedicated her career to econometric research aimed at uncovering causal relationships in the criminal justice system. In her article, Stevenson argues that (1) the most rigorous studies, particularly randomized controlled trials (RCTs), often yield null findings; (2) significant effects detected in these studies are rarely sustained over time; and (3) effective interventions are seldom successfully replicated. Drawing from a subset of RCT studies, Stevenson concludes that these failures stem from an overemphasis on isolated intervention components rather than addressing the complex, structural dynamics inherent in issues of safety and justice — dynamics that defy straightforward RCT evaluations. 

To its great credit, Vital City, in partnership with the Niskanen Center’s Hypertext journal, orchestrated a comprehensive response to Stevenson’s article that spanned a diverse array of perspectives, disciplines and areas of expertise. With the provocative title of “Does Evidence Matter?”, it certainly grabbed my attention. As a criminologist who has long engaged in evaluation research and now, in my current role as Director of the National Institute of Justice (the only federal science agency whose mission is to sponsor research that promotes safety and justice), I spend considerable time assessing which study findings are rigorous enough to elevate as credible evidence and how best to get that evidence into the hands of people who will use it to make improvements in safety and justice programs and policies. 

There is a gaping hole in both Stevenson’s article and the responses: the crucial study of implementation itself, a largely overlooked aspect in our field.

The articles in the issue critique various weaknesses and gaps in Stevenson’s thesis and data sources, which I won’t reiterate here. In fact, I want to be clear that I credit Stevenson for her courage in publishing an article that challenges the field to think differently, and hopefully in ways that can prompt the generation of more impactful evidence. Instead, I would like to discuss a gaping hole in both Stevenson’s article and the responses in “Does Evidence Matter?”: the crucial study of implementation itself, a largely overlooked aspect in our field.

For years, criminology has debated the merits of RCTs as the gold standard of evaluation methodology — whether they are feasible, ethical and universally applicable (see Robert Sampson’s “Gold Standard Myths: Observations on the Experimental Turn in Quantitative Criminology,​”​ 2010). However, amidst these debates, we have neglected fundamental questions that bear on the validity of an intervention: Was it implemented as intended? When it was implemented in a different setting, was care taken to preserve what worked while tailoring other important aspects of it to the new context? Carefully preserving the parts of an intervention that power results while modifying the parts that can get lost when translated to new places has yet to be taken seriously in our approach to intervention research in criminal justice. 

In accordance with Rossi’s metallic laws of evaluation, both he and many in our field have missed the law that I would argue is both most pervasive and most durable — let’s call it the “titanium law” of evaluation: The less deliberate the implementation of a social program is, the more likely its net impact will be zero. This concept is almost axiomatic, yet only a relatively small share of published evaluations in our field address even basic aspects of implementation fidelity, much less the need to tailor programs to account for local conditions.

Failure to measure and account for fidelity and local context compromises our ability to interpret research findings accurately. It prevents mid-course corrections that could enhance program outcomes. And it often leads to erroneous conclusions, attributing program failure to flaws in design or theory, rather than implementation shortcomings.

Consider a prison work release program: Conceptually sound, it places incarcerated individuals in jobs with guaranteed post-release employment. But what if logistical issues like lockdowns or inadequate transport hinder participant attendance at a particular site? What if a new setting would require more intense efforts to gain employer participation, or changes to a labor agreement to allow corrections officers to supervise participants outside the facility? What if the features of an agricultural work release program in one setting need to be carefully modified for use in manufacturing elsewhere? Research that identifies and accounts for such factors is at least as important as the final empirical measure of whether prison work release “works.” Without the relevant implementation assessments, an RCT might erroneously conclude that work release is ineffective when in fact it just wasn’t implemented in a way that preserved its inherent causal power. 

Failure to measure and account for fidelity and local context…often leads to erroneous conclusions, attributing program failure to flaws in design or theory, rather than implementation.

The “action research” model

How can we measure fidelity, or determine when local context matters? Evaluators can start by partnering with practitioners to develop a logic model that tracks program inputs, outputs and outcomes in detail, carefully demonstrating how an intervention and the local setting will interact to produce an intended result. Doing this will ensure that implementation design plays as important a role in research as evaluation design. But here’s where it gets controversial: Many of us were trained as evaluators to keep an arm’s length from program implementers based on the premise that doing so maintains objectivity. 

The “action research” model suggests differently. It says researchers should routinely share information with programmatic staff about what’s working in program implementation and where improvements could be made to shore up fidelity or account for challenges specific to the study setting. The action research model advocates transparency and collaboration. Regular feedback between researchers and program staff enhances program quality and adaptation to local contexts, both ingredients that are critical for successful replications. Importantly, ongoing measurement and adaptation to improve fidelity should be integral to program delivery to sustain long-term impact.

As obvious as this may seem to some, we have yet to do it with any consistency. Revisiting Stevenson’s findings shows that RCTs rarely show net gains, that any such gains are rarely sustained over time and that few effective programs are replicated successfully. Each one of these challenges could be circumvented through a thorough an action research approach to evaluation that is grounded in a process evaluation informed by a detailed logic model.

Moreover, focusing on fidelity through action research not only strengthens program evaluation but also fosters trust between researchers and practitioners, builds local capacity for ongoing performance measurements, and promotes enduring research partnerships essential for advancing evidence-based practice. This potential goes entirely unmentioned in Stevenson’s article, a silence that exposes the breadth of the gap we need to fill. It should serve as a loud wake-up call for the field.