Criminal justice interventions shouldn’t be held to an impossible standard.
Most attempts to innovate are unsuccessful. Two-thirds of businesses fail within 10 years. Only 3% of patents lead to a profit. In sports, most at-bats in baseball end unceremoniously in an out, and most football drives end before a team can kick a field goal or score a touchdown. Most shots in basketball are missed. Yet we continue to play the game — because winning is a process.
Speaking of scientific advancement, Albert Einstein observed that “success is failure in progress.” In reference to the oft-cited fact that Thomas Edison failed 1,000 times in inventing the light bulb, Edison is said to have replied, “I didn’t fail 1,000 times. The light bulb was an invention with 1,000 steps.”
As it is with hitting a baseball — or starting a successful business or inventing a successful product — changing the world is hard. As with other areas of human endeavor, most social policy programs or shifts in law enforcement strategy do not transform the world and many probably do not even work at all. As Megan Stevenson puts it in her essay, “Cause, Effect, and the Structure of the Social World,” most reforms have little lasting effect when evaluated with gold-standard methods.
This claim is difficult to argue with and is not a hill I’d want to die defending. Instead, I’d like to say a few words about the implication of Stevenson’s observation and, in particular, her claim that “the dominant perspective on social change — one that forms a pervasive background for academic research and policymaking — is a myth.”
****
Businesses fail when they do not make a sufficiently large profit. Profit serves as a constant source of feedback about the viability of an enterprise. If a business isn’t making money, in the words of “Shark Tank” investor Kevin O’Leary, an entrepreneur can take the idea behind a barn and shoot it.
In the world of social innovation, research and evaluation is the analog to profit. It is the means of understanding whether what we are doing is effective. A key difference, though, is that while profit may be difficult to obtain, it is easy to measure. Programmatic effectiveness, on the other hand, is both hard to obtain and hard to measure. This is true for many reasons but, in particular, it is due to the fundamental problem of causal inference: It is not possible to observe a given individual under two states of the world — one in which the individual has been affected by a given reform and the other in which they have not. Comparing treated people to untreated people generates all sorts of problems because it is often the case that people who receive a treatment like job training or education differ systematically from those who do not. This could either be because more motivated people are more likely to seek out treatment or because people are selected to participate in a program on the basis of their personal characteristics. Communities aren’t laboratories, and they never will be.
Programmatic effectiveness is both hard to obtain and hard to measure.
In principle, researchers can control for differences between treated and untreated people, but this becomes hard to do when many of the characteristics that matter — including motivation and willingness to change — are hard to measure and aren’t observable in data. Randomized experiments solve this “selection” problem. Because access to the treatment is determined by a coin flip, treated and untreated people are, in expectation, alike in all possible ways but for access to the treatment. We effectively create two parallel worlds where the only thing that differs is the thing we’re trying to study.
In practice, randomized experiments come along with a number of limitations. They face a variety of feasibility constraints and they often involve small and selected samples of study participants, making it hard to know if the intervention being evaluated will scale well in the real world. What works well in one place may not work as well in another, and what works well when delivered to one group of people or by one organization may not work as well when a different organization serves a different population. Reducing class size might yield beneficial results for one age or type of student but not for others; the same can be said for increasing the size of a police force with one type of population (or indeed one type of police force) but not another. For these reasons and others, Richard Berk referred to randomized experiments, oft referred to as the “gold standard” in research, as the “bronze standard.”
While a healthy debate can be had about where randomized experiments should rank in the hierarchy of evidence — i.e., gold, silver or bronze — it is fair to say that randomized experiments will always be in the hunt for a medal because they do convincingly solve the selection problem. Returning to Stevenson’s claim that the dominant perspective on social change — and randomized experiments — is a failure, I offer the following thoughts.
Social scientists are not engineers. For the most part, we simply test the hypotheses that policymakers generate.
First, why should we expect randomized experiments to produce evidence of transformational social change? That seems an impossible standard given that our world is shaped by human nature and a variety of unforgiving social and political constraints. If change is hard and most interventions fail to change the world in transformational ways, then it stands to reason that randomized control trial (RCT) evidence should reflect this seemingly fundamental truth. The fact that most RCT evidence is associated with modest impacts at best matches our understanding of the structure of the social world and serves as a sign that research evidence, rather than being subject to researcher biases, is credible. This is critical because while Stevenson regards RCTs as following from the “engineer’s view” of the world, social scientists are, in fact, not engineers. For the most part, we simply test the hypotheses that policymakers generate.
Second, while randomized experiments remain an incredibly powerful form of evidence (perhaps somewhere between the gold and the bronze), there is other high-quality evidence as well — evidence from so-called natural experiments where similar people are treated differently in the real world for arbitrary reasons. For example, criminal defendants might be arraigned in court in front of different judges or a program might roll out to different communities at different times for arbitrary reasons. Given that fact, it is unreasonably restrictive — and a little odd — to demand that randomized experiments do all of the heavy lifting in generating knowledge. Indeed, the entire point of a research literature is to build up knowledge over time from a number of different contexts using a variety of evidence. It is through repetition and replication, by testing hypotheses in different ways and in different contexts, that we derive knowledge. An RCT can form the basis for scaling up a program that might be tested at scale using a differences-in-differences design. Likewise, an RCT might form the basis for more investment in a particular strategy. Seen through this lens, the claim that the benefits of RCT evidence are a myth is a strawman — easy to knock down but in service of the wrong question.
Third, Stevenson has, in my view, downplayed some of the RCT research that has shown consistent evidence of meaningful change. Data-driven policing strategies that shift police to crime hot spots reduce crime and violence. Place-based interventions, such as remediating dilapidated homes, greening vacant lots and improving street lighting, have led to meaningful public safety benefits. Summer jobs for youth and various incarnations of cognitive behavioral therapy reduce offending.
It is possible to make the world better, slowly, one step at a time.
Have these changes, by themselves and in a vacuum, been transformational? Well, it depends on what you mean by that. Is a 5% or 10% improvement in a given problem a transformation — or is the only true transformation a much bigger result than that? More to the point, why is transformation, which is in the eye of the beholder, the standard to which we must adhere? While RCTs hold constant other changes, in the real world, different policies and programs interact. Kids might not only obtain a summer job but they might also benefit from a variety of other social supports or smaller class sizes or higher quality policing in their communities. As Greg Berman and Aubrey Fox note in their excellent book, “Gradual: The Case for Incremental Change in a Radical Age,” small improvements can compound over time. In a 2017 paper published in the American Sociological Review, Patrick Sharkey, Gerard Torrats-Espionsa and Delaram Takyar study changes in the presence of community nonprofits and show that, over several decades, growth in the number of community nonprofits leads to a modest but meaningful decline in violence. The implication is that it is possible to make the world better, slowly, one step at a time.
****
Changing the world is hard. If it were easy, we’d probably know more of the answers by now. Research, including RCT research, has served to document this sobering empirical regularity and Stevenson is correct to note the large number of small and null findings from experimental research in her essay. Lots of stuff we try indeed doesn’t make much of a difference. But at the same time, RCTs have also led to genuine learning — both about what fails and what succeeds.
If the standard by which we are to judge any human endeavor is transformational change, then disappointment is all but assured. But the big-picture evidence suggests that we do not have to wait for a social revolution to make meaningful progress in improving people’s lives. If you are over the age of 35 and living in the United States, you have seen homicide rates double in your lifetime only to fall back to baseline a decade later. New York City, where I grew up, experienced more than 2,000 murders in 1990 and fewer than 300 as recently as 2018. Meanwhile, in Chicago, which had a similar murder rate as New York City in 1990, murders are back up and close to their peak level. Something caused murders to decline by more than 80% in New York. While it is possible that most of the variation is due to indistinct social forces, it is also possible that public policy has played an important role.
Social science research is unlikely to pinpoint one neat trick that led New York to separate itself from Chicago. But it remains a critical tool, among others, to make sense of a messy, chaotic and difficult-to-change world.