We can’t shrink from our duty to understand the effects of policy on people’s lives.
After completing my undergraduate degree at Vanderbilt, I was hired as a victim-witness coordinator in the district attorney’s office in Nashville. I worked with domestic violence victims whose cases were advancing through the courts, and I believed our work made victims safer by holding abusers accountable and by providing services to victims, abusers and families. I was honored and highly motivated to do this work.
However, not all victims were so sure the court’s interventions would make them safer: Some felt more protected, but others would say, “It doesn’t matter what you do, if he wants to kill me, he will.” (About 80% of my clients were women.) I was at a professional crossroads: I could either keep working in the system and hoping my efforts were effective, or I could train to be a social scientist and build a research portfolio dedicated to exploring how policy and practice could reduce domestic violence victims’ risk of harm.
I chose the latter path. Drawing on her recent article, Megan Stevenson might say I subscribe to the “engineer’s” view. I do so not only because I believe in the power of the information that causal research can provide, but also because I know firsthand about the vacuum in which first responders, social workers and other change agents are left to make decisions in the absence of evidence. Without evidence, we have to do what seems right and effective based on our gut. In a world of scarce resources, anecdotal evidence and selected samples, we are prone to choosing wrong. And worse, we are prone to being staunchly committed to our own incorrect priors.
That said, Stevenson makes several warranted claims about the flaws of causal social science. Her criticisms point us toward ways to improve our discipline. As she notes, “It’s notoriously difficult to publish research that isn’t statistically significant.” This is true, and it should not be. Journals should publish null studies. They should also publish studies that find substantively small but statistically significant results — though they should ensure the reader understands these “precisely estimated zeros” are unlikely to make a meaningful change and may not be cost-effective. Moreover, researchers should also post their results as working papers in the event a journal declines to publish them.
Before going to scale, replication is vital to getting policy right.
Since Stevenson focuses on randomized controlled trials (RCTs), I will too. It is highly improbable that for every one statistically significant published RCT, there were 19 statistically insignificant unpublished RCTs. This scenario is certainly possible in research that analyzes secondary data, but, unlike RCTs, secondary data analysis can often be performed without a funder.
RCTs are expensive, and the funder will (in all cases I can think of) require a report or series of reports that will need to be distributed both internally and externally. The researcher’s report(s) will determine future funding strategies. Therefore, while publication bias may exist — even in RCTs — it is unlikely that it exists to the scale Stevenson’s argument requires. The fact that we know many RCT findings are not replicated across sites itself indicates that publication bias is less common in the RCT setting.
Stevenson “zooms in” on several RCTs where replications produced mixed results. “Replicate” is a commonly misused term. “Replicate” implies an RCT was carbon copied from one RCT site to another, which is rarely the case. While it’s possible the intervention is copied from setting to setting, the effect of that intervention is most usually compared to whatever was happening before the researchers arrived. We’d expect this to be different from setting to setting. This means at every RCT replication the intervention (or treatment) is being measured against a different counterfactual (a different “what if”). This alone could create differences in RCT outcomes, especially if the RCT is evaluating a very incremental change.
One of Stevenson’s examples of failed replications was a set of RCTs used to determine the effect of a suspect’s mandatory arrest at a domestic violence crime scene. The initial study took place in Minneapolis and had promising results. Following the Minneapolis experiment and the very public Thurman v. City of Torrington case, there was quick and widespread adoption of mandatory arrest laws. By 1987, 176 jurisdictions had some form of arrest policy.
In 1986 and 1987, six replications of the Minneapolis experiment took place in cities across the country. Stevenson writes, “Combining data across all seven studies, researchers later concluded that arrest did not, in fact, have a consistent or large effect on recidivism.”
The solution is not to abandon evidence — even the incremental kind provided by RCTs. The solution is to do RCTs faster and across more settings — so that we can continuously update what we know and build evidence for change.
Stevenson cites Schmidt and Sherman, but that study offered much more than a despairing shrug. The authors used the evidence of these seven studies to uncover important nuances regarding who is helped and who is harmed by mandatory arrest laws. As a result of the seven RCTs, they developed a set of five policy recommendations: (1) repeal mandatory arrest laws to avoid unintended consequences, (2) substitute structured police discretion so that police can effectively use their professional judgment, (3) allow warrantless arrest, (4) issue arrest warrants for absent offenders and (5) create special units for couples experiencing chronic violence.
In light of the replications, many jurisdictions created more nuanced pro-arrest policies, which now include a range of warrantless arrest options across jurisdictions. However, once a policy is in place, it can be very hard to undo it — especially when doing so can be seen as going soft on crime.
The replication studies yielded information to improve policy. That’s a success, not a failure. However, the timing tells us something important to improve our use of RCTs in policymaking. Before going to scale, replication is vital to getting policy right. Had we known the results of all seven experiments prior to the policy’s rollout, pro-arrest policies for domestic violence would have been more nuanced and effective from the start.
Finally, Stevenson argues that the incremental kinds of interventions that are evaluated with RCTs are simply too modest to make much difference. Even if an RCT finds a real effect, she argues, outcomes will regress to their pre-intervention levels due to the social forces at play in people’s lives. There is wisdom in avoiding micro-solutions to macro-problems. However, we must also recognize that our policy process is often incremental, and policymakers should use evidence — even if it is imperfect — to make those incremental decisions. At the very least, this helps us to avoid missteps while inching closer to improvement.
The solution is not to abandon evidence — even the incremental kind provided by RCTs. The solution is to do RCTs faster and across more settings — so that we can continuously update what we know and build evidence for change. The reversion Stevenson identifies might be avoided if we had a better sense of what to do next or which effects are context-dependent. When RCTs fail to replicate, we must work to understand why: Was it research design, differences in the population across sites or something else? Then we must use this robust evidence on circumstances and mechanisms — as Schmidt and Sherman did — to create a set of interventions that lead to lasting change.