Jean Gaumy / Magnum Photos

What Is ‘Gold Standard’ Evidence, Anyway?

Anna Harvey

March 27, 2024

Advancing the credibility revolution

Advancing the credibility revolution

Many in the policy community consider evidence from randomized controlled trials (RCTs) to be the “gold standard” for evaluating the causal effects of programs and policies. RCT designs, wherein the researcher randomizes subjects’ participation in a program or policy, have some obvious advantages for isolating causal effects. But RCTs are not possible in many settings. They are particularly rare in criminal justice settings because of ethical and legal obstacles to randomizing the treatment of criminal defendants.

The sparse record of RCT evaluations in criminal justice settings has led some to conclude that evidence-based criminal justice reform is a pipe dream. But this narrow reading of the evidence raises a question: What exactly is “gold-standard” evidence?

The idea that there’s such a thing as “gold-standard” evidence stems from the “credibility revolution” in empirical social science over the past 20 years (aka taking the “con” out of econometrics). The lesson of the credibility revolution is that we have to be very careful when making inferences about the causal effects of programs and policies, because we might confuse the effects of, say, selection into a program, with the effects of the program itself. However, if we are able to randomly assign participation in a program to some subjects but not others (as in an RCT), and we have a large enough sample, we can be pretty sure that any post-program differences in outcomes between our treatment and control groups are due to the program, not to anything else.

But are RCTs the only “gold standard” method for inferring causality? No! Not by a long shot. “Quasi-experimental” methods can also give us high-quality evidence about the causal effects of programs and policies. These methods include regression discontinuity designs (RDD), instrumental variables (IV) and difference in differences (DID); a good summary of these quasi-experimental research designs can be found in the book “Mastering ‘Metrics: The Path from Cause to Effect.” In different ways, quasi-experimental designs all leverage the central feature of an RCT by isolating some form of randomness in subjects’ participation in a program or policy. Yet quasi-experimental designs have the advantage that they can be deployed where RCTs can’t, including in criminal justice settings.

The vast majority of causal studies of criminal justice policies and programs leverage quasi-experimental methods.

But are these quasi-experimental methods really as good as RCTs? We can look at this question empirically. A recent analysis of the reliability of different forms of evidence looked at papers published in the top 25 economics journals in 2015 and 2018 that reported evidence from RCT, RDD, IV and DID designs. For each method, the analysis looked at the distribution of test statistics indicating statistical significance. Those distributions should be smooth: For any given method, we should not see spikes in the number of studies reporting statistical significance just below conventional thresholds (for example, a common threshold for reporting significance is at or below 0.05, indicating that it’s highly unlikely that the results were produced by mere chance). If we do see such spikes, this tells us that the method is less reliable (perhaps because of publication bias or “p-hacking,” the manipulation of the study to produce statistically significant findings).

The analysis found that articles reporting RCT evidence largely do not display a spike in test statistics just below conventional significance thresholds, confirming the widespread opinion that RCTs are a highly reliable empirical method. Importantly, however, after controlling for journal and field of study (to ensure that we are identifying the effects of different methods rather than the effects of the kinds of researchers who are represented in different journals or fields), the analysis found that both RDD and DID designs are as reliable as RCT designs. Moreover, the lower reliability of IV designs was primarily a problem for papers using relatively weak instruments, a practice that is likely to disappear under the current, higher bar for reporting IV evidence.

Evidence-based criminal justice reform is far from a pipe dream.

This is very good news for the field of criminal justice research, given that the vast majority of causal studies of criminal justice policies and programs leverage quasi-experimental designs. For example, while 21% of the papers in the analysis described above reported evidence from RCTs, only 10% of the 68 criminal justice-related articles accepted for publication in 21 economics journals in 2023 reported evidence from RCTs.

How do the prospects for evidence-based criminal justice reform look when we include all of the available “gold standard” evidence in our survey of the literature? I don’t know! Somebody should write that survey!

But even just a casual and unsystematic skim of recent quasi-experimental papers suggests that evidence-based criminal justice reform is far from a pipe dream. For example, recent quasi-experimental evidence has shown that we can reduce recidivism by helping defendants avoid criminal record acquisition, and by substituting electronic monitoring in lieu of pretrial incarceration or short incarceration sentences (with the spillover benefits of reducing post-release welfare receipt and increasing post-release school completion). Quasi-experimental evidence also indicates that we can reduce the use of cash bail without increasing recidivism. All of these findings are solid and credible.

Quasi-experimental evidence also reveals that we can reduce crime by providing those in need of housing with emergency financial assistance or a housing placement; getting more civilian “eyes on the street” through community crime monitoring programs, with the spillover benefit of reducing student absenteeism; providing greater access to substance abuse treatment and health care; staggering benefits disbursements and disbursing benefits electronically rather than as cash; reducing foreclosures and vacancies; extending daylight savings time; and reducing air pollution. Quasi-experimental evidence further indicates that we can reduce crime over a much longer time frame by ensuring that children have what they need to thrive, including health care, nutrition and environments free from lead.

Again, this is far from a systematic review of the quasi-experimental literature on the causal impacts of criminal justice programs and policies (somebody really should write that review!). But it is clear from even a casual look at that literature that there exists “gold-standard” evidence that many criminal justice programs and policies increase public safety while reducing the harms imposed by current criminal justice practices. If anything, this evidence suggests that we should redouble our efforts to achieve evidence-based criminal justice reform rather than surrender to the status quo.