Planning an Experiment in Increasing Arrest Rates

To be evidence based and data driven, your police department needs to plan on how to measure whether a particular intervention you are doing is working. If you spend $100,000 on some new software or other tech, is it actually resulting in the outcomes you want?

Here I will walk through an example of how to plan whether you can tell if that intervention is actually improving arrest rates.

The experimental ideal

First you need to determine how you will estimate what is called the counterfactual – what would your arrest rates be without the tech? In a perfect world, you would do what is called a randomized experiment. You may only let half of your staff use the software while the trial is going on for example.

In practice that is very hard, so police departments often do different approaches, that are not randomized experiments but are often reasonable proxies. These include:

The below code shows how to estimate how many crimes you will need in your treated and control groups to tell if the intervention is working.

In practice, you cannot control this exactly, but is still useful for planning. So say you need 1000 crimes in treated and 1000 crimes in control. (Your treated and control groups may be randomized cases, they may be split by geography or shift in your agency as well).

So you estimate that to get that many crimes, you need to run the experiment for 1 year. In my experience it often takes significantly longer than people expect to be able to reasonably detect if your intervention is working. Many times I see people do analyses for a month, whereas they would need to conduct the experiment for well over a year for it to be well powered.

This is typically formalized via a metric called power. Power is the probability if the intervention works as well as you expect, will you be able to tell the treated group is significantly different than the control group X% of the time. (You may also plan for precision, e.g. can tell it increased a minimum of 10%.) It is common to plan you have at least an 80% chance to tell your intervention is working better than the control group.

Example R Code Analysis

Here is an example walkthrough using R code. First I load in the libraries I need, most of the work here is in base R, but I use ggplot2 for graphs and the RColorBrewer library for a color scale (the knitr library is just for this post to make nicer tables). I also set the random number seed, for consistent results for the simulations.

library(ggplot2)
library(RColorBrewer)
library(knitr)
set.seed(10)

# ggplot theme
theme_andy <- function(){
  theme_bw() %+replace% theme(
     text = element_text(size = 16),
     panel.grid.major= element_line(linetype = "longdash"),
     panel.grid.minor= element_blank()
) }

Simple Binary Comparison

I will conduct two different analyses. The first is a simple comparison of proportions of arrests between two samples (the treated and control).

The assumptions of this simulation are:

Then we vary different components in the simulation. We vary the total number of crimes in each arm of the experiment (100 to 2000 in increments of 100 observations).

Power is computed using the power.prop.test function in R, which is based on a large-sample Wald test for the difference between two proportions. For each combination of sample size and treatment effect, the test calculates the probability of rejecting the null hypothesis of equal arrest probabilities under the assumed alternative.

baseline <- 0.4
pp_increases <- c(0.05,0.10,0.15,0.20)
n <- seq(100,2000,50) # this is per arm, so 500 would be 500 in treated and 500 in control

df <- expand.grid(baseline=baseline,pp=pp_increases,n=n)

getp <- function(n,p1,p2){
    pf <- power.prop.test(n=n,p1=p1,p2=p2,
            alternative="two.sided",
            sig.level = 0.05)
    return(pf$power)
}

df$power <- mapply(getp,df$n,df$baseline,df$baseline+df$pp)
# This is a Wald test for testing the difference between two proportions
pp_labels <- sprintf("%.2f", pp_increases)
df$fpp <- factor(df$pp,labels=pp_labels,levels=pp_increases)

It is easiest to visualize the results of this simulation using a graph. If you think your intervention will increase arrest rates 20 percentage points (e.g. go from 40% to 60%), you only need 100 observations in each arm to detect that change at 80% power. For an only 5 percentage point increase (going from 40% to 45%) you need more like 1600 crime events in each arm to detect that difference.

power_plot <- ggplot(df, aes(x = n, y = power, color = fpp)) +
  geom_line(size = 1) + geom_point() +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray50") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
  scale_x_continuous(breaks = seq(0, max(df$n), by = 200)) + 
  scale_color_brewer(palette = "Dark2") + 
  labs(
    x = "Sample size per arm (N)",
    y = "Power",
    color = "PP increase",
    title = "Power to Detect Increase in Arrest Probability",
    subtitle = paste0("Baseline arrest rate = ",baseline * 100,"%") 
  ) +
  theme_andy()

power_plot

Power curves for increase in arrest rate

Ultimately you would want to change your baseline arrest percentage and increments based on your own data. Say you had something with much lower baseline (like car break ins), and so started from 5%. You just change the baseline value above, and then probably change the pp_increases vector to different values as well.

Different Baseline Probabilities

The prior analysis is a good place to start, but fails in one respect that is common with different policing interventions – we often expect the intervention to increase arrest rates for all crimes, not just a single crime. Many different crimes have different baseline probabilities though, e.g. robberies may be solved 30% of the time, but MV thefts only 10% of the time.

A reasonable baseline assumption for planning experiments is that the intervention will increase the arrest rate by a consistent percent (not percentage point) across all crime types. E.g. it may raise arrest rates by 10%. So robberies go from 30% to 33%, and MV thefts go from 10% to 11%.

So here we create a sample distribution of crimes, e.g. MV thefts are 20% of all crime types, and have a baseline arrest proportion of 10%. You do that for each major crime type of interest.

crime_baselines <- c(0.1,0.05,0.3,0.5,0.7) # baseline arrest probabilities
crime_types <- c('MVTheft','Larceny','Burglary','Robbery','Assault') # crime types
crime_prop <- c(0.2,0.3,0.1,0.05,0.35) # proportions in the sample for each crime type
names(crime_baselines) <- crime_types

# pretty print this table
ctable = data.frame(crime_prop,crime_baselines)
ctable$crime_types <- crime_types
ctable <- ctable[,c("crime_types","crime_prop","crime_baselines")]
names(ctable) <- c("Crime","Sample Proportion","Arrest Proportion")
kable(ctable,row.names=FALSE)
Crime Sample Proportion Arrest Proportion
MVTheft 0.20 0.10
Larceny 0.30 0.05
Burglary 0.10 0.30
Robbery 0.05 0.50
Assault 0.35 0.70

Next we create a function that simulates a sample of data from these baseline crime types, assigns to treated or control (again assuming a 50/50 baseline split), and then estimates a logistic regression equation. It then gets the actual beta coefficient and standard error.

logistic <- function(x){1/(1+exp(-x))}
logit <- function(x){log(x/(1-x))}

gen_sample <- function(n,ir,alpha=0.05){
    crimes <- sample(crime_types,size=n,replace=TRUE,prob=crime_prop)
    crime_prob <- crime_baselines[crimes]
    df <- data.frame(crime=crimes,prob=crime_prob)
    df$treat <- rbinom(n,size=1,prob=0.5)
    # get baseline probability, increases for treated
    df$bp <- logistic(logit(crime_prob) + df$treat*log(ir))
    df$arrest <- rbinom(n,size=1,prob=df$bp)
    # estimate logistic regression
    fit <- glm(
       arrest ~ treat + crime,
       data = df,
       family = binomial(link = "logit")
    )
    # extract treatment coefficient info data frame
    summ <- summary(fit)$coefficients
    beta_hat <- summ["treat", "Estimate"]
    se_hat   <- summ["treat", "Std. Error"]
    p_value  <- summ["treat", "Pr(>|z|)"]
    return(
      c(
        b = beta_hat,
        se   = se_hat,
        power  = as.numeric(p_value < alpha)
      )
    )
}

Now we are ready to conduct our simulation. Here we need to choose a total number of simulations to conduct (I default to 10,000 in most cases, but you can ultimately do many more).

n_sims <- 10000                         # number of simulations to conduct
n_values <- c(1000,2000,3000,4000,5000) # total incidents (combined both treated and control)
irr_increases <- c(1.10,1.20,1.40)      # increase in around 10%, 20%, and 40%
# (relative increases, not percentage point), so 10% increase of baseline 50% arrest is 55% arrests
sim_results <- c()

for (i in irr_increases){
    for (n in n_values){
        sim_vals <- c(n,i)
        names(sim_vals) <- c("N","IRR")
        res <- rowMeans(replicate(n_sims, gen_sample(n = n, ir = i)))
        r2 <- c(res,sim_vals)
        sim_results <- rbind(sim_results,r2)
    }
}

row.names(sim_results) <- 1:nrow(sim_results)
sim_results <- data.frame(sim_results)
sim_results$sim_irr <- exp(sim_results$b)
sim_results$bias <- sim_results$sim_irr - sim_results$IRR
fields <- c("N","IRR","bias","b","se","power")
sim_results <- sim_results[,fields]
form <- c(0,rep(2,length(fields)-1))
kable(sim_results,digits=form)
N IRR bias b se power
1000 1.1 0 0.10 0.17 0.09
2000 1.1 0 0.09 0.12 0.12
3000 1.1 0 0.10 0.10 0.17
4000 1.1 0 0.10 0.08 0.20
5000 1.1 0 0.10 0.08 0.25
1000 1.2 0 0.18 0.17 0.18
2000 1.2 0 0.18 0.12 0.32
3000 1.2 0 0.18 0.10 0.46
4000 1.2 0 0.18 0.08 0.58
5000 1.2 0 0.18 0.08 0.69
1000 1.4 0 0.34 0.17 0.51
2000 1.4 0 0.34 0.12 0.79
3000 1.4 0 0.34 0.10 0.93
4000 1.4 0 0.34 0.09 0.98
5000 1.4 0 0.34 0.08 0.99

Here we can see that with the smaller IRR increases of 10%, even with a total of 5000 cases (this is total, so combined arms here), the power to detect that small of a difference across all the crime types is pretty low. With an IRR of 40%, you get reasonable power at 2000 cases.

The more realistic expectation though is likely a 20% IRR increase, which you will likely need a sample size even over 5000 crimes to detect that in these simulations.

Note with GLM type models, you want to also make sure to estimate bias with your simulations. Since they are only unbiased in large samples, it is quite possible that you have bias in smaller sample sizes as well. But here we can see that the bias is fairly small even with 1000 cases.

Want this done for you?

It is easy to say you are being evidence based, but it is much harder in practice to meet that ideal. One of the types of consulting I do here at Crime De-Coder is help agencies create such experiments, monitor them in practice, and help conduct rigorous analysis of your intervention.

Additionally I do this work for grant funded research, which often mandates an external researcher to do outcome analyses, and often needs a plan to get funding to begin with.

If either of these scenarios applies to you, get in touch.