Campaign Zero, the police reform campaign proposed by Black Lives Matter activists, outlined ten ways governments can change policing regulations to meet their demands. The demand to end broken windows policing topped this list. Below, I will show an example of how machine learning algorithms can help curtail this kind of policing while keeping the number of criminals that could be caught but were allowed to walk to a minimum.

What's the Problem?

Put simply, too many innocent people are harassed by police. For example, Philandro Castile, the black man who was shot by police enforcement officers after a traffic stop, was accused of traffic violations on more than 50 occasions before the fatal stop. Many of these "violations" were things like improperly displaying a license plate. About half of them were dismissed outright.

What Does the Data Say?

One good thing that came out of the New York City's widely criticized stop and frisk policy is a dataset that allows us to examine trends like this one in the population at large. This dataset includes data for every stop. For each stop, the data can answer when and where the stop occurred, the reason for the stop, the characteristics of the suspect, and among many other data points, the outcome. In each of the twelve years for which these data are available, New Yorkers were stopped about half a million times, and nine out of ten stops were of completely innocent people. See the plot below for a breakdown by police precinct in 2015.

map

Making a Tradeoff

Although stopping innocent people is clearly a problem, police officers do have to stop many people in high-crime areas to make these areas safer. A bulge in a jacket is going to indicate nothing illicit nine times out of ten, but to make the neighborhood safe, it may be worth stopping ten people and upsetting nine of them so that you can stop the one person who is hiding an illegal gun. A tradeoff has to be made. Perhaps if this tradeoff was more favorable (if a police officer did not have to upset nine people in order to find the one with the gun), then a high level of crime enforcement could be achieved together with a small number of annoying, and likely discriminatory stops.

What's the Solution?

Machine learning to the rescue! In this case, the model might learn to predict whether the suspect was guilty of a crime as assessed by the police officer after the stop. Although this is far from a perfect measure of guilt, it is certainly a reasonable starting point. A machine learning model allows many variables to be used for predictions and can assess complex relationships between the variables. This allowed me to include a great variety of variables, for example, whether the month was January, and what the suspect was wearing. Because the model allows for complex relationships, it takes into account that although wearing a parka in January is perfectly normal, wearing a parka in June is suspicious (it may indicate the suspect is concealing a weapon).

First, we need to read in the data.

sf = read.csv('stopfrisk.csv',header=T)

Next, we add various time variables, such as the month, day of week and day of month.

sf$datestop=as.Date(as.character(sf$datestop), '%m%d%Y') #read in the date
sf$month = months(sf$datestop)
sf$qtr = quarters(sf$datestop)
sf$weekday = weekdays(sf$datestop)
sf$monthday = as.numeric(format(sf$datestop, "%d"))

Next, we build a matrix of predictors.

predicts = sf[c('pct','timestop','city','sex','age',
'agesq','height','weight','haircolor','build','inout','offunif')]

In addition to these variables, we need to add reason for stop and additional circumstance variables. There are many of these, but they all start with 'ac' or 'cs.'

predicts = data.frame(predicts, sf[substr(names(sf), 1, 2) %in% c('ac','cs')])
predictsM = model.matrix(sf$arstmade~., predicts) #Exclude rows where the predictors were missing and converts factors to numeric variables
DV = sf$arstmade[as.numeric(dimnames(predictsM)[[1]])] #Use only observations for which none of the predictors were missing

The most accurate model I found, as assessed by cross validation, was a boosted gradient descent model of decision trees. These parameters worked well after playing around for a little bit, but it is possible that better parameters are available.

test = sample(1:length(DV), 2000, replace=T) #test rows
train = !(1:length(DV) %in% test) #train rows
bst <- xgboost(data = predictsM[train,], label = DV[train], 
max.depth = 40, eta = .3, lambda = 1, nthread = 2, alpha=.25, nround = 40, 
objective = "binary:logistic", verbose = 1)

Results

> table((predict(bst, predictsM[test,])>.05 ), DV[test]) #.05 is an arbitrary threshold

         0    1
FALSE 1303   66
TRUE   424  207

This table indicates that out of the 273 people who would be arrested if stopped, 207 (76%) were stopped. This is called recall. Out of the 1727 people who would not be arrested, only 424 (25%) were stopped, meaning the model had 75% precision. Let's see that breakdown by race.

plotd = with(sf[test,], aggregate(DV[test], list(DV[test], predict(bst, predictsM[test,])>.05, sf$race[test] ), length ))
names(plotd)=c("Guilty","Stopped","Race","Number")
races = c('Black', 'Black\nHispanic', 'White\nHispanic', 'White',
'Asian', 'Native\nAmerican')
plotd$Race = races[plotd$Race]
plotd$Stopped = ifelse(plotd$Stopped, "Yes", "No")
plotd$Guilty = ifelse(plotd$Guilty>0, "Yes", "No")
ggplot(data=plotd, aes(x=Guilty, y=Number, fill=Stopped)) +
geom_bar(stat="identity", position=position_dodge())+ facet_grid(. ~ Race)

map

Only about 220 innocent black people would be stopped if this policy was instituted, compared to about 1000 without it.

Below, you will find a graphical representation of the model's results. If we set the model's decision criterion to .05, we find that the model is able to identify most guilty people, but stop much fewer innocents. Specifically, at this threshold, we can reduce the total number of innocent people stopped to only 25% of the original (that means for every 100 innocent New Yorkers that had to be stopped before, only 25 will have to be stopped if we use this tool) while still catching 76% of those who would ultimately be arrested. If we would like to make sure we catch almost all the criminals, we can decrease the threshold to .01. In this case, 59% of the original amount of innocent people would be stopped, but also 93% of those who would ultimately be arrested.

Why did it Work?

Although the algorithm does not directly output how much each feature contributed to making good predictions, we can assess this indirectly. One measure of feature importance compares the full model to how good it would have been if it did not have access to a specific feature. The most important features by this measure were time of stop, age of suspect, whether the stop occurred indoors or outdoors, the weight and height of the suspect, whether the officer was wearing a uniform, the reason for the stop, and circumstances like proximity to scene of offense.

Implementation

This algorithm can be implemented in an app to be used in patrol car computers or officers' smartphones. Given a clock and GPS tracking, many of the features can be automatically filled. The app would ask police officers to fill out other features (e.g., reason for stop, physical characteristics of suspect), then, based on the conclusions from the algorithm, it would indicate whether it is worthwhile to stop this suspect. Of course, in times of emergency, using such a tool may not be feasible. However, broken windows policing is not about times of emergency. Officers making routine stops often have plenty of time.

Gaming the Model

If this tool were implemented, it would be in a very high-stakes environment. Real criminals would have a lot to gain from understanding how it works. This specific machine learning model is more difficult to game than most because it includes interactions. It would be difficult for criminals to avoid being stopped by following simple rules like "don't wear loose clothing" because the model includes interactions between variables, as described. Nonetheless, sophisticated criminals may be able to game even a complex model like this one, perhaps through advanced knowledge of the algorithm. These kinds of criminals can still be apprehended if a stochastic element is added to the decision process. The model outputs the probability that somebody is guilty, rather than a decision, so the tool can give a recommendation based on this probability, rather than deterministically, based on a threshold. If people are stopped according to this probability, rather than according to a cutoff, as suggested above, gaming the model all the time would be nearly impossible.