Can statistical models discriminate against a group of people without the intent of the creator? Microsoft's recent experiment with the Twitter artificial intelligence bot Tay suggests the answer is a firm yes. The bot was programmed to resemble an impulsive teenager, but ended up denying the holocaust ever happened, while simultaneously advocating a genocide of Mexicans. In an article entitled "Can Computers Be Racist," Michael Brennan discussed how algorithms can and do display preferences for certain groups in some depth and I will not go farther into it.

Instead, I am going to follow up and propose a possible solution. In short, I am going to outline three possible ways statistical models can treat differences between groups. The first is by explicitly taking advantage of them in order to improve prediction. The second is by ignoring differences, but still giving preferential results to certain groups. The third is by accounting for group differences and giving the same results to people regardless of the group they belong to.

The Example

Imagine we are trying to design a tool to help police officers identify criminal suspects more effectively, thereby reducing the need to stop innocent people. The goal would be to have the police officer enter some information about the suspect and the situation and have the algorithm provide the officer a decision: confront the person or not. In the best case scenario, if such a tool were highly accurate and neutral with respect to racial group, it could reduce the amount of needless stops and reduce frictions between police officers and the communities they serve. In the worst case scenario, if the tool takes advantage of the fact that there are higher rates of criminal incidence among racial minorities, it would result in more stops of people who belong to those racial minorities and increase tensions between police and the communities they serve.

Fortunately, the New York Police Department releases troves of data about its stop and frisk campaign (cudos). Using these data, we can predict the likelihood for some person who is suspected of a crime to be arrested based on background characteristics, the officer's suspicions, the time and location of the confrontation etc..

Note that what I am advocating here is a technique and not necesserily its use in this context. The specific tool that I am proposing could be easily gamed and what I am trying to predict here (whether a person is arrested) does not actually indicate guilt.

A First Approach

The first thing we could try is to make a model that includes a whole lot of variables (including demographics) and then make predictions using all these variables. Although it seems outrageous to use statistical models that explicitly include demographic characteristics to make impactful decisions, it is actually a common practice (thankfully not as frequently for historically disadvantaged groups). For example, insurance companies include gender, age, and marital status in assessing risk and therefore insurance payments.

Specifically, I used the NYPD's data about stop and frisk to predict which confrontations resulted in arrests. As predictors, I used features of the confrontation (e.g., the time and place), features of the suspect (e.g., gender, race, age, height, weight, haircolor, build), the officer's reasons for confronting the suspect (e.g., suspicious bulge), whether the officer was in uniform and other circumstances (e.g., proximity to scene of a crime).

Because there were so many variables, I used regularization to select the ones that were actually useful and because regularization introduces a hyper-parameter, I used cross-validation to select the best hyper-parameter.

To validate the model, I left 1000 people suspected of a crime out for final evaluation. Of these 1000, 156 were arrested when confronted. Because no model can be perfect, some of these people who should be arrested will be missed. I set this percentage to 5%, meaning 8 people who would have been arrested would walk free. Using this criterion allowed 30.5% of people not to be stopped at all. The problem with this model was that

A Proposed Solution

The approach I discuss here is to use a machine learning algorithm to learn what characteristics of the labeled questions give them the topics that they have and extrapolate that to the unlableled questions. I will do this in R using the tm and randomForest packages. There will be a separate model for each label.

> require(plyr)
> require(tm)

The first thing to do is to read in the file to see what we're dealing with. This file is made publically available by the Good Judgment Project.

> ifps = read.csv('ifps.csv', header=T, stringsAsFactors = F)
> head(ifps$q_text)
[1] "Will the Six-Party talks (among the US, North Korea, South Korea, Russia, China, and Japan) formally resume in 2011?"
[2] "Who will be inaugurated as President of Russia in 2012?"
[3] "Will Serbia be officially granted EU candidacy by 31 December 2011?"
[4] "Will the United Nations General Assembly recognize a Palestinian state by 30 September 2011?"
[5] "Will Daniel Ortega win another term as President of Nicaragua during the late 2011 elections?"
[6] "Will Italy restructure or default on its debt by 31 December 2011?"
> head(ifps$tags[ifps$tags!='']) #view the first six tags
[1] "|Middle-East|Law|"
[2] "|Middle-East|Conflict-Interstate|Diplomacy|Iran|"
[3] "|Asia-East|Pacific_Rim|Korea|Diplomacy|Military|"
[4] "|Asia-East|Europe|China|Economics|Law|"
[5] "|Asia-East|Pacific_Rim|Japan|Elections|"
[6] "|Middle-East|"

It seems that a question can be given more than one topic (or tag) and the tags are separated by the | character in the tags.

We would like to be able to predict which topic(s) in ifps$tags are appropriate to a question based on the text in ifps$q_text. For that, we need one matrix that tells us which articles have which topics (that's what we want to predict) and a second matrix that tells us which words are in which article.

To get there, let's get a list of all the possible topics. These will be the names of the columns of the first matrix. First, we split them by the |character, and collect the output in one list and store the distinct elements of this list under topics.

> topics = unique(unlist(strsplit(ifps$tags, split='\\|')))
> topics = topics[c(-1,-length(topics))] #the first and last topics were "" and "0"
> length(topics) #how many topics are there?
[1] 59

We want a matrix with one row for each question and one column for each of the distinct topics, and for the entries, we want an indicator letting us know whether that question was labeled with that topic. The ldply function from the plyr package does just that. It applies a function to a list (in our case the list of topic vectors).

> require(plyr)
> topicMat = ldply( strsplit(ifps$tags, split='\\|'), 
+                   function(x) topics %in% x)
> names(topicMat) = topics #Give the columns the appropriate names
> topicMat = topicMat[colSums(topicMat)>10] #Get rid of any topics used less than 10 times.

Well, our topics are now in the right shape, but we still need a matrix with one row for each question and one column for each possible word, indicating whether that question has that word in it. R's text mining package, tm, is perfect for this.

> mycorp = Corpus(VectorSource(q_text)) #turn the questions into a corpus
> mycorp = tm_map(mycorp, PlainTextDocument) #convenience function
> mycorp = tm_map(mycorp, content_transformer(tolower)) change everything to lowercase
> mycorp = tm_map(mycorp, removePunctuation) #remove punctuation
> dtm = DocumentTermMatrix(mycorp) #make a document-term matrix
> dtm = removeSparseTerms(dtm, .99) #remove terms that apeear in fewer than 1% of questions
> dtm2 = as.matrix(dtm) #change this into a regular matrix
> class(dtm2) = "numeric" #change the values to numeric, as required by regression

Now we're ready to train our algorithm. I find that random forests work really well for text data, so that is what I will use.

> train = !ifps$tags=='' #train on the rows that are labeled
> rfs = apply( topicMat[train,], 2, #for each column of the topic matrix
               function(x) randomForest(dtm2[train,], factor(x), ntree=200)
               #train a random forest with 200 trees
               )
> chosennames = apply( dtm2[!train,], 1, function(row) { #for each row in the test set
    names(topicMat)[ #choose the topic labels so that 
        unlist(lapply( rfs, function(rf) predict(rf, row, 'prob')[2]>.4)) #its probability according to our model is greater than .4
        ] 
})

The Results

The results are stored in the chosennames vector. For example, let's look at question 1:

> ifps$q_text[1]
[1] "Will the Six-Party talks (among the US, North Korea, South Korea, Russia, China, and Japan) formally resume in 2011?"
> chosennames[1]
$`1`
[1] "Diplomacy"

I think this is a good label, and in general, I found all the labels the algorithm assigned were appropriate, though even in the training set they did not always agree wiht the labels originally provided. However, problematically, the labels were only available for about a quarter of the test cases.

> mean(unlist(lapply(chosennames[!train], function(x)length(x)==0))) #proportion of test articles that weren't labeled
[1] 0.7282051
> ifps$q_text[4]
[1] "Will the United Nations General Assembly recognize a Palestinian state by 30 September 2011?"
> chosennames[4]
$`4`
character(0)

Question 4 for example, was unlabeled, though a topic like "Israel" might be appropriate. That's because the word "Palestine" did not appear in any of the labeled questions. Without that word, it's hard to tell what the question is about.

Conclusion

The random forests did a good job learning what they could from the data. Moreover, the models could be fit with only a few lines of code and were effective without modifying too many options. But it appears that there just weren't enough labeled data available here to label the questions appropriately. In Part 2 of this blog post, I consult an unsupervised machine learning approach to label these questions.