You weren’t supposed to actually implement it, Google

Last month, I wrote a blog post warning about how, if you follow popular trends in NLP, you can easily accidentally make a classifier that is pretty racist. To demonstrate this, I included the very simple code, as a “cautionary tutorial”.

The post got a fair amount of reaction. Much of it positive and taking it seriously, so thanks for that. But eventually I heard from some detractors. Of course there were the fully expected “I’m not racist but what if racism is correct” retorts that I knew I’d have to face. But there were also people who couldn’t believe that anyone does NLP this way. They said I was talking about a non-problem that doesn’t show up in serious machine learning, or projecting my own bad NLP ideas, or something.

Well. Here’s Perspective API, made by an offshoot of Google. They believe they are going to use it to fight “toxicity” online. And by “toxicity” they mean “saying anything with negative sentiment”. And by “negative sentiment” they mean “whatever word2vec thinks is bad”. It works exactly like the hypothetical system that I cautioned against.

On this blog, we’ve just looked at what word2vec (or GloVe) thinks is bad. It includes black people, Mexicans, Islam, and given names that don’t usually belong to white Americans. You can actually type my examples into Perspective API and it will actually respond that the ones that are less white-sounding are more “likely to be perceived as toxic”.

  • Hello, my name is Emily” is supposedly 4% likely to be “toxic”. Similar results for “Susan”, “Paul”, etc.
  • Hello, my name is Shaniqua” (“Jamel”, “DeShawn”, etc.): 21% likely to be toxic.
  • Let’s go get Italian food”: 9%.
  • Let’s go get Mexican food”: 29%.

Here are two more examples I didn’t mention before:

  • Christianity is a major world religion”: 37%. Okay, maybe things can get heated when religion comes up at all, but compare:
  • Islam is a major world religion”: 66% toxic.

I’ve heard about Perspective API from many directions, but my proximate source is this Twitter thread by Dan Luu, who has his own examples:

I have previously written positive things about researchers at Google who are looking at approaches to de-biasing AI, such as their blog post on Equality of Opportunity in Machine Learning.

But Google is a big place. It contains multitudes. And it seems it contains a subdivision that will do the wrong thing, which other Googlers know is the wrong thing, because it’s easy.

Google, you made a very bad investment. (That sentence is 61% toxic, by the way.)


As I update this post in April 2018, I’ve had some communication with the Perspective API team and learned some more details about it.

Some details of this post were incorrect, based on things I assumed when looking at Perspective API from outside. For example, Perspective API does not literally build on word2vec. But the end result is the same: it learns the same biases that word2vec learns anyway.

In September 2017, Violet Blue wrote an exposé of Perspective API for Engadget. Despite the details that I had wrong, the Engadget article confirms that the system really is that bad, and provides even more examples.

Perspective API has changed their online demo to lower toxicity scores across the board, without fundamentally changing the model. Text with a score under a certain threshold is now labeled as “not toxic”. I believe this remedy could be described technically as “weak sauce”.

The Perspective API team claims that their system has no inherent bias against non-white names, and that the higher toxicity scores that appear for names such as “DeShawn” is an artifact of how they handle out-of-vocabulary words. All the names that are typical for white Americans are in-vocabulary. Make of that what you will.

The Perspective API team continues to promote their product, such as via hackathons and TED talks. Users of the API are not warned of its biases, except for a generic warning that could apply to any AI system, saying that users should manually review its results. It is still sometimes held up as a positive example of fighting toxicity with NLP, misleading lay audiences into thinking that present NLP has a solution to toxicity.

How to make a racist AI without really trying

A cautionary tutorial.

Let’s make a sentiment classifier!

Sentiment analysis is a very frequently-implemented task in NLP, and it’s no surprise. Recognizing whether people are expressing positive or negative opinions about things has obvious business applications. It’s used in social media monitoring, customer feedback, and even automatic stock trading (leading to bots that buy Berkshire Hathaway when Anne Hathaway gets a good movie review).

It’s simplistic, sometimes too simplistic, but it’s one of the easiest ways to get measurable results from NLP. In a few steps, you can put text in one end and get positive and negative scores out the other, and you never have to figure out what you should do with a parse tree or a graph of entities or any difficult representation like that.

So that’s what we’re going to do here, following the path of least resistance at every step, obtaining a classifier that should look very familiar to anyone involved in current NLP. For example, you can find this model described in the Deep Averaging Networks paper (Iyyer et al., 2015). This model is not the point of that paper, so don’t take this as an attack on their results; it was there as an example of a well-known way to use word vectors.

Here’s the outline of what we’re going to do:

  • Acquire some typical word embeddings to represent the meanings of words
  • Acquire training and test data, with gold-standard examples of positive and negative words
  • Train a classifier, using gradient descent, to recognize other positive and negative words based on their word embeddings
  • Compute sentiment scores for sentences of text using this classifier
  • Behold the monstrosity that we have created

And at that point we will have shown “how to make a racist AI without really trying”. Of course that would be a terrible place to leave it, so afterward, we’re going to:

  • Measure the problem statistically, so we can recognize if we’re solving it
  • Improve the data to obtain a semantic model that’s more accurate and less racist

Software dependencies

This tutorial is written in Python, and relies on a typical Python machine-learning stack: numpy and scipy for numerical computing, pandas for managing our data, and scikit-learn for machine learning. Later on we’ll graph some things with matplotlib and seaborn.

You could also replace scikit-learn with TensorFlow or Keras or something like that, as they can also train classifiers using gradient descent. But there’s no need for the deep-learning abstractions they provide, as it only takes a single layer of machine learning to solve this problem.

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import seaborn
import re
import statsmodels.formula.api

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
In [2]:
# Configure how graphs will show up in this notebook
%matplotlib inline
seaborn.set_context('notebook', rc={'figure.figsize': (10, 6)}, font_scale=1.5)

Step 1: Word embeddings

Word embeddings are frequently used to represent words as inputs to machine learning. The words become vectors in a multi-dimensional space, where nearby vectors represent similar meanings. With word embeddings, you can compare words by (roughly) what they mean, not just exact string matches.

Successfully training word vectors requires starting from hundreds of gigabytes of input text. Fortunately, various machine-learning groups have already done this and provided pre-trained word embeddings that we can download.

Two very well-known datasets of pre-trained English word embeddings are word2vec, pretrained on Google News data, and GloVe, pretrained on the Common Crawl of web pages. We would get similar results for either one, but here we’ll use GloVe because its source of data is more transparent.

GloVe comes in three sizes: 6B, 42B, and 840B. The 840B size is powerful, but requires significant post-processing to use it in a way that’s an improvement over 42B. The 42B version is pretty good and is also neatly trimmed to a vocabulary of 1 million words. Because we’re following the path of least resistance, we’ll just use the 42B version.

Why does it matter that the word embeddings are “well-known”?

I’m glad you asked, hypothetical questioner! We’re trying to do something extremely typical at each step, and for some reason, comparison-shopping for better word embeddings isn’t typical yet. Read on, and I hope you’ll come out of this tutorial with the desire to use modern, high-quality word embeddings, especially those that are aware of algorithmic bias and try to mitigate it. But that’s getting ahead of things.

We download from the GloVe web page, and extract it into data/glove.42B.300d.txt. Next we define a function to read the simple format of its word vectors.

In [3]:
def load_embeddings(filename):
    Load a DataFrame from the generalized text format used by word2vec, GloVe,
    fastText, and ConceptNet Numberbatch. The main point where they differ is
    whether there is an initial line with the dimensions of the matrix.
    labels = []
    rows = []
    with open(filename, encoding='utf-8') as infile:
        for i, line in enumerate(infile):
            items = line.rstrip().split(' ')
            if len(items) == 2:
                # This is a header row giving the shape of the matrix
            values = np.array([float(x) for x in items[1:]], 'f')
    arr = np.vstack(rows)
    return pd.DataFrame(arr, index=labels, dtype='f')

embeddings = load_embeddings('data/glove.42B.300d.txt')
(1917494, 300)

Step 2: A gold-standard sentiment lexicon

We need some input about which words are positive and which words are negative. There are many sentiment lexicons you could use, but we’re going to go with a very straightforward lexicon (Hu and Liu, 2004), the same one used by the Deep Averaging Networks paper.

We download the lexicon from Bing Liu’s web site ( and extract it into data/positive-words.txt and data/negative-words.txt.

Next we define how to read these files, and read them in as the pos_words and neg_words variables:

In [4]:
def load_lexicon(filename):
    Load a file from Bing Liu's sentiment lexicon
    (, containing
    English words in Latin-1 encoding.
    One file contains a list of positive words, and the other contains
    a list of negative words. The files contain comment lines starting
    with ';' and blank lines, which should be skipped.
    lexicon = []
    with open(filename, encoding='latin-1') as infile:
        for line in infile:
            line = line.rstrip()
            if line and not line.startswith(';'):
    return lexicon

pos_words = load_lexicon('data/positive-words.txt')
neg_words = load_lexicon('data/negative-words.txt')

Step 3: Train a model to predict word sentiments

Our data points here are the embeddings of these positive and negative words. We use the Pandas .loc[] operation to look up the embeddings of all the words.

Some of these words are not in the GloVe vocabulary, particularly the misspellings such as “fancinating”. Those words end up with rows full of NaN to indicate their missing embeddings, so we use .dropna() to remove them.

In [5]:
pos_vectors = embeddings.loc[pos_words].dropna()
neg_vectors = embeddings.loc[neg_words].dropna()

Now we make arrays of the desired inputs and outputs. The inputs are the embeddings, and the outputs are 1 for positive words and -1 for negative words. We also make sure to keep track of the words they’re labeled with, so we can interpret the results.

In [6]:
vectors = pd.concat([pos_vectors, neg_vectors])
targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index])
labels = list(pos_vectors.index) + list(neg_vectors.index)

Hold on. Some words are neither positive nor negative, they’re neutral. Shouldn’t there be a third class for neutral words?

I think that having examples of neutral words would be quite beneficial, especially because the problems we’re going to see come from assigning sentiment to words that shouldn’t have sentiment. If we could reliably identify when words should be neutral, it would be worth the slight extra complexity of a 3-class classifier. It requires finding a source of examples of neutral words, because Liu’s data only lists positive and negative words.

So I tried a version of this notebook where I put in 800 examples of neutral words, and put a strong weight on predicting words to be neutral. But the end results were not much different from what you’re about to see.

How is this list drawing the line between positive and negative anyway? Doesn’t that depend on context?

Good question. Domain-general sentiment analysis isn’t as straightforward as it sounds. The decision boundary we’re trying to find is fairly arbitrary in places. In this list, “audacious” is marked as “bad” while “ambitious” is “good”. “Comical” is bad, “humorous” is good. “Refund” is good, even though it’s typically in bad situations that you have to request one or pay one.

I think everyone knows that sentiment requires context, but when implementing an easy approach to sentiment analysis, you just have to kind of hope that you can ignore context and the sentiments will average out to the right trend.

Using the scikit-learn train_test_split function, we simultaneously separate the input vectors, output values, and labels into training and test data, with 10% of the data used for testing.

In [7]:
train_vectors, test_vectors, train_targets, test_targets, train_labels, test_labels = \
    train_test_split(vectors, targets, labels, test_size=0.1, random_state=0)

Now we make our classifier, and train it by running the training vectors through it for 100 iterations. We use a logistic function as the loss, so that the resulting classifier can output the probability that a word is positive or negative.

In [8]:
model = SGDClassifier(loss='log', random_state=0, n_iter=100), train_targets)
SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', n_iter=100, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=0, shuffle=True, verbose=0,

We evaluate the classifier on the test vectors. It predicts the correct sentiment for sentiment words outside of its training data 95% of the time. Not bad.

In [9]:
accuracy_score(model.predict(test_vectors), test_targets)

Let’s define a function that we can use to see the sentiment that this classifier predicts for particular words, then use it to see some examples of its predictions on the test data.

In [10]:
def vecs_to_sentiment(vecs):
    # predict_log_proba gives the log probability for each class
    predictions = model.predict_log_proba(vecs)

    # To see an overall positive vs. negative classification in one number,
    # we take the log probability of positive sentiment minus the log
    # probability of negative sentiment.
    return predictions[:, 1] - predictions[:, 0]

def words_to_sentiment(words):
    vecs = embeddings.loc[words].dropna()
    log_odds = vecs_to_sentiment(vecs)
    return pd.DataFrame({'sentiment': log_odds}, index=vecs.index)

# Show 20 examples from the test set
fidget -9.931679
interrupt -9.634706
staunchly 1.466919
imaginary -2.989215
taxing 0.468522
world-famous 6.908561
low-cost 9.237223
disapointment -8.737182
totalitarian -10.851580
bellicose -8.328674
freezes -8.456981
sin -7.839670
fragile -4.018289
fooled -4.309344
undecided -2.816172
handily 2.339609
demonizes -2.102152
easygoing 8.747150
unpopular -7.887475
commiserate 1.790899

More than the accuracy number, this convinces us that the classifier is working. We can see that the classifier has learned to generalize sentiment to words outside of its training data.

Step 4: Get a sentiment score for text

There are many ways to combine sentiments for word vectors into an overall sentiment score. Again, because we’re following the path of least resistance, we’re just going to average them.

In [11]:
import re
TOKEN_RE = re.compile(r"\w.*?\b")
# The regex above finds tokens that start with a word-like character (\w), and continues
# matching characters (.+?) until the next word break (\b). It's a relatively simple
# expression that manages to extract something very much like words from text.

def text_to_sentiment(text):
    tokens = [token.casefold() for token in TOKEN_RE.findall(text)]
    sentiments = words_to_sentiment(tokens)
    return sentiments['sentiment'].mean()

There are many things we could have done better:

  • Weight words by their inverse frequency, so that words like “the” and “I” don’t cause big changes in sentiment
  • Adjust the averaging so that short sentences don’t end up with the most extreme sentiment values
  • Take phrases into account
  • Use a more robust word-segmentation algorithm that isn’t confused by apostrophes
  • Account for negations such as “not happy”

But all of those would require extra code and wouldn’t fundamentally change the results we’re about to see. At least now we can roughly compare the relative positivity of different sentences:

In [12]:
text_to_sentiment("this example is pretty cool")
In [13]:
text_to_sentiment("this example is okay")
In [14]:
text_to_sentiment("meh, this example sucks")

Step 5: Behold the monstrosity that we have created

Not every sentence is going to contain obvious sentiment words. Let’s see what it does with a few variations on a neutral sentence:

In [15]:
text_to_sentiment("Let's go get Italian food")
In [16]:
text_to_sentiment("Let's go get Chinese food")
In [17]:
text_to_sentiment("Let's go get Mexican food")

This is analogous to what I saw when I experimented with analyzing restaurant reviews using word embeddings, and found out that all the Mexican restaurants were ending up with lower sentiment for no good reason.

Word vectors are capable of representing subtle distinctions of meaning just by reading words in context. So they’re also capable of representing less-subtle things like the biases of our society.

Here are some other neutral statements:

In [18]:
text_to_sentiment("My name is Emily")
In [19]:
text_to_sentiment("My name is Heather")
In [20]:
text_to_sentiment("My name is Yvette")
In [21]:
text_to_sentiment("My name is Shaniqua")

Well, dang.

The system has associated wildly different sentiments with people’s names. You can look at these examples and many others and see that the sentiment is generally more positive for stereotypically-white names, and more negative for stereotypically-black names.

This is the test that Caliskan, Bryson, and Narayanan used to conclude that semantics derived automatically from language corpora contain human-like biases, a paper published in Science in April 2017, and we’ll be using more of it shortly.

Step 6: Measure the problem

We want to learn how to not make something like this again. So let’s put more data through it, and statistically measure how bad its bias is.

Here we have four lists of names that tend to reflect different ethnic backgrounds, mostly from a United States perspective. The first two are lists of predominantly “white” and “black” names adapted from Caliskan et al.’s article. I also added typically Hispanic names, as well as Muslim names that come from Arabic or Urdu; these are two more distinct groupings of given names that tend to represent your background.

This data is currently used as a bias-check in the ConceptNet build process, and can be found in the conceptnet5.vectors.evaluation.bias module. I’m interested in expanding this to more ethnic backgrounds, which may require looking at surnames and not just given names.

Here are the lists:

In [22]:
    # The first two lists are from the Caliskan et al. appendix describing the
    # Word Embedding Association Test.
    'White': [
        'Adam', 'Chip', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Ian', 'Justin',
        'Ryan', 'Andrew', 'Fred', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Jed',
        'Paul', 'Todd', 'Brandon', 'Hank', 'Jonathan', 'Peter', 'Wilbur', 'Amanda',
        'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Crystal', 'Katie',
        'Meredith', 'Shannon', 'Betsy', 'Donna', 'Kristin', 'Nancy', 'Stephanie',
        'Bobbie-Sue', 'Ellen', 'Lauren', 'Peggy', 'Sue-Ellen', 'Colleen', 'Emily',
        'Megan', 'Rachel', 'Wendy'

    'Black': [
        'Alonzo', 'Jamel', 'Lerone', 'Percell', 'Theo', 'Alphonse', 'Jerome',
        'Leroy', 'Rasaan', 'Torrance', 'Darnell', 'Lamar', 'Lionel', 'Rashaun',
        'Tyree', 'Deion', 'Lamont', 'Malik', 'Terrence', 'Tyrone', 'Everol',
        'Lavon', 'Marcellus', 'Terryl', 'Wardell', 'Aiesha', 'Lashelle', 'Nichelle',
        'Shereen', 'Temeka', 'Ebony', 'Latisha', 'Shaniqua', 'Tameisha', 'Teretha',
        'Jasmine', 'Latonya', 'Shanise', 'Tanisha', 'Tia', 'Lakisha', 'Latoya',
        'Sharise', 'Tashika', 'Yolanda', 'Lashandra', 'Malika', 'Shavonn',
        'Tawanda', 'Yvette'
    # This list comes from statistics about common Hispanic-origin names in the US.
    'Hispanic': [
        'Juan', 'José', 'Miguel', 'Luís', 'Jorge', 'Santiago', 'Matías', 'Sebastián',
        'Mateo', 'Nicolás', 'Alejandro', 'Samuel', 'Diego', 'Daniel', 'Tomás',
        'Juana', 'Ana', 'Luisa', 'María', 'Elena', 'Sofía', 'Isabella', 'Valentina',
        'Camila', 'Valeria', 'Ximena', 'Luciana', 'Mariana', 'Victoria', 'Martina'
    # The following list conflates religion and ethnicity, I'm aware. So do given names.
    # This list was cobbled together from searching baby-name sites for common Muslim names,
    # as spelled in English. I did not ultimately distinguish whether the origin of the name
    # is Arabic or Urdu or another language.
    # I'd be happy to replace it with something more authoritative, given a source.
    'Arab/Muslim': [
        'Mohammed', 'Omar', 'Ahmed', 'Ali', 'Youssef', 'Abdullah', 'Yasin', 'Hamza',
        'Ayaan', 'Syed', 'Rishaan', 'Samar', 'Ahmad', 'Zikri', 'Rayyan', 'Mariam',
        'Jana', 'Malak', 'Salma', 'Nour', 'Lian', 'Fatima', 'Ayesha', 'Zahra', 'Sana',
        'Zara', 'Alya', 'Shaista', 'Zoya', 'Yasmin'

Now we’ll use Pandas to make a table of these names, their predominant ethnic background, and the sentiment score we get for them:

In [23]:
def name_sentiment_table():
    frames = []
    for group, name_list in sorted(NAMES_BY_ETHNICITY.items()):
        lower_names = [name.lower() for name in name_list]
        sentiments = words_to_sentiment(lower_names)
        sentiments['group'] = group

    # Put together the data we got from each ethnic group into one big table
    return pd.concat(frames)

name_sentiments = name_sentiment_table()

A sample of the data:

In [24]:
sentiment group
mohammed 0.834974 Arab/Muslim
alya 3.916803 Arab/Muslim
terryl -2.858010 Black
josé 0.432956 Hispanic
luciana 1.086073 Hispanic
hank 0.391858 White
megan 2.158679 White

Now we can visualize the distribution of sentiment we get for each kind of name:

In [25]:
plot = seaborn.swarmplot(x='group', y='sentiment', data=name_sentiments)
plot.set_ylim([-10, 10])
(-10, 10)

We can see that as a bar-plot, too, showing the 95% confidence intervals of the means.

In [26]:
plot = seaborn.barplot(x='group', y='sentiment', data=name_sentiments, capsize=.1)

And finally we can break out the serious statistical machinery, using the statsmodels package, to tell us how big of an effect this is (along with a bunch of other statistics).

In [27]:
ols_model = statsmodels.formula.api.ols('sentiment ~ group', data=name_sentiments).fit()
OLS Regression Results
Dep. Variable: sentiment R-squared: 0.208
Model: OLS Adj. R-squared: 0.192
Method: Least Squares F-statistic: 13.04
Date: Thu, 13 Jul 2017 Prob (F-statistic): 1.31e-07
Time: 11:31:17 Log-Likelihood: -356.78
No. Observations: 153 AIC: 721.6
Df Residuals: 149 BIC: 733.7
Df Model: 3
Covariance Type: nonrobust

The F-statistic is the ratio of the variation between groups to the variation within groups, which we can take as a measure of overall ethnic bias.

The probability, right below that, is the probability that we would see this high of an F-statistic given the null hypothesis: that is, given data where there was no difference between ethnicities. The probability is very, very low. If this were a paper, we’d get to call the result “highly statistically significant”.

Out of all these numbers, the F-value is the one we really want to improve. A lower F-value is better.

In [28]:

Step 7: Trying different data

Now that we have the ability to measure prejudicial badness in our word vectors, let’s try to improve it. To do so, we’ll want to repeat a bunch of things that so far we just ran as individual steps in this Python notebook.

If I were writing good, maintainable code, I wouldn’t have been using global variables like model and embeddings. But writing ad-hoc spaghetti research code let us look at what we were doing at every step and learn from it, so there’s something to be said for that. Let’s re-use what we can, and at least define a function for redoing some of these steps:

In [29]:
def retrain_model(new_embs):
    Repeat the steps above with a new set of word embeddings.
    global model, embeddings, name_sentiments
    embeddings = new_embs
    pos_vectors = embeddings.loc[pos_words].dropna()
    neg_vectors = embeddings.loc[neg_words].dropna()
    vectors = pd.concat([pos_vectors, neg_vectors])
    targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index])
    labels = list(pos_vectors.index) + list(neg_vectors.index)

    train_vectors, test_vectors, train_targets, test_targets, train_labels, test_labels = \
        train_test_split(vectors, targets, labels, test_size=0.1, random_state=0)
    model = SGDClassifier(loss='log', random_state=0, n_iter=100), train_targets)
    accuracy = accuracy_score(model.predict(test_vectors), test_targets)
    print("Accuracy of sentiment: {:.2%}".format(accuracy))
    name_sentiments = name_sentiment_table()
    ols_model = statsmodels.formula.api.ols('sentiment ~ group', data=name_sentiments).fit()
    print("F-value of bias: {:.3f}".format(ols_model.fvalue))
    print("Probability given null hypothesis: {:.3}".format(ols_model.f_pvalue))
    # Show the results on a swarm plot, with a consistent Y-axis
    plot = seaborn.swarmplot(x='group', y='sentiment', data=name_sentiments)
    plot.set_ylim([-10, 10])

Trying word2vec

You may think this is a problem that only GloVe has. If the system weren’t trained on all of the Common Crawl (which contains lots of unsavory sites and like 20 copies of Urban Dictionary), maybe it wouldn’t have gone bad. What about good old word2vec, trained on Google News?

The most authoritative source for the word2vec data seems to be this file on Google Drive. Download it and save it as data/word2vec-googlenews-300.bin.gz.

In [30]:
# Use a ConceptNet function to load word2vec into a Pandas frame from its binary format
from conceptnet5.vectors.formats import load_word2vec_bin
w2v = load_word2vec_bin('data/word2vec-googlenews-300.bin.gz', nrows=2000000)

# word2vec is case-sensitive, so case-fold its labels
w2v.index = [label.casefold() for label in w2v.index]

# Now we have duplicate labels, so drop the later (lower-frequency) occurrences of the same label
w2v = w2v.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
Accuracy of sentiment: 94.30%
F-value of bias: 15.573
Probability given null hypothesis: 7.43e-09

So: word2vec is even worse. With an F-value over 15, it has even larger differences in sentiment between groups.

In retrospect, expecting news to be safe from algorithmic bias was rather a lot to hope for.

Trying ConceptNet Numberbatch

Now I can finally get to discussing my own word-embedding project.

ConceptNet, the knowledge graph I work on with word-embedding features built in, has a training step that adjusts the embeddings to identify and remove some sources of algorithmic racism and sexism. This step is based on Bolukbasi et al.’s “Debiasing Word Embeddings“, and generalized to address multiple forms of prejudice at once. As far as I know, we’re the only semantic system that has anything of the sort built in.

From time to time, we export pre-computed vectors from ConceptNet, a release we give the name ConceptNet Numberbatch. The April 2017 release was the first to include this de-biasing step, so let’s load its English vectors and retrain our sentiment model with them.

Download numberbatch-en-17.04b.txt.gz, save it in the data/ directory, and retrain the model:

In [31]:
Accuracy of sentiment: 97.46%
F-value of bias: 3.805
Probability given null hypothesis: 0.0118

So have we entirely fixed the problem by switching to ConceptNet Numberbatch? Can we stop worrying about algorithmic racism? No.

Have we made the problem a lot smaller? Definitely.

The ranges of sentiments overlap a lot more than they did in the word vectors that came directly from GloVe or word2vec. The F-value is less than a third of what it was for GloVe, and a quarter of what it was for word2vec. And in general, we see much smaller differences in sentiment that come from comparing different given names, which is what we’d hope for, because names really shouldn’t matter to the task of sentiment analysis.

But there is still a small correlation. Maybe I could have picked some data or training parameters that made the problem look completely solved. That would have been a bad move, because the problem isn’t completely solved. There are more causes of algorithmic racism than the ones we have identified and compensated for in ConceptNet. But this is a good start.

There is no trade-off

Note that the accuracy of sentiment prediction went up when we switched to ConceptNet Numberbatch.

Some people expect that fighting algorithmic racism is going to come with some sort of trade-off. There’s no trade-off here. You can have data that’s better and less racist. You can have data that’s better because it’s less racist. There was never anything “accurate” about the overt racism that word2vec and GloVe learned.

Other approaches

This is of course only one way to do sentiment analysis. All the steps we used are common, but you probably object that you wouldn’t do it that way. But if you have your own process, I urge you to see if your process is encoding prejudices and biases in the model it learns.

Instead of or in addition to changing your source of word vectors, you could try to fix this problem in the output directly. It may help, for example, to build a stronger model of whether sentiment should be assigned to words at all, designed to specifically exclude names and groups of people.

You could abandon the idea of inferring sentiment for words, and only count the sentiment of words that appear exactly in the list. This is perhaps the most common form of sentiment analysis — the kind that includes no machine learning at all. Its results will be no more biased than whoever made the list. But the lack of machine learning means that this approach has low recall, and the only way to adapt it to your data set is to edit the list manually.

As a hybrid approach, you could produce a large number of inferred sentiments for words, and have a human annotator patiently look through them, making a list of exceptions whose sentiment should be set to 0. The downside of this is that it’s extra work; the upside is that you take the time to actually see what your data is doing. And that’s something that I think should happen more often in machine learning anyway.

ConceptNet 5.5.5 update

ConceptNet 5.5.5 is out, and it’s running on The version5.5 tag in Git has been updated to point to this version. Here’s what’s new.


Data changes:

  • Uses ConceptNet Numberbatch 17.06, which incorporates de-biasing to avoid harmful stereotypes being encoded in its word representations.
  • Fixed a glitch in retrofitting, where terms in ConceptNet that were two steps removed from any term that existed in one of the existing word-embedding data sources were all being assigned the same meaningless vector. They now get vectors that are propagated (after multiple steps) from terms that do have existing word embeddings, as intended.
  • Filtered some harmful assertions that came from disruptive or confused Open Mind Common Sense contributors. (Some of them had been filtered before, but changes to the term representation had defeated the filters.)
  • Added a new source of input word embeddings, created at Luminoso by running a multilingual variant of fastText over OpenSubtitles 2016. This provides a source of real-world usage of non-English words.

Build process changes:

  • We measured the amount of RAM the build process requires at its peak to be 30 GB, and tested that it completes on a machine with 32 GB of RAM. We updated the Snakefile to reflect these requirements and to use them to better plan which tasks to run in parallel.
  • The build process starts by checking for some requirements (having enough RAM, enough disk space, and a usable PostgreSQL database), and exits early if they aren’t met, instead of crashing many hours later.
  • The tests have been organized into tests that can be run before building ConceptNet, tests that can be run after a small example build, and tests that require the full ConceptNet. The first two kinds of tests are run automatically, in the right sequence, by the script.
  • and have been moved into the top-level directory, where they are more visible.

Library changes:

  • Uses the marisa-trie library to speed up inferring vectors for out-of-vocabulary words.
  • Uses the annoy library to suggest nearest neighbors that map a larger vocabulary into a smaller one.
  • Depends on a specific version of xmltodict, because a breaking change to xmltodict managed to break the build process of many previous versions of ConceptNet.
  • The cn5-vectors evaluate command can evaluate whether a word vector space contains gender biases or ethnic biases.

Understanding our version numbers

Version numbers in modern software are typically described as major.minor.micro. ConceptNet’s version numbers would be better described as mega.major.minor. Now that all the version components happen to be 5, I’ll explain what they mean to me.

The change from 5.5.4 to 5.5.5 is a “minor” change. It involves important fixes to the data, but these fixes don’t affect a large number of edges or significantly change the vocabulary. If you are building research on ConceptNet and require stable results, we suggest building a particular version (such as 5.5.4 or 5.5.5) from its Docker container, as a “minor” change could cause inconsistent results.

The change from 5.4 to 5.5 was a “major” change. We changed the API format somewhat (hopefully with a smooth transition), we made significant changes to ConceptNet’s vocabulary of terms, we added new data sources, and we even changed the domain name where it is hosted. We’re working on another “major” update, version 5.6, that incorporates new data sources again, though I believe the changes will not be as sweeping as the 5.5 update.

The change from ConceptNet 4 to ConceptNet 5 (six years ago) was a “mega” change, a thorough rethinking and redesign of the project, keeping things that worked and discarding things that didn’t, which is not well described by software versions. The appropriate way to represent it in Semantic Versioning would probably be to start a new project with a different name.

Don’t worry, I have no urge to make a ConceptNet 6 anytime soon. ConceptNet 5 is doing great.

The word vectors that ConceptNet uses in its relatedness API (which are also distributed separately as ConceptNet Numberbatch) are recalculated for every version, even minor versions. The results you get from updating to new vectors should get steadily more accurate, unless your results depended on the ability to represent harmful stereotypes.

You can’t mix old and new vectors, so any machine-learning model needs to be rebuilt to use new vectors. This is why we gave ConceptNet Numberbatch a version numbering scheme that is entirely based on the date (vectors computed in June 2017 are version 17.06).

Bugfix: our English-only word vectors contained the wrong data

If you have used the ConceptNet Numberbatch 17.04 word vectors, it turns out that you got very different results if you downloaded the English-only vectors versus if you used the multilingual, language-tagged vectors.

I decided to make this downloadable file of English-only vectors as a convenience, because it would be the format that looked most like a drop-in replacement for word2vec’s data. But the English-only format is not a format that we use anywhere. We test our vectors, but we don’t test reimporting them from all the files we exported, so that caused a bug in the export to go unnoticed.

The English-only vectors ended up labeling the rows with the wrong English words, unfortunately, making the data they contained meaningless. If you use the multilingual version, it was and still is fine.

If you use the English-only vectors, we have a new Numberbatch download, version 17.04b, that should fix the problem.

I apologize for the erroneous data, and for the setback this may have caused for anyone who is just trying to use the best word vectors they can. Thank you to the users on the conceptnet-users mailing list who drew my attention to the problem.

ConceptNet Numberbatch 17.04: better, less-stereotyped word vectors

Word embeddings or word vectors are a way for computers to understand what words mean in text written by people. The goal is to represent words as lists of numbers, where small changes to the numbers represent small changes to the meaning of the word. This is a technique that helps in building AI algorithms for natural language understanding — using word vectors, the algorithm can compare words by what they mean, not just by how they’re spelled.

But the news that’s breaking everywhere about word vectors is that they also represent the worst parts of what people mean. Stereotypes and prejudices are baked into what the computer believes to be the meanings of words. To put it bluntly, the computer learns to be sexist and racist, because it learns from what people say.


There are many articles you could read for background on the problem, including:

We want to avoid letting computers be awful to people just because people are awful to people. We want to provide word vectors that are not just the technical best, but also morally good. So we’re releasing a new version of ConceptNet Numberbatch that has been post-processed to counteract several kinds of biases and stereotypes.

If you use word vectors in your machine learning and the state-of-the-art accuracy of ConceptNet Numberbatch hasn’t convinced you to switch from word2vec or GloVe, we hope that built-in de-biasing makes a compelling case. Machine learning is better when your machine is less prone to learning to be a jerk.

How do we evaluate that we’ve made ConceptNet Numberbatch less prejudiced than competing systems? There seem to be no standardized evaluations for this yet. But we’ve created some evaluations based on the data from these recent papers. We’re making it a part of the ConceptNet build process to automatically de-bias Numberbatch and evaluate how successful that de-biasing was.

The Bolukbasi et al. paper describes how to counteract gender bias, and by adapting the techniques of that paper, we can reduce the gender bias we observe to almost nothing. Biases regarding race, ethnicity, and religion are harder to define, and therefore harder to remove, but it’s important to make progress on this anyway.

The graph you see below shows what we’ve done so far. The y-axis is a scale that we came up with involving the dot products between word vectors and their possible stereotypes: closer to zero is better. The brown bar, “ConceptNet Numberbatch 17.04”, is the de-biased system we’re releasing. (The version number represents the date, April 2017.)


We’re not trying to say we’ve solved the problem, but we can conclude that we’ve made the problem smaller. Keep in mind that this evaluation itself will likely change in the future, as we gain a better understanding of how to measure bias.

In dealing with machine-learning bias, there’s the concern that removing the bias could also cause changes that remove accuracy. But we’ve found that the change is negligible: about 1% of the overall result, much smaller than the error bars. Here’s the updated evaluation graph. The y-axis is the Spearman correlation with gold standard data (higher is better). The evaluations here are for English words only — Numberbatch covers many more languages, but the systems we’re comparing to don’t.


ConceptNet Numberbatch is already so much more accurate than any other released system that it can lose 1% of its accuracy for a good cause. If you want 1% more accuracy out of your word vectors, I suggest focusing on improving a knowledge graph, not putting back the stereotypes.

Systems we compare to

In the graphs above, “word2vec Google News” is the popular system from 2013-2014 that many people go to when they think “oh hey I need some word vectors”. Its continued relevance is largely due to the fact that it lets you use data that’s learned from Google’s large corpus of news, a very nice corpus which you can’t actually have. We use its results as an input to ConceptNet Numberbatch.

GloVe 1.2 840B” is a system from Stanford that is better in some cases, and learns from reading the whole Web via the Common Crawl. It seems to have some fixable problems with the scaling of its features. “GloVe renormalized” is Luminoso’s improvement on GloVe, which we also use as an input.

fastText enWP (without OOV)” is Facebook’s word vectors trained on the English Wikipedia, with the disclaimer that their accuracy should be better than what we show here. Facebook’s fastText comes with a strategy for handling out-of-vocabulary words, like Numberbatch does. But the data they make available for applying that strategy is in an undocumented binary format that I haven’t deciphered yet.

Gender analogies and gender bias

The word2vec authors (Tomas Mikolov et al.) showed that there was structure in the meanings of the vectors that word2vec learned, allowing you to do arithmetic with word meanings. If you’ve read anything about word vectors before, you’re probably tired of this example by now, but it will make a good illustration: you can take the vector for the word “king”, then subtract “man” and add “woman” to it, and the result is close to the vector for the word “queen”. 

Let’s de-mystify how that happens a bit. Word2vec has read many billions of words (in particular, words in articles on Google News) and learned about what contexts they appear in. The operation “king - man + woman” is essentially asking to find a word that’s used similarly to the word “king”, but one whose context is more like the word “woman” than the word “man”. For example, a word that’s like “king” but appears with the words “she” and “her”. The clear answer is “queen”.

So the vectors learned by word2vec (and many later systems) can express analogy problems as mathematical equations:

woman - man queen - king

This is remarkably similar to the way that analogies are presented in, for example, an English class, or a standardized test such as the Miller Analogies Test or former SAT tests:

woman : man :: queen : king

Evaluating analogies like these led researchers to wonder what other analogies could be created in this system by adding “woman” and subtracting “man” to a word, or vice versa. And some of the results revealed a problem.

In the following examples, the word2vec system was given the first three words of an analogy, and asked for the word in its vocabulary that best completed the equation. The results revealed gender biases that had been baked into the system:

man : woman :: shopkeeper : housewife

man : woman :: carpentry : sewing

man : woman :: pharmaceuticals : cosmetics

I wonder if the excessive focus on Mikolov et al.’s analogy evaluation has exacerbated the problem. When a system is asked repeatedly to make analogies of the form male word : female word :: other male word : other female word, and it’s evaluated on this and its knowledge of geography and not much else, is it any surprise that we end up with systems that amplify stereotypes that distinguish women and men?

The problem is not just a theoretical problem that shows up when playing with analogies. Word embeddings are actively used in many different fields of natural language processing. A system that searches résumés for people with particular programming skills could end up ranking women lower: the phrase “she developed software for…”  would be a worse match for the classifier’s expectations than “he developed software for…”.

This is one of the biases that we strive to remove, and the one we are most successful at removing, as described later in this post.

Word embeddings contain ethnic biases, too

The work on understanding and removing gender biases has been published for a while. But while working on this, I noticed that the data also contained significant racial and ethnic biases, which it seemed nobody was talking about. Just recently, the Caliskan et al. article came out and provided some needed illumination on the issue.

I had tried building an algorithm for sentiment analysis based on word embeddings — evaluating how much people like certain things based on what they say about them. When I applied it to restaurant reviews, I found it was ranking Mexican restaurants lower. The reason was not reflected in the star ratings or actual text of the reviews. It’s not that people don’t like Mexican food. The reason was that the system had learned the word “Mexican” from reading the Web.

If a restaurant were described as doing something “illegal”, that would be a pretty negative statement about the restaurant, right? But the Web contains lots of text where people use the word “Mexican” disproportionately along with the word “illegal”, particularly to associate “Mexican immigrants” with “illegal immigrants”. The system ends up learning that “Mexican” means something similar to “illegal”, and so it must mean something bad.

The tests I implemented for ethnic bias are to take a list of words, such as “white”, “black”, “Asian”, and “Hispanic”, and find which one has the strongest correlation with each of a list of positive and negative words, such as “cheap”, “criminal”, “elegant”, and “genius”. I did this again with a fine-grained version that lists hundreds of words for ethnicities and nationalities, and thus is more difficult to get a low score on, and again with what may be the trickiest test of all, comparing words for different religions and spiritual beliefs.

In these tests, for each positive and negative word, I find the group-of-people word that it’s most strongly associated with, and compare that to the average. The difference is the bias for that word, and in the end it’s averaged over all the positive and negative words. This appears in the graphs as “Ethnic bias (coarse)”, “Ethnic bias (fine)”, and “Religious bias”.

Note that it’s infeasible to reach 0 on this scale — words for groups of people will necessarily have some different associations. Even random differences between the words would give non-zero results. This is one reason I don’t consider the scale to be final. I’d like to make one that works like the gender-bias scale, where reaching 0 is attainable and desirable.

The Science article uncovers racial biases in a different way: it looks for different associations with predominantly-black names, such as “Deion”, “Jamel”, “Shereen”, and “Latisha”, versus predominantly-white names, such as “Amanda”, “Courtney”, “Adam”, and “Harry”. I incorporated a version of this (and also added some predominantly Hispanic and Islamic names) as another test, shown on the graph as “Bias from names”.

In ConceptNet Numberbatch, we’ve extended Bolukbasi’s de-biasing method to cover multiple types of prejudices, including ethnic and religious. This, too, is discussed below in the “What we’ve done” section.

Porn biases

While we’re talking about the biases that an algorithm gets from reading the Web, let’s talk about another large influence: a lot of the Web is porn.

A system that reads the text of pages sampled from the Web is going to read a lot of the text of porn pages. As such, it is going to end up learning associations for many kinds of words, such as “girlfriend”, “teen”, and “Asian”, that would be very inappropriate to put into production in a machine learning system.

Many of these associated terms, such as “slut”, are negative in connotation. This causes gender bias, ethnic bias, and more. When countering all of these biases, we need to make sure that people are not associated with degrading terminology because of their gender, ethnicity, age, or sexual orientation. This is another aspect of words that we strive to de-bias.

How to fix machine-learning biases

Biases and prejudices are clearly a big problem in machine learning, and at least some machine learning researchers are doing something about it. I’ve seen two major approaches to fixing biases, and as a shorthand, I’ll call them “Google-style” and “Microsoft-style” fixes, though I’m aware that these are just one project from each company and probably don’t represent a company-wide plan. The main difference is at what stage of the process you try to remove biases.

What I call “Google-style” de-biasing is described in a post on the Google Research Blog, “Equality of Opportunity in Machine Learning”. In this view, the data is what it is; if the data is unfair, it’s a reflection of the world being unfair. So the goal is to identify the point at which a machine learning tool makes a decision that affects someone (their example is a classifier that decides whether to grant a loan), and de-bias the actual decision that it makes, providing equal opportunity regardless of what the system learned from its data.

They caution against “fairness through unawareness”, the attempt to produce a system that’s unbiased just because it’s not told about attributes such as gender or race, because machine learning is great at picking up on correlated patterns that could be a proxy for gender or race.

Google’s approach is a principled and reasonable approach to take, especially in a workplace that venerates data above all else. But it sounds like it involves a large and never-ending amount of programmer effort to ensure that biased data leads to unbiased decisions. I have to wonder how many of Google’s products that use machine learning really have a working “equal opportunity filter” on their output.

In the “Microsoft-style” approach, when your data is biased against some group of people, you change the data until it’s more fair, which should help to de-bias anything you do with that data. This approach avoids “unawareness” because it adjusts all the data that’s correlated with the identified bias. I call this “Microsoft-style” because it’s based on the Microsoft Research paper (Bolukbasi et al.) that I linked to.

To remove a gender bias from word embeddings, for example, you can collect many examples of word pairs that are gender-biased and shouldn’t be (such as “doctor” vs. “nurse”), and use them to find the combination of word-embedding components that are responsible for gender bias. Then you mathematically adjust the components so that that combination comes out to 0.

It’s not quite that simple — doing exactly what I said could result in a system that’s unbiased because it has no idea what gender is, which would harm its ability to understand words that carry actual information about gender (such as “she” or “uncle”). You need to also have examples of words that are appropriate to distinguish by gender, such as “he” and “she”, or “aunt” and “uncle”. You then train the system to find the right balance between destroying biased assumptions and preserving useful information about gender.

To summarize the effects of these approaches, I would say that Microsoft-style de-biasing is more transferable between different tasks, but Google-style lets you positively demonstrate that certain things about your system are fair, if you use it consistently. If you control your machine learning pipeline from end to end, from source data to the point where you make a decision, I would say you should do both.

What we’ve done

In the new release of ConceptNet Numberbatch, we adapt one of the “Microsoft-style” techniques from Bolukbasi et al., but we remove many types of biases, not just one.

The process goes like this:

  • Classify words according to the appropriateness of a distinction. For example, “mother” vs. “father” contains an appropriate gender distinction that we shouldn’t change. “Homemaker” vs. “programmer” contains an inappropriate distinction. Bolukbasi provided some nice crowd-sourced data about what people consider appropriate.
  • Adjust the word vectors for words on the “inappropriate” side algebraically, so that the distinction they shouldn’t be making comes out to zero.
  • Evaluate how successful we were at removing the bias, by testing it on a different set of words than the ones we used to find the bias.

The graph below shows the gender bias that we aim to remove. Words are plotted according to how much they’re associated with male words (on the left) or female words (on the right), and according to whether our classifier says this association is appropriate (on the top) or inappropriate (on the bottom).


And here’s what the graph looks like after de-biasing. The inappropriate gender distinctions have been set to nearly zero.


We use similar steps to remove biased associations with different races, ethnicities, and religions. Because we don’t have nice crowd-sourced data for exactly what should be removed in those cases, we instead aim to de-correlate them with words representing positive and negative sentiment.

We hope we’re pushing word vectors away from biases and prejudices, and toward systems that don’t think of you any differently whether you’re named Stephanie or Shanice or Santiago or Syed.

Using the results

ConceptNet Numberbatch 17.04 is out now, with the vectors available for research into text understanding and classification. The format is designed so they can be used as a replacement for word2vec vectors.

In Luminoso products, we use a version of ConceptNet Numberbatch that’s adapted to our text-understanding pipeline as a starting point, providing general background knowledge about what words mean. Numberbatch represents what Luminoso knows before it learns about your domain, and allows it to quickly learn to understand words from a particular domain because it doesn’t have to learn an entire language from scratch.

Our next practical step is to incorporate the newest Numberbatch into Luminoso, with both the SemEval accuracy improvements and the de-biasing.

In further research, we aim to refine how we measure these different kinds of biases. One improvement would be to measure biases in languages besides English. Numberbatch vectors are aligned across different languages, so the de-biasing we performed should affect all languages, but it will be important to test this with some multilingual data.