Interview on de-biasing NLP

In several previous posts here, I’ve been discussing the risks of biased AI, particularly in natural language processing tasks. It’s part of my work at Luminoso to address this, both in ConceptNet and in Luminoso’s services. We’re not just warning about what not to do, we’re promoting best practices about what you can do to prevent bias right now.

At Luminoso, Denise Christie and I recorded a discussion about de-biasing, touching on a lot of the current issues that create bias in NLP and what to do about them.  The recording and transcript are now available on Luminoso’s site.

You weren’t supposed to actually implement it, Google

Last month, I wrote a blog post warning about how, if you follow popular trends in NLP, you can easily accidentally make a classifier that is pretty racist. To demonstrate this, I included the very simple code, as a “cautionary tutorial”.

The post got a fair amount of reaction. Much of it positive and taking it seriously, so thanks for that. But eventually I heard from some detractors. Of course there were the fully expected “I’m not racist but what if racism is correct” retorts that I knew I’d have to face. But there were also people who couldn’t believe that anyone does NLP this way. They said I was talking about a non-problem that doesn’t show up in serious machine learning, or projecting my own bad NLP ideas, or something.

Well. Here’s Perspective API, made by an offshoot of Google. They believe they are going to use it to fight “toxicity” online. And by “toxicity” they mean “saying anything with negative sentiment”. And by “negative sentiment” they mean “whatever word2vec thinks is bad”. It works exactly like the hypothetical system that I cautioned against.

On this blog, we’ve just looked at what word2vec (or GloVe) thinks is bad. It includes black people, Mexicans, Islam, and given names that don’t usually belong to white Americans. You can actually type my examples into Perspective API and it will actually respond that the ones that are less white-sounding are more “likely to be perceived as toxic”.

  • “Hello, my name is Emily” is supposedly 4% likely to be “toxic”. Similar results for “Susan”, “Paul”, etc.
  • “Hello, my name is Shaniqua” (“Jamel”, “DeShawn”, etc.): 21% likely to be toxic.
  • “Let’s go get Italian food”: 9%.
  • “Let’s go get Mexican food”: 29%.

Here are two more examples I didn’t mention before:

  • “Christianity is a major world religion”: 37%. Okay, maybe things can get heated when religion comes up at all, but compare:
  • “Islam is a major world religion”: 66% toxic.

I’ve heard about Perspective API from many directions, but my proximate source is this Twitter thread by Dan Luu, who has his own examples:

I have previously written positive things about researchers at Google who are looking at approaches to de-biasing AI, such as their blog post on Equality of Opportunity in Machine Learning.

But Google is a big place. It contains multitudes. And it seems it contains a subdivision that will do the wrong thing, which other Googlers know is the wrong thing, because it’s easy.

Google, you made a very bad investment. (That sentence is 61% toxic, by the way.)

How to make a racist AI without really trying

Perhaps you heard about Tay, Microsoft’s experimental Twitter chat-bot, and how within a day it became so offensive that Microsoft had to shut it down and never speak of it again. And you assumed that you would never make such a thing, because you’re not doing anything weird like letting random jerks on Twitter re-train your AI on the fly.

My purpose with this tutorial is to show that you can follow an extremely typical NLP pipeline, using popular data and popular techniques, and end up with a racist classifier that should never be deployed.

There are ways to fix it. Making a non-racist classifier is only a little bit harder than making a racist classifier. The fixed version can even be more accurate at evaluations. But to get there, you have to know about the problem, and you have to be willing to not just use the first thing that works.

This tutorial is a Jupyter Python notebook hosted on GitHub Gist. I recommend clicking through to the full view of the notebook, instead of scrolling the tiny box below.


ConceptNet 5.5.5 update

ConceptNet 5.5.5 is out, and it’s running on The version5.5 tag in Git has been updated to point to this version. Here’s what’s new.


Data changes:

  • Uses ConceptNet Numberbatch 17.06, which incorporates de-biasing to avoid harmful stereotypes being encoded in its word representations.
  • Fixed a glitch in retrofitting, where terms in ConceptNet that were two steps removed from any term that existed in one of the existing word-embedding data sources were all being assigned the same meaningless vector. They now get vectors that are propagated (after multiple steps) from terms that do have existing word embeddings, as intended.
  • Filtered some harmful assertions that came from disruptive or confused Open Mind Common Sense contributors. (Some of them had been filtered before, but changes to the term representation had defeated the filters.)
  • Added a new source of input word embeddings, created at Luminoso by running a multilingual variant of fastText over OpenSubtitles 2016. This provides a source of real-world usage of non-English words.

Build process changes:

  • We measured the amount of RAM the build process requires at its peak to be 30 GB, and tested that it completes on a machine with 32 GB of RAM. We updated the Snakefile to reflect these requirements and to use them to better plan which tasks to run in parallel.
  • The build process starts by checking for some requirements (having enough RAM, enough disk space, and a usable PostgreSQL database), and exits early if they aren’t met, instead of crashing many hours later.
  • The tests have been organized into tests that can be run before building ConceptNet, tests that can be run after a small example build, and tests that require the full ConceptNet. The first two kinds of tests are run automatically, in the right sequence, by the script.
  • and have been moved into the top-level directory, where they are more visible.

Library changes:

  • Uses the marisa-trie library to speed up inferring vectors for out-of-vocabulary words.
  • Uses the annoy library to suggest nearest neighbors that map a larger vocabulary into a smaller one.
  • Depends on a specific version of xmltodict, because a breaking change to xmltodict managed to break the build process of many previous versions of ConceptNet.
  • The cn5-vectors evaluate command can evaluate whether a word vector space contains gender biases or ethnic biases.

Understanding our version numbers

Version numbers in modern software are typically described as major.minor.micro. ConceptNet’s version numbers would be better described as mega.major.minor. Now that all the version components happen to be 5, I’ll explain what they mean to me.

The change from 5.5.4 to 5.5.5 is a “minor” change. It involves important fixes to the data, but these fixes don’t affect a large number of edges or significantly change the vocabulary. If you are building research on ConceptNet and require stable results, we suggest building a particular version (such as 5.5.4 or 5.5.5) from its Docker container, as a “minor” change could cause inconsistent results.

The change from 5.4 to 5.5 was a “major” change. We changed the API format somewhat (hopefully with a smooth transition), we made significant changes to ConceptNet’s vocabulary of terms, we added new data sources, and we even changed the domain name where it is hosted. We’re working on another “major” update, version 5.6, that incorporates new data sources again, though I believe the changes will not be as sweeping as the 5.5 update.

The change from ConceptNet 4 to ConceptNet 5 (six years ago) was a “mega” change, a thorough rethinking and redesign of the project, keeping things that worked and discarding things that didn’t, which is not well described by software versions. The appropriate way to represent it in Semantic Versioning would probably be to start a new project with a different name.

Don’t worry, I have no urge to make a ConceptNet 6 anytime soon. ConceptNet 5 is doing great.

The word vectors that ConceptNet uses in its relatedness API (which are also distributed separately as ConceptNet Numberbatch) are recalculated for every version, even minor versions. The results you get from updating to new vectors should get steadily more accurate, unless your results depended on the ability to represent harmful stereotypes.

You can’t mix old and new vectors, so any machine-learning model needs to be rebuilt to use new vectors. This is why we gave ConceptNet Numberbatch a version numbering scheme that is entirely based on the date (vectors computed in June 2017 are version 17.06).

Bugfix: our English-only word vectors contained the wrong data

If you have used the ConceptNet Numberbatch 17.04 word vectors, it turns out that you got very different results if you downloaded the English-only vectors versus if you used the multilingual, language-tagged vectors.

I decided to make this downloadable file of English-only vectors as a convenience, because it would be the format that looked most like a drop-in replacement for word2vec’s data. But the English-only format is not a format that we use anywhere. We test our vectors, but we don’t test reimporting them from all the files we exported, so that caused a bug in the export to go unnoticed.

The English-only vectors ended up labeling the rows with the wrong English words, unfortunately, making the data they contained meaningless. If you use the multilingual version, it was and still is fine.

If you use the English-only vectors, we have a new Numberbatch download, version 17.04b, that should fix the problem.

I apologize for the erroneous data, and for the setback this may have caused for anyone who is just trying to use the best word vectors they can. Thank you to the users on the conceptnet-users mailing list who drew my attention to the problem.

ConceptNet Numberbatch 17.04: better, less-stereotyped word vectors

Word embeddings or word vectors are a way for computers to understand what words mean in text written by people. The goal is to represent words as lists of numbers, where small changes to the numbers represent small changes to the meaning of the word. This is a technique that helps in building AI algorithms for natural language understanding — using word vectors, the algorithm can compare words by what they mean, not just by how they’re spelled.

But the news that’s breaking everywhere about word vectors is that they also represent the worst parts of what people mean. Stereotypes and prejudices are baked into what the computer believes to be the meanings of words. To put it bluntly, the computer learns to be sexist and racist, because it learns from what people say.


There are many articles you could read for background on the problem, including:

We want to avoid letting computers be awful to people just because people are awful to people. We want to provide word vectors that are not just the technical best, but also morally good. So we’re releasing a new version of ConceptNet Numberbatch that has been post-processed to counteract several kinds of biases and stereotypes.

If you use word vectors in your machine learning and the state-of-the-art accuracy of ConceptNet Numberbatch hasn’t convinced you to switch from word2vec or GloVe, we hope that built-in de-biasing makes a compelling case. Machine learning is better when your machine is less prone to learning to be a jerk.

How do we evaluate that we’ve made ConceptNet Numberbatch less prejudiced than competing systems? There seem to be no standardized evaluations for this yet. But we’ve created some evaluations based on the data from these recent papers. We’re making it a part of the ConceptNet build process to automatically de-bias Numberbatch and evaluate how successful that de-biasing was.

The Bolukbasi et al. paper describes how to counteract gender bias, and by adapting the techniques of that paper, we can reduce the gender bias we observe to almost nothing. Biases regarding race, ethnicity, and religion are harder to define, and therefore harder to remove, but it’s important to make progress on this anyway.

The graph you see below shows what we’ve done so far. The y-axis is a scale that we came up with involving the dot products between word vectors and their possible stereotypes: closer to zero is better. The brown bar, “ConceptNet Numberbatch 17.04”, is the de-biased system we’re releasing. (The version number represents the date, April 2017.)


We’re not trying to say we’ve solved the problem, but we can conclude that we’ve made the problem smaller. Keep in mind that this evaluation itself will likely change in the future, as we gain a better understanding of how to measure bias.

In dealing with machine-learning bias, there’s the concern that removing the bias could also cause changes that remove accuracy. But we’ve found that the change is negligible: about 1% of the overall result, much smaller than the error bars. Here’s the updated evaluation graph. The y-axis is the Spearman correlation with gold standard data (higher is better). The evaluations here are for English words only — Numberbatch covers many more languages, but the systems we’re comparing to don’t.


ConceptNet Numberbatch is already so much more accurate than any other released system that it can lose 1% of its accuracy for a good cause. If you want 1% more accuracy out of your word vectors, I suggest focusing on improving a knowledge graph, not putting back the stereotypes.

Systems we compare to

In the graphs above, “word2vec Google News” is the popular system from 2013-2014 that many people go to when they think “oh hey I need some word vectors”. Its continued relevance is largely due to the fact that it lets you use data that’s learned from Google’s large corpus of news, a very nice corpus which you can’t actually have. We use its results as an input to ConceptNet Numberbatch.

“GloVe 1.2 840B” is a system from Stanford that is better in some cases, and learns from reading the whole Web via the Common Crawl. It seems to have some fixable problems with the scaling of its features. “GloVe renormalized” is Luminoso’s improvement on GloVe, which we also use as an input.

“fastText enWP (without OOV)” is Facebook’s word vectors trained on the English Wikipedia, with the disclaimer that their accuracy should be better than what we show here. Facebook’s fastText comes with a strategy for handling out-of-vocabulary words, like Numberbatch does. But the data they make available for applying that strategy is in an undocumented binary format that I haven’t deciphered yet.

Gender analogies and gender bias

The word2vec authors (Tomas Mikolov et al.) showed that there was structure in the meanings of the vectors that word2vec learned, allowing you to do arithmetic with word meanings. If you’ve read anything about word vectors before, you’re probably tired of this example by now, but it will make a good illustration: you can take the vector for the word “king”, then subtract “man” and add “woman” to it, and the result is close to the vector for the word “queen”. 

Let’s de-mystify how that happens a bit. Word2vec has read many billions of words (in particular, words in articles on Google News) and learned about what contexts they appear in. The operation “king – man + woman” is essentially asking to find a word that’s used similarly to the word “king”, but one whose context is more like the word “woman” than the word “man”. For example, a word that’s like “king” but appears with the words “she” and “her”. The clear answer is “queen”.

So the vectors learned by word2vec (and many later systems) can express analogy problems as mathematical equations:

womanman queenking

This is remarkably similar to the way that analogies are presented in, for example, an English class, or a standardized test such as the Miller Analogies Test or former SAT tests:

woman : man :: queen : king

Evaluating analogies like these led researchers to wonder what other analogies could be created in this system by adding “woman” and subtracting “man” to a word, or vice versa. And some of the results revealed a problem.

In the following examples, the word2vec system was given the first three words of an analogy, and asked for the word in its vocabulary that best completed the equation. The results revealed gender biases that had been baked into the system:

man : woman :: shopkeeper : housewife

man : woman :: carpentry : sewing

man : woman :: pharmaceuticals : cosmetics

I wonder if the excessive focus on Mikolov et al.’s analogy evaluation has exacerbated the problem. When a system is asked repeatedly to make analogies of the form male word : female word :: other male word : other female word, and it’s evaluated on this and its knowledge of geography and not much else, is it any surprise that we end up with systems that amplify stereotypes that distinguish women and men?

The problem is not just a theoretical problem that shows up when playing with analogies. Word embeddings are actively used in many different fields of natural language processing. A system that searches résumés for people with particular programming skills could end up ranking women lower: the phrase “she developed software for…”  would be a worse match for the classifier’s expectations than “he developed software for…”.

This is one of the biases that we strive to remove, and the one we are most successful at removing, as described later in this post.

Word embeddings contain ethnic biases, too

The work on understanding and removing gender biases has been published for a while. But while working on this, I noticed that the data also contained significant racial and ethnic biases, which it seemed nobody was talking about. Just recently, the Caliskan et al. article came out and provided some needed illumination on the issue.

I had tried building an algorithm for sentiment analysis based on word embeddings — evaluating how much people like certain things based on what they say about them. When I applied it to restaurant reviews, I found it was ranking Mexican restaurants lower. The reason was not reflected in the star ratings or actual text of the reviews. It’s not that people don’t like Mexican food. The reason was that the system had learned the word “Mexican” from reading the Web.

If a restaurant were described as doing something “illegal”, that would be a pretty negative statement about the restaurant, right? But the Web contains lots of text where people use the word “Mexican” disproportionately along with the word “illegal”, particularly to associate “Mexican immigrants” with “illegal immigrants”. The system ends up learning that “Mexican” means something similar to “illegal”, and so it must mean something bad.

The tests I implemented for ethnic bias are to take a list of words, such as “white”, “black”, “Asian”, and “Hispanic”, and find which one has the strongest correlation with each of a list of positive and negative words, such as “cheap”, “criminal”, “elegant”, and “genius”. I did this again with a fine-grained version that lists hundreds of words for ethnicities and nationalities, and thus is more difficult to get a low score on, and again with what may be the trickiest test of all, comparing words for different religions and spiritual beliefs.

In these tests, for each positive and negative word, I find the group-of-people word that it’s most strongly associated with, and compare that to the average. The difference is the bias for that word, and in the end it’s averaged over all the positive and negative words. This appears in the graphs as “Ethnic bias (coarse)”, “Ethnic bias (fine)”, and “Religious bias”.

Note that it’s infeasible to reach 0 on this scale — words for groups of people will necessarily have some different associations. Even random differences between the words would give non-zero results. This is one reason I don’t consider the scale to be final. I’d like to make one that works like the gender-bias scale, where reaching 0 is attainable and desirable.

The Science article uncovers racial biases in a different way: it looks for different associations with predominantly-black names, such as “Deion”, “Jamel”, “Shereen”, and “Latisha”, versus predominantly-white names, such as “Amanda”, “Courtney”, “Adam”, and “Harry”. I incorporated a version of this (and also added some predominantly Hispanic and Islamic names) as another test, shown on the graph as “Bias from names”.

In ConceptNet Numberbatch, we’ve extended Bolukbasi’s de-biasing method to cover multiple types of prejudices, including ethnic and religious. This, too, is discussed below in the “What we’ve done” section.

Porn biases

While we’re talking about the biases that an algorithm gets from reading the Web, let’s talk about another large influence: a lot of the Web is porn.

A system that reads the text of pages sampled from the Web is going to read a lot of the text of porn pages. As such, it is going to end up learning associations for many kinds of words, such as “girlfriend”, “teen”, and “Asian”, that would be very inappropriate to put into production in a machine learning system.

Many of these associated terms, such as “slut”, are negative in connotation. This causes gender bias, ethnic bias, and more. When countering all of these biases, we need to make sure that people are not associated with degrading terminology because of their gender, ethnicity, age, or sexual orientation. This is another aspect of words that we strive to de-bias.

How to fix machine-learning biases

Biases and prejudices are clearly a big problem in machine learning, and at least some machine learning researchers are doing something about it. I’ve seen two major approaches to fixing biases, and as a shorthand, I’ll call them “Google-style” and “Microsoft-style” fixes, though I’m aware that these are just one project from each company and probably don’t represent a company-wide plan. The main difference is at what stage of the process you try to remove biases.

What I call “Google-style” de-biasing is described in a post on the Google Research Blog, “Equality of Opportunity in Machine Learning”. In this view, the data is what it is; if the data is unfair, it’s a reflection of the world being unfair. So the goal is to identify the point at which a machine learning tool makes a decision that affects someone (their example is a classifier that decides whether to grant a loan), and de-bias the actual decision that it makes, providing equal opportunity regardless of what the system learned from its data.

They caution against “fairness through unawareness”, the attempt to produce a system that’s unbiased just because it’s not told about attributes such as gender or race, because machine learning is great at picking up on correlated patterns that could be a proxy for gender or race.

Google’s approach is a principled and reasonable approach to take, especially in a workplace that venerates data above all else. But it sounds like it involves a large and never-ending amount of programmer effort to ensure that biased data leads to unbiased decisions. I have to wonder how many of Google’s products that use machine learning really have a working “equal opportunity filter” on their output.

In the “Microsoft-style” approach, when your data is biased against some group of people, you change the data until it’s more fair, which should help to de-bias anything you do with that data. This approach avoids “unawareness” because it adjusts all the data that’s correlated with the identified bias. I call this “Microsoft-style” because it’s based on the Microsoft Research paper (Bolukbasi et al.) that I linked to.

To remove a gender bias from word embeddings, for example, you can collect many examples of word pairs that are gender-biased and shouldn’t be (such as “doctor” vs. “nurse”), and use them to find the combination of word-embedding components that are responsible for gender bias. Then you mathematically adjust the components so that that combination comes out to 0.

It’s not quite that simple — doing exactly what I said could result in a system that’s unbiased because it has no idea what gender is, which would harm its ability to understand words that carry actual information about gender (such as “she” or “uncle”). You need to also have examples of words that are appropriate to distinguish by gender, such as “he” and “she”, or “aunt” and “uncle”. You then train the system to find the right balance between destroying biased assumptions and preserving useful information about gender.

To summarize the effects of these approaches, I would say that Microsoft-style de-biasing is more transferable between different tasks, but Google-style lets you positively demonstrate that certain things about your system are fair, if you use it consistently. If you control your machine learning pipeline from end to end, from source data to the point where you make a decision, I would say you should do both.

What we’ve done

In the new release of ConceptNet Numberbatch, we adapt one of the “Microsoft-style” techniques from Bolukbasi et al., but we remove many types of biases, not just one.

The process goes like this:

  • Classify words according to the appropriateness of a distinction. For example, “mother” vs. “father” contains an appropriate gender distinction that we shouldn’t change. “Homemaker” vs. “programmer” contains an inappropriate distinction. Bolukbasi provided some nice crowd-sourced data about what people consider appropriate.
  • Adjust the word vectors for words on the “inappropriate” side algebraically, so that the distinction they shouldn’t be making comes out to zero.
  • Evaluate how successful we were at removing the bias, by testing it on a different set of words than the ones we used to find the bias.

The graph below shows the gender bias that we aim to remove. Words are plotted according to how much they’re associated with male words (on the left) or female words (on the right), and according to whether our classifier says this association is appropriate (on the top) or inappropriate (on the bottom).


And here’s what the graph looks like after de-biasing. The inappropriate gender distinctions have been set to nearly zero.


We use similar steps to remove biased associations with different races, ethnicities, and religions. Because we don’t have nice crowd-sourced data for exactly what should be removed in those cases, we instead aim to de-correlate them with words representing positive and negative sentiment.

We hope we’re pushing word vectors away from biases and prejudices, and toward systems that don’t think of you any differently whether you’re named Stephanie or Shanice or Santiago or Syed.

Using the results

ConceptNet Numberbatch 17.04 is out now, with the vectors available for research into text understanding and classification. The format is designed so they can be used as a replacement for word2vec vectors.

In Luminoso products, we use a version of ConceptNet Numberbatch that’s adapted to our text-understanding pipeline as a starting point, providing general background knowledge about what words mean. Numberbatch represents what Luminoso knows before it learns about your domain, and allows it to quickly learn to understand words from a particular domain because it doesn’t have to learn an entire language from scratch.

Our next practical step is to incorporate the newest Numberbatch into Luminoso, with both the SemEval accuracy improvements and the de-biasing.

In further research, we aim to refine how we measure these different kinds of biases. One improvement would be to measure biases in languages besides English. Numberbatch vectors are aligned across different languages, so the de-biasing we performed should affect all languages, but it will be important to test this with some multilingual data.

ftfy (fixes text for you) 4.4 and 5.0

ftfy is Luminoso’s open-source Unicode-fixing library for Python.

Luminoso’s biggest open-source project is ConceptNet, but we also use this blog to provide updates on our other open-source projects. And among these projects, ftfy is certainly the most widely used. It solves a problem a lot of people have with “no faffing about”, as a grateful e-mail I received put it.

When you use the ftfy.fix_text() function, it detects and fixes such problems as mojibake (text that was decoded in the wrong encoding), accidental HTML escaping, curly quotes where you expected straight ones, and so on. (You can also selectively disable these fixes, or run them as separate functions.)

Here’s an example that fixes some multiply-mangled Unicode that I actually found on the Web:

>>> print(ftfy.fix_text("¯\\_(ツ)_/¯"))

Another example, from a Twitter-bot gone wrong:

>>> print(ftfy.fix_text("#правильноепитание"))

So we’re proud to present two new releases of ftfy, versions 4.4 and 5.0. Let’s start by talking about the big change:

A control panel labeled in Polish, with a big red button with the text 'Drop Python 2 support' overlaid.
Photo credit: “The Big Red Button” by włodi, used under the CC-By-SA 2.0 license

That’s right: as of version 4.4, ftfy is better at dealing with encodings of Eastern European languages! After all, sometimes your text is in Polish, like the labels on this very serious-looking control panel. Or maybe it’s in Czech, Slovak, Hungarian, or a language with similar accented letters.

Before Unicode, people would handle these alphabets using a single-byte encoding designed for them, like Windows-1250, which would be incompatible with other languages. In that encoding, the photographer’s name is the byte string w\xb3odi. But now the standard encoding of the Web is UTF-8, where the same name is w\xc5\x82odi.

The encoding errors you might encounter due to mixing these up used to be underrepresented in the test data I collected. You might end up with the name looking like “wĹ‚odi” and ftfy would just throw up its hands like ¯\\_(ツ)_/¯. But now it understands what happened to that name and how to fix it.

Oh, but what about that text I photoshopped onto the button?

The same image, cropped to just the 'Drop Python 2 support' button.

Yeah, I was pulling your leg a bit by talking about the Windows-1250 thing first.

ftfy 5.0 is the same as ftfy 4.4, but it drops support for Python 2. It also gains some tests that we’re happier to not have to write for both versions of Python. Depending on how inertia-ful your use of Python is, this may be a big deal to you.

Three at last!

Python 3 has a string type that’s a pretty good representation of Unicode, and it uses it consistently throughout its standard library. It’s a great language for describing Unicode and how to fix it. It’s a great language for text in general. But until now, we’ve been writing ftfy in the unsatisfying language known as “Python 2+3”, where you can’t take advantage of anything that’s cleaner in Python 3 because you still have to do it the Python 2.7 way also.

So, following the plan we announced in April 2015, we released two versions at the same time. They do the same thing, but ftfy 5.0 gets to have shorter, simpler code.

It seems we even communicated this to ftfy’s users successfully. Shortly after ftfy 5.0 appeared on PyPI, the bug report we received wasn’t about where Python 2 support went, it was about a regression introduced by the new heuristics. (That’s why 4.4.1 and 5.0.1 are out already.)

There’s more I plan to do with ftfy, especially fixing more kinds of encoding errors, as summarized by issue #18. It’ll be easier to make it happen when I can write the fix in a single language.

But if you’re still on Python 2 — possibly due to forces outside your control — I hope I’ve left you with a pretty good option. Thousands of users are content with ftfy 4, and it’s not going away.

One more real-world example

>>> from ftfy.fixes import fix_encoding_and_explain
>>> fix_encoding_and_explain("NapĂ\xadšte nám !")
('Napíšte nám !',
 [('encode', 'sloppy-windows-1250', 2), ('decode', 'utf-8', 0)])

How Luminoso made ConceptNet into the best word vectors, and won at SemEval

I have been telling people for a while that ConceptNet is a valuable source of information for semantic vectors, or “word embeddings” as they’ve been called since the neural-net people showed up in 2013 and renamed everything. Let’s call them “word vectors”, even though they can represent phrases too. The idea is to compute a vector space where similar vectors represent words or phrases with similar meanings.

In particular, I’ve been pointing to results showing that our precomputed vectors, ConceptNet Numberbatch, are the state of the art in multiple languages. Now we’ve verified this by participating in SemEval 2017 Task 2, “Multilingual and Cross-lingual Semantic Word Similarity”, and winning in a landslide.

A graph of the SemEval multilingual task results, showing the Luminoso system performing above every other system in every language, except for two systems that only submitted results in Farsi.
Performance of SemEval systems on the Multilingual Word Similarity task. Our system, in blue, shows its 95% confidence interval.
A graph of the SemEval cross-lingual task results, showing the Luminoso system performing above every other system in every language pair.
Performance of SemEval systems on the Cross-lingual Word Similarity task. Our system, in blue, shows its 95% confidence interval.

SemEval is a long-running evaluation of computational semantics. It does an important job of counteracting publication bias. Most people will only publish evaluations where their system performs well, but SemEval allows many groups to compete head-to-head on an evaluation they haven’t seen yet, with results released all at the same time. When SemEval results come out, you can see a fair comparison of everyone’s approach, with positive and negative results.

This task was a typical word-relatedness task, the same kind that we’ve been talking about in previous posts. You get a list of pairs of words, and your system has to assess how related they are, which is a useful thing to know in NLP applications such as search, text classification, and topic detection. The score is how well your system’s responses correlate with the responses that people give.

The system we submitted was not much different from the one we published and presented at AAAI 2017 and that we’ve been blogging about. It’s the product of the long-running crowd-sourcing and linked-data effort that has gone into ConceptNet, and lots of research here at Luminoso about how to make use of it.

At a high level, it’s an ensemble method that glues together multiple sources of vectors, using ConceptNet as the glue, and retrofitting (Faruqui, 2015) as the glue gun, and also building large parts of the result entirely out of the glue, a technique which worked well for me in elementary school when I had to make a diorama.

The primary goal of this SemEval task was to submit one system that performed well in multiple languages, and we did the best by far in that. Some systems only attempted one or two languages, and at least get to appear in the breakdown of the results by language. I notice that the QLUT system (I think that’s the Qilu University of Technology) is in a statistical tie with us in English, but submitted no other languages, and that two Farsi-only systems did better than us in Farsi.

On the cross-lingual results (comparing words between pairs of languages), no other system came close to us, even in Farsi, showing the advantage of ConceptNet being multilingual from the ground up.

The “baseline” system submitted by the organizers was Nasari, a knowledge-graph-based system previously published in 2016. Often the baseline system is a very simplistic technique, but this baseline was fairly sophisticated and demanding, and many systems couldn’t outperform it. The organizers, at least, believe that everyone in this field should be aware of what knowledge graphs can do, and it’s your problem if you’re not.

Don’t take “OOV” for an answer

The main thing that our SemEval system added, on top of the ConceptNet Numberbatch data you can download, is a strategy for handling out-of-vocabulary words. In the end, so many NLP evaluations come down to how you unk your OOVs. Wait, I’ll explain.

Most machine learning over text considers words as atomic units. So you end up with a particular vocabulary of words your system has learned about. The test data will almost certainly contain some words that the system hasn’t learned; those words are “Out of Vocabulary”, or “OOV”.

(There are some deep learning techniques now that go down to the character level, but they’re messier. And they still end up with a vocabulary of characters. I put the Unicode snowman ☃ into one of those systems and it segfaulted.)

Some publications use the dramatic cop-out of skipping all OOV words in their evaluation. That’s awful. Please don’t do that. I could make an NLP system whose vocabulary is the single word “chicken”, and that would get it a 100% score on some OOV-skipping evaluations, but the domain of text it could understand would be quite limited (Zongker, 2002).

In general, when a system encounters an OOV word, there has to be some strategy for dealing with it. Perhaps you replace all OOV words with a single symbol for unknown words, “unk”, a strategy common enough to have become a verb.

SemEval doesn’t let you dodge OOV words: you need to submit some similarity value for every pair, even if your system has no idea. “Unking” would not have worked very well for comparing words. It seemed to us that a good OOV strategy would make a noticeable difference in the results. We made a couple of assumptions:

  • The most common OOV words are inflections or slight variations of words that are known.
  • Inflections are suffixes in most of the languages we deal with, so the beginning of the word is more important than the end.
  • In non-English languages, OOV words may just be borrowings from English, the modern lingua franca.

So, in cases where it doesn’t help to use our previously published OOV strategy of looking up terms in ConceptNet and replacing them with their neighbors in the graph, we added these two OOV tricks:

  • Look for the word in English instead of the language it’s supposed to be in.
  • Look for known words that have the longest common prefix with the unknown word.

This strategy made a difference of about 10 percent in the results. Without it, our system still would have won at the cross-lingual task, but would have narrowly lost to the HCCL system on the individual languages. But we’re handicapping ourselves here: everyone got to decide on their OOV strategy as part of the task. When the SemEval workshop happens, I’ll be interested to see what strategies other people used.

What about Google and Facebook?

When people talk about semantic vectors, they generally aren’t talking about what a bunch of small research groups came up with last month. They’re talking about the big names, particularly Google’s word2vec and Facebook’s fastText.

Everyone who makes semantic vectors loves to compare to word2vec, because everyone has heard of it, and it’s so easy to beat. This should not be surprising: NLP research did not stop in 2014, but word2vec’s development did. It’s a bit hard to use word2vec as a reference point in SemEval, because if you want non-English data in word2vec, you have to go train it yourself. I’ve done that a few times, with awful results, but I’m not sure those results are representative, because of course I’m using data I can get myself, and the most interesting thing about word2vec is that you can get the benefit of it being trained on Google’s wealth of data.

A more interesting comparison is to fastText, released by Facebook Research in 2016 as a better, faster way to learn word vectors. Tomas Mikolov, the lead author on word2vec, is now part of the fastText team.

fastText has just released pre-trained vectors in a lot of languages. It’s trained only on Wikipedia, which should be a warning sign that the data is going to have a disproportionate fascination with places where 20 people live and albums that 20 people have listened to. But this lets us compare how fastText would have done in SemEval.

The fastText software has a reasonable OOV strategy — it learns about sub-word sequences of characters, and falls back on those when it doesn’t know a word — but as far as I can tell, they didn’t release the sub-word information with their pre-trained vectors. Lacking the ability to run their OOV strategy, I turned off our own OOV strategy to make a fair comparison:

Luminoso performs comfortably above word2vec and fastText in this graph.
Comparison of released word vectors on the SemEval data, without using any OOV strategy.

Note that word2vec is doing better than fastText, due to being trained on more data, but it’s only in English. Luminoso’s ConceptNet-based system, even without its OOV strategy, is doing much better than these well-known systems. And when I experiment with bolting ConceptNet’s OOV onto fastText, it only gets above the baseline system in German.

Overcoming skepticism and rejection in academic publishing

Returning to trying to publish academically, after being in the startup world for four years, was an interesting and frustrating experience. I’d like to gripe about it a bit. Feel free to skip ahead if you don’t care about my gripes about publishing.

When we first started getting world-beating results, in late 2015, we figured that they would be easy to publish. After all, people compare themselves to the “state of the art” all the time, so it’s the publication industry’s job to keep people informed about the new state of the art, right?

We got rejected three times. Once without even being reviewed, because I messed up the LaTeX boilerplate and the paper had the wrong font size. Once because a reviewer was upset that we weren’t comparing to a particular system whose performance had already been superseded (I wonder if it was his). Once because we weren’t “novel” and were just a “bag of tricks” (meanwhile, the fastText paper has “Bag of Tricks” in its title). In the intervening time, dozens of papers have claimed to be the “state of the art” with numbers lower than the ones we blogged about.

I gradually learned that how the result was framed was much more important than the actual result. It makes me appreciate what my advisors did in grad school; I used to have less interesting results than this sail through the review process, and their advice on how to frame it probably played a large role.

So this time, I worked on a paper that could be summarized with “Here’s data you can use! (And here’s why it’s good)”, instead of with “Our system is better than yours! (Here’s the data)”. AAAI finally accepted that paper for their 2017 conference, where we’ve just presented it and maybe gotten a few people’s attention, particularly with the shocking news that ConceptNet still exists.

The fad-chasers of machine learning haven’t picked up on ConceptNet Numberbatch either, maybe because it doesn’t have “2vec” in the name. (My co-worker Joanna has claimed “2vec” as her hypothetical stage name as a rapper.) And, contrary to the example of systems that are better at recognizing cat pictures, Nvidia hasn’t yet added acceleration for the vector operations we use to their GPUs. (I jest. Mostly in that you wouldn’t want to do something so memory-heavy on a GPU.)

At least in the academic world, the idea that you need knowledge graphs to support text understanding is taking hold from more sources than just us. The organizers’ baseline system (Nasari) used BabelNet, a knowledge graph that looks a lot like ConceptNet except for its restrictive license. Nasari beat a lot of the other entries, but not ours.

But academia still has its own built-in skepticism that a small company can really be the world leader in vector-based semantics. The SemEval results make it pretty clear. I’ll believe that academia has really caught up when someone graphs against us instead of word2vec the next time they say “state of the art”. (And don’t forget to put error bars or a confidence interval on it!)

How do I use ConceptNet Numberbatch?

To make it as straightforward as possible:

  • Work through any tutorial on machine learning for NLP that uses semantic vectors.
  • Get to the part where they tell you to use word2vec. (A particularly enlightened tutorial may tell you to use GloVe 1.2.)
  • Get the ConceptNet Numberbatch data, and use it instead.
  • Get better results that also generalize to other languages.

One task where we’ve demonstrated this ourselves is in solving analogy problems.

Whether this works out for you or not, tell us about it on the ConceptNet Gitter.

How does Luminoso use ConceptNet Numberbatch?

Luminoso provides software as a service for text understanding. Our data pipeline starts out with its “background knowledge”, which is very similar to ConceptNet Numberbatch, so that it has a good idea of what words mean before it sees a single sentence of your data. It then reads through your data and refines its understanding of what words and phrases mean based on how they’re used in your data, allowing it to accurately understand jargon, common misspellings, and domain-specific meanings of words.

If you rely entirely on “deep learning” to extract meaning from words, you need billions of words before it starts being accurate. Collecting billions of words is difficult, and the text you collect is probably not the text you really want to understand.

Luminoso starts out knowing everything that ConceptNet, word2vec, and GloVe know and works from there, so it can learn quickly from the smaller number of documents that you’re actually interested in. We package this all up in a visualization interface and an API that lets you understand what’s going on in your text quickly.

Announcing ConceptNet 5.5 and

ConceptNet is a large, multilingual knowledge graph about what words mean.

This is background knowledge that’s very important in NLP and machine learning, and it remains relevant in a time when the typical thing to do is to shove a terabyte or so of text through a neural net. We’ve shown that ConceptNet provides information for word embeddings that isn’t captured by purely distributional techniques like word2vec.

At Luminoso, we make software for domain-specific text understanding. We use ConceptNet to provide a base layer of general understanding, so that our machine learning can focus on quickly learning what’s interesting about text in your domain, when other techniques have to re-learn how the entire language works.

ConceptNet 5.5 is out now, with features that are particularly designed for improving word embeddings and for linking ConceptNet to other knowledge sources.

The new

With the release of ConceptNet 5.5, we’ve relaunched its website at to provide a modern, easy-to-browse view of the data in ConceptNet.

The old site was at, and I applaud the MIT Media Lab sysadmins for the fact that it keeps running and we’ve even been able to update it with new data. But none of us are at MIT anymore — we all work at Luminoso now, and it’s time for ConceptNet to make the move with us.

ConceptNet improves word embeddings

Word embeddings represent the semantics of a word or phrase as many-dimensional vectors, which are pre-computed by a neural net or some other machine learning algorithm. This is a pretty useful idea. We’ve been doing it with ConceptNet since before the term “word embeddings” was common.

When most developers need word embeddings, the first and possibly only place they look is word2vec, a neural net algorithm from Google that computes word embeddings from distributional semantics. That is, it learns to predict words in a sentence from the other words around them, and the embeddings are the representation of words that make the best predictions. But even after terabytes of text, there are aspects of word meanings that you just won’t learn from distributional semantics alone.

To pick one example, word2vec seems to think that because the terms “Red Sox” and “Yankees” appear in similar sentences, they mean basically the same thing. Not here in Boston, they don’t. Same deal with “high school” and “elementary school”. We get a lot of information from the surrounding words, which is the key idea of distributional semantics, but we need more than that.

When we take good word embeddings and add ConceptNet to them, the results are state-of-the-art on several standard evaluations of word embeddings, even outperforming recently-released systems such as FastText.

Comparing the performance of available word-embedding systems. Scores are measured by Spearman correlation with the gold standard, or (for SAT analogies) by the proportion of correct answers. The orange bar is the embeddings used in ConceptNet 5.5.

We could achieve results like this with ConceptNet 5.4 as well, but 5.5 has a big change in its representation that makes it a better match for word embeddings. In previous versions, English words were all reduced to a root form before they were even represented as a ConceptNet node. There was a node for “write”, and no node for “wrote”; a node for “dog”, and no node for “dogs”. If you had a word in its inflected form, you had to reduce it to a root form (using the same algorithm as ConceptNet) to get results. That helped for making the data more strongly-connected, but made it hard to use ConceptNet with other things.

This stemming trick only ever applied to English, incidentally. We never had a consistent way to apply it to all languages. We didn’t even really have a consistent way to apply it to English; any stemmer is either going to have to take into account the context of a sentence (which ConceptNet nodes don’t have) or be wrong some of the time. (Is “saw” a tool or the past tense of “see”?) The ambiguity and complexity just become unmanageable when other languages are in the mix.

So in ConceptNet 5.5, we’ve changed the representation of word forms. There are separate nodes for “dog” and “dogs”, but they’re connected by the “FormOf” relation, and we make sure they end up with very similar word vectors. This will make some use cases easier and others harder, but it corrects a long-standing glitch in ConceptNet’s representation, and incidentally makes it easier to directly compare ConceptNet 5.5 with other systems such as word2vec.

Solving analogies like a college applicant

ConceptNet picks the right answer to an SAT question.

One way to demonstrate that your word-embedding system has a good representation of meaning is to use it to solve word analogies. The usual example, pretty much a cliché by now, is “man : woman :: king : queen”. You want those word vectors to form something like a parallelogram in your vector space, indicating that the relationships between these words are parallel to each other, even if the system can’t explain in words what the relationship is. (And I really wish it could.)

In an earlier post, Cramming for the Test Set, I lamented that the Google analogy data that everyone’s been using to evaluate their word embeddings recently is unrepresentative, and it’s a step down in quality from what Peter Turney has been using in his analogy research since 2005. I did not succeed in finding a way to open up some good analogy data under a Creative Commons license, but I did at least contact Turney to get his data set of SAT questions.

The ConceptNet Numberbatch word embeddings, built into ConceptNet 5.5, solve these SAT analogies better than any previous system. It gets 56.4% of the questions correct. The best comparable previous system, Turney’s SuperSim (2013), got 54.8%. And we’re getting ever closer to “human-level” performance on SAT analogies — while particularly smart humans can of course get a lot more questions right, the average college applicant gets 57.0%.

We can aspire to more than being comparable to a mediocre high school student, but that’s pretty good for an AI so far!

The Semantic Web: where is it now?

By now, the words “Semantic Web” probably either make you feel sad, angry, or bored. There were a lot of promises about how all we needed to do was get everyone to put some RDF and OWL in their XML or whatever, and computers would get smarter. But few people wanted to actually do this, and it didn’t actually accomplish much when they did.

But there is a core idea of the Semantic Web that succeeded. We just don’t call it the Semantic Web anymore: we call it Linked Data. It’s the idea of sharing data, with URLs, that can explain what it means and how to connect it to other data. It’s the reason Gmail knows that you have a plane flight coming up and can locate your boarding pass. It’s the reason that editions of Wikipedia in hundreds of languages can be maintained and updated. I hear it also makes databases of medical research more interoperable, though I don’t actually know anything about that. Given that there’s this shard of the Semantic Web that does work, how about we get more semantics in it by making sure it works well with ConceptNet?

The new makes it easier to use ConceptNet as Linked Data. You can get results from its API in JSON-LD format, a new format for sharing data that blows up some ugly old technologies like RDF+XML and SPARQL, and replaces them with things people actually want to use, like JSON and REST. You can get familiar with the API by just looking at it in your Web browser — when you’re in a browser, we do a few things to make it easy to explore, like adding hyperlinks and syntax highlighting.

When I learned about JSON-LD, I noticed that it would be easy to switch ConceptNet to it, because the API looked kind of like it already. But what really convinced me to make the switch was a strongly-worded rant by W3C member Manu Sporny, which both fans and foes of Semantic Web technologies should find interesting, called “JSON-LD and Why I Hate the Semantic Web”. A key quote:

If you want to make the Semantic Web a reality, stop making the case for it and spend your time doing something more useful, like actually making machines smarter or helping people publish data in a way that’s useful to them.

Sounds like a good plan to me. We’ve shown a couple of ways that ConceptNet is making machines smarter than they could be without it, and some applications should be able to benefit even more by linking ConceptNet to other knowledge bases such as Wikidata.

Find out more

ConceptNet 5.5 can be found on the Web and on GitHub.

The ConceptNet documentation has been updated for ConceptNet 5.5, including an FAQ.

If you have questions or want more information, you can visit our new chat room on Gitter.

wordfreq 1.5: More data, more languages, more accuracy

wordfreq is a useful dataset of word frequencies in many languages, and a simple Python library that lets you look up the frequencies of words (or word-like tokens, if you want to quibble about what’s a word). Version 1.5 is now available on GitHub and PyPI.

wordfreq can rank the frequencies of nearly 400,000 English words. These are some of them.
wordfreq can rank the frequencies of nearly 400,000 English words. These are some of them.

These word frequencies don’t just come from one source; they combine many sources to take into account many different ways to use language.

Some other frequency lists just use Wikipedia because it’s easy, but then they don’t accurately represent the frequencies of words outside of an encyclopedia. The wordfreq data combines whatever data is available from Wikipedia, Google Books, Reddit, Twitter, SUBTLEX, OpenSubtitles, and the Leeds Internet Corpus. Now we’ve added one more source: as much non-English text as we could possibly find in the Common Crawl of the entire Web.

Including this data has led to some interesting changes in the new version 1.5 of wordfreq:

  • We’ve got enough data to support 9 new languages: Bulgarian, Catalan, Danish, Finnish, Hebrew, Hindi, Hungarian, Norwegian Bokmål, and Romanian.
  • Korean has been promoted from marginal to full support. In fact, none of the languages are “marginal” now: all 27 supported languages have at least three data sources and a tokenizer that’s prepared to handle that language.
  • We changed how we rank the frequencies of words when data sources disagree. We used to use the mean of the frequencies. Now we use a weighted median.

Fixing outliers

Using a weighted median of word frequencies is an important change to the data. When the Twitter data source says “oh man you guys ‘rt’ is a really common word in every language”, and the other sources say “No it’s not”, the word ‘rt’ now ends up with a much lower value in the combined list because of the median.

wordfreq can still analyze formal or informal writing without its top frequencies being spammed by things that are specific to one data source. This turned out to be essential when adding the Common Crawl: when text on the Web is translated into a lot of languages, there is an unreasonably high chance that it says “log in”, “this website uses cookies”, “select your language”, the name of another language, or is related to tourism, such as text about hotels and restaurants. We wanted to take advantage of the fact that we have a crawl of the multilingual Web, without making all of the data biased toward words that are overrepresented in that crawl.

A typical "select your language" dropdown.

The reason the median is weighted is so we can still compare frequencies of words that don’t appear in a majority of sources. If a source has never seen a word, that could just be sampling noise, so its vote of 0 for what the word’s frequency should be counts less. As a result, there are still source-specific words, just with a lower frequency than they had in wordfreq 1.4:

>>> # Some source data has split off "n't" as a
>>> # separate token
>>> wordfreq.zipf_frequency("n't", 'en', 'large')

>>> wordfreq.zipf_frequency('retweet', 'en', 'large')

>>> wordfreq.zipf_frequency('eli5', 'en', 'large')

Why use only non-English data in the Common Crawl?

Mostly to keep the amount of data manageable. While the final wordfreq lists are compressed down to kilobytes or megabytes, building these lists already requires storing and working with a lot of input.

There are terabytes of data in the Common Crawl, and while that’s not quite “big data” because it fits on a hard disk and a desktop computer can iterate through it with no problem, counting every English word in the Common Crawl would involve intermediate results that start to push the “fits on a hard disk” limit. English is doing fine because it has its own large sources, such as Google Books.

More data in more languages

A language can be represented in wordfreq when there are 3 large enough, free enough, independent sources of data for it. If there are at least 5 sources, then we also build a “large” list, containing lower-frequency words at the cost of more memory.

There are now 27 languages that make the cut. There perhaps should have been 30: the only reason Czech, Slovak, and Vietnamese aren’t included is that I neglected to download their Wikipedias before counting up data sources. Those languages should be coming soon.

Here’s another chart showing the frequencies of miscellaneous words, this time in all the languages:

A chart of selected word frequencies in 27 languages.

Getting wordfreq in your Python environment is as easy as pip install wordfreq. We hope you find this data useful in helping computers make sense of language!