Word Embeddings

Bugfix: our English-only word vectors contained the wrong data

If you have used the ConceptNet Numberbatch 17.04 word vectors, it turns out that you got very different results if you downloaded the English-only vectors versus if you used the multilingual, language-tagged vectors.

I decided to make this downloadable file of English-only vectors as a convenience, because it would be the format that looked most like a drop-in replacement for word2vec's data. But the English-only format is not a format that we use anywhere. We test our vectors, but we don't test reimporting them from all the files we exported, so that caused a bug in the export to go unnoticed.

The English-only vectors ended up labeling the rows with the wrong English words, unfortunately, making the data they contained meaningless. If you use the multilingual version, it was and still is fine.

If you use the English-only vectors, we have a new Numberbatch download, version 17.04b, that should fix the problem.

I apologize for the erroneous data, and for the setback this may have caused for anyone who is just trying to use the best word vectors they can. Thank you to the users on the conceptnet-users mailing list who drew my attention to the problem.

How Luminoso made ConceptNet into the best word vectors, and won at SemEval

I have been telling people for a while that ConceptNet is a valuable source of information for semantic vectors, or "word embeddings" as they've been called since the neural-net people showed up in 2013 and renamed everything. Let's call them "word vectors", even though they can represent phrases too. The idea is to compute a vector space where similar vectors represent words or phrases with similar meanings.

In particular, I've been pointing to results showing that our precomputed vectors, ConceptNet Numberbatch, are the state of the art in multiple languages. Now we've verified this by participating in SemEval 2017 Task 2, "Multilingual and Cross-lingual Semantic Word Similarity", and winning in a landslide.

A graph of the SemEval multilingual task results, showing the Luminoso system performing above every other system in every language, except for two systems that only submitted results in Farsi. Performance of SemEval systems on the Multilingual Word Similarity task. Our system, in blue, shows its 95% confidence interval.

A graph of the SemEval cross-lingual task results, showing the Luminoso system performing above every other system in every language pair. Performance of SemEval systems on the Cross-lingual Word Similarity task. Our system, in blue, shows its 95% confidence interval.

SemEval is a long-running evaluation of computational semantics. It does an important job of counteracting publication bias. Most people will only publish evaluations where their system performs well, but SemEval allows many groups to compete head-to-head on an evaluation they haven't seen yet, with results released all at the same time. When SemEval results come out, you can see a fair comparison of everyone's approach, with positive and negative results.

This task was a typical word-relatedness task, the same kind that we've been talking about in previous posts. You get a list of pairs of words, and your system has to assess how related they are, which is a useful thing to know in NLP applications such as search, text classification, and topic detection. The score is how well your system's responses correlate with the responses that people give.

The system we submitted was not much different from the one we published and presented at AAAI 2017 and that we've been blogging about. It's the product of the long-running crowd-sourcing and linked-data effort that has gone into ConceptNet, and lots of research here at Luminoso about how to make use of it.

At a high level, it's an ensemble method that glues together multiple sources of vectors, using ConceptNet as the glue, and retrofitting (Faruqui, 2015) as the glue gun, and also building large parts of the result entirely out of the glue, a technique which worked well for me in elementary school when I had to make a diorama.

The primary goal of this SemEval task was to submit one system that performed well in multiple languages, and we did the best by far in that. Some systems only attempted one or two languages, and at least get to appear in the breakdown of the results by language. I notice that the QLUT system (I think that's the Qilu University of Technology) is in a statistical tie with us in English, but submitted no other languages, and that two Farsi-only systems did better than us in Farsi.

On the cross-lingual results (comparing words between pairs of languages), no other system came close to us, even in Farsi, showing the advantage of ConceptNet being multilingual from the ground up.

The "baseline" system submitted by the organizers was Nasari, a knowledge-graph-based system previously published in 2016. Often the baseline system is a very simplistic technique, but this baseline was fairly sophisticated and demanding, and many systems couldn't outperform it. The organizers, at least, believe that everyone in this field should be aware of what knowledge graphs can do, and it's your problem if you're not.

Don't take "OOV" for an answer

The main thing that our SemEval system added, on top of the ConceptNet Numberbatch data you can download, is a strategy for handling out-of-vocabulary words. In the end, so many NLP evaluations come down to how you unk your OOVs. Wait, I'll explain.

Most machine learning over text considers words as atomic units. So you end up with a particular vocabulary of words your system has learned about. The test data will almost certainly contain some words that the system hasn't learned; those words are "Out of Vocabulary", or "OOV".

(There are some deep learning techniques now that go down to the character level, but they're messier. And they still end up with a vocabulary of characters. I put the Unicode snowman ☃ into one of those systems and it segfaulted.)

Some publications use the dramatic cop-out of skipping all OOV words in their evaluation. That's awful. Please don't do that. I could make an NLP system whose vocabulary is the single word "chicken", and that would get it a 100% score on some OOV-skipping evaluations, but the domain of text it could understand would be quite limited (Zongker, 2002).

In general, when a system encounters an OOV word, there has to be some strategy for dealing with it. Perhaps you replace all OOV words with a single symbol for unknown words, "unk", a strategy common enough to have become a verb.

[embed]https://twitter.com/yoavgo/status/788140563015098369[/embed]

SemEval doesn't let you dodge OOV words: you need to submit some similarity value for every pair, even if your system has no idea. "Unking" would not have worked very well for comparing words. It seemed to us that a good OOV strategy would make a noticeable difference in the results. We made a couple of assumptions:

  • The most common OOV words are inflections or slight variations of words that are known.
  • Inflections are suffixes in most of the languages we deal with, so the beginning of the word is more important than the end.
  • In non-English languages, OOV words may just be borrowings from English, the modern lingua franca.

So, in cases where it doesn't help to use our previously published OOV strategy of looking up terms in ConceptNet and replacing them with their neighbors in the graph, we added these two OOV tricks:

  • Look for the word in English instead of the language it's supposed to be in.
  • Look for known words that have the longest common prefix with the unknown word.

This strategy made a difference of about 10 percent in the results. Without it, our system still would have won at the cross-lingual task, but would have narrowly lost to the HCCL system on the individual languages. But we're handicapping ourselves here: everyone got to decide on their OOV strategy as part of the task. When the SemEval workshop happens, I'll be interested to see what strategies other people used.

What about Google and Facebook?

When people talk about semantic vectors, they generally aren't talking about what a bunch of small research groups came up with last month. They're talking about the big names, particularly Google's word2vec and Facebook's fastText.

Everyone who makes semantic vectors loves to compare to word2vec, because everyone has heard of it, and it's so easy to beat. This should not be surprising: NLP research did not stop in 2014, but word2vec's development did. It's a bit hard to use word2vec as a reference point in SemEval, because if you want non-English data in word2vec, you have to go train it yourself. I've done that a few times, with awful results, but I'm not sure those results are representative, because of course I'm using data I can get myself, and the most interesting thing about word2vec is that you can get the benefit of it being trained on Google's wealth of data.

A more interesting comparison is to fastText, released by Facebook Research in 2016 as a better, faster way to learn word vectors. Tomas Mikolov, the lead author on word2vec, is now part of the fastText team.

fastText has just released pre-trained vectors in a lot of languages. It's trained only on Wikipedia, which should be a warning sign that the data is going to have a disproportionate fascination with places where 20 people live and albums that 20 people have listened to. But this lets us compare how fastText would have done in SemEval.

The fastText software has a reasonable OOV strategy -- it learns about sub-word sequences of characters, and falls back on those when it doesn't know a word -- but as far as I can tell, they didn't release the sub-word information with their pre-trained vectors. Lacking the ability to run their OOV strategy, I turned off our own OOV strategy to make a fair comparison:

Luminoso performs comfortably above word2vec and fastText in this graph. Comparison of released word vectors on the SemEval data, without using any OOV strategy.

Note that word2vec is doing better than fastText, due to being trained on more data, but it's only in English. Luminoso's ConceptNet-based system, even without its OOV strategy, is doing much better than these well-known systems. And when I experiment with bolting ConceptNet's OOV onto fastText, it only gets above the baseline system in German.

Overcoming skepticism and rejection in academic publishing

Returning to trying to publish academically, after being in the startup world for four years, was an interesting and frustrating experience. I'd like to gripe about it a bit. Feel free to skip ahead if you don't care about my gripes about publishing.

When we first started getting world-beating results, in late 2015, we figured that they would be easy to publish. After all, people compare themselves to the "state of the art" all the time, so it's the publication industry's job to keep people informed about the new state of the art, right?

We got rejected three times. Once without even being reviewed, because I messed up the LaTeX boilerplate and the paper had the wrong font size. Once because a reviewer was upset that we weren't comparing to a particular system whose performance had already been superseded (I wonder if it was his). Once because we weren't "novel" and were just a "bag of tricks" (meanwhile, the fastText paper has "Bag of Tricks" in its title). In the intervening time, dozens of papers have claimed to be the "state of the art" with numbers lower than the ones we blogged about.

I gradually learned that how the result was framed was much more important than the actual result. It makes me appreciate what my advisors did in grad school; I used to have less interesting results than this sail through the review process, and their advice on how to frame it probably played a large role.

So this time, I worked on a paper that could be summarized with "Here's data you can use! (And here's why it's good)", instead of with "Our system is better than yours! (Here's the data)". AAAI finally accepted that paper for their 2017 conference, where we've just presented it and maybe gotten a few people's attention, particularly with the shocking news that ConceptNet still exists.

The fad-chasers of machine learning haven't picked up on ConceptNet Numberbatch either, maybe because it doesn't have "2vec" in the name. (My co-worker Joanna has claimed "2vec" as her hypothetical stage name as a rapper.) And, contrary to the example of systems that are better at recognizing cat pictures, Nvidia hasn't yet added acceleration for the vector operations we use to their GPUs. (I jest. Mostly in that you wouldn't want to do something so memory-heavy on a GPU.)

At least in the academic world, the idea that you need knowledge graphs to support text understanding is taking hold from more sources than just us. The organizers' baseline system (Nasari) used BabelNet, a knowledge graph that looks a lot like ConceptNet except for its restrictive license. Nasari beat a lot of the other entries, but not ours.

But academia still has its own built-in skepticism that a small company can really be the world leader in vector-based semantics. The SemEval results make it pretty clear. I'll believe that academia has really caught up when someone graphs against us instead of word2vec the next time they say "state of the art". (And don't forget to put error bars or a confidence interval on it!)

How do I use ConceptNet Numberbatch?

To make it as straightforward as possible:

  • Work through any tutorial on machine learning for NLP that uses semantic vectors.
  • Get to the part where they tell you to use word2vec. (A particularly enlightened tutorial may tell you to use GloVe 1.2.)
  • Get the ConceptNet Numberbatch data, and use it instead.
  • Get better results that also generalize to other languages.

One task where we've demonstrated this ourselves is in solving analogy problems.

Whether this works out for you or not, tell us about it on the ConceptNet Gitter.

How does Luminoso use ConceptNet Numberbatch?

Luminoso provides software as a service for text understanding. Our data pipeline starts out with its "background knowledge", which is very similar to ConceptNet Numberbatch, so that it has a good idea of what words mean before it sees a single sentence of your data. It then reads through your data and refines its understanding of what words and phrases mean based on how they're used in your data, allowing it to accurately understand jargon, common misspellings, and domain-specific meanings of words.

If you rely entirely on "deep learning" to extract meaning from words, you need billions of words before it starts being accurate. Collecting billions of words is difficult, and the text you collect is probably not the text you really want to understand.

Luminoso starts out knowing everything that ConceptNet, word2vec, and GloVe know and works from there, so it can learn quickly from the smaller number of documents that you're actually interested in. We package this all up in a visualization interface and an API that lets you understand what's going on in your text quickly.

Yes, people do want pre-computed word embeddings

The very informative tutorial by Vlad Niculae on Word Mover's Distance in Python includes this step:

We could train the embeddings ourselves, but for meaningful results we would need tons of documents, and that might take a while. So let’s just use the ones from the word2vec team.

I couldn't have asked for a better justification for ConceptNet and Luminoso in two sentences.

When presenting new results from Conceptnet Numberbatch, which works way better than word2vec alone, one objection is that the embeddings are pre-computed and aren't based on your data. (Luminoso is a SaaS platform that retrains them to your data, in the cases where you do need that.)

Pre-baked embeddings are useful. People are resigning themselves to use word2vec's pre-baked embeddings because they don't know they can have better ones. I dream of the day when someone writing a new tutorial like this says "So let's just use Conceptnet Numberbatch."

Cramming for the test set: We need better ways to evaluate analogies

The publication of word2vec (as "Efficient Estimation of Word Representations in Vector Space" by Mikolov et al.) got a considerable amount of attention by demonstrating that a representation designed to predict words in context could also be used to predict analogies between words. The word2vec authors demonstrated this by including their own corpus of analogies for evaluation. Since then, other representations have been evaluated against that same corpus.

But a word representation that is better at capturing general knowledge of the relationships between things won't necessarily do better on Mikolov et al.'s evaluation. That evaluation tests numerous examples of only a few types of analogies:

  • Geographical facts, such as “Athens : Greece :: Baghdad : Iraq
  • Gender-swapping analogies, such as “man : woman :: king : queen
  • Names of international currency, such as “Angola : kwanza :: Armenia : dram
  • Morphological relationships, such as “free : freely :: happy : happily
  • Factoids about multi-word named entities, such as “Baltimore : Baltimore Sun :: Cleveland : Cleveland Plain Dealer

The multi-word named entities are usually considered separately. Even word2vec, which this evaluation was designed to evaluate, required a differently-trained vector space to be able to get entities like "Cleveland Plain Dealer" into its vocabulary.

Conceptnet Numberbatch and analogy questions

I've been posting about the state-of-the-art set of word embeddings, Conceptnet Numberbatch, and you might wonder how it does on word2vec's analogies. So even though I'm not a big fan of the word2vec analogy data, I ran a quick evaluation to find out, using Omer Levy's 3CosMul metric for choosing the best analogies. Here's how it scored, broken down by the type of question:

  • Geography: 95.6%
  • Gender: 95.8%
  • Currency: 45.5%
  • Morphology: ???
  • Multi-word: 2.2% (most terms are out-of-vocabulary)

Let's talk about the question marks next to "Morphology". It doesn't make sense to ask Numberbatch about morphology. Like most English NLP systems but unlike word2vec, Numberbatch expects morphology to be handled as a separate step. This is a better plan than forgetting everything we know about morphology and hoping the system can rediscover it.

The overwhelming majority of the morphology questions look like "write : writes :: work : works". Notice that answering this question involves nothing about the meanings of the words "write" and "work". In fact, the less a system knows about meaning, the less there will be to distract it from its morphological task of adding the letter "s".

Numberbatch has the same representation for "write" and "writes", and I think this is reasonable for a system focused on semantics. They have the same meaning, just different morphology. If you want to do morphology, ask a lemmatizer.

So Numberbatch does well on some categories, and it could probably be tuned to do better. But I think this tuning would be counterproductive, because it would reward memorized facts over general knowledge.

Teaching to the test

word2vec's evaluation was a fine demonstration of the capabilities of word2vec when it was published, but it doesn't make much sense as a gold standard.

I believe that a system that aces the whole evaluation could be made out of existing tools, and it wouldn't have very much to do with semantic vectors. Given the analogy A : B :: C : D, it would just look up A and B in Wikipedia and Wiktionary, find connections between them, and return the thing that C is connected to in the same way. Using a pre-parsed version of Wikipedia and Wiktionary would help, and those are things I've been working with. You could add in a lemmatizer, but the best lemmatizers are basically condensed versions of Wiktionary anyway.

This would be a silly thing to make. It's like telling a human student exactly what's on the test, and letting them bring as many notes as they want. Nothing is left but a test of ability to look things up.

From a machine learning point of view, you might call it "training on the test set", but I don't think it's quite the same thing. There's no training step involved here. Call it "cramming for the test set" instead. The analogy evaluation is a test of whether your system knows facts and morphology, so knowing facts and morphology is how you succeed at it.

Let's put this back in perspective, though. The reason the word2vec paper was remarkable is that word2vec wasn't designed to know facts, or even to be able to make analogies at all. It was designed to predict words in the context of other words, and it happened to be able to make analogies. That was the cool part.

Now that we expect word vectors to be able to form analogies, let's expect more from our analogies.

English tests for people and computers

Above, I compared a computer running an evaluation to a human learner taking a test. If you want to test whether a human understands analogies, you don't ask them 10,000 questions about geography. You ask them a lot of different things. So I went looking for analogy tests for people.

I think these kind of analogy "equations" are falling out of favor in education, probably for good reason. They're artificial and they have a lot to do with test-taking skills. They're not on the SAT anymore, so if you really want to know whether a high-schooler gets analogies, now you use a separate test called the Miller Analogy Test. I think they're still pretty reasonable for computers. Computers like equations, and they have mad test-taking skills.

Here are some simple analogies that a semantic representation should be able to make, which I found on a website of resources for English teachers:

  • mouth : eat :: feet : walk
  • awful : bad :: fantastic : good
  • brick : wall :: page : book
  • poor : money :: sad : happiness
  • June : July :: Monday : Tuesday
  • umbrella : rain :: sunscreen : sun

And here are some more difficult ones, from a test-prep book for the Miller Analogy Test:

  • articulate : speech :: coordinated : movement
  • inception : conclusion :: departure : arrival
  • scintillating : dullness :: boisterous : calm
  • elucidate : clarity :: illuminate : light
  • shard : pottery :: splinter : wood
  • attenuate : signal :: dampen : enthusiasm

These examples of analogies from tests also come with multiple-choice distractors, in contrast to the word2vec evaluation, where the vocabulary of all the questions is used as the set of distractors.

Unlike geographical facts, these questions don't have answers that can simply be looked up. There's no data set that would name the relationship between "articulate" and "speech" for you in such a way that you can apply the same relationship to "coordinated". You need a system that can discover a representation of that relationship, and that's what a good set of semantic vectors can do.

It seems that we can evaluate our semantic systems by giving them tests that were originally designed for people. This approach to semantic evaluation has been used, for example, by Peter Turney, who used SAT questions in "A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations" and related publications.

And now for the big problem: people who write test questions write them under extremely restrictive terms of use. I'd better hope fair use really exists so I can even quote twelve of them here. Turney's results can no longer be reproduced, through no fault of his, because he is not allowed to distribute his test data.

It would be great if someone who wrote test-prep questions would cooperate with the NLP community and make some of their questions available as an evaluation. I tried e-mailing the website that had the first set of questions on it. I never got a response, and I assume they're filtering my e-mail as "Strange AI guy" now.

Making it possible to evaluate analogies

There are some great data sets out there about word similarities. MEN-3000, Rare Words, and WordSim-353 are all good examples. They're in convenient text formats, they're usually split into development and test sets, and they're free to redistribute so that your experiments are reproducible.

There should be a way to get analogies up to the same standard. I've heard that other people who do this kind of semantics are also looking for a good analogy evaluation. We could get an evaluation corpus the traditional way, with human effort, and divide up the task of making an analogy test for computers among researchers and their students. It wouldn't be enough for one person or one research group to write all the questions, because they would only write the kinds of questions they expect to be able to handle.

If there were a grant that could fund this, we could more straightforwardly spend money on the problem: we could buy the rights to these test-prep materials from somebody, so that we can convert them into convenient evaluation data, use them, and release them under a Creative Commons license.

Whether their preference is for neural networks, semantic graphs, or logical inferences, many schools of thought on computational semantics agree that analogies are an interesting and relevant task. We should take the opportunity to make our progress on this task measurable and reproducible by obtaining an open, sufficiently general corpus of analogies.

Conceptnet Numberbatch: a new name for the best word embeddings you can download

Recently at Luminoso, we've been promoting one of the open-source, open-data products of our research: a set of semantic vectors that we made by combining ConceptNet with other data sources. As I'm launching this new ConceptNet blog, it's a good time to promote it some more, as it shows why the knowledge in ConceptNet is more important than ever.

Semantic vectors (also known as word embeddings from a deep-learning perspective) let you compare word meanings numerically. Our vectors are measurably better for this than the well-known word2vec vectors (the ones you download from the archived word2vec project page that are trained on Google News), and it's also measurably better than the GloVe vectors.

To be fair, this system takes word2vec and GloVe as inputs so that it can improve them. One great thing about vector representations is that you can put them together into an ensemble that's better than its parts.

The name that we gave it when writing a paper about the system is quite a mouthful. The "ConceptNet Vector Ensemble". I found myself stumbling over the name when giving updates on it at meetings, while trying to get people to not shorten it to "ConceptNet", which is a much broader project. It's hard to get this to catch on as an improvement over word2vec if it has such an anti-catchy name.

Last week, Google released an English parsing model named “Parsey McParseface”. Everybody has heard about it. Giving your machine-learning model a silly Internetty name seems to be a great idea.

And that's why the ConceptNet Vector Ensemble is now named Conceptnet Numberbatch.

It even remains an accurate, descriptive name! I bet Google's parser doesn't even have a face.

What does Conceptnet Numberbatch do?

Conceptnet Numberbatch is a set of semantic vectors: it associates words and phrases in a variety of languages with lists of 600 numbers, representing the gist of what they mean.

Some of the information that these vectors represent comes from ConceptNet, a semantic network of knowledge about word meanings. ConceptNet is collected from a combination of expert-created resources, crowdsourcing, and games with a purpose.

If you want to apply machine learning to the meanings of words and sentences, you probably want your system to start out knowing what a lot of words mean. By comparing semantic vectors, you can find search results that are "near misses" that don't exactly match the search term, you can tell when one sentence is a paraphrase of another sentence, and you can discover the general topics that are being talked about by finding clusters of vectors.

Here's an example that we can step through. Suppose we want to ask Conceptnet Numberbatch whether Benedict Cumberbatch is more like an actor or an otter. We start by looking up the rows labeled cumberbatchactor, and otter in Numberbatch. This gives us a 600-dimensional unit vector for each of them. Here are all of them graphed component-by-component:

These are pretty hard for us to compare visually, but arrays of numbers are quite easy for computers to work with. The important thing here is that vectors that are similar will point in similar directions (which means they have a high dot product as unit vectors). When we look at them component-by-component here, that means that a vector is similar to another vector when they are positive in the same places and negative in the same places. We can visualize this similarity by multiplying the vectors component-wise:

The cumberbatch * actor plot shows a lot more positive components and fewer negative components than cumberbatch * otter, particularly near the left side. The term cumberbatch is like actor in many ways, and unlike it in very few ways. Adding up the component-wise products, we find that cumberbatch is 0.35 similar to actor on a scale from -1 to 1, and it's only 0.04 similar to otter.

Another way to understand these vectors is to rank the semantic vectors that are most similar to them. Here are examples for the three vectors we looked at:

otter
/c/en/otter                  1.000000
/c/en/japanese_river_otter   0.993316
/c/en/european_otter         0.988882
/c/en/otterless              0.951721
/c/en/water_mammal           0.938959
/c/en/otterlike              0.872185
/c/en/otterish               0.869584
/c/en/lutrine                0.838774
/c/en/otterskin              0.833183
/c/en/waitoreke              0.694700
/c/en/musteline_mammal       0.680890
/c/en/raccoon_dog            0.608738
actor
/c/en/actor                  1.000001
/c/en/role_player            0.999875
/c/en/star_in_film           0.950550
/c/en/actorial               0.900689
/c/en/actorish               0.866238
/c/en/work_in_theater        0.853726
/c/en/star_in_movie          0.844339
/c/en/stage_actor            0.842363
/c/en/kiruna_stamell         0.813768
/c/en/actress                0.798980
/c/en/method_act             0.777413
/c/en/in_film                0.770334
cumberbatch
/c/en/cumberbatch            1.000000
/c/en/cumbermania            0.871606
/c/en/cumberbabe             0.853023
/c/en/cumberfan              0.837851
/c/en/sherlock               0.379741
/c/en/star_in_film           0.373129
/c/en/actor                  0.367241
/c/en/role_player            0.367171
/c/en/hiddlestoner           0.355940
/c/en/hiddleston             0.346617
/c/en/actorfic               0.344154
/c/en/holmes                 0.337961

We evaluated Numberbatch on several measures of semantic similarity. A system scores highly on these tests when it makes the same judgments about which words are similar to each other that a human would. Across the board, Numberbatch is the system with the most human-like similarity judgments. The code and data that support this are available on GitHub.

How does this fit into ConceptNet in general?

ConceptNet is a semantic network of knowledge about word meanings. Since 2007, long before anyone called these "word embeddings", we've provided vector representations of the terms in ConceptNet that can be compared for similarity. We used to make these by decomposing the link structure of ConceptNet using SVD. Now, a variation on Faruqui et al.'s retrofitting does the job better, and that's what Numberbatch does.

The current version of Numberbatch, 16.04, uses a transformed version of ConceptNet 5.4. It's not available through the ConceptNet API -- for now, you download Numberbatch separately from its own GitHub page.

ConceptNet 5.5 is going to arrive soon, and a new version of Numberbatch based on that data will be merged into its codebase.

Wait, why did the N become lowercase?

You sure ask the important questions, hypothetical reader. Keeping the N in ConceptNet capitalized would be more consistent, but it'd break the flow. You'd probably read "ConceptNet Numberbatch" in a way that sounds less like a double-dactyl name than "Conceptnet Numberbatch" does.

Capitalize the N if you want. Lowercase all the letters if you want. The orthography of these project names isn't sacred anyway. ConceptNet itself originated from a project that could be called "OpenMind Commonsense", "OpenMind CommonSense", "Open Mind Commonsense", or various other variations until we let it settle on four normal words, "Open Mind Common Sense". (OMCS was named in the '90s. Give everyone involved a break.)

Please explain the name and why otters are involved

There's a fine Internet tradition of concocting names that sound very approximately like "Benedict Cumberbatch", and now we've adopted one such name for our research. For more details, you should read A Linguist Explains the Rules of Summoning Benedict Cumberbatch on The Toast. Then, if you manage to come back from there, you should gaze upon Red Scharlach's Otters Who Look Like Benedict Cumberbatch.

Conceptnet Numberbatch is entirely our own choice of name, and should not indicate affiliation with or endorsement by any person or any otter.

Coincidentally, back in the day, ConceptNet 3 was partly developed on a PowerMac named "otter".

The particular otter at the top of this post was photographed by Bernard Landgraf, who has taken several excellent nature photos for Wikipedia. The photo is freely available under a Creative Commons Attribution-ShareAlike 3.0 license.

No otters were harmed in the production of this research.

An introduction to the ConceptNet Vector Ensemble

Originally published on April 6, 2016.

Here's a big idea that's taken hold in natural language processing: meanings are vectors. A text-understanding system can represent the approximate meaning of a word or phrase by representing it as a vector in a multi-dimensional space. Vectors that are close to each other represent similar meanings.

A fragment of a concept-cloud visualization of the ConceptNet Vector Ensemble (CNVE). Words that appear close to each other are similar. A fragment of a concept-cloud visualization of the ConceptNet Vector Ensemble (CNVE).

Vectors are how Luminoso has always represented meaning. When we started Luminoso, this was seen as a bit of a crazy idea.

It was an exciting time when the idea of vectors as meanings was suddenly popularized by the Google research project word2vec. Now this isn't considered a crazy idea anymore, it's considered the effective thing to do.

Luminoso's starting point -- its model of word meanings when it hasn't seen any of your documents -- comes from a vector-based representation of ConceptNet 5. That gives it general knowledge about what words mean. These vectors are then automatically adjusted based on the specific way that words are used in your domain.

But you might well ask: if these newer systems such as word2vec or GloVe are so effective, should we be using them as our starting point?

As the girl in the Old El Paso commercial asks,

The best representation of word meanings we've seen -- and we think it's the best representation of word meanings anyone has seen -- is our new ensemble that combines ConceptNet, GloVe, PPDB, and word2vec. It's described in our paper, "An Ensemble Method to Produce High-Quality Word Embeddings", and it's reproducible using this GitHub repository.

We call this the ConceptNet Vector Ensemble. These domain-general word embeddings fill the same niche as, for example, the word2vec Google News vectors, but by several measures, they represent related meanings more like people do.

A comparison of some word-embedding systems on two measures of word relatedness. Our system, CNVE, is the red dot in the upper right. A comparison of some word-embedding systems on two measures of word relatedness. Our system, CNVE, is the red dot in the upper right.

Expanding on "retrofitting"

Manaal Faruqui's Retrofitting, from CMU's Language Technologies Institute, is a very cool idea.

Every system of word vectors is going to reflect the set of data it was trained on, which means there's probably more information from outside that data that could make it better. If you've got a good set of word vectors, but you wish there was more information it had taken into account -- particularly a knowledge graph -- you can use a fairly straightforward "retrofitting" procedure to adjust the vectors accordingly.

Starting with some vectors and adjusting them based on new information -- that sure sounds like what I just described about what Luminoso does, right? Faruqui's retrofitting is not the particular process we use inside Luminoso's products, but the general idea is related enough to Luminoso's proprietary process that working with it was quite natural for us, and we found that it does work well.

There's one idea from our process that can be added to retrofitting easily: if you have information about words that weren't in your vocabulary to start with, you should automatically expand your vector space to include them.

Faruqui describes some retrofitting combinations that work well, such as combining GloVe with WordNet. I don't think anyone had tried doing anything like this with ConceptNet before, and it turns out to be a pretty powerful source of knowledge to add. And when you add this idea of automatically expanding the vocabulary, now you can also represent all the words and phrases in ConceptNet that weren't in the vocabulary of your original vector space, such as words in other languages.

The multilingual knowledge in ConceptNet is particularly relevant here. Our ensemble can learn more about words based on the things they translate to in languages besides English, and it can represent those words in other languages with the same kind of vectors that it uses to represent English words.

There's clearly more to be done to extend the full power of this representation to non-English languages. It would be better, for example, if it started with some text in other languages that it could learn from and retrofit onto, instead of relying entirely on the multilingual links in ConceptNet. But it's promising that the Spanish vectors that our ensemble learns entirely from ConceptNet, starting from having no idea what Spanish is, perform better at word similarity than a system trained on the text of the Spanish Wikipedia.

On the other hand, you have GloVe

For some reason, everyone in this niche talks about word2vec and few people talk about the similar system GloVe, from Stanford NLP. We were more drawn to GloVe as something to experiment with, as we find the way it works clearer than word2vec.

When we compared word2vec and GloVe, we got better initial results from GloVe. Levy et al. report the opposite. I think what this shows is that a whole lot of the performance of these systems is in the fine details of how you use them. And indeed, when we tweak the way we use GloVe -- particularly when we borrow a process from ConceptNet to normalize words to their root form -- we get word similarities that are much better than word2vec and the original GloVe, even before we retrofit anything onto it.

You can probably guess the next step: "why don't we use both?" word2vec's most broadly useful vectors come from Google News articles, while GloVe's come from reading the Web at large. Those represent different kinds of information. Both of them should be in the system. In the ConceptNet Vector Ensemble, we build a vector space that combines word2vec and GloVe before we start retrofitting.

The data flow of building the ConceptNet Vector Ensemble.

You can see that creating state-of-the-art word embeddings involves ideas from a number of different people. A few of them are our own -- particularly ConceptNet 5, which is entirely developed at Luminoso these days, and the various ways we transformed word embeddings to make them work better together.

This is an exciting, fast-moving area of NLP. We're telling everyone about our vectors because the openness of word-embedding research made them possible, and if we kept our own improvement quiet, the field would probably find a way to move on without it at the cost of some unnecessary effort.

These vectors are available for download under a Creative Commons Attribution Share-Alike license. If you're working on an application that starts from a vector representation of words -- maybe you're working in the still-congealing field of Deep Learning methods for NLP -- you should give the ConceptNet Vector Ensemble a try.