Announcing ConceptNet 5.5 and conceptnet.io

ConceptNet is a large, multilingual knowledge graph about what words mean.

This is background knowledge that’s very important in NLP and machine learning, and it remains relevant in a time when the typical thing to do is to shove a terabyte or so of text through a neural net. We’ve shown that ConceptNet provides information for word embeddings that isn’t captured by purely distributional techniques like word2vec.

At Luminoso, we make software for domain-specific text understanding. We use ConceptNet to provide a base layer of general understanding, so that our machine learning can focus on quickly learning what’s interesting about text in your domain, when other techniques have to re-learn how the entire language works.

ConceptNet 5.5 is out now, with features that are particularly designed for improving word embeddings and for linking ConceptNet to other knowledge sources.

The new conceptnet.io

With the release of ConceptNet 5.5, we’ve relaunched its website at conceptnet.io to provide a modern, easy-to-browse view of the data in ConceptNet.

The old site was at conceptnet5.media.mit.edu, and I applaud the MIT Media Lab sysadmins for the fact that it keeps running and we’ve even been able to update it with new data. But none of us are at MIT anymore — we all work at Luminoso now, and it’s time for ConceptNet to make the move with us.

ConceptNet improves word embeddings

Word embeddings represent the semantics of a word or phrase as many-dimensional vectors, which are pre-computed by a neural net or some other machine learning algorithm. This is a pretty useful idea. We’ve been doing it with ConceptNet since before the term “word embeddings” was common.

When most developers need word embeddings, the first and possibly only place they look is word2vec, a neural net algorithm from Google that computes word embeddings from distributional semantics. That is, it learns to predict words in a sentence from the other words around them, and the embeddings are the representation of words that make the best predictions. But even after terabytes of text, there are aspects of word meanings that you just won’t learn from distributional semantics alone.

To pick one example, word2vec seems to think that because the terms “Red Sox” and “Yankees” appear in similar sentences, they mean basically the same thing. Not here in Boston, they don’t. Same deal with “high school” and “elementary school”. We get a lot of information from the surrounding words, which is the key idea of distributional semantics, but we need more than that.

When we take good word embeddings and add ConceptNet to them, the results are state-of-the-art on several standard evaluations of word embeddings, even outperforming recently-released systems such as FastText.

numberbatch-comparison-graph
Comparing the performance of available word-embedding systems. Scores are measured by Spearman correlation with the gold standard, or (for SAT analogies) by the proportion of correct answers. The orange bar is the embeddings used in ConceptNet 5.5.

We could achieve results like this with ConceptNet 5.4 as well, but 5.5 has a big change in its representation that makes it a better match for word embeddings. In previous versions, English words were all reduced to a root form before they were even represented as a ConceptNet node. There was a node for “write”, and no node for “wrote”; a node for “dog”, and no node for “dogs”. If you had a word in its inflected form, you had to reduce it to a root form (using the same algorithm as ConceptNet) to get results. That helped for making the data more strongly-connected, but made it hard to use ConceptNet with other things.

This stemming trick only ever applied to English, incidentally. We never had a consistent way to apply it to all languages. We didn’t even really have a consistent way to apply it to English; any stemmer is either going to have to take into account the context of a sentence (which ConceptNet nodes don’t have) or be wrong some of the time. (Is “saw” a tool or the past tense of “see”?) The ambiguity and complexity just become unmanageable when other languages are in the mix.

So in ConceptNet 5.5, we’ve changed the representation of word forms. There are separate nodes for “dog” and “dogs”, but they’re connected by the “FormOf” relation, and we make sure they end up with very similar word vectors. This will make some use cases easier and others harder, but it corrects a long-standing glitch in ConceptNet’s representation, and incidentally makes it easier to directly compare ConceptNet 5.5 with other systems such as word2vec.

Solving analogies like a college applicant

analogy-screenshot
ConceptNet picks the right answer to an SAT question.

One way to demonstrate that your word-embedding system has a good representation of meaning is to use it to solve word analogies. The usual example, pretty much a cliché by now, is “man : woman :: king : queen”. You want those word vectors to form something like a parallelogram in your vector space, indicating that the relationships between these words are parallel to each other, even if the system can’t explain in words what the relationship is. (And I really wish it could.)

In an earlier post, Cramming for the Test Set, I lamented that the Google analogy data that everyone’s been using to evaluate their word embeddings recently is unrepresentative, and it’s a step down in quality from what Peter Turney has been using in his analogy research since 2005. I did not succeed in finding a way to open up some good analogy data under a Creative Commons license, but I did at least contact Turney to get his data set of SAT questions.

The ConceptNet Numberbatch word embeddings, built into ConceptNet 5.5, solve these SAT analogies better than any previous system. It gets 56.4% of the questions correct. The best comparable previous system, Turney’s SuperSim (2013), got 54.8%. And we’re getting ever closer to “human-level” performance on SAT analogies — while particularly smart humans can of course get a lot more questions right, the average college applicant gets 57.0%.

We can aspire to more than being comparable to a mediocre high school student, but that’s pretty good for an AI so far!

The Semantic Web: where is it now?

By now, the words “Semantic Web” probably either make you feel sad, angry, or bored. There were a lot of promises about how all we needed to do was get everyone to put some RDF and OWL in their XML or whatever, and computers would get smarter. But few people wanted to actually do this, and it didn’t actually accomplish much when they did.

But there is a core idea of the Semantic Web that succeeded. We just don’t call it the Semantic Web anymore: we call it Linked Data. It’s the idea of sharing data, with URLs, that can explain what it means and how to connect it to other data. It’s the reason Gmail knows that you have a plane flight coming up and can locate your boarding pass. It’s the reason that editions of Wikipedia in hundreds of languages can be maintained and updated. I hear it also makes databases of medical research more interoperable, though I don’t actually know anything about that. Given that there’s this shard of the Semantic Web that does work, how about we get more semantics in it by making sure it works well with ConceptNet?

The new conceptnet.io makes it easier to use ConceptNet as Linked Data. You can get results from its API in JSON-LD format, a new format for sharing data that blows up some ugly old technologies like RDF+XML and SPARQL, and replaces them with things people actually want to use, like JSON and REST. You can get familiar with the API by just looking at it in your Web browser — when you’re in a browser, we do a few things to make it easy to explore, like adding hyperlinks and syntax highlighting.

When I learned about JSON-LD, I noticed that it would be easy to switch ConceptNet to it, because the API looked kind of like it already. But what really convinced me to make the switch was a strongly-worded rant by W3C member Manu Sporny, which both fans and foes of Semantic Web technologies should find interesting, called “JSON-LD and Why I Hate the Semantic Web”. A key quote:

If you want to make the Semantic Web a reality, stop making the case for it and spend your time doing something more useful, like actually making machines smarter or helping people publish data in a way that’s useful to them.

Sounds like a good plan to me. We’ve shown a couple of ways that ConceptNet is making machines smarter than they could be without it, and some applications should be able to benefit even more by linking ConceptNet to other knowledge bases such as Wikidata.

Find out more

ConceptNet 5.5 can be found on the Web and on GitHub.

The ConceptNet documentation has been updated for ConceptNet 5.5, including an FAQ.

If you have questions or want more information, you can visit our new chat room on Gitter.

Conceptnet Numberbatch: a new name for the best word embeddings you can download

Recently at Luminoso, we’ve been promoting one of the open-source, open-data products of our research: a set of semantic vectors that we made by combining ConceptNet with other data sources. As I’m launching this new ConceptNet blog, it’s a good time to promote it some more, as it shows why the knowledge in ConceptNet is more important than ever.

Semantic vectors (also known as word embeddings from a deep-learning perspective) let you compare word meanings numerically. Our vectors are measurably better for this than the well-known word2vec vectors (the ones you download from the archived word2vec project page that are trained on Google News), and it’s also measurably better than the GloVe vectors.

To be fair, this system takes word2vec and GloVe as inputs so that it can improve them. One great thing about vector representations is that you can put them together into an ensemble that’s better than its parts.

The name that we gave it when writing a paper about the system is quite a mouthful. The “ConceptNet Vector Ensemble”. I found myself stumbling over the name when giving updates on it at meetings, while trying to get people to not shorten it to “ConceptNet”, which is a much broader project. It’s hard to get this to catch on as an improvement over word2vec if it has such an anti-catchy name.

Last week, Google released an English parsing model named “Parsey McParseface”. Everybody has heard about it. Giving your machine-learning model a silly Internetty name seems to be a great idea.

And that’s why the ConceptNet Vector Ensemble is now named Conceptnet Numberbatch.

It even remains an accurate, descriptive name! I bet Google’s parser doesn’t even have a face.

What does Conceptnet Numberbatch do?

Conceptnet Numberbatch is a set of semantic vectors: it associates words and phrases in a variety of languages with lists of 600 numbers, representing the gist of what they mean.

Some of the information that these vectors represent comes from ConceptNet, a semantic network of knowledge about word meanings. ConceptNet is collected from a combination of expert-created resources, crowdsourcing, and games with a purpose.

If you want to apply machine learning to the meanings of words and sentences, you probably want your system to start out knowing what a lot of words mean. By comparing semantic vectors, you can find search results that are “near misses” that don’t exactly match the search term, you can tell when one sentence is a paraphrase of another sentence, and you can discover the general topics that are being talked about by finding clusters of vectors.

Here’s an example that we can step through. Suppose we want to ask Conceptnet Numberbatch whether Benedict Cumberbatch is more like an actor or an otter. We start by looking up the rows labeled cumberbatchactor, and otter in Numberbatch. This gives us a 600-dimensional unit vector for each of them. Here are all of them graphed component-by-component:

These are pretty hard for us to compare visually, but arrays of numbers are quite easy for computers to work with. The important thing here is that vectors that are similar will point in similar directions (which means they have a high dot product as unit vectors). When we look at them component-by-component here, that means that a vector is similar to another vector when they are positive in the same places and negative in the same places. We can visualize this similarity by multiplying the vectors component-wise:

The cumberbatch * actor plot shows a lot more positive components and fewer negative components than cumberbatch * otter, particularly near the left side. The term cumberbatch is like actor in many ways, and unlike it in very few ways. Adding up the component-wise products, we find that cumberbatch is 0.35 similar to actor on a scale from -1 to 1, and it’s only 0.04 similar to otter.

Another way to understand these vectors is to rank the semantic vectors that are most similar to them. Here are examples for the three vectors we looked at:

otter

/c/en/otter                  1.000000
/c/en/japanese_river_otter   0.993316
/c/en/european_otter         0.988882
/c/en/otterless              0.951721
/c/en/water_mammal           0.938959
/c/en/otterlike              0.872185
/c/en/otterish               0.869584
/c/en/lutrine                0.838774
/c/en/otterskin              0.833183
/c/en/waitoreke              0.694700
/c/en/musteline_mammal       0.680890
/c/en/raccoon_dog            0.608738

actor

/c/en/actor                  1.000001
/c/en/role_player            0.999875
/c/en/star_in_film           0.950550
/c/en/actorial               0.900689
/c/en/actorish               0.866238
/c/en/work_in_theater        0.853726
/c/en/star_in_movie          0.844339
/c/en/stage_actor            0.842363
/c/en/kiruna_stamell         0.813768
/c/en/actress                0.798980
/c/en/method_act             0.777413
/c/en/in_film                0.770334

cumberbatch

/c/en/cumberbatch            1.000000
/c/en/cumbermania            0.871606
/c/en/cumberbabe             0.853023
/c/en/cumberfan              0.837851
/c/en/sherlock               0.379741
/c/en/star_in_film           0.373129
/c/en/actor                  0.367241
/c/en/role_player            0.367171
/c/en/hiddlestoner           0.355940
/c/en/hiddleston             0.346617
/c/en/actorfic               0.344154
/c/en/holmes                 0.337961

We evaluated Numberbatch on several measures of semantic similarity. A system scores highly on these tests when it makes the same judgments about which words are similar to each other that a human would. Across the board, Numberbatch is the system with the most human-like similarity judgments. The code and data that support this are available on GitHub.

How does this fit into ConceptNet in general?

ConceptNet is a semantic network of knowledge about word meanings. Since 2007, long before anyone called these “word embeddings”, we’ve provided vector representations of the terms in ConceptNet that can be compared for similarity. We used to make these by decomposing the link structure of ConceptNet using SVD. Now, a variation on Faruqui et al.’s retrofitting does the job better, and that’s what Numberbatch does.

The current version of Numberbatch, 16.04, uses a transformed version of ConceptNet 5.4. It’s not available through the ConceptNet API — for now, you download Numberbatch separately from its own GitHub page.

ConceptNet 5.5 is going to arrive soon, and a new version of Numberbatch based on that data will be merged into its codebase.

Wait, why did the N become lowercase?

You sure ask the important questions, hypothetical reader. Keeping the N in ConceptNet capitalized would be more consistent, but it’d break the flow. You’d probably read “ConceptNet Numberbatch” in a way that sounds less like a double-dactyl name than “Conceptnet Numberbatch” does.

Capitalize the N if you want. Lowercase all the letters if you want. The orthography of these project names isn’t sacred anyway. ConceptNet itself originated from a project that could be called “OpenMind Commonsense”, “OpenMind CommonSense”, “Open Mind Commonsense”, or various other variations until we let it settle on four normal words, “Open Mind Common Sense”. (OMCS was named in the ’90s. Give everyone involved a break.)

Please explain the name and why otters are involved

There’s a fine Internet tradition of concocting names that sound very approximately like “Benedict Cumberbatch”, and now we’ve adopted one such name for our research. For more details, you should read A Linguist Explains the Rules of Summoning Benedict Cumberbatch on The Toast. Then, if you manage to come back from there, you should gaze upon Red Scharlach’s Otters Who Look Like Benedict Cumberbatch.

Conceptnet Numberbatch is entirely our own choice of name, and should not indicate affiliation with or endorsement by any person or any otter.

Coincidentally, back in the day, ConceptNet 3 was partly developed on a PowerMac named “otter”.

The particular otter at the top of this post was photographed by Bernard Landgraf, who has taken several excellent nature photos for Wikipedia. The photo is freely available under a Creative Commons Attribution-ShareAlike 3.0 license.

No otters were harmed in the production of this research.

An introduction to the ConceptNet Vector Ensemble

Originally published on April 6, 2016 at https://blog.luminoso.com/2016/04/06/an-introduction-to-the-conceptnet-vector-ensemble/.

Here’s a big idea that’s taken hold in natural language processing: meanings are vectors. A text-understanding system can represent the approximate meaning of a word or phrase by representing it as a vector in a multi-dimensional space. Vectors that are close to each other represent similar meanings.

A fragment of a concept-cloud visualization of the ConceptNet Vector Ensemble (CNVE). Words that appear close to each other are similar.
A fragment of a concept-cloud visualization of the ConceptNet Vector Ensemble (CNVE).

Vectors are how Luminoso has always represented meaning. When we started Luminoso, this was seen as a bit of a crazy idea.

It was an exciting time when the idea of vectors as meanings was suddenly popularized by the Google research project word2vec. Now this isn’t considered a crazy idea anymore, it’s considered the effective thing to do.

Luminoso’s starting point — its model of word meanings when it hasn’t seen any of your documents — comes from a vector-based representation of ConceptNet 5. That gives it general knowledge about what words mean. These vectors are then automatically adjusted based on the specific way that words are used in your domain.

But you might well ask: if these newer systems such as word2vec or GloVe are so effective, should we be using them as our starting point?

As the girl in the Old El Paso commercial asks, "Why don't we have both?"

The best representation of word meanings we’ve seen — and we think it’s the best representation of word meanings anyone has seen — is our new ensemble that combines ConceptNet, GloVe, PPDB, and word2vec. It’s described in our paper, “An Ensemble Method to Produce High-Quality Word Embeddings“, and it’s reproducible using this GitHub repository.

We call this the ConceptNet Vector Ensemble. These domain-general word embeddings fill the same niche as, for example, the word2vec Google News vectors, but by several measures, they represent related meanings more like people do.

A comparison of some word-embedding systems on two measures of word relatedness. Our system, CNVE, is the red dot in the upper right.
A comparison of some word-embedding systems on two measures of word relatedness. Our system, CNVE, is the red dot in the upper right.

Expanding on “retrofitting”

Manaal Faruqui’s Retrofitting, from CMU’s Language Technologies Institute, is a very cool idea.

Every system of word vectors is going to reflect the set of data it was trained on, which means there’s probably more information from outside that data that could make it better. If you’ve got a good set of word vectors, but you wish there was more information it had taken into account — particularly a knowledge graph — you can use a fairly straightforward “retrofitting” procedure to adjust the vectors accordingly.

Starting with some vectors and adjusting them based on new information — that sure sounds like what I just described about what Luminoso does, right? Faruqui’s retrofitting is not the particular process we use inside Luminoso’s products, but the general idea is related enough to Luminoso’s proprietary process that working with it was quite natural for us, and we found that it does work well.

There’s one idea from our process that can be added to retrofitting easily: if you have information about words that weren’t in your vocabulary to start with, you should automatically expand your vector space to include them.

Faruqui describes some retrofitting combinations that work well, such as combining GloVe with WordNet. I don’t think anyone had tried doing anything like this with ConceptNet before, and it turns out to be a pretty powerful source of knowledge to add. And when you add this idea of automatically expanding the vocabulary, now you can also represent all the words and phrases in ConceptNet that weren’t in the vocabulary of your original vector space, such as words in other languages.

The multilingual knowledge in ConceptNet is particularly relevant here. Our ensemble can learn more about words based on the things they translate to in languages besides English, and it can represent those words in other languages with the same kind of vectors that it uses to represent English words.

There’s clearly more to be done to extend the full power of this representation to non-English languages. It would be better, for example, if it started with some text in other languages that it could learn from and retrofit onto, instead of relying entirely on the multilingual links in ConceptNet. But it’s promising that the Spanish vectors that our ensemble learns entirely from ConceptNet, starting from having no idea what Spanish is, perform better at word similarity than a system trained on the text of the Spanish Wikipedia.

On the other hand, you have GloVe

For some reason, everyone in this niche talks about word2vec and few people talk about the similar system GloVe, from Stanford NLP. We were more drawn to GloVe as something to experiment with, as we find the way it works clearer than word2vec.

When we compared word2vec and GloVe, we got better initial results from GloVe. Levy et al. report the opposite. I think what this shows is that a whole lot of the performance of these systems is in the fine details of how you use them. And indeed, when we tweak the way we use GloVe — particularly when we borrow a process from ConceptNet to normalize words to their root form — we get word similarities that are much better than word2vec and the original GloVe, even before we retrofit anything onto it.

You can probably guess the next step: “why don’t we use both?” word2vec’s most broadly useful vectors come from Google News articles, while GloVe’s come from reading the Web at large. Those represent different kinds of information. Both of them should be in the system. In the ConceptNet Vector Ensemble, we build a vector space that combines word2vec and GloVe before we start retrofitting.

The data flow of building the ConceptNet Vector Ensemble.

You can see that creating state-of-the-art word embeddings involves ideas from a number of different people. A few of them are our own — particularly ConceptNet 5, which is entirely developed at Luminoso these days, and the various ways we transformed word embeddings to make them work better together.

This is an exciting, fast-moving area of NLP. We’re telling everyone about our vectors because the openness of word-embedding research made them possible, and if we kept our own improvement quiet, the field would probably find a way to move on without it at the cost of some unnecessary effort.

These vectors are available for download under a Creative Commons Attribution Share-Alike license. If you’re working on an application that starts from a vector representation of words — maybe you’re working in the still-congealing field of Deep Learning methods for NLP — you should give the ConceptNet Vector Ensemble a try.