# Tutorial: Distinguishing attributes using ConceptNet

In a previous post, we mentioned the good results that systems built using ConceptNet got at SemEval this year. One of those systems was our own entry to the “Capturing Discriminative Attributes” task, about determining differences in meanings between words.

The system we submitted got second place, by combining information from ConceptNet, WordNet, Wikipedia, and Google Books. That system has some messy dependencies and fiddly details, so in this tutorial, we’re going to build a much simpler version of the system that also performs well.

### Distinguishing attributes the simple way¶¶

Our poster, a prettier version of our SemEval paper, mainly presents the full version of the system, the one that uses five different methods of distinguishing attributes and combines them all in an SVM classifier. But here, I particularly want you to take note of the “ConceptNet is all you need” section, describing a simpler version we discovered while evaluating what made the full system work.

It seems that, instead of using five kinds of features, we may have been able to do just as well using just the pre-trained embeddings we call ConceptNet Numberbatch. So we’ll build that system here, using the ConceptNet Numberbatch data and a small amount of code, with only common dependencies (pandas and sklearn).

In [1]:
from sklearn.metrics import f1_score
import numpy as np
import pandas as pd


I want you to be able to reproduce this result, so I’ve put the SemEval data files, along with the exact version of ConceptNet Numberbatch we were using, in a zip file on my favorite scientific data hosting service, Zenodo.

These shell commands should serve the purpose of downloading and extracting that data, if the wget and unzip commands are available on your system.

In [2]:
!wget https://zenodo.org/record/1289942/files/conceptnet-distinguishing-attributes-data.zip

--2018-06-15 10:47:28--  https://zenodo.org/record/1289942/files/conceptnet-distinguishing-attributes-data.zip
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2457101853 (2.3G) [application/octet-stream]
Saving to: ‘conceptnet-distinguishing-attributes-data.zip’

conceptnet-distingu 100%[===================>]   2.29G  2.88MB/s    in 10m 9s

2018-06-15 10:57:38 (3.85 MB/s) - ‘conceptnet-distinguishing-attributes-data.zip’ saved [2457101853/2457101853]


In [8]:
!unzip conceptnet-distinguishing-attributes-data.zip

Archive:  conceptnet-distinguishing-attributes-data.zip
inflating: numberbatch-20180108-biased.h5
inflating: discriminatt-test.txt
inflating: discriminatt-train.txt
inflating: discriminatt-validation.txt


In our actual solution, we imported some utilities from the ConceptNet5 codebase. In this simplified version, we’ll re-define the utilities that we need.

In [9]:
def text_to_uri(text):
"""
An extremely cut-down version of ConceptNet's standardized_concept_uri.
Converts a term such as "apple" into its ConceptNet URI, "/c/en/apple".

Only works for single English words, with no punctuation besides hyphens.
"""
return '/c/en/' + text.lower().replace('-', '_')

def normalize_vec(vec):
"""
Normalize a vector to a unit vector, so that dot products are cosine
similarities.

If it's the zero vector, leave it as is, so all its cosine similarities
will be zero.
"""
norm = vec.dot(vec) ** 0.5
if norm == 0:
return vec
return vec / norm


We would need a lot more support from the ConceptNet code if we wanted to apply ConceptNet’s strategy for out-of-vocabulary words. Fortunately, the words in this task are quite common. Our out-of-vocabulary strategy can be to return the zero vector.

In [10]:
class AttributeHeuristic:
def __init__(self, hdf5_filename):
"""
Load a word embedding matrix that is the 'mat' member of an HDF5 file,
with UTF-8 labels for its rows.

(This is the format that ConceptNet Numberbatch word embeddings use.)
"""
self.cache = {}

def get_vector(self, term):
"""
Look up the vector for a term, returning it normalized to a unit vector.
If the term is out-of-vocabulary, return a zero vector.

Because many terms appear repeatedly in the data, cache the result.
"""
uri = text_to_uri(term)
if uri in self.cache:
return self.cache[uri]
else:
try:
vec = normalize_vec(self.embeddings.loc[uri])
except KeyError:
vec = pd.Series(index=self.embeddings.columns).fillna(0)
self.cache[uri] = vec
return vec

def get_similarity(self, term1, term2):
"""
Get the cosine similarity between the embeddings of two terms.
"""
return self.get_vector(term1).dot(self.get_vector(term2))

def compare_attributes(self, term1, term2, attribute):
"""
Our heuristic for whether an attribute applies more to term1 than
to term2: find the cosine similarity of each term with the
attribute, and take the difference of the square roots of those
similarities.
"""
match1 = max(0, self.get_similarity(term1, attribute)) ** 0.5
match2 = max(0, self.get_similarity(term2, attribute)) ** 0.5
return match1 - match2

def classify(self, term1, term2, attribute, threshold):
"""
Convert the attribute heuristic into a yes-or-no decision, by testing
whether the difference is larger than a given threshold.
"""
return self.compare_attributes(term1, term2, attribute) > threshold

def evaluate(self, semeval_filename, threshold):
"""
Evaluate the heuristic on a file containing instances of this form:

banjo,harmonica,stations,0
mushroom,onions,stem,1

Return the macro-averaged F1 score. (As in the task, we use macro-
averaged F1 instead of raw accuracy, to avoid being misled by
imbalanced classes.)
"""
for line in open(semeval_filename, encoding='utf-8'):
term1, term2, attribute, strval = line.rstrip().split(',')
discriminative = bool(int(strval))



When we ran this solution, our latest set of word embeddings calculated from ConceptNet was ‘numberbatch-20180108-biased’. This name indicates that it was built on January 8, 2018, and acknowledges that we haven’t run it through the de-biasing process, which we consider important when deploying a machine learning system.

Here, we didn’t want to complicate things by adding the de-biasing step. But keep in mind that this heuristic would probably have some unfortunate trends if it were asked to distinguish attributes of people’s name, gender, or ethnicity.

In [11]:
heuristic = AttributeHeuristic('numberbatch-20180108-biased.h5')


The classifier has one parameter that can vary, which is the “threshold”: the minimum difference between cosine similarities that will count as a discriminative attribute. When we ran the training code for our full SemEval entry on this one feature, we got a classifier that’s equivalent to a threshold of 0.096. Let’s simplify that by rounding it off to 0.1.

In [13]:
heuristic.evaluate('discriminatt-train.txt', threshold=0.1)

Out[13]:
0.6620320353802582

When we were creating this code, we didn’t have access to the test set — this is pretty much the point of SemEval. We could compare results on the validation set, which is how we decided to use a combination of five features, where the feature you see here is only one of them. It’s also how we found that taking the square root of the cosine similarities was helpful.

When we’re just revisiting a simplified version of the classifier, there isn’t much that we need to do with the validation set, but let’s take a look at how it does anyway.

In [14]:
heuristic.evaluate('discriminatt-validation.txt', threshold=0.1)

Out[14]:
0.693873461779053

But what’s really interesting about this simple heuristic is how it performs on the previously held-out test set.

In [15]:
heuristic.evaluate('discriminatt-test.txt', threshold=0.1)

Out[15]:
0.7358997147499388

It’s pretty remarkable to see a test accuracy that’s so much higher than the training accuracy! It should actually make you suspicious that this classifier is somehow tuned to the test data.

But that’s why it’s nice to have a result we can compare to that followed the SemEval process. Our actual SemEval entry got the same accuracy, 73.6%, and showed that we could attain that number without having any access to the test data.

Many entries to this task performed better on the test data than on the validation data. It seems that the test set is cleaner overall than the validation set, which in turn is cleaner than the training set. Simple classifiers that generalize well had the chance to do much better on the test set. Classifiers which had the ability to focus too much on the specific details of the training set, some of which are erroneous, performed worse.

But you could still question whether the simplified system that we came up with after the fact can actually be compared to the system we submitted, which will leads me on a digression about “lucky systems” at the end of this post.

### Examples¶¶

Let’s see how this heuristic does on some examples of these “discriminative attribute” questions.

When we look at heuristic.compare_attributes(a, b, c), we’re asking if a is more associated with c than b is. The heuristic returns a number. By our evaluation above, we consider the attribute to be discriminative if the number is 0.1 or greater.

In [17]:
heuristic.compare_attributes('window', 'door', 'glass')

Out[17]:
0.16762984210407628

From the examples in the code above: mushrooms have stems, while onions don’t.

In [35]:
heuristic.compare_attributes('mushroom', 'onions', 'stem')

Out[35]:
0.11308354447365421

This one comes straight from the task description: cappuccino contains milk, while americano doesn’t. Unfortunately, our heuristic is not confident about the distinction, and returns a value less than 0.1. It would fail this example in the evaluation.

In [37]:
heuristic.compare_attributes('cappuccino', 'americano', 'milk')

Out[37]:
0.06309686358452515

An example of a non-discriminative attribute: trains and subways both involve rails. Our heuristic barely gets this right, but only due to lack of confidence.

In [38]:
heuristic.compare_attributes('train', 'subway', 'rails')

Out[38]:
0.08336122961828196

This was not required for the task, but the heuristic can also tell us when an attribute is discriminative in the opposite direction. Water is much more associated with soup than it is with fingers. It is a discriminative attribute that distinguishes soup from finger, not finger from soup. The heuristic gives us back a negative number indicating this.

In [39]:
heuristic.compare_attributes('finger', 'soup', 'water')

Out[39]:
-0.2778968364707769

### Lucky systems¶¶

As a kid, I used to hold marble racing tournaments in my room, rolling marbles simultaneously down plastic towers of tracks and funnels. I went so far as to set up a bracket of 64 marbles to find the fastest marble. I kind of thought that running marble tournaments was peculiar to me and my childhood, but now I’ve found out that marble racing videos on YouTube are a big thing! Some of them even have overlays as if they’re major sporting events.

In the end, there’s nothing special about the fastest marble compared to most other marbles. It’s just lucky. If one ran the tournament again, the marble champion might lose in the first round. But the one thing you could conclude about the fastest marble is that it was no worse than the other marbles. A bad marble (say, a misshapen one, or a plastic bead) would never luck out enough to win.

In our paper, we tested 30 alternate versions of the classifier, including the one that was roughly equivalent to this very simple system. We were impressed by the fact that it performed as well as our real entry. And this could be because of the inherent power of ConceptNet Numberbatch, or it could be because it’s the lucky marble.

I tried it with other thresholds besides 0.1, and some of the nearby reasonable threshold values only score 71% or 72%. But that still tells you that this interestingly simple system is doing the right thing and is capable of getting a very good result. It’s good enough to be the lucky marble, so it’s good enough for this tutorial.

Incidentally, the same argument about “lucky systems” applies to SemEval entries themselves. There are dozens of entries from different teams, and the top-scoring entry is going to be an entry that did the right thing and also got lucky.

In the post-SemEval discussion at ACL, someone proposed that all results should be Bayesian probability distributions, estimated by evaluating systems on various subsets of the test data, and instead of declaring a single winner or a tie, we should get probabilistic beliefs as results: “There is an 80% chance that entry A is the best solution to the task, an 18% chance that entry B is the best solution…”

I find this argument entirely reasonable, and probably unlikely to catch on in a world where we haven’t even managed to replace the use of p-values.

# ConceptNet’s strong performance at SemEval 2018

At the beginning of June, we went to the NAACL conference and the SemEval workshop. SemEval is a yearly event where NLP systems are compared head-to-head on semantic tasks, and how they perform on unseen test data.

I like to submit to SemEval because I see it as the NLP equivalent of pre-registered studies. You know the results are real; they’re not cherry-picked positive results, and they’re not repeatedly tuned to the same test set. SemEval provides valuable evidence about which semantic techniques actually work well on new data.

Recently, SemEval has been a compelling demonstration of why ConceptNet is important in semantics. The results of multiple tasks have shown the advantage of using a knowledge graph, particularly ConceptNet, and not assuming that a distributional representation such as word2vec will learn everything there is to learn.

Last year we got the top score (by a wide margin) in the SemEval task that we entered using ConceptNet Numberbatch (pre-trained word vectors built from ConceptNet). I was wondering if we had really made an impression with this result, or if the field was going to write it off as a fluke and go on as it was.

We made an impression! This year at SemEval, there were many systems using ConceptNet, not just ours. Let’s look at the two tasks where ConceptNet made an appearance.

### Story understanding¶

There are many NLP evaluations that involve reading comprehension, but many of them are susceptible to shallow strategies where the machine just learns to parrot key phrases from the text. The interesting twist in this one is that about half of the answers are not present in the text, but are meant to be inferred using common sense knowledge.

Here’s an example from the task paper, by Simon Ostermann et al.:

Text: It was a long day at work and I decided to stop at the gym before going home. I ran on the treadmill and lifted some weights. I decided I would also swim a few laps in the pool. Once I was done working out, I went in the locker room and stripped down and wrapped myself in a towel. I went into the sauna and turned on the heat. I let it get nice and steamy. I sat down and relaxed. I let my mind think about nothing but peaceful, happy thoughts. I stayed in there for only about ten minutes because it was so hot and steamy. When I got out, I turned the sauna off to save energy and took a cool shower. I got out of the shower and dried off. After that, I put on my extra set of clean clothes I brought with me, and got in my car and drove home.

Q1: Where did they sit inside the sauna?

(a) on the floor
(b) on a bench

Q2: How long did they stay in the sauna?

(b) over thirty minutes

Q1 is not just asking for a phrase to be echoed from the text. It requires some common sense knowledge, such as that saunas contain benches, that benches are meant for people to sit on, and that people will probably sit on a bench in preference to the floor.

It’s no wonder that the top system, from Yuanfudao Research, made use of ConceptNet and got a boost from its common sense knowledge. Their architecture was an interesting one I haven’t seen before — they queried the ConceptNet API for what relations existed between words in the text, the question, and the answer, and used the results they got as inputs to their neural net.

### Telling word meanings apart¶

Our entry this year was for Task 10: Capturing Discriminative Attributes, a task about recognizing differences between words. Many evaluation tasks, including the multilingual similarity task that we won last year, involve recognizing similar words. For example, it’s good for a system to know that “cappuccino” and “espresso” are similar things. But it’s also important for a system to know how they differ, and that’s what this task is about.

Our entry used ConceptNet Numberbatch in combination with four other resources, and took second place at the task. Our system is best described by our poster, which you can now read from the comfort of your Web browser.

In their summary paper, the task organizers (Alicia Krebs, Alessandro Lenci, and Denis Paperno) highlight the fact that systems that used knowledge bases performed much better than those that didn’t. Here’s a table of the results, which we’ve adapted from their paper and annotated with the largest knowledge base used by each entry:

Rank Team Score Knowledge base
1 SUNNYNLP 0.75 Probase
2 Luminoso 0.74 ConceptNet
3 BomJi 0.73
3 NTU NLP 0.73 ConceptNet
5 UWB 0.72 ConceptNet
6 ELiRF-UPV 0.69 ConceptNet
6 Meaning Space 0.69 WordNet
6 Wolves 0.69 ConceptNet
9 Discriminator 0.67
9 ECNU 0.67 WordNet
11 AmritaNLP 0.66
12 GHH 0.65
13 ALB 0.63
13 CitiusNLP 0.63
13 THU NGN 0.63
16 UNBNLP 0.61 WordNet
17 UNAM 0.60
17 UMD 0.60
19 ABDN 0.52 WordNet
20 Igevorse 0.51
21 bicici 0.47
human ceiling 0.90
word2vec baseline 0.61

The winning system made very effective use of Probase, a hierarchy of automatically extracted “is-a” statements about noun phrases. Unfortunately, Probase was never released for non-academic use; it became the Microsoft Concept Graph, which was recently shut down.

We can see here that five systems used ConceptNet in their solution, and their various papers describe how ConceptNet provided a boost to their accuracy.

In our own results, we encountered the surprising retrospective result that we could have simplified our system to just use the ConceptNet Numberbatch embeddings, and no other sources of information, and it would have done just as well! You can read a bit more about this in the poster, and I hope to demonstrate this simple system in a tutorial post soon.

# ftfy (fixes text for you) 5.4 released

We’ve released version 5.4 of ftfy, our Python 3 tool that fixes mojibake and other Unicode glitches.

>>> import ftfy

>>> ftfy.fix_text("ongeÃ«venaard")
'ongeëvenaard'

>>> ftfy.fix_text("HÃ”TEL")
'HÔTEL'


In this version, we tuned the heuristic to be able to fix more cases where there are only two characters of mojibake, such as the Ã« in "ongeÃ«venaard", thanks to a bug report about how ftfy was failing to un-corrupt the letter ë.

There are many cases like this that ftfy could fix already, but version 5.3 wasn’t convinced it should change anything: “« is a kind of quotation mark! What if the user really meant to put «venaard» in quotes, and there just happens to be a word ending in Ã right next to it?”

This is a bit of a silly concern when:

• quotation marks aren’t usually sandwiched directly between letters, with no spaces around them

• words don’t usually end in Ã, not even in Portuguese

• The text would really look a lot better with ë in it instead of Ã«

We tuned the heuristic so that it recognizes more of these two-character sequences as clear cases of mojibake, and doesn’t worry about quotation marks that are between letters.

Why does ftfy have to be careful in cases like this? It may seem that we could just fix every two-character sequence that looks like Windows-1252 was mixed up with UTF-8, the most common form of mojibake. But one design goal is that we really don’t want it to introduce errors. Here’s a real-world example that’s in ftfy’s tests:

>>> text = "PARCE QUE SUR LEURS PLAQUES IL Y MARQUÉ…"

>>> # It's possible to decode this text as if it's mojibake.
>>> text.encode('windows-1252').decode('utf-8')
'PARCE QUE SUR LEURS PLAQUES IL Y MARQUɅ'

>>> # But we don't, because the text is fine as it is.
>>> ftfy.fix_text(text)
'PARCE QUE SUR LEURS PLAQUES IL Y MARQUÉ…'


People are often surprised that ftfy is a hand-tuned heuristic, and not, for example, the output of a machine-learning algorithm. Machine learning is great, but it has its limits. One advantage of being hand-tuned is that we can keep aiming for a false positive rate that’s so low that an ML training loop wouldn’t even be able to measure it. Another advantage, shown with this update, is that we can make sure to do the right thing in these minimal cases.

Machine-learned tools such as the language detector cld2 will warn you that they’re “not designed to do well on short text”. Short text is often interesting and important, so ftfy is designed to do well on it.

Also with this release, we can finally have a nice-looking project page on the new Python Package Index.

# ConceptNet and JSON-LD

JSON-LD is a flower blooming in the majestic ruins of the Semantic Web.

It’s a way of describing an API of linked data, so that a computer can understand what its responses mean. But the description stays out of the way, so a human programmer can interact with the API the way they would any other.

This post is going to be nerdier than usual. In this tutorial, we’re going to look under the surface of the ConceptNet API, which is based on JSON-LD, and see how to use tools such as pyld to transform it into RDF and align it with other data.

Should you care? I think you should if the difference between “Linked Data” and plain old “data” is important to you. But this information isn’t actually essential to use ConceptNet. It’s a bonus that makes ConceptNet more interoperable with other things. This will not be on the test.

### Hanging ornaments on the JSON tree¶¶

To start with an example, here’s the JSON-LD response that you get from the API query http://api.conceptnet.io/c/en/knowledge_graph.

Conveniently for the length of this example (and disappointingly in general), ConceptNet knows only one thing about the English term “knowledge graph”, which is that in French it’s “graphe de connaissances”. So the “edges” value, which contains the meat of the response, is a list of one edge.

{
"@context": [
"http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json"
],
"@id": "/c/en/knowledge_graph",
"edges": [
{
"@id": "/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]",
"@type": "Edge",
"dataset": "/d/wiktionary/fr",
"end": {
"@id": "/c/en/knowledge_graph",
"@type": "Node",
"label": "knowledge graph",
"language": "en",
"term": "/c/en/knowledge_graph"
},
"rel": {
"@id": "/r/Synonym",
"@type": "Relation",
"label": "Synonym"
},
"sources": [
{
"@id": "/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]",
"@type": "Source",
"contributor": "/s/resource/wiktionary/fr",
"process": "/s/process/wikiparsec/1"
}
],
"start": {
"@id": "/c/fr/graphe_de_connaissances/n",
"@type": "Node",
"label": "graphe de connaissances",
"language": "fr",
"sense_label": "n",
"term": "/c/fr/graphe_de_connaissances"
},
"surfaceText": null,
"weight": 1.0
}
]
}


Most of this reflects the way the ConceptNet 5 API has always looked. What tells you it’s JSON-LD is a few properties that started showing up in version 5.5, with @ signs in their names. In particular, there’s a pointer to the @context, which is where you (or your software) would go to start understanding what the JSON-LD means. With JSON-LD, you can get more information than you would from the API response alone.

### Calling things by their true name¶¶

What’s cool about JSON-LD is that it takes your API and makes it interoperable with RDF. And what’s cool about RDF — if you’ll accept that there’s anything cool about RDF — is that it can assign everything a name, and that name is meaningful and globally unique.

Naming things is one of the traditional “hard problems of computer science”, so this actually matters. And the way RDF names things should be immediately understandable to every developer: names are URLs.

Following the fantasy trope, when you know the true name of something, you have power over it.

Having the URL for a term in RDF tells you whether it’s the same as something you already know about. Computationally, you know more about what “JSON” is if you know it’s the same as https://www.wikidata.org/wiki/Q2063 or http://dbpedia.org/resource/JSON.

And if you have the URL for something that you don’t already know about, you can usually go to that URL and find more information. For example, that’s how you’d confirm that Wikidata’s “Q2063” and DBPedia’s “JSON” are the same thing as each other. That’s what makes all of this information “Linked Data”, not just data.

When you say “URL”, you must actually mean “IRI”.

It’s good to talk to you again, Imaginary Interlocutor, but do you have to be such a web-standards pedant? Nobody knows what an IRI is. I’m going to keep calling these URLs, especially because I really do intend every one of them that I produce to locate a resource.

The names in ConceptNet may look like ad-hoc identifiers, like "/c/en/knowledge_graph" and "cc:by-sa/4.0". The property names, such as "dataset", look pretty ad-hoc too. But these are just short nicknames, and via JSON-LD, we can find the true names of all of these:

The way to turn the strings in the API response into these true names is using ConceptNet’s JSON-LD context. Don’t get too bogged down in it right now. One thing it provides is prefixes that let us use shorter names for things. Here’s the prefix that lets "cc:by-sa/4.0" point to the Creative Commons URL above:

"cc": "http://creativecommons.org/licenses/",



It also has a base URL, for interpreting relative URLs such as /c/en/knowledge_graph. The base URL happens to be the URL of the context itself, because why not:

"@base": "http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json",



Some of the property names are things that we define. This line says that “weight” is a property that’s defined in ConceptNet’s context (cn: for short), and its value is a floating-point number:

"weight": {"@id": "cn:weight", "@type": "xsd:float"},



Some of the properties are already meaningfully defined elsewhere. For example, we can have “comment” fields in API responses. Its values are strings to be read by the API user. This notion of a comment already exists in RDF Schema.

"comment": {"@id": "rdfs:comment", "@type": "xsd:string"},



With this line, we can specify that when we say “comment”, we mean “rdfs:comment”, which when you expand the prefix means “http://www.w3.org/2000/01/rdf-schema#comment“.

Let’s take a step back. What do you do with this kind of information?

I think the most likely user who cares about the linked data in ConceptNet is someone who’s building something larger out of ConceptNet and other resources. This would match my experience in building ConceptNet, where the inputs that are available in RDF are the ones I can be confident that I’m handling correctly, even if they update in the future.

Let’s talk about how things used to be with WordNet. If I want to refer to a particular item in WordNet, such as the synset {example, instance, illustration, representative}, there are a number of ways I could describe it, and most of them probably wouldn’t be consistent with anything else. I could give you synset names that you can look up, such as example.n.01 or illustration.n.03. These numbers might change with new versions of WordNet, and there’s no way to inherently know that they refer to the same thing.

I could also give you an internal ID such as 05828980-n, which at least is a single name for the synset, but all of these IDs would change with new releases of WordNet.

And this really got better because of RDF?

Yep. When using multiple data sources that are based on WordNet, you used to need a table that tells you which IDs are the same as which other IDs — basically a kind of Rosetta stone lining up names and numbers from different versions of WordNet. Hopefully some researcher somewhere has made the table you need.

But the fact that WordNet is in RDF now means that I know the global, true name that I can call this WordNet entry: http://wordnet-rdf.princeton.edu/id/05828980-n. I don’t need a Rosetta stone to know what this URL refers to. I can even go to that URL to find out more about it.

But that’s just the same internal ID shoved into a URL. How does that make a difference?

Putting it into a URL means that it’s more than just an internal ID now. Regardless of where the ID number came from originally, it’s an implicit promise that this URL consistently refers to the synset {example, instance, illustration, representative}.

And, importantly, it suggests that if you’re building something on top of WordNet, you should use the same URL to identify the same synset. These wordnet-rdf URLs are also used by the Open Multilingual WordNet project, so you can be sure of when terms in different languages are intended to refer to the same thing, and you can align the data OMW provides with WordNet data you get from other sources.

### Using PyLD¶¶

The PyLD library lets us interpret JSON-LD responses, and apply various standard transformations to them.

For example, maybe instead of our own API format, you want to see the data in ConceptNet in a format that some other project uses. One format you might like is N-Triples, a simple text format that’s like CSV if CSV were annoying to parse. Each line is an RDF statement, containing the subject, the predicate, and the object, and ending with a dot. The URLs involved are fully expanded.

This format is also called N-Quads now. We could replace the dot with a fourth thing called a “named graph”, but we don’t.

To produce this format, we’ll use jsonld.normalize. N-Quads is one of the two formats it can output.

In [1]:
from pyld import jsonld

In [2]:
def show_nquads(url):

In [3]:
show_nquads('http://api.conceptnet.io/c/en/knowledge_graph')

<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#dataset> <http://api.conceptnet.io/d/wiktionary/fr> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#end> <http://api.conceptnet.io/c/en/knowledge_graph> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#rel> <http://api.conceptnet.io/r/Synonym> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#source> <http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#start> <http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#weight> "1.0E0"^^<http://www.w3.org/2001/XMLSchema#float> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Edge> .
<http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#process> <http://api.conceptnet.io/s/process/wikiparsec/1> .
<http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]> <http://purl.org/dc/terms/contributor> <http://api.conceptnet.io/s/resource/wiktionary/fr> .
<http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Source> .
<http://api.conceptnet.io/c/en/knowledge_graph> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#edges> <http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> .
<http://api.conceptnet.io/c/en/knowledge_graph> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#label> "knowledge graph" .
<http://api.conceptnet.io/c/en/knowledge_graph> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#term> <http://api.conceptnet.io/c/en/knowledge_graph> .
<http://api.conceptnet.io/c/en/knowledge_graph> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Node> .
<http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#label> "graphe de connaissances" .
<http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#sense_label> "n" .
<http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#term> <http://api.conceptnet.io/c/fr/graphe_de_connaissances> .
<http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Node> .
<http://api.conceptnet.io/r/Synonym> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#label> "Synonym" .
<http://api.conceptnet.io/r/Synonym> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Relation> .



There you go. It’s not pretty, but everything is pretty much spelled out. With N-Quads format, you could process ConceptNet the same way as WordNet or DBPedia.

The other available format, besides N-Quads, is a list of dictionaries, which is a good format for working with this data programmatically when you’re not writing it to a file, but is ridiculously verbose to look at:

In [4]:
edges = jsonld.normalize('http://api.conceptnet.io/c/en/knowledge_graph')['@default']
edges[:5]

Out[4]:
[{'object': {'type': 'IRI',
'value': 'http://api.conceptnet.io/d/wiktionary/fr'},
'predicate': {'type': 'IRI',
'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#dataset'},
'subject': {'type': 'IRI',
'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}},
{'object': {'type': 'IRI',
'value': 'http://api.conceptnet.io/c/en/knowledge_graph'},
'predicate': {'type': 'IRI',
'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#end'},
'subject': {'type': 'IRI',
'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}},
{'object': {'type': 'IRI',
'predicate': {'type': 'IRI',
'subject': {'type': 'IRI',
'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}},
{'object': {'type': 'IRI', 'value': 'http://api.conceptnet.io/r/Synonym'},
'predicate': {'type': 'IRI',
'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#rel'},
'subject': {'type': 'IRI',
'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}},
{'object': {'type': 'IRI',
'value': 'http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]'},
'predicate': {'type': 'IRI',
'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#source'},
'subject': {'type': 'IRI',
'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}}]

### Example: displaying the graph¶¶

What we just got out of JSON-LD is a graph structure, and Python gives us ways to visualize graphs, such as the appropriately-named graphviz wrapper.

We can use this anonymous list-of-dictionaries format to provide input to graphviz. We just need some code that prettifies it a little bit.

In [5]:
import graphviz
from conceptnet5.uri import join_uri, split_uri
API_ROOT = 'http://api.conceptnet.io'

def short_name(value, max_length=40):
"""
Convert an RDF value (given as a dictionary) to a reasonable label.
"""
if value['type'] == 'blank node':
return '_'
elif value['type'] == 'IRI':
url = value['value']
if '#' in url:
# Show just the fragment of URLs with a fragment
# (it's probably a property name)
return url.split('#')[-1]

# Give URLs relative to the root of our API
if url.startswith(API_ROOT):
short_url = url[len(API_ROOT):]
# If the URL is too long, hide it
if len(short_url) > max_length:
pieces = split_uri(short_url)
return join_uri(pieces[0], '...')
else:
return short_url
else:
return url.split('://')[-1]
else:
# Put literal values in quotes
text = value['value'].replace(':', '')
if len(text) > max_length:
text = text[:max_length] + '...'
return '"{}"'.format(text)

def show_graph(url, size=10):
"""
Show the graph structure of a ConceptNet API response.
"""
rdf = jsonld.normalize(url)['@default']
graph = graphviz.Digraph(
strict=False, graph_attr={'size': str(size), 'rankdir': 'LR'}
)
for edge in rdf:
subj = short_name(edge['subject'])
obj = short_name(edge['object'])
pred = short_name(edge['predicate'])
if subj and obj and pred:
# Apply different styles to the nodes based on whether they're
# literals, ConceptNet URLs, or other URLs
if obj.startswith('"'):
# Literal values
graph.node(obj, penwidth='0')
elif obj.startswith('/'):
# ConceptNet nodes
graph.node(obj, style='filled', fillcolor="#ddeeff")
else:
# Other URLs
graph.node(obj, color="#558855")
graph.edge(subj, obj, label=pred)

return graph

In [6]:
show_graph('http://api.conceptnet.io/c/en/knowledge_graph')

Out[6]:

Wait. This tentacle monster is what a single assertion in ConceptNet looks like?

Yes. I bet you were expecting something more like this:

In [7]:
graph = graphviz.Graph(
graph_attr={'size': '10', 'rankdir': 'LR'},
node_attr={'style': 'filled', 'fillcolor': "#ddeeff"}
)
graph.edge('/c/en/knowledge_graph', '/c/fr/graphe_de_connaissances', label='/r/Synonym')
graph

Out[7]:

And you’ll often see claims that RDF can describe knowledge graphs in this way, where each edge is a fact in the knowledge base.

But this leaves no room for any interesting information about the edge, such as the sources that it comes from or how strongly we believe it. To talk about an edge in RDF, you have to “reify” it — to turn the edge into a node, and describe it with more edges. And that’s what we’ve done.

I’m not sure if anyone really wants to work with the un-reified facts in ConceptNet as RDF edges. I know that DBPedia has those, but often I see a DBPedia edge and find myself asking “okay, but really? Is this a real fact? Where did it come from?” Without reification, there’s no answer.

You’ve also got information about nodes, such as their label and their type, which are normal things to have in RDF.

Tentacles aside, could I put a lot of ConceptNet assertions into GraphViz and get a visualization of the structure of the ConceptNet graph?

You would get an illegible hairy mess that brings your image-rendering software grinding to a halt.

Which is about the same as any other large graph visualization.

### Understanding the context in context¶¶

We have to go deeper.

Wait, no, we don’t have to do any of this. I want to go deeper.

One goal I had for ConceptNet’s JSON-LD context is that it should explain itself, much like RDF Schema did back in 2000. If you encounter the context file on its own, you should be able to read it and at least partially understand it. And hey, given that we’ve got all this JSON-LD stuff going on, it would be nice if a computer can also understand the stuff that you understand.

So that’s what I did. The context file doesn’t just describe an abstract vocabulary of properties; it also defines those properties. When the actual "@context" refers to identifiers such as "cn:rel", the prefix cn: refers to a fragment in this file itself, so it’s saying that #rel is defined somewhere in this file — and here it is.

The definition tells you the types of things that each property relates, such as Nodes, Edges, or Sources. It relates them to other things in RDF when possible, such the fact that the "rel", "start", and "end" of a ConceptNet assertion play the roles of the the "rdf:predicate", "rdf:subject", and "rdf:object" respectively. It provides additional explanations using the "comment" property. For example:

{
"@id": "#rel",
"@type": "rdf:Property",
"subPropertyOf": "rdf:predicate",
"domain": ["#Edge", "#Feature"],
"range": "#Relation",
"comment": "Links to the kind of relationship that holds between two terms. In this API, the 'rel' will always be a ConceptNet URI beginning with /r/. In RDF, this would be called the 'predicate'."
}


These explanatory properties appear outside of the "@context" section, the only section that actually matters to how JSON-LD is processed. I wish I could have put comments inside the "@context", where the values really matter. But if I do that, it doesn’t validate as proper JSON-LD. You need to have already parsed the "@context" to know what a comment is, and JSON-LD doesn’t leave any wiggle room for circular definitions.

But outside of the "@context" section, I can put whatever I want. And what I choose to put there is these additional, explanatory properties that are also meaningful JSON-LD.

So you can interpret the context file, in the context of itself:

In [8]:
show_graph('http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json', size=30)

Out[8]:

Feel free to squint at this tangled web if you really like graphs about graphs. It’s like API documentation, squared!

But speaking of that, remember that you can also read ConceptNet’s API documentation in English instead of in JSON-LD.

Did… did you just make an ontology? That seems out of character.

The ontology was always there, Imaginary Interlocutor. We’ve just moved it from the realm of Platonic ideals, to a JSON file you can download.

In the end, what have you accomplished here?

The next time someone asks me if ConceptNet is available in RDF form, I can say “yes”.

# ConceptNet 5.6 released

ConceptNet 5.6 is out!

We’ve made a lot of changes behind the scenes that should have fairly small effects on the way you use ConceptNet. Some of the changes are:

• We normalize text properly in more languages. Arabic words no longer insist on matching vowel points that nobody writes in real text. Serbian/Croatian words now have a unified vocabulary written in the Latin alphabet, instead of some words being in the Latin alphabet and some in Cyrillic.

• ConceptNet knows what emoji are and can define them in a number of languages, thanks to importing Unicode CLDR data. 😺

• We’ve included data from CC-CEDICT, an open Chinese dictionary.

• For fans of self-explaining APIs and what’s left of the Semantic Web: Everything returned by the ConceptNet API is now valid JSON-LD, and we now test to make sure this is true. You can use a JSON-LD processor to convert responses from the ConceptNet API into other formats such as RDF triples.

• We no longer use Docker to deploy ConceptNet. It caused no end of inscrutable problems and it didn’t make anything easier. Sorry for getting caught up in the hype. We still provide ways to configure a machine to serve ConceptNet exactly like we do.

More details are on the changelog on the ConceptNet wiki.

We also moved our blog — the one you’re reading now — from WordPress to a static site generated with Nikola. One feature this provides is that we can post Python notebooks directly on the blog, instead of having to use an external service such as Gist. This makes it much easier to post tutorials, and we hope to do this shortly.