ConceptNet’s strong performance at SemEval 2018

At the beginning of June, we went to the NAACL conference and the SemEval workshop. SemEval is a yearly event where NLP systems are compared head-to-head on semantic tasks, and how they perform on unseen test data.

I like to submit to SemEval because I see it as the NLP equivalent of pre-registered studies. You know the results are real; they’re not cherry-picked positive results, and they’re not repeatedly tuned to the same test set. SemEval provides valuable evidence about which semantic techniques actually work well on new data.

Recently, SemEval has been a compelling demonstration of why ConceptNet is important in semantics. The results of multiple tasks have shown the advantage of using a knowledge graph, particularly ConceptNet, and not assuming that a distributional representation such as word2vec will learn everything there is to learn.

Last year we got the top score (by a wide margin) in the SemEval task that we entered using ConceptNet Numberbatch (pre-trained word vectors built from ConceptNet). I was wondering if we had really made an impression with this result, or if the field was going to write it off as a fluke and go on as it was.

We made an impression! This year at SemEval, there were many systems using ConceptNet, not just ours. Let’s look at the two tasks where ConceptNet made an appearance.

Story understanding

Task 11: Machine Comprehension Using Commonsense Knowledge is a task where your NLP system reads a story and then answers some simple questions that test its comprehension.

There are many NLP evaluations that involve reading comprehension, but many of them are susceptible to shallow strategies where the machine just learns to parrot key phrases from the text. The interesting twist in this one is that about half of the answers are not present in the text, but are meant to be inferred using common sense knowledge.

Here’s an example from the task paper, by Simon Ostermann et al.:

Text: It was a long day at work and I decided to stop at the gym before going home. I ran on the treadmill and lifted some weights. I decided I would also swim a few laps in the pool. Once I was done working out, I went in the locker room and stripped down and wrapped myself in a towel. I went into the sauna and turned on the heat. I let it get nice and steamy. I sat down and relaxed. I let my mind think about nothing but peaceful, happy thoughts. I stayed in there for only about ten minutes because it was so hot and steamy. When I got out, I turned the sauna off to save energy and took a cool shower. I got out of the shower and dried off. After that, I put on my extra set of clean clothes I brought with me, and got in my car and drove home.

Q1: Where did they sit inside the sauna?

(a) on the floor
(b) on a bench

Q2: How long did they stay in the sauna?

(a) about ten minutes
(b) over thirty minutes

Q1 is not just asking for a phrase to be echoed from the text. It requires some common sense knowledge, such as that saunas contain benches, that benches are meant for people to sit on, and that people will probably sit on a bench in preference to the floor.

It’s no wonder that the top system, from Yuanfudao Research, made use of ConceptNet and got a boost from its common sense knowledge. Their architecture was an interesting one I haven’t seen before — they queried the ConceptNet API for what relations existed between words in the text, the question, and the answer, and used the results they got as inputs to their neural net.

I hadn’t heard about this system before the workshop. It was quite satisfying to see ConceptNet win at a difficult task without any effort from us!

Telling word meanings apart

Our entry this year was for Task 10: Capturing Discriminative Attributes, a task about recognizing differences between words. Many evaluation tasks, including the multilingual similarity task that we won last year, involve recognizing similar words. For example, it’s good for a system to know that “cappuccino” and “espresso” are similar things. But it’s also important for a system to know how they differ, and that’s what this task is about.

Our entry used ConceptNet Numberbatch in combination with four other resources, and took second place at the task. Our system is best described by our poster, which you can now read from the comfort of your Web browser.

A rendering of our poster. The link leads to a PDF version.

In their summary paper, the task organizers (Alicia Krebs, Alessandro Lenci, and Denis Paperno) highlight the fact that systems that used knowledge bases performed much better than those that didn’t. Here’s a table of the results, which we’ve adapted from their paper and annotated with the largest knowledge base used by each entry:

Rank Team Score Knowledge base
1 SUNNYNLP 0.75 Probase
2 Luminoso 0.74 ConceptNet
3 BomJi 0.73
3 NTU NLP 0.73 ConceptNet
5 UWB 0.72 ConceptNet
6 ELiRF-UPV 0.69 ConceptNet
6 Meaning Space 0.69 WordNet
6 Wolves 0.69 ConceptNet
9 Discriminator 0.67
9 ECNU 0.67 WordNet
11 AmritaNLP 0.66
12 GHH 0.65
13 ALB 0.63
13 CitiusNLP 0.63
13 THU NGN 0.63
16 UNBNLP 0.61 WordNet
17 UNAM 0.60
17 UMD 0.60
19 ABDN 0.52 WordNet
20 Igevorse 0.51
21 bicici 0.47
human ceiling 0.90
word2vec baseline 0.61

The winning system made very effective use of Probase, a hierarchy of automatically extracted “is-a” statements about noun phrases. Unfortunately, Probase was never released for non-academic use; it became the Microsoft Concept Graph, which was recently shut down.

We can see here that five systems used ConceptNet in their solution, and their various papers describe how ConceptNet provided a boost to their accuracy.

In our own results, we encountered the surprising retrospective result that we could have simplified our system to just use the ConceptNet Numberbatch embeddings, and no other sources of information, and it would have done just as well! You can read a bit more about this in the poster, and I hope to demonstrate this simple system in a tutorial post soon.

ftfy (fixes text for you) 5.4 released

We’ve released version 5.4 of ftfy, our Python 3 tool that fixes mojibake and other Unicode glitches.

>>> import ftfy

>>> ftfy.fix_text("ongeëvenaard")
'ongeëvenaard'

>>> ftfy.fix_text("HÔTEL")
'HÔTEL'

In this version, we tuned the heuristic to be able to fix more cases where there are only two characters of mojibake, such as the ë in "ongeëvenaard", thanks to a bug report about how ftfy was failing to un-corrupt the letter ë.

There are many cases like this that ftfy could fix already, but version 5.3 wasn’t convinced it should change anything: “« is a kind of quotation mark! What if the user really meant to put «venaard» in quotes, and there just happens to be a word ending in à right next to it?”

This is a bit of a silly concern when:

  • quotation marks aren’t usually sandwiched directly between letters, with no spaces around them

  • words don’t usually end in Ã, not even in Portuguese

  • The text would really look a lot better with ë in it instead of ë

We tuned the heuristic so that it recognizes more of these two-character sequences as clear cases of mojibake, and doesn’t worry about quotation marks that are between letters.

Why does ftfy have to be careful in cases like this? It may seem that we could just fix every two-character sequence that looks like Windows-1252 was mixed up with UTF-8, the most common form of mojibake. But one design goal is that we really don’t want it to introduce errors. Here’s a real-world example that’s in ftfy’s tests:

>>> text = "PARCE QUE SUR LEURS PLAQUES IL Y MARQUÉ…"

>>> # It's possible to decode this text as if it's mojibake.
>>> text.encode('windows-1252').decode('utf-8')
'PARCE QUE SUR LEURS PLAQUES IL Y MARQUɅ'

>>> # But we don't, because the text is fine as it is.
>>> ftfy.fix_text(text)
'PARCE QUE SUR LEURS PLAQUES IL Y MARQUÉ…'

People are often surprised that ftfy is a hand-tuned heuristic, and not, for example, the output of a machine-learning algorithm. Machine learning is great, but it has its limits. One advantage of being hand-tuned is that we can keep aiming for a false positive rate that’s so low that an ML training loop wouldn’t even be able to measure it. Another advantage, shown with this update, is that we can make sure to do the right thing in these minimal cases.

Machine-learned tools such as the language detector cld2 will warn you that they’re “not designed to do well on short text”. Short text is often interesting and important, so ftfy is designed to do well on it.

Also with this release, we can finally have a nice-looking project page on the new Python Package Index.

ConceptNet and JSON-LD

JSON-LD is a flower blooming in the majestic ruins of the Semantic Web.

It’s a way of describing an API of linked data, so that a computer can understand what its responses mean. But the description stays out of the way, so a human programmer can interact with the API the way they would any other.

This post is going to be nerdier than usual. In this tutorial, we’re going to look under the surface of the ConceptNet API, which is based on JSON-LD, and see how to use tools such as pyld to transform it into RDF and align it with other data.

Should you care? I think you should if the difference between “Linked Data” and plain old “data” is important to you. But this information isn’t actually essential to use ConceptNet. It’s a bonus that makes ConceptNet more interoperable with other things. This will not be on the test.

Hanging ornaments on the JSON tree

To start with an example, here’s the JSON-LD response that you get from the API query http://api.conceptnet.io/c/en/knowledge_graph.

Conveniently for the length of this example (and disappointingly in general), ConceptNet knows only one thing about the English term “knowledge graph”, which is that in French it’s “graphe de connaissances”. So the “edges” value, which contains the meat of the response, is a list of one edge.

{
  "@context": [
    "http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json"
  ],
  "@id": "/c/en/knowledge_graph",
  "edges": [
    {
      "@id": "/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]",
      "@type": "Edge",
      "dataset": "/d/wiktionary/fr",
      "end": {
        "@id": "/c/en/knowledge_graph",
        "@type": "Node",
        "label": "knowledge graph",
        "language": "en",
        "term": "/c/en/knowledge_graph"
      },
      "license": "cc:by-sa/4.0",
      "rel": {
        "@id": "/r/Synonym",
        "@type": "Relation",
        "label": "Synonym"
      },
      "sources": [
        {
          "@id": "/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]",
          "@type": "Source",
          "contributor": "/s/resource/wiktionary/fr",
          "process": "/s/process/wikiparsec/1"
        }
      ],
      "start": {
        "@id": "/c/fr/graphe_de_connaissances/n",
        "@type": "Node",
        "label": "graphe de connaissances",
        "language": "fr",
        "sense_label": "n",
        "term": "/c/fr/graphe_de_connaissances"
      },
      "surfaceText": null,
      "weight": 1.0
    }
  ]
}

Most of this reflects the way the ConceptNet 5 API has always looked. What tells you it’s JSON-LD is a few properties that started showing up in version 5.5, with @ signs in their names. In particular, there’s a pointer to the @context, which is where you (or your software) would go to start understanding what the JSON-LD means. With JSON-LD, you can get more information than you would from the API response alone.

Calling things by their true name

What’s cool about JSON-LD is that it takes your API and makes it interoperable with RDF. And what’s cool about RDF — if you’ll accept that there’s anything cool about RDF — is that it can assign everything a name, and that name is meaningful and globally unique.

Naming things is one of the traditional “hard problems of computer science”, so this actually matters. And the way RDF names things should be immediately understandable to every developer: names are URLs.

Following the fantasy trope, when you know the true name of something, you have power over it.

Having the URL for a term in RDF tells you whether it’s the same as something you already know about. Computationally, you know more about what “JSON” is if you know it’s the same as https://www.wikidata.org/wiki/Q2063 or http://dbpedia.org/resource/JSON.

And if you have the URL for something that you don’t already know about, you can usually go to that URL and find more information. For example, that’s how you’d confirm that Wikidata’s “Q2063” and DBPedia’s “JSON” are the same thing as each other. That’s what makes all of this information “Linked Data”, not just data.

When you say “URL”, you must actually mean “IRI”.

It’s good to talk to you again, Imaginary Interlocutor, but do you have to be such a web-standards pedant? Nobody knows what an IRI is. I’m going to keep calling these URLs, especially because I really do intend every one of them that I produce to locate a resource.

The names in ConceptNet may look like ad-hoc identifiers, like "/c/en/knowledge_graph" and "cc:by-sa/4.0". The property names, such as "dataset", look pretty ad-hoc too. But these are just short nicknames, and via JSON-LD, we can find the true names of all of these:

The way to turn the strings in the API response into these true names is using ConceptNet’s JSON-LD context. Don’t get too bogged down in it right now. One thing it provides is prefixes that let us use shorter names for things. Here’s the prefix that lets "cc:by-sa/4.0" point to the Creative Commons URL above:

"cc": "http://creativecommons.org/licenses/",

It also has a base URL, for interpreting relative URLs such as /c/en/knowledge_graph. The base URL happens to be the URL of the context itself, because why not:

"@base": "http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json",

Some of the property names are things that we define. This line says that “weight” is a property that’s defined in ConceptNet’s context (cn: for short), and its value is a floating-point number:

"weight": {"@id": "cn:weight", "@type": "xsd:float"},

Some of the properties are already meaningfully defined elsewhere. For example, we can have “comment” fields in API responses. Its values are strings to be read by the API user. This notion of a comment already exists in RDF Schema.

"comment": {"@id": "rdfs:comment", "@type": "xsd:string"},

With this line, we can specify that when we say “comment”, we mean “rdfs:comment”, which when you expand the prefix means “http://www.w3.org/2000/01/rdf-schema#comment“.

Let’s take a step back. What do you do with this kind of information?

I think the most likely user who cares about the linked data in ConceptNet is someone who’s building something larger out of ConceptNet and other resources. This would match my experience in building ConceptNet, where the inputs that are available in RDF are the ones I can be confident that I’m handling correctly, even if they update in the future.

Let’s talk about how things used to be with WordNet. If I want to refer to a particular item in WordNet, such as the synset {example, instance, illustration, representative}, there are a number of ways I could describe it, and most of them probably wouldn’t be consistent with anything else. I could give you synset names that you can look up, such as example.n.01 or illustration.n.03. These numbers might change with new versions of WordNet, and there’s no way to inherently know that they refer to the same thing.

I could also give you an internal ID such as 05828980-n, which at least is a single name for the synset, but all of these IDs would change with new releases of WordNet.

And this really got better because of RDF?

Yep. When using multiple data sources that are based on WordNet, you used to need a table that tells you which IDs are the same as which other IDs — basically a kind of Rosetta stone lining up names and numbers from different versions of WordNet. Hopefully some researcher somewhere has made the table you need.

But the fact that WordNet is in RDF now means that I know the global, true name that I can call this WordNet entry: http://wordnet-rdf.princeton.edu/id/05828980-n. I don’t need a Rosetta stone to know what this URL refers to. I can even go to that URL to find out more about it.

But that’s just the same internal ID shoved into a URL. How does that make a difference?

Putting it into a URL means that it’s more than just an internal ID now. Regardless of where the ID number came from originally, it’s an implicit promise that this URL consistently refers to the synset {example, instance, illustration, representative}.

And, importantly, it suggests that if you’re building something on top of WordNet, you should use the same URL to identify the same synset. These wordnet-rdf URLs are also used by the Open Multilingual WordNet project, so you can be sure of when terms in different languages are intended to refer to the same thing, and you can align the data OMW provides with WordNet data you get from other sources.

Using PyLD

The PyLD library lets us interpret JSON-LD responses, and apply various standard transformations to them.

For example, maybe instead of our own API format, you want to see the data in ConceptNet in a format that some other project uses. One format you might like is N-Triples, a simple text format that’s like CSV if CSV were annoying to parse. Each line is an RDF statement, containing the subject, the predicate, and the object, and ending with a dot. The URLs involved are fully expanded.

This format is also called N-Quads now. We could replace the dot with a fourth thing called a “named graph”, but we don’t.

To produce this format, we’ll use jsonld.normalize. N-Quads is one of the two formats it can output.

In [1]:
from pyld import jsonld
In [2]:
def show_nquads(url):
    print(jsonld.normalize(url, {'format': 'application/nquads'}))
In [3]:
show_nquads('http://api.conceptnet.io/c/en/knowledge_graph')
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#dataset> <http://api.conceptnet.io/d/wiktionary/fr> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#end> <http://api.conceptnet.io/c/en/knowledge_graph> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#license> <http://creativecommons.org/licenses/by-sa/4.0> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#rel> <http://api.conceptnet.io/r/Synonym> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#source> <http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#start> <http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#weight> "1.0E0"^^<http://www.w3.org/2001/XMLSchema#float> .
<http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Edge> .
<http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#process> <http://api.conceptnet.io/s/process/wikiparsec/1> .
<http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]> <http://purl.org/dc/terms/contributor> <http://api.conceptnet.io/s/resource/wiktionary/fr> .
<http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Source> .
<http://api.conceptnet.io/c/en/knowledge_graph> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#edges> <http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]> .
<http://api.conceptnet.io/c/en/knowledge_graph> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#label> "knowledge graph" .
<http://api.conceptnet.io/c/en/knowledge_graph> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#term> <http://api.conceptnet.io/c/en/knowledge_graph> .
<http://api.conceptnet.io/c/en/knowledge_graph> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Node> .
<http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#label> "graphe de connaissances" .
<http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#sense_label> "n" .
<http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#term> <http://api.conceptnet.io/c/fr/graphe_de_connaissances> .
<http://api.conceptnet.io/c/fr/graphe_de_connaissances/n> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Node> .
<http://api.conceptnet.io/r/Synonym> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#label> "Synonym" .
<http://api.conceptnet.io/r/Synonym> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#Relation> .

There you go. It’s not pretty, but everything is pretty much spelled out. With N-Quads format, you could process ConceptNet the same way as WordNet or DBPedia.

The other available format, besides N-Quads, is a list of dictionaries, which is a good format for working with this data programmatically when you’re not writing it to a file, but is ridiculously verbose to look at:

In [4]:
edges = jsonld.normalize('http://api.conceptnet.io/c/en/knowledge_graph')['@default']
edges[:5]
Out[4]:
[{'object': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/d/wiktionary/fr'},
  'predicate': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#dataset'},
  'subject': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}},
 {'object': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/c/en/knowledge_graph'},
  'predicate': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#end'},
  'subject': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}},
 {'object': {'type': 'IRI',
   'value': 'http://creativecommons.org/licenses/by-sa/4.0'},
  'predicate': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#license'},
  'subject': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}},
 {'object': {'type': 'IRI', 'value': 'http://api.conceptnet.io/r/Synonym'},
  'predicate': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#rel'},
  'subject': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}},
 {'object': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/and/[/s/process/wikiparsec/1/,/s/resource/wiktionary/fr/]'},
  'predicate': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json#source'},
  'subject': {'type': 'IRI',
   'value': 'http://api.conceptnet.io/a/[/r/Synonym/,/c/fr/graphe_de_connaissances/n/,/c/en/knowledge_graph/]'}}]

Example: displaying the graph

What we just got out of JSON-LD is a graph structure, and Python gives us ways to visualize graphs, such as the appropriately-named graphviz wrapper.

We can use this anonymous list-of-dictionaries format to provide input to graphviz. We just need some code that prettifies it a little bit.

In [5]:
import graphviz
from conceptnet5.uri import join_uri, split_uri
API_ROOT = 'http://api.conceptnet.io'

def short_name(value, max_length=40):
    """
    Convert an RDF value (given as a dictionary) to a reasonable label.
    """
    if value['type'] == 'blank node':
        return '_'
    elif value['type'] == 'IRI':    
        url = value['value']
        if '#' in url:
            # Show just the fragment of URLs with a fragment
            # (it's probably a property name)
            return url.split('#')[-1]

        # Give URLs relative to the root of our API
        if url.startswith(API_ROOT):
            short_url = url[len(API_ROOT):]
            # If the URL is too long, hide it
            if len(short_url) > max_length:
                pieces = split_uri(short_url)
                return join_uri(pieces[0], '...')
            else:
                return short_url
        else:
            return url.split('://')[-1]
    else:
        # Put literal values in quotes
        text = value['value'].replace(':', '')
        if len(text) > max_length:
            text = text[:max_length] + '...'
        return '"{}"'.format(text)

    
def show_graph(url, size=10):
    """
    Show the graph structure of a ConceptNet API response.
    """
    rdf = jsonld.normalize(url)['@default']
    graph = graphviz.Digraph(
        strict=False, graph_attr={'size': str(size), 'rankdir': 'LR'}
    )
    for edge in rdf:
        subj = short_name(edge['subject'])
        obj = short_name(edge['object'])
        pred = short_name(edge['predicate'])
        if subj and obj and pred:
            # Apply different styles to the nodes based on whether they're
            # literals, ConceptNet URLs, or other URLs
            if obj.startswith('"'):
                # Literal values
                graph.node(obj, penwidth='0')
            elif obj.startswith('/'):
                # ConceptNet nodes
                graph.node(obj, style='filled', fillcolor="#ddeeff")
            else:
                # Other URLs
                graph.node(obj, color="#558855")
            graph.edge(subj, obj, label=pred)
    
    return graph
In [6]:
show_graph('http://api.conceptnet.io/c/en/knowledge_graph')
Out[6]:
%3 /d/wiktionary/fr /d/wiktionary/fr /a/… /a/… /a/…-&gt;/d/wiktionary/fr dataset /c/en/knowledge_graph /c/en/knowledge_graph /a/…-&gt;/c/en/knowledge_graph end creativecommons.org/licenses/by-sa/4.0 creativecommons.org/licenses/by-sa/4.0 /a/…-&gt;creativecommons.org/licenses/by-sa/4.0 license /r/Synonym /r/Synonym /a/…-&gt;/r/Synonym rel /and/… /and/… /a/…-&gt;/and/… source /c/fr/graphe_de_connaissances/n /c/fr/graphe_de_connaissances/n /a/…-&gt;/c/fr/graphe_de_connaissances/n start “1.0E0” “1.0E0” /a/…-&gt;”1.0E0” weight Edge Edge /a/…-&gt;Edge type /c/en/knowledge_graph-&gt;/a/… edges /c/en/knowledge_graph-&gt;/c/en/knowledge_graph term “knowledge graph” “knowledge graph” /c/en/knowledge_graph-&gt;”knowledge graph” label Node Node /c/en/knowledge_graph-&gt;Node type “Synonym” “Synonym” /r/Synonym-&gt;”Synonym” label Relation Relation /r/Synonym-&gt;Relation type /s/process/wikiparsec/1 /s/process/wikiparsec/1 /and/…-&gt;/s/process/wikiparsec/1 process /s/resource/wiktionary/fr /s/resource/wiktionary/fr /and/…-&gt;/s/resource/wiktionary/fr purl.org/dc/terms/contributor Source Source /and/…-&gt;Source type /c/fr/graphe_de_connaissances/n-&gt;Node type “graphe de connaissances” “graphe de connaissances” /c/fr/graphe_de_connaissances/n-&gt;”graphe de connaissances” label “n” “n” /c/fr/graphe_de_connaissances/n-&gt;”n” sense_label /c/fr/graphe_de_connaissances /c/fr/graphe_de_connaissances /c/fr/graphe_de_connaissances/n-&gt;/c/fr/graphe_de_connaissances term

Wait. This tentacle monster is what a single assertion in ConceptNet looks like?

Yes. I bet you were expecting something more like this:

In [7]:
graph = graphviz.Graph(
    graph_attr={'size': '10', 'rankdir': 'LR'},
    node_attr={'style': 'filled', 'fillcolor': "#ddeeff"}
)
graph.edge('/c/en/knowledge_graph', '/c/fr/graphe_de_connaissances', label='/r/Synonym')
graph
Out[7]:
%3 /c/en/knowledge_graph /c/en/knowledge_graph /c/fr/graphe_de_connaissances /c/fr/graphe_de_connaissances /c/en/knowledge_graph—/c/fr/graphe_de_connaissances /r/Synonym

And you’ll often see claims that RDF can describe knowledge graphs in this way, where each edge is a fact in the knowledge base.

But this leaves no room for any interesting information about the edge, such as the sources that it comes from or how strongly we believe it. To talk about an edge in RDF, you have to “reify” it — to turn the edge into a node, and describe it with more edges. And that’s what we’ve done.

I’m not sure if anyone really wants to work with the un-reified facts in ConceptNet as RDF edges. I know that DBPedia has those, but often I see a DBPedia edge and find myself asking “okay, but really? Is this a real fact? Where did it come from?” Without reification, there’s no answer.

You’ve also got information about nodes, such as their label and their type, which are normal things to have in RDF.

Tentacles aside, could I put a lot of ConceptNet assertions into GraphViz and get a visualization of the structure of the ConceptNet graph?

You would get an illegible hairy mess that brings your image-rendering software grinding to a halt.

Which is about the same as any other large graph visualization.

Understanding the context in context

We have to go deeper.

Wait, no, we don’t have to do any of this. I want to go deeper.

One goal I had for ConceptNet’s JSON-LD context is that it should explain itself, much like RDF Schema did back in 2000. If you encounter the context file on its own, you should be able to read it and at least partially understand it. And hey, given that we’ve got all this JSON-LD stuff going on, it would be nice if a computer can also understand the stuff that you understand.

So that’s what I did. The context file doesn’t just describe an abstract vocabulary of properties; it also defines those properties. When the actual "@context" refers to identifiers such as "cn:rel", the prefix cn: refers to a fragment in this file itself, so it’s saying that #rel is defined somewhere in this file — and here it is.

The definition tells you the types of things that each property relates, such as Nodes, Edges, or Sources. It relates them to other things in RDF when possible, such the fact that the "rel", "start", and "end" of a ConceptNet assertion play the roles of the the "rdf:predicate", "rdf:subject", and "rdf:object" respectively. It provides additional explanations using the "comment" property. For example:

{
  "@id": "#rel",
  "@type": "rdf:Property",
  "subPropertyOf": "rdf:predicate",
  "domain": ["#Edge", "#Feature"],
  "range": "#Relation",
  "comment": "Links to the kind of relationship that holds between two terms. In this API, the 'rel' will always be a ConceptNet URI beginning with /r/. In RDF, this would be called the 'predicate'."
}

These explanatory properties appear outside of the "@context" section, the only section that actually matters to how JSON-LD is processed. I wish I could have put comments inside the "@context", where the values really matter. But if I do that, it doesn’t validate as proper JSON-LD. You need to have already parsed the "@context" to know what a comment is, and JSON-LD doesn’t leave any wiggle room for circular definitions.

But outside of the "@context" section, I can put whatever I want. And what I choose to put there is these additional, explanatory properties that are also meaningful JSON-LD.

So you can interpret the context file, in the context of itself:

In [8]:
show_graph('http://api.conceptnet.io/ld/conceptnet5.6/context.ld.json', size=30)
Out[8]:
%3 Datatype Datatype Edge Edge Edge-&gt;Datatype type “Each edge in ConceptNet represents a fac…” “Each edge in ConceptNet represents a fac…” Edge-&gt;”Each edge in ConceptNet represents a fac…” comment Statement Statement Edge-&gt;Statement subClassOf Feature Feature Feature-&gt;Datatype type “A Feature is a pattern that edges can ma…” “A Feature is a pattern that edges can ma…” Feature-&gt;”A Feature is a pattern that edges can ma…” comment Resource Resource Feature-&gt;Resource subClassOf Node Node Node-&gt;Datatype type “A node in ConceptNet typically represent…” “A node in ConceptNet typically represent…” Node-&gt;”A node in ConceptNet typically represent…” comment Query Query Node-&gt;Query subClassOf Query-&gt;Datatype type Query-&gt;Resource subClassOf “A Query is a set of results that you can…” “A Query is a set of results that you can…” Query-&gt;”A Query is a set of results that you can…” comment RelatedNode RelatedNode RelatedNode-&gt;Datatype type “A node that is related to a query. Conta…” “A node that is related to a query. Conta…” RelatedNode-&gt;”A node that is related to a query. Conta…” comment Relation Relation Relation-&gt;Datatype type Relation-&gt;Query subClassOf “One of a fixed vocabulary of relations, …” “One of a fixed vocabulary of relations, …” Relation-&gt;”One of a fixed vocabulary of relations, …” comment Source Source Source-&gt;Datatype type Source-&gt;Resource subClassOf “A Source is a reason to believe an Edge….” “A Source is a reason to believe an Edge….” Source-&gt;”A Source is a reason to believe an Edge….” comment Property Property activity activity activity-&gt;Resource range activity-&gt;Source domain activity-&gt;Property type “A property of a source, identifying a cr…” “A property of a source, identifying a cr…” activity-&gt;”A property of a source, identifying a cr…” comment contributor contributor contributor-&gt;Resource range contributor-&gt;Source domain contributor-&gt;Property type “A property of a source, indicating the p…” “A property of a source, indicating the p…” contributor-&gt;”A property of a source, indicating the p…” comment dataset dataset dataset-&gt;Edge domain dataset-&gt;Resource range dataset-&gt;Property type “A property of an edge, separating edges …” “A property of an edge, separating edges …” dataset-&gt;”A property of an edge, separating edges …” comment edges edges edges-&gt;Edge domain edges-&gt;Relation range edges-&gt;Property type “When you look up a node, its ‘edges’ pro…” “When you look up a node, its ‘edges’ pro…” edges-&gt;”When you look up a node, its ‘edges’ pro…” comment end end end-&gt;Edge domain end-&gt;Feature domain end-&gt;Node range end-&gt;Property type “Links to the node that this edge points …” “Links to the node that this edge points …” end-&gt;”Links to the node that this edge points …” comment object object end-&gt;object subPropertyOf feature feature feature-&gt;Feature range feature-&gt;Query domain feature-&gt;Property type “When this property is present, the query…” “When this property is present, the query…” feature-&gt;”When this property is present, the query…” comment features features features-&gt;Query domain features-&gt;Query range features-&gt;Property type “API responses can be grouped into ‘featu…” “API responses can be grouped into ‘featu…” features-&gt;”API responses can be grouped into ‘featu…” comment label label label-&gt;Node domain label-&gt;Property type label-&gt;label subPropertyOf “The natural-language label of a node. Ev…” “The natural-language label of a node. Ev…” label-&gt;”The natural-language label of a node. Ev…” comment string string label-&gt;string range license license license-&gt;Edge domain license-&gt;Query domain license-&gt;Property type “A link to the Creative Commons license u…” “A link to the Creative Commons license u…” license-&gt;”A link to the Creative Commons license u…” comment License License license-&gt;License range node node node-&gt;Edge domain node-&gt;Feature domain node-&gt;Node range node-&gt;Property type “Sometimes we want to specify that a Conc…” “Sometimes we want to specify that a Conc…” node-&gt;”Sometimes we want to specify that a Conc…” comment pagination-PartialCollectionView pagination-PartialCollectionView pagination-PartialCollectionView-&gt;Datatype type “An object containing links to more pages…” “An object containing links to more pages…” pagination-PartialCollectionView-&gt;”An object containing links to more pages…” comment pagination-firstPage pagination-firstPage pagination-firstPage-&gt;Query range pagination-firstPage-&gt;Property type pagination-firstPage-&gt;pagination-PartialCollectionView domain “A link to the first page of results.” “A link to the first page of results.” pagination-firstPage-&gt;”A link to the first page of results.” comment pagination-nextPage pagination-nextPage pagination-nextPage-&gt;Query range pagination-nextPage-&gt;Property type pagination-nextPage-&gt;pagination-PartialCollectionView domain “A link to the next page of results. Only…” “A link to the next page of results. Only…” pagination-nextPage-&gt;”A link to the next page of results. Only…” comment pagination-paginatedProperty pagination-paginatedProperty pagination-paginatedProperty-&gt;Property type pagination-paginatedProperty-&gt;Property range pagination-paginatedProperty-&gt;pagination-PartialCollectionView domain “Indicates which property — such as ‘edg…” “Indicates which property — such as ‘edg…” pagination-paginatedProperty-&gt;”Indicates which property — such as ‘edg…” comment pagination-previousPage pagination-previousPage pagination-previousPage-&gt;Query range pagination-previousPage-&gt;Property type pagination-previousPage-&gt;pagination-PartialCollectionView domain “A link to the previous page of results. …” “A link to the previous page of results. …” pagination-previousPage-&gt;”A link to the previous page of results. …” comment pagination-view pagination-view pagination-view-&gt;Query domain pagination-view-&gt;Property type pagination-view-&gt;pagination-PartialCollectionView range “Appears on a response that returns more …” “Appears on a response that returns more …” pagination-view-&gt;”Appears on a response that returns more …” comment process process process-&gt;Resource range process-&gt;Source domain process-&gt;Property type “A property of a source, indicating a com…” “A property of a source, indicating a com…” process-&gt;”A property of a source, indicating a com…” comment rel rel rel-&gt;Edge domain rel-&gt;Feature domain rel-&gt;Relation range rel-&gt;Property type “Links to the kind of relationship that h…” “Links to the kind of relationship that h…” rel-&gt;”Links to the kind of relationship that h…” comment predicate predicate rel-&gt;predicate subPropertyOf related related related-&gt;Query domain related-&gt;RelatedNode range related-&gt;Property type “A list returned when you make a ‘/relate…” “A list returned when you make a ‘/relate…” related-&gt;”A list returned when you make a ‘/relate…” comment sense_label sense_label sense_label-&gt;Node domain sense_label-&gt;Property type sense_label-&gt;string range “A URL-safe string that can distinguish m…” “A URL-safe string that can distinguish m…” sense_label-&gt;”A URL-safe string that can distinguish m…” comment site site site-&gt;Node domain site-&gt;Property type site-&gt;string range “ConceptNet has ‘ExternalURL’ edges that …” “ConceptNet has ‘ExternalURL’ edges that …” site-&gt;”ConceptNet has ‘ExternalURL’ edges that …” comment sources sources sources-&gt;Edge domain sources-&gt;Source range sources-&gt;Property type “The ‘sources’ of an edge are a set of in…” “The ‘sources’ of an edge are a set of in…” sources-&gt;”The ‘sources’ of an edge are a set of in…” comment start start start-&gt;Edge domain start-&gt;Feature domain start-&gt;Node range start-&gt;Property type start-&gt;”Links to the node that this edge points …” comment subject subject start-&gt;subject subPropertyOf surfaceText surfaceText surfaceText-&gt;Edge domain surfaceText-&gt;Property type surfaceText-&gt;string range “The natural language text that correspon…” “The natural language text that correspon…” surfaceText-&gt;”The natural language text that correspon…” comment symmetric symmetric symmetric-&gt;Edge domain symmetric-&gt;Relation domain symmetric-&gt;Property type “A relation or edge can be ‘symmetric’. W…” “A relation or edge can be ‘symmetric’. W…” symmetric-&gt;”A relation or edge can be ‘symmetric’. W…” comment boolean boolean symmetric-&gt;boolean range term term term-&gt;Node domain term-&gt;Node range term-&gt;Property type “The ‘term’ property links a node to its …” “The ‘term’ property links a node to its …” term-&gt;”The ‘term’ property links a node to its …” comment weight weight weight-&gt;Edge domain weight-&gt;RelatedNode domain weight-&gt;Property type “A numerical value indicating how strongl…” “A numerical value indicating how strongl…” weight-&gt;”A numerical value indicating how strongl…” comment float float weight-&gt;float range “This section defines the types and prope…” “This section defines the types and prope…” _ _ _-&gt;”This section defines the types and prope…” comment

Feel free to squint at this tangled web if you really like graphs about graphs. It’s like API documentation, squared!

But speaking of that, remember that you can also read ConceptNet’s API documentation in English instead of in JSON-LD.

Did… did you just make an ontology? That seems out of character.

The ontology was always there, Imaginary Interlocutor. We’ve just moved it from the realm of Platonic ideals, to a JSON file you can download.

In the end, what have you accomplished here?

The next time someone asks me if ConceptNet is available in RDF form, I can say “yes”.

ConceptNet 5.6 released

ConceptNet 5.6 is out!

We’ve made a lot of changes behind the scenes that should have fairly small effects on the way you use ConceptNet. Some of the changes are:

  • We normalize text properly in more languages. Arabic words no longer insist on matching vowel points that nobody writes in real text. Serbian/Croatian words now have a unified vocabulary written in the Latin alphabet, instead of some words being in the Latin alphabet and some in Cyrillic.

  • ConceptNet knows what emoji are and can define them in a number of languages, thanks to importing Unicode CLDR data. 😺

  • We’ve included data from CC-CEDICT, an open Chinese dictionary.

  • For fans of self-explaining APIs and what’s left of the Semantic Web: Everything returned by the ConceptNet API is now valid JSON-LD, and we now test to make sure this is true. You can use a JSON-LD processor to convert responses from the ConceptNet API into other formats such as RDF triples.

  • We no longer use Docker to deploy ConceptNet. It caused no end of inscrutable problems and it didn’t make anything easier. Sorry for getting caught up in the hype. We still provide ways to configure a machine to serve ConceptNet exactly like we do.

More details are on the changelog on the ConceptNet wiki.

We also moved our blog — the one you’re reading now — from WordPress to a static site generated with Nikola. One feature this provides is that we can post Python notebooks directly on the blog, instead of having to use an external service such as Gist. This makes it much easier to post tutorials, and we hope to do this shortly.

Interview on de-biasing NLP

In several previous posts here, I’ve been discussing the risks of biased AI, particularly in natural language processing tasks. It’s part of my work at Luminoso to address this, both in ConceptNet and in Luminoso’s services. We’re not just warning about what not to do, we’re promoting best practices about what you can do to prevent bias right now.

At Luminoso, Denise Christie and I recorded a discussion about de-biasing, touching on a lot of the current issues that create bias in NLP and what to do about them.  The recording and transcript are now available on Luminoso’s site.