Yes, people do want pre-computed word embeddings

The very informative tutorial by Vlad Niculae on Word Mover’s Distance in Python includes this step:

We could train the embeddings ourselves, but for meaningful results we would need tons of documents, and that might take a while. So let’s just use the ones from the word2vec team.

I couldn’t have asked for a better justification for ConceptNet and Luminoso in two sentences.

When presenting new results from Conceptnet Numberbatch, which works way better than word2vec alone, one objection is that the embeddings are pre-computed and aren’t based on your data. (Luminoso is a SaaS platform that retrains them to your data, in the cases where you do need that.)

Pre-baked embeddings are useful. People are resigning themselves to use word2vec’s pre-baked embeddings because they don’t know they can have better ones. I dream of the day when someone writing a new tutorial like this says “So let’s just use Conceptnet Numberbatch.”

4 thoughts on “Yes, people do want pre-computed word embeddings

  1. Assuming everything works, I’ll be using it for WMD and “conceptbatch!”
    (PS: Is Any special preprocessing needed of the raw text before I apply gensim/wmd/numberbatch pretrained to it?)


    1. In the current version of ConceptNet Numberbatch, the pre-processing is the same as word2vec: lowercase your text, replace spaces with underscores if you’re looking up multi-word phrases, and (this is the weird part) replace sequences of 2 or more consecutive digits with the symbol #.


      1. @Rob: Are these pre-processing steps documented somewhere? Would love to understand if there has been some sort of standardization around these text pre-processing steps.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s