An introduction to the ConceptNet Vector Ensemble

Originally published on April 6, 2016 at https://blog.luminoso.com/2016/04/06/an-introduction-to-the-conceptnet-vector-ensemble/.

Here’s a big idea that’s taken hold in natural language processing: meanings are vectors. A text-understanding system can represent the approximate meaning of a word or phrase by representing it as a vector in a multi-dimensional space. Vectors that are close to each other represent similar meanings.

A fragment of a concept-cloud visualization of the ConceptNet Vector Ensemble (CNVE). Words that appear close to each other are similar.
A fragment of a concept-cloud visualization of the ConceptNet Vector Ensemble (CNVE).

Vectors are how Luminoso has always represented meaning. When we started Luminoso, this was seen as a bit of a crazy idea.

It was an exciting time when the idea of vectors as meanings was suddenly popularized by the Google research project word2vec. Now this isn’t considered a crazy idea anymore, it’s considered the effective thing to do.

Luminoso’s starting point — its model of word meanings when it hasn’t seen any of your documents — comes from a vector-based representation of ConceptNet 5. That gives it general knowledge about what words mean. These vectors are then automatically adjusted based on the specific way that words are used in your domain.

But you might well ask: if these newer systems such as word2vec or GloVe are so effective, should we be using them as our starting point?

As the girl in the Old El Paso commercial asks, "Why don't we have both?"

The best representation of word meanings we’ve seen — and we think it’s the best representation of word meanings anyone has seen — is our new ensemble that combines ConceptNet, GloVe, PPDB, and word2vec. It’s described in our paper, “An Ensemble Method to Produce High-Quality Word Embeddings“, and it’s reproducible using this GitHub repository.

We call this the ConceptNet Vector Ensemble. These domain-general word embeddings fill the same niche as, for example, the word2vec Google News vectors, but by several measures, they represent related meanings more like people do.

A comparison of some word-embedding systems on two measures of word relatedness. Our system, CNVE, is the red dot in the upper right.
A comparison of some word-embedding systems on two measures of word relatedness. Our system, CNVE, is the red dot in the upper right.

Expanding on “retrofitting”

Manaal Faruqui’s Retrofitting, from CMU’s Language Technologies Institute, is a very cool idea.

Every system of word vectors is going to reflect the set of data it was trained on, which means there’s probably more information from outside that data that could make it better. If you’ve got a good set of word vectors, but you wish there was more information it had taken into account — particularly a knowledge graph — you can use a fairly straightforward “retrofitting” procedure to adjust the vectors accordingly.

Starting with some vectors and adjusting them based on new information — that sure sounds like what I just described about what Luminoso does, right? Faruqui’s retrofitting is not the particular process we use inside Luminoso’s products, but the general idea is related enough to Luminoso’s proprietary process that working with it was quite natural for us, and we found that it does work well.

There’s one idea from our process that can be added to retrofitting easily: if you have information about words that weren’t in your vocabulary to start with, you should automatically expand your vector space to include them.

Faruqui describes some retrofitting combinations that work well, such as combining GloVe with WordNet. I don’t think anyone had tried doing anything like this with ConceptNet before, and it turns out to be a pretty powerful source of knowledge to add. And when you add this idea of automatically expanding the vocabulary, now you can also represent all the words and phrases in ConceptNet that weren’t in the vocabulary of your original vector space, such as words in other languages.

The multilingual knowledge in ConceptNet is particularly relevant here. Our ensemble can learn more about words based on the things they translate to in languages besides English, and it can represent those words in other languages with the same kind of vectors that it uses to represent English words.

There’s clearly more to be done to extend the full power of this representation to non-English languages. It would be better, for example, if it started with some text in other languages that it could learn from and retrofit onto, instead of relying entirely on the multilingual links in ConceptNet. But it’s promising that the Spanish vectors that our ensemble learns entirely from ConceptNet, starting from having no idea what Spanish is, perform better at word similarity than a system trained on the text of the Spanish Wikipedia.

On the other hand, you have GloVe

For some reason, everyone in this niche talks about word2vec and few people talk about the similar system GloVe, from Stanford NLP. We were more drawn to GloVe as something to experiment with, as we find the way it works clearer than word2vec.

When we compared word2vec and GloVe, we got better initial results from GloVe. Levy et al. report the opposite. I think what this shows is that a whole lot of the performance of these systems is in the fine details of how you use them. And indeed, when we tweak the way we use GloVe — particularly when we borrow a process from ConceptNet to normalize words to their root form — we get word similarities that are much better than word2vec and the original GloVe, even before we retrofit anything onto it.

You can probably guess the next step: “why don’t we use both?” word2vec’s most broadly useful vectors come from Google News articles, while GloVe’s come from reading the Web at large. Those represent different kinds of information. Both of them should be in the system. In the ConceptNet Vector Ensemble, we build a vector space that combines word2vec and GloVe before we start retrofitting.

The data flow of building the ConceptNet Vector Ensemble.

You can see that creating state-of-the-art word embeddings involves ideas from a number of different people. A few of them are our own — particularly ConceptNet 5, which is entirely developed at Luminoso these days, and the various ways we transformed word embeddings to make them work better together.

This is an exciting, fast-moving area of NLP. We’re telling everyone about our vectors because the openness of word-embedding research made them possible, and if we kept our own improvement quiet, the field would probably find a way to move on without it at the cost of some unnecessary effort.

These vectors are available for download under a Creative Commons Attribution Share-Alike license. If you’re working on an application that starts from a vector representation of words — maybe you’re working in the still-congealing field of Deep Learning methods for NLP — you should give the ConceptNet Vector Ensemble a try.

wordfreq 1.2 is better at Chinese, English, Greek, Polish, Swedish, and Turkish

Originally posted on October 29, 2015 at https://blog.luminoso.com/2015/10/29/wordfreq-1-2-is-better-at-chinese-english-greek-polish-swedish-and-turkish/.
Wordfreq 1.2 example code
Examples in Chinese and British English. Click through for copyable code.

In a previous post, we introduced wordfreq, our open-source Python library that lets you ask “how common is this word?”

Wordfreq is an important low-level tool for Luminoso. It’s one of the things we use to figure out which words are important in a set of text data. When we get the word frequencies figured out in a language, that’s a big step toward being able to handle that language from end to end in the Luminoso pipeline. We recently started supporting Arabic in our product, and improved Chinese enough to take the “BETA” tag off of it, and having the right word frequencies for those languages was a big part of it.

I’ve continued to work on wordfreq, putting together more data from more languages. We now have 17 languages that meet the threshold of having three independent sources of word frequencies, which we consider important for those word frequencies to be representative.

Here’s what’s new in wordfreq 1.2:

  • The English word list has gotten a bit more robust and a bit more British by including SUBTLEX, adding word frequencies from American TV shows as well as the BBC.
  • It can fearlessly handle Chinese now. It uses a lovely pure-Python Chinese tokenizer, Jieba, to handle multiple-word phrases, and Jieba’s built-in wordlist provides a third independent source of word frequencies. Wordfreq can even smooth over the differences between Traditional and Simplified Chinese.
  • Greek has also been promoted to a fully-supported language. With new data from Twitter and OpenSubtitles, it now has four independent sources.
  • In some applications, you want to tokenize a complete piece of text, including punctuation as separate tokens. Punctuation tokens don’t get their own word frequencies, but you can ask the tokenizer to give you the punctuation tokens anyway.
  • We added support for Polish, Swedish, and Turkish. All those languages have a reasonable amount of data that we could obtain from OpenSubtitles, Twitter, and Wikipedia by doing what we were doing already.

When adding Turkish, we made sure to convert the case of dotted and dotless İ’s correctly. We know that putting the dots in the wrong places can lead to miscommunication and even fatal stabbings.

The language in wordfreq that’s still only partially supported is Korean. We still only have two sources of data for it, so you’ll see the disproportionate influence of Twitter on its frequencies. If you know where to find a lot of freely-usable Korean subtitles, for example, we would love to know.

Let’s revisit the top 10 words in the languages wordfreq supports. And now that we’ve talked about getting right-to-left right, let’s add a bit of code that makes Arabic show up with right-to-left words in left-to-right order, instead of middle-to-elsewhere order like it came out before.

Code showing the top ten words in each language wordfreq 1.2 supports.

Wordfreq 1.2 is available on GitHub and PyPI.

wordfreq: Open source and open data about word frequencies

Originally posted on September 1, 2015 at https://blog.luminoso.com/2015/09/01/wordfreq-open-source-and-open-data-about-word-frequencies/.

Often, in NLP, you need to answer the simple question: “is this a common word?” It turns out that this leaves the computer to answer a more vexing question: “What’s a word?”

Let’s talk briefly about why word frequencies are important. In many cases, you want to assign more significance to uncommon words. For example, a product review might contain the word “use” and the word “defective”, and the word “defective” carries way more information. If you’re wondering what the deal is with John Kasich, a headline that mentions “Kasich” will be much more likely to be what you’re looking for than one that merely mentions “John”.

For purposes like these, it would be nice if we could just import a Python package that could tell us whether one word was more common than another, in general, based on a wide variety of text. We looked for a while and couldn’t find it. So we built it.

wordfreq provides estimates of the frequencies of words in many languages, loading its data from efficiently-compressed data structures so it can give you word frequencies down to 1 occurrence per million without having to access an external database. It aims to avoid being limited to a particular domain or style of text, getting its data from a variety of sources: Google Books, Wikipedia, OpenSubtitles, Twitter, and the Leeds Internet Corpus.

The 10 most common words that wordfreq knows in 15 languages.
The 10 most common words that wordfreq knows in 15 languages. Yes, it can handle multi-character words in Chinese and Japanese; those just aren’t in the top 10. A puzzle for Unicode geeks: guess where the start of the Arabic list is.

Partial solutions: stopwords and inverse document frequency

Those who are familiar with the basics of information retrieval probably have a couple of simple suggestions in mind for dealing with word frequencies.

One is to come up with a list of stopwords, words such as “the” and “of” that are too common to use for anything. Discarding stopwords can be a useful optimization, but that’s far too blunt of an operation to solve the word frequency problem in general. There’s no place to draw the bright line between stopwords and non-stopwords, and in the “John Kasich” example, it’s not the case that “John” should be a stopword.

Another partial solution would be to collect all the documents you’re interested in, and re-scale all the words according to their inverse document frequency or IDF. This is a quantity that decreases as the proportion of documents a word appears in increases, reaching 0 for a word that appears in every document.

One problem with IDF is that it can’t distinguish a word that appears in a lot of documents because it’s unimportant, from a word that appears in a lot of documents because it’s very important to your domain. Another, more practical problem with IDF is that you can’t calculate it until you’ve seen all your documents, and it fluctuates a lot as you add documents. This is particularly an issue if your documents arrive in an endless stream.

We need good domain-general word frequencies, not just domain-specific word frequencies, because without the general ones, we can’t determine which domain-specific word frequencies are interesting.

Avoiding biases

The counts of one resource alone tend to tell you more about that resource than about the language. If you ask Wikipedia alone, you’ll find that “census”, “1945”, and “stub” are very common words. If you ask Google Books, you’ll find that “propranolol” is supposed to be 10 times more common than “lol” overall (and also that there’s something funny going on, so to speak, in the early 1800s).

If you collect data from Twitter, you’ll of course find out how common “lol” is. You also might find that the ram emoji “🐏” is supposed to be extremely common, because that guy from One Direction once tweeted “We are derby super 🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏”, and apparently every fan of One Direction who knows what Derby Super Rams are retweeted it.

Yes, wordfreq considers emoji to be words. Its Twitter frequencies would hardly be complete without them.

We can’t entirely avoid the biases that come from where we get our data. But if we collect data from enough different sources (not just larger sources), we can at least smooth out the biases by averaging them between the different sources.

What’s a word?

You have to agree with your wordlist on the matter of what constitutes a “word”, or else you’ll get weird results that aren’t supported by the actual data.

Do you split words at all spaces and punctuation? Which of the thousands of symbols in Unicode are punctuation? Is an apostrophe punctuation? Is it punctuation when it puts a word in single quotes? Is it punctuation in “can’t”, or in “l’esprit”? How many words is “U.S.” or “google.com”? How many words is “お早うございます” (“good morning”), taking into account that Japanese is written without spaces? The symbol “-” probably doesn’t count as a word, but does “+”? How about “☮” or “♥”?

The process of splitting text into words is called “tokenization”, and everyone’s got their own different way to do it, which is a bit of a problem for a word frequency list.

We tried a few ways to make a sufficiently simple tokenization function that we could use everywhere, across many languages. We ended up with our own ad-hoc rule including large sets of Unicode characters and a special case for apostrophes, and this is in fact what we used when we originally released wordfreq 1.0, which came packaged with regular expressions that look like attempts to depict the Flying Spaghetti Monster in text.

A particularly noodly regex.

But shortly after that, I realized that the Unicode Consortium had already done something similar, and they’d probably thought about it for more than a few days.

Word splitting in Unicode
Word splitting in Unicode. Not pictured: how to decide which of these segments count as “words”.

This standard for tokenization looked like almost exactly what we wanted, and the last thing holding me back was that implementing it efficiently in Python looked like it was going to be a huge pain. Then I found that the regex package (not the re package built into Python) contains an efficient implementation of this standard. Defining how to split text into words became a very simple regular expression… except in Chinese and Japanese, because a regular expression has no chance in a language where the separation between words is not written in any way.

So this is how wordfreq 1.1 identifies the words to count and the words to look up. Of course, there is going to be data that has been tokenized in a different way. When wordfreq gets something that looks like it should be multiple words, it will look them up separately and estimate their combined frequency, instead of just returning 0.

Language support

wordfreq supports 15 commonly-used languages, but of course some languages are better supported than others. English is quite polished, for example, while Chinese so far is just there to be better than nothing.

The reliability of each language corresponds pretty well with the number of different data sources we put together to make the wordlist. Some sources are hard to get in certain languages. Perhaps unsurprisingly, for example, not much of Twitter is in Chinese. Perhaps more surprisingly, not much of it is in German either.

The word lists that we’ve built for wordfreq represent the languages where we have at least two sources. I would consider the ones with two sources a bit dubious, while all the languages that have three or more sources seem to have a reasonable ranking of words.

  • 5 sources: English
  • 4 sources: Arabic, French, German, Italian, Portuguese, Russian, Spanish
  • 3 sources: Dutch, Indonesian, Japanese, Malay
  • 2 sources: Chinese, Greek, Korean

Compact wordlists

When we were still figuring this all out, we made several 0.x versions of wordfreq that required an external SQLite database with all the word frequencies, because there are millions of possible words and we had to store a different floating-point frequency for each one. That’s a lot of data, and it would have been infeasible to include it all inside the Python package. (GitHub and PyPI don’t like huge files.) We ended up with a situation where installing wordfreq would either need to download a huge database file, or build that file from its source data, both of which would consume a lot of time and computing resources when you’re just trying to install a simple package.

As we tried different ways of shipping this data around to all the places that needed it, we finally tried another tactic: What if we just distributed less data?

Two assumptions let us greatly shrink our word lists:

  • We don’t care about the frequencies of words that occur less than once per million words. We can just assume all those words are equally informative.
  • We don’t care about, say, 2% differences in word frequency.

Now instead of storing a separate frequency for each word, we group the words into 600 possible tiers of frequency. You could call these tiers “centibels”, a logarithmic unit similar to decibels, because there are 100 of them for each factor of 10 in the word frequency. Each of them represents a band of word frequencies that spans about a 2.3% difference. The data we store can then be simplified to “Here are all the words in tier #330… now here are all the words in tier #331…” and converted to frequencies when you ask for them.

Some tiers of word frequencies in English.
Some tiers of word frequencies in English.

This let us cut down the word lists to an entirely reasonable size, so that we can put them in the repository, and just keep them in memory while you’re using them. The English word list, for example, is 245 KB, or 135 KB compressed.

But it’s important to note the trade-off here, that wordfreq only represents sufficiently common words. It’s not suited for comparing rare words to each other. A word rarer than “amulet”, “bunches”, “deactivate”, “groupie”, “pinball”, or “slipper”, all of which have a frequency of about 1 per million, will not be represented in wordfreq.

Getting the package

wordfreq is available on GitHub, or it can be installed from the Python Package Index with the command pip install wordfreq. Documentation can be found in its README on GitHub.

wordfreq usage example
Comparing the frequency per million words of two spellings of “café”, in English and French.

ftfy (fixes text for you) 4.0: changing less and fixing more

Originally posted on May 21, 2015 at https://blog.luminoso.com/2015/05/21/ftfy-fixes-text-for-you-4-0-changing-less-and-fixing-more/.

ftfy is a Python tool that takes in bad Unicode and outputs good Unicode. I developed it because we really needed it at Luminoso — the text we work with can be damaged in several ways by the time it gets to us. It’s become our most popular open-source project by far, as many other people have the same itch that we’re scratching.

The coolest thing that ftfy does is to fix mojibake — those mix-ups in encodings that cause the word más to turn into más or even más. (I’ll recap why this happens and how it can be reversed below.) Mojibake is often intertwined with other problems, such as un-decoded HTML entities (más), and ftfy fixes those as well. But as we worked with the ftfy 3 series, it gradually became clear that the default settings were making some changes that were unnecessary, and from time to time they would actually get in the way of the goal of cleaning up text.

ftfy 4 includes interesting new fixes to creative new ways that various software breaks Unicode. But it also aims to change less text that doesn’t need to be changed. This is the big change that made us increase the major version number from 3 to 4, and it’s fundamentally about Unicode normalization. I’ll discuss this change below under the heading “Normalization”.

Mojibake and why it happens

Mojibake is what happens when text is written in one encoding and read as if it were a different one. It comes from the Japanese word “•¶Žš‰»‚¯” — no, sorry, “文字化け” — meaning “character corruption”. Mojibake turns everything but basic ASCII characters into nonsense.

Suppose you have a word such as “más”. In UTF-8 — the encoding used by the majority of the Internet — the plain ASCII letters “m” and “s” are represented by the familiar single byte that has represented them in ASCII for 50 years. The letter “á”, which is not ASCII, is represented by two bytes.

Text:  m  á     s
Bytes: 6d c3 a1 73

The problem occurs when these bytes get sent to a program that doesn’t quite understand UTF-8. This program probably thinks that every character is one byte, so it decodes each byte as a character, in a way that depends on the operating system it’s running on and the country it was set up for. (This, of course, makes no sense in an era where computers from all over the world can talk to each other.)

If we decode this text using Windows’ most popular single-byte encoding, which is known as “Windows-1252” and often confused with “ISO-8859-1”, we’ll get this:

Bytes: 6d c3 a1 73
Text:  m  Ã  ¡  s

The real problem happens when this text needs to be sent back over the Internet. It may very well send the newly-weirdified text in a way that knows it needs to encode UTF-8:

Intended text: m  á           s
Actual text:   m  Ã     ¡     s
Bytes:         6d c3 83 c2 a1 73

So, the word “más” was supposed to be four bytes of UTF-8, but what we have now is six bytes of what I propose to call “Double UTF-8”, or “WTF-8” for short.

WTF-8 is a very common form of mojibake, and the fortunate thing is that it’s reasonably easy to detect. Most possible sequences of bytes are not UTF-8, and most mojibake forms sequences of characters that are extremely unlikely to be the intended text. So ftfy can look for sequences that would decode as UTF-8 if they were encoded as another popular encoding, and then sanity-check by making sure that the new text looks more likely than the old text. By reversing the process that creates mojibake, it turns mojibake into the correct text with a rate of false positives so low that it’s difficult to measure.

Weird new mojibake

We test ftfy on live data from Twitter, which due to its diversity of languages and clients is a veritable petri dish of Unicode bugs. One thing I’ve found in this testing is that mojibake is becoming a bit less common. People expect their Twitter clients to be able to deal with Unicode, and the bugs are gradually getting fixed. The “you fail at Unicode” character � was 33% less common on Twitter in 2014 than it was in 2013.

Some software is still very bad at Unicode — particularly Microsoft products. These days, Microsoft is in many ways making its software play nicer in a pluralistic world, but they bury their head in the sand when it comes to the dominance of UTF-8. Sadly, Microsoft’s APIs were not designed for UTF-8 and they’re not interested in changing them. They adopted Unicode during its awkward coming-of-age in the mid ’90s, when UTF-16 seemed like the only way to do it. Encoding text in UTF-16 is like dancing the Macarena — you probably could do it under duress, but you haven’t willingly done it since 1997.

Because they don’t match the way the outside world uses Unicode, Microsoft products tend to make it very hard or impossible to export and import Unicode correctly, and easy to do it incorrectly. This remains a major reason that we need ftfy.

Although text is getting a bit cleaner, people are getting bolder about their use of Unicode and the bugs that remain are getting weirder. ftfy has always been able to handle some cases of files that use different encodings on different lines, but what we’re seeing now is text that switches between UTF-8 and WTF-8 in the same sentence. There’s something out there that uses UTF-8 for its opening quotation marks and Windows-1252 for its closing quotation marks, before encoding it all in UTF-8 again, “like this”. You can’t simply encode and decode that string to get the intended text “like this”.

ftfy 4.0 includes a heuristic that fixes some common cases of mixed encodings in close proximity. It’s a bit conservative — it leaves some text unfixed, because if it changed all text that might possibly be in a mixed encoding, it would lead to too many false positives.

Another variation of this is that ftfy looks for mojibake that some other well-meaning software has tried to fix, such as by replacing byte A0 with a space, because in Windows-1252 A0 is a non-breaking space. Previously, ftfy would have to leave the mojibake unfixed if one of its characters was changed. But if the sequence is clear enough, ftfy will put back the A0 byte so that it can fix the original mojibake.

Does this seem gratuitous? These are things that show up both in ftfy’s testing stream and in real data that we’ve had to handle. We want to minimize the cases where we have to tell a customer “sorry, your text is busted” and maximize the cases where we just deal with it.

Normalization

NFC (the Normalization Form that uses Composition) is a process that should be applied to basically all Unicode input. Unicode is flexible enough that it has multiple ways to write exactly the same text, and NFC merges them into the same sensible way. Here are two ways to write más, as illustrated by the ftfy.explain_unicode function.

This is the NFC normalized way:

U+006D m [Ll] LATIN SMALL LETTER M
U+00E1 á [Ll] LATIN SMALL LETTER A WITH ACUTE
U+0073 s [Ll] LATIN SMALL LETTER S

And this is a different way that’s not NFC-normalized (it’s NFD-normalized instead):

U+006D m [Ll] LATIN SMALL LETTER M
U+0061 a [Ll] LATIN SMALL LETTER A
U+0301 ́  [Mn] COMBINING ACUTE ACCENT
U+0073 s [Ll] LATIN SMALL LETTER S

If you want the same text to be represented by the same data, running everything through NFC normalization is a good idea. ftfy does that (unless you ask it not to).

Previous versions of ftfy were, by default, not just using NFC normalization, but the more aggressive NFKC normalization (whose acronym is quite unsatisfying because the K stands for “Compatibility”). For a while, it seemed like normalizing even more was even better. NFKC does things like convert fullwidth  letters into normal letters, and convert the single ellipsis character into three periods.

But NFKC also loses meaningful information. If you were to ask me what the leading cause of mojibake is, I might answer “Excel™”. After NFKC normalization, I’d instead be blaming something called “ExcelTM”. In cases like this, NFKC is hitting the text with too blunt a hammer. Even when it seems appropriate to normalize aggressively because we’re going to be performing machine learning on text, the resulting words such as “exceltm” are not helpful.

So in ftfy 4.0, we switched the default normalization to NFC. We didn’t want to lose the nice parts of NFKC, such as normalizing fullwidth letters and breaking up the kind of ligatures that can make the word “fluffiest” appear to be five characters long. So we added those back in as separate fixes. By not applying NFKC bluntly to all the text, we change less text that doesn’t need to be changed, even as we apply more kinds of fixes. It’s a significant change in the default behavior of ftfy, but we hope you agree that this is a good thing. A side benefit is that ftfy 4.0 is faster overall than 3.x, because NFC normalization can run very quickly in common cases.

Future-proofing emoji and other changes

ftfy’s heuristics depend on knowing what kind of characters it’s looking at, so it includes a table where it can quickly look up Unicode character classes. This table normally doesn’t change very much, but we update it as Python’s unicodedata gets updated with new characters, making the same table available even in previous versions of Python.

One part of the table is changing really fast, though, in a way that Python may never catch up with. Apple is rapidly adding new emoji and modifiers to the Unicode block that’s set aside for them, such as 🖖🏽, which should be a brown-skinned Vulcan salute. Unicode will publish them in a standard eventually, but people are using them now.

Instead of waiting for Unicode and then Python to catch up, ftfy just assumes that any character in this block is an emoji, even if it doesn’t appear to be assigned yet. When emoji burritos arrive, ftfy will be ready for them.

Developers who like to use the UNIX command line will be happy to know that ftfy can be used as a pipe now, as in:

curl http://example.com/api/data.txt | ftfy | sort | uniq -c

The details of all the changes can be found, of course, in the CHANGELOG.

Has ftfy solved a problem for you? Have you stumped it with a particularly bizarre case of mojibake? Let us know in the comments or on Twitter.

ftfy (fixes text for you) version 3.0

Originally posted on August 26, 2013 at https://blog.luminoso.com/2013/08/26/ftfy-fixes-text-for-you-version-3-0/.

About a year ago, we blogged about how to ungarble garbled Unicode in a post called Fixing common Unicode mistakes with Python — after they’ve been made. Shortly after that, we released the code in a Python package called ftfy.

You have almost certainly seen the kind of problem ftfy fixes. Here’s a shoutout from a developer who found that her database was full of place names such as “BucureÅŸti, Romania” because of someone else’s bug. That’s easy enough to fix:

>>> from ftfy import fix_text

>>> print(fix_text(u'BucureÅŸti, Romania'))
Bucureşti, Romania

>>> fix_text(u'Sokal’, L’vivs’ka Oblast’, Ukraine')
"Sokal', L'vivs'ka Oblast', Ukraine"

A reddit commenter has helpfully reminded me of the technical name for this phenomenon, which is mojibake.

We’ve kept developing this code because of how directly useful it is. Today, we’re releasing version 3.0 of ftfy. We’ve made it run faster, made it start up faster, made it fix more kinds of problems, and reduced its rate of false positives to near zero, so that now we can just run it on any text anyone sends us.

(I know that “near zero” is not a useful description of an error rate. To be more precise: We test ftfy by running the live stream of Twitter through it and looking at the changes it makes. Since the last bugfix, it has handled over 7,000,000 tweets with no false positives.)

We’ve also made sure that the code runs on both Python 2 and Python 3, and gives equivalent results on all versions, even when the text contains “astral characters” such as emoji that are handled inconsistently in Python 2.

You can get ftfy from GitHub or by using your favorite Python package manager, such as:

pip install ftfy

If ftfy is useful to you, we’d love to hear how you’re using it. You can reply to the comments here or e-mail us at info@luminoso.com.