ftfy (fixes text for you) 4.0: changing less and fixing more

Originally posted on May 21, 2015 at https://blog.luminoso.com/2015/05/21/ftfy-fixes-text-for-you-4-0-changing-less-and-fixing-more/.

ftfy is a Python tool that takes in bad Unicode and outputs good Unicode. I developed it because we really needed it at Luminoso — the text we work with can be damaged in several ways by the time it gets to us. It’s become our most popular open-source project by far, as many other people have the same itch that we’re scratching.

The coolest thing that ftfy does is to fix mojibake — those mix-ups in encodings that cause the word más to turn into más or even más. (I’ll recap why this happens and how it can be reversed below.) Mojibake is often intertwined with other problems, such as un-decoded HTML entities (más), and ftfy fixes those as well. But as we worked with the ftfy 3 series, it gradually became clear that the default settings were making some changes that were unnecessary, and from time to time they would actually get in the way of the goal of cleaning up text.

ftfy 4 includes interesting new fixes to creative new ways that various software breaks Unicode. But it also aims to change less text that doesn’t need to be changed. This is the big change that made us increase the major version number from 3 to 4, and it’s fundamentally about Unicode normalization. I’ll discuss this change below under the heading “Normalization”.

Mojibake and why it happens

Mojibake is what happens when text is written in one encoding and read as if it were a different one. It comes from the Japanese word “•¶Žš‰»‚¯” — no, sorry, “文字化け” — meaning “character corruption”. Mojibake turns everything but basic ASCII characters into nonsense.

Suppose you have a word such as “más”. In UTF-8 — the encoding used by the majority of the Internet — the plain ASCII letters “m” and “s” are represented by the familiar single byte that has represented them in ASCII for 50 years. The letter “á”, which is not ASCII, is represented by two bytes.

Text:  m  á     s
Bytes: 6d c3 a1 73

The problem occurs when these bytes get sent to a program that doesn’t quite understand UTF-8. This program probably thinks that every character is one byte, so it decodes each byte as a character, in a way that depends on the operating system it’s running on and the country it was set up for. (This, of course, makes no sense in an era where computers from all over the world can talk to each other.)

If we decode this text using Windows’ most popular single-byte encoding, which is known as “Windows-1252” and often confused with “ISO-8859-1”, we’ll get this:

Bytes: 6d c3 a1 73
Text:  m  Ã  ¡  s

The real problem happens when this text needs to be sent back over the Internet. It may very well send the newly-weirdified text in a way that knows it needs to encode UTF-8:

Intended text: m  á           s
Actual text:   m  Ã     ¡     s
Bytes:         6d c3 83 c2 a1 73

So, the word “más” was supposed to be four bytes of UTF-8, but what we have now is six bytes of what I propose to call “Double UTF-8”, or “WTF-8” for short.

WTF-8 is a very common form of mojibake, and the fortunate thing is that it’s reasonably easy to detect. Most possible sequences of bytes are not UTF-8, and most mojibake forms sequences of characters that are extremely unlikely to be the intended text. So ftfy can look for sequences that would decode as UTF-8 if they were encoded as another popular encoding, and then sanity-check by making sure that the new text looks more likely than the old text. By reversing the process that creates mojibake, it turns mojibake into the correct text with a rate of false positives so low that it’s difficult to measure.

Weird new mojibake

We test ftfy on live data from Twitter, which due to its diversity of languages and clients is a veritable petri dish of Unicode bugs. One thing I’ve found in this testing is that mojibake is becoming a bit less common. People expect their Twitter clients to be able to deal with Unicode, and the bugs are gradually getting fixed. The “you fail at Unicode” character � was 33% less common on Twitter in 2014 than it was in 2013.

Some software is still very bad at Unicode — particularly Microsoft products. These days, Microsoft is in many ways making its software play nicer in a pluralistic world, but they bury their head in the sand when it comes to the dominance of UTF-8. Sadly, Microsoft’s APIs were not designed for UTF-8 and they’re not interested in changing them. They adopted Unicode during its awkward coming-of-age in the mid ’90s, when UTF-16 seemed like the only way to do it. Encoding text in UTF-16 is like dancing the Macarena — you probably could do it under duress, but you haven’t willingly done it since 1997.

Because they don’t match the way the outside world uses Unicode, Microsoft products tend to make it very hard or impossible to export and import Unicode correctly, and easy to do it incorrectly. This remains a major reason that we need ftfy.

Although text is getting a bit cleaner, people are getting bolder about their use of Unicode and the bugs that remain are getting weirder. ftfy has always been able to handle some cases of files that use different encodings on different lines, but what we’re seeing now is text that switches between UTF-8 and WTF-8 in the same sentence. There’s something out there that uses UTF-8 for its opening quotation marks and Windows-1252 for its closing quotation marks, before encoding it all in UTF-8 again, “like this”. You can’t simply encode and decode that string to get the intended text “like this”.

ftfy 4.0 includes a heuristic that fixes some common cases of mixed encodings in close proximity. It’s a bit conservative — it leaves some text unfixed, because if it changed all text that might possibly be in a mixed encoding, it would lead to too many false positives.

Another variation of this is that ftfy looks for mojibake that some other well-meaning software has tried to fix, such as by replacing byte A0 with a space, because in Windows-1252 A0 is a non-breaking space. Previously, ftfy would have to leave the mojibake unfixed if one of its characters was changed. But if the sequence is clear enough, ftfy will put back the A0 byte so that it can fix the original mojibake.

Does this seem gratuitous? These are things that show up both in ftfy’s testing stream and in real data that we’ve had to handle. We want to minimize the cases where we have to tell a customer “sorry, your text is busted” and maximize the cases where we just deal with it.

Normalization

NFC (the Normalization Form that uses Composition) is a process that should be applied to basically all Unicode input. Unicode is flexible enough that it has multiple ways to write exactly the same text, and NFC merges them into the same sensible way. Here are two ways to write más, as illustrated by the ftfy.explain_unicode function.

This is the NFC normalized way:

U+006D m [Ll] LATIN SMALL LETTER M
U+00E1 á [Ll] LATIN SMALL LETTER A WITH ACUTE
U+0073 s [Ll] LATIN SMALL LETTER S

And this is a different way that’s not NFC-normalized (it’s NFD-normalized instead):

U+006D m [Ll] LATIN SMALL LETTER M
U+0061 a [Ll] LATIN SMALL LETTER A
U+0301 ́  [Mn] COMBINING ACUTE ACCENT
U+0073 s [Ll] LATIN SMALL LETTER S

If you want the same text to be represented by the same data, running everything through NFC normalization is a good idea. ftfy does that (unless you ask it not to).

Previous versions of ftfy were, by default, not just using NFC normalization, but the more aggressive NFKC normalization (whose acronym is quite unsatisfying because the K stands for “Compatibility”). For a while, it seemed like normalizing even more was even better. NFKC does things like convert fullwidth  letters into normal letters, and convert the single ellipsis character into three periods.

But NFKC also loses meaningful information. If you were to ask me what the leading cause of mojibake is, I might answer “Excel™”. After NFKC normalization, I’d instead be blaming something called “ExcelTM”. In cases like this, NFKC is hitting the text with too blunt a hammer. Even when it seems appropriate to normalize aggressively because we’re going to be performing machine learning on text, the resulting words such as “exceltm” are not helpful.

So in ftfy 4.0, we switched the default normalization to NFC. We didn’t want to lose the nice parts of NFKC, such as normalizing fullwidth letters and breaking up the kind of ligatures that can make the word “fluffiest” appear to be five characters long. So we added those back in as separate fixes. By not applying NFKC bluntly to all the text, we change less text that doesn’t need to be changed, even as we apply more kinds of fixes. It’s a significant change in the default behavior of ftfy, but we hope you agree that this is a good thing. A side benefit is that ftfy 4.0 is faster overall than 3.x, because NFC normalization can run very quickly in common cases.

Future-proofing emoji and other changes

ftfy’s heuristics depend on knowing what kind of characters it’s looking at, so it includes a table where it can quickly look up Unicode character classes. This table normally doesn’t change very much, but we update it as Python’s unicodedata gets updated with new characters, making the same table available even in previous versions of Python.

One part of the table is changing really fast, though, in a way that Python may never catch up with. Apple is rapidly adding new emoji and modifiers to the Unicode block that’s set aside for them, such as 🖖🏽, which should be a brown-skinned Vulcan salute. Unicode will publish them in a standard eventually, but people are using them now.

Instead of waiting for Unicode and then Python to catch up, ftfy just assumes that any character in this block is an emoji, even if it doesn’t appear to be assigned yet. When emoji burritos arrive, ftfy will be ready for them.

Developers who like to use the UNIX command line will be happy to know that ftfy can be used as a pipe now, as in:

curl http://example.com/api/data.txt | ftfy | sort | uniq -c

The details of all the changes can be found, of course, in the CHANGELOG.

Has ftfy solved a problem for you? Have you stumped it with a particularly bizarre case of mojibake? Let us know in the comments or on Twitter.

ftfy (fixes text for you) version 3.0

Originally posted on August 26, 2013 at https://blog.luminoso.com/2013/08/26/ftfy-fixes-text-for-you-version-3-0/.

About a year ago, we blogged about how to ungarble garbled Unicode in a post called Fixing common Unicode mistakes with Python — after they’ve been made. Shortly after that, we released the code in a Python package called ftfy.

You have almost certainly seen the kind of problem ftfy fixes. Here’s a shoutout from a developer who found that her database was full of place names such as “BucureÅŸti, Romania” because of someone else’s bug. That’s easy enough to fix:

>>> from ftfy import fix_text

>>> print(fix_text(u'BucureÅŸti, Romania'))
Bucureşti, Romania

>>> fix_text(u'Sokal’, L’vivs’ka Oblast’, Ukraine')
"Sokal', L'vivs'ka Oblast', Ukraine"

A reddit commenter has helpfully reminded me of the technical name for this phenomenon, which is mojibake.

We’ve kept developing this code because of how directly useful it is. Today, we’re releasing version 3.0 of ftfy. We’ve made it run faster, made it start up faster, made it fix more kinds of problems, and reduced its rate of false positives to near zero, so that now we can just run it on any text anyone sends us.

(I know that “near zero” is not a useful description of an error rate. To be more precise: We test ftfy by running the live stream of Twitter through it and looking at the changes it makes. Since the last bugfix, it has handled over 7,000,000 tweets with no false positives.)

We’ve also made sure that the code runs on both Python 2 and Python 3, and gives equivalent results on all versions, even when the text contains “astral characters” such as emoji that are handled inconsistently in Python 2.

You can get ftfy from GitHub or by using your favorite Python package manager, such as:

pip install ftfy

If ftfy is useful to you, we’d love to hear how you’re using it. You can reply to the comments here or e-mail us at info@luminoso.com.

Fixing Unicode mistakes and more: the ftfy package

Originally posted on August 24, 2012 at https://blog.luminoso.com/2012/08/24/fixing-unicode-mistakes-and-more-the-ftfy-package/.

There’s been a great response to my earlier post, Fixing common Unicode mistakes with Python. This is clearly something that people besides me needed. In fact, someone already made the code into a web site, at fixencoding.com. I like the favicon.

I took the suggestion to split the code into a new standalone package. It’s now called ftfy, standing for “fixes text for you”. You can install it with pip install ftfy.

I observed that I was doing interesting things with Unicode in Python, and yet I wasn’t doing it in Python 3, which basically makes me a terrible person. ftfy is now compatible with both Python 2 and Python 3.

Something else amusing happened: At one point, someone edited the previous post and WordPress barfed HTML entities all over its text. All the quotation marks turned into ", for example. So, for a bit, that post was setting a terrible example about how to handle text correctly!

I took that as a sign that I should expand ftfy so that it also decodes HTML entities (though it will leave them alone in the presence of HTML tags). While I was at it, I also made it turn curly quotes into straight ones, convert Windows line endings to Unix, normalize Unicode characters to their canonical forms, strip out terminal color codes, and remove miscellaneous control characters. The original fix_bad_unicode is still in there, if you just want the encoding fixer without the extra stuff.

Fixing common Unicode mistakes with Python — after they’ve been made

Originally posted on August 20, 2012 at https://blog.luminoso.com/2012/08/20/fix-unicode-mistakes-with-python/.

Update: not only can you fix Unicode mistakes with Python, you can fix Unicode mistakes with our open source Python package ftfy. It’s on PyPI and everything.

You have almost certainly seen text on a computer that looks something like this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Here’s what’s going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn’t even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.

Now, you’re not the programmer causing the encoding problems, right? Because you’ve read something like Joel Spolsky’s The Absolute Minimum Every Developer Absolutely, Positively Must Know About Unicode And Character Sets or the Python Unicode HOWTO and you’ve learned the difference between text and bytestrings and how to get them right.

But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

So this post isn’t about how to do Unicode right. It’s about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.

Continue reading “Fixing common Unicode mistakes with Python — after they’ve been made”