Originally posted on August 24, 2012.
There’s been a great response to my earlier post, Fixing common Unicode mistakes with Python. This is clearly something that people besides me needed. In fact, someone already made the code into a web site, at fixencoding.com. I like the favicon.
I took the suggestion to split the code into a new standalone package. It’s now called ftfy, standing for “fixes text for you”. You can install it with pip install ftfy.
I observed that I was doing interesting things with Unicode in Python, and yet I wasn’t doing it in Python 3, which basically makes me a terrible person. ftfy is now compatible with both Python 2 and Python 3.
Something else amusing happened: At one point, someone edited the previous post and WordPress barfed HTML entities all over its text. All the quotation marks turned into “, for example. So, for a bit, that post was setting a terrible example about how to handle text correctly!
I took that as a sign that I should expand ftfy so that it also decodes HTML entities (though it will leave them alone in the presence of HTML tags). While I was at it, I also made it turn curly quotes into straight ones, convert Windows line endings to Unix, normalize Unicode characters to their canonical forms, strip out terminal color codes, and remove miscellaneous control characters. The original fix_bad_unicode is still in there, if you just want the encoding fixer without the extra stuff.
Originally posted on August 20, 2012.
Update: not only can you fix Unicode mistakes with Python, you can fix Unicode mistakes with our open source Python package ftfy. It’s on PyPI and everything.
You have almost certainly seen text on a computer that looks something like this:
If numbers arenâ€™t beautiful, I donâ€™t know what is. â€“Paul ErdÅ‘s
Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:
If numbers aren’t beautiful, I don’t know what is. –Paul Erdős
Here’s what’s going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn’t even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.
Now, you’re not the programmer causing the encoding problems, right? Because you’ve read something like Joel Spolsky’s The Absolute Minimum Every Developer Absolutely, Positively Must Know About Unicode And Character Sets or the Python Unicode HOWTO and you’ve learned the difference between text and bytestrings and how to get them right.
But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.
So this post isn’t about how to do Unicode right. It’s about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.
Continue reading “Fixing common Unicode mistakes with Python after they’ve been made”