ftfy (fixes text for you) version 3.0

Originally posted on August 26, 2013.

About a year ago, we blogged about how to ungarble garbled Unicode in a post called Fixing common Unicode mistakes with Python — after they’ve been made. Shortly after that, we released the code in a Python package called ftfy.

You have almost certainly seen the kind of problem ftfy fixes. Here’s a shoutout from a developer who found that her database was full of place names such as “BucureÅŸti, Romania” because of someone else’s bug. That’s easy enough to fix:

>>> from ftfy import fix_text

>>> print(fix_text(u'BucureÅŸti, Romania'))
Bucureşti, Romania

>>> fix_text(u'Sokal’, L’vivs’ka Oblast’, Ukraine')
"Sokal', L'vivs'ka Oblast', Ukraine"

A reddit commenter has helpfully reminded me of the technical name for this phenomenon, which is mojibake.

We’ve kept developing this code because of how directly useful it is. Today, we’re releasing version 3.0 of ftfy. We’ve made it run faster, made it start up faster, made it fix more kinds of problems, and reduced its rate of false positives to near zero, so that now we can just run it on any text anyone sends us.

(I know that “near zero” is not a useful description of an error rate. To be more precise: We test ftfy by running the live stream of Twitter through it and looking at the changes it makes. Since the last bugfix, it has handled over 7,000,000 tweets with no false positives.)

We’ve also made sure that the code runs on both Python 2 and Python 3, and gives equivalent results on all versions, even when the text contains “astral characters” such as emoji that are handled inconsistently in Python 2.

You can get ftfy from GitHub or by using your favorite Python package manager, such as:

pip install ftfy

If ftfy is useful to you, we’d love to hear how you’re using it. You can reply to the comments here or e-mail us at info@luminoso.com.

Fixing Unicode mistakes and more: the ftfy package

Originally posted on August 24, 2012.

There’s been a great response to my earlier post, Fixing common Unicode mistakes with Python. This is clearly something that people besides me needed. In fact, someone already made the code into a web site, at fixencoding.com. I like the favicon.

I took the suggestion to split the code into a new standalone package. It’s now called ftfy, standing for “fixes text for you”. You can install it with pip install ftfy.

I observed that I was doing interesting things with Unicode in Python, and yet I wasn’t doing it in Python 3, which basically makes me a terrible person. ftfy is now compatible with both Python 2 and Python 3.

Something else amusing happened: At one point, someone edited the previous post and WordPress barfed HTML entities all over its text. All the quotation marks turned into “, for example. So, for a bit, that post was setting a terrible example about how to handle text correctly!

I took that as a sign that I should expand ftfy so that it also decodes HTML entities (though it will leave them alone in the presence of HTML tags). While I was at it, I also made it turn curly quotes into straight ones, convert Windows line endings to Unix, normalize Unicode characters to their canonical forms, strip out terminal color codes, and remove miscellaneous control characters. The original fix_bad_unicode is still in there, if you just want the encoding fixer without the extra stuff.

Fixing common Unicode mistakes with Python after they’ve been made

Originally posted on August 20, 2012.

Update: not only can you fix Unicode mistakes with Python, you can fix Unicode mistakes with our open source Python package ftfy. It’s on PyPI and everything.

You have almost certainly seen text on a computer that looks something like this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Here’s what’s going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn’t even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.

Now, you’re not the programmer causing the encoding problems, right? Because you’ve read something like Joel Spolsky’s The Absolute Minimum Every Developer Absolutely, Positively Must Know About Unicode And Character Sets or the Python Unicode HOWTO and you’ve learned the difference between text and bytestrings and how to get them right.

But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

So this post isn’t about how to do Unicode right. It’s about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.

Continue reading “Fixing common Unicode mistakes with Python after they’ve been made”

How to make an orderly transition to Python Requests 1.0 instead of running around in a panic

There’s a lovely Python module for making HTTP requests, called requests. We use it at Luminoso. A bunch of code we depend on uses it. Our API customers use it. Basically everyone uses it because it’s the right thing to use.

Yesterday we did our first code update of the new year on our development systems, and found that suddenly nothing was working. Meanwhile, our customers sent us bug reports for similar reasons. We’d see errors like this one:

TypeError: session() takes no arguments (1 given)

And all kinds of code would crash with this kind of error, which occurs because Requests changed .json from a property to a method:

TypeError: 'instancemethod' object has no attribute '__getitem__'

You see, on December 17, Kenneth Reitz released version 1.0 of requests and declared “This is not a backwards compatible change.” As far as we can tell, this has caused a small ripple of version-related panic in the Python world. We know it’s okay to break compatibility when changing the major version number. That’s what major version numbers are for. But the problem is that it’s really hard to deal with multiple incompatible versions of the same Python package.

If you were to type pip install requests now, you’ll get version 1.0, and it won’t work with most code written for version 0.14. So maybe you should ask for “requests < 1.0” or “requests == 0.14.2”, and maybe even declare that dependency in setup.py. That was certainly the stopgap measure we went around applying yesterday.

The problem is that, once you do that, you can’t ever upgrade to Requests 1.0 or install any code that uses Requests 1.0, unless you port all your code and update all your Python environments at once. Not even virtualenv will help. You just can’t have an environment that depends on “requests < 1.0” and “requests >= 1.0” at the same time and have your code keep working.

The requests-transition package

We want to make it possible to move to the shiny new Requests 1.x code. But we
also want our code stack to keep working in the present. That’s the purpose of
requests-transition. All it does is it installs both versions of
requests as two different packages with different names.

The slogan of requests is “Python HTTP for Humans”. The slogan of requests-transition is “Python HTTP for busy people who don’t have time to port all their code yet”.

To install it using pip:

pip install requests-transition

Now you can stabilize your existing code that uses requests 0.x by changing the line

import requests

to

import requests0 as requests

When you port the code to use requests 1.0, change the import line to:

import requests1 as requests

In the future, when all your dependencies use requests 1.0 and 0.x is a distant memory, you should get the latest version of the real requests package and change the import lines back to:

import requests

And that is how you transition to requests 1.x, calmly and painlessly.

We have already updated our API client code to use requests-transition, instead of forcing you to install “requests < 1.0”.

Watch python-requests-transition on GitHub