Originally posted on August 26, 2013.
About a year ago, we blogged about how to ungarble garbled Unicode in a post called Fixing common Unicode mistakes with Python â€” after they’ve been made. Shortly after that, we released the code in a Python package called ftfy.
You have almost certainly seen the kind of problem ftfy fixes. Here’s a shoutout from a developer who found that her database was full of place names such as “BucureÅŸti, Romania” because of someone else’s bug. That’s easy enough to fix:
>>> from ftfy import fix_text >>> print(fix_text(u'BucureÅŸti, Romania')) Bucureşti, Romania >>> fix_text(u'Sokalâ€™, Lâ€™vivsâ€™ka Oblastâ€™, Ukraine') "Sokal', L'vivs'ka Oblast', Ukraine"
A reddit commenter has helpfully reminded me of the technical name for this phenomenon, which is mojibake.
We’ve kept developing this code because of how directly useful it is. Today, we’re releasing version 3.0 of ftfy. We’ve made it run faster, made it start up faster, made it fix more kinds of problems, and reduced its rate of false positives to near zero, so that now we can just run it on any text anyone sends us.
(I know that “near zero” is not a useful description of an error rate. To be more precise: We test ftfy by running the live stream of Twitter through it and looking at the changes it makes. Since the last bugfix, it has handled over 7,000,000 tweets with no false positives.)
We’ve also made sure that the code runs on both Python 2 and Python 3, and gives equivalent results on all versions, even when the text contains “astral characters” such as emoji that are handled inconsistently in Python 2.
You can get ftfy from GitHub or by using your favorite Python package manager, such as:
pip install ftfy
If ftfy is useful to you, we’d love to hear how you’re using it. You can reply to the comments here or e-mail us at firstname.lastname@example.org.