ftfy (fixes text for you) 4.4 and 5.0

Rob Speer

2017-03-14 13:50

ftfy is Luminoso's open-source Unicode-fixing library for Python.

Luminoso's biggest open-source project is ConceptNet, but we also use this blog to provide updates on our other open-source projects. And among these projects, ftfy is certainly the most widely used. It solves a problem a lot of people have with "no faffing about", as a grateful e-mail I received put it.

When you use the ftfy.fix_text() function, it detects and fixes such problems as mojibake (text that was decoded in the wrong encoding), accidental HTML escaping, curly quotes where you expected straight ones, and so on. (You can also selectively disable these fixes, or run them as separate functions.)

Here's an example that fixes some multiply-mangled Unicode that I actually found on the Web:

>>> print(ftfy.fix_text("&macr;\\_(ãƒ„)_/&macr;"))
¯\_(ツ)_/¯

Another example, from a Twitter-bot gone wrong:

>>> print(ftfy.fix_text("#╨┐╤Ç╨░╨▓╨╕╨╗╤î╨╜╨╛╨╡╨┐╨╕╤é╨░╨╜╨╕╨╡"))
#правильноепитание

So we're proud to present two new releases of ftfy, versions 4.4 and 5.0. Let's start by talking about the big change:

Photo credit: "The Big Red Button" by włodi, used under the CC-By-SA 2.0 license

That's right: as of version 4.4, ftfy is better at dealing with encodings of Eastern European languages! After all, sometimes your text is in Polish, like the labels on this very serious-looking control panel. Or maybe it's in Czech, Slovak, Hungarian, or a language with similar accented letters.

Before Unicode, people would handle these alphabets using a single-byte encoding designed for them, like Windows-1250, which would be incompatible with other languages. In that encoding, the photographer's name is the byte string w\xb3odi. But now the standard encoding of the Web is UTF-8, where the same name is w\xc5\x82odi.

The encoding errors you might encounter due to mixing these up used to be underrepresented in the test data I collected. You might end up with the name looking like "wĹ‚odi" and ftfy would just throw up its hands like ¯\(ãƒ„)/¯. But now it understands what happened to that name and how to fix it.

Oh, but what about that text I photoshopped onto the button?

Yeah, I was pulling your leg a bit by talking about the Windows-1250 thing first.

ftfy 5.0 is the same as ftfy 4.4, but it drops support for Python 2. It also gains some tests that we're happier to not have to write for both versions of Python. Depending on how inertia-ful your use of Python is, this may be a big deal to you.

Three at last!

Python 3 has a string type that's a pretty good representation of Unicode, and it uses it consistently throughout its standard library. It's a great language for describing Unicode and how to fix it. It's a great language for text in general. But until now, we've been writing ftfy in the unsatisfying language known as "Python 2+3", where you can't take advantage of anything that's cleaner in Python 3 because you still have to do it the Python 2.7 way also.

So, following the plan we announced in April 2015, we released two versions at the same time. They do the same thing, but ftfy 5.0 gets to have shorter, simpler code.

It seems we even communicated this to ftfy's users successfully. Shortly after ftfy 5.0 appeared on PyPI, the bug report we received wasn't about where Python 2 support went, it was about a regression introduced by the new heuristics. (That's why 4.4.1 and 5.0.1 are out already.)

There's more I plan to do with ftfy, especially fixing more kinds of encoding errors, as summarized by issue #18. It'll be easier to make it happen when I can write the fix in a single language.

But if you're still on Python 2 -- possibly due to forces outside your control -- I hope I've left you with a pretty good option. Thousands of users are content with ftfy 4, and it's not going away.

One more real-world example

>>> from ftfy.fixes import fix_encoding_and_explain
>>> fix_encoding_and_explain("NapĂ\xadĹˇte nĂˇm !")
('Napíšte nám !',
 [('encode', 'sloppy-windows-1250', 2), ('decode', 'utf-8', 0)])

ftfy (fixes text for you) 4.0: changing less and fixing more

Rob Speer

2016-05-19 18:48

Originally posted on May 21, 2015.

ftfy is a Python tool that takes in bad Unicode and outputs good Unicode. I developed it because we really needed it at Luminoso -- the text we work with can be damaged in several ways by the time it gets to us. It's become our most popular open-source project by far, as many other people have the same itch that we're scratching.

The coolest thing that ftfy does is to fix mojibake -- those mix-ups in encodings that cause the word más to turn into mÃ¡s or even mÃƒÂ¡s. (I'll recap why this happens and how it can be reversed below.) Mojibake is often intertwined with other problems, such as un-decoded HTML entities (más), and ftfy fixes those as well. But as we worked with the ftfy 3 series, it gradually became clear that the default settings were making some changes that were unnecessary, and from time to time they would actually get in the way of the goal of cleaning up text.

ftfy 4 includes interesting new fixes to creative new ways that various software breaks Unicode. But it also aims to change less text that doesn't need to be changed. This is the big change that made us increase the major version number from 3 to 4, and it's fundamentally about Unicode normalization. I'll discuss this change below under the heading "Normalization".

Mojibake and why it happens

Mojibake is what happens when text is written in one encoding and read as if it were a different one. It comes from the Japanese word "•¶Žš‰»‚¯" -- no, sorry, "文字化け" -- meaning "character corruption". Mojibake turns everything but basic ASCII characters into nonsense.

Suppose you have a word such as "más". In UTF-8 -- the encoding used by the majority of the Internet -- the plain ASCII letters "m" and "s" are represented by the familiar single byte that has represented them in ASCII for 50 years. The letter "á", which is not ASCII, is represented by two bytes.

```http://blog.emojipedia.org/apple-2015-emoji-changelog-ios-os-x

curl http://example.com/api/data.txt | ftfy | sort | uniq -c

```

The details of all the changes can be found, of course, in the CHANGELOG.

Has ftfy solved a problem for you? Have you stumped it with a particularly bizarre case of mojibake? Let us know in the comments or on Twitter.

ftfy (fixes text for you) version 3.0

Rob Speer

2016-05-19 18:28

Originally posted on August 26, 2013.

About a year ago, we blogged about how to ungarble garbled Unicode in a post called Fixing common Unicode mistakes with Python â€” after they’ve been made. Shortly after that, we released the code in a Python package called ftfy.

You have almost certainly seen the kind of problem ftfy fixes. Here's a shoutout from a developer who found that her database was full of place names such as "BucureÅŸti, Romania" because of someone else's bug. That's easy enough to fix:

pip install ftfy

If ftfy is useful to you, we'd love to hear how you're using it. You can reply to the comments here or e-mail us at info@luminoso.com.

Fixing Unicode mistakes and more: the ftfy package

Rob Speer

2016-05-19 15:40

Originally posted on August 24, 2012.

There's been a great response to my earlier post, Fixing common Unicode mistakes with Python. This is clearly something that people besides me needed. In fact, someone already made the code into a web site, at fixencoding.com. I like the favicon.

I took the suggestion to split the code into a new standalone package. It's now called ftfy, standing for "fixes text for you". You can install it with pip install ftfy.

I observed that I was doing interesting things with Unicode in Python, and yet I wasn't doing it in Python 3, which basically makes me a terrible person. ftfy is now compatible with both Python 2 and Python 3.

Something else amusing happened: At one point, someone edited the previous post and WordPress barfed HTML entities all over its text. All the quotation marks turned into ", for example. So, for a bit, that post was setting a terrible example about how to handle text correctly!

I took that as a sign that I should expand ftfy so that it also decodes HTML entities (though it will leave them alone in the presence of HTML tags). While I was at it, I also made it turn curly quotes into straight ones, convert Windows line endings to Unix, normalize Unicode characters to their canonical forms, strip out terminal color codes, and remove miscellaneous control characters. The original fix_bad_unicode is still in there, if you just want the encoding fixer without the extra stuff.

Fixing common Unicode mistakes with Python after they've been made

Rob Speer

2016-05-19 15:35

Originally posted on August 20, 2012.

Update: not only can you fix Unicode mistakes with Python, you can fix Unicode mistakes with our open source Python package ftfy. It's on PyPI and everything.

You have almost certainly seen text on a computer that looks something like this:

If numbers arenâ€™t beautiful, I donâ€™t know what is. â€“Paul ErdÅ‘s

Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Here's what's going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn't even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.

Now, you're not the programmer causing the encoding problems, right? Because you've read something like Joel Spolsky's The Absolute Minimum Every Developer Absolutely, Positively Must Know About Unicode And Character Sets or the Python Unicode HOWTO and you've learned the difference between text and bytestrings and how to get them right.

But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

So this post isn't about how to do Unicode right. It's about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.

Here's the type of Unicode mistake we're fixing.

Some text, somewhere, was encoded into bytes using UTF-8 (which is quickly becoming the standard encoding for text on the Internet).
The software that received this text wasn't expecting UTF-8. It instead decodes the bytes in an encoding with only 256 characters. The simplest of these encodings is the one called "ISO-8859-1", or "Latin-1" among friends. In Latin-1, you map the 256 possible bytes to the first 256 Unicode characters. This encoding can arise naturally from software that doesn't even consider that different encodings exist.
The result is that every non-ASCII character turns into two or three garbage characters.

The three most commonly-confused codecs are UTF-8, Latin-1, and Windows-1252. There are lots of other codecs in use in the world, but they are so obviously different from these three that everyone can tell when they've gone wrong. We'll focus on fixing cases where text was encoded as one of these three codecs and decoded as another.

A first attempt

When you look at the kind of junk that's produced by this process, the character sequences seem so ugly and meaningless that you could just replace anything that looks like it should have been UTF-8. Just find those sequences, replace them unconditionally with what they would be in UTF-8, and you're done. In fact, that's what my first version did. Skipping a bunch of edge cases and error handling, it looked something like this:

```python

A table telling us how to interpret the first word of a letter's Unicode

name. The number indicates how frequently we expect this script to be used

on computers. Many scripts not included here are assumed to have a frequency

of "0" -- if you're going to write in Linear B using Unicode, you're

probably aware enough of encoding issues to get it right.

The lowercase name is a general category -- for example, Han characters and

Hiragana characters are very frequently adjacent in Japanese, so they all go

into category 'cjk'. Letters of different categories are assumed not to

appear next to each other often.

SCRIPT_TABLE = { 'LATIN': (3, 'latin'), 'CJK': (2, 'cjk'), 'ARABIC': (2, 'arabic'), 'CYRILLIC': (2, 'cyrillic'), 'GREEK': (2, 'greek'), 'HEBREW': (2, 'hebrew'), 'KATAKANA': (2, 'cjk'), 'HIRAGANA': (2, 'cjk'), 'HIRAGANA-KATAKANA': (2, 'cjk'), 'HANGUL': (2, 'cjk'), 'DEVANAGARI': (2, 'devanagari'), 'THAI': (2, 'thai'), 'FULLWIDTH': (2, 'cjk'), 'MODIFIER': (2, None), 'HALFWIDTH': (1, 'cjk'), 'BENGALI': (1, 'bengali'), 'LAO': (1, 'lao'), 'KHMER': (1, 'khmer'), 'TELUGU': (1, 'telugu'), 'MALAYALAM': (1, 'malayalam'), 'SINHALA': (1, 'sinhala'), 'TAMIL': (1, 'tamil'), 'GEORGIAN': (1, 'georgian'), 'ARMENIAN': (1, 'armenian'), 'KANNADA': (1, 'kannada'), # mostly used for looks of disapproval 'MASCULINE': (1, 'latin'), 'FEMININE': (1, 'latin') }

```