wordfreq 1.2 is better at Chinese, English, Greek, Polish, Swedish, and Turkish

Rob Speer

2016-05-19 18:59

Originally posted on October 29, 2015.

Examples in Chinese and British English. Click through for copyable code.

In a previous post, we introduced wordfreq, our open-source Python library that lets you ask "how common is this word?"

Wordfreq is an important low-level tool for Luminoso. It's one of the things we use to figure out which words are important in a set of text data. When we get the word frequencies figured out in a language, that's a big step toward being able to handle that language from end to end in the Luminoso pipeline. We recently started supporting Arabic in our product, and improved Chinese enough to take the "BETA" tag off of it, and having the right word frequencies for those languages was a big part of it.

I've continued to work on wordfreq, putting together more data from more languages. We now have 17 languages that meet the threshold of having three independent sources of word frequencies, which we consider important for those word frequencies to be representative.

Here's what's new in wordfreq 1.2:

The English word list has gotten a bit more robust and a bit more British by including SUBTLEX, adding word frequencies from American TV shows as well as the BBC.
It can fearlessly handle Chinese now. It uses a lovely pure-Python Chinese tokenizer, Jieba, to handle multiple-word phrases, and Jieba's built-in wordlist provides a third independent source of word frequencies. Wordfreq can even smooth over the differences between Traditional and Simplified Chinese.
Greek has also been promoted to a fully-supported language. With new data from Twitter and OpenSubtitles, it now has four independent sources.
In some applications, you want to tokenize a complete piece of text, including punctuation as separate tokens. Punctuation tokens don't get their own word frequencies, but you can ask the tokenizer to give you the punctuation tokens anyway.
We added support for Polish, Swedish, and Turkish. All those languages have a reasonable amount of data that we could obtain from OpenSubtitles, Twitter, and Wikipedia by doing what we were doing already.

When adding Turkish, we made sure to convert the case of dotted and dotless İ's correctly. We know that putting the dots in the wrong places can lead to miscommunication and even fatal stabbings.

The language in wordfreq that's still only partially supported is Korean. We still only have two sources of data for it, so you'll see the disproportionate influence of Twitter on its frequencies. If you know where to find a lot of freely-usable Korean subtitles, for example, we would love to know.

Let's revisit the top 10 words in the languages wordfreq supports. And now that we've talked about getting right-to-left right, let's add a bit of code that makes Arabic show up with right-to-left words in left-to-right order, instead of middle-to-elsewhere order like it came out before.

Code showing the top ten words in each language wordfreq 1.2 supports.

Wordfreq 1.2 is available on GitHub and PyPI.

Originally posted on October 29, 2015.

Comments