Originally posted on October 29, 2015.
Wordfreq 1.2 example code Examples in Chinese and British English. Click through for copyable code. In a previous post, we introduced wordfreq, our open-source Python library that lets you ask "how common is this word?" Wordfreq is an important low-level tool for Luminoso. It's one of the things we use to figure out which words are important in a set of text data. When we get the word frequencies figured out in a language, that's a big step toward being able to handle that language from end to end in the Luminoso pipeline. We recently started supporting Arabic in our product, and improved Chinese enough to take the "BETA" tag off of it, and having the right word frequencies for those languages was a big part of it. I've continued to work on wordfreq, putting together more data from more languages. We now have 17 languages that meet the threshold of having three independent sources of word frequencies, which we consider important for those word frequencies to be representative. Here's what's new in wordfreq 1.2: When adding Turkish, we made sure to convert the case of dotted and dotless İ's correctly. We know that putting the dots in the wrong places can lead to miscommunication and even fatal stabbings. The language in wordfreq that's still only partially supported is Korean. We still only have two sources of data for it, so you'll see the disproportionate influence of Twitter on its frequencies. If you know where to find a lot of freely-usable Korean subtitles, for example, we would love to know. Let's revisit the top 10 words in the languages wordfreq supports. And now that we've talked about getting right-to-left right, let's add a bit of code that makes Arabic show up with right-to-left words in left-to-right order, instead of middle-to-elsewhere order like it came out before. Code showing the top ten words in each language wordfreq 1.2 supports. Wordfreq 1.2 is available on GitHub and PyPI.