wordfreq 1.4: more words, plus word frequencies from Reddit

The wordfreq module is an easy Python interface for looking up the frequencies of words. It was originally designed for use cases where it was most important to find common words, so it would list all the words that occur at least once per million words: that’s about 30,000 words in English. An advantage of ending the list there is that it loads really fast and takes up a small amount of RAM.

But there’s more to know about word frequencies. There’s a difference between words that are used a bit less than once in a million words, like “almanac”, “crusty”, and “giraffes”, versus words that are used just a few times per billion, such as “centerback”, “polychora”, and “scanlations”. As I’ve started using wordfreq in some aspects of the build process of ConceptNet, I’ve wanted to be able to rank words by frequency even if they’re less common than “giraffes”, and I’m sure other people do too.

So one big change in wordfreq 1.4 is that there is now a ‘large’ wordlist available in the languages that have enough data to support it: English, German, Spanish, French, and Portuguese. These lists contain all words used at least once per 100 million words. The default wordlist is still the smaller, faster one, so you have to ask for the ‘large’ wordlist explicitly — see the documentation.

Including word frequencies from Reddit

The best way to get representative word frequencies is to include a lot of text from a lot of different sources. Now there’s another source available: the Reddit comment corpus.

Reddit is an English-centric site and 99.2% of its comments are in English. We still need to account for the exceptions, such as /r/es, /r/todayilearned_jp, /r/sweden, and of course, the thread named “HELP reddit turned spanish and i cannot undo it!”.

I used pycld2 to detect the language of Reddit comments. In this version, I decided to only use the comments that could be detected as English, because I couldn’t be sure that the data I was getting from other languages was representative enough. For example, unfortunately, most comments in Italian on Reddit are spam, and most comments in Japanese are English speakers trying to learn Japanese. The data that looks the most promising is Spanish, and I might decide to include that in a later version.

So now some Reddit-centric words have claimed a place in the English word list, alongside words from Google Books, Wikipedia, Twitter, television subtitles, and the Leeds Internet Corpus:

>>> wordfreq.word_frequency('upvote', 'en')
1.0232929922807536e-05

>>> wordfreq.word_frequency('eli5', 'en', 'large')
6.165950018614822e-07

One more thing: we only use words from comments with a score of 1 or more. This helps keep the worst examples of spam and trolling from influencing the word list too much.

The Zipf frequency scale

When Marc Brysbaert let me include his excellent SUBTLEX data (word frequencies from television subtitles) as part of wordfreq, he asked me to include his preferred frequency scale as an option. I agree that I find it nicer than looking at raw frequencies.

The Zipf scale is a logarithmic scale of word frequency that’s meant to give you intuitive, small, positive numbers for reasonable words: it’s 9 plus the log (base 10) of the word frequency. This was easy to include in wordfreq, because it stores its frequencies on a log-10 scale anyway. You can now use the wordfreq.zipf_frequency function to see frequencies on this scale.

>>> wordfreq.zipf_frequency('people', 'en', 'large')
6.23

>>> wordfreq.zipf_frequency('cats', 'en', 'large')
4.42

>>> wordfreq.zipf_frequency('giraffes', 'en', 'large')
3.0

>>> wordfreq.zipf_frequency('narwhals', 'en', 'large')
2.1

>>> wordfreq.zipf_frequency('heffalumps', 'en', 'large')
1.78

>>> wordfreq.zipf_frequency('borogoves', 'en', 'large')
1.16

wordfreq is part of a stack of natural language tools developed at Luminoso and used in ConceptNet. Its data is available under the Creative Commons Attribution-ShareAlike 4.0 license.

wordfreq 1.2 is better at Chinese, English, Greek, Polish, Swedish, and Turkish

Originally posted on October 29, 2015.
Wordfreq 1.2 example code
Examples in Chinese and British English. Click through for copyable code.

In a previous post, we introduced wordfreq, our open-source Python library that lets you ask “how common is this word?”

Wordfreq is an important low-level tool for Luminoso. It’s one of the things we use to figure out which words are important in a set of text data. When we get the word frequencies figured out in a language, that’s a big step toward being able to handle that language from end to end in the Luminoso pipeline. We recently started supporting Arabic in our product, and improved Chinese enough to take the “BETA” tag off of it, and having the right word frequencies for those languages was a big part of it.

I’ve continued to work on wordfreq, putting together more data from more languages. We now have 17 languages that meet the threshold of having three independent sources of word frequencies, which we consider important for those word frequencies to be representative.

Here’s what’s new in wordfreq 1.2:

  • The English word list has gotten a bit more robust and a bit more British by including SUBTLEX, adding word frequencies from American TV shows as well as the BBC.
  • It can fearlessly handle Chinese now. It uses a lovely pure-Python Chinese tokenizer, Jieba, to handle multiple-word phrases, and Jieba’s built-in wordlist provides a third independent source of word frequencies. Wordfreq can even smooth over the differences between Traditional and Simplified Chinese.
  • Greek has also been promoted to a fully-supported language. With new data from Twitter and OpenSubtitles, it now has four independent sources.
  • In some applications, you want to tokenize a complete piece of text, including punctuation as separate tokens. Punctuation tokens don’t get their own word frequencies, but you can ask the tokenizer to give you the punctuation tokens anyway.
  • We added support for Polish, Swedish, and Turkish. All those languages have a reasonable amount of data that we could obtain from OpenSubtitles, Twitter, and Wikipedia by doing what we were doing already.

When adding Turkish, we made sure to convert the case of dotted and dotless İ’s correctly. We know that putting the dots in the wrong places can lead to miscommunication and even fatal stabbings.

The language in wordfreq that’s still only partially supported is Korean. We still only have two sources of data for it, so you’ll see the disproportionate influence of Twitter on its frequencies. If you know where to find a lot of freely-usable Korean subtitles, for example, we would love to know.

Let’s revisit the top 10 words in the languages wordfreq supports. And now that we’ve talked about getting right-to-left right, let’s add a bit of code that makes Arabic show up with right-to-left words in left-to-right order, instead of middle-to-elsewhere order like it came out before.

Code showing the top ten words in each language wordfreq 1.2 supports.

Wordfreq 1.2 is available on GitHub and PyPI.

Fixing common Unicode mistakes with Python after they’ve been made

Originally posted on August 20, 2012.

Update: not only can you fix Unicode mistakes with Python, you can fix Unicode mistakes with our open source Python package ftfy. It’s on PyPI and everything.

You have almost certainly seen text on a computer that looks something like this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Somewhere, a computer got hold of a list of numbers that were intended to constitute a quotation and did something distinctly un-beautiful with it. A person reading that can deduce that it was actually supposed to say this:

If numbers aren’t beautiful, I don’t know what is. –Paul Erdős

Here’s what’s going on. A modern computer has the ability to display text that uses over 100,000 different characters, but unfortunately that text sometimes passes through a doddering old program that believes there are only the 256 that it can fit in a single byte. The program doesn’t even bother to check what encoding the text is in; it just uses its own favorite encoding and turns a bunch of characters into strings of completely different characters.

Now, you’re not the programmer causing the encoding problems, right? Because you’ve read something like Joel Spolsky’s The Absolute Minimum Every Developer Absolutely, Positively Must Know About Unicode And Character Sets or the Python Unicode HOWTO and you’ve learned the difference between text and bytestrings and how to get them right.

But the problem is that sometimes you might have to deal with text that comes out of other code. We deal with this a lot at Luminoso, where the text our customers want us to analyze has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

So this post isn’t about how to do Unicode right. It’s about a tool we came up with for damage control after some other program does Unicode wrong. It detects some of the most common encoding mistakes and does what it can to undo them.

Continue reading “Fixing common Unicode mistakes with Python after they’ve been made”

How to make an orderly transition to Python Requests 1.0 instead of running around in a panic

There’s a lovely Python module for making HTTP requests, called requests. We use it at Luminoso. A bunch of code we depend on uses it. Our API customers use it. Basically everyone uses it because it’s the right thing to use.

Yesterday we did our first code update of the new year on our development systems, and found that suddenly nothing was working. Meanwhile, our customers sent us bug reports for similar reasons. We’d see errors like this one:

TypeError: session() takes no arguments (1 given)

And all kinds of code would crash with this kind of error, which occurs because Requests changed .json from a property to a method:

TypeError: 'instancemethod' object has no attribute '__getitem__'

You see, on December 17, Kenneth Reitz released version 1.0 of requests and declared “This is not a backwards compatible change.” As far as we can tell, this has caused a small ripple of version-related panic in the Python world. We know it’s okay to break compatibility when changing the major version number. That’s what major version numbers are for. But the problem is that it’s really hard to deal with multiple incompatible versions of the same Python package.

If you were to type pip install requests now, you’ll get version 1.0, and it won’t work with most code written for version 0.14. So maybe you should ask for “requests < 1.0” or “requests == 0.14.2”, and maybe even declare that dependency in setup.py. That was certainly the stopgap measure we went around applying yesterday.

The problem is that, once you do that, you can’t ever upgrade to Requests 1.0 or install any code that uses Requests 1.0, unless you port all your code and update all your Python environments at once. Not even virtualenv will help. You just can’t have an environment that depends on “requests < 1.0” and “requests >= 1.0” at the same time and have your code keep working.

The requests-transition package

We want to make it possible to move to the shiny new Requests 1.x code. But we
also want our code stack to keep working in the present. That’s the purpose of
requests-transition. All it does is it installs both versions of
requests as two different packages with different names.

The slogan of requests is “Python HTTP for Humans”. The slogan of requests-transition is “Python HTTP for busy people who don’t have time to port all their code yet”.

To install it using pip:

pip install requests-transition

Now you can stabilize your existing code that uses requests 0.x by changing the line

import requests

to

import requests0 as requests

When you port the code to use requests 1.0, change the import line to:

import requests1 as requests

In the future, when all your dependencies use requests 1.0 and 0.x is a distant memory, you should get the latest version of the real requests package and change the import lines back to:

import requests

And that is how you transition to requests 1.x, calmly and painlessly.

We have already updated our API client code to use requests-transition, instead of forcing you to install “requests < 1.0”.

Watch python-requests-transition on GitHub