wordfreq 1.5: More data, more languages, more accuracy

Rob Speer

2016-08-22 12:44

wordfreq is a useful dataset of word frequencies in many languages, and a simple Python library that lets you look up the frequencies of words (or word-like tokens, if you want to quibble about what's a word). Version 1.5 is now available on GitHub and PyPI.

wordfreq can rank the frequencies of nearly 400,000 English words. These are some of them.

These word frequencies don't just come from one source; they combine many sources to take into account many different ways to use language.

Some other frequency lists just use Wikipedia because it's easy, but then they don't accurately represent the frequencies of words outside of an encyclopedia. The wordfreq data combines whatever data is available from Wikipedia, Google Books, Reddit, Twitter, SUBTLEX, OpenSubtitles, and the Leeds Internet Corpus. Now we've added one more source: as much non-English text as we could possibly find in the Common Crawl of the entire Web.

Including this data has led to some interesting changes in the new version 1.5 of wordfreq:

We've got enough data to support 9 new languages: Bulgarian, Catalan, Danish, Finnish, Hebrew, Hindi, Hungarian, Norwegian Bokmål, and Romanian.
Korean has been promoted from marginal to full support. In fact, none of the languages are "marginal" now: all 27 supported languages have at least three data sources and a tokenizer that's prepared to handle that language.
We changed how we rank the frequencies of words when data sources disagree. We used to use the mean of the frequencies. Now we use a weighted median.

Fixing outliers

Using a weighted median of word frequencies is an important change to the data. When the Twitter data source says "oh man you guys 'rt' is a really common word in every language", and the other sources say "No it's not", the word 'rt' now ends up with a much lower value in the combined list because of the median.

wordfreq can still analyze formal or informal writing without its top frequencies being spammed by things that are specific to one data source. This turned out to be essential when adding the Common Crawl: when text on the Web is translated into a lot of languages, there is an unreasonably high chance that it says "log in", "this website uses cookies", "select your language", the name of another language, or is related to tourism, such as text about hotels and restaurants. We wanted to take advantage of the fact that we have a crawl of the multilingual Web, without making all of the data biased toward words that are overrepresented in that crawl.

The reason the median is weighted is so we can still compare frequencies of words that don't appear in a majority of sources. If a source has never seen a word, that could just be sampling noise, so its vote of 0 for what the word's frequency should be counts less. As a result, there are still source-specific words, just with a lower frequency than they had in wordfreq 1.4:

[code lang=python]
&gt;&gt;&gt; # Some source data has split off "n't" as a
&gt;&gt;&gt; # separate token
&gt;&gt;&gt; wordfreq.zipf_frequency("n't", 'en', 'large')
2.28

&gt;&gt;&gt; wordfreq.zipf_frequency('retweet', 'en', 'large')
1.57

&gt;&gt;&gt; wordfreq.zipf_frequency('eli5', 'en', 'large')
1.45
[/code]

Why use only non-English data in the Common Crawl?

Mostly to keep the amount of data manageable. While the final wordfreq lists are compressed down to kilobytes or megabytes, building these lists already requires storing and working with a lot of input.

There are terabytes of data in the Common Crawl, and while that's not quite "big data" because it fits on a hard disk and a desktop computer can iterate through it with no problem, counting every English word in the Common Crawl would involve intermediate results that start to push the "fits on a hard disk" limit. English is doing fine because it has its own large sources, such as Google Books.

More data in more languages

A language can be represented in wordfreq when there are 3 large enough, free enough, independent sources of data for it. If there are at least 5 sources, then we also build a "large" list, containing lower-frequency words at the cost of more memory.

There are now 27 languages that make the cut. There perhaps should have been 30: the only reason Czech, Slovak, and Vietnamese aren't included is that I neglected to download their Wikipedias before counting up data sources. Those languages should be coming soon.

Here's another chart showing the frequencies of miscellaneous words, this time in all the languages:

Getting wordfreq in your Python environment is as easy as pip install wordfreq. We hope you find this data useful in helping computers make sense of language!

wordfreq 1.4: more words, plus word frequencies from Reddit

Rob Speer

2016-06-02 12:02

The wordfreq module is an easy Python interface for looking up the frequencies of words. It was originally designed for use cases where it was most important to find common words, so it would list all the words that occur at least once per million words: that's about 30,000 words in English. An advantage of ending the list there is that it loads really fast and takes up a small amount of RAM.

But there's more to know about word frequencies. There's a difference between words that are used a bit less than once in a million words, like "almanac", "crusty", and "giraffes", versus words that are used just a few times per billion, such as "centerback", "polychora", and "scanlations". As I've started using wordfreq in some aspects of the build process of ConceptNet, I've wanted to be able to rank words by frequency even if they're less common than "giraffes", and I'm sure other people do too.

So one big change in wordfreq 1.4 is that there is now a 'large' wordlist available in the languages that have enough data to support it: English, German, Spanish, French, and Portuguese. These lists contain all words used at least once per 100 million words. The default wordlist is still the smaller, faster one, so you have to ask for the 'large' wordlist explicitly -- see the documentation.

Including word frequencies from Reddit

The best way to get representative word frequencies is to include a lot of text from a lot of different sources. Now there's another source available: the Reddit comment corpus.

Reddit is an English-centric site and 99.2% of its comments are in English. We still need to account for the exceptions, such as /r/es, /r/todayilearned_jp, /r/sweden, and of course, the thread named “HELP reddit turned spanish and i cannot undo it!”.

I used pycld2 to detect the language of Reddit comments. In this version, I decided to only use the comments that could be detected as English, because I couldn't be sure that the data I was getting from other languages was representative enough. For example, unfortunately, most comments in Italian on Reddit are spam, and most comments in Japanese are English speakers trying to learn Japanese. The data that looks the most promising is Spanish, and I might decide to include that in a later version.

So now some Reddit-centric words have claimed a place in the English word list, alongside words from Google Books, Wikipedia, Twitter, television subtitles, and the Leeds Internet Corpus:

```http://crr.ugent.be/archives/1352

>>> wordfreq.zipf_frequency('people', 'en', 'large') 6.23

>>> wordfreq.zipf_frequency('cats', 'en', 'large') 4.42

>>> wordfreq.zipf_frequency('giraffes', 'en', 'large') 3.0

>>> wordfreq.zipf_frequency('narwhals', 'en', 'large') 2.1

>>> wordfreq.zipf_frequency('heffalumps', 'en', 'large') 1.78

>>> wordfreq.zipf_frequency('borogoves', 'en', 'large') 1.16

```

wordfreq is part of a stack of natural language tools developed at Luminoso and used in ConceptNet. Its data is available under the Creative Commons Attribution-ShareAlike 4.0 license.

wordfreq 1.2 is better at Chinese, English, Greek, Polish, Swedish, and Turkish

Rob Speer

2016-05-19 18:59

Originally posted on October 29, 2015.

Examples in Chinese and British English. Click through for copyable code.

In a previous post, we introduced wordfreq, our open-source Python library that lets you ask "how common is this word?"

Wordfreq is an important low-level tool for Luminoso. It's one of the things we use to figure out which words are important in a set of text data. When we get the word frequencies figured out in a language, that's a big step toward being able to handle that language from end to end in the Luminoso pipeline. We recently started supporting Arabic in our product, and improved Chinese enough to take the "BETA" tag off of it, and having the right word frequencies for those languages was a big part of it.

I've continued to work on wordfreq, putting together more data from more languages. We now have 17 languages that meet the threshold of having three independent sources of word frequencies, which we consider important for those word frequencies to be representative.

Here's what's new in wordfreq 1.2:

The English word list has gotten a bit more robust and a bit more British by including SUBTLEX, adding word frequencies from American TV shows as well as the BBC.
It can fearlessly handle Chinese now. It uses a lovely pure-Python Chinese tokenizer, Jieba, to handle multiple-word phrases, and Jieba's built-in wordlist provides a third independent source of word frequencies. Wordfreq can even smooth over the differences between Traditional and Simplified Chinese.
Greek has also been promoted to a fully-supported language. With new data from Twitter and OpenSubtitles, it now has four independent sources.
In some applications, you want to tokenize a complete piece of text, including punctuation as separate tokens. Punctuation tokens don't get their own word frequencies, but you can ask the tokenizer to give you the punctuation tokens anyway.
We added support for Polish, Swedish, and Turkish. All those languages have a reasonable amount of data that we could obtain from OpenSubtitles, Twitter, and Wikipedia by doing what we were doing already.

When adding Turkish, we made sure to convert the case of dotted and dotless İ's correctly. We know that putting the dots in the wrong places can lead to miscommunication and even fatal stabbings.

The language in wordfreq that's still only partially supported is Korean. We still only have two sources of data for it, so you'll see the disproportionate influence of Twitter on its frequencies. If you know where to find a lot of freely-usable Korean subtitles, for example, we would love to know.

Let's revisit the top 10 words in the languages wordfreq supports. And now that we've talked about getting right-to-left right, let's add a bit of code that makes Arabic show up with right-to-left words in left-to-right order, instead of middle-to-elsewhere order like it came out before.

Code showing the top ten words in each language wordfreq 1.2 supports.

Wordfreq 1.2 is available on GitHub and PyPI.

wordfreq: Open source and open data about word frequencies

Rob Speer

2016-05-19 18:54

Originally posted on September 1, 2015.

Often, in NLP, you need to answer the simple question: "is this a common word?" It turns out that this leaves the computer to answer a more vexing question: "What's a word?"

Let's talk briefly about why word frequencies are important. In many cases, you want to assign more significance to uncommon words. For example, a product review might contain the word "use" and the word "defective", and the word "defective" carries way more information. If you're wondering what the deal is with John Kasich, a headline that mentions "Kasich" will be much more likely to be what you're looking for than one that merely mentions "John".

For purposes like these, it would be nice if we could just import a Python package that could tell us whether one word was more common than another, in general, based on a wide variety of text. We looked for a while and couldn't find it. So we built it.

wordfreq provides estimates of the frequencies of words in many languages, loading its data from efficiently-compressed data structures so it can give you word frequencies down to 1 occurrence per million without having to access an external database. It aims to avoid being limited to a particular domain or style of text, getting its data from a variety of sources: Google Books, Wikipedia, OpenSubtitles, Twitter, and the Leeds Internet Corpus.

The 10 most common words that wordfreq knows in 15 languages. Yes, it can handle multi-character words in Chinese and Japanese; those just aren't in the top 10. A puzzle for Unicode geeks: guess where the start of the Arabic list is.

Partial solutions: stopwords and inverse document frequency

Those who are familiar with the basics of information retrieval probably have a couple of simple suggestions in mind for dealing with word frequencies.

One is to come up with a list of stopwords, words such as "the" and "of" that are too common to use for anything. Discarding stopwords can be a useful optimization, but that's far too blunt of an operation to solve the word frequency problem in general. There's no place to draw the bright line between stopwords and non-stopwords, and in the "John Kasich" example, it's not the case that "John" should be a stopword.

Another partial solution would be to collect all the documents you're interested in, and re-scale all the words according to their inverse document frequency or IDF. This is a quantity that decreases as the proportion of documents a word appears in increases, reaching 0 for a word that appears in every document.

One problem with IDF is that it can't distinguish a word that appears in a lot of documents because it's unimportant, from a word that appears in a lot of documents because it's very important to your domain. Another, more practical problem with IDF is that you can't calculate it until you've seen all your documents, and it fluctuates a lot as you add documents. This is particularly an issue if your documents arrive in an endless stream.

We need good domain-general word frequencies, not just domain-specific word frequencies, because without the general ones, we can't determine which domain-specific word frequencies are interesting.

Avoiding biases

The counts of one resource alone tend to tell you more about that resource than about the language. If you ask Wikipedia alone, you'll find that "census", "1945", and "stub" are very common words. If you ask Google Books, you'll find that "propranolol" is supposed to be 10 times more common than "lol" overall (and also that there's something funny going on, so to speak, in the early 1800s).

If you collect data from Twitter, you'll of course find out how common "lol" is. You also might find that the ram emoji "🐏" is supposed to be extremely common, because that guy from One Direction once tweeted "We are derby super 🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏🐏", and apparently every fan of One Direction who knows what Derby Super Rams are retweeted it.

Yes, wordfreq considers emoji to be words. Its Twitter frequencies would hardly be complete without them.

We can't entirely avoid the biases that come from where we get our data. But if we collect data from enough different sources (not just larger sources), we can at least smooth out the biases by averaging them between the different sources.

What's a word?

You have to agree with your wordlist on the matter of what constitutes a "word", or else you'll get weird results that aren't supported by the actual data.

Do you split words at all spaces and punctuation? Which of the thousands of symbols in Unicode are punctuation? Is an apostrophe punctuation? Is it punctuation when it puts a word in single quotes? Is it punctuation in "can't", or in "l'esprit"? How many words is "U.S." or "google.com"? How many words is "お早うございます" ("good morning"), taking into account that Japanese is written without spaces? The symbol "-" probably doesn't count as a word, but does "+"? How about "☮" or "♥"?

The process of splitting text into words is called "tokenization", and everyone's got their own different way to do it, which is a bit of a problem for a word frequency list.

We tried a few ways to make a sufficiently simple tokenization function that we could use everywhere, across many languages. We ended up with our own ad-hoc rule including large sets of Unicode characters and a special case for apostrophes, and this is in fact what we used when we originally released wordfreq 1.0, which came packaged with regular expressions that look like attempts to depict the Flying Spaghetti Monster in text.

But shortly after that, I realized that the Unicode Consortium had already done something similar, and they'd probably thought about it for more than a few days.

Word splitting in Unicode. Not pictured: how to decide which of these segments count as "words".

This standard for tokenization looked like almost exactly what we wanted, and the last thing holding me back was that implementing it efficiently in Python looked like it was going to be a huge pain. Then I found that the regex package (not the re package built into Python) contains an efficient implementation of this standard. Defining how to split text into words became a very simple regular expression... except in Chinese and Japanese, because a regular expression has no chance in a language where the separation between words is not written in any way.

So this is how wordfreq 1.1 identifies the words to count and the words to look up. Of course, there is going to be data that has been tokenized in a different way. When wordfreq gets something that looks like it should be multiple words, it will look them up separately and estimate their combined frequency, instead of just returning 0.

Language support

wordfreq supports 15 commonly-used languages, but of course some languages are better supported than others. English is quite polished, for example, while Chinese so far is just there to be better than nothing.

The reliability of each language corresponds pretty well with the number of different data sources we put together to make the wordlist. Some sources are hard to get in certain languages. Perhaps unsurprisingly, for example, not much of Twitter is in Chinese. Perhaps more surprisingly, not much of it is in German either.

The word lists that we've built for wordfreq represent the languages where we have at least two sources. I would consider the ones with two sources a bit dubious, while all the languages that have three or more sources seem to have a reasonable ranking of words.

5 sources: English
4 sources: Arabic, French, German, Italian, Portuguese, Russian, Spanish
3 sources: Dutch, Indonesian, Japanese, Malay
2 sources: Chinese, Greek, Korean

Compact wordlists

When we were still figuring this all out, we made several 0.x versions of wordfreq that required an external SQLite database with all the word frequencies, because there are millions of possible words and we had to store a different floating-point frequency for each one. That's a lot of data, and it would have been infeasible to include it all inside the Python package. (GitHub and PyPI don't like huge files.) We ended up with a situation where installing wordfreq would either need to download a huge database file, or build that file from its source data, both of which would consume a lot of time and computing resources when you're just trying to install a simple package.

As we tried different ways of shipping this data around to all the places that needed it, we finally tried another tactic: What if we just distributed less data?

Two assumptions let us greatly shrink our word lists:

We don't care about the frequencies of words that occur less than once per million words. We can just assume all those words are equally informative.
We don't care about, say, 2% differences in word frequency.

Now instead of storing a separate frequency for each word, we group the words into 600 possible tiers of frequency. You could call these tiers "centibels", a logarithmic unit similar to decibels, because there are 100 of them for each factor of 10 in the word frequency. Each of them represents a band of word frequencies that spans about a 2.3% difference. The data we store can then be simplified to "Here are all the words in tier #330... now here are all the words in tier #331..." and converted to frequencies when you ask for them.

Some tiers of word frequencies in English.

This let us cut down the word lists to an entirely reasonable size, so that we can put them in the repository, and just keep them in memory while you're using them. The English word list, for example, is 245 KB, or 135 KB compressed.

But it's important to note the trade-off here, that wordfreq only represents sufficiently common words. It's not suited for comparing rare words to each other. A word rarer than "amulet", "bunches", "deactivate", "groupie", "pinball", or "slipper", all of which have a frequency of about 1 per million, will not be represented in wordfreq.

Getting the package

wordfreq is available on GitHub, or it can be installed from the Python Package Index with the command pip install wordfreq. Documentation can be found in its README on GitHub.

Comparing the frequency per million words of two spellings of "café", in English and French.