Meditations on programming, startups, and technology
New Relic

Using Python to detect the most frequent words in a file

Working with Python is nice. Just like Ruby, it usually doesn’t get in the way of my thought process and it comes “with batteries included”. Let’s consider the small task of printing a list of the N most frequent words within a given file:

from string import punctuation
from operator import itemgetter

N = 10
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("test.txt") 
                                             for word in line.split())
                                          
for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), key=itemgetter(1), reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

I won’t provide a step by step explanation of what I believe is already rather understandable. There are however a few tricky considerations to be made on behalf of those who are not too familiar with the language. First and foremost, I love using Generator Expressions because they are lazily evaluated and have a math-like readability. It’s just a very convenient way of crating generator objects. Notice how in the snippet I favor them over the option of placing a whole file into a string by concatenating the read() method to the open() one. Doing so results in a significant performance improvement for large files. Generator Expressions and List Comprehension are extremely useful language features which are inherited from the world of functional programming, and I’m glad that Python fully embraces them.

In the third for loop we count words and add them and their respective frequencies to the words dictionary (similar to a Ruby Hash). Notice how the method get() enabled us to specify a default value before incrementing the counter, in case the given key didn’t exist yet (which means that the word we were adding hadn’t been encountered before). We pass operator.itemgetter() as a keyword argument (another nice Python feature) to the sorted() function. itemgetter() returns a callable object that fetches the given item(s) from its operand which, in our case, essentially means that we can tell sorted() to sort based on the value of the dictionary’s items (the frequency of the words) rather than based on the keys (the words themselves).

Unfortunately there is a problem with this code. It will correctly sort the most popular words in the file, but equally represented words won’t be alphabetically ordered. Given that we specified a reverse order for the sorted() function, we could simply pass it key=itemgetter(1, 0) to order (in descending order) by value first and by key second. But let’s be realistic. In most cases, you want to have these type of keys whose values are equal, be alphabetically ordered (in ascending order). With a few changes to the code, this can be easily achieved:

from string import punctuation

def sort_items(x, y):
    """Sort by value first, and by key (reverted) second."""
    return cmp(x[1], y[1]) or cmp(y[0], x[0])

N = 10
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("test.txt") 
                                             for word in line.split())
                                          
for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), cmp=sort_items, reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Previously we specified what “key” should we use for sorting, while in this case we now have a much greater deal of control. By defining the function sort_items() and passing a pointer to it for the cmp argument of the function sorted(), we get to define how the comparison amongst the items of the dictionary should be carried out. The function that we defined at the beginning of the script will return -1, 0 or 1, depending on how the two key-value pairs compare. The returned value is cmp(x[1], y[1]) or cmp(y[0], x[0]). This may seem complicated but the trick is rather easy. The first part compares the frequencies of the two words and returns 1 or -1 if one is greater than the other. If they are equal, the expression to the left of the or will be 0, therefore the expression on the right of the or will be returned. On the right we compare the keys (the words), but invert the order of the arguments y and x to reverse the effects of the reversed ordering defined in sorted().

Finally, for those who prefer to use a lambda expression, rather than to define a function, we can write the following:

from string import punctuation

N = 10
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("test.txt") 
                                             for word in line.split())
                                          
for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), 
                   cmp=lambda x, y: cmp(x[1], y[1]) or cmp(y[0], x[0]), 
                   reverse=True)[:N]

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Or simplified further by getting rid of reverse=True and using key rather than cmp:

from string import punctuation

N = 10
words = {}

words_gen = (word.strip(punctuation).lower() for line in open("test.txt")
                                             for word in line.split())

for word in words_gen:
    words[word] = words.get(word, 0) + 1

top_words = sorted(words.iteritems(), 
                   key=lambda(word, count): (-count, word))[:N] 

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)

Please bear in mind that the code makes a few assumptions so as to keep things simple. As it stands, the script would consider “l’amore” as a single word, and an accidental lack of spaces wouldn’t be accounted for (e.g. “word.Another” would be a single word too). The replace() method can be used to address these sorts of special cases.

Sure, this was a rather trivial example, born from an iPython session, but I think it gives away Python’s expressiveness and flexibility when dealing with problems that, approached in some other languages, would be much more error prone and verbose. Batteries included indeed.


If you enjoyed this post, then make sure you subscribe to my Newsletter and/or Feed.

receive my posts by email

15 Responses to “Using Python to detect the most frequent words in a file”

  1. Tom Robinson says:

    Of course, this is exactly what command line filter programs are good at…

    cat test.txt | tr -s ‘[:space:]’ ‘\n’ | tr ‘[:upper:]’ ‘[:lower:]’ | sort | uniq -c | sort -n | tail -10

  2. […] Cangiano wrote a post about “Using Python to detect the most frequent words in a file“. It’s a nice summary of how to do it in Python, but (nearly) the same thing can be […]

  3. Paddy3118 says:

    Hi Antonio,
    Just before the last code section, you introduce it as “getting rid of reverse=True”, but you fail to mention that you also change from using cmp to use key. cmp is called for every comparison wheras key is called once for each item in the list which is usually faster.

    I also wonder if this code:

    for word in words_gen:
        words[word] = words.get(word, 0) + 1
    

    Might be replaced by (untested):

    words = defaultdict(int)
    for word in words_gen:
        words[word] +=1
    

    Which would look up word in words only once?

    I enjoyed your post.

    Thanks, Paddy.

  4. Hi Paddy,

    thanks for your comment. I’ve slightly changed the wording to point out that in the “good” solution at the end, we are using key rather than cmp. Using defaultdict would work (2.5 only) and also be more efficient. Here is the solution that incorporates your suggestion:

    from string import punctuation
    from collections import defaultdict
    
    N = 10
    words = {}
    
    words_gen = (word.strip(punctuation).lower() for line in open("test.txt")
                                                 for word in line.split())
    
    words = defaultdict(int)
    for word in words_gen:
        words[word] +=1
    
    top_words = sorted(words.iteritems(),
                       key=lambda(word, count): (-count, word))[:N] 
    
    for word, frequency in top_words:
        print "%s: %d" % (word, frequency)
    
  5. Mark says:

    I’m learning ruby, so I thought I’d put together a Ruby version:

    N = 10
    count = Hash.new(0)
    
    File.open(the_file).each_line do |line|
      line.downcase.scan(/\w+/).each do |word|
        count[word] += 1
      end
    end
    
    top_words = count.sort{|a,b| a[1]b[1]}
    
    top_words.each do |top|
      puts "%s: %d" % top
    end
    
  6. William Chang says:

    This of course depends on what you are counting words for, but I would recommend translate all non-letters to space and then splitting on space. For the natural language tasks that I do, this is pretty appropriate. It really depends on what you want to happen when you hit stuff like “Bob’s”, “hyper-active”, “http://www.google.com”, “bob@gmail.com”, “2342sdf”, etc… I also like putting the rule that I use to split into words into it’s own function which I call here tokenize().

    from string import punctuation
    from collections import defaultdict

    N = 10
    words = {}

    def tokenize(line):
    line = re.sub(r”[^a-z]”, ” “, line.lower())
    return line.split()

    words = defaultdict(int)
    for line in open(“test.txt”):
    for token in tokenize(line):
    words[token] +=1

    top_words = sorted(words.iteritems(),
    key=lambda(word, count): (-count, word))[:N]

    for word, frequency in top_words:
    print “%s: %d” % (word, frequency)

  7. Mark says:

    Somehow, my sort line above got mangled. This one’s an improvement:

    top_words = count.sort_by { |w| w[1] }

  8. david says:

    count = {}; open(“somefile”).each_line { |line| line.split(/\b/).each { |word| count[word] ||= 0; count[word] += 1 } }

  9. Graham says:

    Incredibly useful information and a practical way to see how the functional aspects of Python work.
    I have used and liked Haskell, and now I think I am going to like Python because I have seen more of what it can do.

  10. […] I started to look more online, into different tutorials than the standard Google page rank 1 one, I found this one, this basically opened my mind to what Python had to offer in the functional realm. So here is the […]

  11. knarf says:

    Grmbl..

    Perl implementation : http://pastebin.com/f41b63d37
    Alternate version : http://pastebin.com/f25e1deb1

  12. SHAUN MBHIZA says:

    HELP! HELP! I NEED A PYTHON CODE/COMMAND TO REVERSE A WORD HELP GUYS PLEASE AND IF POSSIBLE A.S.A.P

  13. Shaun, why are you shouting? :)

    Anyway, reversing a string in Python is very easy using [::-1]

    word = "ciao"
    reversed = word[::-1]
    print reversed
    
  14. Shaun Mbhiza says:

    Thanks a Lot Antoniano U have made my day thank you thank you THANK YOU!!!!

  15. Shaun Mbhiza says:

    I’m doing a simple program that adds repeated words in a tuple to a list,the problem is that I don’t know how to make the check case-INsensitive.Can you help me with this code and also the bit w.r.t the initial check.

  16. Rung András says:

    Thanks a lot, it works fine, much better than other codes which cannot handle my utf-8 linguistic stuff!

  17. […] found this while googling… Using Python to detect the most frequent words in a file | Zen and the Art of Programming Its a python script, and many variants, for collecting commonly used words from a text file. It […]

Leave a Reply

I sincerely welcome and appreciate your comments, whether in agreement or dissenting with my article. However, trolling will not be tolerated. Comments are automatically closed 15 days after the publication of each article.

Copyright © 2005-2014 Antonio Cangiano. All rights reserved.