18 responses

  1. Tom Robinson
    March 18, 2008

    Of course, this is exactly what command line filter programs are good at…

    cat test.txt | tr -s ‘[:space:]’ ‘\n’ | tr ‘[:upper:]’ ‘[:lower:]’ | sort | uniq -c | sort -n | tail -10

  2. Antonio Cangiano
    March 18, 2008

    Nice one Tom. :)

  3. Paddy3118
    March 19, 2008

    Hi Antonio,
    Just before the last code section, you introduce it as “getting rid of reverse=True”, but you fail to mention that you also change from using cmp to use key. cmp is called for every comparison wheras key is called once for each item in the list which is usually faster.

    I also wonder if this code:

    for word in words_gen:
        words[word] = words.get(word, 0) + 1
    

    Might be replaced by (untested):

    words = defaultdict(int)
    for word in words_gen:
        words[word] +=1
    

    Which would look up word in words only once?

    I enjoyed your post.

    Thanks, Paddy.

  4. Antonio Cangiano
    March 19, 2008

    Hi Paddy,

    thanks for your comment. I’ve slightly changed the wording to point out that in the “good” solution at the end, we are using key rather than cmp. Using defaultdict would work (2.5 only) and also be more efficient. Here is the solution that incorporates your suggestion:

    from string import punctuation
    from collections import defaultdict
    
    N = 10
    words = {}
    
    words_gen = (word.strip(punctuation).lower() for line in open("test.txt")
                                                 for word in line.split())
    
    words = defaultdict(int)
    for word in words_gen:
        words[word] +=1
    
    top_words = sorted(words.iteritems(),
                       key=lambda(word, count): (-count, word))[:N] 
    
    for word, frequency in top_words:
        print "%s: %d" % (word, frequency)
    
  5. Mark
    March 22, 2008

    I’m learning ruby, so I thought I’d put together a Ruby version:

    N = 10
    count = Hash.new(0)
    
    File.open(the_file).each_line do |line|
      line.downcase.scan(/\w+/).each do |word|
        count[word] += 1
      end
    end
    
    top_words = count.sort{|a,b| a[1]b[1]}
    
    top_words.each do |top|
      puts "%s: %d" % top
    end
    
  6. William Chang
    March 24, 2008

    This of course depends on what you are counting words for, but I would recommend translate all non-letters to space and then splitting on space. For the natural language tasks that I do, this is pretty appropriate. It really depends on what you want to happen when you hit stuff like “Bob’s”, “hyper-active”, “http://www.google.com”, “bob@gmail.com”, “2342sdf”, etc… I also like putting the rule that I use to split into words into it’s own function which I call here tokenize().

    from string import punctuation
    from collections import defaultdict

    N = 10
    words = {}

    def tokenize(line):
    line = re.sub(r”[^a-z]”, ” “, line.lower())
    return line.split()

    words = defaultdict(int)
    for line in open(“test.txt”):
    for token in tokenize(line):
    words[token] +=1

    top_words = sorted(words.iteritems(),
    key=lambda(word, count): (-count, word))[:N]

    for word, frequency in top_words:
    print “%s: %d” % (word, frequency)

  7. Mark
    March 25, 2008

    Somehow, my sort line above got mangled. This one’s an improvement:

    top_words = count.sort_by { |w| w[1] }

  8. david
    March 28, 2008

    count = {}; open(“somefile”).each_line { |line| line.split(/\b/).each { |word| count[word] ||= 0; count[word] += 1 } }

  9. Graham
    August 18, 2008

    Incredibly useful information and a practical way to see how the functional aspects of Python work.
    I have used and liked Haskell, and now I think I am going to like Python because I have seen more of what it can do.

  10. knarf
    March 1, 2009

    Grmbl..

    Perl implementation : http://pastebin.com/f41b63d37
    Alternate version : http://pastebin.com/f25e1deb1

  11. SHAUN MBHIZA
    March 16, 2009

    HELP! HELP! I NEED A PYTHON CODE/COMMAND TO REVERSE A WORD HELP GUYS PLEASE AND IF POSSIBLE A.S.A.P

  12. Antonio Cangiano
    March 16, 2009

    Shaun, why are you shouting? :)

    Anyway, reversing a string in Python is very easy using [::-1]

    word = "ciao"
    reversed = word[::-1]
    print reversed
    
  13. Shaun Mbhiza
    March 18, 2009

    Thanks a Lot Antoniano U have made my day thank you thank you THANK YOU!!!!

  14. Shaun Mbhiza
    March 18, 2009

    I’m doing a simple program that adds repeated words in a tuple to a list,the problem is that I don’t know how to make the check case-INsensitive.Can you help me with this code and also the bit w.r.t the initial check.

  15. Rung András
    June 17, 2009

    Thanks a lot, it works fine, much better than other codes which cannot handle my utf-8 linguistic stuff!

Leave a Reply

 

 

 

Back to top
mobile desktop