Counting Word Instances in a String

Yesterday in the IRC channel someone asked if there was a way to count the number of times each unique word appears in a string. While it was obvious that this could be done manually (see below), no one knew of a more elegant solution. Can anyone think of one? Here is the solution I used and it definitely falls into the "manual" (and probably slow) category.

First I made my string:

<cfsavecontent variable="string"> This is a paragraph with some text in it. Certain words will be repeated, and other words will not be repeated. The question is though, how much can I write before I begin to sound like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any further words sound like gibberish and are completely worthless. </cfsavecontent>

I then used some regex to get an array of words:

<cfset words = reMatch("[[:word:]]+", string)>

Next I created a structure:

<cfset wordCount = structNew()>

And then looped over the array and inserted the words into the structure:

<cfloop index="word" array="#words#"> <cfif structKeyExists(wordCount, word)> <cfset wordCount[word]++> <cfelse> <cfset wordCount[word] = 1> </cfif> </cfloop>

Note that this will be inherently case-insenstive, which I think is a good thing. At this point we are done, but I added some display code as well:

<cfset sorted = structSort(wordCount, "numeric", "desc")>

<table border="1" width="400"> <tr> <th width="50%">Word</th> <th>Count</th> </tr>

<cfloop index="word" array="#sorted#"> <cfoutput> <tr> <td>#word#</td> <td>#wordCount[word]#</td> </tr> </cfoutput> </cfloop>

Archived Comments

Comment 1 by nick posted on 8/2/2007 at 9:31 PM

If "Paris Point" becomes part of the daily lexicon, you can officially coin it. Nice code work too.

Comment 2 by Ben Nadel posted on 8/2/2007 at 9:32 PM

REMatch() makes me happy :)

Comment 3 by Quan Tran posted on 8/2/2007 at 9:32 PM

Probably not faster, but you could create a query with a single column and use qoq to get the count with a group by.

Comment 4 by Gareth posted on 8/2/2007 at 9:54 PM

Couldn't you do something like
#ListLen(string, " #Chr(13)##Chr(10)#")#

(it seems to work with the string variable you posted)

Comment 5 by Raymond Camden posted on 8/2/2007 at 9:58 PM

Gareth, that counts the words. We need a count of the number of each word. Ie, the string has The ten times. Etc.

Comment 6 by Gareth posted on 8/2/2007 at 10:02 PM

Whoops, unique instances...

Let me try that again :)

<cfset new_string = ListSort(REReplaceNoCase(LCase(string), "[^a-z ]", "", "ALL"), "Text", "Asc", " #Chr(13)##Chr(10)#")>

<cfscript>
// had to use this as CF does not allow lookbehind in regular expressions, but JAVA does
obj = createobject("java","java.util.regex.Pattern"); // create pattern searching object
x = obj.compile("(?<=[ ]|^)([^ ]*)([ ]\1)+(?=[ ]|$)"); // compile the regular expression for use
new_string = x.matcher(new_string).replaceAll("$1"); // remove all duplicates
</cfscript>

#ListLen(new_string, " ")#

Comment 7 by Gareth posted on 8/2/2007 at 10:05 PM

OK, I'm going to stop now :)
I got a total of the unique words, but not a count of the number of duplicate words (that's what I get for trying to do write code for one thing while checking out the blogs in another tab :) )

Comment 8 by todd sharp posted on 8/2/2007 at 10:49 PM

Issue: the word "Let's" gets broken into "Let" and "s" because of your RE. Solution? Still working on it... ;)

Comment 9 by Ben Nadel posted on 8/2/2007 at 10:56 PM

@Todd,

I ran into that same problem during one of Ray's Friday Puzzlers... trust me - don't try to figure it out, your brain will only end up hurting. Here's why, these are all single "words":

hatin'
let's
sweet-ass
cf.objective()
O'connell

... if you can write an algorithm to use all those "non-word" characters as parts of words, well then, you are the man!

Comment 10 by Raymond Camden posted on 8/2/2007 at 11:09 PM

Although, one could argue that sweet-ass wouldn't be so bad as two words. hatin' is slang, and would become hatin, which is ok.

I think if you could just make get single quotes to work, you would get most "real" words.

I wonder - maybe switch from [[:word:]] to

(any non alpha except single quote)(alpha,1 or more)(optional ' if followed by alpha)(any non alpha except single quote)

Then again - another solution? Remove '. You end up with words like "lets", which could be confused with "Ray lets Paris call him", but it would be better than let and s as words.

Comment 11 by noname posted on 8/2/2007 at 11:26 PM

But, isn't the next word signified by a space? So everything between a space, is a word?

Comment 12 by Ben Nadel posted on 8/2/2007 at 11:27 PM

Yeah, stripping out the single quotes is probably the easiest thing to do. Least amount of damage for the best results.

Comment 13 by Ron Alexander posted on 8/3/2007 at 12:23 AM

So why not just change from "[[:word:]]" to "[a-zA-Z0-9]+'[a-zA-Z0-9]+|[a-zA-Z0-9]+"

That seems to do the trick for let's lets. But it still doesn't make the counting any more elegant.

Comment 14 by Raymond Camden posted on 8/3/2007 at 12:27 AM

Thats pretty cool there Ron.

Comment 15 by Jonathon posted on 8/3/2007 at 12:47 AM

There's a CF IRC channel floating around somewhere? Anyone feel like sharing the info? :)

Comment 16 by Raymond Camden posted on 8/3/2007 at 12:50 AM

The one I use is #coldfusion on Dalnet.

Comment 17 by Ron Alexander posted on 8/3/2007 at 12:59 AM

This is better:

(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)])+

It matches method chains like myarray.dedup().sort()

Comment 18 by Ron Alexander posted on 8/3/2007 at 1:02 AM

And just to match Ben's "hatin'" example:
"(?:[a-zA-Z0-9\(\)])+(?:\'|-|\.)?(?:[a-zA-Z0-9\(\)\-\'])+"

Don't forget (like I did) to throw in the \'\- into the last non-capturing group.

Ben does that meet your needs?

Comment 19 by Dustin posted on 8/3/2007 at 1:07 AM

This is probably how I would do it:

<cfset string = reReplace(string,'(\.|"(?=\w))','','all') />
<cfset wordAry = listToArray(string,'#chr(10)##chr(13)##chr(32)#') />
<cfset wordQry = queryNew('word','VarChar') />
<cfloop from="1" to="#arrayLen(wordAry)#" index="i">
<cfset queryAddRow(wordQry) />
<cfset querySetCell(wordQry,'word',reReplace(wordAry[i],'[",]$','')) />
</cfloop>
<cfquery dbtype="query" name="uniqueWords">
SELECT word, count(*) as wordCount FROM wordQry group by word order by wordCount desc
</cfquery>
<cfdump var="#uniqueWords#">

Comment 20 by db posted on 8/3/2007 at 5:59 PM

this seems to work for me:
<cfset words = arrayToList(string.split('\s'))>
<cfset wordCount = structNew()>
<cfloop index="word" list="#words#">
<cfset wordCount[word] = ListValueCountNoCase(words, word)>
</cfloop>

Comment 21 by db posted on 8/3/2007 at 7:28 PM

no, that's not right - its including punctuation as part of the word. so i tried with the "hatin'" list and got this working:
<cfset words = arrayToList(string.split("\.?[[^()]\s&&([""()][\s])]"))>

Comment 22 by Johnny posted on 8/4/2007 at 11:31 PM

Wouldn't this work?

<cfset wordcount = structNew()/>
<cfloop list="#string#" delimiters=' ,"' index="word">
<cfset word = replaceList(word,"',.","")/>
<cfif structKeyExists(wordCount, word)>
<cfset wordCount[word] = wordCount[word] + 1/>
<cfelse>
<cfset wordCount[word] = 1/>
</cfif>
</cfloop>

Comment 23 by Mike Cohen posted on 8/25/2008 at 8:06 PM

Why not just do:

<cfset myString = "blah blah blah bldfadsff fd ">
<cfset mCounter = stringToArray(myString,"a")>
<cfset numberOfAs = arraylen(mcounter)>

?

Comment 24 by todd sharp posted on 8/25/2008 at 8:25 PM

Unless I'm missing something in an earlier comment, there is no strintToArray() function in ColdFusion...

Comment 25 by todd sharp posted on 8/25/2008 at 8:26 PM

Make that stringToArray()...

Comment 26 by D. Davis posted on 2/24/2009 at 3:54 AM

Nice article, and thanks guys for the different ways of doing this.

Wanted to note: the sort on this ("textnocase") needs to be "numeric","desc" otherwise you're not getting your top numbers right (ie, textnocase sort would look like 4,3,20,17).

Great code on this as a first step to making a word cloud, looping it on DB-pulled text fields.

Comment 27 by Raymond Camden posted on 2/26/2009 at 3:10 AM

Oops. Thanks D. Fixed in the code above.

Comment 28 by Bryan posted on 4/21/2010 at 8:49 PM

Doesn't seem to be returning carriage returns for me. Any fixes?

Comment 29 by Raymond Camden posted on 4/21/2010 at 10:25 PM

What do you mean? Why would it return carriage returns? It returns the number of words.

Comment 30 by Mark Brodsky posted on 6/23/2011 at 9:03 PM

How about ListValueCount(list, value [, delimiters ])

I was trying to find out how many HRs I had in a text string in a DB column (which would show how many entries I recorded for the view history of a certain page), and this seemed to do the trick, like this:

<pre><cfset histcount = ListValueCount(list.history, "hr", "<,>")></pre>

Comment 31 by Raymond Camden posted on 6/25/2011 at 1:53 AM

What value would you use though?

Comment 32 by Lina Haddad posted on 10/25/2011 at 7:37 PM

i solved it like this ,(using getToken) i am assuming that we should have space at least between 2 words,,Does any body see any issue with that?
<cfsavecontent variable="string">
This is a paragraph with some text in it. Certain words will be repeated, and other words
will not be repeated. The question is though, how much can I write before I begin to sound
like a complete and utter idiot. Let's call that the "Paris Point". At the Paris Point, any
further words sound like gibberish and are completely worthless.
</cfsavecontent>
<cfset word="let's" />
<cfset i=1 />
<cfset countOfword=0 />
<cfloop condition="#getToken(string,i,' ')# neq ''">
<cfif #getToken(string,i,' ')# eq #word#><cfset countOfword=countOfword+1 /></cfif>
<cfset i=i+1 />
</cfloop>
<cfoutput>#countOfword#</cfoutput>

Comment 33 by Ken Gladden posted on 10/27/2011 at 12:42 PM

Back to the point about how to include "-" and O'Conner use [:print:] instead of [:word:]. Works wonders for me!

Comment 34 by chris ellem posted on 11/11/2013 at 10:31 AM

don't you love these post when you need an answer in a hurry..thanks Ray

Comment 35 by Raymond Camden posted on 11/11/2013 at 4:15 PM

I love it when stuff works 6 years later. ;)

Comment 36 by Awais posted on 11/24/2017 at 7:36 AM

Hi guys,

I tried the [:word:] solution, but it is counting 'blue-eyed' as 2 words rather than 1.
And 'doesn’t' is taken as 2 words. Is their any way to tell CF to let - and ' go by?

Any help would be more than appreciated :)

Regards,
Awais

Comment 37 (In reply to #36) by Raymond Camden posted on 11/24/2017 at 2:19 PM

Would this help? First Google result for "regular expression word hyphen" https://stackoverflow.com/q...