Earlier today Mike Henke asked if there was a way to generate a tag cloud from an RSS feed. While he was able to find a solution quick enough (Wordle), I thought it would be kind of fun to try this myself. I knew that Pete Freitag had already blogged on tag clouds and ColdFusion, so all I had to do was generate my word data and pass it to his code. Here's what I came up with.
I began with a simple call to my RSS URL to generate a query of data. For my testing, this was the only thing I cached. Obviously all of my "crunching" could have been cached.
<cfset rss = cacheGet("rss")>
<cfif isNull(rss)>
<cfset feedUrl = "http://feedproxy.google.com/RaymondCamdensColdfusionBlog">
<cffeed source="#feedUrl#" query="rss">
<cfset cacheput("rss", rss,createTimespan(0,1,0,0))>
</cfif>
Now for the fun part. In order to use Pete's code, I need to know each word and the number of times it appears. I began with an empty struct:
<cfset allwords = {}>
Next, I created a list of "stop" words, words I'd always ignore. (Note, this list was kind of arbitrary. Also note I added some spaces in the blog entry just so it would wrap better.)
<cfset stopwords = "and,this,the,a,it,as,was,to,don't,has,you, you're,you've,with,why,which,when,were,we've,we're, then,than,i,i'll,i'm,i've,i'd,it's,for,of,is,if,in,that,but,my,not,can,are,',done, off,their,isn't,yes,what's,them,they,'',be,being,all, only,does,here,an,by,would,like,at,do,want,or,could, out,our,while,what,had,each,into,where,That's,will,else, let's,about,got,using,before,over,actually,going,some,well">
I then split by word boundary and added them to the struct. Note that this word boundary also includes ' so I can match "don't". This is not perfect, but good enough.
<cfloop query="rss">
<cfset words = reMatch("[\w']+",bigstring)>
<cfloop index="w" array="#words#">
<cfif len(w) gt 1 and not listFindNoCase(stopwords, w)>
<cfif not structKeyExists(allwords, w)>
<cfset allwords[w] = 0>
</cfif>
<cfset allwords[w]++>
</cfif>
</cfloop>
</cfloop>
I had quite a few words, so I decided to remove all words with less than 5 instances.
<cfloop item="k" collection="#allwords#">
<cfif allwords[k] lte 5>
<cfset structDelete(allwords,k)>
</cfif>
</cfloop>
Now comes Pete's code to generate high/low values.
<cfset diff = maxval - minval>
<cfset distribution = diff / 3>
<cfset minval = 999999>
<cfset maxval = 0>
<cfloop item="k" collection="#allwords#">
<cfif allwords[k] lt minval>
<cfset minval = allwords[k]>
<cfelseif allwords[k] gt maxval>
<cfset maxval = allwords[k]>
</cfif>
</cfloop>
And finally, the output:
<h2>Word Cloud</h2>
<cfloop item="w" collection="#allWords#">
<cfif allWords[w] EQ minval>
<cfset class="smallestTag">
<cfelseif allWords[w] EQ maxval>
<cfset class="largestTag">
<cfelseif allWords[w] GT (minval + (distribution*2))>
<cfset class="largeTag">
<cfelseif allWords[w] GT (minval + distribution)>
<cfset class="mediumTag">
<cfelse>
<cfset class="smallTag">
</cfif>
<cfoutput><span class="#class#">#w#</a></cfoutput>
</cfloop>
</p>
Sexy, eh? Here is the output from my blog:
I then pointed it at the RSS feed from ColdFusionBloggers:
I probably could have shortened that a lot more with my minimum filter. Anyway, I then did one more tweak. Instead of counting words, I simply took the category list:
<cfloop query="rss">
<cfset words = listToArray(categorylabel)>
This tag cloud then represents categories from the RSS feed:
And that's it. Totally and completely stupid, but fun. Here's the current script, although it's a bit messy. As I said, normally you would want to cache all of the crunching.
p.s. Words a bit hard to read in the pictures? Right click and open in new tab. Sorry about that!
<!--- create a count of words --->
<cfset allwords = {}>
<cfset stopwords = "and,this,the,a,it,as,was,to,don't,has,you,you're,you've,with,why,which,when,were,we've,we're,then,than,i,i'll,i'm,i've,i'd,it's,for,of,is,if,in,that,but,my,not,can,are,',done,off,their,isn't,yes,what's,them,they,'',be,being,all,only,does,here,an,by,would,like,at,do,want,or,could,out,our,while,what,had,each,into,where,That's,will,else,let's,about,got,using,before,over,actually,going,some,well"> <cfloop query="rss">
<!---
<cfset words = reMatch("[\w']+",bigstring)>
--->
<cfset words = listToArray(categorylabel)>
<cfloop index="w" array="#words#">
<cfif len(w) gt 1 and not listFindNoCase(stopwords, w)>
<cfif not structKeyExists(allwords, w)>
<cfset allwords[w] = 0>
</cfif>
<cfset allwords[w]++>
</cfif>
</cfloop>
</cfloop> <!--- remove where val < 5, 5 being a bit arbitrary --->
<!---
<cfloop item="k" collection="#allwords#">
<cfif allwords[k] lte 0>
<cfset structDelete(allwords,k)>
</cfif>
</cfloop>
---> <!--- get min, max --->
<cfset minval = 999999>
<cfset maxval = 0>
<cfloop item="k" collection="#allwords#">
<cfif allwords[k] lt minval>
<cfset minval = allwords[k]>
<cfelseif allwords[k] gt maxval>
<cfset maxval = allwords[k]>
</cfif>
</cfloop> <cfset diff = maxval - minval>
<cfset distribution = diff / 3> <!DOCTYPE html>
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1" />
<meta name="description" content="" />
<meta name="keywords" content="" /> <link rel="stylesheet" href="http://twitter.github.com/bootstrap/1.4.0/bootstrap.min.css">
<!--[if lt IE 9]><script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script><![endif]-->
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.4/jquery.min.js"></script>
<script type="text/javascript">
$(function() { });
</script>
<style>
.smallestTag { font-size: xx-small; }
.smallTag { font-size: small; }
.mediumTag { font-size: medium; }
.largeTag { font-size: large; }
.largestTag { font-size: xx-large; }
</style>
</head>
<body> <div class="container">
<h2>Word Cloud</h2>
<cfloop item="w" collection="#allWords#">
<cfif allWords[w] EQ minval>
<cfset class="smallestTag">
<cfelseif allWords[w] EQ maxval>
<cfset class="largestTag">
<cfelseif allWords[w] GT (minval + (distribution*2))>
<cfset class="largeTag">
<cfelseif allWords[w] GT (minval + distribution)>
<cfset class="mediumTag">
<cfelse>
<cfset class="smallTag">
</cfif>
<cfoutput><span class="#class#">#w#</a></cfoutput>
</cfloop>
</p> </div> </body>
</html>
<cfset rss = cacheGet("rss")>
<cfif isNull(rss)>
<cfset feedUrl = "http://www.coldfusionbloggers.org/rss.cfm">
<cffeed source="#feedUrl#" query="rss">
<cfset cacheput("rss", rss,createTimespan(0,1,0,0))>
</cfif>