Jeff sent me an interesting question last Friday involving writing out large amounts of data to a text file in ColdFusion. He had to read in thousands of files and append them to a single file. He was curious about what he could do to speed up this process. I wasn't really sure what to suggest - outside of making sure he used cfsetting requesttimeout to give his script time to process, but he wrote back and said he had some success using Java to write out the file data. This led me to do a bit of digging myself. I know that the new file functions (added in ColdFusion 8) made use of higher performing code behind the scenes. So for example, if you used cffile to read in a multi gigabyte file, than ColdFusion has to store all that data in RAM. But if you make use of fileOpen and fileReadLine, you can suck in parts of the file at a time. Shoot - you can even use fileSeek (in ColdFusion 9) to jump ahead. All of this works very well, but is focused on the read side of the equation. How about writing? I whipped up a simple test to see differently I could write to a file and how differently the approaches would perform.
I began my test script by ensuring it would have enough time to run:
<cfsetting requesttimeout="999">
Next I output some whitespace junk. I'm going to be using cfflush and discovered that Chrome, like Internet Explorer, likes to get "enough" content before it renders anything.
<cfoutput>#repeatString(" ", 250)#</cfoutput><cfflush>
Here is my first test:
<cfset thisTick = getTickCount()>
<cfloop index="x" from="1" to="200000">
<cffile action="append" file="#theFile#" output="#string#">
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput>
<cfset string = repeatString(createUUID(), 10)>
<cfset theFile = expandPath("./data.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush>
I created a string based on a UUID repeated 10 times. I set my file name and then loop from 1 to 200,000 using the append form of cffile to write data to the file. That little cfif condition in there is just a simple way for me to monitor the progress of my test. By outputting a hash mark every one thousand iterations I can get an idea of how quickly my test is running. I wrap the meat of this with a few getTickCounts() so I can time the process.
This test took 70,222 ms to run.
Ok, so how about using the new(ish) file functions? Here's my next text.
<cfset thisTick = getTickCount()>
<cfset fileOb = fileOpen(theFile, "append")>
<cfloop index="x" from="1" to="200000">
<cfset fileWriteLine(fileOb, string)>
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cfset fileClose(fileOb)>
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput>
<cfset theFile = expandPath("./data2.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush>
I create a file object opened using append mode. I made use of fileWriteLine to append my text. Finally, I close the file object. So how did this perform?
This test took 1,622 ms to run.
Bit faster, eh? Then I tried something else. I thought - what would happen if I built up a large string and just wrote to the file system once. I knew that a normal string operation wouldn't work as string operations in general aren't very performant. I used a Java StringBuilder instead.
<cfset thisTick = getTickCount()>
<cfset s = createObject("java","java.lang.StringBuilder")>
<cfset newString = string & chr(13)>
<cfloop index="x" from="1" to="200000">
<cfset s.append(newString)>
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cffile action="write" file="#theFile#" output="#s.toString()#">
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput>
<cfset theFile = expandPath("./data3.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush>
This test took 1,658 ms to run.
Now that's pretty interesting. In every iteration of my test, the StringBuilder version was always very close to the fileWriteLine version. Always slower, but not far enough to really matter. The main difference though is that I've got one variable taking in a large amount of RAM. In theory, this could take all the RAM available to the JVM. (Keep in mind the JVM is not an area I'm strong in. This is where I typically send people to Mike Brunt. ;)
I'll include the entire test script below, but the tests verify what I expected. The newer file functions work much better for both reading and writing. Any comments?
<cfset string = repeatString(createUUID(), 10)>
<cfset theFile = expandPath("./data.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush> <cfset thisTick = getTickCount()>
<cfloop index="x" from="1" to="200000">
<cffile action="append" file="#theFile#" output="#string#">
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput> <hr>
<cfset theFile = expandPath("./data2.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush> <cfset thisTick = getTickCount()>
<cfset fileOb = fileOpen(theFile, "append")>
<cfloop index="x" from="1" to="200000">
<cfset fileWriteLine(fileOb, string)>
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cfset fileClose(fileOb)>
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput> <hr>
<cfset theFile = expandPath("./data3.txt")>
<cfoutput>About to write to #theFile#</cfoutput>
<p>
<cfflush> <cfset thisTick = getTickCount()>
<cfset s = createObject("java","java.lang.StringBuilder")>
<cfset newString = string & chr(13)>
<cfloop index="x" from="1" to="200000">
<cfset s.append(newString)>
<cfif x mod 1000 is 0>
<cfoutput>##</cfoutput>
<cfflush>
</cfif>
</cfloop>
<cffile action="write" file="#theFile#" output="#s.toString()#">
<cfset finalTick = getTickCount() - thisTick> <cfoutput>
<p>Took #finalTick# ms to write.
</cfoutput>
<cfsetting requesttimeout="999">
<cfoutput>#repeatString(" ", 250)#</cfoutput><cfflush>