getAllTheTexts - simple Apache Tika wrapper

August 16, 2012 coldfusion

(This post is more than 2 years old.)

A few days ago a reader asked me if I had code that could handle extracting text from various document formats. There are multiple tools in ColdFusion that can do this. My first thought was to build a CFC that used a large switch block to shell out to the various different utilities. For some, this would be easy. CFPDF, for example, has a text extraction feature. Others would be a bit more work. You can convert PPTs and Word docs to PDF using CFDOCUMENT and then use CFPDF to extract text. Excel files can be parsed using CFSPREADSHEET. You get the idea.

Before going down that route, however, I took a look at Apache Tika. Tika supports extracting metadata and text from numerous different text formats. (Complete list of supported formats.)

Turns out Tika has a pretty simple API. How simple? I was able to get the code down to both extract text and return metadata in fewer than 50 lines. Here's the complete code for the CFC:

As you can see, I make use of the excellent JavaLoader library (the development branch to be clear). Once you have an instance of the CFC, it is a simple matter of passing a filename to the read method. The metadata is very deep. For a PPTX I parsed I got info on the number of slides as well as the presentation template. It even returned a large amount of information on an MP3.

I write these posts for free — if they're useful to you, you can buy me a coffee. It helps more than you'd think.

You can download the code plus a small example at the github repo: https://github.com/cfjedimaster/getallthetexts

Special thanks to Mark Mandel for help with a class loader issue I ran into and Jeff Coughlin as well.