Christian Ready pinged me a few days ago about an interesting problem he was having at one of his web sites. His search (Verity-based on CFMX7) was returning HTML. The HTML was escaped so the user literally saw stuff like this in the results:
Hi, my name is <b>Bob</b> and I'm a rabid developer!I pointed out that the regex used to remove HTML would also work for escaped html:
<cfset cleaned = rereplace(str, "<.*?>", "", "all")>
In English, this regex matches the escaped less than sign (<), any character (non greedy, more on that in a bit), and then the escaped greater than symbol (>). The "non greedy" part means to match the smallest possible match possible. Without this, the regex would remove the html tag and everything inside of it! We just want to remove the tags themselves.
This worked - but then exposed another problem. Verity was returning text with incomplete HTML tags. As an example, consider this text block:
ul>This is some <b>bold</b> html with <i>markup</i> in it.
Here is <b
Notice the incomplete HTML tag at the beginning and end of the string. Luckily regex provides us with a simple way to look for patterns at either the beginning or end of a string. Consider these two lines:
<cfset cleaned = rereplace(cleaned, "<.*?$", "", "all")>
<cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")>
</code
The first line looks for a match of a < at the end of the string. The next line looks for a > at the beginning of the string. Both allow for bits of the html tag as well.
So all together this is the code I gave him:
<code>
<cfset cleaned = rereplace(str, "<.?>", "", "all")>
<cfset cleaned = rereplace(cleaned, "<.?$", "", "all")>
<cfset cleaned = rereplace(cleaned, "^.*?>", "", "all")>
Most likely this could be done in one regex instead.