When is XML not XML?

Here is a mystery for folks. I've updated my parsing engine for coldfusionbloggers.org. I'm using CFHTTP now so I can check Etag type stuff. I take the result text and save it to a file to be parsed by CFFEED.

But before I do that I check to ensure it's valid XML. Here is where it gets weird. Charlie Griefer's blog works with CFFEED directly, but isXML on the result returns false. But - I can xmlParse the string no problem. Simple example:

<cfset f= "http://cfblog.griefer.com/feeds/rss2-0.cfm?blogid=30"> <cfhttp url="#f#"> <cfset text = cfhttp.filecontent>

<cfif isXml(text)> yes <cfelse> no <cfset z = xmlParse(text)> <cfdump var="#z#"> </cfif>

If you run this, you will see "no" output, and than an XML object. If you use CFFEED on the URL directly, that works as well. So it seems like isXML is being strict about something. I can update my code to try/catch an xmlParse obviously, but I'd rather figure out why the above is happening first.

Archived Comments

Comment 1 by charlie griefer posted on 11/12/2007 at 11:11 PM

hi ray:

i sent this entry to jon clausen, the big brain behind cfblog. i know you said you think it's a "cf thing" more than a "cfblog thing", but i figured jon might have some insights he can offer up.

Comment 2 by Ben posted on 11/12/2007 at 11:15 PM

I ran that URL through the w3c XML validator and it didn't validate.

Comment 3 by Raymond Camden posted on 11/12/2007 at 11:40 PM

Where exactly did you validate? I tried here and it validated:

http://www.validome.org/rss...

Comment 4 by Ben Koshy posted on 11/12/2007 at 11:50 PM

http://www.validome.org/xml... doesn't validate it either. I was trying to use a more generic validator to simulate what isXML would behave. Is it because its missing a DTD/DOCTYPE Declaration?

Comment 5 by todd sharp posted on 11/13/2007 at 12:30 AM

Is it any surprise that it's Charlies feed? Does that shock anyone?

Comment 6 by Jeff Price posted on 11/13/2007 at 12:37 AM

It's passed at the two validators I tried:
http://validator.w3.org
http://feedvalidator.org

This is really neat! I can't see a reason this would fail to be valid XML.

Comment 7 by charlie griefer posted on 11/13/2007 at 12:37 AM

hey when you roll hard core like me, you uncover problems that n00bs such as yourself don't encounter :P

Comment 8 by charlie griefer posted on 11/13/2007 at 12:47 AM

um yeah, just to clarify... my previous comment was in response to todd (best not to piss off jeff, i figure) :)

Comment 9 by mj posted on 11/13/2007 at 12:48 AM

If you look within the CDATA for each item there are quite a few tags that are malformed or they get chopped. If you correct these then the xml validates.

Comment 10 by Jeff Price posted on 11/13/2007 at 1:03 AM

LOL! It's ok to piss me off, you have my permission.

Comment 11 by Jon Clausen posted on 11/13/2007 at 1:50 AM

Interesting insights, Ray. I dug into it briefly, but here's what I can tell so far:

The problem is <![CDATA[]]> in the xmlNode. I always use the W3C validator which escapes malformed HTML within the CDATA. I used CDATA on purpose because xmlFormat() doesn't always re-format correctly for valid RSS Feeds - especially when non-technical users are providing the input.

Coldfusion's isXML() doesn't appear to escape the CDATA content, however. For example, the following feed using xmlFormat() with Charlie's content returns isXML() true:

http://cfblog.griefer.com/f...

Whereas the original does not:

http://cfblog.griefer.com/f...

Interesting stuff!

Comment 12 by Raymond Camden posted on 11/13/2007 at 1:58 AM

So are you saying that if I had

<b>foo

In my CDATA, CF would consider it bad because I enver closed the B?

Comment 13 by charlie griefer posted on 11/13/2007 at 1:58 AM

O_O

did jon just call me a 'non-technical user'? :)

Comment 14 by Jon Clausen posted on 11/13/2007 at 2:06 AM

@charlie:
"did jon just call me a 'non-technical user'? :)"

Errr..... :-O No actually, the original change to using CDATA was from a couple of non-technically oriented blog portals like pieceoftexas.com. Users were pasting from word and even with the WYSIWYG, xmlFormat() wasn't cleaning it up enough. There were also intermittent problems with feed readers decoding inline javascript like YouTube posts, etc. from users content.

@Ray
I'm going to play around with it, but it appears that any raw HTML in CDATA will cause isXML() to fail - which is the reason for using CDATA in the first place.

Comment 15 by Raymond Camden posted on 11/13/2007 at 2:11 AM

Jon, please let me know asap and I will file a bug report for it. Btw, I know of the xmlFormat issue and it bugs the you know what out of me.

Comment 16 by Rick O posted on 11/13/2007 at 4:48 AM

Ray said: "So are you saying that if I had <b>foo In my CDATA, CF would consider it bad because I enver closed the B?"

In my testing, it didn't appear to be looking for parity/balance of tags, so much as it was looking for parity of brackets. That is, <b> without </b> is okay, as is <a>foo</b>, but <b (no closing bracket) is not. His feed at this moment has an A tag that has been chopped off between the tagName and its first attribute.

Comment 17 by Sammy Larbi posted on 11/13/2007 at 6:54 AM

I think parsing xml will work with what I'll call "sub xml," while isXML(value) would verify that the value is a valid xml document. The difference would be that parsing an xml doc for a substructure would allow you to query that substructure by parsing xml again, while it wouldn't still be a valid xml doc.

Comment 18 by Raymond Camden posted on 11/13/2007 at 5:20 PM

I've confirmed this and will write a follow up blog entry a bit later this morning. I"m going to log the bug report right now.