Re: Identifying (X)HTML without MIME

At Tue, Nov 09, 2004 at 09:23:35AM +1100, Lachlan Hunt wrote:
> ----
>     + You can't sniff for the five characters "<?xml" because:
> 
>       - The <?xml ... ?> header is optional per Appendix C, and it is
>         recommended not to include it as it causes IE6 to trigger
>         quirks mode.

That's interesting, actually.  I have at least one true XML document which
I serve that works on IE6, and it definitely has this at the top.  It merely
triggers IE6's XML mode, as far as I can tell, since if I leave it out, it
tries to interpret it as HTML (which it isn't.)  I gather from this information
that IE's behaviour is backwards when the XML document is actually XHTML.

As for it being optional... whoops.  I forgot that it was, since it seems so
highly recommended to use it. :-)

>       - SGML can also contain PIs (see the example below).
> 
>    ...
> 
>    e.g. what language is this text/html document in?:
> 
>       <?xml this is not?>
>       <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN"
>           [ <!-- SYSTEM "not XHTML" --> ]>
>       <!-- -- -->
>         This is a comment. This document is not XHTML.
>         <html xmlns="http://www.w3.org/1999/xhtml"/>
>         Ok, I'm done now. -->
>       <html>
>        <title> Need a title in HTML4! </title>
>        <p> This is a valid HTML4 document.
>       </html>

Yep, so it looks like it's easier to just try to parse the document as SGML.
Once you have parsed it, you can look at the first PI.  If the first PI is a
valid XML declaration, then take the document as XML.  Check the "xmlns"
attribute (since we used an SGML parser, it won't have any idea of namespaces)
and if it's there, take it as XML.  You can look at the public identifiers
at this point as well, since the parser will have read those, hopefully
correctly. ;-)

The important thing is that using a proper parser will sidestep that comment
trickery above.  Unfortunately file(1) doesn't do this, AFAIK.  It just looks
for "<!DOCTYPE html", "<html" and other things which will cause it to fail
occasionally.  But it was never meant to be rigid, either, and it assumes
people aren't idiots (and thus won't generate code such as the above.)

TX

-- 
             Email: Trejkaz Xaoza <trejkaz@xaoza.net>
          Web site: http://xaoza.net/trejkaz/
         Jabber ID: trejkaz@jabber.xaoza.net
   GPG Fingerprint: 9EEB 97D7 8F7B 7977 F39F  A62C B8C7 BC8B 037E EA73

Received on Monday, 8 November 2004 23:45:08 UTC