- From: <trejkaz@trypticon.org>
- Date: Tue, 9 Nov 2004 11:38:16 +1100
- To: www-html@w3.org
- Message-ID: <20041109003816.GA22456@dev.xaoza.net>
At Tue, Nov 09, 2004 at 09:23:35AM +1100, Lachlan Hunt wrote: > ---- > + You can't sniff for the five characters "<?xml" because: > > - The <?xml ... ?> header is optional per Appendix C, and it is > recommended not to include it as it causes IE6 to trigger > quirks mode. That's interesting, actually. I have at least one true XML document which I serve that works on IE6, and it definitely has this at the top. It merely triggers IE6's XML mode, as far as I can tell, since if I leave it out, it tries to interpret it as HTML (which it isn't.) I gather from this information that IE's behaviour is backwards when the XML document is actually XHTML. As for it being optional... whoops. I forgot that it was, since it seems so highly recommended to use it. :-) > - SGML can also contain PIs (see the example below). > > ... > > e.g. what language is this text/html document in?: > > <?xml this is not?> > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN" > [ <!-- SYSTEM "not XHTML" --> ]> > <!-- -- --> > This is a comment. This document is not XHTML. > <html xmlns="http://www.w3.org/1999/xhtml"/> > Ok, I'm done now. --> > <html> > <title> Need a title in HTML4! </title> > <p> This is a valid HTML4 document. > </html> Yep, so it looks like it's easier to just try to parse the document as SGML. Once you have parsed it, you can look at the first PI. If the first PI is a valid XML declaration, then take the document as XML. Check the "xmlns" attribute (since we used an SGML parser, it won't have any idea of namespaces) and if it's there, take it as XML. You can look at the public identifiers at this point as well, since the parser will have read those, hopefully correctly. ;-) The important thing is that using a proper parser will sidestep that comment trickery above. Unfortunately file(1) doesn't do this, AFAIK. It just looks for "<!DOCTYPE html", "<html" and other things which will cause it to fail occasionally. But it was never meant to be rigid, either, and it assumes people aren't idiots (and thus won't generate code such as the above.) TX -- Email: Trejkaz Xaoza <trejkaz@xaoza.net> Web site: http://xaoza.net/trejkaz/ Jabber ID: trejkaz@jabber.xaoza.net GPG Fingerprint: 9EEB 97D7 8F7B 7977 F39F A62C B8C7 BC8B 037E EA73
Received on Monday, 8 November 2004 23:45:08 UTC