- From: Simon Pieters <zcorpan@gmail.com>
- Date: Sun, 04 Mar 2007 13:19:53 +0100
On Sun, 04 Mar 2007 07:33:51 +0100, Michael Day <mikeday at yeslogic.com> wrote: > For user agents like Prince that support XML and HTML content it is > sometimes necessary to distinguish whether a .html file is actually XML > or HTML in order for it to be processed correctly. > > I've written an article for XML.com explaining exactly how Prince > performs content sniffing to distinguish XML and HTML documents: > > What Does XML Smell Like? > http://www.xml.com/pub/a/2007/02/28/what-does-xml-smell-like.html > > Any feedback would be greatly appreciated. No doubt at some point it > will be necessary to revise our heuristics for HTML5 :) If you load a file from disk, then use any meta information the OS can provide. (I think Linux can store Content-Type information for files.) If the OS relies on file extensions (like Windows does) then use that. .htm and .html are HTML. I know of lots of HTML documents that start with an "XML declaration" but are not well-formed if parsed as XML. (For starters, some version of DreamWeaver emitted XML declarations for documents, but did not ensure well-formedness and the result is often not well-formed.) Even if it was well-formed, it probably wasn't tested under XML conditions so it's likely that style sheets and scripts only work correctly under HTML conditions. From the article: | It is common for XHTML files to be given an extension of .html or .htm, | as .xhtml is rather long and .xht is rather obscure. This means that a | file with an extension of .html may actually be an XML document and | require an XML parser. This is completely bogus. Those "XHTML" files are most likely inteded to be treated as HTML and not as XML. If an author wanted it to be treated as XML he/she would use .xhtml, .xht or .xml. Even if it would work correctly with an XML parser, it would likely also work correctly with an HTML parser (since all browsers would treat it as HTML, and authors mostly test their documents in some browser). If an author authored a document and testing it with Prince, finding that XML-only features work even with a .html file extension, then it is likely that that document would break in browsers (because XML-only features don't work in HTML). HTML5 has specified content-sniffing rules, FWIW: http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500 -- Simon Pieters
Received on Sunday, 4 March 2007 04:19:53 UTC