- From: Michael Day <mikeday@yeslogic.com>
- Date: Tue, 06 Mar 2007 10:25:24 +1100
Hi Simon, > If you load a file from disk, then use any meta information the OS can > provide. (I think Linux can store Content-Type information for files.) > If the OS relies on file extensions (like Windows does) then use that. Some Linux file systems might potentially be capable of storing extra metadata in extended attributes, but in practice I haven't seen any Linux distributions actually use this functionality for storing the content type of files. This basically leaves us with file extensions, just like Windows. > .htm and .html are HTML. I know of lots of HTML documents that start > with an "XML declaration" but are not well-formed if parsed as XML. (For > starters, some version of DreamWeaver emitted XML declarations for > documents, but did not ensure well-formedness and the result is often > not well-formed.) Even if it was well-formed, it probably wasn't tested > under XML conditions so it's likely that style sheets and scripts only > work correctly under HTML conditions. Given that Prince serves a different niche than most user agents, our users tend to be more likely to use XML with embedded SVG etc., and less likely to run Prince on documents created by DreamWeaver. When Prince is run on a document retrieved over HTTP it obeys the Content-Type header, so that documents on the web will be parsed as HTML. However, it is true that if a document that appears to be XML but actually isn't is downloaded and saved as a file then Prince will try to load it as XML rather than HTML after sniffing the content in the absence of a Content-Type header. The user will then receive error messages if the document is not well-formed. In practice, this case does not seem to arise very often, but if it encourages people to either fix their XML and make it well-formed or stop pretending that their HTML is XML then that doesn't sound like such a bad thing :) > If an author authored a document and testing it with Prince, finding > that XML-only features work even with a .html file extension, then it is > likely that that document would break in browsers (because XML-only > features don't work in HTML). This comes back to the thorny issue of how MathML is supposed to work on the web. It seems to require that content be served up as XHTML, which no one does, or that HTML documents contain "XML islands", which is not well specified at all. It would be nice if HTML5 could tackle this in a way that makes sense. > HTML5 has specified content-sniffing rules, FWIW: > http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing Yes, these rules never seem to identify a document as being XML, though. > See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500 Prince always respects the Content-Type header, and only sniffs document content when no such metadata is available. Best regards, Michael -- Print XML with Prince! http://www.princexml.com
Received on Monday, 5 March 2007 15:25:24 UTC