- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Fri, 23 Jul 2010 17:26:42 +0300
- To: HTMLwg <public-html@w3.org>
- Cc: Eliot Graff <eliotgra@microsoft.com>, public-i18n-core@w3.org
Proposal: Polyglot Markup should allow the document encoding to be set via the encoding attribute of the XML declaration. The XML declaration, including the encoding attribute, thus becomes a HTML5 extension, whenever polyglot markup is being consumed as HTML. (See my previous letter to Sam, about the XML declaration as polyglot markup indicator.) Justification/Usefulness: HTML5's encoding detection algorithm is incomplete - as it does not make up for the fact that user agents actually use the encoding attribute of an XML declaration, when other encoding info is lacking. Via the XML declaration we get a method for setting the encoding which works in both XML as well as - for those UAs that support it - in HTML. Even though the recommended way should be to rely on META@charset, for HTML-parsing reliability. Also, the XML declaration's encoding attribute can even be used in UTF-16 encoded pages. Thus we get an in-document way to specify the encoding also when the document is in UTF-16. (The META attribute cannot be used for this, according to the HTML5 rules.) Without an in-document indication method that works in XML, then we favor text/html consumption/authoring, as we this then lack an in-document method for specifying the encoding to XML parsers and XML authoring tools. (The META element operates only inside text/html.) HTML5's encoding detection algorithm vs XML declaration: I was under the impression that the XML encoding declaration did not have any effect in text/html parsers ... However, this is not true. Firefox 4/Safari 5/Opera 10.6 (but not IE6-8, it seems) each has an encoding guessing feature. In absence of info from HTTP, BOM or META@charset, the encoding attribute of the XML declaration decides the encoding. In Safari/Webkit the guessing is always part of the encoding detection, it seems. While Opera/Firefox let you enable/disable it. In Opera it seems enabled by default. Not sure about Firefox. The new information here - at least to myself - is that the encoding guessing feature does not only apply a character pattern analysis (as I had thought), but that it also enables interpretation of the XML encoding attribute. In fact, whenever Opera an Firefox runs with encoding guessing enabled, then pattern analysis does not take place, whenever there is an XML declaration with an encoding attribute. Where does the reading of the encoding attribute fit into the encoding determination algorithm of HTML5? (Section 10.2.2.1 Determining the character encoding). By testing an UTF-8 encoded file with and without the BOM. and with false - as well as true - info inside the encoding attribute, I found that the reading of the encoding attribute happens: a) if encoding from HTTP is lacking (algorithm step 1) b) and if the UTF-8 BOM is lacking (algorithm step 3) c) and there is no META element (algorithm step 4) d) then encoding attribute is read! (algorithm step 5). See below. Thus, it happens: e) before pattern analysis is initiated (algorithm step 6) E.g. for UTF-8 encoded files without BOM & without the encoding attribute in the XML declaration and without any other explicit encoding info, then the UAs would determine the encoding to be UTF-8. But if the file had an encoding attribute which e.g. said "ISO-8859-5", then that value would be used. g) before "an implementation-defined or user-specified default character encoding" (algorithm step 7 - last step) It could seem as if Safari/Firefox/Opera links the reading of the encoding attribute to pattern analysis. Even if the encoding attribute has higher priority than the pattern analysis, the encoding attribute is not read, it seems to me, unless the browser runs with pattern matching enabled. (In Safari it seems impossible to disable the pattern matching - and thus also the reading of the encoding attribute. One can only overrule Safari's encoding choice _after_ Safari has performed is guesswork.) It seems to me that the reading of the encoding attribute is such a specific thing that it should have been mentioned in the HTML5 algorithm. Currently step 5 of the algorithm only says the following: ]]If the user agent has information on the likely encoding for this page, e.g. based on the encoding of the page when it was last visited, then return that encoding, with the confidence tentative, and abort these steps.[[ However, it does not seem like the reading of the encoding attribute has anything to do with information that is specific to the particular user agent ... After all, at least 3 different UAs have managed to handle the encoding attribute the same way ... Thus HTML5's algorithm should perhaps be updated here. PS: I'm aware that for me to suggest the XML declaration used in Polyglot Markup, is a deviation from earlier viewpoints. My viewpoints has developed through the debate. -- leif halvard silli
Received on Friday, 23 July 2010 14:38:34 UTC