- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Fri, 23 Jul 2010 17:26:42 +0300
- To: HTMLwg <public-html@w3.org>
- Cc: Eliot Graff <eliotgra@microsoft.com>, public-i18n-core@w3.org
Proposal: Polyglot Markup should allow the document encoding to be set
via the encoding attribute of the XML declaration. The XML declaration,
including the encoding attribute, thus becomes a HTML5 extension,
whenever polyglot markup is being consumed as HTML. (See my previous
letter to Sam, about the XML declaration as polyglot markup indicator.)
Justification/Usefulness:
HTML5's encoding detection algorithm is incomplete - as it does not
make up for the fact that user agents actually use the encoding
attribute of an XML declaration, when other encoding info is lacking.
Via the XML declaration we get a method for setting the encoding which
works in both XML as well as - for those
UAs that support it - in HTML. Even though the recommended way should
be to rely on META@charset, for HTML-parsing reliability.
Also, the XML declaration's encoding attribute can even be used in
UTF-16 encoded pages. Thus we get an in-document way to specify the
encoding also when the document is in UTF-16. (The META attribute
cannot be used for this, according to the HTML5 rules.)
Without an in-document indication method that works in XML, then we
favor text/html consumption/authoring, as we this then lack an
in-document method for specifying the encoding to XML parsers and XML
authoring tools. (The META element operates only inside text/html.)
HTML5's encoding detection algorithm vs XML declaration:
I was under the impression that the XML encoding declaration did not
have any effect in text/html parsers ... However, this is not true.
Firefox 4/Safari 5/Opera 10.6 (but not IE6-8, it seems) each has
an encoding guessing feature. In absence of info from HTTP, BOM or
META@charset, the encoding attribute of the XML declaration decides the
encoding. In Safari/Webkit the guessing is always part of the encoding
detection, it seems. While Opera/Firefox let you enable/disable it. In
Opera it seems enabled by default. Not sure about Firefox.
The new information here - at least to myself - is that the encoding
guessing feature does not only apply a character pattern analysis (as I
had thought), but that it also enables interpretation of the XML
encoding attribute. In fact, whenever Opera an Firefox runs with
encoding guessing enabled, then pattern analysis does not take place,
whenever there is an XML declaration with an encoding attribute.
Where does the reading of the encoding attribute fit into the encoding
determination algorithm of HTML5? (Section 10.2.2.1 Determining the
character encoding). By testing an UTF-8 encoded file with and without
the BOM. and with false - as well as true - info inside the encoding
attribute, I found that the reading of the encoding attribute happens:
a) if encoding from HTTP is lacking (algorithm step 1)
b) and if the UTF-8 BOM is lacking (algorithm step 3)
c) and there is no META element (algorithm step 4)
d) then encoding attribute is read! (algorithm step 5). See below.
Thus, it happens:
e) before pattern analysis is initiated (algorithm step 6)
E.g. for UTF-8 encoded files without BOM & without the encoding
attribute in the XML declaration and without any other explicit
encoding info, then the UAs would determine the encoding to be
UTF-8. But if the file had an encoding attribute which e.g.
said "ISO-8859-5", then that value would be used.
g) before "an implementation-defined or user-specified default
character encoding" (algorithm step 7 - last step)
It could seem as if Safari/Firefox/Opera links the reading of the
encoding attribute to pattern analysis. Even if the encoding attribute
has higher priority than the pattern analysis, the encoding attribute
is not read, it seems to me, unless the browser runs with pattern
matching enabled. (In Safari it seems impossible to disable the pattern
matching - and thus also the reading of the encoding attribute. One can
only overrule Safari's encoding choice _after_ Safari has performed is
guesswork.)
It seems to me that the reading of the encoding attribute is such a
specific thing that it should have been mentioned in the HTML5
algorithm. Currently step 5 of the algorithm only says the following:
]]If the user agent has information on the likely encoding for
this page, e.g. based on the encoding of the page when it
was last visited, then return that encoding, with the
confidence tentative, and abort these steps.[[
However, it does not seem like the reading of the encoding attribute
has anything to do with information that is specific to the particular
user agent ... After all, at least 3 different UAs have managed to
handle the encoding attribute the same way ... Thus HTML5's algorithm
should perhaps be updated here.
PS: I'm aware that for me to suggest the XML declaration used in
Polyglot Markup, is a deviation from earlier viewpoints. My viewpoints
has developed through the debate.
--
leif halvard silli
Received on Friday, 23 July 2010 14:38:35 UTC