Polyglot Markup/XML encoding declaration from Leif Halvard Silli on 2010-07-23 (public-i18n-core@w3.org from July to September 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Fri, 23 Jul 2010 17:26:42 +0300
To: HTMLwg <public-html@w3.org>
Cc: Eliot Graff <eliotgra@microsoft.com>, public-i18n-core@w3.org
Message-ID: <20100723172642878890.c3eca3dc@xn--mlform-iua.no>
Proposal: Polyglot Markup should allow the document encoding to be set 
via the encoding attribute of the XML declaration. The XML declaration, 
including the encoding attribute, thus becomes a HTML5 extension, 
whenever polyglot markup is being consumed as HTML. (See my previous 
letter to Sam, about the XML declaration as polyglot markup indicator.) 

	Justification/Usefulness:

HTML5's encoding detection algorithm is incomplete - as it does not 
make up for the fact that user agents actually use the encoding 
attribute of an XML declaration, when other encoding info is lacking.

Via the XML declaration we get a method for setting the encoding which 
works in both XML as well as - for those 
UAs that support it - in  HTML. Even though the recommended way should 
be to rely on META@charset, for  HTML-parsing reliability.

   Also, the XML declaration's encoding attribute can even be used in 
UTF-16 encoded pages. Thus we get an in-document way to specify the 
encoding also when the document is in UTF-16. (The META attribute 
cannot be used for this, according to the HTML5 rules.) 

   Without an in-document indication method that works in XML, then we 
favor text/html consumption/authoring, as we this then lack an 
in-document method for specifying the encoding to XML parsers and XML 
authoring tools. (The META element operates only inside text/html.)

    HTML5's encoding detection algorithm vs XML declaration:

I was under the impression that the XML encoding declaration did not 
have any effect in text/html parsers ...  However, this is not true. 

Firefox 4/Safari 5/Opera 10.6 (but not IE6-8, it seems) each has 
an encoding guessing feature. In absence of info from HTTP, BOM or 
META@charset, the encoding attribute of the XML declaration decides the 
encoding. In Safari/Webkit the guessing is always part of the encoding 
detection, it seems. While Opera/Firefox let you enable/disable it. In 
Opera it seems enabled by default. Not sure about Firefox.

The new information here - at least to myself - is that the encoding 
guessing feature does not only apply a character pattern analysis (as I 
had thought), but that it also enables interpretation of the XML 
encoding attribute. In fact, whenever Opera an Firefox runs with 
encoding guessing enabled, then pattern analysis does not take place, 
whenever there is an XML declaration with an encoding attribute.

Where does the reading of the encoding attribute fit into the encoding 
determination algorithm of HTML5? (Section 10.2.2.1  Determining the 
character encoding). By testing an UTF-8 encoded file with and without 
the BOM. and with false - as well as true - info inside the encoding 
attribute, I found that the reading of the encoding attribute happens:

a) if encoding from HTTP is lacking (algorithm step 1)
b) and if the UTF-8 BOM is lacking  (algorithm step 3)
c) and there is no META element (algorithm step 4)
d) then encoding attribute is read! (algorithm step 5). See below.

Thus, it happens:

e) before pattern analysis is initiated (algorithm step 6)
   E.g. for UTF-8 encoded files without BOM & without the encoding
   attribute in the XML declaration and without any other explicit
   encoding info, then the UAs would determine the encoding to be
   UTF-8. But if the file had an encoding attribute which e.g.  
   said "ISO-8859-5", then that value would be used.
g) before "an implementation-defined or user-specified default
   character encoding" (algorithm step 7 - last step)

It could seem as if Safari/Firefox/Opera links the reading of the 
encoding attribute to pattern analysis. Even if the encoding attribute 
has higher priority than the pattern analysis, the encoding attribute 
is not read, it seems to me, unless the browser runs with pattern 
matching enabled. (In Safari it seems impossible to disable the pattern 
matching - and thus also the reading of the encoding attribute. One can 
only overrule Safari's encoding choice _after_ Safari has performed is 
guesswork.)

It seems to me that the reading of the encoding attribute is such a 
specific thing that it should have been mentioned in the HTML5 
algorithm. Currently step 5 of the algorithm only says the following:

    ]]If the user agent has information on the likely encoding for
      this page, e.g. based on the encoding of the page when it 
      was last visited, then return that encoding, with the 
      confidence tentative, and abort these steps.[[

However, it does not seem like the reading of the encoding attribute 
has anything to do with information that is specific to the particular 
user agent ... After all, at least 3 different UAs have managed to 
handle the encoding attribute the same way ... Thus HTML5's algorithm 
should perhaps be updated here.

PS: I'm aware that for me to suggest the XML declaration used in 
Polyglot Markup, is a deviation from earlier viewpoints. My viewpoints 
has developed through the debate.
-- 
leif halvard silli
Received on Friday, 23 July 2010 14:38:34 UTC