RE: i18n comments on Polyglot Markup from Leif Halvard Silli on 2010-07-15 (public-html@w3.org from July 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 15 Jul 2010 21:47:53 +0400
To: Richard Ishida <ishida@w3.org>
Cc: 'Anne van Kesteren' <annevk@opera.com>, public-html@w3.org
Message-ID: <20100715214753762201.24e00081@xn--mlform-iua.no>
Richard Ishida, Thu, 15 Jul 2010 17:18:02 +0100:

> 2. If an html doc starts with a UTF-16 shaped BOM, any subsequent 
> meta encoding declarations are ignored. The browser sees the BOM and 
> stops looking for encoding information.  I don't, however, see 
> anything prohibiting the use of the UTF-16 encoded *character* 
> sequence <meta charset="utf-16"> (in UTF-16 characters) in the 
> document. It will just be ignored by the browser when the detection 
> algorithm is run. (see below).
> 
> 3. If there is no BOM and the browser encounters the ASCII *byte* 
> sequence <meta charset="utf-16">, following the detection algorithm, 
> there is something wrong, because you wouldn't see that sequence of 
> bytes in UTF-16.  Therefore the browser assumes that the encoding is 
> actually UTF-8 and stops looking for other encoding information.

Thanks for the helpful questions/explanation.

My first comment is that I think you need to raise a bug with HTML5 if 
you want “the UTF-16 encoded *character* sequence <meta 
charset="utf-16">” to be valid in a HTML document.

Secondly, here are some observations of what happens in Valididator.nu 
when validating a HTML5 document with polyglot markup as XHTML first, 
and then as HTML. The document contains '<meta charset="UTF-16"/>' and 
is UTF-16 encoded with a BOM.

When using XHTML validation (I uploaded as document with .xhtml 
suffix), I got this message:

]]  Error: Bad value UTF-16 for attribute charset on XHTML element meta.
]]  From line 4, column 4; to line 4, column 27
]]  <head>↩   <meta charset="UTF-16"/>

That error message does not make sense for XHTML, does it? Or why does 
a XML validator say what the value of an attribute that XML parsers do 
not look at, should be?  However, the error message could - eventually 
- have made sense for HTML. (Seemingly, Validator.nu tries to ensure 
polyglot markup ...)

When using HTML validation, then I got two messages:

]] Error: Internal encoding declaration specified utf-16 which is
]] not an ASCII superset. Continuing as if the encoding had been utf-8.
]] From line 4, column 4; to line 4, column 27
]] <head>↩   <meta charset="UTF-16"/>↩   <t
]]
]] Error: Internal encoding declaration utf-8 disagrees with the 
]] actual encoding of the document (utf-16).
]] From line 4, column 4; to line 4, column 27
]] <head>↩   <meta charset="UTF-16"/>↩   <t

[...]
> So I think it is fine to have a meta element in a utf-16 encoded 
> document - it just won't be used by the browser for detection.

Again, then a bug against HTML5 is needed - it can't be solved in 
Polyglot Markup alone.

> It is 
> also best that utf-16 encoded documents start with a bom, to avoid 
> reliance on browser heuristics. 

In Polyglot Markup it is not only best, it is, as you have explained, 
_required_ to use the BOM for UTF-16 encoded documents. This, in order 
to be valid XML. But may be you meant that this follows by what you say 
here: 

> UTF-16 encoded XML documents, on the other hand,

If you consider Polyglot Markup to be XML markup, then it does follow 
... However, I think it is useful to say "Polyglot" or "Polyglot 
Markup" rather than to see such documents as (syntactically) XML 
documents.

> must start with a 
> BOM, see http://www.w3.org/TR/REC-xml/#charencoding  When the doc is 
> treated as XML, however, the meta element is ignored.

When there is a BOM, then - for the special case of UTF-16, then there 
is no difference between HTML and XHTML parsing, it seems: the META 
declaration should be ignored. Thus, the only purpose becomes meta 
information for authors etc.

It could make sense to allow '<meta charset="UTF-16"/>' in HTML 
documents, provided that the document has a valid UTF-16 BOM.  If there 
is no BOM, then I do wonder if it would be logical to require 
validators to perform heuristics in order to decide whether '<meta 
charset="UTF-16"/>' provides correct meta data about the document? 

The difference between polyglot and HTML documents would be that in a 
polyglot, then the validity of '<meta charset="UTF-16"/>' would depend 
on the presence of a BOM. Whereas in a pure HTML document, then also 
the presence of a HTTP header which announces the encoding to be 
UTF-16, could probably also count as reason to accept '<meta 
charset="UTF-16"/>' as valid.

> Do you agree?

>> From: Anne van Kesteren [mailto:annevk@opera.com]

>> FWIW, using <meta charset=utf-16> is an error in HTML. When used in XML
>> its value must be UTF-8. See
>> http://www.whatwg.org/specs/web-apps/current-work/complete.html#attr-

>> meta-charset
>> 
>> For HTML the requirements are more complex, but UTF-16 is not allowed and
>> when specified is treated as UTF-8 (though if the document is actually
>> encoded as UTF-16 it would be decoded as UTF-16).
-- 
leif halvard silli
Received on Thursday, 15 July 2010 17:48:55 UTC