Re: i18n comments on Polyglot Markup

Anne van Kesteren, Tue, 13 Jul 2010 22:19:55 +0200:
> On Tue, 13 Jul 2010 21:40:24 +0200, Richard Ishida <ishida@w3.org> wrote:
>> Thank you for beginning work on the polyglot document.  I think it 
>> will be very useful.  FWIW, I would welcome an approach to the text 
>> that made it more like an author-friendly "how-to" guide, rather 
>> than spec text.
>> 
>> I am about to raise 8 bugs in bugzilla.  These comments have been 
>> discussed by the i18n WG.  I hope you find them helpful.
>> 
>> FWIW, the i18n group keeps track of comments on your doc at 
>> http://www.w3.org/International/reviews/1007-polyglot/
> 
> FWIW, using <meta charset=utf-16> is an error in HTML.

First, when Sam said a _similar_ (but not same) thing, then Lachlan 
replied that UTF-16 is: [1] "perfectly acceptable for HTML". I think, 
based on my reading of what Lachlan said, that I may have advised 
Elliot to say that <meta charset="UTF-16"/> should be legal in polyglot 
markup.

After seeing your comment, Anne, I now realize that <meta 
charset=utf-16> causes an HTML5 parser (or at least Validator.nu) to 
treat the document as UTF-8 (or, at any rate, it is illegal). Hence, 
yes, <meta charset="utf-16" /> is an error in HTML.

> When used in XML its value must be UTF-8.

Since XML parsers do not decide the encoding based on <meta 
charset="utf-16" />, then I cannot see that you are correct here. But 
of of course, if you are talking about _polyglot markup_, then <meta 
charset="utf-16" /> cannot be used, due the the problems in _HTML_ 
parsers.

> See 
> 
http://www.whatwg.org/specs/web-apps/current-work/complete.html#attr-meta-charset

If you didn't point to the 10MB complete spec, but instead had pointed 
to location in the multi page version, then I could actually afford 
looking up the spec ... Alternatively, you could have said which 
section it is in.

I found this in section '4.2.5.5. Specifying the document's character 
encoding' (in a PDF copy of 18th of February 2010), which I now 
understand much better:

]] If an HTML document does not start with a BOM, and if its encoding 
is not explicitly given by Content-Type 
metadata, and the document is not an iframe srcdoc document, then the 
character encoding used must be 
an ASCII-compatible character encoding [[

> For HTML the requirements are more complex, but UTF-16 is not allowed 

I guess "more complex" translates to the following: in HTML5, then 
UTF-16 _is_ permitted, provided that the encoding is specified via BOM. 
Whereas, as told, specifying UTF-16 via <meta charset="utf-16" />, is 
not permitted.

Pretty simple, actually.

> and when specified is treated as UTF-8 (though if the document is 
> actually encoded as UTF-16 it would be decoded as UTF-16).

In conclusion, I believe the polyglot markup spec should continue to 
allow UTF-16, as long as the encoding is specified via BOM.

[1] http://www.w3.org/mid/4BD00FC4.8010403@lachy.id.au
-- 
leif halvard silli

Received on Thursday, 15 July 2010 14:55:55 UTC