Re: i18n comments on Polyglot Markup from Sam Ruby on 2010-07-15 (public-html@w3.org from July 2010)

From: Sam Ruby <rubys@intertwingly.net>
Date: Thu, 15 Jul 2010 16:43:37 -0400
To: Richard Ishida <ishida@w3.org>
CC: 'Leif Halvard Silli' <xn--mlform-iua@xn--mlform-iua.no>, 'Anne van Kesteren' <annevk@opera.com>, public-html@w3.org
Message-ID: <4C3F72F9.7070105@intertwingly.net>
On 07/15/2010 04:01 PM, Richard Ishida wrote:
>> From: Sam Ruby [mailto:rubys@intertwingly.net]
>> Sent: 15 July 2010 19:43
> ...
>>> UTF-16 encoded XML documents, on the other hand, must start with a
>>> BOM, see http://www.w3.org/TR/REC-xml/#charencoding  When the doc is
>>> treated as XML, however, the meta element is ignored.
>>>
>>> Do you agree?
>>
>> I disagree that UTF-16 encoded XML document must start with a BOM.  See:
>>
>> http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
>>
>> My personal experience (which now may be dated) is that there are a
>> number of XML parsers that choke in the presence of a BOM.  But even if
>> we ignore such, there still are quite a few ways to go given this set of
>> information.
>
> Although that section includes methods of detecting utf-16 as well as other
> encodings, I'm basing my conclusion on the following text:
>
> "Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY  begin
> with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
> section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF).
> This is an encoding signature, not part of either the markup or the
> character data of the XML document. XML processors MUST  be able to use this
> character to differentiate between UTF-8 and UTF-16 encoded documents."
> (http://www.w3.org/TR/REC-xml/#charencoding )
>
> Which seems pretty clear.  I will check with some XML gurus.

There is no question that BOMs are allowed by the specification.  In my 
prior work with SOAP based Web Services[1] and with feeds, however, I 
found that there were a number of cases where this part of the 
specification was not respected.  In fact, not all "XML" parsers support 
utf-16[2].

>> It turns out that<meta charset="utf-16"/>  will always be ignored but
>> the content processed correctly if the content is correctly encoded as
>> utf-16.  I gather that Richard would prefer that such elements not be
>> treated as conformance errors, whereas Ian would prefer that such
>> elements be treated as conformance errors.
>
> I have two reasons for my preference.
>
> [1] some people will probably add meta elements when using utf-16 encoded
> documents, and there's not any harm in it that I can see, so no real reason
> to penalise them for it.
>
> [2] i18n folks have long advised that you should always include a visible
> indication of the encoding in a document, HTML or XML, even if you don't
> strictly need to, because it can be very useful for developers, testers, or
> translation production managers who want to visually check the encoding of a
> document.

OK, then I'd suggest opening a bug report first against the HTML5 spec 
suggesting that it be allowed in both the HTML and XHTML serializations.

http://tinyurl.com/2vvv8vz

Depending on how that bug is resolved, a subsequent bug report against 
the polyglot spec would be in order.

Again, I wouldn't object to allowing the combination of utf-16 encoded 
content with corresponding meta tags, it just isn't something that I 
would personally recommend.  It just seems to me that it is clearly in 
the "there be dragons" territory, and given the precipices on both sides 
of the narrow common path between XML and HTML5, it seems to me to be to 
be an unnecessary distraction.  My (mild) preference is that encodings 
other that utf-8 be relegated to a footnote which mentions the 
possibility and perhaps outlines a few of the dangers.

What we have now is a top level section which permits more than is 
currently considered to be conforming (and in the case of some non-ASCII 
based encodings, more than actually will work interoperably), as well as 
disallowing combinations that will work but may not be recommended 
(utf-8 without either a meta tag or an XML declaration).

- Sam Ruby

[1] 
http://stackoverflow.com/questions/56812/bom-not-expected-in-cf-but-sent-by-iis-sp
[2] http://www.intertwingly.net/blog/2004/06/03/Aggregator-utf-16-tests.html
Received on Thursday, 15 July 2010 20:44:11 UTC