Re: i18n comments on Polyglot Markup from Sam Ruby on 2010-07-15 (public-html@w3.org from July 2010)

From: Sam Ruby <rubys@intertwingly.net>
Date: Thu, 15 Jul 2010 14:42:51 -0400
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
CC: Anne van Kesteren <annevk@opera.com>, Richard Ishida <ishida@w3.org>, public-html@w3.org
Message-ID: <4C3F56AB.7030105@intertwingly.net>

On 07/15/2010 02:20 PM, Leif Halvard Silli wrote:
> Anne van Kesteren, Thu, 15 Jul 2010 19:53:50 +0200:
>
>>> Do you agree?
>>
>> Not entirely, but mostly.
>
> Maciej, in the past, once treated as similar comment (about an
> accessibility topic) as un-collegial. (Before he became co-chair, I
> gather.) Full explanation and openness is appreciated.

Restoring the original question:

> UTF-16 encoded XML documents, on the other hand, must start with a
> BOM, see http://www.w3.org/TR/REC-xml/#charencoding  When the doc is
> treated as XML, however, the meta element is ignored.
>
> Do you agree?

I disagree that UTF-16 encoded XML document must start with a BOM.  See:

http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

My personal experience (which now may be dated) is that there are a 
number of XML parsers that choke in the presence of a BOM.  But even if 
we ignore such, there still are quite a few ways to go given this set of 
information.

It turns out that <meta charset="utf-16"/> will always be ignored but 
the content processed correctly if the content is correctly encoded as 
utf-16.  I gather that Richard would prefer that such elements not be 
treated as conformance errors, whereas Ian would prefer that such 
elements be treated as conformance errors.

We could also go a different way entirely, and say that polyglot 
documents are a subset of both HTML5 and XHTML5, and the subset that we 
select is only utf-8.  I mention this as this is my personal 
recommendation on the matter, but I can live either of the other two 
alternatives mentioned above.

- Sam Ruby

Received on Thursday, 15 July 2010 18:43:25 UTC