Re: i18n comments on Polyglot Markup from Maciej Stachowiak on 2010-07-15 (public-html@w3.org from July 2010)

From: Maciej Stachowiak <mjs@apple.com>
Date: Thu, 15 Jul 2010 13:33:58 -0700
To: Richard Ishida <ishida@w3.org>
Cc: 'Sam Ruby' <rubys@intertwingly.net>, 'Leif Halvard Silli' <xn--mlform-iua@xn--mlform-iua.no>, 'Anne van Kesteren' <annevk@opera.com>, public-html@w3.org
Message-id: <AC8623B9-FF5E-4AC4-AEB8-ED92F749F1F6@apple.com>

On Jul 15, 2010, at 1:01 PM, Richard Ishida wrote:

> 
>> 
>> We could also go a different way entirely, and say that polyglot
>> documents are a subset of both HTML5 and XHTML5, and the subset that we
>> select is only utf-8.  I mention this as this is my personal
>> recommendation on the matter, but I can live either of the other two
>> alternatives mentioned above.
> 
> While it would be wonderful to live in a world where only utf-8 encodings
> are allowed, I'm not sure we can do that.  I think we need to acknowledge
> that these are XML documents, and although we certainly constrain the
> vocabulary, I'm leery about messing with the right of people to use other
> encodings if they insist. 

The very nature of the Polyglot document is that it does not allow arbitrary XML, HTML or XHTML constructs - only those that can be used to generate documents that are conforming both ways and work sufficiently the same both ways. For the target audience of this spec, it seems likely that they will want to follow best practices, and so utf8 will be an acceptable constraint.

I am not informed enough to have an opinion on the underlying technical merits overall, but I will mention a few things from implementation experience with this stuff:

- the behavior of <meta charset=utf16> in text/html (in that it will never switch you to utf-16 encoding unless you were already using it) is quite confusing; it certainly surprised me when I first learned about it years ago
- charset encodings that are not an ASCII superset (where ASCII characters appear as the same single byte they are in ASCII) can be problematic to handle, and can also lead to security problems when content authors do filtering in a way that assumes ASCII.

Regards,
Maciej

Received on Thursday, 15 July 2010 20:34:38 UTC