- From: Richard Ishida <ishida@w3.org>
- Date: Mon, 25 Jul 2011 08:28:06 +0100
- To: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
[resent to get it into tracker] i18n-ISSUE-77: HTTP and defaulting to UTF-16LE Date: Thu, 21 Jul 2011 18:41:45 +0900 From: "Martin J. Dürst" <duerst@it.aoyama.ac.jp> Organization: Aoyama Gakuin University To: Richard Ishida <ishida@w3.org> CC: public-i18n-core@w3.org <public-i18n-core@w3.org> Hello Richard, On 2011/07/20 23:55, Richard Ishida wrote: > 8.2.2.2 Character encodings > http://www.w3.org/TR/html5/parsing.html#character-encodings-0 > > "When a user agent is to use the UTF-16 encoding but no BOM has been > found, user agents must default to UTF-16LE." > > If the HTTP header declares the file to be UTF-16BE, which I believe it > can, and in which case a BOM should *not* be used, then I think that > this would not be true. This strictly depends on what "the UTF-16 encoding" means in the sentence you cite. If it means "the encoding labeled as 'UTF-16'", then this doesn't include encodings labeled UTF-16BE, and therefore there is no problem. If "the UTF-16 encoding" means "any encoding that works like UTF-16, independent of the label and other details", then you are right. My impression from reading "8.2.2.2 Character encodings" is that it's talking about the encoding labeled "UTF-16", but it might be helpful to check and/or clarify. UTF-16 is a very special case (UTF-32 has similar issues, but is much less important in practice, in particular across the network), because it's easy to mix up UTF-16 the general encoding method used for Unicode with code units of 16 bits and 'UTF-16' the character encoding (charset) label. (Also, in implementations, it's sometimes important to be able to separately set "BOM/noBOM", "LE/BE", and the actual label, which is difficult if a converter or output routine only takes a 'charset' label as a parameter.) > If the HTTP header declares the file to be > UTF-16, then there must be a BOM, so I assume that this is a recovery > mechanism if someone does declare UTF-16 in HTTP but omits the BOM. I'd > think that some kind of error message would be in order though. You want an error message like "missing BOM on UTF-16 page"? That's good for a validator, but not for a browser. ... Regards, Martin.
Received on Monday, 25 July 2011 07:28:35 UTC