W3C home > Mailing lists > Public > public-html@w3.org > July 2010

RE: i18n comments on Polyglot Markup

From: Richard Ishida <ishida@w3.org>
Date: Thu, 15 Jul 2010 17:18:02 +0100
To: "'Anne van Kesteren'" <annevk@opera.com>
Cc: <public-html@w3.org>, "'Leif Halvard Silli'" <xn--mlform-iua@xn--mlform-iua.no>
Message-ID: <007b01cb2439$4cb5a6a0$e620f3e0$@org>
Hi Anne,

My understanding from reading section 4.2.5.5 of the HTML5 spec[1]  is the following:

1. An HTML5 document can be served as UTF-16.  

The following points assume that no encoding info is supplied in the HTTP header.

2. If an html doc starts with a UTF-16 shaped BOM, any subsequent meta encoding declarations are ignored. The browser sees the BOM and stops looking for encoding information.  I don't, however, see anything prohibiting the use of the UTF-16 encoded *character* sequence <meta charset="utf-16"> (in UTF-16 characters) in the document. It will just be ignored by the browser when the detection algorithm is run. (see below).

3. If there is no BOM and the browser encounters the ASCII *byte* sequence <meta charset="utf-16">, following the detection algorithm, there is something wrong, because you wouldn't see that sequence of bytes in UTF-16.  Therefore the browser assumes that the encoding is actually UTF-8 and stops looking for other encoding information.

4. If a UTF-16 encoded doc does not start with a BOM, the meta declaration will not be recognized as anything by the detection algorithm, because the bytes don't match the pattern being looked for in the algorithm.  A browser could, however, use heuristics after all else fails to detect that the encoding is UTF-16, though this has nothing to do with any meta element.

So I think it is fine to have a meta element in a utf-16 encoded document - it just won't be used by the browser for detection.  It is also best that utf-16 encoded documents start with a bom, to avoid reliance on browser heuristics. 

UTF-16 encoded XML documents, on the other hand, must start with a BOM, see http://www.w3.org/TR/REC-xml/#charencoding  When the doc is treated as XML, however, the meta element is ignored.

Do you agree?

RI



[1] http://dev.w3.org/html5/spec/semantics.html#charset

============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/




> -----Original Message-----
> From: Anne van Kesteren [mailto:annevk@opera.com]
> Sent: 13 July 2010 21:20
> To: public-html@w3.org; Richard Ishida
> Subject: Re: i18n comments on Polyglot Markup
> 
> On Tue, 13 Jul 2010 21:40:24 +0200, Richard Ishida <ishida@w3.org> wrote:
> > Thank you for beginning work on the polyglot document.  I think it will
> > be very useful.  FWIW, I would welcome an approach to the text that made
> > it more like an author-friendly "how-to" guide, rather than spec text.
> >
> > I am about to raise 8 bugs in bugzilla.  These comments have been
> > discussed by the i18n WG.  I hope you find them helpful.
> >
> > FWIW, the i18n group keeps track of comments on your doc at
> > http://www.w3.org/International/reviews/1007-polyglot/
> 
> FWIW, using <meta charset=utf-16> is an error in HTML. When used in XML
> its value must be UTF-8. See
> http://www.whatwg.org/specs/web-apps/current-work/complete.html#attr-
> meta-charset
> 
> For HTML the requirements are more complex, but UTF-16 is not allowed and
> when specified is treated as UTF-8 (though if the document is actually
> encoded as UTF-16 it would be decoded as UTF-16).
> 
> 
> --
> Anne van Kesteren
> http://annevankesteren.nl/
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 9.0.830 / Virus Database: 271.1.1/3001 - Release Date: 07/12/10
> 17:49:00
Received on Thursday, 15 July 2010 16:18:41 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:18 UTC