- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Tue, 27 Jul 2010 03:33:14 -0700 (PDT)
- To: Richard Ishida <ishida@w3.org>
- Cc: public-html@w3.org, www-international@w3.org
Richard Ishida wrote: > HTML5 says: > "If an HTML document does not start with a BOM, and if its encoding is > not > explicitly given by Content-Type metadata, and the document is not an > iframe > srcdoc document, then the character encoding used must be an > ASCII-compatible character encoding..." > http://dev.w3.org/html5/spec/semantics.html#charset > > This rules out the use of UTF-16BE and UTF16-LE character encodings, > since > they should not start with a BOM. To me, it seems fine to make UTF-16BE and UTF-16LE non-conforming. Authors should use UTF-8. We should make everything else as non-conforming as feasible. Unfortunately, the legacy for e.g. Windows-1252 is so large that it's not feasible to make it non-conforming. However, it is feasible to make UTF-32, UTF-16BE, UTF-16LE, CESU-8, BOCU-1 and other recent instances of gratuitous encoding proliferation non-conforming. > A little later, the spec says > "If an HTML document contains a meta element with a charset attribute > or a > meta element with an http-equiv attribute in the Encoding declaration > state, then the character encoding used must be an ASCII-compatible > character encoding." > > This rules out the use of a character encoding declaration with the > value > UTF-16, even in content that is encoded in that encoding. > > I earlier stated my preference to be able to say that a document is > encoded > in UTF-16 in the encoding declaration (in UTF-16 encoded documents, of > course), because: > > [1] some people will probably add meta elements when using utf-16 > encoded > documents, and there's not any harm in it that I can see, so no real > reason > to penalise them for it. Putting "UTF-16" in a meta means that the author's mental model of how HTML works is wrong. Having the wrong mental model of how stuff works generally leads to trouble at some point, so I think it's reasonable to make this manifestation of a wrong mental model an error. > [2] i18n folks have long advised that you should always include a > visible > indication of the encoding in a document, HTML or XML, even if you > don't > strictly need to, because it can be very useful for developers, > testers, or > translation production managers who want to visually check the > encoding of a > document. That's a bad rationale. It's a *very* bad idea to check the encoding by reading a string that doesn't participate in encoding detection at all, since the string may be wrong. > The alternative may be > to > make it clearer that, although UTF-16 is ok, HTML5 and XHTML5 do not > accept > UTF-16BE and UTF16-LE encoding declarations - only UTF-16 with a BOM > (which > of course covers the same serialisations). I'd be OK with making the non-conformance of UTF-16BE and UTF-16LE more explicit. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Tuesday, 27 July 2010 10:33:50 UTC