Re: UTF-16, UTF-16BE and UTF-16LE in HTML5 from Henri Sivonen on 2010-07-27 (public-html@w3.org from July 2010)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 27 Jul 2010 03:33:14 -0700 (PDT)
To: Richard Ishida <ishida@w3.org>
Cc: public-html@w3.org, www-international@w3.org
Message-ID: <28645108.108675.1280226794954.JavaMail.root@cm-mail03.mozilla.org>

Richard Ishida wrote:
> HTML5 says:
> "If an HTML document does not start with a BOM, and if its encoding is
> not
> explicitly given by Content-Type metadata, and the document is not an
> iframe
> srcdoc document, then the character encoding used must be an
> ASCII-compatible character encoding..."
> http://dev.w3.org/html5/spec/semantics.html#charset
> 
> This rules out the use of UTF-16BE and UTF16-LE character encodings,
> since
> they should not start with a BOM.

To me, it seems fine to make UTF-16BE and UTF-16LE non-conforming.

Authors should use UTF-8. We should make everything else as non-conforming as feasible. Unfortunately, the legacy for e.g. Windows-1252 is so large that it's not feasible to make it non-conforming. However, it is feasible to make UTF-32, UTF-16BE, UTF-16LE, CESU-8, BOCU-1 and other recent instances of gratuitous encoding proliferation non-conforming.

> A little later, the spec says
> "If an HTML document contains a meta element with a charset attribute
> or a
> meta element with an http-equiv attribute in the Encoding declaration
> state, then the character encoding used must be an ASCII-compatible
> character encoding."
> 
> This rules out the use of a character encoding declaration with the
> value
> UTF-16, even in content that is encoded in that encoding.
> 
> I earlier stated my preference to be able to say that a document is
> encoded
> in UTF-16 in the encoding declaration (in UTF-16 encoded documents, of
> course), because:
> 
> [1] some people will probably add meta elements when using utf-16
> encoded
> documents, and there's not any harm in it that I can see, so no real
> reason
> to penalise them for it.

Putting "UTF-16" in a meta means that the author's mental model of how HTML works is wrong. Having the wrong mental model of how stuff works generally leads to trouble at some point, so I think it's reasonable to make this manifestation of a wrong mental model an error.

> [2] i18n folks have long advised that you should always include a
> visible
> indication of the encoding in a document, HTML or XML, even if you
> don't
> strictly need to, because it can be very useful for developers,
> testers, or
> translation production managers who want to visually check the
> encoding of a
> document.

That's a bad rationale. It's a *very* bad idea to check the encoding by reading a string that doesn't participate in encoding detection at all, since the string may be wrong.

> The alternative may be
> to
> make it clearer that, although UTF-16 is ok, HTML5 and XHTML5 do not
> accept
> UTF-16BE and UTF16-LE encoding declarations - only UTF-16 with a BOM
> (which
> of course covers the same serialisations).

I'd be OK with making the non-conformance of UTF-16BE and UTF-16LE more explicit.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Tuesday, 27 July 2010 10:33:50 UTC