Re: UTF-16 without BOM is exiled from Ian Hickson on 2009-06-06 (public-html-comments@w3.org from June 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Sat, 6 Jun 2009 01:32:01 +0000 (UTC)
To: Kornél Pál <kornelpal@gmail.com>
Cc: public-html-comments@w3.org
Message-ID: <Pine.LNX.4.62.0906060110060.1648@hixie.dreamhostps.com>

On Thu, 7 May 2009, Kornél Pál wrote:
> 
> After having a look at HTML5 editor's draft I believe that 8.2.2.3 
> incorrectly instructs changing UTF-16 to UTF-8.
> 
> UTF-16 without BOM cannot be detected using the sniffing algorithm 
> because is incompatible with ASCII. But the browser may guess (step 6.) 
> that it's UTF-16 but it will only be tentative.
> 
> After the parser may find and encoding specified in the UTF-16 text.
> 
> If the encoding found is UTF-16 then that is instructed by step 1. of 
> 8.2.2.3 to be changed to UTF-8 that is definitely wrong.

I've switched steps 1 and 2 in the "change the encoding" to handle the 
case where UTF-16 is sniffed tentatively and later confirmed.


> Another problem is that if you were able to find an encoding name other 
> than UTF-16 in a valid HTML code decoded as if it were UTF-16 you 
> shouldn't restart parsing because if it isn't UTF-16 then the encoding 
> found is not accurate either.

Good point. I've made the algorithm just do nothing if the old encoding is 
UTF-16.


> 4.2.5.5 also states:
> 
> If an HTML document does not start with a BOM, and if its encoding is not
> explicitly given by Content-Type metadata, then the character encoding used
> must be an ASCII-compatible character encoding
> 
> and
> 
> If an HTML document contains a meta element with a charset attribute or a meta
> element in the Encoding declaration state, then the character encoding used
> must be an ASCII-compatible character encoding.
> 
> These together are equivalent to saying that UTF-16 without BOM is not allowed
> but I believe that this was not the intent. If it really is I would prefer to
> have an explicit note about this.

UTF-16 without BOM is allowed if there is explicit character encoding 
metadata in the document's Content-Type HTTP headers.


> Encoding found by parsing using UTF-16 should be UTF-16 and any other 
> values should be treated as a parse error.

It's not a parse error, but it is an error, yes. This is already the case 
(since you're not allowed to declare an encoding that's wrong).


> Permitting UTF-16 without BOM makes sense because encoding autodetection 
> is permitted as well and ASCII compatible encodings having encoding 
> specified using <meta> will not reach the autodetection stage.

In general, the intent is to make any document that relies on the 
autodetection non-conforming, since autodetection is unreliable.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Saturday, 6 June 2009 01:32:34 UTC