- From: Kornél Pál <kornelpal@gmail.com>
- Date: Thu, 07 May 2009 00:38:23 +0200
- To: public-html-comments@w3.org
Hi, After having a look at HTML5 editor's draft I believe that 8.2.2.3 incorrectly instructs changing UTF-16 to UTF-8. UTF-16 without BOM cannot be detected using the sniffing algorithm because is incompatible with ASCII. But the browser may guess (step 6.) that it's UTF-16 but it will only be tentative. After the parser may find and encoding specified in the UTF-16 text. If the encoding found is UTF-16 then that is instructed by step 1. of 8.2.2.3 to be changed to UTF-8 that is definitely wrong. Another problem is that if you were able to find an encoding name other than UTF-16 in a valid HTML code decoded as if it were UTF-16 you shouldn't restart parsing because if it isn't UTF-16 then the encoding found is not accurate either. 4.2.5.5 also states: If an HTML document does not start with a BOM, and if its encoding is not explicitly given by Content-Type metadata, then the character encoding used must be an ASCII-compatible character encoding and If an HTML document contains a meta element with a charset attribute or a meta element in the Encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. These together are equivalent to saying that UTF-16 without BOM is not allowed but I believe that this was not the intent. If it really is I would prefer to have an explicit note about this. Encoding found by parsing using UTF-16 should be UTF-16 and any other values should be treated as a parse error. Permitting UTF-16 without BOM makes sense because encoding autodetection is permitted as well and ASCII compatible encodings having encoding specified using <meta> will not reach the autodetection stage. Best regards, Kornél Pál
Received on Thursday, 7 May 2009 08:00:02 UTC