- From: Ian Hickson <ian@hixie.ch>
- Date: Sat, 6 Jun 2009 01:32:01 +0000 (UTC)
- To: Kornél Pál <kornelpal@gmail.com>
- Cc: public-html-comments@w3.org
- Message-ID: <Pine.LNX.4.62.0906060110060.1648@hixie.dreamhostps.com>
On Thu, 7 May 2009, Kornél Pál wrote: > > After having a look at HTML5 editor's draft I believe that 8.2.2.3 > incorrectly instructs changing UTF-16 to UTF-8. > > UTF-16 without BOM cannot be detected using the sniffing algorithm > because is incompatible with ASCII. But the browser may guess (step 6.) > that it's UTF-16 but it will only be tentative. > > After the parser may find and encoding specified in the UTF-16 text. > > If the encoding found is UTF-16 then that is instructed by step 1. of > 8.2.2.3 to be changed to UTF-8 that is definitely wrong. I've switched steps 1 and 2 in the "change the encoding" to handle the case where UTF-16 is sniffed tentatively and later confirmed. > Another problem is that if you were able to find an encoding name other > than UTF-16 in a valid HTML code decoded as if it were UTF-16 you > shouldn't restart parsing because if it isn't UTF-16 then the encoding > found is not accurate either. Good point. I've made the algorithm just do nothing if the old encoding is UTF-16. > 4.2.5.5 also states: > > If an HTML document does not start with a BOM, and if its encoding is not > explicitly given by Content-Type metadata, then the character encoding used > must be an ASCII-compatible character encoding > > and > > If an HTML document contains a meta element with a charset attribute or a meta > element in the Encoding declaration state, then the character encoding used > must be an ASCII-compatible character encoding. > > These together are equivalent to saying that UTF-16 without BOM is not allowed > but I believe that this was not the intent. If it really is I would prefer to > have an explicit note about this. UTF-16 without BOM is allowed if there is explicit character encoding metadata in the document's Content-Type HTTP headers. > Encoding found by parsing using UTF-16 should be UTF-16 and any other > values should be treated as a parse error. It's not a parse error, but it is an error, yes. This is already the case (since you're not allowed to declare an encoding that's wrong). > Permitting UTF-16 without BOM makes sense because encoding autodetection > is permitted as well and ASCII compatible encodings having encoding > specified using <meta> will not reach the autodetection stage. In general, the intent is to make any document that relies on the autodetection non-conforming, since autodetection is unreliable. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Saturday, 6 June 2009 01:32:34 UTC