Re: Incorrect behaviour with utf-16 meta declaration from Richard Ishida on 2011-07-04 (www-validator@w3.org from July 2011)

From: Richard Ishida <ishida@w3.org>
Date: Mon, 04 Jul 2011 10:26:19 +0100
To: "Michael[tm] Smith" <mike@w3.org>
CC: www-validator@w3.org
Message-ID: <4E11873B.2030705@w3.org>
Hi Mike,

I think that what your parser misses is the statement a little earlier 
in the spec that says:

[[
For each of the rows in the following table, starting with the first one 
and going down, if there are as many or more bytes available than the 
number of bytes in the first column, and the first bytes of the file 
match the bytes given in the first column, then return the encoding 
given in the cell in the second column of that row, with the confidence 
certain, and abort these steps:
Bytes in Hexadecimal  Encoding
FE FF  Big-endian UTF-16
FF FE  Little-endian UTF-16
EF BB BF  UTF-8
]]

The file 
http://www.w3.org/International/tests/i18n-checker/utf16/utf16le-charset-html5.html 
contains a utf-16LE bom, and so should be handled as UTF-16.

If you get as far as step 5 in the algorithm, sure, you are not dealing 
with a UTF16 encoded file (because you wouldn't recognise the byte 
sequences if you were), so at that point, yes, treat as UTF-8.

I think that the error message that should be output is something along 
the lines of:

"The encoding declaration is incorrect: this is not a UTF-16 encoded 
file. In HTML4.01 the page will be parsed as the default encoding for 
the browser. In XHTML 1.x and HTML5 it will be treated as UTF-8. If you 
used a different character encoding than these, you will likely see 
corruption of the non-ASCII text on your page. Change the encoding 
declaration to reflect the actual encoding of the page."

RI


PS: Try viewing the example page on Firefox, and you'll see that it's 
perfectly readable.  That wouldn't be the case if you were interpreting 
a sequence of utf-8 bytes as utf-16.




On 04/07/2011 08:11, Michael[tm] Smith wrote:
> Richard Ishida<ishida@w3.org>, 2011-07-03 10:32 +0100:
>
>> Checking http://www.w3.org/International/tests/i18n-checker/utf16/utf16le-charset-html5.html
>>
>> I get the following error messages:
>>
>> [[
>> Error Line 5, Column 70: Internal encoding declaration specified utf-16
>> which is not an ASCII superset. Continuing as if the encoding had been
>> utf-8.
>>
>> <meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
>>
>> ✉
>> Error Line 5, Column 70: Internal encoding declaration utf-8 disagrees with
>> the actual encoding of the document (utf-16).
>>
>> <meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
>> ]]
>>
>> It is incorrect to parse the document as utf-8, since the document actually
>> *is* a utf-16 document. You can report that use of the utf-16 meta
>> declaration is against the spec in utf-16 documents, but not assume that the
>> encoding is wrong.
>
> According to the HTML5 spec, it is correct to parse the document as UTF-8.
> In fact, the spec requires that behavior; see step 5.1.13 of the algorithm
> in the "Determining the character encoding" section of the spec:
>
>    "If charset is a UTF-16 encoding, change the value of charset to UTF-8."
>    http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding
>
> The validator.nu backend includes a parser that conforms to the HTML5 spec
> (which incidentally is the same parser that Firefox now uses). And both of
> the error messages you cite above are being emitted by that parser, during
> the parsing phase, before the backend actually gets around to starting the
> validation stage at all.
>
> Note also that any browser which conforms to the HTML5 spec will exhibit
> this same behavior (that is, changing the charset from UTF-16 to UTF-8)
>
> So as far as the spec goes, those messages are both correct and expected --
> as well as being consistent with parsing behavior in browsers.
>
>    --Mike
>

-- 
Richard Ishida
Internationalization Activity Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/


Register for the W3C MultilingualWeb Workshop!
Limerick, 21-22 September 2011
http://multilingualweb.eu/register
Received on Monday, 4 July 2011 09:26:50 UTC