Re: Incorrect behaviour with utf-16 meta declaration from Richard Ishida on 2011-07-04 (www-validator@w3.org from July 2011)

From: Richard Ishida <ishida@w3.org>
Date: Mon, 04 Jul 2011 10:56:39 +0100
To: "Michael[tm] Smith" <mike@w3.org>
CC: www-validator@w3.org
Message-ID: <4E118E57.6040804@w3.org>
Oops.  Missed something out at the end.

The error message suggestion included below was for a file that is not 
actually encoded in utf16 but has a utf16 encoding declaration in the meta.

If a file actually *is* encoded in utf-16, you still need an error 
message for html5 if there's a utf16 encoding declaration in the meta. 
This is because the HTML5 spec requires an ASCII-compatible meta 
encoding declaration (mostly to avoid problems with EBCDIC etc, but it 
also catches utf16).  For that scenario, I'd suggest an error message 
along the following lines:

"The HTML5 specification disallows the use of meta encoding declarations 
with UTF-16 encoded documents. A UTF-16 byte-order mark (BOM) is the 
only in-document encoding allowed."

HTH,
RI


On 04/07/2011 10:26, Richard Ishida wrote:
> Hi Mike,
>
> I think that what your parser misses is the statement a little earlier
> in the spec that says:
>
> [[
> For each of the rows in the following table, starting with the first one
> and going down, if there are as many or more bytes available than the
> number of bytes in the first column, and the first bytes of the file
> match the bytes given in the first column, then return the encoding
> given in the cell in the second column of that row, with the confidence
> certain, and abort these steps:
> Bytes in Hexadecimal Encoding
> FE FF Big-endian UTF-16
> FF FE Little-endian UTF-16
> EF BB BF UTF-8
> ]]
>
> The file
> http://www.w3.org/International/tests/i18n-checker/utf16/utf16le-charset-html5.html
> contains a utf-16LE bom, and so should be handled as UTF-16.
>
> If you get as far as step 5 in the algorithm, sure, you are not dealing
> with a UTF16 encoded file (because you wouldn't recognise the byte
> sequences if you were), so at that point, yes, treat as UTF-8.
>
> I think that the error message that should be output is something along
> the lines of:
>
> "The encoding declaration is incorrect: this is not a UTF-16 encoded
> file. In HTML4.01 the page will be parsed as the default encoding for
> the browser. In XHTML 1.x and HTML5 it will be treated as UTF-8. If you
> used a different character encoding than these, you will likely see
> corruption of the non-ASCII text on your page. Change the encoding
> declaration to reflect the actual encoding of the page."
>
> RI
>
>
> PS: Try viewing the example page on Firefox, and you'll see that it's
> perfectly readable. That wouldn't be the case if you were interpreting a
> sequence of utf-8 bytes as utf-16.
>
>
>
>
> On 04/07/2011 08:11, Michael[tm] Smith wrote:
>> Richard Ishida<ishida@w3.org>, 2011-07-03 10:32 +0100:
>>
>>> Checking
>>> http://www.w3.org/International/tests/i18n-checker/utf16/utf16le-charset-html5.html
>>>
>>>
>>> I get the following error messages:
>>>
>>> [[
>>> Error Line 5, Column 70: Internal encoding declaration specified utf-16
>>> which is not an ASCII superset. Continuing as if the encoding had been
>>> utf-8.
>>>
>>> <meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
>>>
>>> ✉
>>> Error Line 5, Column 70: Internal encoding declaration utf-8
>>> disagrees with
>>> the actual encoding of the document (utf-16).
>>>
>>> <meta http-equiv="Content-Type" content="text/html; charset=utf-16" />
>>> ]]
>>>
>>> It is incorrect to parse the document as utf-8, since the document
>>> actually
>>> *is* a utf-16 document. You can report that use of the utf-16 meta
>>> declaration is against the spec in utf-16 documents, but not assume
>>> that the
>>> encoding is wrong.
>>
>> According to the HTML5 spec, it is correct to parse the document as
>> UTF-8.
>> In fact, the spec requires that behavior; see step 5.1.13 of the
>> algorithm
>> in the "Determining the character encoding" section of the spec:
>>
>> "If charset is a UTF-16 encoding, change the value of charset to UTF-8."
>> http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding
>>
>>
>> The validator.nu backend includes a parser that conforms to the HTML5
>> spec
>> (which incidentally is the same parser that Firefox now uses). And
>> both of
>> the error messages you cite above are being emitted by that parser,
>> during
>> the parsing phase, before the backend actually gets around to starting
>> the
>> validation stage at all.
>>
>> Note also that any browser which conforms to the HTML5 spec will exhibit
>> this same behavior (that is, changing the charset from UTF-16 to UTF-8)
>>
>> So as far as the spec goes, those messages are both correct and
>> expected --
>> as well as being consistent with parsing behavior in browsers.
>>
>> --Mike
>>
>

-- 
Richard Ishida
Internationalization Activity Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/


Register for the W3C MultilingualWeb Workshop!
Limerick, 21-22 September 2011
http://multilingualweb.eu/register
Received on Monday, 4 July 2011 09:57:11 UTC