- From: Richard Ishida <ishida@w3.org>
- Date: Mon, 04 Jul 2011 10:26:19 +0100
- To: "Michael[tm] Smith" <mike@w3.org>
- CC: www-validator@w3.org
Hi Mike, I think that what your parser misses is the statement a little earlier in the spec that says: [[ For each of the rows in the following table, starting with the first one and going down, if there are as many or more bytes available than the number of bytes in the first column, and the first bytes of the file match the bytes given in the first column, then return the encoding given in the cell in the second column of that row, with the confidence certain, and abort these steps: Bytes in Hexadecimal Encoding FE FF Big-endian UTF-16 FF FE Little-endian UTF-16 EF BB BF UTF-8 ]] The file http://www.w3.org/International/tests/i18n-checker/utf16/utf16le-charset-html5.html contains a utf-16LE bom, and so should be handled as UTF-16. If you get as far as step 5 in the algorithm, sure, you are not dealing with a UTF16 encoded file (because you wouldn't recognise the byte sequences if you were), so at that point, yes, treat as UTF-8. I think that the error message that should be output is something along the lines of: "The encoding declaration is incorrect: this is not a UTF-16 encoded file. In HTML4.01 the page will be parsed as the default encoding for the browser. In XHTML 1.x and HTML5 it will be treated as UTF-8. If you used a different character encoding than these, you will likely see corruption of the non-ASCII text on your page. Change the encoding declaration to reflect the actual encoding of the page." RI PS: Try viewing the example page on Firefox, and you'll see that it's perfectly readable. That wouldn't be the case if you were interpreting a sequence of utf-8 bytes as utf-16. On 04/07/2011 08:11, Michael[tm] Smith wrote: > Richard Ishida<ishida@w3.org>, 2011-07-03 10:32 +0100: > >> Checking http://www.w3.org/International/tests/i18n-checker/utf16/utf16le-charset-html5.html >> >> I get the following error messages: >> >> [[ >> Error Line 5, Column 70: Internal encoding declaration specified utf-16 >> which is not an ASCII superset. Continuing as if the encoding had been >> utf-8. >> >> <meta http-equiv="Content-Type" content="text/html; charset=utf-16" /> >> >> ✉ >> Error Line 5, Column 70: Internal encoding declaration utf-8 disagrees with >> the actual encoding of the document (utf-16). >> >> <meta http-equiv="Content-Type" content="text/html; charset=utf-16" /> >> ]] >> >> It is incorrect to parse the document as utf-8, since the document actually >> *is* a utf-16 document. You can report that use of the utf-16 meta >> declaration is against the spec in utf-16 documents, but not assume that the >> encoding is wrong. > > According to the HTML5 spec, it is correct to parse the document as UTF-8. > In fact, the spec requires that behavior; see step 5.1.13 of the algorithm > in the "Determining the character encoding" section of the spec: > > "If charset is a UTF-16 encoding, change the value of charset to UTF-8." > http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding > > The validator.nu backend includes a parser that conforms to the HTML5 spec > (which incidentally is the same parser that Firefox now uses). And both of > the error messages you cite above are being emitted by that parser, during > the parsing phase, before the backend actually gets around to starting the > validation stage at all. > > Note also that any browser which conforms to the HTML5 spec will exhibit > this same behavior (that is, changing the charset from UTF-16 to UTF-8) > > So as far as the spec goes, those messages are both correct and expected -- > as well as being consistent with parsing behavior in browsers. > > --Mike > -- Richard Ishida Internationalization Activity Lead W3C (World Wide Web Consortium) http://www.w3.org/International/ http://rishida.net/ Register for the W3C MultilingualWeb Workshop! Limerick, 21-22 September 2011 http://multilingualweb.eu/register
Received on Monday, 4 July 2011 09:26:50 UTC