- From: Richard Ishida <ishida@w3.org>
- Date: Mon, 04 Jul 2011 10:56:39 +0100
- To: "Michael[tm] Smith" <mike@w3.org>
- CC: www-validator@w3.org
Oops. Missed something out at the end. The error message suggestion included below was for a file that is not actually encoded in utf16 but has a utf16 encoding declaration in the meta. If a file actually *is* encoded in utf-16, you still need an error message for html5 if there's a utf16 encoding declaration in the meta. This is because the HTML5 spec requires an ASCII-compatible meta encoding declaration (mostly to avoid problems with EBCDIC etc, but it also catches utf16). For that scenario, I'd suggest an error message along the following lines: "The HTML5 specification disallows the use of meta encoding declarations with UTF-16 encoded documents. A UTF-16 byte-order mark (BOM) is the only in-document encoding allowed." HTH, RI On 04/07/2011 10:26, Richard Ishida wrote: > Hi Mike, > > I think that what your parser misses is the statement a little earlier > in the spec that says: > > [[ > For each of the rows in the following table, starting with the first one > and going down, if there are as many or more bytes available than the > number of bytes in the first column, and the first bytes of the file > match the bytes given in the first column, then return the encoding > given in the cell in the second column of that row, with the confidence > certain, and abort these steps: > Bytes in Hexadecimal Encoding > FE FF Big-endian UTF-16 > FF FE Little-endian UTF-16 > EF BB BF UTF-8 > ]] > > The file > http://www.w3.org/International/tests/i18n-checker/utf16/utf16le-charset-html5.html > contains a utf-16LE bom, and so should be handled as UTF-16. > > If you get as far as step 5 in the algorithm, sure, you are not dealing > with a UTF16 encoded file (because you wouldn't recognise the byte > sequences if you were), so at that point, yes, treat as UTF-8. > > I think that the error message that should be output is something along > the lines of: > > "The encoding declaration is incorrect: this is not a UTF-16 encoded > file. In HTML4.01 the page will be parsed as the default encoding for > the browser. In XHTML 1.x and HTML5 it will be treated as UTF-8. If you > used a different character encoding than these, you will likely see > corruption of the non-ASCII text on your page. Change the encoding > declaration to reflect the actual encoding of the page." > > RI > > > PS: Try viewing the example page on Firefox, and you'll see that it's > perfectly readable. That wouldn't be the case if you were interpreting a > sequence of utf-8 bytes as utf-16. > > > > > On 04/07/2011 08:11, Michael[tm] Smith wrote: >> Richard Ishida<ishida@w3.org>, 2011-07-03 10:32 +0100: >> >>> Checking >>> http://www.w3.org/International/tests/i18n-checker/utf16/utf16le-charset-html5.html >>> >>> >>> I get the following error messages: >>> >>> [[ >>> Error Line 5, Column 70: Internal encoding declaration specified utf-16 >>> which is not an ASCII superset. Continuing as if the encoding had been >>> utf-8. >>> >>> <meta http-equiv="Content-Type" content="text/html; charset=utf-16" /> >>> >>> ✉ >>> Error Line 5, Column 70: Internal encoding declaration utf-8 >>> disagrees with >>> the actual encoding of the document (utf-16). >>> >>> <meta http-equiv="Content-Type" content="text/html; charset=utf-16" /> >>> ]] >>> >>> It is incorrect to parse the document as utf-8, since the document >>> actually >>> *is* a utf-16 document. You can report that use of the utf-16 meta >>> declaration is against the spec in utf-16 documents, but not assume >>> that the >>> encoding is wrong. >> >> According to the HTML5 spec, it is correct to parse the document as >> UTF-8. >> In fact, the spec requires that behavior; see step 5.1.13 of the >> algorithm >> in the "Determining the character encoding" section of the spec: >> >> "If charset is a UTF-16 encoding, change the value of charset to UTF-8." >> http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding >> >> >> The validator.nu backend includes a parser that conforms to the HTML5 >> spec >> (which incidentally is the same parser that Firefox now uses). And >> both of >> the error messages you cite above are being emitted by that parser, >> during >> the parsing phase, before the backend actually gets around to starting >> the >> validation stage at all. >> >> Note also that any browser which conforms to the HTML5 spec will exhibit >> this same behavior (that is, changing the charset from UTF-16 to UTF-8) >> >> So as far as the spec goes, those messages are both correct and >> expected -- >> as well as being consistent with parsing behavior in browsers. >> >> --Mike >> > -- Richard Ishida Internationalization Activity Lead W3C (World Wide Web Consortium) http://www.w3.org/International/ http://rishida.net/ Register for the W3C MultilingualWeb Workshop! Limerick, 21-22 September 2011 http://multilingualweb.eu/register
Received on Monday, 4 July 2011 09:57:11 UTC