- From: Thomas Rutter <tom@thomasrutter.com>
- Date: Wed, 28 May 2008 11:16:57 +1000
- To: www-validator@w3.org
- Message-ID: <cb4878e50805271816h77a5e827nb4705477c0f4b241@mail.gmail.com>
Hello, When validating a page containing a UTF-8 overlong form of an ASCII character (in this example \x3A), I get the following error message: > Sorry, I am unable to validate this document because on line 1 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. > > The error was: utf8 "\xC1" does not map to Unicode Please find attached a simplified test-case. Please note that the test case is UTF-8, and opening it in most text editors will cause the overlong form to be silently corrected. Let's hope that doesn't happen in the transmission of this email. A UTF-8 overlong form is a UTF-8 character represented using more bytes than is necessary. For example, using two bytes to represent an ASCII letter is an overlong form. Most non-validating applications will accept them and silently convert them to their normal form. The W3C validator, rightly, flags them as invalid characters, however the error message given is misleading. The validator parses both bytes of the character, which in this example are \xC1\xAA. It converts this to the numerical representation \x3A (the lowercase 'j'), which if you follow the UTF-8 spec closely, should have been represented using one byte, instead of two. However, the error message given by the validator is that the document contains code point \xC1 and that this is not a valid Unicode character. This may be the side effect of the way the UTF-8 parser returns failure, but \xC1 is not the numerical representation of those bytes, and if it were, \xC1 does map to a valid unicode character (capital A with an accent). The error message caused considerable frustration in finding what was wrong with the character. A more helpful error message would state: > Sorry, I am unable to validate this document because on line 1 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. > > The error was: utf8 Illegal overlong form "\xC1\x3A" Thanks, Thomas
Attachments
- text/html attachment: test.html
Received on Wednesday, 28 May 2008 03:36:56 UTC