Error message for invalid UTF-8 overlong forms should be improved from Thomas Rutter on 2008-05-28 (www-validator@w3.org from May 2008)

From: Thomas Rutter <tom@thomasrutter.com>
Date: Wed, 28 May 2008 11:16:57 +1000
To: www-validator@w3.org
Message-ID: <cb4878e50805271816h77a5e827nb4705477c0f4b241@mail.gmail.com>

Hello,

When validating a page containing a UTF-8 overlong form of an ASCII
character (in this example \x3A), I get the following error message:

> Sorry, I am unable to validate this document because on line 1 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
>
> The error was: utf8 "\xC1" does not map to Unicode

Please find attached a simplified test-case.  Please note that the
test case is UTF-8, and opening it in most text editors will cause the
overlong form to be silently corrected.  Let's hope that doesn't
happen in the transmission of this email.

A UTF-8 overlong form is a UTF-8 character represented using more
bytes than is necessary.  For example, using two bytes to represent an
ASCII letter is an overlong form.  Most non-validating applications
will accept them and silently convert them to their normal form.  The
W3C validator, rightly, flags them as invalid characters, however the
error message given is misleading.

The validator parses both bytes of the character, which in this
example are \xC1\xAA.  It converts this to the numerical
representation \x3A (the lowercase 'j'), which if you follow the UTF-8
spec closely, should have been represented using one byte, instead of
two.  However, the error message given by the validator is that the
document contains code point \xC1 and that this is not a valid Unicode
character.  This may be the side effect of the way the UTF-8 parser
returns failure, but \xC1 is not the numerical representation of those
bytes, and if it were, \xC1 does map to a valid unicode character
(capital A with an accent).  The error message caused considerable
frustration in finding what was wrong with the character.

A more helpful error message would state:

> Sorry, I am unable to validate this document because on line 1 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
>
> The error was: utf8 Illegal overlong form "\xC1\x3A"


Thanks,
Thomas

Attachments

text/html attachment: test.html

Received on Wednesday, 28 May 2008 03:36:56 UTC