Re: Error message for invalid UTF-8 overlong forms should be improved from Jukka K. Korpela on 2008-05-28 (www-validator@w3.org from May 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Wed, 28 May 2008 10:40:54 +0300
To: "Thomas Rutter" <tom@thomasrutter.com>, <www-validator@w3.org>
Message-ID: <00fd01c8c096$2b2b6230$0500000a@DOCENDO>
Thomas Rutter wrote:

> When validating a page containing a UTF-8 overlong form of an ASCII
> character (in this example \x3A), I get the following error message:
>
>> Sorry, I am unable to validate this document because on line 1 it
>> contained one or more bytes that I cannot interpret as utf-8 (in
>> other words, the bytes found are not valid values in the specified
>> Character Encoding). Please check both the content of the file and
>> the character encoding indication.
>>
>> The error was: utf8 "\xC1" does not map to Unicode

The message looks clear to me, both correct and about as understandable 
as it can be, within the limitations of correctness. The only thing I 
would change is "I cannot interpret", which would be better in passive 
mode: "cannot be interpreted", since it is not a matter of the abilities 
of a specific program, the validator.

"UTF-8 overlong form" is a misnomer. There is only one UTF-8 encoded 
representation of each Unicode code point, including those in the ASCII 
range. An "overlong form" is simply an error, possibly resulting from an 
incorrect algorithm for encoding, possibly something else.

> Please find attached a simplified test-case.

It is base64 encoded piece of data, not a document - it even lacks a 
doctype declaration.

> Please note that the
> test case is UTF-8, and opening it in most text editors will cause the
> overlong form to be silently corrected.

I don't know what you expect the data to contain, but the text editors 
that I use don't correct "overlong forms".

> Let's hope that doesn't
> happen in the transmission of this email.

Uploading a full document on a web server and posting the URL would save 
us from guessing.

> A UTF-8 overlong form is a UTF-8 character represented using more
> bytes than is necessary.

No, it is just malformed data.

> The validator parses both bytes of the character, which in this
> example are \xC1\xAA.  It converts this to the numerical
> representation \x3A (the lowercase 'j'),

What makes you think so? If I test a page containing \xC1 \xAA and 
declared as UTF-8, the validator simply reports as quoted above, so it 
only reports the first offending octet. The idea, I guess, is that this 
is low-level data error that should be inspected using a text editor 
suitable for the job, rather than a markup validator. The data would 
equally be in error if claimed to be UTF-8 encoded plain text.

> However, the error message given by the validator is that the
> document contains code point \xC1 and that this is not a valid Unicode
> character.

No, it says that the _octet_ (called "byte" for apparent reasons) \xC1 
cannot be interpreted. In the error message, the part
The error was: utf8 "\xC1" does not map to Unicode
is perhaps somewhat misleading and takes part of the guilt here. The 
octet \xC1 simply isn't UTF-8. A better formulation would be the 
following:
Specific data: "\xC1" is not allowed in utf-8.

> \xC1 does map to a valid unicode character
> (capital A with an accent).

No it does not. The octet \xC1 means nothing in UTF-8. In ISO-8859-1, it 
means U+00C1, but that's a different issue.

> The error message caused considerable
> frustration in finding what was wrong with the character.

I can understand it, but the problem is in the software that produced 
the malformed data. The validator simply reports what's wrong at the 
UTF-8 level, which it needs to check before being able to read 
characters.

> A more helpful error message would state:
- -
>> The error was: utf8 Illegal overlong form "\xC1\x3A"

No, "overlong form" is not a commonly understood concept, it is 
misleading in implying that the data would actually be UTF-8, and it is 
particularly misleading in contexts where the offending data is there 
for some quite different reason, e.g. because somewhat mistakenly 
submits ISO-8859-1 data as UTF-8 (fairly common) or the data just 
contains spurious octets caused by some software error.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 28 May 2008 07:41:23 UTC