Re: Error message for invalid UTF-8 overlong forms should be improved

> Uploading a full document on a web server and posting the URL would save 
> us from guessing.
>   
Please see two test cases here:

http://arcticforest.com/tmp/test-EDA080.html
http://arcticforest.com/tmp/test-C1AA.html

For the first, the validator sees the bytes \xED\xA0\x80 and complains 
"utf8 "\xD800" does not map to Unicode".
For the second, the validator sees the bytes \xC1\xAA and complains 
"utf8 "\xC1" does not map to Unicode".

Given the differences I believe one of the two has to be incorrect.  In 
the first, the validator is complaining about \xD800, which is a numeric 
representation after 'decoding' three bytes as UTF-8.  In the second, 
the validator is complaining about \xC1, which is not the numeric 
representation after decoding anything but is rather the first octet 
encountered which is not part of a valid UTF-8 character.

It was my interpretation that the incorrect one was the second one.  I 
believe the error messages should be more like:

1: "utf8 "\xD800" does not map to Unicode".
2: "octet "\xC1" is not the start of a UTF-8 character".
>
> I don't know what you expect the data to contain, but the text editors 
> that I use don't correct "overlong forms".
>   
I included that as a side note.  The text editor I used was PSPad, which 
kept silently correcting \xC1\xAA to \x3A (the letter j) without letting 
me know, making testing difficult.  More concerning was that tools such 
as the WDG validator also silently converted it to another character and 
did not complain or count it as an error.  But that is of course a 
separate issue.
>
> What makes you think so? If I test a page containing \xC1 \xAA and 
> declared as UTF-8, the validator simply reports as quoted above, so it 
> only reports the first offending octet. The idea, I guess, is that this 
> is low-level data error that should be inspected using a text editor 
> suitable for the job, rather than a markup validator. The data would 
> equally be in error if claimed to be UTF-8 encoded plain text.
>   
Essentially, the first error was that the parser found that the UTF-8 
value \xD800 was not a valid Unicode character, and the second error was 
that the parser found that the octet \xC1 was not the first byte of a 
valid UTF-8 value.  The error messages don't give this distinction.

My complaint was that for the first error message, it told me I had an 
invalid \xD800 and I had to find where my application was adding the 
UTF-8 representation of that, ie \xED\xA0\x80.  When I encountered the 
second error message, it told me I had an invalid \xC1 but when I 
searched for where my application was adding the UTF-8 representation of 
this, ie \xC3\x81, but I did not find it.  Instead, I discovered that 
the second error message was the same but was caused by a different 
condition.  It was this time referring to the octet \xC1 which was the 
first byte of the character.

The second message could be misunderstood to mean that the document 
contains the UTF-8 value \xC1 (which could be formed by the bytes 
\xC3\x81) which is invalid in Unicode.
>> \xC1 does map to a valid unicode character
>> (capital A with an accent).
>>     
>
> No it does not. The octet \xC1 means nothing in UTF-8. In ISO-8859-1, it 
> means U+00C1, but that's a different issue.
>   
Sorry, I realise of course you are correct, and yes it's different issue :)

Thanks,
Thomas

Received on Wednesday, 28 May 2008 11:18:53 UTC