Re: Error message for invalid UTF-8 overlong forms should be improved

Thomas Rutter wrote:

> http://arcticforest.com/tmp/test-EDA080.html
> http://arcticforest.com/tmp/test-C1AA.html
>
> For the first, the validator sees the bytes \xED\xA0\x80 and complains
> "utf8 "\xD800" does not map to Unicode".
> For the second, the validator sees the bytes \xC1\xAA and complains
> "utf8 "\xC1" does not map to Unicode".

That's inconsistent indeed, and the more I think of it, the more 
misleading this "utf8 "\x..." does not map to Unicode" thing looks like. 
It is difficult to express concisely that data that has been declared or 
assumed to be utf-8 encoded violates the rules of utf-8 and cannot thus 
be interpreted as characters. But the current formulation is misleading 
and even plain wrong, at least in the first case.

The Unicode Standard uses the term "ill-formed" about code unit 
sequences (D84), and "ill-formed byte sequence" might be an appropriate 
expression, since it can be understood fairly intuitively, too. (Of 
course, the word "byte" is better known to most people than the more 
exact "octet".)

> Given the differences I believe one of the two has to be incorrect.

I would say that the first one is definitely wrong, and both are 
misleading. I'm afraid it might be nontrivial to fix this, since it's 
perhaps not just a matter of rewriting the error message. Maybe the 
validator does not "remember" the original data (the bytes) when it 
starts reporting the error.

> In the first, the validator is complaining about \xD800, which is a
> numeric representation after 'decoding' three bytes as UTF-8.

We might describe things that way from the perspective of the internal 
logic of the software, but the three bytes _cannot_ be decoded as UTF-8.

> In the second,
> the validator is complaining about \xC1, which is not the numeric
> representation after decoding anything but is rather the first octet
> encountered which is not part of a valid UTF-8 character.

Right. And the validator should refer to the actual data, the bytes, and 
not to something like \xD800, which would be the encoded form if utf-8 
did not have a specific restriction.

Maybe it could _additionally_ suggest some possible reasons for the data 
error, but I think that's beyond the scope of a markup validator. Maybe 
the error message could refer to some utility or source of information 
for analyzing ill-formed data. But the data might actually be 
well-formed in practice, just in an encoding different from utf-8; yet, 
when declared or assumed as utf-8, it should be reported as ill-formed.

> My complaint was that for the first error message, it told me I had an
> invalid \xD800 and I had to find where my application was adding the
> UTF-8 representation of that, ie \xED\xA0\x80.  When I encountered the
> second error message, it told me I had an invalid \xC1 but when I
> searched for where my application was adding the UTF-8 representation
> of this, ie \xC3\x81, but I did not find it.

That's indeed confusing, especially since the data does not contain 
\xD800 at all, just a byte sequence that _would_ represent that code 
point _if_ certain rule were removed from the definition of utf-8. 
Moreover, it is inappropriate to say that \xD800 (as a code point) "does 
not map to Unicode", since it is a Unicode code point, just defined to 
be a noncharacter code point (more specifically, a surrogate code 
point - a fairly confusing term). It would be appropriate to say that it 
does not map to a Unicode character.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/ 

Received on Thursday, 29 May 2008 06:15:25 UTC