- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 29 May 2008 09:14:40 +0300
- To: "Thomas Rutter" <tom@thomasrutter.com>
- Cc: <www-validator@w3.org>
Thomas Rutter wrote: > http://arcticforest.com/tmp/test-EDA080.html > http://arcticforest.com/tmp/test-C1AA.html > > For the first, the validator sees the bytes \xED\xA0\x80 and complains > "utf8 "\xD800" does not map to Unicode". > For the second, the validator sees the bytes \xC1\xAA and complains > "utf8 "\xC1" does not map to Unicode". That's inconsistent indeed, and the more I think of it, the more misleading this "utf8 "\x..." does not map to Unicode" thing looks like. It is difficult to express concisely that data that has been declared or assumed to be utf-8 encoded violates the rules of utf-8 and cannot thus be interpreted as characters. But the current formulation is misleading and even plain wrong, at least in the first case. The Unicode Standard uses the term "ill-formed" about code unit sequences (D84), and "ill-formed byte sequence" might be an appropriate expression, since it can be understood fairly intuitively, too. (Of course, the word "byte" is better known to most people than the more exact "octet".) > Given the differences I believe one of the two has to be incorrect. I would say that the first one is definitely wrong, and both are misleading. I'm afraid it might be nontrivial to fix this, since it's perhaps not just a matter of rewriting the error message. Maybe the validator does not "remember" the original data (the bytes) when it starts reporting the error. > In the first, the validator is complaining about \xD800, which is a > numeric representation after 'decoding' three bytes as UTF-8. We might describe things that way from the perspective of the internal logic of the software, but the three bytes _cannot_ be decoded as UTF-8. > In the second, > the validator is complaining about \xC1, which is not the numeric > representation after decoding anything but is rather the first octet > encountered which is not part of a valid UTF-8 character. Right. And the validator should refer to the actual data, the bytes, and not to something like \xD800, which would be the encoded form if utf-8 did not have a specific restriction. Maybe it could _additionally_ suggest some possible reasons for the data error, but I think that's beyond the scope of a markup validator. Maybe the error message could refer to some utility or source of information for analyzing ill-formed data. But the data might actually be well-formed in practice, just in an encoding different from utf-8; yet, when declared or assumed as utf-8, it should be reported as ill-formed. > My complaint was that for the first error message, it told me I had an > invalid \xD800 and I had to find where my application was adding the > UTF-8 representation of that, ie \xED\xA0\x80. When I encountered the > second error message, it told me I had an invalid \xC1 but when I > searched for where my application was adding the UTF-8 representation > of this, ie \xC3\x81, but I did not find it. That's indeed confusing, especially since the data does not contain \xD800 at all, just a byte sequence that _would_ represent that code point _if_ certain rule were removed from the definition of utf-8. Moreover, it is inappropriate to say that \xD800 (as a code point) "does not map to Unicode", since it is a Unicode code point, just defined to be a noncharacter code point (more specifically, a surrogate code point - a fairly confusing term). It would be appropriate to say that it does not map to a Unicode character. Jukka K. Korpela ("Yucca") http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 29 May 2008 06:15:25 UTC