- From: Thomas Rutter <tom@thomasrutter.com>
- Date: Wed, 28 May 2008 21:17:50 +1000
- To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
- CC: www-validator@w3.org
- Message-ID: <483D3F5E.2050809@thomasrutter.com>
> Uploading a full document on a web server and posting the URL would save > us from guessing. > Please see two test cases here: http://arcticforest.com/tmp/test-EDA080.html http://arcticforest.com/tmp/test-C1AA.html For the first, the validator sees the bytes \xED\xA0\x80 and complains "utf8 "\xD800" does not map to Unicode". For the second, the validator sees the bytes \xC1\xAA and complains "utf8 "\xC1" does not map to Unicode". Given the differences I believe one of the two has to be incorrect. In the first, the validator is complaining about \xD800, which is a numeric representation after 'decoding' three bytes as UTF-8. In the second, the validator is complaining about \xC1, which is not the numeric representation after decoding anything but is rather the first octet encountered which is not part of a valid UTF-8 character. It was my interpretation that the incorrect one was the second one. I believe the error messages should be more like: 1: "utf8 "\xD800" does not map to Unicode". 2: "octet "\xC1" is not the start of a UTF-8 character". > > I don't know what you expect the data to contain, but the text editors > that I use don't correct "overlong forms". > I included that as a side note. The text editor I used was PSPad, which kept silently correcting \xC1\xAA to \x3A (the letter j) without letting me know, making testing difficult. More concerning was that tools such as the WDG validator also silently converted it to another character and did not complain or count it as an error. But that is of course a separate issue. > > What makes you think so? If I test a page containing \xC1 \xAA and > declared as UTF-8, the validator simply reports as quoted above, so it > only reports the first offending octet. The idea, I guess, is that this > is low-level data error that should be inspected using a text editor > suitable for the job, rather than a markup validator. The data would > equally be in error if claimed to be UTF-8 encoded plain text. > Essentially, the first error was that the parser found that the UTF-8 value \xD800 was not a valid Unicode character, and the second error was that the parser found that the octet \xC1 was not the first byte of a valid UTF-8 value. The error messages don't give this distinction. My complaint was that for the first error message, it told me I had an invalid \xD800 and I had to find where my application was adding the UTF-8 representation of that, ie \xED\xA0\x80. When I encountered the second error message, it told me I had an invalid \xC1 but when I searched for where my application was adding the UTF-8 representation of this, ie \xC3\x81, but I did not find it. Instead, I discovered that the second error message was the same but was caused by a different condition. It was this time referring to the octet \xC1 which was the first byte of the character. The second message could be misunderstood to mean that the document contains the UTF-8 value \xC1 (which could be formed by the bytes \xC3\x81) which is invalid in Unicode. >> \xC1 does map to a valid unicode character >> (capital A with an accent). >> > > No it does not. The octet \xC1 means nothing in UTF-8. In ISO-8859-1, it > means U+00C1, but that's a different issue. > Sorry, I realise of course you are correct, and yes it's different issue :) Thanks, Thomas
Received on Wednesday, 28 May 2008 11:18:53 UTC