- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Wed, 28 May 2008 10:40:54 +0300
- To: "Thomas Rutter" <tom@thomasrutter.com>, <www-validator@w3.org>
Thomas Rutter wrote: > When validating a page containing a UTF-8 overlong form of an ASCII > character (in this example \x3A), I get the following error message: > >> Sorry, I am unable to validate this document because on line 1 it >> contained one or more bytes that I cannot interpret as utf-8 (in >> other words, the bytes found are not valid values in the specified >> Character Encoding). Please check both the content of the file and >> the character encoding indication. >> >> The error was: utf8 "\xC1" does not map to Unicode The message looks clear to me, both correct and about as understandable as it can be, within the limitations of correctness. The only thing I would change is "I cannot interpret", which would be better in passive mode: "cannot be interpreted", since it is not a matter of the abilities of a specific program, the validator. "UTF-8 overlong form" is a misnomer. There is only one UTF-8 encoded representation of each Unicode code point, including those in the ASCII range. An "overlong form" is simply an error, possibly resulting from an incorrect algorithm for encoding, possibly something else. > Please find attached a simplified test-case. It is base64 encoded piece of data, not a document - it even lacks a doctype declaration. > Please note that the > test case is UTF-8, and opening it in most text editors will cause the > overlong form to be silently corrected. I don't know what you expect the data to contain, but the text editors that I use don't correct "overlong forms". > Let's hope that doesn't > happen in the transmission of this email. Uploading a full document on a web server and posting the URL would save us from guessing. > A UTF-8 overlong form is a UTF-8 character represented using more > bytes than is necessary. No, it is just malformed data. > The validator parses both bytes of the character, which in this > example are \xC1\xAA. It converts this to the numerical > representation \x3A (the lowercase 'j'), What makes you think so? If I test a page containing \xC1 \xAA and declared as UTF-8, the validator simply reports as quoted above, so it only reports the first offending octet. The idea, I guess, is that this is low-level data error that should be inspected using a text editor suitable for the job, rather than a markup validator. The data would equally be in error if claimed to be UTF-8 encoded plain text. > However, the error message given by the validator is that the > document contains code point \xC1 and that this is not a valid Unicode > character. No, it says that the _octet_ (called "byte" for apparent reasons) \xC1 cannot be interpreted. In the error message, the part The error was: utf8 "\xC1" does not map to Unicode is perhaps somewhat misleading and takes part of the guilt here. The octet \xC1 simply isn't UTF-8. A better formulation would be the following: Specific data: "\xC1" is not allowed in utf-8. > \xC1 does map to a valid unicode character > (capital A with an accent). No it does not. The octet \xC1 means nothing in UTF-8. In ISO-8859-1, it means U+00C1, but that's a different issue. > The error message caused considerable > frustration in finding what was wrong with the character. I can understand it, but the problem is in the software that produced the malformed data. The validator simply reports what's wrong at the UTF-8 level, which it needs to check before being able to read characters. > A more helpful error message would state: - - >> The error was: utf8 Illegal overlong form "\xC1\x3A" No, "overlong form" is not a commonly understood concept, it is misleading in implying that the data would actually be UTF-8, and it is particularly misleading in contexts where the offending data is there for some quite different reason, e.g. because somewhat mistakenly submits ISO-8859-1 data as UTF-8 (fairly common) or the data just contains spurious octets caused by some software error. Jukka K. Korpela ("Yucca") http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 28 May 2008 07:41:23 UTC