Re: Invalid Bytes for Charset from Jukka K. Korpela on 2008-11-14 (www-validator@w3.org from November 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Fri, 14 Nov 2008 17:28:09 +0200
To: <www-validator@w3.org>
Message-ID: <F4C9F0E5D58C4442B621D349319F5FDD@JukanPC>

Michael Adams wrote:

[ discussing the error message ...]
>>> The error was: utf8 "\x80" does not map to Unicode

> \x80 is illegal as a first byte in unicode.

First of all, this relates to UTF-8 encoding only.

Second, you're right in the sense that byte 80 is not allowed as the first 
byte of the encoding of character in UTF-8. I was confused when I wrote that 
it must be _followed_ by a byte pattern of a specific kind; instead, it must 
appear _within_ a byte combination of a certain kind.

Anyway, the error message is wrong. The byte 80 occurring in UTF-8 data 
stream surely "maps to Unicode" as part of byte patterns. A correct error 
message would be "The error was: Byte 80 (hexadecimal) found in purported 
UTF-8 data in a context where it is not allowed." This is fairly generic of 
course, but I suppose the error message pattern is generic as well, so we 
cannot assume that it's about occurrences as first bytes only.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/

Received on Friday, 14 November 2008 15:29:06 UTC