W3C home > Mailing lists > Public > www-validator@w3.org > November 2008

Re: Invalid Bytes for Charset

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Fri, 14 Nov 2008 17:28:09 +0200
Message-ID: <F4C9F0E5D58C4442B621D349319F5FDD@JukanPC>
To: <www-validator@w3.org>

Michael Adams wrote:

[ discussing the error message ...]
>>> The error was: utf8 "\x80" does not map to Unicode

> \x80 is illegal as a first byte in unicode.

First of all, this relates to UTF-8 encoding only.

Second, you're right in the sense that byte 80 is not allowed as the first 
byte of the encoding of character in UTF-8. I was confused when I wrote that 
it must be _followed_ by a byte pattern of a specific kind; instead, it must 
appear _within_ a byte combination of a certain kind.

Anyway, the error message is wrong. The byte 80 occurring in UTF-8 data 
stream surely "maps to Unicode" as part of byte patterns. A correct error 
message would be "The error was: Byte 80 (hexadecimal) found in purported 
UTF-8 data in a context where it is not allowed." This is fairly generic of 
course, but I suppose the error message pattern is generic as well, so we 
cannot assume that it's about occurrences as first bytes only.

Yucca, http://www.cs.tut.fi/~jkorpela/ 
Received on Friday, 14 November 2008 15:29:06 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 1 March 2016 14:17:57 UTC