Re: Invalid Bytes for Charset

Michael Adams wrote:

[ discussing the error message ...]
>>> The error was: utf8 "\x80" does not map to Unicode

> \x80 is illegal as a first byte in unicode.

First of all, this relates to UTF-8 encoding only.

Second, you're right in the sense that byte 80 is not allowed as the first 
byte of the encoding of character in UTF-8. I was confused when I wrote that 
it must be _followed_ by a byte pattern of a specific kind; instead, it must 
appear _within_ a byte combination of a certain kind.

Anyway, the error message is wrong. The byte 80 occurring in UTF-8 data 
stream surely "maps to Unicode" as part of byte patterns. A correct error 
message would be "The error was: Byte 80 (hexadecimal) found in purported 
UTF-8 data in a context where it is not allowed." This is fairly generic of 
course, but I suppose the error message pattern is generic as well, so we 
cannot assume that it's about occurrences as first bytes only.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/ 

Received on Friday, 14 November 2008 15:29:06 UTC