Re: Invalid Bytes for Charset from Michael Adams on 2008-11-14 (www-validator@w3.org from November 2008)

From: Michael Adams <linux_mike@paradise.net.nz>
Date: Fri, 14 Nov 2008 22:32:25 +1300
To: www-validator@w3.org
Message-id: <20081114223225.25c20498.linux_mike@paradise.net.nz>
On Thu, 13 Nov 2008 20:03:24 +0200
Came this utterance fomulated by Jukka K. Korpela to my mailbox:

> 
> Jeremy Meyers wrote:
> 
> > When validating my website (www.softlord.com) I get an error
> >
> > Sorry, I am unable to validate this document because on line 1156 
> > it contained one or more bytes that I cannot interpret as utf-8  (in
> > other words, the bytes found are not valid values in the specified
> > Character Encoding). Please check both the content of the file and
> > the character encoding indication.
> >
> > The error was: utf8 "\x80" does not map to Unicode
> 
> When I try to validate http://www.softlord.com I get 57 error
> messages, none of which is like the one you describe. Besides, the
> document contains only 1138 lines.
> 
> I suppose you are referring to validation results for some other page
> or for an older version of the page.
> 
> > Would it be possible to have the validator still show the source
> > that it is checking if "show source" is checked, so that I might
> > identify and remove the offending characters?
> 
> Such issues have been discussed in the www-validator list, and people
> (well, at least I) have expressed the opinion that such things are a
> job for a general character data checker. After all, data errors where
> octets do not map to characters in the specified encoding are at a
> completely different level than markup validation. They would be
> errors even if the data were treated, say, as plain text.
> 
> The error message you quote is a real one in the sense that it is
> sometimes issued by the validator. It is symptomatic that in its
> attempt to report character data encoding error, the validator fails
> miserably. It claims that one or more bytes found "are not valid
> values in the specified Character Encoding". This is definitely
> incorrect when the encoding is UTF-8. Any byte except C0, C1, and F5
> through FF may appear in UTF-8 data. Thus, 80 hexadecimal (or "\x80"
> as the message puts it) must not be judged as invalid. What is wrong
> is a _combination_ of bytes. A 80 byte must be followed by a certain
> pattern of bytes. In this case, it is probably followed by a byte in
> the ASCII range (00 to 7F), and _this_ violates the UTF-8
> specification.
> 

\x80 is illegal as a first byte in unicode. All the following are for
the first byte only:

binary   hex range description
0xxxxxxx 00-7F     normal US-ASCII (with no trailing bytes)
10xxxxxx 80-BF     illegal
110xxxxx C0-DF     indicates two byte code with 5 bits of data
1110xxxx E0-EF     indicates three byte code with 4 bits of data
11110xxx F0-F7     indicates four byte code with 3 bits of data
11111xxx F8-FF     illegal

So the number of ones followed by a trailing zero indicate the number of
bytes makes the rest of the encoded character. All subsequent code bytes
are in the binary format 10xxxxxx 80-BF.

-- 
Michael

All shall be well, and all shall be well, and all manner of things shall
be well

 - Julian of Norwich 1342 - 1416
Received on Friday, 14 November 2008 09:29:50 UTC