Re: Invalid Bytes for Charset from Jukka K. Korpela on 2008-11-13 (www-validator@w3.org from November 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 13 Nov 2008 20:03:24 +0200
To: "Jeremy Meyers" <jeremy.meyers@sonybmg.com>, <www-validator@w3.org>
Message-ID: <D83AFA6C87414295A8F3979237DE3790@JukanPC>

Jeremy Meyers wrote:

> When validating my website (www.softlord.com) I get an error
>
> Sorry, I am unable to validate this document because on line 1156  it
> contained one or more bytes that I cannot interpret as utf-8  (in
> other words, the bytes found are not valid values in the specified
> Character Encoding). Please check both the content of the file and
> the character encoding indication.
>
> The error was: utf8 "\x80" does not map to Unicode

When I try to validate http://www.softlord.com I get 57 error messages, none 
of which is like the one you describe. Besides, the document contains only 
1138 lines.

I suppose you are referring to validation results for some other page or for 
an older version of the page.

> Would it be possible to have the validator still show the source that
> it is checking if "show source" is checked, so that I might identify
> and remove the offending characters?

Such issues have been discussed in the www-validator list, and people (well, 
at least I) have expressed the opinion that such things are a job for a 
general character data checker. After all, data errors where octets do not 
map to characters in the specified encoding are at a completely different 
level than markup validation. They would be errors even if the data were 
treated, say, as plain text.

The error message you quote is a real one in the sense that it is sometimes 
issued by the validator. It is symptomatic that in its attempt to report 
character data encoding error, the validator fails miserably. It claims that 
one or more bytes found "are not valid values in the specified Character 
Encoding". This is definitely incorrect when the encoding is UTF-8. Any byte 
except C0, C1, and F5 through FF may appear in UTF-8 data. Thus, 80 
hexadecimal (or "\x80" as the message puts it) must not be judged as 
invalid. What is wrong is a _combination_ of bytes. A 80 byte must be 
followed by a certain pattern of bytes. In this case, it is probably 
followed by a byte in the ASCII range (00 to 7F), and _this_ violates the 
UTF-8 specification.

So character data checking as such is not a job for a markup validator, but 
the validator pages might suggest suitable tools, generally, and especially 
in error situations that relate to encodings. The page
http://billposer.org/Software/unidesc.html
(which I found through the Unicode Consortium pages) contains some promising 
tools. Maybe the validator pages could host a simple UTF-8 checker? After 
all, not everyone can install and run a nice tool written in C...

> Viewing the source via a web
> browser doesn't necessarily leave me with exactly the same line that
> the validator is looking at in the same position

Generally, you should use a good editor, not a web browser, for such 
purposes. But you could use e.g. the View Source functionality of Firefox. 
It gives access to a line by its line number, among other things.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/

Received on Thursday, 13 November 2008 18:04:19 UTC