- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 13 Nov 2008 20:03:24 +0200
- To: "Jeremy Meyers" <jeremy.meyers@sonybmg.com>, <www-validator@w3.org>
Jeremy Meyers wrote: > When validating my website (www.softlord.com) I get an error > > Sorry, I am unable to validate this document because on line 1156 it > contained one or more bytes that I cannot interpret as utf-8 (in > other words, the bytes found are not valid values in the specified > Character Encoding). Please check both the content of the file and > the character encoding indication. > > The error was: utf8 "\x80" does not map to Unicode When I try to validate http://www.softlord.com I get 57 error messages, none of which is like the one you describe. Besides, the document contains only 1138 lines. I suppose you are referring to validation results for some other page or for an older version of the page. > Would it be possible to have the validator still show the source that > it is checking if "show source" is checked, so that I might identify > and remove the offending characters? Such issues have been discussed in the www-validator list, and people (well, at least I) have expressed the opinion that such things are a job for a general character data checker. After all, data errors where octets do not map to characters in the specified encoding are at a completely different level than markup validation. They would be errors even if the data were treated, say, as plain text. The error message you quote is a real one in the sense that it is sometimes issued by the validator. It is symptomatic that in its attempt to report character data encoding error, the validator fails miserably. It claims that one or more bytes found "are not valid values in the specified Character Encoding". This is definitely incorrect when the encoding is UTF-8. Any byte except C0, C1, and F5 through FF may appear in UTF-8 data. Thus, 80 hexadecimal (or "\x80" as the message puts it) must not be judged as invalid. What is wrong is a _combination_ of bytes. A 80 byte must be followed by a certain pattern of bytes. In this case, it is probably followed by a byte in the ASCII range (00 to 7F), and _this_ violates the UTF-8 specification. So character data checking as such is not a job for a markup validator, but the validator pages might suggest suitable tools, generally, and especially in error situations that relate to encodings. The page http://billposer.org/Software/unidesc.html (which I found through the Unicode Consortium pages) contains some promising tools. Maybe the validator pages could host a simple UTF-8 checker? After all, not everyone can install and run a nice tool written in C... > Viewing the source via a web > browser doesn't necessarily leave me with exactly the same line that > the validator is looking at in the same position Generally, you should use a good editor, not a web browser, for such purposes. But you could use e.g. the View Source functionality of Firefox. It gives access to a line by its line number, among other things. -- Yucca, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 13 November 2008 18:04:19 UTC