- From: Michael Adams <linux_mike@paradise.net.nz>
- Date: Fri, 14 Nov 2008 22:32:25 +1300
- To: www-validator@w3.org
On Thu, 13 Nov 2008 20:03:24 +0200 Came this utterance fomulated by Jukka K. Korpela to my mailbox: > > Jeremy Meyers wrote: > > > When validating my website (www.softlord.com) I get an error > > > > Sorry, I am unable to validate this document because on line 1156 > > it contained one or more bytes that I cannot interpret as utf-8 (in > > other words, the bytes found are not valid values in the specified > > Character Encoding). Please check both the content of the file and > > the character encoding indication. > > > > The error was: utf8 "\x80" does not map to Unicode > > When I try to validate http://www.softlord.com I get 57 error > messages, none of which is like the one you describe. Besides, the > document contains only 1138 lines. > > I suppose you are referring to validation results for some other page > or for an older version of the page. > > > Would it be possible to have the validator still show the source > > that it is checking if "show source" is checked, so that I might > > identify and remove the offending characters? > > Such issues have been discussed in the www-validator list, and people > (well, at least I) have expressed the opinion that such things are a > job for a general character data checker. After all, data errors where > octets do not map to characters in the specified encoding are at a > completely different level than markup validation. They would be > errors even if the data were treated, say, as plain text. > > The error message you quote is a real one in the sense that it is > sometimes issued by the validator. It is symptomatic that in its > attempt to report character data encoding error, the validator fails > miserably. It claims that one or more bytes found "are not valid > values in the specified Character Encoding". This is definitely > incorrect when the encoding is UTF-8. Any byte except C0, C1, and F5 > through FF may appear in UTF-8 data. Thus, 80 hexadecimal (or "\x80" > as the message puts it) must not be judged as invalid. What is wrong > is a _combination_ of bytes. A 80 byte must be followed by a certain > pattern of bytes. In this case, it is probably followed by a byte in > the ASCII range (00 to 7F), and _this_ violates the UTF-8 > specification. > \x80 is illegal as a first byte in unicode. All the following are for the first byte only: binary hex range description 0xxxxxxx 00-7F normal US-ASCII (with no trailing bytes) 10xxxxxx 80-BF illegal 110xxxxx C0-DF indicates two byte code with 5 bits of data 1110xxxx E0-EF indicates three byte code with 4 bits of data 11110xxx F0-F7 indicates four byte code with 3 bits of data 11111xxx F8-FF illegal So the number of ones followed by a trailing zero indicate the number of bytes makes the rest of the encoded character. All subsequent code bytes are in the binary format 10xxxxxx 80-BF. -- Michael All shall be well, and all shall be well, and all manner of things shall be well - Julian of Norwich 1342 - 1416
Received on Friday, 14 November 2008 09:29:50 UTC