- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Tue, 6 Feb 2007 10:48:49 +0200 (EET)
- To: Keith Hopper <kh@waikato.ac.nz>
- cc: www-validator@w3.org
On Fri, 2 Feb 2007, Keith Hopper wrote: > The validator output for this contains such gems as - > > <span class="msg">non SGML character number 146</span>.<pre><code > class="input">If you<strong title="Position where error was > detected.">X</strong><re unsure about the </code></pre> > > in which the 'X' character was the Unicode code point U+00C2. Actually the validator's report, encoded in UTF-8, seems to contain the octet C2 followed by the octet 3C. This is a combination that must not occur in UTF-8, so anything you see is just a browser's (largely unplanned) error processing. What I see is two small rectangles. If I manually tell my browser to interpret the report as ISO-8859-1 encoded, I C2 gets interpreted as A with circumflex and 3C as "<", so the markup is valid but the information content is wrong. To put it simply, the validator is not able to handle character encodings properly in a case like this > Notice after the strong element end tag the additionally '<' tag start > character. What I see there, after manually switching to ISO-8859-1, is the right single quotation mark, which is rather tragicomic. But your mileage may vary according to your browser and settings for overriding encoding information. > I submit this is a validator error. I think similar problems have been reported before, fairly long ago, so I'm afraid the problem is too deep in the validator's code to be fixed in a simple way, and I'm afraid there's little hope of a rewrite of the validator. For comparison, the WDG validator http://www.htmlhelp.com/tools/validator/ issues a much cleaner report. P.S. I guess you have figured out what the error on the page being validated is: it contains a byte (octet) that is reserved for control characters in ISO-8859-1. The suggestions in the validator's report are useful, though they don't seem to mention the _simplest_ fix: change the encoding in the <meta> tag from iso-8859-1 to windows-1252. This is a pragmatic approach that works fairly well, but purists frown upon it, for good reasons. Then again, the page is XHTML 1.0 served as text/html, which is both practically and theoretically pointless, so a little trickery and hackery can't add much to the confusion. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Tuesday, 6 February 2007 08:49:08 UTC