Re: Invalid html from validator from Jukka K. Korpela on 2007-02-06 (www-validator@w3.org from February 2007)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Tue, 6 Feb 2007 10:48:49 +0200 (EET)
To: Keith Hopper <kh@waikato.ac.nz>
cc: www-validator@w3.org
Message-ID: <Pine.GSO.4.64.0702061027220.25492@mustatilhi.cs.tut.fi>

On Fri, 2 Feb 2007, Keith Hopper wrote:

>     The validator output for this contains such gems as -
>
>        <span class="msg">non SGML character number 146</span>.<pre><code
> class="input">If you<strong title="Position where error was
> detected.">X</strong><re unsure about the </code></pre>
>
> in which the 'X' character was the Unicode code point U+00C2.

Actually the validator's report, encoded in UTF-8, seems to contain the 
octet C2 followed by the octet 3C. This is a combination that must not 
occur in UTF-8, so anything you see is just a browser's (largely 
unplanned) error processing. What I see is two small rectangles. If I 
manually tell my browser to interpret the report as ISO-8859-1 encoded, I 
C2 gets interpreted as A with circumflex and 3C as "<", so the markup 
is valid but the information content is wrong.

To put it simply, the validator is not able to handle character encodings 
properly in a case like this

>     Notice after the strong element end tag the additionally '<' tag start
> character.

What I see there, after manually switching to ISO-8859-1, is the right 
single quotation mark, which is rather tragicomic. But your mileage may 
vary according to your browser and settings for overriding encoding 
information.

>     I submit this is a validator error.

I think similar problems have been reported before, fairly long ago, so 
I'm afraid the problem is too deep in the validator's code to be fixed in 
a simple way, and I'm afraid there's little hope of a rewrite of the 
validator.

For comparison, the WDG validator
http://www.htmlhelp.com/tools/validator/
issues a much cleaner report.

P.S. I guess you have figured out what the error on the page being 
validated is: it contains a byte (octet) that is reserved for control 
characters in ISO-8859-1. The suggestions in the validator's report are 
useful, though they don't seem to mention the _simplest_ fix: change the 
encoding in the <meta> tag from iso-8859-1 to windows-1252. This is a 
pragmatic approach that works fairly well, but purists frown upon it, for 
good reasons. Then again, the page is XHTML 1.0 served as text/html, which 
is both practically and theoretically pointless, so a little trickery and 
hackery can't add much to the confusion.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Tuesday, 6 February 2007 08:49:08 UTC