Re: utf-8 validation help from Jukka K. Korpela on 2006-08-30 (www-validator@w3.org from August 2006)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Wed, 30 Aug 2006 19:32:08 +0300 (EEST)
To: www-validator@w3.org
cc: Hugh Topping <hughtopping@gmail.com>
Message-ID: <Pine.GSO.4.64.0608301916001.15009@mustatilhi.cs.tut.fi>

On Wed, 30 Aug 2006, David Dorward wrote:

>>    What are the bits of (X)HTML in my code that the validator cannot
>>    interpret as utf-8
>
> Looks like your pound signs.

And accented e's (e with acute accent, é). Those are the only non-ASCII 
characters in the document.

>>    and how can I chnage the code to enable the validator
>>    to interpret it?
>
> By configuring the server to send a content-type with the correct
> encoding information in it (looks like you are using ISO-8859-1) or
> configuring your editor to save in UTF-8.

Well, the server _should_ send the encoding information, and if it 
doesn't, at least the page should have a <meta> tag with such info. I'm 
not sure how saving as UTF-8 would help _without_ that; it would make the 
situation worse, since browsers probably don't default the encoding to 
UTF-8.

There's a problem with the validator, really. It gets a document without 
a charset parameter in a Content-Type header and without a <meta> Ersatz 
for that. It then interprets, with no explicit note about having done 
that, the document as being UTF-8 encoded, then issues error messages 
based on this (mis)interpretation. The validator can _know_, and actually 
knows, that the encoding cannot be UTF-8, since the data is malformed if 
interpreted that way. So why does it issue detailed complaints?

The author can use the interface that lets him select the encoding 
manually, setting it to ISO-8859-1. This will start validation proper, 
resulting in about 52 more or less real and useful error messages.

The validator _should_ report that the encoding is not declared; it 
_could_ also proceed with some guess on the encoding, but only if it 
explicitly says so - hopefully with a dislaimer that says that it often 
guesses it all wrong.

Incidentally, when only the pound sign and the e with acute are used 
beyond the ASCII repertoire, and only in a few occurrences, the author 
_could_ use &pound; and &eacute; for them and avoid the misleading 
messages. But he _should_ really decide on the encoding and declare it.

P.S. Changing pages from HTML to XHTML is wasted time, or worse. If you 
wish, use XHTML 1.0 with Appendix C kludges for _new_ pages.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Wednesday, 30 August 2006 16:32:25 UTC