- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Wed, 30 Aug 2006 19:32:08 +0300 (EEST)
- To: www-validator@w3.org
- cc: Hugh Topping <hughtopping@gmail.com>
- Message-ID: <Pine.GSO.4.64.0608301916001.15009@mustatilhi.cs.tut.fi>
On Wed, 30 Aug 2006, David Dorward wrote: >> What are the bits of (X)HTML in my code that the validator cannot >> interpret as utf-8 > > Looks like your pound signs. And accented e's (e with acute accent, é). Those are the only non-ASCII characters in the document. >> and how can I chnage the code to enable the validator >> to interpret it? > > By configuring the server to send a content-type with the correct > encoding information in it (looks like you are using ISO-8859-1) or > configuring your editor to save in UTF-8. Well, the server _should_ send the encoding information, and if it doesn't, at least the page should have a <meta> tag with such info. I'm not sure how saving as UTF-8 would help _without_ that; it would make the situation worse, since browsers probably don't default the encoding to UTF-8. There's a problem with the validator, really. It gets a document without a charset parameter in a Content-Type header and without a <meta> Ersatz for that. It then interprets, with no explicit note about having done that, the document as being UTF-8 encoded, then issues error messages based on this (mis)interpretation. The validator can _know_, and actually knows, that the encoding cannot be UTF-8, since the data is malformed if interpreted that way. So why does it issue detailed complaints? The author can use the interface that lets him select the encoding manually, setting it to ISO-8859-1. This will start validation proper, resulting in about 52 more or less real and useful error messages. The validator _should_ report that the encoding is not declared; it _could_ also proceed with some guess on the encoding, but only if it explicitly says so - hopefully with a dislaimer that says that it often guesses it all wrong. Incidentally, when only the pound sign and the e with acute are used beyond the ASCII repertoire, and only in a few occurrences, the author _could_ use £ and é for them and avoid the misleading messages. But he _should_ really decide on the encoding and declare it. P.S. Changing pages from HTML to XHTML is wasted time, or worse. If you wish, use XHTML 1.0 with Appendix C kludges for _new_ pages. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 30 August 2006 16:32:25 UTC