Re: utf-8 validation help from Jukka K. Korpela on 2006-08-31 (www-validator@w3.org from August 2006)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 31 Aug 2006 09:04:48 +0300 (EEST)
To: www-validator@w3.org
Message-ID: <Pine.GSO.4.64.0608310845040.17873@mustatilhi.cs.tut.fi>
On Thu, 31 Aug 2006, Frank Ellermann wrote:

> Jukka K. Korpela wrote:
>
>> Well, the server _should_ send the encoding information
>
> For many users of the online validator that won't fly, they
> have some Web space somewhere, with a http server that won't
> let them create dot-files (or ignores them, same effect).

They might also be using "free" web space on a server that adds some code 
on each page sent, making it invalid. There's a lot that can go wrong, and 
if you want a fix, you may even need to change the server. The charset 
issue is however much less serious

>> at least the page should have a <meta> tag with such info.
>
> Yes, for HTML

Which is what people should use as the distribution format of www 
pages, _especially_ on a server that does not let them control the HTTP 
headers in any way.

> The OP now picked
> <?xml version="1.0" encoding="utf-8"?>, for popular browsers
> that should work (and it certainly works for the validator).

For some values of "work". It throws the most popular browser into "quirks 
mode", since its doctype sniffing decides that the author wants that (i.e. 
simulation of errors in old versions in IE) if there is _anything_ before 
the doctype declaration. Is _this_ what people want when they move to 
XHTML?

(The OP also changed the non-ASCII characters to entities, so character 
encoding is not really an issue any more. The document is in practice in 
ASCII encoding, though ASCII data can trivially be declared and processed 
as UTF-8.)

>> The validator can _know_, and actually knows, that the
>> encoding cannot be UTF-8, since the data is malformed if
>> interpreted that way.
>
> What should it do,

First and foremost, it should report an error when it can decide that it 
cannot perform its job reliably. The simplest approach would be to stop 
there, hopefully with some message that helps the user fix the error
and try again. This would probably reduce confusion, since if the 
validator tries to do something based on a guesswork, then
a) the guess may be wrong, with misleading results
b) users tend to overlook an error message they don't immediately
    understand and look at the "validation report proper" instead.

> try Latin-1 if UTF-8 fails, windows-1252 if
> it still doesn't work ?

If you make a guess, windows-1252 is surely a more practical guess than 
iso-8859-1. But a validator shouldn't be a guessing tool. It has a very 
specific and rather technical job to do.

> Or just ignore the spurious octets
> reporting a summary "missing charset declaration, certainly no
> UTF-8" ?

It surely musn't just ignore "the spurious octets" when it cannot possibly 
know what the encoding is. Without information about encoding, every 
octet is spurious.

There's a very practical reason why a user of the validator should know 
that there is no encoding information - and should know it _first_, and 
perhaps as the _only_ thing that the validator says until the error has 
been fixed. If you put an HTML document with no encoding information on a 
web server that sends no encoding information in HTTP headers, then user 
agents can and will interpret it in different ways, creating much more 
confusion than a casual markup error does. Typically, a browser uses the 
encoding that was _last_ manually chosen by its user (via View/Encoding 
command or something like that) or the default encoding as specified in 
the browser settings. An author often fails to see this problem since _he_ 
has windows-1252 as his browser default and hasn't visited pages that 
forced him to change the encoding manually.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 31 August 2006 06:05:13 UTC