- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 31 Aug 2006 09:04:48 +0300 (EEST)
- To: www-validator@w3.org
On Thu, 31 Aug 2006, Frank Ellermann wrote: > Jukka K. Korpela wrote: > >> Well, the server _should_ send the encoding information > > For many users of the online validator that won't fly, they > have some Web space somewhere, with a http server that won't > let them create dot-files (or ignores them, same effect). They might also be using "free" web space on a server that adds some code on each page sent, making it invalid. There's a lot that can go wrong, and if you want a fix, you may even need to change the server. The charset issue is however much less serious >> at least the page should have a <meta> tag with such info. > > Yes, for HTML Which is what people should use as the distribution format of www pages, _especially_ on a server that does not let them control the HTTP headers in any way. > The OP now picked > <?xml version="1.0" encoding="utf-8"?>, for popular browsers > that should work (and it certainly works for the validator). For some values of "work". It throws the most popular browser into "quirks mode", since its doctype sniffing decides that the author wants that (i.e. simulation of errors in old versions in IE) if there is _anything_ before the doctype declaration. Is _this_ what people want when they move to XHTML? (The OP also changed the non-ASCII characters to entities, so character encoding is not really an issue any more. The document is in practice in ASCII encoding, though ASCII data can trivially be declared and processed as UTF-8.) >> The validator can _know_, and actually knows, that the >> encoding cannot be UTF-8, since the data is malformed if >> interpreted that way. > > What should it do, First and foremost, it should report an error when it can decide that it cannot perform its job reliably. The simplest approach would be to stop there, hopefully with some message that helps the user fix the error and try again. This would probably reduce confusion, since if the validator tries to do something based on a guesswork, then a) the guess may be wrong, with misleading results b) users tend to overlook an error message they don't immediately understand and look at the "validation report proper" instead. > try Latin-1 if UTF-8 fails, windows-1252 if > it still doesn't work ? If you make a guess, windows-1252 is surely a more practical guess than iso-8859-1. But a validator shouldn't be a guessing tool. It has a very specific and rather technical job to do. > Or just ignore the spurious octets > reporting a summary "missing charset declaration, certainly no > UTF-8" ? It surely musn't just ignore "the spurious octets" when it cannot possibly know what the encoding is. Without information about encoding, every octet is spurious. There's a very practical reason why a user of the validator should know that there is no encoding information - and should know it _first_, and perhaps as the _only_ thing that the validator says until the error has been fixed. If you put an HTML document with no encoding information on a web server that sends no encoding information in HTTP headers, then user agents can and will interpret it in different ways, creating much more confusion than a casual markup error does. Typically, a browser uses the encoding that was _last_ manually chosen by its user (via View/Encoding command or something like that) or the default encoding as specified in the browser settings. An author often fails to see this problem since _he_ has windows-1252 as his browser default and hasn't visited pages that forced him to change the encoding manually. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 31 August 2006 06:05:13 UTC