- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Thu, 31 Aug 2006 10:40:31 +0200
- To: www-validator@w3.org
Jukka K. Korpela wrote: > They might also be using "free" web space on a server that > adds some code on each page sent, making it invalid. Yes, that would be a hopeless case. But RFC 2616 is more tolerant wrt the http header. If the choice is "no info" vs. "wrong info" I pick the former - some of my plain text files are pc-multilingual-850+euro, no decent Web server could get this right without direct instructions. > The charset issue is however much less serious This got a MAY, a SHOULD, and two MUSTs in 3.4.1 of RFC 2616. And probably my browser belongs to the "unfortunately" cases. Tough. At least this mess is limited to HTTP/1.0, so that can't confuse the validator. >>> at least the page should have a <meta> tag with such info. >> Yes, for HTML > Which is what people should use as the distribution format of > www pages, _especially_ on a server that does not let them > control the HTTP headers in any way. The OP wants XHTML, and I use it for years now, and so far all arguments against XHTML 1.0 transitional didn't convince me. I saw your points against XHTML 1.1 some days ago (about name and map), that was convincing. Unfortunately, some 1.1 features make sense, but "visible with any browser" is my top priority. >> The OP now picked <?xml version="1.0" encoding="utf-8"?>, [...] > It throws the most popular browser into "quirks mode", since > its doctype sniffing decides that the author wants that (i.e. > simulation of errors in old versions in IE) if there is > _anything_ before the doctype declaration. Is _this_ what > people want when they move to XHTML? I use this only on one experimental page, for ordinary XHTML pages I use a meta declaration. We discussed that last year, you told me that this page might fool the validator, but won't work elsewhere. The mainstream has already decided that this is all hopeless, and uses some Wiki cum XHTML dialects. What you get is XHTML, some warts like the odd Mediawiki <pre>, but not too bad. > The OP also changed the non-ASCII characters to entities, so > character encoding is not really an issue any more. The > document is in practice in ASCII encoding, though ASCII data > can trivially be declared and processed as UTF-8. Yes, I use ASCII almost everywhere, with windows-1252 only if I "must" for wannabe-backwards compatible Euros. This "wannabe" results in "visible with many browsers", unfortunately not all. > The simplest approach would be to stop there, hopefully with > some message that helps the user fix the error and try again. Not ideal if the page is really UTF-8 and "there" means "before the body element". If "there" means "first UTF-8 error" it's an idea. Again not ideal if the page is really Latin-1, those users then need three validation steps (assuming that they can fix all reported errors, and that there is another problem, not only the missing charset declaration). > This would probably reduce confusion, since if the validator > tries to do something based on a guesswork, then [...] ACK, this could get completely out of hand for really obscure "charsets" like SCSU. Probably the validator doesn't support it, missing some interesting cases in its test suite. > If you make a guess, windows-1252 is surely a more practical > guess than iso-8859-1. But a validator shouldn't be a > guessing tool. It has a very specific and rather technical > job to do. It also has a "typical" audience. Folks discussing the fine points of charset recognition are untypical, an assumption of "unknown legacy SBCS with 0x80 .. 0xFF all allowed" could make sense. For windows-1252 that's not certain, it could pick any other obscure charset where this is certain. > It surely musn't just ignore "the spurious octets" when it > cannot possibly know what the encoding is. Without > information about encoding, every octet is spurious. The idea would be to avoid a flood of bogus error messages based on a wrong guess, but also _try_ to report other errors as long as it's plausible. Otherwise you get the effect that users need three validation steps instead of two. In reality we "know" that it's 99% windows-1252. I recall two articles here in the last three years, one asking for codepage 437, another about an unregistered Mac charset. Anything else here about charset issues was windows-1252. > If you put an HTML document with no encoding information on a > web server that sends no encoding information in HTTP > headers, then user agents can and will interpret it in > different ways Sure, nobody proposed to report no error at all if the "guess" approach finds no other error. No "tentatively valid", it's INVALID. The only question is how to best report it to users. Frank
Received on Thursday, 31 August 2006 08:45:42 UTC