Re: utf-8 validation help from Frank Ellermann on 2006-08-31 (www-validator@w3.org from August 2006)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Thu, 31 Aug 2006 10:40:31 +0200
To: www-validator@w3.org
Message-ID: <44F6A07F.6AB9@xyzzy.claranet.de>
Jukka K. Korpela wrote:
 
> They might also be using "free" web space on a server that
> adds some code on each page sent, making it invalid.

Yes, that would be a hopeless case.  But RFC 2616 is more
tolerant wrt the http header.  If the choice is "no info" vs.
"wrong info" I pick the former - some of my plain text files
are pc-multilingual-850+euro, no decent Web server could get
this right without direct instructions.

> The charset issue is however much less serious

This got a MAY, a SHOULD, and two MUSTs in 3.4.1 of RFC 2616.
And probably my browser belongs to the "unfortunately" cases.
Tough.  At least this mess is limited to HTTP/1.0, so that
can't confuse the validator.

>>> at least the page should have a <meta> tag with such info.
>> Yes, for HTML
> Which is what people should use as the distribution format of
> www pages, _especially_ on a server that does not let them
> control the HTTP headers in any way.

The OP wants XHTML, and I use it for years now, and so far all
arguments against XHTML 1.0 transitional didn't convince me.  I
saw your points against XHTML 1.1 some days ago (about name and
map), that was convincing.  Unfortunately, some 1.1 features
make sense, but "visible with any browser" is my top priority.
   
>> The OP now picked <?xml version="1.0" encoding="utf-8"?>,
[...]
> It throws the most popular browser into "quirks mode", since
> its doctype sniffing decides that the author wants that (i.e.
> simulation of errors in old versions in IE) if there is
> _anything_ before the doctype declaration. Is _this_ what
> people want when they move to XHTML?

I use this only on one experimental page, for ordinary XHTML
pages I use a meta declaration.  We discussed that last year,
you told me that this page might fool the validator, but won't
work elsewhere.

The mainstream has already decided that this is all hopeless,
and uses some Wiki cum XHTML dialects.  What you get is XHTML,
some warts like the odd Mediawiki <pre>, but not too bad.

> The OP also changed the non-ASCII characters to entities, so
> character encoding is not really an issue any more. The
> document is in practice in ASCII encoding, though ASCII data
> can trivially be declared and processed as UTF-8.

Yes, I use ASCII almost everywhere, with windows-1252 only if I
"must" for wannabe-backwards compatible Euros.  This "wannabe"
results in "visible with many browsers", unfortunately not all.

> The simplest approach would be to stop there, hopefully with
> some message that helps the user fix the error and try again.

Not ideal if the page is really UTF-8 and "there" means "before
the body element".  If "there" means "first UTF-8 error" it's
an idea.  Again not ideal if the page is really Latin-1, those
users then need three validation steps (assuming that they can
fix all reported errors, and that there is another problem, not
only the missing charset declaration).

> This would probably reduce confusion, since if the validator
> tries to do something based on a guesswork, then
[...]
ACK, this could get completely out of hand for really obscure
"charsets" like SCSU.  Probably the validator doesn't support
it, missing some interesting cases in its test suite.

> If you make a guess, windows-1252 is surely a more practical
> guess than iso-8859-1. But a validator shouldn't be a
> guessing tool. It has a very specific and rather technical
> job to do.

It also has a "typical" audience.  Folks discussing the fine
points of charset recognition are untypical, an assumption of
"unknown legacy SBCS with 0x80 .. 0xFF all allowed" could make
sense.  For windows-1252 that's not certain, it could pick any
other obscure charset where this is certain.  

> It surely musn't just ignore "the spurious octets" when it
> cannot possibly know what the encoding is. Without
> information about encoding, every octet is spurious.

The idea would be to avoid a flood of bogus error messages
based on a wrong guess, but also _try_ to report other errors
as long as it's plausible.  Otherwise you get the effect that
users need three validation steps instead of two.  

In reality we "know" that it's 99% windows-1252.  I recall two
articles here in the last three years, one asking for codepage
437, another about an unregistered Mac charset.  Anything else
here about charset issues was windows-1252.

> If you put an HTML document with no encoding information on a
> web server that sends no encoding information in HTTP
> headers, then user agents can and will interpret it in
> different ways

Sure, nobody proposed to report no error at all if the "guess"
approach finds no other error.  No "tentatively valid", it's
INVALID.  The only question is how to best report it to users.

Frank
Received on Thursday, 31 August 2006 08:45:42 UTC