Re: charset=us-ascii mandatory? from Frank Ellermann on 2007-05-11 (www-validator@w3.org from May 2007)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Fri, 11 May 2007 16:53:26 +0200
To: www-validator@w3.org
Message-ID: <46448366.24C3@xyzzy.claranet.de>

Jukka K. Korpela wrote:

> In practice, documents that fail to declare their encoding mostly use
> windows-1252.

Agreeing with anything else you said I'm not sure about this conclusion:

Some documents without declaration might be Latin-1 or ASCII.  No big
issue, the visible characters of ASCII are a proper subset of the
visible characters of Latin-1, and that's in the same sense a proper
subset of windows-1252.

But if authors manage to create an ASCII or Latin-1 document which is
later mutilated into windows-1252 by a dubious editor (human or tool),
they might prefer to get a clear "invalid" from the validator, not
only a warning.  Some really old browsers didn't support windows-1252,
or claimed to support Latin-1 when they meant windows-1252, and that's
messy.  What should another tool do if the user tries to forward this
UNKNOWN-8BIT mess as mail ?

Another problem with "default windows-1252" is that it would "accept"
(warning but no error) many other UNKNOWN-8BIT charsets.  Codepage 437,
850, 858, MAC Roman, etc. etc., they all would match "windows-1252".

> If you are a web browser and you think (either on the basis of charset
> declaration or your settings or even your educated guess) that 
> ISO-8859-1 encoding is to be used to interpret a document, what will
> you do when you encounter a character in the range 80..9F? Right, you
> interpret them by windows-1252, often by doing nothing special

My browser is lazy, it lets me guess what to do (and it won't allow me
to guess UTF-8 or UTF-anything, but I digress).  Under your conditions
I could guess that I'm looking at a Cyrilic document (KOI8-R or 1251 or
similar).   

> users don't really want to see messages like "octet 80 encountered in
> a document declared to be ISO-8859-1"

I want that.  It took me about a year here until I understood the issue,
and replaced all &#128; bogeys by octet 128 declared as windows-1252,
but it was precisely what I wanted.  An old W3C validator version let
me get away with &#128; (before 9-11, years ago), and that was wrong.

> Using UTF-8 as the default implies that in most cases, if the document
> contains octets outside the ASCII range, they will be reported by the
> validator as data errors (malformed UTF-8 data).

Yes, a nice feature of UTF-8, it doesn't permit too much nonsense (at
least the "new" STD 63 / RFC 3629 version of UTF-8 isn't permissive).
If somebody invests time into "W3C validator charset issues" I'd hope 
it's in supporting more correctly declared and registered charsets.

Frank

Received on Friday, 11 May 2007 14:56:24 UTC