- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Fri, 11 May 2007 16:53:26 +0200
- To: www-validator@w3.org
Jukka K. Korpela wrote: > In practice, documents that fail to declare their encoding mostly use > windows-1252. Agreeing with anything else you said I'm not sure about this conclusion: Some documents without declaration might be Latin-1 or ASCII. No big issue, the visible characters of ASCII are a proper subset of the visible characters of Latin-1, and that's in the same sense a proper subset of windows-1252. But if authors manage to create an ASCII or Latin-1 document which is later mutilated into windows-1252 by a dubious editor (human or tool), they might prefer to get a clear "invalid" from the validator, not only a warning. Some really old browsers didn't support windows-1252, or claimed to support Latin-1 when they meant windows-1252, and that's messy. What should another tool do if the user tries to forward this UNKNOWN-8BIT mess as mail ? Another problem with "default windows-1252" is that it would "accept" (warning but no error) many other UNKNOWN-8BIT charsets. Codepage 437, 850, 858, MAC Roman, etc. etc., they all would match "windows-1252". > If you are a web browser and you think (either on the basis of charset > declaration or your settings or even your educated guess) that > ISO-8859-1 encoding is to be used to interpret a document, what will > you do when you encounter a character in the range 80..9F? Right, you > interpret them by windows-1252, often by doing nothing special My browser is lazy, it lets me guess what to do (and it won't allow me to guess UTF-8 or UTF-anything, but I digress). Under your conditions I could guess that I'm looking at a Cyrilic document (KOI8-R or 1251 or similar). > users don't really want to see messages like "octet 80 encountered in > a document declared to be ISO-8859-1" I want that. It took me about a year here until I understood the issue, and replaced all € bogeys by octet 128 declared as windows-1252, but it was precisely what I wanted. An old W3C validator version let me get away with € (before 9-11, years ago), and that was wrong. > Using UTF-8 as the default implies that in most cases, if the document > contains octets outside the ASCII range, they will be reported by the > validator as data errors (malformed UTF-8 data). Yes, a nice feature of UTF-8, it doesn't permit too much nonsense (at least the "new" STD 63 / RFC 3629 version of UTF-8 isn't permissive). If somebody invests time into "W3C validator charset issues" I'd hope it's in supporting more correctly declared and registered charsets. Frank
Received on Friday, 11 May 2007 14:56:24 UTC