Re: charset=us-ascii mandatory? from Jukka K. Korpela on 2007-05-13 (www-validator@w3.org from May 2007)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Sun, 13 May 2007 10:40:54 +0300 (EEST)
To: www-validator@w3.org
Message-ID: <Pine.GSO.4.64.0705131017380.16580@hopeatilhi.cs.tut.fi>
On Fri, 11 May 2007, Frank Ellermann wrote:

> But if authors manage to create an ASCII or Latin-1 document which is
> later mutilated into windows-1252 by a dubious editor (human or tool),
> they might prefer to get a clear "invalid" from the validator, not
> only a warning.

It would be incorrect to report an error in the absence of character 
encoding information. You don't know what the encoding is or is supposed 
to be (though you may have guesses, perhaps even very probable guesses), 
so you cannot know that the document is invalid.

> Another problem with "default windows-1252" is that it would "accept"
> (warning but no error) many other UNKNOWN-8BIT charsets.  Codepage 437,
> 850, 858, MAC Roman, etc. etc., they all would match "windows-1252".

That would be fine. Actually, "default windows-1252" is not even the most 
permissive, so I will change my proposal.

The point is that in a _validator_, all characters beyond the ASCII 
repertoire are just data that may appear as character data content (or in 
CDATA attribute values). Without character encoding information, you 
cannot know how to interpret them, but neither need you know that, since 
you are a validator. Well, you would need to analyze whether the octets 
represent characters in the document character set, but you cannot do that 
when you don't know the encoding. You can just tell that you were not able 
to check that.

>> users don't really want to see messages like "octet 80 encountered in
>> a document declared to be ISO-8859-1"
>
> I want that.  It took me about a year here until I understood the issue,
> and replaced all &#128; bogeys by octet 128 declared as windows-1252,
> but it was precisely what I wanted.  An old W3C validator version let
> me get away with &#128; (before 9-11, years ago), and that was wrong.

Reporting &#128; is a different issue, since its meaning does not depend 
on the document's character encoding.

What's relevant in this discussion is that if you use octet 128 at all, 
consciously or unconsciously, you need to get the response that the 
encoding needs to be declared. Neither an error message nor a warning is 
really adequate here. Rather, an error message of a different category or 
level is needed: the user needs to know that validation proper cannot be 
carried out due to lack of sufficient information. So it's comparable to 
reporting a data transfer error.

>> Using UTF-8 as the default implies that in most cases, if the document
>> contains octets outside the ASCII range, they will be reported by the
>> validator as data errors (malformed UTF-8 data).
>
> Yes, a nice feature of UTF-8, it doesn't permit too much nonsense

The problem here is that there would often be a large number of completely 
misleading data error messages. You take almost any document containing 
non-ASCII data, submit it to validation without character encoding 
information, and there would be a message about the majority of non-ASCII 
characters, one message per character.

My modified proposal is:

When a document is submitted to validation so that its character encoding 
cannot be deduced in any of the ways defined (message headers, meta tags, 
defined defaults), then
1) a data error message is issued, preferably before any other message,
    explaining that validation cannot be carried out due to lack of
    character encoding (charset information), with a link to a document
    explaining this in detail
2) that message is followed by a note explaining that validation process
    is performed under some assumptions (to be listed next or in a linked
    document)
3) a further explanation is given that emphasizes that character data
    in the document cannot be tested and that the document should be
    submitted to validation after selecting and specifying an encoding
4) validation is then started with an assumed character encoding where
    a) octets 0 to 7F are interpreted as ASCII
    b) octets 80 to FF are not interpreted at all but assumed to constitute
    non-ASCII character data
    (This is different from UKNOWN-8BIT, which is completely agnostic
    about the interpretation.)

Item 4 could be omitted. After all, the user _should_ do as explained in 
item 3.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Sunday, 13 May 2007 07:41:02 UTC