- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Sun, 13 May 2007 10:40:54 +0300 (EEST)
- To: www-validator@w3.org
On Fri, 11 May 2007, Frank Ellermann wrote: > But if authors manage to create an ASCII or Latin-1 document which is > later mutilated into windows-1252 by a dubious editor (human or tool), > they might prefer to get a clear "invalid" from the validator, not > only a warning. It would be incorrect to report an error in the absence of character encoding information. You don't know what the encoding is or is supposed to be (though you may have guesses, perhaps even very probable guesses), so you cannot know that the document is invalid. > Another problem with "default windows-1252" is that it would "accept" > (warning but no error) many other UNKNOWN-8BIT charsets. Codepage 437, > 850, 858, MAC Roman, etc. etc., they all would match "windows-1252". That would be fine. Actually, "default windows-1252" is not even the most permissive, so I will change my proposal. The point is that in a _validator_, all characters beyond the ASCII repertoire are just data that may appear as character data content (or in CDATA attribute values). Without character encoding information, you cannot know how to interpret them, but neither need you know that, since you are a validator. Well, you would need to analyze whether the octets represent characters in the document character set, but you cannot do that when you don't know the encoding. You can just tell that you were not able to check that. >> users don't really want to see messages like "octet 80 encountered in >> a document declared to be ISO-8859-1" > > I want that. It took me about a year here until I understood the issue, > and replaced all € bogeys by octet 128 declared as windows-1252, > but it was precisely what I wanted. An old W3C validator version let > me get away with € (before 9-11, years ago), and that was wrong. Reporting € is a different issue, since its meaning does not depend on the document's character encoding. What's relevant in this discussion is that if you use octet 128 at all, consciously or unconsciously, you need to get the response that the encoding needs to be declared. Neither an error message nor a warning is really adequate here. Rather, an error message of a different category or level is needed: the user needs to know that validation proper cannot be carried out due to lack of sufficient information. So it's comparable to reporting a data transfer error. >> Using UTF-8 as the default implies that in most cases, if the document >> contains octets outside the ASCII range, they will be reported by the >> validator as data errors (malformed UTF-8 data). > > Yes, a nice feature of UTF-8, it doesn't permit too much nonsense The problem here is that there would often be a large number of completely misleading data error messages. You take almost any document containing non-ASCII data, submit it to validation without character encoding information, and there would be a message about the majority of non-ASCII characters, one message per character. My modified proposal is: When a document is submitted to validation so that its character encoding cannot be deduced in any of the ways defined (message headers, meta tags, defined defaults), then 1) a data error message is issued, preferably before any other message, explaining that validation cannot be carried out due to lack of character encoding (charset information), with a link to a document explaining this in detail 2) that message is followed by a note explaining that validation process is performed under some assumptions (to be listed next or in a linked document) 3) a further explanation is given that emphasizes that character data in the document cannot be tested and that the document should be submitted to validation after selecting and specifying an encoding 4) validation is then started with an assumed character encoding where a) octets 0 to 7F are interpreted as ASCII b) octets 80 to FF are not interpreted at all but assumed to constitute non-ASCII character data (This is different from UKNOWN-8BIT, which is completely agnostic about the interpretation.) Item 4 could be omitted. After all, the user _should_ do as explained in item 3. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Sunday, 13 May 2007 07:41:02 UTC