- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Mon, 5 May 2008 08:54:30 +0300
- To: <www-validator@w3.org>
olivier Thereaux wrote: > On 2-May-08, at 11:09 PM, Andreas Prilop wrote: >> With UTF-8 or Windows-1252 assumed, the W3C validator simply gives up >> and does nothing >> >> "Sorry! This document can not be checked." >> >> when it finds some byte (or byte sequence) that it cannot >> interpret as Windows-1252 or UTF-8. > > Which is why the validator was patched to try latin-1, after utf-8 and > win-1252. That sounds odd, since any octet that is acceptable in a windows-1252 encoded HTML document is also acceptable, with the same meaning, in an iso-8859-1 encoded HTML document, but not vice versa. I guess your idea is that in iso-8859-1, octets 80 to 9F (hex.) are defined as control characters U+0080 to U+009F, as opposite to being partly graphic characters, partly undefined in windows-1252. However, in practical terms, we know that the "fallback" guess is virtually always wrong. The author of the document, or the person who submitted it to validation, didn't really want to have it interpreted as iso-8859-1 but probably as some 8-bit encoding where the problematic octets denote graphic characters, e.g. KOI8-R or MacRoman. > http://qa-dev.w3.org/wmvs/HEAD/ Doesn't work. I tested my http://www.cs.tut.fi/~jkorpela/chars/test.htm and I get "Tentatively passed, 3 warning(s) ", which is just wrong, because it contains octets not defined in the assumed encoding window-1252. Besides, I don't see any warnings, and there should be 5 error messages for undefined octets (at least if we interpret windows-1252 as defined by the mapping table at the Unicode.org site; that table is cited in the IANA registration of windows-1252). The test version seems to allow any octets in the range 80 to 9F for iso-8859-1. The normal validator.w3.org version correctly reports them as errors. Besides, "tentatively passed" is obscure if not outright wrong. Pass, or do not pass; there is no try. It might make sense to say "cannot be validated, but would validate if declared as windows-1252". >> The W3C validator just reports "non SGML character number ...", >> which is still better than to sit there and to do nothing. > > Arguably. For experts in SGML and markup languages, yes, "non SGML > character" is an obvious sign of an encoding issue. For most people, > however, "non SGML character number" is gibberish, whereas "sorry, > there is a problem because I could not determine the encoding of your > document" is somewhat understandable. I agree. But in addition to that, it _is_ better to sit there and to do nothing, after having told what the problem is, than to give _wrong_ or at misleading or purely guesswork-based messages. There is so little to be won by making guesses. The user would still have to specify the encoding and revalidate. So why not just tell him to do that? It actually _saves_ time, since the user does not have to deal with spurious and obscure messages. One might say that _sometimes_ the guess is correct and helps user to select the encoding. But I don't think that's relevant. The validator could simply tell the user that if we does not know what the intended encoding is, he could try such-and-such encodings. I think the discussions have shown that there is no satisfactory way to process a document in validation without knowing its encoding, at least as specified by the user of the validator. The situation should just be reported as an error condition that prevents _even trying validation_. Jukka K. Korpela ("Yucca") http://www.cs.tut.fi/~jkorpela/
Received on Monday, 5 May 2008 05:56:42 UTC