- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Mon, 5 May 2008 08:54:30 +0300
- To: <www-validator@w3.org>
olivier Thereaux wrote:
> On 2-May-08, at 11:09 PM, Andreas Prilop wrote:
>> With UTF-8 or Windows-1252 assumed, the W3C validator simply gives up
>> and does nothing
>>
>> "Sorry! This document can not be checked."
>>
>> when it finds some byte (or byte sequence) that it cannot
>> interpret as Windows-1252 or UTF-8.
>
> Which is why the validator was patched to try latin-1, after utf-8 and
> win-1252.
That sounds odd, since any octet that is acceptable in a windows-1252
encoded HTML document is also acceptable, with the same meaning, in an
iso-8859-1 encoded HTML document, but not vice versa.
I guess your idea is that in iso-8859-1, octets 80 to 9F (hex.) are
defined as control characters U+0080 to U+009F, as opposite to being
partly graphic characters, partly undefined in windows-1252. However, in
practical terms, we know that the "fallback" guess is virtually always
wrong. The author of the document, or the person who submitted it to
validation, didn't really want to have it interpreted as iso-8859-1 but
probably as some 8-bit encoding where the problematic octets denote
graphic characters, e.g. KOI8-R or MacRoman.
> http://qa-dev.w3.org/wmvs/HEAD/
Doesn't work. I tested my
http://www.cs.tut.fi/~jkorpela/chars/test.htm
and I get "Tentatively passed, 3 warning(s) ", which is just wrong,
because it contains octets not defined in the assumed encoding
window-1252. Besides, I don't see any warnings, and there should be 5
error messages for undefined octets (at least if we interpret
windows-1252 as defined by the mapping table at the Unicode.org site;
that table is cited in the IANA registration of windows-1252).
The test version seems to allow any octets in the range 80 to 9F for
iso-8859-1. The normal validator.w3.org version correctly reports them
as errors.
Besides, "tentatively passed" is obscure if not outright wrong. Pass, or
do not pass; there is no try. It might make sense to say "cannot be
validated, but would validate if declared as windows-1252".
>> The W3C validator just reports "non SGML character number ...",
>> which is still better than to sit there and to do nothing.
>
> Arguably. For experts in SGML and markup languages, yes, "non SGML
> character" is an obvious sign of an encoding issue. For most people,
> however, "non SGML character number" is gibberish, whereas "sorry,
> there is a problem because I could not determine the encoding of your
> document" is somewhat understandable.
I agree. But in addition to that, it _is_ better to sit there and to do
nothing, after having told what the problem is, than to give _wrong_ or
at misleading or purely guesswork-based messages.
There is so little to be won by making guesses. The user would still
have to specify the encoding and revalidate. So why not just tell him to
do that? It actually _saves_ time, since the user does not have to deal
with spurious and obscure messages.
One might say that _sometimes_ the guess is correct and helps user to
select the encoding. But I don't think that's relevant. The validator
could simply tell the user that if we does not know what the intended
encoding is, he could try such-and-such encodings.
I think the discussions have shown that there is no satisfactory way to
process a document in validation without knowing its encoding, at least
as specified by the user of the validator. The situation should just be
reported as an error condition that prevents _even trying validation_.
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Received on Monday, 5 May 2008 05:56:42 UTC