Re: Fallback to UTF-8 from Jukka K. Korpela on 2008-05-05 (www-validator@w3.org from May 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Mon, 5 May 2008 08:54:30 +0300
To: <www-validator@w3.org>
Message-ID: <003e01c8ae74$bf2ac630$0500000a@DOCENDO>
olivier Thereaux wrote:

> On 2-May-08, at 11:09 PM, Andreas Prilop wrote:
>> With UTF-8 or Windows-1252 assumed, the W3C validator simply gives up
>> and does nothing
>>
>>   "Sorry! This document can not be checked."
>>
>> when it finds some byte (or byte sequence) that it cannot
>> interpret as Windows-1252 or UTF-8.
>
> Which is why the validator was patched to try latin-1, after utf-8 and
> win-1252.

That sounds odd, since any octet that is acceptable in a windows-1252 
encoded HTML document is also acceptable, with the same meaning, in an 
iso-8859-1 encoded HTML document, but not vice versa.

I guess your idea is that in iso-8859-1, octets 80 to 9F (hex.) are 
defined as control characters U+0080 to U+009F, as opposite to being 
partly graphic characters, partly undefined in windows-1252. However, in 
practical terms, we know that the "fallback" guess is virtually always 
wrong. The author of the document, or the person who submitted it to 
validation, didn't really want to have it interpreted as iso-8859-1 but 
probably as some 8-bit encoding where the problematic octets denote 
graphic characters, e.g. KOI8-R or MacRoman.

> http://qa-dev.w3.org/wmvs/HEAD/

Doesn't work. I tested my
http://www.cs.tut.fi/~jkorpela/chars/test.htm
and I get "Tentatively passed, 3 warning(s) ", which is just wrong, 
because it contains octets not defined in the assumed encoding 
window-1252. Besides, I don't see any warnings, and there should be 5 
error messages for undefined octets (at least if we interpret 
windows-1252 as defined by the mapping table at the Unicode.org site; 
that table is cited in the IANA registration of windows-1252).

The test version seems to allow any octets in the range 80 to 9F for 
iso-8859-1. The normal validator.w3.org version correctly reports them 
as errors.

Besides, "tentatively passed" is obscure if not outright wrong. Pass, or 
do not pass; there is no try. It might make sense to say "cannot be 
validated, but would validate if declared as windows-1252".

>> The W3C validator just reports "non SGML character number ...",
>> which is still better than to sit there and to do nothing.
>
> Arguably. For experts in SGML and markup languages, yes, "non SGML
> character" is an obvious sign of an encoding issue. For most people,
> however, "non SGML character number" is gibberish, whereas "sorry,
> there is a problem because I could not determine the encoding of your
> document" is somewhat understandable.

I agree. But in addition to that, it _is_ better to sit there and to do 
nothing, after having told what the problem is, than to give _wrong_ or 
at misleading or purely guesswork-based messages.

There is so little to be won by making guesses. The user would still 
have to specify the encoding and revalidate. So why not just tell him to 
do that? It actually _saves_ time, since the user does not have to deal 
with spurious and obscure messages.

One might say that _sometimes_ the guess is correct and helps user to 
select the encoding. But I don't think that's relevant. The validator 
could simply tell the user that if we does not know what the intended 
encoding is, he could try such-and-such encodings.

I think the discussions have shown that there is no satisfactory way to 
process a document in validation without knowing its encoding, at least 
as specified by the user of the validator. The situation should just be 
reported as an error condition that prevents _even trying validation_.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Received on Monday, 5 May 2008 05:56:42 UTC