Re: Fallback to UTF-8 from Jukka K. Korpela on 2008-04-28 (www-validator@w3.org from April 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Mon, 28 Apr 2008 09:05:24 +0300
To: "W3C Validator Community" <www-validator@w3.org>
Message-ID: <005501c8a8f5$db1501a0$0500000a@DOCENDO>
olivier Thereaux wrote:

> On 24-Apr-08, at 11:38 PM, Andreas Prilop wrote:
>> Which kind of patch do you mean?
>> I just ask to change the default from UTF-8 to ISO-8859-1.
>
> In a few years developing various software projects, I have learned to
> be very wary of the word "just". Any occurence of a suggestion that a
> software "just has to do this or that" usually means a lot of
> complexity and difficulty for whoever actually has to implement.

Usually perhaps, but this particular question is about the default 
encoding assumed by the validator. If the code is at least minimally 
structured, this should indeed be _implementable_ as a very simple fix, 
essentially changing one definition of a parameter or a constant. 
Whether it is the _right_ fix is a different issue, but it's still fair 
to say "I just ask" as a reaction to request for a patch.

As far as I can see, your comments deal with the desirability of the 
suggested fix, not to any technical difficulty in implementing it.

> That said, as the long thread has shown, there are a number of
> candidates for default:
> * utf-8, because it is the future-looking encoding, also appropriate
> for most international content.

Future-looking does not apply here. We are dealing with an _error 
condition_: the encoding has not been specified, and you (we) have 
decided that the validator should make a guess, as extra comfort to the 
user.

> It is also what authors are strongly
> encouraged to use today, and as such, the validator is a tool that
> should favor this practice.

In the given situation, the validator should encourage the user to 
specify the encoding, not to use a particular encoding. In practice, 
utf-8 is hardly the most common encoding. Moreover, assuming it will 
generally result in spurious, misleading error messages whenever the 
encoding is not utf-8 or compatible with it (like US-Ascii).

> * windows-1252, which appears to be a safe default for a lot of
> content on the web today, and which the HTML5 specification suggests
> as a fallback for UAs trying to parse legacy content

It is not safe at all, since many code positions are undefined. This 
pointlessly means that character data will be reported as erroneous, 
even though the validator has no idea of what the intended encoding was, 
except that it wasn't what the validator guessed.

Reporting bad data characters is not relevant to markup validation, 
unless the encoding is known so that the report can be _correct_, and 
here it isn't. Therefore, the guess should maximize the probability of 
interpreting octets correctly if they can denote markup-significant 
characters and minimize the probability of reporting anything about any 
other characters (since we cannot know what characters are denoted by 
the octets in the data).

> * iso-8859-1, not because it's a proper encoding for most languages,
> but because it has (unfortunately) been set as default in a number of
> specifications.

The current HTML specifications explicitly abjure such a default, and 
when validating something assumed to be HTML, shouldn't HTML specs trump 
any other spec? This doesn't mean you mustn't make a guess on 
iso-8859-1, but you would need to have some _specific grounds_ for it. 
The relevant argument would be the assumption that it is common. But is 
it? There are lots of documents _declared_ to be in iso-8859-1 but 
actually in windows-1252.

> We can either argue forever on which default is the right one (as
> parts of this thread - and many a sterile discussion before -  have
> shown, alas)

None of them is, though partly for different reasons, so it is 
understandable that the discussion would be endless.

> or have implementations try the three.

That's just pointless. Why those three? Why would you take extra trouble 
to make such a guess and ultimately reject a document just because it is 
in one of the many 8-bit encodings that happens to have characters in 
positions that make them malformed in any of the encodings you tried?

Exactly what is wrong with the idea of assuming octets 0...127 decimal 
to have their Ascii meanings and other octets to constitute data 
characters in some unknown encoding? Not knowing what those data 
characters are does not harm validation at all.

When you echo source in error messages, you cannot know what those data 
characters are. So you should simply leave the encoding unspecified. 
This matches the properties of the document being validated. The user's 
browser will then apply its own default (and allows the user to change 
the encoding).

The odds are that the user is validating his own document, or a document 
created in his authoring environment, and the reason he hasn't observed 
the effects of lack of encoding information is that his browser 
interprets the data the way that suits the document. The same will then 
happen to the validator's report.

Actually, this makes it _too_ probable that the guess is correct: 
there's the risk that user misses his mistake! To deal with this problem 
of correct guess, the validator should express the problem loud and 
clear. Well, it sort of does that now, but the information about 
encoding guessing obscures this. (Calling lack of encoding info 
"Potential Issue" is misleading.)

The validator should first and foremost say that the document cannot be 
validated and that the character encoding information is needed to make 
validation possible. It might be best to stop there, but if you're going 
to make guess and proceed, then assume just what you _need_ to assume, 
namely the encoding of markup-significant data.

Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Received on Monday, 28 April 2008 06:05:54 UTC