- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Mon, 28 Apr 2008 09:05:24 +0300
- To: "W3C Validator Community" <www-validator@w3.org>
olivier Thereaux wrote: > On 24-Apr-08, at 11:38 PM, Andreas Prilop wrote: >> Which kind of patch do you mean? >> I just ask to change the default from UTF-8 to ISO-8859-1. > > In a few years developing various software projects, I have learned to > be very wary of the word "just". Any occurence of a suggestion that a > software "just has to do this or that" usually means a lot of > complexity and difficulty for whoever actually has to implement. Usually perhaps, but this particular question is about the default encoding assumed by the validator. If the code is at least minimally structured, this should indeed be _implementable_ as a very simple fix, essentially changing one definition of a parameter or a constant. Whether it is the _right_ fix is a different issue, but it's still fair to say "I just ask" as a reaction to request for a patch. As far as I can see, your comments deal with the desirability of the suggested fix, not to any technical difficulty in implementing it. > That said, as the long thread has shown, there are a number of > candidates for default: > * utf-8, because it is the future-looking encoding, also appropriate > for most international content. Future-looking does not apply here. We are dealing with an _error condition_: the encoding has not been specified, and you (we) have decided that the validator should make a guess, as extra comfort to the user. > It is also what authors are strongly > encouraged to use today, and as such, the validator is a tool that > should favor this practice. In the given situation, the validator should encourage the user to specify the encoding, not to use a particular encoding. In practice, utf-8 is hardly the most common encoding. Moreover, assuming it will generally result in spurious, misleading error messages whenever the encoding is not utf-8 or compatible with it (like US-Ascii). > * windows-1252, which appears to be a safe default for a lot of > content on the web today, and which the HTML5 specification suggests > as a fallback for UAs trying to parse legacy content It is not safe at all, since many code positions are undefined. This pointlessly means that character data will be reported as erroneous, even though the validator has no idea of what the intended encoding was, except that it wasn't what the validator guessed. Reporting bad data characters is not relevant to markup validation, unless the encoding is known so that the report can be _correct_, and here it isn't. Therefore, the guess should maximize the probability of interpreting octets correctly if they can denote markup-significant characters and minimize the probability of reporting anything about any other characters (since we cannot know what characters are denoted by the octets in the data). > * iso-8859-1, not because it's a proper encoding for most languages, > but because it has (unfortunately) been set as default in a number of > specifications. The current HTML specifications explicitly abjure such a default, and when validating something assumed to be HTML, shouldn't HTML specs trump any other spec? This doesn't mean you mustn't make a guess on iso-8859-1, but you would need to have some _specific grounds_ for it. The relevant argument would be the assumption that it is common. But is it? There are lots of documents _declared_ to be in iso-8859-1 but actually in windows-1252. > We can either argue forever on which default is the right one (as > parts of this thread - and many a sterile discussion before - have > shown, alas) None of them is, though partly for different reasons, so it is understandable that the discussion would be endless. > or have implementations try the three. That's just pointless. Why those three? Why would you take extra trouble to make such a guess and ultimately reject a document just because it is in one of the many 8-bit encodings that happens to have characters in positions that make them malformed in any of the encodings you tried? Exactly what is wrong with the idea of assuming octets 0...127 decimal to have their Ascii meanings and other octets to constitute data characters in some unknown encoding? Not knowing what those data characters are does not harm validation at all. When you echo source in error messages, you cannot know what those data characters are. So you should simply leave the encoding unspecified. This matches the properties of the document being validated. The user's browser will then apply its own default (and allows the user to change the encoding). The odds are that the user is validating his own document, or a document created in his authoring environment, and the reason he hasn't observed the effects of lack of encoding information is that his browser interprets the data the way that suits the document. The same will then happen to the validator's report. Actually, this makes it _too_ probable that the guess is correct: there's the risk that user misses his mistake! To deal with this problem of correct guess, the validator should express the problem loud and clear. Well, it sort of does that now, but the information about encoding guessing obscures this. (Calling lack of encoding info "Potential Issue" is misleading.) The validator should first and foremost say that the document cannot be validated and that the character encoding information is needed to make validation possible. It might be best to stop there, but if you're going to make guess and proceed, then assume just what you _need_ to assume, namely the encoding of markup-significant data. Jukka K. Korpela ("Yucca") http://www.cs.tut.fi/~jkorpela/
Received on Monday, 28 April 2008 06:05:54 UTC