- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 8 Dec 2005 19:58:06 +0200 (EET)
- To: www-validator@w3.org
On Thu, 8 Dec 2005, Andreas Prilop wrote: > Why does the validator assume UTF-8 in the first place? I would say that it is within its rights as a user agent when it does that, though the decision is not practically good: "To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): 1. An HTTP "charset" parameter in a "Content-Type" field. 2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset". 3. The charset attribute set on an element that designates an external resource. In addition to this list of priorities, the user agent may use heuristics and user settings. For example, many user agents use a heuristic to distinguish the various encodings used for Japanese text. Also, user agents typically have a user-definable, local default character encoding which they apply in the absence of other indicators." Source: http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2 Apparently the validator uses UTF-8 as the implied default. The choice is impractical, though, since the real encoding is most probably not UTF-8 for HTML 4.01 documents. When no encoding is declared, the real encoding is probably some 8-bit encoding, and ISO-8859-1 was once defined as the overall default (though now it isn't, and the de facto default was and is still largely windows-1252). > IMHO it would be more helpful > (a) to say "No Character Encoding Found!" or > (b) to take a charset that fits (in this case ISO-8859-1). or (c) to imply windows-1256 (Windows Arabic), because it has all code positions (00 to FF hex.) assigned, so there will be no spurious error messages about some "undefined characters". When echoing the source, declaring windows-1256, the appearance would make it rather obvious to the user that there is something wrong at the character level, in a typical case (assuming that some characters > 7F are used and the real encoding doesn't happen to be windows-1256). If the validator implies ISO-8859-1 or windows-1252, for example, there would often be messages about non-SGML characters (if the validator works properly with the encoding), and these messages might be rather misleading. They would report some octets as errors and others not, rather arbitrarily. The logical alternative is to imply US-ASCII and report all non-ASCII octets as errors (undefined characters), but it's less practical. Probably (a) would be better, though the message could be formulated in a more adequate and balanced way, e.g. "The character encoding was not declared for the document. Therefore, no validation was performed. Please declare the encoding as described at [insert suitable link here] and try again." I would not recommend using the 'Encoding' menu. People may find it and use it, but it's not the proper way in any normal situation. Even if the user cannot (or does not know how to) affect HTTP headers, he can use a <meta> tag, which helps in actual use of the document, instead of just helping to "pass validation" without solving the problem. (People may have documents with undeclared encoding for testing purposes, but such people can be expected to know what they are doing and find the 'Encoding' menu all by themselves if they find it useful.) -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 8 December 2005 17:58:24 UTC