Re: Document without charset from Jukka K. Korpela on 2005-12-08 (www-validator@w3.org from December 2005)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 8 Dec 2005 19:58:06 +0200 (EET)
To: www-validator@w3.org
Message-ID: <Pine.GSO.4.63.0512081931540.9371@korppi.cs.tut.fi>
On Thu, 8 Dec 2005, Andreas Prilop wrote:

> Why does the validator assume UTF-8 in the first place?

I would say that it is within its rights as a user agent when it does 
that, though the decision is not practically good:

"To sum up, conforming user agents must observe the following priorities 
when determining a document's character encoding (from highest priority to 
lowest):

1. An HTTP "charset" parameter in a "Content-Type" field.
2. A META declaration with "http-equiv" set to "Content-Type" and a value 
set for "charset".
3. The charset attribute set on an element that designates an external 
resource.
In addition to this list of priorities, the user agent may use heuristics 
and user settings. For example, many user agents use a heuristic to 
distinguish the various encodings used for Japanese text. Also, user 
agents typically have a user-definable, local default character encoding 
which they apply in the absence of other indicators."

Source: http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2

Apparently the validator uses UTF-8 as the implied default.

The choice is impractical, though, since the real encoding is most 
probably not UTF-8 for HTML 4.01 documents. When no encoding is declared, 
the real encoding is probably some 8-bit encoding, and ISO-8859-1 was
once defined as the overall default (though now it isn't, and the de facto 
default was and is still largely windows-1252).

> IMHO it would be more helpful
> (a) to say "No Character Encoding Found!"  or
> (b) to take a charset that fits (in this case ISO-8859-1).

or
(c) to imply windows-1256 (Windows Arabic), because it has all code
     positions (00 to FF hex.) assigned, so there will be no spurious
     error messages about some "undefined characters".

When echoing the source, declaring windows-1256, the appearance
would make it rather obvious to the user that there is something
wrong at the character level, in a typical case (assuming that some
characters > 7F are used and the real encoding doesn't happen to be
windows-1256).

If the validator implies ISO-8859-1 or windows-1252, for example,
there would often be messages about non-SGML characters (if the validator 
works properly with the encoding), and these messages might be rather
misleading. They would report some octets as errors and others not,
rather arbitrarily.

The logical alternative is to imply US-ASCII and report all non-ASCII 
octets as errors (undefined characters), but it's less practical.
Probably (a) would be better, though the message could be formulated
in a more adequate and balanced way, e.g.

"The character encoding was not declared for the document.
Therefore, no validation was performed. Please declare the encoding as
described at [insert suitable link here] and try again."

I would not recommend using the 'Encoding' menu. People may find it and 
use it, but it's not the proper way in any normal situation. Even if the 
user cannot (or does not know how to) affect HTTP headers, he can
use a <meta> tag, which helps in actual use of the document, instead
of just helping to "pass validation" without solving the problem.
(People may have documents with undeclared encoding for testing purposes, 
but such people can be expected to know what they are doing and find the
'Encoding' menu all by themselves if they find it useful.)

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 8 December 2005 17:58:24 UTC