Re: charset=us-ascii mandatory? from Jukka K. Korpela on 2007-05-07 (www-validator@w3.org from May 2007)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Mon, 7 May 2007 22:59:07 +0300 (EEST)
To: www-validator@w3.org
Message-ID: <Pine.GSO.4.64.0705072231090.10202@mustatilhi.cs.tut.fi>
On Mon, 7 May 2007, olivier Thereaux wrote:

> I'm curious, why windows-1252? How would this platform-dependent charset be 
> more appropriate as a fallback than the universal unicode?

First a brief history:

The HTML 2.0 specification required that the ISO-8859-1 encoding be 
supported by user agents. There was no requirement on supporting any other 
encoding. Moreover, ISO-8859-1 is the default encoding in HTTP. The HTML 
2.0 spec was somewhat vague. It allowed the user of a charset parameter
for the text/html type, with the following note:
             The default value is outside the
             scope of this specification; but for example, the
             default is `US-ASCII' in the context of MIME mail, and
             `ISO-8859-1' in the context of HTTP [HTTP].

HTML 3.2 did not address the issue.

HTML 4.0 explicitly denied any default, saying:

"The HTTP protocol ([RFC2068], section 3.7.1) mentions ISO-8859-1 as a 
default character encoding when the "charset" parameter is absent from the 
"Content-Type" header field. In practice, this recommendation has proved 
useless because some servers don't allow a "charset" parameter to be sent, 
and others may not be configured to send the parameter. Therefore, user 
agents must not assume any default value for the "charset" parameter."

It then prescribes how to determine the encoding, and finally adds:
"In addition to this list of priorities, the user agent may use heuristics 
and user settings."

Sounds vague, yes. But this was re-confirmed in HTML 4.01.

What this boils down to is that if the encoding has not been specified, 
the user agent should make an educated guess, but they will generally 
just use whatever happens to be set as the default in the browser 
settings. This is much more probably ISO-8859-1 than UTF-8.

In practice, documents that fail to declare their encoding mostly use 
windows-1252. The reason is that this encoding has been the de facto 
default on the Web for over a decade. If you are a web browser and you 
think (either on the basis of charset declaration or your settings or even 
your educated guess) that ISO-8859-1 encoding is to be used to interpret a 
document, what will you do when you encounter a character in the range 
80..9F? Right, you interpret them by windows-1252, often by doing 
nothing special - you just treat them as 8-bit quantities and your 
libraries and environment often handle them automatically that way. If 
they don't, you should take care of that, since you will then handle many 
documents the way the author meant, and you lose nothing (except potential 
error detection and reporting, but users don't really want to see messages 
like "octet 80 encountered in a document declared to be ISO-8859-1").

Using ISO-8859-1 as the default would be almost as good as windows-1252, 
but using the latter will handle cases that use the code positions 80..9F 
and does not affect at all the interpretation of ISO-8859-1 encoded 
documents.

Using UTF-8 as the default implies that in most cases, if the document 
contains octets outside the ASCII range, they will be reported by the 
validator as data errors (malformed UTF-8 data). The reason is that in the 
vast majority of cases, the real encoding is windows-1252 or some other 
8-bit encoding.

There's a real confusion emerging these days, since people mix ISO-8859-1 
and UTF-8 data e.g. by joining data from different sources in different 
encodings. In such situations, using UTF-8 as the default would help to 
detect the pronlem in validation, because it would much more often result 
in data errors. But I don't think this alone justifies the use of UTF-8 as 
the default.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Monday, 7 May 2007 19:59:10 UTC