- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Mon, 7 May 2007 22:59:07 +0300 (EEST)
- To: www-validator@w3.org
On Mon, 7 May 2007, olivier Thereaux wrote: > I'm curious, why windows-1252? How would this platform-dependent charset be > more appropriate as a fallback than the universal unicode? First a brief history: The HTML 2.0 specification required that the ISO-8859-1 encoding be supported by user agents. There was no requirement on supporting any other encoding. Moreover, ISO-8859-1 is the default encoding in HTTP. The HTML 2.0 spec was somewhat vague. It allowed the user of a charset parameter for the text/html type, with the following note: The default value is outside the scope of this specification; but for example, the default is `US-ASCII' in the context of MIME mail, and `ISO-8859-1' in the context of HTTP [HTTP]. HTML 3.2 did not address the issue. HTML 4.0 explicitly denied any default, saying: "The HTTP protocol ([RFC2068], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter." It then prescribes how to determine the encoding, and finally adds: "In addition to this list of priorities, the user agent may use heuristics and user settings." Sounds vague, yes. But this was re-confirmed in HTML 4.01. What this boils down to is that if the encoding has not been specified, the user agent should make an educated guess, but they will generally just use whatever happens to be set as the default in the browser settings. This is much more probably ISO-8859-1 than UTF-8. In practice, documents that fail to declare their encoding mostly use windows-1252. The reason is that this encoding has been the de facto default on the Web for over a decade. If you are a web browser and you think (either on the basis of charset declaration or your settings or even your educated guess) that ISO-8859-1 encoding is to be used to interpret a document, what will you do when you encounter a character in the range 80..9F? Right, you interpret them by windows-1252, often by doing nothing special - you just treat them as 8-bit quantities and your libraries and environment often handle them automatically that way. If they don't, you should take care of that, since you will then handle many documents the way the author meant, and you lose nothing (except potential error detection and reporting, but users don't really want to see messages like "octet 80 encountered in a document declared to be ISO-8859-1"). Using ISO-8859-1 as the default would be almost as good as windows-1252, but using the latter will handle cases that use the code positions 80..9F and does not affect at all the interpretation of ISO-8859-1 encoded documents. Using UTF-8 as the default implies that in most cases, if the document contains octets outside the ASCII range, they will be reported by the validator as data errors (malformed UTF-8 data). The reason is that in the vast majority of cases, the real encoding is windows-1252 or some other 8-bit encoding. There's a real confusion emerging these days, since people mix ISO-8859-1 and UTF-8 data e.g. by joining data from different sources in different encodings. In such situations, using UTF-8 as the default would help to detect the pronlem in validation, because it would much more often result in data errors. But I don't think this alone justifies the use of UTF-8 as the default. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Monday, 7 May 2007 19:59:10 UTC