- From: David Woolley <david@djwhome.demon.co.uk>
- Date: Sat, 21 Feb 2004 11:41:07 +0000 (GMT)
- To: www-style@w3.org
> determined as ISO-8859-1 (HTTP) or US-ASCII (MIME) (and in fact, a > processor that chooses to adhere to CSS must violate HTTP/MIME...) In reality, an unspecified character set in HTTP, for HTML, has meant Windows 1252 in the USA and Big5 in Taiwan, for a very long time (and even iso-8859-1 in a meta element can mean this, and in the Taiwan case, Windows-1252 in a meta element probably also has this meaning[1]). HTML 4.01 overrides HTTP and says that no inference should be drawn from the lack of an HTTP character set specification; this is more to do with the non-use of HTTP metadata by authors for reasons discussed in another article. > > > 3) If neither the header nor looking for U+FEFF or @charset yield an > > encoding, but this style sheet was loaded because a document > > I am strictly opposed to this rule, it is confusing, it is inconsistent > with other specification, it is /not implementable/, and it yields in > inconsistent results. Without this rule, the vast majority of documents that don't have naturally UTF-8 compatible style sheets will become invalid. No browser developer interested in a non-US market can sensibly reject such documents. The only mitigating factors are that, at least outside USA and Western Europe, web page authors tend to use English for IDs, classes, and even URLs, content: is only implemented in minority browsers, and not well known, and misinterpreting personal names in comments doesn't cause real problems for browsers. > .björn { color: white } This case is resolved by assuming the same character set as the referring document; it doesn't actually matter if that character set is wrong, for fixed length code ASCII compatible character sets, and this case probably recovers even for UTF-8. > .bj\0000f6rn { background-color: black } This case was written by an I18N aware user, and there is relatively good chance that they did identify the character sets explicitly, although legacy considerations mean that they may still not have @charset and HTTP metadata ones, that the HTTP header doesn't specify it. Things will eventually change, but unlike the situation where Scandinavian email used to use the local variant of ISO 646, even though the email standards said it could only be ASCII, web pages with useful content can have great longevity. (The move to MIME for Scandinavian email was the result of newer, US authored, email clients being MIME based and using Windows 1252 or ISO 8859/1, rather than any desire on the part of the Scandinavians to abandon their old character code variant.) [1] for non-style sheet reasons, I was trying to find an example of a Gujarati (44,000,000 speakers) page that would display out of the box on Windows XP and didn't use downlowded (mis-represented) glyph sets; Windows XP has a Unicode and ISCII coded Gujarati font. I've so far failed, but one had windows-1252 on the frameset page[2] and both x-user-defined and windows-1252 (two meta elements) on the frame page, and presumably used a glyph set font. The presence of windows-1252 was clearly the result of using Front Page, which used to use Windows-1252 undeclared, but now sticks it in a way that unsophisticated i18n users don't understand to change it. One used macintosh, which I'm sure is a misrepresentation of a very specific Apple defined character set. Another used x-user-defined and included IE format downloadable fonts; again presumably glyph sets. I may have hit a pure windows-1252 misrepresentation. I'm still looking for a page encoded according to Unicode or a national standard. [2] This is another reason for not using frames; browsers normally only provide character set overrides and feedback for the page as a whole, even if correctly specified character sets can differ between frames.
Received on Saturday, 21 February 2004 06:55:57 UTC