Re: [CSS21] response to issue 115 (and 44) from David Woolley on 2004-02-21 (www-style@w3.org from February 2004)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Sat, 21 Feb 2004 11:41:07 +0000 (GMT)
To: www-style@w3.org
Message-Id: <200402211141.i1LBf7903726@djwhome.demon.co.uk>
> determined as ISO-8859-1 (HTTP) or US-ASCII (MIME) (and in fact, a
> processor that chooses to adhere to CSS must violate HTTP/MIME...)

In reality, an unspecified character set in HTTP, for HTML, has meant
Windows 1252 in the USA and Big5 in Taiwan, for a very long time (and
even iso-8859-1 in a meta element can mean this, and in the Taiwan
case, Windows-1252 in a meta element probably also has this meaning[1]).
HTML 4.01 overrides HTTP and says that no inference should be
drawn from the lack of an HTTP character set specification; this is
more to do with the non-use of HTTP metadata by authors for reasons
discussed in another article.

> 
> > 3) If neither the header nor looking for U+FEFF or @charset yield an
> >    encoding, but this style sheet was loaded because a document
> 
> I am strictly opposed to this rule, it is confusing, it is inconsistent
> with other specification, it is /not implementable/, and it yields in
> inconsistent results. 

Without this rule, the vast majority of documents that don't have
naturally UTF-8 compatible style sheets will become invalid.  No browser
developer interested in a non-US market can sensibly reject such
documents.  The only mitigating factors are that, at least outside
USA and Western Europe, web page authors tend to use English for
IDs, classes, and even URLs, content: is only implemented in minority
browsers, and not well known, and misinterpreting personal names in
comments doesn't cause real problems for browsers.

>       .björn { color: white }

This case is resolved by assuming the same character set as the
referring document; it doesn't actually matter if that character
set is wrong, for fixed length code ASCII compatible character sets,
and this case probably recovers even for UTF-8.

>       .bj\0000f6rn { background-color: black }

This case was written by an I18N aware user, and there is relatively good
chance that they did identify the character sets explicitly, although
legacy considerations mean that they may still not have @charset and
HTTP metadata ones, that the HTTP header doesn't specify it.

Things will eventually change, but unlike the situation where Scandinavian
email used to use the local variant of ISO 646, even though the email
standards said it could only be ASCII, web pages with useful content can
have great longevity.  (The move to MIME for Scandinavian email was the
result of newer, US authored, email clients being MIME based and using
Windows 1252 or ISO 8859/1, rather than any desire on the part of the
Scandinavians to abandon their old character code variant.)

[1] for non-style sheet reasons, I was trying to find an example of a
Gujarati (44,000,000 speakers) page that would display out of the box
on Windows XP and didn't use downlowded (mis-represented) glyph sets;
Windows XP has a Unicode and ISCII coded Gujarati font.

I've so far failed, but one had windows-1252 on the frameset page[2]
and both x-user-defined and windows-1252 (two meta elements) on the frame
page, and presumably used a glyph set font.  The presence of windows-1252
was clearly the result of using Front Page, which used to use Windows-1252
undeclared, but now sticks it in a way that unsophisticated i18n users
don't understand to change it.

One used macintosh, which I'm sure is a misrepresentation of a very
specific Apple defined character set.

Another used x-user-defined and included IE format downloadable fonts;
again presumably glyph sets.

I may have hit a pure windows-1252 misrepresentation.

I'm still looking for a page encoded according to Unicode or a national
standard.

[2] This is another reason for not using frames; browsers normally only
provide character set overrides and feedback for the page as a whole,
even if correctly specified character sets can differ between frames.
Received on Saturday, 21 February 2004 06:55:57 UTC