- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Wed, 25 Feb 2004 18:24:07 +0200
- To: Jungshik Shin <jshin@i18nl10n.com>
- Cc: www-style@w3.org
On Feb 24, 2004, at 10:25, Jungshik Shin wrote: > On Tue, 24 Feb 2004, Henri Sivonen wrote: > >> On Feb 23, 2004, at 22:27, Jungshik Shin wrote: >> >>>>>> 4) If all else fails, assume UTF-8. > >> comments contain non-ASCII bytes that don't form valid UTF-8 >> sequences, >> the CSS spec needs to require either a recovering UTF-8 decoder or a >> default encoding that otherwise makes all bytes streams valid. > > Note that '#4' was the last resort. Assuming the character > encoding of linking documents usually works (when stylesheets are > associated with html/xml documents). Except it doesn't work when the content and the style sheet come from different workflows, which is likely to happen in Europe when a content management system uses UTF-8 but the style sheets are authored in legacy text editors and contain non-ASCII in comments such as copyright notices. The case with opera.com and Mozilla illustrates that it is unsafe to *guess* UTF-8 (even from a linking document) and use a draconian UTF-8 decoder. (I don't think using a draconian UTF-8 decoder is a problem when the encoding has been declared explicitly or there is a UTF-8 BOM.) Considering that recently effort has been expended in order to make every character stream parseable in a predictable way by a CSS parser, it would seem illogical to mandate a heuristic that is likely to render some real-world *byte* streams unparseable. I think there are two of this ways around the problem: 1) Requiring the use of a recovering UTF-8 decoder. Security reasons are used usually cited as a reason for not doing this. I don't know what the particular security reasons are. 2) Specifying that in the absence of character encoding information, ASCII must be taken as ASCII and bytes with the high bit set have to map to some place holder. The situation with MIME suggests that if a spec says ASCII must be assumed, implementors assume windows-1252 anyway in the hope that the guess is sometimes right. Mapping all the high bit characters to a place holder, on the other hand, is guaranteed to be the wrong guess. Anyway, if we forget the suggestion of defaulting to windows-1252, the problem still remains that it is unsafe to both *guess* an encoding that has interbyte dependencies and enforce those interbyte dependencies by rejecting the style sheet if the goal is to be able to process the non-bogotic parts of semi-bogotic style sheets (which I understand is the goal of making all *character* streams parseable in a predictable way beyond the original forward-compatible parsing rules). >>> True in Western Europe and most other parts of the world. Not true in >>> Japan, China and Korea. I'm not talking about comments here. A number >>> of stylesheets list font-family names in Chinese, Japanese and Korean >>> in legacy >>> encodings (GB2312, Big5, Shift_JIS, EUC-JP, EUC-KR, etc). >> >> So why on earth don't they label their style sheets with the >> appropriate character encoding label? The UTF-8 default guess does not >> help at all with GB2312, Big5, Shift_JIS, EUC-JP, EUC-KR, etc. > > As already pointed out by others, for exactly the same reason as > many Western European stylesheets are not properly tagged as in > 'ISO-8859-1' or 'Windows-1252' even though they have non-ASCII > characters > although in comment. It's not the same. Using non-ASCII in a non-comment part of the style sheet without a character encoding label and expecting things to magically sort themselves out is significantly more unreasonable than expecting comments to be gracefully discarded even when they contain non-ASCII. After all, the latter works if the parser assumes any superset of ASCII that it does not have interbyte dependencies. -- Henri Sivonen hsivonen@iki.fi http://iki.fi/hsivonen/
Received on Wednesday, 25 February 2004 11:25:17 UTC