Re: [CSS21] response to issue 115 (and 44)

On Feb 24, 2004, at 10:25, Jungshik Shin wrote:

> On Tue, 24 Feb 2004, Henri Sivonen wrote:
>
>> On Feb 23, 2004, at 22:27, Jungshik Shin wrote:
>>
>>>>>>  4) If all else fails, assume UTF-8.
>
>> comments contain non-ASCII bytes that don't form valid UTF-8 
>> sequences,
>> the CSS spec needs to require either a recovering UTF-8 decoder or a
>> default encoding that otherwise makes all bytes streams valid.
>
>   Note that '#4' was the last resort. Assuming the character
> encoding of linking documents usually works (when stylesheets are
> associated with html/xml documents).

Except it doesn't work when the content and the style sheet come from 
different workflows, which is likely to happen in Europe when a content 
management system uses UTF-8 but the style sheets are authored in 
legacy text editors and contain non-ASCII in comments such as copyright 
notices.

The case with opera.com and Mozilla illustrates that it is unsafe to 
*guess* UTF-8 (even from a linking document) and use a draconian UTF-8 
decoder. (I don't think using a draconian UTF-8 decoder is a problem 
when the encoding has been declared explicitly or there is a UTF-8 
BOM.)

Considering that recently effort has been expended in order to make 
every character stream parseable in a predictable way by a CSS parser, 
it would seem illogical to mandate a heuristic that is likely to render 
some real-world *byte* streams unparseable.

I think there are two of this ways around the problem:
1) Requiring the use of a recovering UTF-8 decoder. Security reasons 
are used usually cited as a reason for not doing this. I don't know 
what the particular security reasons are.
2) Specifying that in the absence of character encoding information, 
ASCII must be taken as ASCII and bytes with the high bit set have to 
map to some place holder.

The situation with MIME suggests that if a spec says ASCII must be 
assumed, implementors assume windows-1252 anyway in the hope that the 
guess is sometimes right. Mapping all the high bit characters to a 
place holder, on the other hand, is guaranteed to be the wrong guess.

Anyway, if we forget the suggestion of defaulting to windows-1252, the 
problem still remains that it is unsafe to both *guess* an encoding 
that has interbyte dependencies and enforce those interbyte 
dependencies by rejecting the style sheet if the goal is to be able to 
process the non-bogotic parts of semi-bogotic style sheets (which I 
understand is the goal of making all *character* streams parseable in a 
predictable way beyond the original forward-compatible parsing rules).

>>> True in Western Europe and most other parts of the world. Not true in
>>> Japan, China and Korea. I'm not talking about comments here. A number
>>> of stylesheets list font-family names in Chinese, Japanese and Korean
>>> in legacy
>>> encodings (GB2312, Big5, Shift_JIS, EUC-JP, EUC-KR, etc).
>>
>> So why on earth don't they label their style sheets with the
>> appropriate character encoding label? The UTF-8 default guess does not
>> help at all with GB2312, Big5, Shift_JIS, EUC-JP, EUC-KR, etc.
>
>   As already pointed out by others, for exactly the same reason as
> many Western European stylesheets are not properly tagged as in
> 'ISO-8859-1' or 'Windows-1252' even though they have non-ASCII 
> characters
> although in comment.

It's not the same. Using non-ASCII in a non-comment part of the style 
sheet without a character encoding label and expecting things to 
magically sort themselves out is significantly more unreasonable than 
expecting comments to be gracefully discarded even when they contain 
non-ASCII. After all, the latter works if the parser assumes any 
superset of ASCII that it does not have interbyte dependencies.

-- 
Henri Sivonen
hsivonen@iki.fi
http://iki.fi/hsivonen/

Received on Wednesday, 25 February 2004 11:25:17 UTC