Re: [CSS21] response to issue 115 (and 44) from Henri Sivonen on 2004-02-24 (www-style@w3.org from February 2004)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 24 Feb 2004 08:55:04 +0200
To: Jungshik Shin <jshin@i18nl10n.com>
Cc: www-style@w3.org
Message-Id: <5FDF01E3-6696-11D8-B2C2-003065B8CF0E@iki.fi>

On Feb 23, 2004, at 22:27, Jungshik Shin wrote:

> On Mon, 23 Feb 2004, Jukka K. Korpela wrote:
>
>> On Mon, 23 Feb 2004, Henri Sivonen wrote:
>>
>>> On Feb 21, 2004, at 00:26, Bert Bos wrote:
>>>
>>>>  4) If all else fails, assume UTF-8.
>>>
>>> Why not windows-1252 (with the few undefined bytes mapped to
>>> *something* so that all byte streams can be converted some
>>> "characters")?
>
>   Why not? Because there are a lot of stylesheets in encodings other
> than Windows-1252.

Yes and the right thing to do is to label them as such.

> If you don't like UTF-8, you'd better ask for
> ISO-646:IRV.

I like UTF-8. My point is that it is unlikely for unlabeled data to be 
UTF-8 by chance. Although UTF-8 makes the most sense as the One True 
Encoding, UTF-8 is not the best guess when guessing what a bozo who 
uses non-ASCII without a label might use.

Anyway, the UTF-8 default combined with a Draconian UTF-8 decoder is 
not a workable solution for existing unlabeled style sheets. To address 
the case where the style sheet is ASCII except for comments and the 
comments contain non-ASCII bytes that don't form valid UTF-8 sequences, 
the CSS spec needs to require either a recovering UTF-8 decoder or a 
default encoding that otherwise makes all bytes streams valid.

>>> Anyway, it's just
>>> plain stupid to use non-ASCII outside comments in a style sheet that
>>> doesn't have a character encoding label and doesn't have a BOM, so in
>>> the relatively rare cases where this heuristic fails, the author 
>>> would
>>> have only him/herself to blame.
>>
>> Indeed. And currently most style sheets contain Ascii only.
>
>   True in Western Europe and most other parts of the world. Not true in
> Japan, China and Korea. I'm not talking about comments here. A number
> of stylesheets list font-family names in Chinese, Japanese and Korean 
> in legacy
> encodings (GB2312, Big5, Shift_JIS, EUC-JP, EUC-KR, etc).

So why on earth don't they label their style sheets with the 
appropriate character encoding label? The UTF-8 default guess does not 
help at all with GB2312, Big5, Shift_JIS, EUC-JP, EUC-KR, etc.

For the cases you're using as the counter examples for windows-1252, 
UTF-8 is a wrong guess, too.

-- 
Henri Sivonen
hsivonen@iki.fi
http://iki.fi/hsivonen/

Received on Tuesday, 24 February 2004 01:56:13 UTC