- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Tue, 24 Feb 2004 08:55:04 +0200
- To: Jungshik Shin <jshin@i18nl10n.com>
- Cc: www-style@w3.org
On Feb 23, 2004, at 22:27, Jungshik Shin wrote: > On Mon, 23 Feb 2004, Jukka K. Korpela wrote: > >> On Mon, 23 Feb 2004, Henri Sivonen wrote: >> >>> On Feb 21, 2004, at 00:26, Bert Bos wrote: >>> >>>> 4) If all else fails, assume UTF-8. >>> >>> Why not windows-1252 (with the few undefined bytes mapped to >>> *something* so that all byte streams can be converted some >>> "characters")? > > Why not? Because there are a lot of stylesheets in encodings other > than Windows-1252. Yes and the right thing to do is to label them as such. > If you don't like UTF-8, you'd better ask for > ISO-646:IRV. I like UTF-8. My point is that it is unlikely for unlabeled data to be UTF-8 by chance. Although UTF-8 makes the most sense as the One True Encoding, UTF-8 is not the best guess when guessing what a bozo who uses non-ASCII without a label might use. Anyway, the UTF-8 default combined with a Draconian UTF-8 decoder is not a workable solution for existing unlabeled style sheets. To address the case where the style sheet is ASCII except for comments and the comments contain non-ASCII bytes that don't form valid UTF-8 sequences, the CSS spec needs to require either a recovering UTF-8 decoder or a default encoding that otherwise makes all bytes streams valid. >>> Anyway, it's just >>> plain stupid to use non-ASCII outside comments in a style sheet that >>> doesn't have a character encoding label and doesn't have a BOM, so in >>> the relatively rare cases where this heuristic fails, the author >>> would >>> have only him/herself to blame. >> >> Indeed. And currently most style sheets contain Ascii only. > > True in Western Europe and most other parts of the world. Not true in > Japan, China and Korea. I'm not talking about comments here. A number > of stylesheets list font-family names in Chinese, Japanese and Korean > in legacy > encodings (GB2312, Big5, Shift_JIS, EUC-JP, EUC-KR, etc). So why on earth don't they label their style sheets with the appropriate character encoding label? The UTF-8 default guess does not help at all with GB2312, Big5, Shift_JIS, EUC-JP, EUC-KR, etc. For the cases you're using as the counter examples for windows-1252, UTF-8 is a wrong guess, too. -- Henri Sivonen hsivonen@iki.fi http://iki.fi/hsivonen/
Received on Tuesday, 24 February 2004 01:56:13 UTC