- From: Martin Duerst <duerst@w3.org>
- Date: Sun, 21 Mar 2004 09:24:24 -0500
- To: Tim Bray <tbray@textuality.com>
- Cc: www-tag@w3.org <www-tag@w3.org>
Hello Tim, This is just a personal reply. At 15:08 04/03/05 -0800, Tim Bray wrote: >> http://www.w3.org/TR/2004/WD-charmod-20040225 >C016 [S] When designing a new protocol, format or API, >specifications SHOULD mandate a unique character encoding. > >This is controversial. I think in general this is reasonable, with the >single exception of doing what XML did and blessing both UTF-8 and >UTF-16. The problem with a single encoding is that it forces people to >choose between being Java/C# friendly (UTF-16) and C/C++ friendly >(UTF-8). Later on, you in fact seem to agree with this point. >Furthermore it's trivially easy to distinguish between UTF-8 and UTF-16 if >you specify a BOM. But I think that if I were defining the next CSS or >equivalent I'd like to be able to say "UTF-8 or UTF-16" without feeling guilty. Most has already been said in the follow-on thread. Just a few more points: 1) CSS may be a rather bad example. We just had discussions with the CSS WG at the Tech Plenary, and we understood that non-ASCII text is much rarer in CSS than in XML (mostly font names and class names). 2) [This is my main point] The advantages of having a single one-and-only encoding for a format are huge. Most people have become only too much used to ASCII to realize the advantage it created for the US computer industry. The sooner we get from the mess with all these different charsets back to a single encoding, the better. The IETF has a saying: "zero, one, many", which indicates that 'two' is often So as a designer, you should first try to use a single encoding. If you have good reasons for not doing so, you don't have to feel guilty. 3) Your point about 'easier to output' in http://www.w3.org/mid/261266D4-712C-11D8-95ED-000A95A51C9E@textuality.com does have some flaws. It's clearly easier to output, but then why only UTF-8 and UTF-16? Some C programs internally work in UTF-32,... And whoever gets your output may have to convert it back to whatever they want internally. If you know where your output goes to, and it's a single place, your output may be right. If you don't know, or it's multiple places, the chance is that you steal more cycles from business logic than you saved in the first place. In addition, knowing exactly what to produce, or what to expect, often helps optimize these pieces carefully. >I don't see anywhere that it recommends that if you're using UTF-16 you >always use a BOM, and that seems like a basic good practice, particularly >if you're going to allow either UTF8 or UTF-16. The BOM came up in various of our review contexts recently, and we have had a lot of discussions about it, but we have come to the conclusion that we don't have consensus yet, nor is there conclusive practical evidence on most aspects of it, to put something definitive into the character model. We have therefore not included anything about the BOM in the character model, in order to move on. But we have been discussing how to come to conclusions, and how to document them, separately (on which point we also haven't come to any conclusions yet :-). Regards, Martin.
Received on Sunday, 21 March 2004 09:29:13 UTC