Re: Reviewed charmod fundamentals from Martin Duerst on 2004-03-21 (www-tag@w3.org from March 2004)

From: Martin Duerst <duerst@w3.org>
Date: Sun, 21 Mar 2004 09:24:24 -0500
To: Tim Bray <tbray@textuality.com>
Cc: www-tag@w3.org <www-tag@w3.org>
Message-Id: <4.2.0.58.J.20040321085559.05c88348@localhost>
Hello Tim,

This is just a personal reply.

At 15:08 04/03/05 -0800, Tim Bray wrote:
>>  http://www.w3.org/TR/2004/WD-charmod-20040225

>C016   [S]   When  designing a new protocol, format or API, 
>specifications  SHOULD mandate a unique character encoding.
>
>This is controversial.  I think in general this is reasonable, with the 
>single exception of doing what XML did and blessing both UTF-8 and 
>UTF-16.  The problem with a single encoding is that it forces people to 
>choose between being Java/C# friendly (UTF-16) and C/C++ friendly 
>(UTF-8).  Later on, you in fact seem to agree with this point.
>Furthermore it's trivially easy to distinguish between UTF-8 and UTF-16 if 
>you specify a BOM.  But I think that if I were defining the next CSS or 
>equivalent I'd like to be able to say "UTF-8 or UTF-16" without feeling guilty.

Most has already been said in the follow-on thread. Just a few more points:

1) CSS may be a rather bad example. We just had discussions with the
    CSS WG at the Tech Plenary, and we understood that non-ASCII text
    is much rarer in CSS than in XML (mostly font names and class names).

2) [This is my main point]
    The advantages of having a single one-and-only encoding for a format
    are huge. Most people have become only too much used to ASCII to
    realize the advantage it created for the US computer industry.
    The sooner we get from the mess with all these different charsets
    back to a single encoding, the better. The IETF has a saying:
    "zero, one, many", which indicates that 'two' is often
    So as a designer, you should first try to use a single encoding.
    If you have good reasons for not doing so, you don't have to feel guilty.

3) Your point about 'easier to output' in
    http://www.w3.org/mid/261266D4-712C-11D8-95ED-000A95A51C9E@textuality.com
    does have some flaws. It's clearly easier to output, but then why only
    UTF-8 and UTF-16? Some C programs internally work in UTF-32,...
    And whoever gets your output may have to convert it back to whatever
    they want internally. If you know where your output goes to, and
    it's a single place, your output may be right. If you don't know,
    or it's multiple places, the chance is that you steal more cycles
    from business logic than you saved in the first place. In addition,
    knowing exactly what to produce, or what to expect, often helps
    optimize these pieces carefully.


>I don't see anywhere that it recommends that if you're using UTF-16 you 
>always use a BOM, and that seems like a basic good practice, particularly 
>if you're going to allow either UTF8 or UTF-16.

The BOM came up in various of our review contexts recently,
and we have had a lot of discussions about it, but we have
come to the conclusion that we don't have consensus yet,
nor is there conclusive practical evidence on most aspects
of it, to put something definitive into the character model.

We have therefore not included anything about the BOM in
the character model, in order to move on. But we have
been discussing how to come to conclusions, and how to
document them, separately (on which point we also haven't
come to any conclusions yet :-).

Regards,   Martin.
Received on Sunday, 21 March 2004 09:29:13 UTC