W3C home > Mailing lists > Public > www-i18n-comments@w3.org > December 2002

Re: character encoding schemes go from character*->byte*

From: Martin Duerst <duerst@w3.org>
Date: Wed, 18 Dec 2002 06:40:23 +0900
Message-Id: <>
To: Dan Connolly <connolly@w3.org>, www-i18n-comments@w3.org
Cc: w3c-i18n-ig@w3.org

Hello Dan,

Many thanks for your comments on the Character Model.

Because our last call period is long closed, we do
not plan to list your comment as an issue in our
last call disposition. But the W3C I18N WG
Core Task Force has looked at your mail shortly at
its teleconf today and has actioned me to write back
to you.

However, I plan to take up the issue of terminology
streamlining again at the next teleconf, to make sure
that we can increase the clarity and consistency of our document.

At 08:33 02/12/17 -0600, Dan Connolly wrote:

>In another working group, I was just going to
>cite the definition of a character encoding scheme,
>but I see it's wrong in the charmod spec:
>"A CES is a mapping of the code units of a CEF into well-defined
>sequences of bytes"
>  -- http://www.w3.org/TR/charmod/#sec-Digital
>  http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Digital
>No, a character encoding scheme maps a squence of
>characters to a sequence of bytes.
>This goes back at least as far as HTML 4.0:
>"The "charset" parameter identifies a character encoding, which is a
>method of converting a sequence of bytes into a sequence of characters."
>  -- http://www.w3.org/TR/html401/charset.html

We don not see any particular conflict here. There is a clear difference
between "character encoding" and "character encoding scheme". The later
is taken from
the former probably goes back to RFC 2070 or so. The later refers to
a very particular part of the overall process of character encoding,
the former to the overall thing.

There may be some places in the Character Model that may be confusing
(but that you don't cite). For example,
http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-UniqueEncoding has:

    [S] Specifications SHOULD avoid using the terms 'character set' and
    'charset' to refer to a character encoding, except when the latter
    is used to refer to the MIME charset parameter or its IANA-registered
    values. The terms 'character encoding', 'character encoding form' or
    'character encoding scheme' are RECOMMENDED.

I propose that we either take out 'character encoding form' and
'character encoding scheme' here, or that we be more specific in
which cases which term should be used.

There are also a few other cases where 'encoding scheme' is used
colloquially; we should also review these for clarity.

>It's technically arbitrary whether it goes from byte*
>to character* or the other way around, but the
>use of 'encoding' in the name strongly suggests
>encoding characters, i.e. going from characters
>to bytes.

In actual fact, it's technically not arbitrary.
There are quite a few character encodings where the
char*->byte* mapping allows choices. This in particular
applies for the iso-2022-based encodings. This is the
reason why the formal definition is 'the wrong way
round'. Nevertheless, it would be extremely weird for
anybody to change to calling these things 'decodings'.

>The charmod spec gets the specification of IANA charsets
>right, indirectly...
>"A CES, together with the CCSes it is used with, is
>identified by an IANA charset identifier."
>but it's not nearly so mathematically precise as just
>saying that IANA charsets identify character encoding
>schemes, and character encoding schemes are
>(invertible) functions from character sequences
>to byte sequences.

As explained above, it doesn't say that these are functions
because they are not functions.

>Please fix.

We don't think that you have brought any actual issues.
We will look at the issues that we have found 'en passant'
at our next teleconference.

Regards,    Martin.

>Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Tuesday, 17 December 2002 16:41:20 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:20:14 UTC