Re: character encoding schemes go from character*->byte* from Martin Duerst on 2002-12-17 (www-i18n-comments@w3.org from December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 18 Dec 2002 07:56:09 +0900
To: Dan Connolly <connolly@w3.org>
Cc: www-i18n-comments@w3.org, w3c-i18n-ig@w3.org
Message-Id: <4.2.0.58.J.20021218073543.069c3410@localhost>
At 16:06 02/12/17 -0600, Dan Connolly wrote:
>On Tue, 2002-12-17 at 15:40, Martin Duerst wrote:
> > Hello Dan,
> >
> > Many thanks for your comments on the Character Model.
>
>Hi... thanks for getting back to me quickly.
>
>I hope you have time to explain a bit more; I'm
>still a little confused...

ok, I'll try.


> > We don not see any particular conflict here. There is a clear difference
> > between "character encoding" and "character encoding scheme". The later
> > is taken from
> > http://www.unicode.org/unicode/reports/tr17/#Character%20Encoding%20Model
> > the former probably goes back to RFC 2070 or so. The later refers to
> > a very particular part of the overall process of character encoding,
> > the former to the overall thing.
>
>Which term names the class that utf-8 and us-ascii fall into?

You most probably should use 'character encoding', which denotes
the overall mapping between char* and byte*. The other terms
such as 'character encoding scheme' and 'character encoding form'
are really only relevant when looking at the details of that process.
In particular, 'character encoding form' was specifically created
for UTF-16, which can neither be explained as a sequence of bytes
(in particular in memory where endianness issues are invisible)
nor as a sequence of naturally (binary) represented integers.


> > >It's technically arbitrary whether it goes from byte*
> > >to character* or the other way around, but the
> > >use of 'encoding' in the name strongly suggests
> > >encoding characters, i.e. going from characters
> > >to bytes.
> >
> > In actual fact, it's technically not arbitrary.
> > There are quite a few character encodings where the
> > char*->byte* mapping allows choices.
>
>Ah; OK. Please note that in the spec.

Ok, we'll try to do so.


> >  This in particular
> > applies for the iso-2022-based encodings. This is the
> > reason why the formal definition is 'the wrong way
> > round'. Nevertheless, it would be extremely weird for
> > anybody to change to calling these things 'decodings'.
> >
> >
> >
> > >The charmod spec gets the specification of IANA charsets
> > >right, indirectly...
> > >
> > >"A CES, together with the CCSes it is used with, is
> > >identified by an IANA charset identifier."
> > >
> > >but it's not nearly so mathematically precise as just
> > >saying that IANA charsets identify character encoding
> > >schemes, and character encoding schemes are
> > >(invertible) functions from character sequences
> > >to byte sequences.
> >
> > As explained above, it doesn't say that these are functions
> > because they are not functions.
>
>They are functions when they go the "wrong" way,
>from byte sequences to character
>sequences, no?

On paper, yes. In actual practice, there are small differences
between the conversions implemented by each of the major vendors/
technologies. But this is different from the 'right' way, where
in some cases, the actual specification allows different choices.


> > >Please fix.
> >
> > We don't think that you have brought any actual issues.
>
>Well, I need a term for the class that utf-8 and us-ascii
>are in; i.e. mappings between character sequences
>and byte sequences. Please let me know which term
>you've chosen for that concept. I thought it
>was "character encoding scheme", from RFC2070 and
>such.

I just reviewed RFC 2070. There are indeed cases where it
uses 'character encoding scheme', in particular in the
abstract. But later, it seems to be quite clear, in particular
when it says:

    The term "charset" in MIME is used to designate a character encoding,
    rather than merely a coded character set as the term may suggest.  A
    character encoding is a mapping (possibly many-to-one) of sequences
    of octets to sequences of characters taken from one or more character
    repertoires.

Thinking back, it's easily possible that different authors
used somewhat different terminology :-).

Please also note http://www.ietf.org/rfc/rfc2130.txt, which
uses mostly the same distinction between 'coded character set'
and 'character encoding scheme' as in
http://www.unicode.org/unicode/reports/tr17/, but then
confuses things with statements such as

    This report recommends the use of ISO 10646 as the default Coded
    Character Set, and UTF-8 as the default Character Encoding Scheme in
    the creation of new protocols...

Here 'utf-8' is talking about a generic method of variable-length
encoding from 32-bit integers to byte sequences, which in theory
could be applied to any kind of Coded Character Set.
However, UTF-8 has fortunately never been used in combination with
anything other than ISO 10646, so that it now always stays for the
specific variable-length encoding of ISO 10646/Unicode, and it is
in this sense (i.e. as a 'character encoding' rather than as a
'character encoding scheme') that it is used in the 'charset'
parameter or the 'encoding' pseudo-attribute.

Regards,    Martin.
Received on Tuesday, 17 December 2002 17:56:27 UTC