Re: some IANA registrations look like repertoires not charsets? from Mark Davis on 2002-08-29 (ietf-charsets@w3.org from July to September 2002)

From: Mark Davis <mark.davis@us.ibm.com>
Date: Thu, 29 Aug 2002 14:56:14 -0700
To: ned.freed@mrochek.com
Cc: charsets <ietf-charsets@iana.org>, Markus Scherer <markus.scherer@jtcsv.com>
Message-id: <OFDC096F88.3B568AE4-ON88256C24.0076F569@us.ibm.com>
> "there be a mapping from octets to characters".

I fail to understand this response. I don't mean this rhetorically; there
is obviously history behind this that I am unaware of.

Logically a CES is a mapping between octets and characters (both
directions); when you are generating data you are logically mapping one
way; when you are interpreting you are mapping the other.

Whenever you have a mapping from some set of sequences of octets to
characters, then you can also derive a mapping from some subset of
characters to a set of sequences of octets, and vice versa. The only odd
cases are when the original mapping takes two sequences of octets to the
same character (or takes two characters to different sequences of octets);
when you derive the reverse mapping you have to decide which is the
preferred mapping and which is just a fallback.

Mark
___
mark.davis@us.ibm.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799



                                                                                                                         
                      ned.                                                                                               
                      freed@mrochek.com        To:       Markus Scherer <markus.scherer@jtcsv.com>                       
                                               cc:       charsets <ietf-charsets@iana.org>                               
                      2002.08.29 11:21         Subject:  Re: some IANA registrations look like repertoires not charsets? 
                                                                                                                         
                                                                                                                         
                                                                                                                         



> ned.freed@mrochek.com wrote:

> > Assuming the intent really was to register repetoires seems like a
> > stretch to me.

> I believe that is possible. I am trying to figure out what the intent
was. I
> am not saying that we must assume right away that these names are not
> charsets. The reference to ISO 10646 collections and IBM GCSGIDs however
> _suggests_ that these are just repertoires.

And I respectfully suggest that pondering the intent of such registrations
is
not a useful way to spend our time.

> > > Without any specified encoding scheme, they would not qualify as
> > > charsets.

> > It isn't particularly relevant to the matter at hand, but the fact of
the
> > matter is that a charset doesn't require an encoding scheme. The
> > requirement is instead that there be a mapping from octets to
characters.
> > Whether this is implemented by means of a CCS/CES pair or something
else
> > is up to the

> An encoding scheme is nothing but an algorithm for going from bytes to
> characters. "a charset doesn't require an encoding scheme" and "there be
a
> mapping from octets to characters" are therefore contradictory.

I knew when I started it was a waste of time to point this out. I'll waste
everyone's time with one more response on this and then I promise I'll shut
up.

Anyway, a character encoding scheme is a mapping from characters to octets,
not
the other way around.

> Without an encoding scheme, there is no way to decode a byte stream.

> > registration. Charsets like iso-2022-jp certainly don't consist of a
single
> > CCS/CES pair.

> We all know that a number of charsets combine one CES with multiple
CCSes.
> Without that CES you would not have a charset, though. We could argue if
there is one CES with sub-CESes or a CES with CEFs (a little like debating
ISO/OSI vs. TCP stack), but at the minimum you need that one lowest-level
CES to dissect the byte stream into meaningful units.

I repeat: A charset is defined as mapping from octets to characters. This
may
be done in a variety of ways, including but not limited to CCS/CES pairs.
You
may like the CCS/CES concept, and it is undeniably useful and perhaps even
the
preferred method for specifying charsets. But it isn't what a charset is
defined to be.

> It is of course possible that the IANA character-sets list is supposed to
> list not only things that are "charsets" but also CCSes and CEFs and
> repertoires.

No it isn't. It is supposed to list charsets. End of story. This has been
debated at enormous length in the past, it is how the current definition of
a
charset was arrived at, and it is not going to be revisited now.

> If so, then please add clarifying text to the top of the list document,
and
> appropriate classification to at least non-charset entries.

Not going to happen.

> > More likely it was assumed the encoding was implied by the
registration.

> That would be good and valid, and I am trying to ascertain what encoding
if
> any was implied.

And I am saying that this is a waste of time.

> > In any case, past attempts to clean up the registry haven't been
> > successful.
> > And given that actual use of any of this junk is unlikely to exist, it
> > hasn't proved to be sufficiently problematic to force the issue.

> That is a sad statement. It puts a big disclaimer onto the IANA charset
list
> that diminishes its value, in my opinion.

Which you're obviously entitled to. I don't agree, and even if I did it
doesn't
change the situation any.

                                                 Ned
Received on Thursday, 29 August 2002 17:58:12 UTC