Re: some IANA registrations look like repertoires not charsets? from Kenneth Whistler on 2002-08-29 (ietf-charsets@w3.org from July to September 2002)

From: Kenneth Whistler <kenw@sybase.com>
Date: Thu, 29 Aug 2002 15:58:44 -0700 (PDT)
To: ned.freed@mrochek.com
Cc: ietf-charsets@iana.org
Message-id: <200208292258.PAA13447@birdie.sybase.com>
Ned,

I was going to stay out of this, but I am a little troubled by
the brushoff you appear to be giving to Marcus' concerns.
I understand that "cleaning up the IANA charset registry" is
a blackhole for effort, and has a marginal benefit to effort
tradeoff, but when an IBM character mapping specialist brings
to your attention identification of registrations of IBM
related repertoires that can only be defective as registrations,
it seems a relatively small task to mark them as such in the
registry, so that other people don't trip over them.

> And I respectfully suggest that pondering the intent of such registrations is
> not a useful way to spend our time.
> 

> I knew when I started it was a waste of time to point this out. I'll waste
> everyone's time with one more response on this and then I promise I'll shut up.
> 
> Anyway, a character encoding scheme is a mapping from characters to octets, not
> the other way around.
> 

I understand the distinction you are making here. The charset registry
defines labels that allow a protocol to identify a byte stream and
then, in principle, using whatever mechanism is associated with that
registration, to decode that byte stream into a sequence of characters.
Period. It takes no position on how characters are to be mapped into
octets, or on the generic issues of mapping tables, round-trip mapping,
and so on.

> I repeat: A charset is defined as mapping from octets to characters. 

The problem, of course, is that for those IBM repertoires, in particular,
that Marcus pointed out, there can be *no* mapping from octets to
characters -- it is inherently and completely undefined. These have
to be defective registrations.

Even with the ISO-10646 registrations, there is a problem if some
indication of encoding scheme is not associated with the registration,
since then in those cases, as well, the mapping *from* octets to
characters is ambiguous -- hence, unusable. Suppose I label some octets
as charset=ISO-10646-Unicode-Latin1. Then are the octets:

   0xC3 0xB0

to be mapped to <small-eth> (likely), or to <Hangul-SSYEOM> (and
regarded as a data error, since outside the repertoire), or to
<Hangul-NAELH> (also a data error), or even to <A-tilde, degree sign>?

Without further explicit designation that UTF-8 is involved, I'd only
be guessing, and I'd be better off with character encoding heuristics
than charset labels. In fact, given the "Latin1" part of the name,
I'd speculate that most implementations would be more likely to
turn this into character hash as Latin-1 than derive the probably
correct answer.

So it seems to me that Marcus is right about this one as well -- it
is simply a defective registration and ought to be marked as such
to warn people off it.

> 
> > If so, then please add clarifying text to the top of the list document, and
> > appropriate classification to at least non-charset entries.
> 
> Not going to happen.

Why not?

Presence of a registration in a list of things designated as "charsets"
doesn't mean that it actually is a well-defined charset, by the
definition of charset you and the registry are using. Why not make
at least a minimal effort to specify those which don't even rise
to this level of well-formedness? That is different from trying to
make a classification of useful versus useless entries on grounds of
widespread implementation or any other such criteria, for example.

Failing to make *some* effort here in effect changes the definition
of charset from:

"a mapping from octets to characters"

to:

"an entry in this registry, which may or may not have a mapping
 from octets to characters"

--Ken
Received on Thursday, 29 August 2002 18:59:35 UTC