- From: Kenneth Whistler <kenw@sybase.com>
- Date: Thu, 29 Aug 2002 15:58:44 -0700 (PDT)
- To: ned.freed@mrochek.com
- Cc: ietf-charsets@iana.org
Ned, I was going to stay out of this, but I am a little troubled by the brushoff you appear to be giving to Marcus' concerns. I understand that "cleaning up the IANA charset registry" is a blackhole for effort, and has a marginal benefit to effort tradeoff, but when an IBM character mapping specialist brings to your attention identification of registrations of IBM related repertoires that can only be defective as registrations, it seems a relatively small task to mark them as such in the registry, so that other people don't trip over them. > And I respectfully suggest that pondering the intent of such registrations is > not a useful way to spend our time. > > I knew when I started it was a waste of time to point this out. I'll waste > everyone's time with one more response on this and then I promise I'll shut up. > > Anyway, a character encoding scheme is a mapping from characters to octets, not > the other way around. > I understand the distinction you are making here. The charset registry defines labels that allow a protocol to identify a byte stream and then, in principle, using whatever mechanism is associated with that registration, to decode that byte stream into a sequence of characters. Period. It takes no position on how characters are to be mapped into octets, or on the generic issues of mapping tables, round-trip mapping, and so on. > I repeat: A charset is defined as mapping from octets to characters. The problem, of course, is that for those IBM repertoires, in particular, that Marcus pointed out, there can be *no* mapping from octets to characters -- it is inherently and completely undefined. These have to be defective registrations. Even with the ISO-10646 registrations, there is a problem if some indication of encoding scheme is not associated with the registration, since then in those cases, as well, the mapping *from* octets to characters is ambiguous -- hence, unusable. Suppose I label some octets as charset=ISO-10646-Unicode-Latin1. Then are the octets: 0xC3 0xB0 to be mapped to <small-eth> (likely), or to <Hangul-SSYEOM> (and regarded as a data error, since outside the repertoire), or to <Hangul-NAELH> (also a data error), or even to <A-tilde, degree sign>? Without further explicit designation that UTF-8 is involved, I'd only be guessing, and I'd be better off with character encoding heuristics than charset labels. In fact, given the "Latin1" part of the name, I'd speculate that most implementations would be more likely to turn this into character hash as Latin-1 than derive the probably correct answer. So it seems to me that Marcus is right about this one as well -- it is simply a defective registration and ought to be marked as such to warn people off it. > > > If so, then please add clarifying text to the top of the list document, and > > appropriate classification to at least non-charset entries. > > Not going to happen. Why not? Presence of a registration in a list of things designated as "charsets" doesn't mean that it actually is a well-defined charset, by the definition of charset you and the registry are using. Why not make at least a minimal effort to specify those which don't even rise to this level of well-formedness? That is different from trying to make a classification of useful versus useless entries on grounds of widespread implementation or any other such criteria, for example. Failing to make *some* effort here in effect changes the definition of charset from: "a mapping from octets to characters" to: "an entry in this registry, which may or may not have a mapping from octets to characters" --Ken
Received on Thursday, 29 August 2002 18:59:35 UTC