- From: Martin Duerst <duerst@w3.org>
- Date: Fri, 30 Aug 2002 10:59:40 +0900
- To: Kenneth Whistler <kenw@sybase.com>, ned.freed@mrochek.com
- Cc: ietf-charsets@iana.org
At 15:58 02/08/29 -0700, Kenneth Whistler wrote: >Even with the ISO-10646 registrations, there is a problem if some >indication of encoding scheme is not associated with the registration, >since then in those cases, as well, the mapping *from* octets to >characters is ambiguous -- hence, unusable. Suppose I label some octets >as charset=ISO-10646-Unicode-Latin1. Looking at that registration, it says: Name: ISO-10646-Unicode-Latin1 MIBenum: 1003 Source: ISO Latin-1 subset of Unicode. Basic Latin and Latin-1 Supplement = collections 1 and 2. See ISO 10646, Appendix A. See RFC 1815. Alias: csUnicodeLatin1 Alias: ISO-10646 Reading RFC 1815, it says: >>>> Description of "ISO-10646" ISO-10646 is profiled to be the most basic part of the family of encodings based on ISO 10646 and contains the following minimal graphic characters: collection number and name positions further restriction ------------------------------------------------------------------ 1 BASIC LATIN 0020-007E 2 LATIN-1 SUPPLEMENT 00A0-00FF C0 and C1 control characters may also be used as specified in the section 16 of ISO 10646. The text with "ISO-10646" encodes text in 16 bit big endian form. As no combining characters are included, "ISO-10646" can be used with applications at implementation level 1. Left-to-right directionality should be used. The encoding is implemented by Windows/NT. For practical communication, use of "ISO-10646" is discouraged. "ISO-8859-1" [RFC1345] should be used instead. >>>> So it is clearly defined as big endian UCS-2 (or UTF-16). The problem with this registration is not that it isn't a 'charset', it's rather that IETF allowed such a registration and the accompaining RFC to go through, though 1) even the author says that he discourages use of his definitions, 2) nobody else was/is really interested, and 3) having one of the clearest opponents of ISO 106464 squat on the label 'ISO-10646' is highly problematic. But maybe that was the easiest way to deal with a well-known troublemaker. >Without further explicit designation that UTF-8 is involved, I'd only >be guessing, and I'd be better off with character encoding heuristics >than charset labels. In fact, given the "Latin1" part of the name, >I'd speculate that most implementations would be more likely to >turn this into character hash as Latin-1 than derive the probably >correct answer. I hope most applications just will say "'charset' ISO-10646-Unicode-Latin1 unknown" In this case, that's the right thing to do. There is no requirement to implement all charsets, nor is there a requirement to implement all aliases. > > > If so, then please add clarifying text to the top of the list > document, and > > > appropriate classification to at least non-charset entries. > > > > Not going to happen. I think that for those registrations that we find out define only a repertoire, it is clearly appropriate to update the registration. I think the best way to do this would be to write a registration request that can be used to update the registration. That can then be discussed here on this list like new registrations, and the registry be updated once consensus is reached. Of course if it's about IBM-related registrations, it may be best if the update request comes from a specialist at IBM. Regards, Martin.
Received on Thursday, 29 August 2002 22:21:13 UTC