Re: some IANA registrations look like repertoires not charsets? from Martin Duerst on 2002-08-30 (ietf-charsets@w3.org from July to September 2002)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 30 Aug 2002 10:59:40 +0900
To: Kenneth Whistler <kenw@sybase.com>, ned.freed@mrochek.com
Cc: ietf-charsets@iana.org
Message-id: <4.2.0.58.J.20020830090201.02955e70@localhost>

At 15:58 02/08/29 -0700, Kenneth Whistler wrote:
>Even with the ISO-10646 registrations, there is a problem if some
>indication of encoding scheme is not associated with the registration,
>since then in those cases, as well, the mapping *from* octets to
>characters is ambiguous -- hence, unusable. Suppose I label some octets
>as charset=ISO-10646-Unicode-Latin1.

Looking at that registration, it says:

Name: ISO-10646-Unicode-Latin1
MIBenum: 1003
Source: ISO Latin-1 subset of Unicode. Basic Latin and Latin-1
          Supplement  = collections 1 and 2.  See ISO 10646,
          Appendix A.  See RFC 1815.
Alias: csUnicodeLatin1
Alias: ISO-10646

Reading RFC 1815, it says:

 >>>>
Description of "ISO-10646"

    ISO-10646 is profiled to be the most basic part of the family of
    encodings based on ISO 10646 and contains the following minimal
    graphic characters:

       collection number and name      positions      further restriction
       ------------------------------------------------------------------
       1 BASIC LATIN                   0020-007E
       2 LATIN-1 SUPPLEMENT            00A0-00FF

    C0 and C1 control characters may also be used as specified in the
    section 16 of ISO 10646.

    The text with "ISO-10646" encodes text in 16 bit big endian form.

    As no combining characters are included, "ISO-10646" can be used with
    applications at implementation level 1.

    Left-to-right directionality should be used.

    The encoding is implemented by Windows/NT.

    For practical communication, use of "ISO-10646" is discouraged.
    "ISO-8859-1" [RFC1345] should be used instead.
 >>>>

So it is clearly defined as big endian UCS-2 (or UTF-16).

The problem with this registration is not that it isn't a 'charset',
it's rather that IETF allowed such a registration and the accompaining
RFC to go through, though 1) even the author says that he discourages
use of his definitions, 2) nobody else was/is really interested, and
3) having one of the clearest opponents of ISO 106464 squat on the
label 'ISO-10646' is highly problematic.

But maybe that was the easiest way to deal with a well-known
troublemaker.

>Without further explicit designation that UTF-8 is involved, I'd only
>be guessing, and I'd be better off with character encoding heuristics
>than charset labels. In fact, given the "Latin1" part of the name,
>I'd speculate that most implementations would be more likely to
>turn this into character hash as Latin-1 than derive the probably
>correct answer.

I hope most applications just will say
    "'charset' ISO-10646-Unicode-Latin1 unknown"
In this case, that's the right thing to do. There is no requirement
to implement all charsets, nor is there a requirement to
implement all aliases.

> > > If so, then please add clarifying text to the top of the list 
> document, and
> > > appropriate classification to at least non-charset entries.
> >
> > Not going to happen.

I think that for those registrations that we find out define
only a repertoire, it is clearly appropriate to update the
registration. I think the best way to do this would be to
write a registration request that can be used to update the
registration. That can then be discussed here on this list
like new registrations, and the registry be updated once
consensus is reached.

Of course if it's about IBM-related registrations, it may
be best if the update request comes from a specialist at
IBM.

Regards,    Martin.

Received on Thursday, 29 August 2002 22:21:13 UTC