Re: proprietary charset identifiers from Jim Melton on 2002-06-14 (www-i18n-comments@w3.org from June 2002)

From: Jim Melton <jim.melton@acm.org>
Date: Thu, 13 Jun 2002 21:04:57 -0600
To: Martin Duerst <duerst@w3.org>
Cc: jim.melton@acm.org (Jim Melton), www-i18n-comments@w3.org, w3c-i18n-ig@w3.org, w3c-xml-query-wg@w3.org
Message-Id: <4.3.2.7.2.20020613204831.067d9bc0@gmstimap.oraclecorp.com>
Martin,

Thanks for the note; my response is delayed partly because of travel and 
partly because I wanted to meet with the Functions & Operators people to 
see if the subject came up (it didn't).

At 02:08 PM 2002-06-05 +0900 Wednesday, Martin Duerst wrote:
>Hello Jim, dear XML Query WG,
>
>We discussed this comment of your at our teleconference
>yesterday, and I was actioned to convey our decision to you.
>
>At 18:39 02/05/31 +0900, Jim Melton wrote:
>
>>This is a last call comment from Jim Melton (jim.melton@acm.org) on
>>the Character Model for the World Wide Web 1.0
>>(http://www.w3.org/TR/2002/WD-charmod-20020430/).
>>
>>Semi-structured version of the comment:
>>
>>Submitted by: Jim Melton (jim.melton@acm.org)
>>Submitted on behalf of (maybe empty): W3C XML Query Working Group
>>Comment type: editorial
>>Chapter/section the comment applies to: 3.2 Digital Encoding of Characters
>>The comment will be visible to: public
>>Comment title: proprietary charset identifiers
>>Comment:
>>Section 3.2, "Digital Encoding of Characters", list element 4, contains 
>>the phrase "... is identified by an IANA charset identifier."
>>
>>In fact, there are a great many CESes that are identified by charset 
>>identifiers that are not assigned by IANA at all, but that are "created" 
>>by proprietary means (e.g., corporations).  The Character Model 
>>specification must not prohibit the use of CESes identified by charset 
>>identifiers assigned through other means.
>>
>>To correct this, simply change "...is identified by an IANA charset 
>>identifier." to "...is identified by a unique identifier, such as an IANA 
>>charset identifier."
>
>However, working on the details today, I discovered that
>it may be better to request a clarification from you first.

I'll try...assuming *I* understand enough to clarify ;^)

>You request that section 3.2 mentions other identifiers for
>character encodings than those registered by IANA. But
>Section 3.2 just mentions the labels as part of the overall
>model. Details of what encodings to use or not to use,
>and what labels to use for them, are given in Section 3.6.2
>(http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-EncodingIdent).
>
>Section 3.6.2 also has a very strong emphasis on IANA labels,
>because using labels from a single registry is the only way
>to avoid conflicts, and the IANA registry is the registry
>used on the Internet (and the Web is part of the Internet).

I understand this much, at least.

>Given this, can you please clarify whether the Query WG meant that:
>
>a) changing "...is identified by an IANA charset identifier." to
>    "...is identified by a unique identifier, such as an IANA
>    charset identifier." is appropriate in Section 3.2 because
>    this is a general discussion, and any set of unique identifiers
>    could do, and specifics are discussed in 3.6.2.
>
>b) The change was intended to make sure that encoding identifiers
>    other than those registered with IANA would conform to the
>    character model; Section 3.6.2 would have to be changed, too.
>
>Yesterday, we forgot about 3.6.2, but assumed the intent of b).
>If b) is your intent, please find our answer below. If your intent
>was a), or something else, we will have to reconsider your comment.

*My* intent, when I drafted the comment for discussion by the Query WG, was 
more nearly b) than a), but I am reluctant to commit to b) (and not for the 
reason that you reject the comment based on assumption b)!).  [I emphasized 
"*My*" because I did not hear this discussed with the Query WG and do not 
wish to infer their intent.]  Instead of writing a bunch of text here, I'll 
respond as we proceed below:

><assumption value='b)'>
>First, please note that your classification of this comment was
>'editorial', but we have decided to reclassify it as 'substantial'.
>
>Second, we have decided to reject this comment, based on the
>following reasons:
>
>- IANA charset identifiers (except for those starting with x-) are
>   guaranteed to be unique. Adding any other set(s) of identifiers
>   to the IANA identifiers very quickly removes this guarantee.
>   Because of that, your proposed change can either be seen as an
>   unnecessary addition, putting in more words but, under careful
>   analysis, not saying anything different, or it can be misunderstood
>   by readers to guarantee some uniqueness when indeed such a guarantee
>   is not possible.
>   [If you know about some trick to guarantee uniqueness among different
>    sets of identifiers, then we sure would like to know.]

Of course, without a registry, there is no ability to *guarantee* 
uniqueness.  However, in certain application situations (because of known 
scope to environments, for example) it is possible to ensure (even 
guarantee) uniqueness without having to resort to an *external* registry 
such as IANA.  For example, documents that are used (privately) in a single 
enterprise, where it is known that the enterprise has guaranteed such 
uniqueness, the services of an external entity such as IANA is not needed 
or even useful.

>- IANA does not 'assign' identifiers, it just registers them.

Of course you are correct; my words were careless, even though I understand 
the distinction and the situation.  Apologies!

>   Anybody can apply for registration. A few years ago, there has been
>   a tendency to restrict registration to widely used/usable encodings,
>   but this lead to the defacto use of many unregistered encodings
>   with an x- prefix. Registration practice has changed to be very
>   liberal now, while making sure that each registration notes duly
>   whether the encoding in practice is suitable for the use on the
>   Internet at large. If any corporation represented in the XML
>   Query WG or elsewhere uses encodings that are not registered
>   with IANA, we strongly recommend to register them.

Recommend away.  It will *never* be the case that 100% of all such 
encodings in use by every enterprise on the planet are registered.  And 
those enterprises *will* find uses for XML in their environments.  One of 
my (and others') difficulties with some assumptions in and behind the 
character model is that no use of XML is deemed "valid" unless it adheres 
to a potentially very large set of restrictions.  That is not, IMHO, the 
way to make XML maximally used or useful, although it is certainly 
appropriate to urge whenever true "world-wide" use of data and applications 
is planned.

>- The IANA registry already contains registrations for many (some
>   even say too many) proprietary encodings. Indeed, the majority
>   of encodings registered are proprietary encodings rather than
>   encodings defined by standards organizations. There is quite
>   some chance that your encoding is already registered. Please check.

I have.  They're not.  And if "some...say too many", that doesn't sound 
like the community at large really wants to see more and more private 
encodings being registered.  In fact, some of my employer's encodings are 
registered and some are not.  Will they all be, some day?  Who knows?  It's 
not really a priority (especially in this economy!).  But our customers 
still want to use the encodings.

An important point that I'm trying to make is this: The more restrictions 
that are placed on "applications" (broadest sense) in order to be 
"conforming", and the more those restrictions are viewed as "rules for the 
sake of having rules" instead of adding real value, the more people will 
choose not to *claim* conformance...after which they will quit following 
even the rules that make sense.  Balance!  That's the key!

>- The IANA registry already contains many (some even say too many)
>   aliases for most encodings. There is quite some chance that the
>   identifier used inside your corporation is already an alias.

Many are.  Some are not.  The same statements made above apply.

>Please tell us, at your earliest convenience, whether you are
>satisfied with our decision or not. If not, please provide
>additional rationale.
></assumption>

Unfortunately, I missed the Query WG teleconference this week due to my 
travel, so we didn't have a chance to discuss the subject (at least not 
with me participating).  However, I assure you that I am not satisfied with 
the decision and I will recommend (with, I expect, considerable support) 
that the Query WG respond that it is not satisfied.

This is a really important issue, along with the question of who/when for 
normalization, over which we continue to disagree.  I hope devoutly that we 
can reach a common ground whereby the I18n's very important task of making 
the Web truly accessible to all is balanced with the database-oriented 
Query and Schema vendors' need to satisfy their real-world customer 
requirements.

Thanks very much for continuing the dialog,
    Jim
========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation            Oracle Email: mailto:jim.melton@oracle.com
1930 Viscounti Drive          Standards email: mailto:jim.melton@acm.org
Sandy, UT 84093-1063              Personal email: mailto:jim@melton.name
USA                                                Fax : +1.801.942.3345
========================================================================
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================
Received on Friday, 14 June 2002 03:08:14 UTC