Re: 'x-' prefix on charset names from Martin Duerst on 2002-10-22 (www-international@w3.org from October to December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 23 Oct 2002 07:49:41 +0900
To: Dan Chiba <dan.chiba@oracle.com>
Cc: www-international@w3.org, www-i18n-comments@w3.org
Message-Id: <4.2.0.58.J.20021023073419.039c2a10@localhost>
Hello Dan,

At 11:33 02/10/22 -0700, Dan Chiba wrote:
>Hello Martin,
>
>Thank you very much for your clarification. Could I have your
>comments in a little further details, please?
>
>There is no doubt about limited use of unregistered names and
>it is encouraged to use other options, but having said that, if
>one needed to opt for option c, would W3C suggest using a raw
>unregistered name rather than a name followed by x-?

If you really have to use an unregistered name, then you should
use a name starting with x-. If you don't do that, you risk that
somebody else also uses the same name (you have that risk with
x-, too), AND that this other person gets his/her name registered,
at which point you'll be in serious trouble.


>Major specifications such as HTTP and XML do not completely
>prohibit using unregistered names or arbitrary names.

Yes, they are a bit vague in this area. We hope that
this will change over time.

But please note that it's not very easy to test whether e.g.
an XML implementation does the right thing. First, it doesn't
have to understand all MIME registered encodings, so it's
allowed to tell you 'unknown encoding' even e.g. for 'iso-8859-1'.
Second, you would have to find an encoding name that it accepts
but that is not registered. You could try with millions of
random-generated labels and not hit one. The only reasonable
way to find a problem is to look at the documentation or at
the source code (which will usually not be available).


>I was
>not sure if the description I cited implies recommending
>option c1 rather than c2, in addition to more preferred
>options like a and b.
>
>  Option
>  c1. Use an arbitrary charset name without an x- prefix
>  c2. Use the 'x-' convention

Taking your citation in a bit more context, I can see what you
mean (from
http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-EncodingIdent):

[S] If the unique encoding approach is not taken, specifications SHOULD 
mandate the use of the IANA charset registry names, and in particular the 
names identified in the registry as 'MIME preferred names', to designate 
character encodings in protocols, data formats and APIs.  [S] The 'x-' 
convention for unregistered character encoding names SHOULD NOT be used, 
having led to abuse in the past.

What you are saying is that you are in a situation where you
can't respect the SHOULD in the first sentence nor the SHOULD
NOT in the second sentence, and it's unclear which one of them
is stronger. I propose that we have a look at this in the WG
and make clear that in such a case, using x- is better than
not using x-.

Regards,    Martin.


>Regards,
>-Dan
>
>Martin Duerst wrote:
> >
> > Hello Dan,
> >
> > Many thanks for your question.
> >
> > At 14:11 02/10/21 -0700, Dan Chiba wrote:
> >
> > >Hello,
> > >
> > >I have a question regarding the 'x-' convention used to
> > >indicate that a charset is not registered at the IANA registry.
> > >Is it prohibited to use a unregistered charset at one's own risk?
> > >
> > >According to the latest CharMod paper, the convention is
> > >discouraged as follows (Excerpt from Section 3.6.2):
> > >
> > >   [S] The 'x-' convention for unregistered character encoding
> > >   names SHOULD NOT be used, having led to abuse in the past.
> > >   ('x-' was used for character encodings that were widely used,
> > >   even long after there was an official registration.)
> > >
> > >My question is about the intent of this is. If an unregistered
> > >charset was used, you will be forced to avoid the convention
> > >for complience. I think there are good reasons to avoid it, but
> > >what should be the options to take?
> > >
> > >Among the following viable alternatives that I can think of, I
> > >understand W3C is in the position of recommending option a and b.
> > >
> > >  a. Use a registered charset instead (May or maynot be feasible)
> > >  b. Get the charset registered (May take time)
> > >  c. Use the unregistered charset (Need bilateral agreement)
> > >
> > >It is not clear to me if W3C intend to prohibit option c. Could
> > >somebody clarify the intent, please?
> >
> > I think your reading of what the Character Model says is correct.
> > Opinion c) is not completely prohibited, but I think the cases
> > where it could be used are very limited. I can imagine the
> > following:
> >
> > - Some researchers are working on an encoding for Egyptian Hieroglyphs.
> >    They want to work out the details before registering. So they
> >    create something like x-hiero-test-1, x-hiero-test-2, and so on.
> >    Once they think they know what they need, they register it, and
> >    use the registered name.
> >
> > - A company wants to test their software with dummy data, and dummy
> >    'charset's, e.g. to check how they can upgrade their software to
> >    deal with new 'charset's. In this case, using x-dummy-1,... would
> >    come in handy.
> >
> > There may be other, similar cases. But in general, go for a) or b).
> > b) may indeed take some time, but it can be as short as two weeks
> > (a minimum period of 2 weeks is necessary to give everybody a
> > chance to comment).
> >
> > Regards,   Martin.
Received on Tuesday, 22 October 2002 20:22:44 UTC