Re: 'x-' prefix on charset names

Martin, 

Martin Duerst wrote:
> 
> Hello Dan,
> 
> At 11:33 02/10/22 -0700, Dan Chiba wrote:
> >Hello Martin,
> >
> >Thank you very much for your clarification. Could I have your
> >comments in a little further details, please?
> >
> >There is no doubt about limited use of unregistered names and
> >it is encouraged to use other options, but having said that, if
> >one needed to opt for option c, would W3C suggest using a raw
> >unregistered name rather than a name followed by x-?
> 
> If you really have to use an unregistered name, then you should
> use a name starting with x-. If you don't do that, you risk that
> somebody else also uses the same name (you have that risk with
> x-, too), AND that this other person gets his/her name registered,
> at which point you'll be in serious trouble.

Yes, that situation is indeed confusing. 

> >Major specifications such as HTTP and XML do not completely
> >prohibit using unregistered names or arbitrary names.
> 
> Yes, they are a bit vague in this area. We hope that
> this will change over time.

Yes, and that has already lead to allowing unregistered names. 
For example I think it is common for an XML processor to support 
non-IANA names, such as Java encodings. I think that is not bad by 
itself, because it is convenient and discouraged but still 
conformant with the current standards. I mean the problem is a 
collision of the names. Fortunately every standard respects the 
registry and discourage using unregistered charsets. In favor of 
interoperability, people favors registered names so I think the 
trend is going to be more along with the recommendation of CharMod. 

> But please note that it's not very easy to test whether e.g.
> an XML implementation does the right thing. First, it doesn't
> have to understand all MIME registered encodings, so it's
> allowed to tell you 'unknown encoding' even e.g. for 'iso-8859-1'.
> Second, you would have to find an encoding name that it accepts
> but that is not registered. You could try with millions of
> random-generated labels and not hit one. The only reasonable
> way to find a problem is to look at the documentation or at
> the source code (which will usually not be available).

To avoid the confusing situation I think a charset whose name 
is identical with another widely used unregistered charset 
(e.g. Java, vendors, etc.) should never be registered at IANA 
if its definition is distinct from the charset in wide use. 

> >I was
> >not sure if the description I cited implies recommending
> >option c1 rather than c2, in addition to more preferred
> >options like a and b.
> >
> >  Option
> >  c1. Use an arbitrary charset name without an x- prefix
> >  c2. Use the 'x-' convention
> 
> Taking your citation in a bit more context, I can see what you
> mean (from
> http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-EncodingIdent):
> 
> [S] If the unique encoding approach is not taken, specifications SHOULD
> mandate the use of the IANA charset registry names, and in particular the
> names identified in the registry as 'MIME preferred names', to designate
> character encodings in protocols, data formats and APIs.  [S] The 'x-'
> convention for unregistered character encoding names SHOULD NOT be used,
> having led to abuse in the past.
> 
> What you are saying is that you are in a situation where you
> can't respect the SHOULD in the first sentence nor the SHOULD
> NOT in the second sentence, and it's unclear which one of them
> is stronger. I propose that we have a look at this in the WG
> and make clear that in such a case, using x- is better than
> not using x-.

Yes, that is precisely what I meant. Thank you very much for 
taking my question to WG. 

Regards,
-Dan

> Regards,    Martin.
> 
> >Regards,
> >-Dan
> >
> >Martin Duerst wrote:
> > >
> > > Hello Dan,
> > >
> > > Many thanks for your question.
> > >
> > > At 14:11 02/10/21 -0700, Dan Chiba wrote:
> > >
> > > >Hello,
> > > >
> > > >I have a question regarding the 'x-' convention used to
> > > >indicate that a charset is not registered at the IANA registry.
> > > >Is it prohibited to use a unregistered charset at one's own risk?
> > > >
> > > >According to the latest CharMod paper, the convention is
> > > >discouraged as follows (Excerpt from Section 3.6.2):
> > > >
> > > >   [S] The 'x-' convention for unregistered character encoding
> > > >   names SHOULD NOT be used, having led to abuse in the past.
> > > >   ('x-' was used for character encodings that were widely used,
> > > >   even long after there was an official registration.)
> > > >
> > > >My question is about the intent of this is. If an unregistered
> > > >charset was used, you will be forced to avoid the convention
> > > >for complience. I think there are good reasons to avoid it, but
> > > >what should be the options to take?
> > > >
> > > >Among the following viable alternatives that I can think of, I
> > > >understand W3C is in the position of recommending option a and b.
> > > >
> > > >  a. Use a registered charset instead (May or maynot be feasible)
> > > >  b. Get the charset registered (May take time)
> > > >  c. Use the unregistered charset (Need bilateral agreement)
> > > >
> > > >It is not clear to me if W3C intend to prohibit option c. Could
> > > >somebody clarify the intent, please?
> > >
> > > I think your reading of what the Character Model says is correct.
> > > Opinion c) is not completely prohibited, but I think the cases
> > > where it could be used are very limited. I can imagine the
> > > following:
> > >
> > > - Some researchers are working on an encoding for Egyptian Hieroglyphs.
> > >    They want to work out the details before registering. So they
> > >    create something like x-hiero-test-1, x-hiero-test-2, and so on.
> > >    Once they think they know what they need, they register it, and
> > >    use the registered name.
> > >
> > > - A company wants to test their software with dummy data, and dummy
> > >    'charset's, e.g. to check how they can upgrade their software to
> > >    deal with new 'charset's. In this case, using x-dummy-1,... would
> > >    come in handy.
> > >
> > > There may be other, similar cases. But in general, go for a) or b).
> > > b) may indeed take some time, but it can be as short as two weeks
> > > (a minimum period of 2 weeks is necessary to give everybody a
> > > chance to comment).
> > >
> > > Regards,   Martin.

Received on Thursday, 24 October 2002 23:25:14 UTC