W3C home > Mailing lists > Public > public-qa-dev@w3.org > May 2007

Re: [wmvs] do we still need charset.cfg to list the "acceptable" character encodings?

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Thu, 24 May 2007 17:57:23 +0900
Message-Id: <6.0.0.20.2.20070524172802.0391a390@localhost>
To: Terje Bless <link@pobox.com>
Cc: QA-dev Dev <public-qa-dev@w3.org>, Bjoern Hoehrmann <derhoermi@gmx.net>

At 16:40 07/05/24, Terje Bless wrote:
>link@pobox.com (Terje Bless) wrote:
>
>>ot@w3.org (olivier Thereaux) wrote:
>>
>>>Sounds reasonable, but what's the policy? And where does it come from?
>>
>>The policy is that nothing that's not registered with IANA will be
>>accepted, and it comes from me. :-)

Yes. This was what motivated the creation of charset.cfg.

>To elaborate somewhat[0];
>
>charset.cfg is an implementation artifact and reflects limited tools.
>
>The planned “ideal” way for this to work was that charset.cfg be replaced with the actual IANA registry[1] such that what we whitelist is not what we happen to have had time to find and stuff in a config file, but what's actually registered.

Well, this is a nice idea in theory. The problem is that
a) Said registry is what you want to rely on when you can rely on it,
   but on occasion, it's not a good idea to rely on it. There are many
   registrations which just register parts of an encoding (a simple,
   maybe not actual, example, would be just the right (msb set) part of
   Latin-1). There are also occasionally cases where practices are very
   dominant, to the extent that going strictly with the registry would
   create too much complaints. This reflects the fact that we don't
   currently have any process for depreciating something from the
   registry.
b) There are cases, or there may be cases, where the transcoding implementation
   we use does not use the same labels as those in the registry, or uses
   these labels, but in a slightly suboptimal way. Therefore, charset.cfg
   (if I remember correctly) provides not only a list for checking, but
   actually a mapping).

>The IANA registry contains information on preferred MIME name etc. based on which we could emit warnings for non-preferred names.

Yes indeed. We should not make the validator willy-nilly accept any 'alias'
that happens to be accepted by the underlying implementation. Gnu iconv,
for example, seems to have a policy of "if you ever see a label used for
something, add it". That's the only way to explain why with
    iconv -l | wc
I get 1144 on a Fedora box, and still 401 on cygwin (your milage may vary).

>Whether an unregistered encoding is a fatal error or a warning is debateable.

Yes. It's definitely an interoperability hazard, at least in the general case.


Regards,    Martin.

>A “charset.cfg” may still be needed, but then only for “exception” purposes such as bitching about vendor-specific charsets or usage boo boos (the -I variants and some Thai encodings, IIRC).
>
>
>
>[0] ― See <http://swhack.com/logs/2007-05-24#T07-12-02>.
>
>[1] ― Literally by parsing
>       <http://www.iana.org/assignments/character-sets>
>       instead of “charset.cfg”.
>
>-- 
>I have lobbied for the update and improvement of SGML. I've done it for years.
>I consider it the jewel for which XML is a setting.  It does deserve a bit of
>polishing now and then.                                        
>-- Len Bullard


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     
Received on Friday, 25 May 2007 06:59:04 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 19 August 2010 18:12:48 GMT