Re: Proposal for additional Aliases to IANA registry of character sets from Martin Duerst on 2002-08-07 (ietf-charsets@w3.org from July to September 2002)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 07 Aug 2002 10:54:01 +0900
To: Mark Davis <mark.davis@us.ibm.com>, ned.freed@mrochek.com
Cc: Chris.Newman@Sun.COM, ietf-charsets@iana.org, Uma Umamaheswaran <umavs@ca.ibm.com>
Message-id: <4.2.0.58.J.20020807103235.0461b328@localhost>

Hello Mark,

I agree with you that the IANA registry plays an important role,
in particular in the context of XML.

However, I think it's important to carefully distinguish registration
of not yet registered character encodings on the one hand, and addition
of aliases on the other hand.

At 13:29 02/08/06 -0700, Mark Davis wrote:

>For better or worse, the IANA registry is used as a central repository of 
>names for character set mappings. In particular, the XML Standard 
>(<http://www.w3.org/TR/REC-xml>http://www.w3.org/TR/REC-xml) is driving 
>the registration of many encodings:

more exactly, http://www.w3.org/TR/REC-xml#charencoding


>4.3.3 Character Encoding in Entities
>...
>
>It is recommended that character encodings registered (as charsets) with 
>the Internet Assigned Numbers Authority 
><http://www.w3.org/TR/REC-xml#IANA>[IANA-CHARSETS], other than those just 
>listed, be referred to using their registered names; other encodings 
>should use names starting with an "x-" prefix. XML processors should match 
>character encoding names in a case-insensitive way and should either 
>interpret an IANA-registered name as the encoding registered at IANA for 
>that name or treat it as unknown (processors are, of course, not required 
>to support all IANA-registered encodings).
>...

Just before the text you cite, we find:

 >>>>
In an encoding declaration, the values "UTF-8", "UTF-16", 
"ISO-10646-UCS-2", and
"ISO-10646-UCS-4" should be used for the various encodings and transformations
of Unicode / ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ...
"ISO-8859-n" (where n is the part number) should be used for the parts of
ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be
used for the various encoded forms of JIS X-0208-1997.
 >>>>

This makes it very clear that there is no need, for XML, to register
additional aliases, because XML already says that the MIME preferred
names should be used.

Of course, XML does not require an XML processor to understand any
character encoding except UTF-8 and UTF-16 (the later always with a BOM).
Even the support of US-ASCII or iso-8859-1 is not required.

The XML Recommendation is not exactly clear on the following point:
If an XML processor accepts a particular encoding, is it required to
accept that encoding under all the aliases registered with IANA, or
is it okay to only accept some of the names, but not others?
For example, is an XML processor allowed to accept an XML document
starting with
     <?xml version='1.0' encoding='iso-8859-1' ?>
but reject one starting with
     <?xml version='1.0' encoding='IBM819' ?>
My answer to this question, for practical purposes, would very clearly
be YES. My guess is that many XML parsers actually exhibit such
behavior. If there are people who, based on the current language,
would claim otherwise, or if there is a feeling that this should
better be clarified, then I will propose an erratum to the XML
Core Working Group.


Regards,    Martin.

Received on Wednesday, 7 August 2002 04:02:54 UTC