- From: Martin Duerst <duerst@w3.org>
- Date: Wed, 07 Aug 2002 10:54:01 +0900
- To: Mark Davis <mark.davis@us.ibm.com>, ned.freed@mrochek.com
- Cc: Chris.Newman@Sun.COM, ietf-charsets@iana.org, Uma Umamaheswaran <umavs@ca.ibm.com>
Hello Mark, I agree with you that the IANA registry plays an important role, in particular in the context of XML. However, I think it's important to carefully distinguish registration of not yet registered character encodings on the one hand, and addition of aliases on the other hand. At 13:29 02/08/06 -0700, Mark Davis wrote: >For better or worse, the IANA registry is used as a central repository of >names for character set mappings. In particular, the XML Standard >(<http://www.w3.org/TR/REC-xml>http://www.w3.org/TR/REC-xml) is driving >the registration of many encodings: more exactly, http://www.w3.org/TR/REC-xml#charencoding >4.3.3 Character Encoding in Entities >... > >It is recommended that character encodings registered (as charsets) with >the Internet Assigned Numbers Authority ><http://www.w3.org/TR/REC-xml#IANA>[IANA-CHARSETS], other than those just >listed, be referred to using their registered names; other encodings >should use names starting with an "x-" prefix. XML processors should match >character encoding names in a case-insensitive way and should either >interpret an IANA-registered name as the encoding registered at IANA for >that name or treat it as unknown (processors are, of course, not required >to support all IANA-registered encodings). >... Just before the text you cite, we find: >>>> In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-n" (where n is the part number) should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. >>>> This makes it very clear that there is no need, for XML, to register additional aliases, because XML already says that the MIME preferred names should be used. Of course, XML does not require an XML processor to understand any character encoding except UTF-8 and UTF-16 (the later always with a BOM). Even the support of US-ASCII or iso-8859-1 is not required. The XML Recommendation is not exactly clear on the following point: If an XML processor accepts a particular encoding, is it required to accept that encoding under all the aliases registered with IANA, or is it okay to only accept some of the names, but not others? For example, is an XML processor allowed to accept an XML document starting with <?xml version='1.0' encoding='iso-8859-1' ?> but reject one starting with <?xml version='1.0' encoding='IBM819' ?> My answer to this question, for practical purposes, would very clearly be YES. My guess is that many XML parsers actually exhibit such behavior. If there are people who, based on the current language, would claim otherwise, or if there is a feeling that this should better be clarified, then I will propose an erratum to the XML Core Working Group. Regards, Martin.
Received on Wednesday, 7 August 2002 04:02:54 UTC