W3C home > Mailing lists > Public > xml-editor@w3.org > April to June 2000

Re: UTF-16BL/LE,... (was: Re: I18N issues with the XML Specification

From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
Date: Fri, 14 Apr 2000 02:53:24 +0800 (CST)
To: w3c-i18n-ig@w3.org
cc: xml-editor@w3.org, w3c-xml-core-wg@w3.org
Message-ID: <Pine.GSO.4.21.0004140227390.10685-100000@gate>
On Wed, 12 Apr 2000, Paul Hoffman / IMC wrote:

> They are similar to the UTF-16 charset, but they have different rules. 
> UTF-16 is an encoding, not a charset. All three charsets start with the 
> UTF-16 transformation format, then add rules to make them charsets.

Now this is a really interesting comment. It manages to have
	UTF-16 the encoding not a charset
	UTF-16 the transformation format not a charset
	UTF-16 the charset (transformation format + added rules)
	UTF-16LE the charset (transformation format + added rules)
	UTF16-BR the charset (transformation format + added rules)
and, of course, none of them are the same as
	Unicode, the character set

The W3C I18n WG recently said "we should use 'charset' instead of
'encoding' since it will confuse less people, for the XML infoset".

At the least, Paul's use of "encoding is not charset" is opposite
to W3C's use of "encoding is charset". 

(And we also have, "UTF-16 the thing that the WG meant when it put
XML together" and "UTF-16 the thing that IANA meant at the time
when XML was put together". IMHO when constructing an errata,
it is that former one which is key to figuring out what to do. )

My guess is Paul means, in order,
	UTF-16 the generic encoding as used in wide characters, i.e., in a
	UTF-16 the generic name for saving Unicode in 16tets
	UTF-16 the charset which may have a BOM
	UTF-16LE the charset which must have no BOM
	UTF-16BE the charset which must have no BOM

I disagree that encoding in XML corresponds to any of those exactly:
	XML encoding is implementation neutral about use inside a program
	XML encoding parameter is not generic but completely specific
	XML encoding does not have a requirement to fit in with any
		RFC that intrudes into the area of what we are allowed
		to put inside data (outside transmission control
		characters); also, it is an assertion about the
		character encoding used: it must be the writer's
		choice whether or not it is good form according to
		any RFC or ISO standard

Rick Jelliffe
Academia Sinica	

Received on Thursday, 13 April 2000 14:53:48 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:37:39 UTC