Re: UTF-16BL/LE,... (was: Re: I18N issues with the XML Specification from Martin J. Duerst on 2000-04-14 (xml-editor@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Fri, 14 Apr 2000 18:29:11 +0900
To: Rick Jelliffe <ricko@gate.sinica.edu.tw>, w3c-i18n-ig@w3.org
Cc: xml-editor@w3.org, w3c-xml-core-wg@w3.org
Message-Id: <4.2.0.58.J.20000414182038.02f82e70@sh.w3.mag.keio.ac.jp>

At 00/04/14 02:53 +0800, Rick Jelliffe wrote:
>On Wed, 12 Apr 2000, Paul Hoffman / IMC wrote:
>
> > They are similar to the UTF-16 charset, but they have different rules.
> > UTF-16 is an encoding, not a charset. All three charsets start with the
> > UTF-16 transformation format, then add rules to make them charsets.
>
>Now this is a really interesting comment. It manages to have
>         UTF-16 the encoding not a charset
>         UTF-16 the transformation format not a charset
>         UTF-16 the charset (transformation format + added rules)
>         UTF-16LE the charset (transformation format + added rules)
>         UTF16-BR the charset (transformation format + added rules)
>and, of course, none of them are the same as
>         Unicode, the character set
>
>The W3C I18n WG recently said "we should use 'charset' instead of
>'encoding' since it will confuse less people, for the XML infoset".

Well, yes, in the XML decl/text decl, there is a 'charset' parameter,
but the pseudoattribute's name is 'encoding'. Because the value
of this parameter comes from the possible values defined by
IETF and registered by IANA, it's a 'charset' and not some
other kind of concept that may be called 'encoding'.


>At the least, Paul's use of "encoding is not charset" is opposite
>to W3C's use of "encoding is charset".

We never defined the term 'encoding' in the WG.


>(And we also have, "UTF-16 the thing that the WG meant when it put
>XML together" and "UTF-16 the thing that IANA meant at the time
>when XML was put together". IMHO when constructing an errata,
>it is that former one which is key to figuring out what to do. )

Yes, definitely. Only that Tim Bray said that that's not
clear, which means that we are working on a clarification
rather than a real erratum.


>My guess is Paul means, in order,
>         UTF-16 the generic encoding as used in wide characters, i.e., in a
>                 program
>         UTF-16 the generic name for saving Unicode in 16tets
>         UTF-16 the charset which may have a BOM
>         UTF-16LE the charset which must have no BOM
>         UTF-16BE the charset which must have no BOM
>
>I disagree that encoding in XML corresponds to any of those exactly:
>         XML encoding is implementation neutral about use inside a program
>         XML encoding parameter is not generic but completely specific
>         XML encoding does not have a requirement to fit in with any
>                 RFC that intrudes into the area of what we are allowed
>                 to put inside data (outside transmission control
>                 characters); also, it is an assertion about the
>                 character encoding used: it must be the writer's
>                 choice whether or not it is good form according to
>                 any RFC or ISO standard

I don't understand what you are saying here. Most of the grammar
in XML is not the writer's joice. Writers are not allowed to
start an attribute with a single quote and end it with a double
quote. There is no formal check on the value of the 'encoding'
attribute in the XML decl/text decl, but that doesn't mean
that what to put there is the writer's choice.

It looks to me as if you disagree with that, but can you
explain more clearly how and why?


Regards,   Martin.

Received on Friday, 14 April 2000 06:04:03 UTC