Re: character sets - a summary and a proposal from Michael Sperberg-McQueen on 1996-09-16 (w3c-sgml-wg@w3.org from September 1996)

From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
Date: Mon, 16 Sep 96 09:44:09 CDT
To: Gavin Nicol <gtn@ebt.com>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
Message-Id: <199609161512.LAA01650@www10.w3.org>
On Mon, 16 Sep 1996 14:40:16 GMT Gavin Nicol said:
>>Q1 should there be any minimal function required of all conforming XML
>>systems, any coded character set or character encoding they are all
>>required to accept as input, whether across the net or from disk?
>
>The coded character set should be ISO 10646. I am willing to accept
>UTF-8 as required (I argued exactly that position on HTML-WG a long
>time ago).

I'm confused.  Where I come from, 'coded character set' is a mapping
between a set of characters and a set of bit patterns, not necessarily
the same length.  (I.e. I use the terms 'character set' and 'coded
character set' as typically defined in ISO character set standards.  I
try unsuccessfully to avoid the term 'character set' precisely because
it is used by SC 18 to mean what SC2 means by 'coded character set'.)

Under that definition, if 'the coded character set should be ISO 10646',
then we should not accept JIS 0208, Shift-JIS, EUC, ISO 8859, etc.,
because they are different coded character sets.  Their character sets,
aka character repertoires, happen to be a subset of that of ISO 10646
and of Unicode, but that does not make them the same coded character
set, nor encodings of it.  It just means that translation into 10646 or
its encodings is not inherently lossy.

So there seems to me to be an inherent contradiction in saying 'the
coded character set should be ISO 10646' and 'we should allow XML
documents to be in Shift-JIS or EUC or ...', which I thought you said
elsewhere in your posting.  I doubt that this contradiction is real, but
I don't know how to resolve it.  I.e., I don't understand what your
position is.

>>Q3 if XML systems may accept different sets of input formats (whether
>>or not these sets overlap), can we ensure interoperability
>>in some way, or is that a lost cause?
>
>Interoperability is something to be greatly desired, and in fact, the
>primary reason I got involved in HTML I18N was precisely
>that. However, I do not believe that at this time, we can get to a
>point where all XML systems will be able to process all XML
>documents. At some point in the future (3-5 years), perhaps. Now, no.

This scares the pants off me.  In 3-5 years, if XML is widely
adopted, it will be *impossible* to impose interoperability in the
form of required support for data streams in UTF-8 or UTF-16 or
whatever, because by then there will be legacy systems and legacy
data to be protected.  The only way to achieve such uniformity is by
imposing it at the outset, when there is no XML legacy data, and we
have a free hand.  Failing to ensure interoperability when we have
a free hand is not a good sign for our ability to achieve it later
when our hands will be tied by systems which have made use of whatever
freedom the spec gives them now.

>>Note on autodetection of character sets.
>
>Autodetection fails abysmally as soon as you get more than a few
>encodings.

Correct; that's why the proposal (a) limits itself to the cases of
UCS-4, UTF-16 / UCS-2, ISO 646 and compatible encodings including
UTF-8, and EBCDIC, and (b) requires an explicit label in the entity
regardless.  Autodetection in general fails when there are more than
a few encodings.  We agree.  The question in my mind is, does this
particular proposal for autodetection fail when we have the set of
encodings described?

>  5) Client/server interaction will initially be primarily in the native
>     encoding.
>  6) Over time, a transition will be made to UTF-8/UTF-16 (ie. as more
>     and better tools become available).
>
>We should recognise, and accept this.

Hmm.  By analogy with this, HTML could have started by allowing any
existing 7-bit national character set, as well as proprietary
8-bit character sets, and hoped that eventually, with time, all HTML
users would migrate to ISO 8859-1.

As it is, it started by prescribing 8859-1, which is a blessing
because it allows users to protest against servers which serve data
in native PC or native Mac format.  Things are bad enough as they are;
would they really be better if sending data in Mac's proprietary
coded character set were legal HTML?

On the whole, I think that HTML did the right thing in pressing for
a move from 7 to 8 bits.  And I think XML should do the same in
pressing for a move to 16 bits.  It is much easier to loosen XML
restrictions in later revisions than to tighten them and break
legacy data.

-C. M. Sperberg-McQueen
Received on Monday, 16 September 1996 11:13:11 UTC