Re: character sets - a summary and a proposal from Gavin Nicol on 1996-09-16 (w3c-sgml-wg@w3.org from September 1996)

From: Gavin Nicol <gtn@ebt.com>
Date: Mon, 16 Sep 1996 19:52:03 GMT
To: U35395@UICVM.CC.UIC.EDU
CC: w3c-sgml-wg@w3.org
Message-Id: <199609161952.TAA15719@wiley.EBT.COM>
>I'm confused.  Where I come from, 'coded character set' is a mapping
>between a set of characters and a set of bit patterns, not necessarily
>the same length. 

Well, this is not exactly the way I'd define it. Anyway, we can say
that there is a group of characters that each can be identified by a
unique integer.

>Under that definition, if 'the coded character set should be ISO 10646',
>then we should not accept JIS 0208, Shift-JIS, EUC, ISO 8859, etc.,

Well, you see, here you show a fundamental mis-understanding. JIS 0208
is a coded character set, SHIFT-JIS is an encoding, as is EUC. ISO
8859-1 is a coded character set, and an encoding.

The characters that SHIFT-JIS, EUC, and ISO 8859-1 map to are also
present in ISO 10646. An encoding does *not* necessarily map to the
integer for a character though in practise, most do. As such, so long
as an encoding identifies the *characters* in ISO 10646, all is well
from a theoretical standpoint (we can even use BCTF). 

In actual fact, this is all just verbiage, and hair-splitting. The
important point is that you can use these encodings in an ISO 10646
based system.

>>Interoperability is something to be greatly desired, and in fact, the
>>primary reason I got involved in HTML I18N was precisely
>>that. However, I do not believe that at this time, we can get to a
>>point where all XML systems will be able to process all XML
>>documents. At some point in the future (3-5 years), perhaps. Now, no.
>
>This scares the pants off me.  In 3-5 years, if XML is widely
>adopted, it will be *impossible* to impose interoperability in the
>form of required support for data streams in UTF-8 or UTF-16 or
>whatever, because by then there will be legacy systems and legacy
>data to be protected.  The only way to achieve such uniformity is by
>imposing it at the outset, when there is no XML legacy data, and we
>have a free hand.  Failing to ensure interoperability when we have
>a free hand is not a good sign for our ability to achieve it later
>when our hands will be tied by systems which have made use of whatever
>freedom the spec gives them now.

If this is true, then we are doomed to failure. I do not think this is
true however. In the 3-5 intervening years, we will see a drastic
improvement in tools, and they will be the answer to the
interoperability problems we will suffer initially.

Anyway, as I said before, interoperability is a dream unless you state
precisely the domain in which you wish interoperability to occur. For
example, the statement "all XML parsers will be able to parse all XML
documents" is much more reasonable than "all XML systems will be able
to process all XML documents".

>>Autodetection fails abysmally as soon as you get more than a few
>>encodings.
>
>Correct; that's why the proposal (a) limits itself to the cases of
>UCS-4, UTF-16 / UCS-2, ISO 646 and compatible encodings including
>UTF-8, and EBCDIC, and (b) requires an explicit label in the entity
>regardless.  

I do not think this necessary. Storage/transmission systems should
provide labelling for maximum interoperability. Autodetection is an
artifact of a poorly designed system.

>Autodetection in general fails when there are more than
>a few encodings.  We agree.  The question in my mind is, does this
>particular proposal for autodetection fail when we have the set of
>encodings described?

No, but then the encodings you select are arbitrary. If we added 3 or
4 more common encodings it would fail.

>>  5) Client/server interaction will initially be primarily in the native
>>     encoding.
>>  6) Over time, a transition will be made to UTF-8/UTF-16 (ie. as more
>>     and better tools become available).
>>
>>We should recognise, and accept this.
>
>Hmm.  By analogy with this, HTML could have started by allowing any
>existing 7-bit national character set, as well as proprietary
>8-bit character sets, and hoped that eventually, with time, all HTML
>users would migrate to ISO 8859-1.
>
>As it is, it started by prescribing 8859-1, which is a blessing
>because it allows users to protest against servers which serve data
>in native PC or native Mac format.  Things are bad enough as they are;
>would they really be better if sending data in Mac's proprietary
>coded character set were legal HTML?
>
>On the whole, I think that HTML did the right thing in pressing for
>a move from 7 to 8 bits.  And I think XML should do the same in
>pressing for a move to 16 bits.  It is much easier to loosen XML
>restrictions in later revisions than to tighten them and break
>legacy data.

HTML met the initial needs of the WWW with the ISO 8859-1 restriction
(the syntax limited it to ISO 8859-1). This will change to ISO 10646
(the HTML I18N draft has now become Proposed Standard). Before the
I18N draft, people were kludging things together in a non-conformant
manner. With the I18N draft, they kludge things together in a
conformant manner. Initially, HTTP and client provided few I18N
capabilities (poor labelling, poor forms support etc.), now at least
the *standards* are moving in the right way. HTML has been driven by
necessity more than by design.

With XML, we can design the system to support I18N, but provide
flexibility in implementation so that we can basically *repeat* what
happened to HTML, but in a way that was designed for.
Received on Monday, 16 September 1996 15:53:13 UTC