Re: XML character sets: a proposal from Gavin Nicol on 1996-09-13 (w3c-sgml-wg@w3.org from September 1996)

From: Gavin Nicol <gtn@ebt.com>
Date: Fri, 13 Sep 1996 19:44:16 GMT
To: U35395@UICVM.CC.UIC.EDU
CC: bbauma1@cs.umbc.edu, w3c-sgml-wg@w3.org
Message-Id: <199609131944.TAA01978@wiley.EBT.COM>

>Can you expound on this a bit?  What character encodings do you
>currently use, for what texts, that won't fit into Unicode?  Do you
>really have encodings that can't even be handled by putting the
>characters you need into the private use area of the BMP?  If you do,
>I'd really like to know more about it.

Some of the Chinese encodings have thousands of characters not in
Unicode. Then we also have live languages that still do not even have
a computerised form yet.

>I'm having trouble thinking of serious applications that meet the
>standard you appear to be setting, i.e. that do not restrict their
>character sets in any way.

It's not so much a matter of the application being unrestricted, as
the *standard* giving enough freedom of movement to allow applications
to be built that would generally fall outsude the mainstream use (for
which UNICODE will generally suffice). As I noted, we can fix the
coded character set to ISO 10646, and we have the BMP, which provides
us with enoguh room to accomodate these other applications, provided
we do not restrict the list of acceptable encodings.

As for a serious application: something I have long wanted to do is to
build a system in which characters could be registered, along with
references to common code points (something like TEI WSD) so that one
could build "virtual" character repertoires and "virtual" coded
character sets. The system would also have glyph server technology so
that an application (or perhaps an OS) could retrieve the glyph images
as appropriate. You can kind of fake this at the moment with SDATA
entities. 

>C compilers and other language processors do not accept source code in
>arbitrary coded character sets; nor do editors and word processors, nor
>do Web browsers. 

This is just a limitation in the implementations. They could if they
wished. 

>The internationalized versions of Mosaic I have seen and heard about
>do accept more than one coded character set, but they are *not*
>extensible, in the sense of allowing run-time additions to their
>capabilities by the end-user.  They are extensible in the sense of
>allowing programmers of sufficient skill to recompile them after
>tinkering with the character-handling code.

Sure. In the future, I expect JAVA to provide a nice way for people to
actually support *arbitrary* encodings, and at some point, a way to
render arbitrary character sets. I have been doing some work on this
on the side.

Eventually, I would like to see coded character sets as such
disappear.  

>than to tell them "To handle your unusual writing systems in XML, recode
>the lexical scanner, recompile, and invoke the XML parser."

In a well designed system, this is unecessary.

I should note that I have never said that I wish to *require* that all
XML parsers be open-ended. I have no problem at all with seeing Latin
1 XML systems, and SJIS XML systems, though I expect most will
actually be UNICODE based.

>If one really, really needs arbitrary coded character sets, why not
>use Real SGML?

Because it's not necessary to do so.

Received on Friday, 13 September 1996 15:46:04 UTC