Re: XML character sets: a proposal

To: Todd Bauman <bbauma1@cs.umbc.edu>, W3C SGML Working Group <w3c-sgml-wg@w3.org>
Subject: Re: XML character sets: a proposal
From: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
Date: Fri, 13 Sep 96 10:02:31 CDT
From w3c-sgml-wg-request@www10.w3.org Fri Sep 13 11: 22:37 1996
In-Reply-To: Message of Thu, 12 Sep 1996 22:11:25 -0400 from Todd Bauman
Organization: ACH/ACL/ALLC Text Encoding Initiative

On Thu, 12 Sep 1996 22:11:25 -0400 Todd Bauman said:
>Even live languages.  I've got some of these documents, and I would
>hate to see XML disallow the character encodings I need to use.
>Gavin's right,

Can you expound on this a bit?  What character encodings do you
currently use, for what texts, that won't fit into Unicode?  Do you
really have encodings that can't even be handled by putting the
characters you need into the private use area of the BMP?  If you do,
I'd really like to know more about it.

>using UTF-8 as a default and / or suggested encoding and including it
>in a reference implementation is one thing.  Prohibiting the use of
>other character encodings is too restrictive.  Whether through MIME and
>/ or through FSI's, XML has to be extensible in this regard.

I'm having trouble thinking of serious applications that meet the
standard you appear to be setting, i.e. that do not restrict their
character sets in any way.

C compilers and other language processors do not accept source code in
arbitrary coded character sets; nor do editors and word processors, nor
do Web browsers.  Emacs does pretty well, on X, with character sets
represented by fonts in the X library.  I don't have high hopes for any
users who need it to handle EBCDIC all of a sudden.  The
internationalized versions of Mosaic I have seen and heard about do
accept more than one coded character set, but they are *not* extensible,
in the sense of allowing run-time additions to their capabilities by the
end-user.  They are extensible in the sense of allowing programmers of
sufficient skill to recompile them after tinkering with the
character-handling code.

On the whole, it seems to me simpler to tell users "To handle your
unusual writing systems in XML, translate your documents into Unicode
(using the private-use area if you need to) and invoke the XML parser"
than to tell them "To handle your unusual writing systems in XML, recode
the lexical scanner, recompile, and invoke the XML parser."

>> I think it would be quite hard to guarantee that all XML systems will
>> be able to meaningfully interpret any arbitrary XML document
>> anyway.
>
>Your not kidding. Even basic rendering in a browser can be quite
>difficult.

? Even with a style sheet?  Perhaps you and Gavin have higher hopes for
'meaningful interpretation' than I do in the first place, but I am
having trouble imagining *any* level of interpretation that won't become
a lot more complex if the parser must adjust at run time to
character sets unknown and unimagined at compile time.

It seems to me that allowing arbitrary coded character sets really
pushes us over a line between something simple and something that may
possibly still be tractable but is surely no longer simple.  If Unicode
is not enough, then a finite and small set of alternate coded character
sets can be defined as legal input.  Allowing arbitrary parse-time
extension is not the way to keep XML simple to implement.

If one really, really needs arbitrary coded character sets, why not
use Real SGML?

-C. M. Sperberg-McQueen

Follow-Ups:

Re: XML character sets: a proposal

From: Gavin Nicol <gtn@ebt.com>

Prev: Procedural matter
Next: Re: XML character sets: a proposal
Index: Message index of w3c-sgml-wg@w3.org mailing list
Thread: Thread index of w3c-sgml-wg@w3.org mailing list