Re: XML character sets: a proposal
> On Thu, 12 Sep 1996 22:11:25 -0400 Todd Bauman said:
> >Even live languages. I've got some of these documents, and I would
> >hate to see XML disallow the character encodings I need to use.
> >Gavin's right,
> Can you expound on this a bit? What character encodings do you
> currently use, for what texts, that won't fit into Unicode? Do you
> really have encodings that can't even be handled by putting the
> characters you need into the private use area of the BMP? If you do,
> I'd really like to know more about it.
I stand corrected. I do use the private use area for this. I simply
was thinking about all of the odd nonstandard 8-bit character
encodings and matching fonts that I have to employ to get these languages through
existing tools. Of course one problem with the private use area is that
> >using UTF-8 as a default and / or suggested encoding and including it
> >in a reference implementation is one thing. Prohibiting the use of
> >other character encodings is too restrictive. Whether through MIME and
> >/ or through FSI's, XML has to be extensible in this regard.
> I'm having trouble thinking of serious applications that meet the
> standard you appear to be setting, i.e. that do not restrict their
> character sets in any way. > Its not that I want to have an
> unrestricted character set its that I want to have a way to inform
> others that I am employing a particular character set encoding.
> Specifying 1 or 2 such encodings such as UTF-8 and / or UTF-16 is
> to restrictive.
Its not that I want an unrestricted number of character sets, I just
want to be free to use different encodings of that set, and I
standard way to inform others that I am doing this. Specifying one
or two encodings is to restrictive.
1. Many people like the encodings that they currently use, have the
tools to work with them, and won't be changing anytime to soon.
2. UTF-8 / UTF-16 are terribly inefficient encodings for a large
number of languages. They require 2 or 3 bytes per character when an
alternate encoding would require only one. UTF-8 is particularly
offensive with its blatant western bias. No one is going to use
these inefficient encodings when they have large amounts of information
to store / transmit and they are paying for the bandwidth.
Moreover, many of the languages that UTF-8 bloats in size by two or
three times are those used by countries that have access to the worst computer and
> C compilers and other language processors do not accept source code in
> arbitrary coded character sets; nor do editors and word processors, nor
> do Web browsers. Emacs does pretty well, on X, with character sets
> represented by fonts in the X library. I don't have high hopes for any
> users who need it to handle EBCDIC all of a sudden. The
> internationalized versions of Mosaic I have seen and heard about do
> accept more than one coded character set, but they are *not* extensible,
> in the sense of allowing run-time additions to their capabilities by the
> end-user. They are extensible in the sense of allowing programmers of
> sufficient skill to recompile them after tinkering with the
> character-handling code.
I would say that this is a poor design.
I don't want end-users to be able to add support for encodings, only
programmers. But I would like -
1. The code that needs to be changed should be
isolated from the parser and the rest of the application.
2. When I'm done I can still claim that I have an XML application.
3. I can communicate to other software that I am using an alternate
encoding for my information.
4. The parser - application API is isolated from any encoding changes
> On the whole, it seems to me simpler to tell users "To handle your
> unusual writing systems in XML, translate your documents into Unicode
> (using the private-use area if you need to) and invoke the XML parser"
> than to tell them "To handle your unusual writing systems in XML, recode
> the lexical scanner, recompile, and invoke the XML parser."
> >> I think it would be quite hard to guarantee that all XML systems will
> >> be able to meaningfully interpret any arbitrary XML document
> >> anyway.
> >Your not kidding. Even basic rendering in a browser can be quite
> ? Even with a style sheet? Perhaps you and Gavin have higher hopes for
> 'meaningful interpretation' than I do in the first place, but I am
> having trouble imagining *any* level of interpretation that won't become
> a lot more complex if the parser must adjust at run time to
> character sets unknown and unimagined at compile time.
I am not a DSSSL expert (nor really even a amateur) so I cannot
attest to its capabilities. I was simply referring to the way ISO
10646 decomposes characters. This makes the mapping from code point
to glyph non-trivial. Multiple ISO 10646 characters may need to be
combined to get the composite that is actually displayed. This is
further complicated by languages such as Arabic in with glyphs change
depending on there proximity to other characters. Browsers capable
of doing this correctly for all languages are difficult and
will not exist for a while (if ever). There is simply no
commercial market for supporting languages like Burmese (which is one
of those languages that is not yet in ISO 10646). As soon as
the font mess is straightened out it will of course be possible to
do this rendering at the server, create the correct glyphs, map them into the private use area,
send a custom font and at least get the browser to display it.
> It seems to me that allowing arbitrary coded character sets really
> pushes us over a line between something simple and something that may
> possibly still be tractable but is surely no longer simple. If Unicode
> is not enough, then a finite and small set of alternate coded character
> sets can be defined as legal input. Allowing arbitrary parse-time
> extension is not the way to keep XML simple to implement.
I always make the distinction between the parser, the entity manager
and the storage manager. The parser sees only UCS-4. It is the storage manager
that needs to be concerned with character encoding, not the parser. I
just want a way to add a storage manager to XML to
support other encodings, and have a standard way to record in a
data stream (possibly outside of SGML) that a specific encoding is being used.
> If one really, really needs arbitrary coded character sets, why not
> use Real SGML?
1. Due to product availability / price considerations.
2. Due to the increased performance of XML software over its more
feature laden counterpart.
B. Todd Bauman
University of Maryland, Baltimore County