Re: XML character sets: a proposal from bbauma1@cs.umbc.edu on 1996-09-14 (w3c-sgml-wg@w3.org from September 1996)

From: <bbauma1@cs.umbc.edu>
Date: Sat, 14 Sep 1996 10:24:50 +0000
To: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
CC: w3c-sgml-wg@w3.org
Message-Id: <199609141433.KAA28169@algol.cs.umbc.edu>
> On Thu, 12 Sep 1996 22:11:25 -0400 Todd Bauman said:
> >Even live languages.  I've got some of these documents, and I would
> >hate to see XML disallow the character encodings I need to use.
> >Gavin's right,
> 
> Can you expound on this a bit?  What character encodings do you
> currently use, for what texts, that won't fit into Unicode?  Do you
> really have encodings that can't even be handled by putting the
> characters you need into the private use area of the BMP?  If you do,
> I'd really like to know more about it.

I stand corrected.  I do use the private use area for this.  I simply 
was thinking about all of the odd nonstandard 8-bit character 
encodings and matching fonts that I have to employ to get these languages through 
existing tools.  Of course one problem with the private use area is that 
its private.
> 
> >using UTF-8 as a default and / or suggested encoding and including it
> >in a reference implementation is one thing.  Prohibiting the use of
> >other character encodings is too restrictive.  Whether through MIME and
> >/ or through FSI's, XML has to be extensible in this regard.
> 
> I'm having trouble thinking of serious applications that meet the
> standard you appear to be setting, i.e. that do not restrict their
> character sets in any way. > Its not that I want to have an
> unrestricted character set its that I want to have a way to inform
> others that I am employing a particular character set encoding. 
> Specifying 1 or 2 such encodings such as UTF-8 and / or UTF-16 is
> to restrictive.

Its not that I want an unrestricted number of character sets, I just 
want to be free to use different encodings of that set, and I 
standard way to inform others that I am doing this.  Specifying one 
or two encodings is to restrictive.

1. Many people like the encodings that they currently use, have the 
tools to work with them, and won't be changing anytime to soon.

2. UTF-8 / UTF-16 are terribly inefficient encodings for a large 
number of languages.  They require 2 or 3 bytes per character when an 
alternate encoding would require only one.  UTF-8 is particularly 
offensive with its blatant western bias.  No one is going to use 
these inefficient encodings when they have large amounts of information 
to store / transmit and they are paying for the bandwidth.  
Moreover, many of the languages that UTF-8 bloats in size by  two or 
three times are  those used by countries that have access to the worst computer and 
communications technology.
 
> C compilers and other language processors do not accept source code in
> arbitrary coded character sets; nor do editors and word processors, nor
> do Web browsers.  Emacs does pretty well, on X, with character sets
> represented by fonts in the X library.  I don't have high hopes for any
> users who need it to handle EBCDIC all of a sudden.  The
> internationalized versions of Mosaic I have seen and heard about do
> accept more than one coded character set, but they are *not* extensible,
> in the sense of allowing run-time additions to their capabilities by the
> end-user.  They are extensible in the sense of allowing programmers of
> sufficient skill to recompile them after tinkering with the
> character-handling code.

I would say that this is a poor design. 

I don't want end-users to be able to add support for encodings, only
programmers. But I would like -

1. The code that needs to be changed should be 
isolated from the parser and the rest of the application.
2. When I'm done I can still claim that I have an XML application.
3. I can communicate to other software that I am using an alternate 
encoding for my information.
4. The parser - application API is isolated from any encoding changes 
I make.

> 
> On the whole, it seems to me simpler to tell users "To handle your
> unusual writing systems in XML, translate your documents into Unicode
> (using the private-use area if you need to) and invoke the XML parser"
> than to tell them "To handle your unusual writing systems in XML, recode
> the lexical scanner, recompile, and invoke the XML parser."
> 

> >> I think it would be quite hard to guarantee that all XML systems will
> >> be able to meaningfully interpret any arbitrary XML document
> >> anyway.
> >
> >Your not kidding. Even basic rendering in a browser can be quite
> >difficult.
> 
> ? Even with a style sheet?  Perhaps you and Gavin have higher hopes for
> 'meaningful interpretation' than I do in the first place, but I am
> having trouble imagining *any* level of interpretation that won't become
> a lot more complex if the parser must adjust at run time to
> character sets unknown and unimagined at compile time.
> 
I am not a DSSSL expert (nor really even a amateur) so I cannot 
attest to its capabilities.  I was simply referring to the way ISO 
10646 decomposes characters.  This makes the mapping from code point 
to glyph non-trivial.  Multiple ISO 10646 characters may need to be 
combined to get the composite that is actually displayed.  This is 
further complicated by languages such as Arabic in with glyphs change 
depending on there proximity to other characters.  Browsers capable 
of doing this correctly for all languages are difficult and 
will not exist for a while (if ever). There is simply no 
commercial market for supporting languages like Burmese (which is one 
of those languages that is not yet in ISO 10646).  As soon as 
the font mess is straightened out it will of course be possible to 
do this rendering at the server, create the correct glyphs,  map them into the private use area, 
send a custom font and at least get the browser to display it.

> It seems to me that allowing arbitrary coded character sets really
> pushes us over a line between something simple and something that may
> possibly still be tractable but is surely no longer simple.  If Unicode
> is not enough, then a finite and small set of alternate coded character
> sets can be defined as legal input.  Allowing arbitrary parse-time
> extension is not the way to keep XML simple to implement.
> 

I always make the distinction between the parser, the entity manager 
and the storage manager. The parser sees only UCS-4.   It is the  storage manager 
that needs to be concerned with character encoding, not the parser.  I 
just want a way to add a storage manager to XML to
support other encodings, and have a standard way to record  in a 
data stream (possibly outside of SGML) that a specific encoding is being used.

> If one really, really needs arbitrary coded character sets, why not
> use Real SGML?

1. Due to product availability / price considerations.
2. Due to the increased performance of XML software over its more 
feature laden counterpart.




B. Todd Bauman
Graduate Student
University of Maryland, Baltimore County
Received on Saturday, 14 September 1996 10:33:43 UTC