- From: <bbauma1@cs.umbc.edu>
- Date: Sat, 14 Sep 1996 10:24:50 +0000
- To: Michael Sperberg-McQueen <U35395@UICVM.CC.UIC.EDU>
- CC: w3c-sgml-wg@w3.org
> On Thu, 12 Sep 1996 22:11:25 -0400 Todd Bauman said: > >Even live languages. I've got some of these documents, and I would > >hate to see XML disallow the character encodings I need to use. > >Gavin's right, > > Can you expound on this a bit? What character encodings do you > currently use, for what texts, that won't fit into Unicode? Do you > really have encodings that can't even be handled by putting the > characters you need into the private use area of the BMP? If you do, > I'd really like to know more about it. I stand corrected. I do use the private use area for this. I simply was thinking about all of the odd nonstandard 8-bit character encodings and matching fonts that I have to employ to get these languages through existing tools. Of course one problem with the private use area is that its private. > > >using UTF-8 as a default and / or suggested encoding and including it > >in a reference implementation is one thing. Prohibiting the use of > >other character encodings is too restrictive. Whether through MIME and > >/ or through FSI's, XML has to be extensible in this regard. > > I'm having trouble thinking of serious applications that meet the > standard you appear to be setting, i.e. that do not restrict their > character sets in any way. > Its not that I want to have an > unrestricted character set its that I want to have a way to inform > others that I am employing a particular character set encoding. > Specifying 1 or 2 such encodings such as UTF-8 and / or UTF-16 is > to restrictive. Its not that I want an unrestricted number of character sets, I just want to be free to use different encodings of that set, and I standard way to inform others that I am doing this. Specifying one or two encodings is to restrictive. 1. Many people like the encodings that they currently use, have the tools to work with them, and won't be changing anytime to soon. 2. UTF-8 / UTF-16 are terribly inefficient encodings for a large number of languages. They require 2 or 3 bytes per character when an alternate encoding would require only one. UTF-8 is particularly offensive with its blatant western bias. No one is going to use these inefficient encodings when they have large amounts of information to store / transmit and they are paying for the bandwidth. Moreover, many of the languages that UTF-8 bloats in size by two or three times are those used by countries that have access to the worst computer and communications technology. > C compilers and other language processors do not accept source code in > arbitrary coded character sets; nor do editors and word processors, nor > do Web browsers. Emacs does pretty well, on X, with character sets > represented by fonts in the X library. I don't have high hopes for any > users who need it to handle EBCDIC all of a sudden. The > internationalized versions of Mosaic I have seen and heard about do > accept more than one coded character set, but they are *not* extensible, > in the sense of allowing run-time additions to their capabilities by the > end-user. They are extensible in the sense of allowing programmers of > sufficient skill to recompile them after tinkering with the > character-handling code. I would say that this is a poor design. I don't want end-users to be able to add support for encodings, only programmers. But I would like - 1. The code that needs to be changed should be isolated from the parser and the rest of the application. 2. When I'm done I can still claim that I have an XML application. 3. I can communicate to other software that I am using an alternate encoding for my information. 4. The parser - application API is isolated from any encoding changes I make. > > On the whole, it seems to me simpler to tell users "To handle your > unusual writing systems in XML, translate your documents into Unicode > (using the private-use area if you need to) and invoke the XML parser" > than to tell them "To handle your unusual writing systems in XML, recode > the lexical scanner, recompile, and invoke the XML parser." > > >> I think it would be quite hard to guarantee that all XML systems will > >> be able to meaningfully interpret any arbitrary XML document > >> anyway. > > > >Your not kidding. Even basic rendering in a browser can be quite > >difficult. > > ? Even with a style sheet? Perhaps you and Gavin have higher hopes for > 'meaningful interpretation' than I do in the first place, but I am > having trouble imagining *any* level of interpretation that won't become > a lot more complex if the parser must adjust at run time to > character sets unknown and unimagined at compile time. > I am not a DSSSL expert (nor really even a amateur) so I cannot attest to its capabilities. I was simply referring to the way ISO 10646 decomposes characters. This makes the mapping from code point to glyph non-trivial. Multiple ISO 10646 characters may need to be combined to get the composite that is actually displayed. This is further complicated by languages such as Arabic in with glyphs change depending on there proximity to other characters. Browsers capable of doing this correctly for all languages are difficult and will not exist for a while (if ever). There is simply no commercial market for supporting languages like Burmese (which is one of those languages that is not yet in ISO 10646). As soon as the font mess is straightened out it will of course be possible to do this rendering at the server, create the correct glyphs, map them into the private use area, send a custom font and at least get the browser to display it. > It seems to me that allowing arbitrary coded character sets really > pushes us over a line between something simple and something that may > possibly still be tractable but is surely no longer simple. If Unicode > is not enough, then a finite and small set of alternate coded character > sets can be defined as legal input. Allowing arbitrary parse-time > extension is not the way to keep XML simple to implement. > I always make the distinction between the parser, the entity manager and the storage manager. The parser sees only UCS-4. It is the storage manager that needs to be concerned with character encoding, not the parser. I just want a way to add a storage manager to XML to support other encodings, and have a standard way to record in a data stream (possibly outside of SGML) that a specific encoding is being used. > If one really, really needs arbitrary coded character sets, why not > use Real SGML? 1. Due to product availability / price considerations. 2. Due to the increased performance of XML software over its more feature laden counterpart. B. Todd Bauman Graduate Student University of Maryland, Baltimore County
Received on Saturday, 14 September 1996 10:33:43 UTC