- From: Gavin Nicol <gtn@ebt.com>
- Date: Fri, 13 Sep 1996 19:44:16 GMT
- To: U35395@UICVM.CC.UIC.EDU
- CC: bbauma1@cs.umbc.edu, w3c-sgml-wg@w3.org
>Can you expound on this a bit? What character encodings do you >currently use, for what texts, that won't fit into Unicode? Do you >really have encodings that can't even be handled by putting the >characters you need into the private use area of the BMP? If you do, >I'd really like to know more about it. Some of the Chinese encodings have thousands of characters not in Unicode. Then we also have live languages that still do not even have a computerised form yet. >I'm having trouble thinking of serious applications that meet the >standard you appear to be setting, i.e. that do not restrict their >character sets in any way. It's not so much a matter of the application being unrestricted, as the *standard* giving enough freedom of movement to allow applications to be built that would generally fall outsude the mainstream use (for which UNICODE will generally suffice). As I noted, we can fix the coded character set to ISO 10646, and we have the BMP, which provides us with enoguh room to accomodate these other applications, provided we do not restrict the list of acceptable encodings. As for a serious application: something I have long wanted to do is to build a system in which characters could be registered, along with references to common code points (something like TEI WSD) so that one could build "virtual" character repertoires and "virtual" coded character sets. The system would also have glyph server technology so that an application (or perhaps an OS) could retrieve the glyph images as appropriate. You can kind of fake this at the moment with SDATA entities. >C compilers and other language processors do not accept source code in >arbitrary coded character sets; nor do editors and word processors, nor >do Web browsers. This is just a limitation in the implementations. They could if they wished. >The internationalized versions of Mosaic I have seen and heard about >do accept more than one coded character set, but they are *not* >extensible, in the sense of allowing run-time additions to their >capabilities by the end-user. They are extensible in the sense of >allowing programmers of sufficient skill to recompile them after >tinkering with the character-handling code. Sure. In the future, I expect JAVA to provide a nice way for people to actually support *arbitrary* encodings, and at some point, a way to render arbitrary character sets. I have been doing some work on this on the side. Eventually, I would like to see coded character sets as such disappear. >than to tell them "To handle your unusual writing systems in XML, recode >the lexical scanner, recompile, and invoke the XML parser." In a well designed system, this is unecessary. I should note that I have never said that I wish to *require* that all XML parsers be open-ended. I have no problem at all with seeing Latin 1 XML systems, and SJIS XML systems, though I expect most will actually be UNICODE based. >If one really, really needs arbitrary coded character sets, why not >use Real SGML? Because it's not necessary to do so.
Received on Friday, 13 September 1996 15:46:04 UTC