- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Mon, 30 Sep 1996 19:09:25 +1000 (EST)
- To: lee@sq.com
- Cc: Charles@sgmlsource.com, paul@arbortext.com, w3c-sgml-wg@w3.org
On Wed, 25 Sep 1996 lee@sq.com wrote: > I am probably being obtuse here, but how would I refer to (for example) > a symbol that has no Unicode or ISO10646 code point? For example, suppose > I need an r-cedilla... in SGML I've used SDATA entities for that. Unicode 0327 hex is a non-spacing cedilla. So if you were using SDATA entities the reference could look like r&U0327; . Or you could use the numeric character reference using decimal. There is little point attempting to adopt Unicode without allowing composed character sequences like this, since many scripts need it (e.g. Indic scripts). Which is not to say that alternative forms should be encouraged where they are possible: for Latin diacriticals I think it would be good to have a policy in XML favouring precomposed characters, but this can't be policy for many other scripts in Unicode/ISO10646. There is a very interesting (I am no expert, but it seems sensible to me) proposal being put to Unicode at the moment to add a large (8000 characters?) set of compound primitives for Han ideographs to ISO 10646/Unicode. These (in conjunction with 3 operators 'next-to', 'under' and 'contains') allow rendering of tens of thousands of extra characters, with a fixed derivation. The proposal I have heard have so far got up to 49000 characters derivable with this set, without finding any characters not representable. (If 49 000 seems a lot, I was told by a representative of a big Japanese company that their typesetting system uses 200 000 characters, by which he meant something in between 'character' and 'glyph' I hope:-) The idea is basically to convert an 'unbounded set' into a 'bounded set': they are not the radicals directly, but are based on a statistical decomposition of characters based on their commonness and on maximum desirable sequence lengths. Of special interest to this XML group is that if these are added to Unicode/ISO10646, it means that SDATA entities (and user-defined characters?) would not be required as a feature in XML for Han ideograph needs of CJK. Rick Jelliffe http://www.allette.com.au/allette/ricko email: ricko@allette.com.au ================================================================ Allette Systems http://www.allette.com.au email: info@allette.com.au 10/91 York St, 2000, phone: +61 2 9262 4777 Sydney, Australia fax: +61 2 9262 4774 ================================================================
Received on Monday, 30 September 1996 07:01:36 UTC