Characters (Re: questions about entities and entity declarations) from Rick Jelliffe on 1996-09-30 (w3c-sgml-wg@w3.org from September 1996)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Mon, 30 Sep 1996 19:09:25 +1000 (EST)
To: lee@sq.com
Cc: Charles@sgmlsource.com, paul@arbortext.com, w3c-sgml-wg@w3.org
Message-Id: <Pine.ULT.3.90.960930183533.6247C-100000@chuckd.allette.com.au>

On Wed, 25 Sep 1996 lee@sq.com wrote:

> I am probably being obtuse here, but how would I refer to (for example)
> a symbol that has no Unicode or ISO10646 code point?  For example, suppose
> I need an r-cedilla... in SGML I've used SDATA entities for that.

Unicode 0327 hex is a non-spacing cedilla. So if you were using SDATA
entities the reference could look like  r&U0327; . Or you could use the
numeric character reference using decimal. 

There is little point attempting to adopt Unicode without allowing
composed character sequences like this, since many scripts need it (e.g.
Indic scripts).  Which is not to say that alternative forms should be
encouraged where they are possible: for Latin diacriticals I think
it would be good to have a policy in XML favouring precomposed characters,
but this can't be policy for many other scripts in Unicode/ISO10646.

There is a very interesting (I am no expert, but it seems sensible to me)
proposal being put to Unicode at the moment to add a large (8000
characters?) set of compound primitives for Han ideographs to ISO
10646/Unicode. These (in conjunction with 3 operators 'next-to', 'under'
and 'contains') allow rendering of tens of thousands of extra characters,
with a fixed derivation.  The proposal I have heard have so far got up to
49000 characters derivable with this set, without finding any characters
not representable.  (If 49 000 seems a lot, I was told by a representative
of a big Japanese company that their typesetting system uses 200 000 
characters, by which he meant something in between 'character' and 'glyph'
I hope:-)

The idea is basically to convert an 'unbounded set' into a 'bounded set':
they are not the radicals directly, but are based on a statistical
decomposition of characters based on their commonness and on maximum
desirable sequence lengths. Of special interest to this XML group is that
if these are added to Unicode/ISO10646, it means that SDATA entities (and
user-defined characters?) would not be required as a feature in XML for
Han ideograph needs of CJK. 

Rick Jelliffe            http://www.allette.com.au/allette/ricko
                         email: ricko@allette.com.au
================================================================
Allette Systems          http://www.allette.com.au
                         email: info@allette.com.au
10/91 York St, 2000,     phone: +61 2 9262 4777
Sydney, Australia        fax:   +61 2 9262 4774
================================================================

Received on Monday, 30 September 1996 07:01:36 UTC