Re: "Open Interchange" from Rick Jelliffe on 1997-06-03 (w3c-sgml-wg@w3.org from June 1997)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Tue, 3 Jun 1997 20:16:15 +1000
To: "Murata Makoto" <murata@apsdc.ksp.fujixerox.co.jp>, <w3c-sgml-wg@w3.org>
Message-Id: <199706031015.UAA17482@jawa.chilli.net.au>

> From: Murata Makoto <murata@apsdc.ksp.fujixerox.co.jp>
 
> Hashigo-Daga, etc., full text search will become impossible.

Which is why we must have a better character model. 

In XML we already have:

* Characters encoded directly in the storage encoding of the document;

* Characters given as numeric character references to ISO 10646;  

* Characters given as entity references to a predefined set.

The thing that XML needs is a way to include other characters in a way that enough information to be useful is transported with the
characters. I propose what we need is markup to:

1) give the character an identifier;
2) give the character a name;
3) nominate an equivalent ISO10646 character for purposes of searching, sorting and simple display, per locale;
4) give a URL for the glyph, under some protocol (what?).

The basic problem is that character sets do not tackle the character variant issue enough. (Except for a Taiwanese character set
that is organised as tables of variants apparantly.)  But we can (and should) build this into XML.

In SGML, the best way to do this is probably something like this:

<!ENTITY hishigo-daga SYSTEM 
	NDATA XML-char 
		[ xml-role="CHARACTER"
		xml-char-name="JAPANESE HAN IDEOGRAPH HISHIGODAGA"
 		xml-equiv="ja &#3002; ko &#3033;"
		xml-class="letter" 
		href="gttp://w3.org/glyphs/japanese.font#12" ]
>

Which would go in the prolog I guess. (Or an equivalent form using elements or PIs.)

To give an example that might make more sense to British-derived readers:  

<!ENTITY Mac SYSTEM 
	NDATA XML-char 
		[ xml-role="CHARACTER"
		xml-char-name="Ligature for Mac in Scottish Names"
 		xml-equiv="Mac"
		xml-class="letter" 
		href="gttp://w3.org.uk/glyphs/scotland.font#3" ]
>

which means: use the "M<sup>c</sup>" glyph if you can retrieve it, otherwise, for sorting etc. use "Mac".

Rick Jelliffe

Received on Tuesday, 3 June 1997 06:15:46 UTC