- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Thu, 10 Apr 1997 19:46:46 +1000
- To: Murata Makoto <murata@apsdc.ksp.fujixerox.co.jp>
- CC: w3c-sgml-wg@w3.org
Murata Makoto wrote: > JIS started to design two new character code standards. Although > these will be used together with JIS X 0208, some of the charaters in these > new standards are more important than some of the JIS X 0208 characters, > said Prof. Shibano. In three years or so, JIS will propose to ISO these new standards as part of ISO 10646. The BMP are not likely to include them. W3C XML people should be careful not be panicked to discover that Unicode 2.0 is not the ultimate character set! That is not any reason to not adopt it in XML 1.0 today. Especially in East and South East Asia, there is still a great effort by standards-making bodies and academics to figure out exactly what characters are needed, including using statistical methods. There can never be a character set that contains *ALL* 'Han' ideographs, for the simple reason that new ones are being invented all the time, especially in Taiwan. (And in any case, for some kinds of scholarly, historical material, the glyph/character distinction may not be completely helpful.) This breakdown is why I think the XML group needs to add a fourth item to its agenda, to be dealt with last, and that is a distributed font or glyph service for XML. Background ---------- The problem of Han ideographs is that they are an unbounded set. It is up to CJK national bodies to add important and common characters to ISO 10646 and to their various regional character sets. But the more rare characters cannot be represented using this method: it is not feasible or practical for logistical reasons. Two methods of circumventing this have been proposed. They both suggest an embedded layer of encoding on top of characters, to decentralise character definition towards the creators of the documents: * defer the problem by using SGML SDATA entities to refer to the characters (i.e. the SPREAD entities, or the Electronic Buddhist Text Initiative's KanjiBase glyph set): this is inappropriate to XML, which is aimed at fully resolved documents suitable for immediate use, unlike SGML; * embed some unique character sequence that also describes the glyph in terms of its components (Prof Hsieh from Academica Sinica's proposal to ISO 10646): this is promising, but is a thing for the future. Proposal -------- I think we need to say that the central difference between characters and glyphs is that characters can be searched on directly from the XML text without a knowledge of the DTD. Using this definition, we can deem any (and only) characters in ISO 10646 BMP to be XML characters. (They will be either character codes or a numeric character references.) Next, we can say that a character that is not in ISO 10646 BMP is not an XML character. It must be marked up in some other way: it is a glyph reference. The appropriate way to mark up a glyph which has no corresponding character in ISO 10646 is using (a reference to an entity containing?) an empty element that nominates a particular font and code point, for example: <p>blah blah <char font="font://www.blah/dame-sans" code="223" alt="缾"/> blah</p> The central advantage of doing this is that it allows XML users the freedom to sidestep the standardisation process. If they need a particular glyph, they don't need to wait for various standards bodies to agree it is a useful character, and then for international bodies to see whether it is really just a glyph variant and so is already present in a unified character, and then for font makers to make and distribute the correct fonts, etc. This is not just a CJK issue: glyph references may be welcomed by mathematicians, as well as page designers who want to include corporate logos in text. (The inline IMG element is the HTML ancestor for this, of course.) Not to mention facilime and historical users, or even someone who wants to add a fancy drop capital, perhaps. I think there will be enough difficulty trying to get XML vendors make their products truly internationalised (i.e. adopt ISO 10646 numeric character references regardless of their regional character set) unless we strictly limit XML to ISO 10646 BMP/Unicode 2.0 and provide a good mechanism to deal with exceptions. Unicode is well-promulgated and accessable: character-set-level support for characters extra to Unicode really misses the need of XML users, IMHO: for strange and rare or currently non-standard characters, it is the ability to locate and display the glyph that is most important. Summary ------- ISO 10646 BMP is enough for XML characters. But people legitimately need more. A mechanism to let them do it themselves is appropriate (and fits into the WWW idiom). So as part of XML should be a simple glyph service system, allowing people who create documents to add extra glyphs as needed. -Rick Jelliffe
Received on Thursday, 10 April 1997 05:41:25 UTC