New work item for XML group ? (Re: Comments on 31 March spec) from Rick Jelliffe on 1997-04-10 (w3c-sgml-wg@w3.org from April 1997)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Thu, 10 Apr 1997 19:46:46 +1000
To: Murata Makoto <murata@apsdc.ksp.fujixerox.co.jp>
CC: w3c-sgml-wg@w3.org
Message-ID: <334CB706.2256@allette.com.au>

Murata Makoto wrote:

> JIS started to design two new character code standards. Although
> these will be used together with JIS X 0208, some of the charaters in these
> new standards are more important than some of the JIS X 0208 characters,
> said Prof. Shibano. In three years or so, JIS will propose to ISO these new standards as part of ISO 10646. The BMP are not likely to include them.

W3C XML people should be careful not be panicked to discover that
Unicode 2.0 is not the ultimate character set! That is not any
reason to not adopt it in XML 1.0 today.

Especially in East and South East Asia, there is still a great effort by
standards-making bodies and academics to figure out exactly what
characters
are needed, including using statistical methods.

There can never be a character set that contains *ALL* 'Han' ideographs,
for the simple reason that new ones are being invented all the time,
especially in Taiwan. (And in any case, for some kinds of scholarly,
historical material, the glyph/character distinction may not be
completely helpful.)

This breakdown is why I think the XML group needs to add a fourth item
to its agenda, to be dealt with last, and that is a distributed font or
glyph service for XML.

Background
----------
The problem of Han ideographs is that they are an unbounded set. It is
up
to CJK national bodies to add important and common characters to ISO
10646
and to their various regional character sets. But the more rare
characters
cannot be represented using this method: it is not feasible or practical
for
logistical reasons.

Two methods of circumventing this have been proposed. They both suggest
an embedded layer of encoding on top of characters, to decentralise
character definition towards the creators of the documents:

* defer the problem by using SGML SDATA entities to refer to the
characters
(i.e. the SPREAD entities, or the Electronic Buddhist Text Initiative's
KanjiBase glyph set): this is inappropriate to XML, which is aimed at
fully resolved
documents suitable for immediate use, unlike SGML;

* embed some unique character sequence that also describes the glyph
in terms of its components (Prof Hsieh from Academica Sinica's proposal
to
ISO 10646): this is promising, but is a thing for the future.

Proposal
--------
I think we need to say that the central difference between characters
and
glyphs is that characters can be searched on directly from the XML text
without a knowledge of the DTD.

Using this definition, we can deem any (and only) characters
in ISO 10646 BMP to be XML characters. (They will be either character
codes or a numeric character references.)

Next, we can say that a character that is not in ISO 10646 BMP is not an
XML character. It must be marked up in some other way: it is a
glyph reference.

The appropriate way to mark up a glyph which has no corresponding
character
in ISO 10646 is using (a reference to an entity containing?) an empty
element
that nominates a particular font and code point, for example:

The central advantage of doing this is that it allows XML users the
freedom
to sidestep the standardisation process. If they need a particular
glyph,
they don't need to wait for various standards bodies to agree it is
a useful character, and then for international bodies to see whether it
is
really just a glyph variant and so is already present in a unified
character,
and then for font makers to make and distribute the correct fonts, etc.

This is not just a CJK issue: glyph references may be welcomed by
mathematicians, as well as page designers who want to include
corporate logos in text. (The inline IMG element is the HTML ancestor
for this, of course.) Not to mention facilime and historical users,
or even someone who wants to add a fancy drop capital, perhaps.

I think there will be enough difficulty trying to get XML vendors make
their products truly internationalised (i.e. adopt ISO 10646 numeric
character references regardless of their regional character set)
unless we strictly limit XML to ISO 10646 BMP/Unicode 2.0 and provide
a good mechanism to deal with exceptions. Unicode is well-promulgated
and accessable: character-set-level support for characters extra to
Unicode really misses the need of XML users, IMHO: for strange and
rare or currently non-standard characters, it is the ability to
locate and display the glyph that is most important.

Summary
-------
ISO 10646 BMP is enough for XML characters. But people legitimately need
more. A mechanism to let them do it themselves is appropriate (and fits
into the WWW idiom).

So as part of XML should be a simple glyph service system, allowing
people who create documents to add extra glyphs as needed.

-Rick Jelliffe

Received on Thursday, 10 April 1997 05:41:25 UTC