Re: Comments on XML Part 1 from Japanese experts from Rick Jelliffe on 1997-06-03 (w3c-sgml-wg@w3.org from June 1997)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Tue, 3 Jun 1997 18:20:56 +1000
To: <w3c-sgml-wg@w3.org>, "Tim Bray" <tbray@textuality.com>
Message-Id: <199706030841.SAA15497@jawa.chilli.net.au>
> From: Tim Bray <tbray@textuality.com>
 
> >2. Hankaku katakana characters
> >(1) Proposal
> >Hankaku katakana characters should be used through character
> >references only. 

I agree with the Japanese: as a policy (but not a restriction) ISO 10646 compatability zone characters should be entered using
numeric character references. And any XML processor is free to replace any compatability zone characters it finds with numeric
character references.

I think there is an important larger issue here. A concept I developed for ERCS was the "greatest common denominator" character
set. (Everyone hates this name, especially mathematicians! It really means "the intersection set of each character repertoire of
each coded character set".)

The characters that can be used in a document directly (rather than  by entity references, or in XML's case by numeric character
references) are the characters that are found on every system.  (I see Balise implements this "greatest common denominator"
checking now, under the name "sanity checking".)  

In practise, this means that, for Japanese documents, it may be prudent to restrict the characters
used *directly* to those present in Shift-JIS AND EUC AND Unicode, which means JIS 208, and not
JIS 212.  

However, the issue is whether XML should adopt and enforce such a thing. In my view, if you do adopt the principle of "native
language markup" (as XML has), documents outside the West will also need some "greatest common denominator" discipline. 

However, I don't think that hardcoding a "greatest common denominator" discipline into XML is warranted, in that it is a document-
and system-dependent category: it will change for every document and over time as technology develops.

So I think the XML naming rules should remain pretty much as the are, and not restrict anything more. 

We need an XML annex on localization. The rest of the world will not go away!!

It should start off by appropriating Gavin Nicol, et al's internet draft on Internationalisation.  Then, as additional sections, it
should have particular locale-dependent sections. For example, for Japan it should mention that it strongly recommends that all
non-JIS 208 characters should be entered with numeric character references. 

> I note that the hankaku katakana are part of the compatibility block
> in the Unicode standard.  It seems to me that it might be a good idea
> that we should recommend that for XML documents, no Unicode compatibility
> characters should be used.  This would cover the halfwidth katakana
> and a whole bunch of other problematic characters.

I disagree with the Japanese and Tim. We should adopt ISO 10646/Unicode 2.0 as it is.  The original ISO 10646 did not have these
compatibility characters. But it was found that they were needed in some situations. So Japanese ISO10646 experts requested that
they be put back in: lets not rewrite ISO 10646; lets follow Unicode as much as we possibly can. 

To follow ISO 10646/Unicode: lets allow hankaku katakana, but *strongly* deprecate them. It is not XML's job to tell users which
characters they cannot use in their documents. A note should be added to the XML default SGML declaration.

> >Omit 65382-65391 from the definition of NAMESTRT (Appendix A).
> >Omit [#xFF66-#xFF6F] from BaseChar [74].
> 
> I agree with this... even if we are stuck with these things,
> we don't want them showing up in tag names.

I agree with the Japanese & Tim. The compability zone characters have no business in names. Native language markup (i.e. where "the
user can mark up the document using the customary words and symbols of their language") does not require it.

> >3. JIS X 0212 characters in EUC_JP
> >(1) Proposal
> >Direct use of JIS X 0212 characters in EUC-JP should be
> >prohibited.  In other words, the control function SS3 of 
> >EUC_JP should be disallowed in XML.
> 
> As noted above, it seems to me that we should not, in the XML spec
> itself, make rules governing how people use ASCII or EBCDIC or 
> ISO-Latin or JIS - with the sole exception that the characters should
> be Unicode characters.  I note from page 105 of the Unicode book
> that JIS 212 was one of the standards used to harvest characters for
> the CJK area in Unicode.  Perhaps this is material for another XML-related
> spec.

I agree with Tim. See above. 

> >3. Shift_JIS
> >(1) Proposal
> >As the definition of Shift_JIS, Appendix 1 of JIS X 0208-1997
> >should be referenced.
> 
> I do not agree with this.  The whole idea is that in an XML text 
> entity, you can use any encoding scheme you want as long as you
> can map the characters to Unicode.  We do not give references for
> ISO-Latin or ASCII or anything except Unicode.

I agree with Tim.

> >4. Ideographic space character
> >(1) Proposal
> >The ideographic space character should not be considered as
> >a white space character.
> 
> This is obviously a difficult decision.  In my work in Japan
> (in the area of full-text search) I was told that the fullwidth
> space should be treated as a space for purposes of searching.
> In the internationalization TC to SGML, how is this handled?

Annex J "Extended Naming Rules" gives syntax, and does not allocate any characters.

In the ERCS project, which lead to Annex J and to the naming rules XML uses, this issue was contraversial like the
half-width/full-width issue. My original motto was that "if two things look the same, they should act the same": SGML should be
able to be debugged visually, and you shouldn't have to resort to an octal debugger or change font to pick up a markup error. 

So I strongly recommend that ideographic spaces (indeed, all spaces) should be valid white-space. However, I certainly also agree
that a comment should be put in that they are *strongly* deprecated in markup, and that they can be replaced by a space (two??) at
any time by any XML process. A comment should be added to the XML default SGML declaration.


> >5. The private use areas
> >(1) Proposal
> >The private use areas of Unicode should not be used in XML.
> >(3) Background
> >XML is primarily intended for open interchange over the
> >Internet.  If the private use areas of Unicode are 
> >used in XML, we will have incompatibility problems.
> 
> But there are a lot of characters in the world that are not
> in Unicode.  Particularly in the areas of mathematics, chemistry,
> and so on?  If we can't use the private-use area, how can we
> use these at all?  I agree that it will require co-ordination 
> between sender and receiver to make this work.

I agree with the Japanese here. The appropriate way to handle extra characters is by entity references or by some special element
invocing a networked retrieval, not by private use characters (or any other mechanism that compromises Unicode's 1 character = 16
bits).  A note should be added to the XML default SGML declaration.

I don't think the Japanese are proposing that XML documents should be restricted to Unicode. Merely that the private use area
mechanism is not the appropriate means of encoding them. 

I see that the font companies are talking about some mechanism of delivering fonts over the net. Does anyone know if this is just a
"Type-on-call"-on-the-web pay service?  As (Gavin &) I have said before, XML needs some mechanism for downloading extra glyphs. 

-ricko
Received on Tuesday, 3 June 1997 04:42:09 UTC