Re: Comments on XML Part 1 from Japanese experts from Tim Bray on 1997-05-28 (w3c-sgml-wg@w3.org from May 1997)

From: Tim Bray <tbray@textuality.com>
Date: Wed, 28 May 1997 08:46:19 -0700
To: w3c-sgml-wg@w3.org
Message-Id: <3.0.32.19970528084511.00811190@pop.intergate.bc.ca>
At 10:48 AM 5/28/97 +0900, Murata Makoto wrote:

First of all, thanks to our Japanese colleagues for their comments.

Several of us are quite nervous about writing guidelines into the XML
spec as to how different national groups should use their own native
character repertoires and encoding schemes.  I'd like to make a 
proposal: we should take such guidelines and publish them as separate
documents in the XML series.  For example, we could have a document
entitled "Recommendations for the use of Japanese text in XML
docuements", as part 5 (or 6, or whatever we're up to now) in the
series.  Now that the W3C has a branch at Keio U., would it be possible
to bring them into the process?

>2. Hankaku katakana characters
>(1) Proposal
>Hankaku katakana characters should be used through character
>references only.

I note that the hankaku katakana are part of the compatibility block
in the Unicode standard.  It seems to me that it might be a good idea
that we should recommend that for XML documents, no Unicode compatibility
characters should be used.  This would cover the halfwidth katakana
and a whole bunch of other problematic characters.

>Omit 65382-65391 from the definition of NAMESTRT (Appendix A).
>Omit [#xFF66-#xFF6F] from BaseChar [74].

I agree with this... even if we are stuck with these things,
we don't want them showing up in tag names.

>3. JIS X 0212 characters in EUC_JP
>(1) Proposal
>Direct use of JIS X 0212 characters in EUC-JP should be
>prohibited.  In other words, the control function SS3 of 
>EUC_JP should be disallowed in XML.

As noted above, it seems to me that we should not, in the XML spec
itself, make rules governing how people use ASCII or EBCDIC or 
ISO-Latin or JIS - with the sole exception that the characters should
be Unicode characters.  I note from page 105 of the Unicode book
that JIS 212 was one of the standards used to harvest characters for
the CJK area in Unicode.  Perhaps this is material for another XML-related
spec.

>3. Shift_JIS
>(1) Proposal
>As the definition of Shift_JIS, Appendix 1 of JIS X 0208-1997
>should be referenced.

I do not agree with this.  The whole idea is that in an XML text 
entity, you can use any encoding scheme you want as long as you
can map the characters to Unicode.  We do not give references for
ISO-Latin or ASCII or anything except Unicode.

>4. Ideographic space character
>(1) Proposal
>The ideographic space character should not be considered as
>a white space character.

This is obviously a difficult decision.  In my work in Japan
(in the area of full-text search) I was told that the fullwidth
space should be treated as a space for purposes of searching.
In the internationalization TC to SGML, how is this handled?

>5. The private use areas
>(1) Proposal
>The private use areas of Unicode should not be used in XML.
>(3) Background
>XML is primarily intended for open interchange over the
>Internet.  If the private use areas of Unicode are 
>used in XML, we will have incompatibility problems.

But there are a lot of characters in the world that are not
in Unicode.  Particularly in the areas of mathematics, chemistry,
and so on?  If we can't use the private-use area, how can we
use these at all?  I agree that it will require co-ordination 
between sender and receiver to make this work.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-708-9592
Received on Wednesday, 28 May 1997 11:48:10 UTC