Re: Comments on XML Part 1 from Japanese experts from Gavin Nicol on 1997-05-28 (w3c-sgml-wg@w3.org from May 1997)

From: Gavin Nicol <gtn@eps.inso.com>
Date: Wed, 28 May 1997 09:54:03 -0400
To: murata@apsdc.ksp.fujixerox.co.jp
CC: w3c-sgml-wg@w3.org
Message-Id: <199705281354.JAA01956@nathaniel.ebt>
>Comments on XML Part 1 from Japanese experts
...
>2. Hankaku katakana characters
>
>(1) Proposal
>
>Hankaku katakana characters should be used through character
>references only.
...
>Internet experts in Japan commonly have the belief that
>hankaku katakana characters shall not be used.  The only
>encoding method for transmitting Japanese e-mail and news
>articles, namely ISO_2022_JP, does not allow hankaku katakana
>characters.  Thus, XML documents containing hankaku katakana
>characters will lead to loss of information or data
>corruption, when they are transmitted via e-mail or news.

I do not mind losing hankaku katakana characters, but am slightly
worried about MSDOS and MS-WIndows based programs that might generate
XML. For example, if MS-Word had the ability to create XML, then we
might run into some problems. I know that hankaku are becoming less
frequently used, so I am not sure that forbidding them makes
sense... over time they will not be used anyway.

>3. JIS X 0212 characters in EUC_JP
>
>(1) Proposal
>
>Direct use of JIS X 0212 characters in EUC-JP should be
>prohibited.  In other words, the control function SS3 of 
>EUC_JP should be disallowed in XML.

For this, like the above, I can understand the rationale, but I do not
think that the XML language specification should mention anything
about encodings, beyond the fact that any encoding whose character
repertoire is a subset of that in the SGML declaration can be used.

>(3) Background
>
>JIS X 0212 is rarely used in Japan.  First, JIS X 0212
>characters cannot be represented by Shift_JIS, and are thus
>not supported by most of the personal computers.  Second,
>JIS X 0212 characters cannot be represented by ISO_2022_JP,
>and thus cannot be transmitted via e-mail or news.  Third, a
>control function SS3 is required to capture JIS X 0212
>characters in EUC_JP, but many tools do not support this
>control function.

The fact that most tools do not support JIS X 212 is probably
sufficient to make it seldom used. I do not think we need explicit
language for this.

>3. Shift_JIS
>
>(1) Proposal
>
>As the definition of Shift_JIS, Appendix 1 of JIS X 0208-1997
>should be referenced.

Again, I do not think the XML specification should even attempt to
enumerate the possible encodings (with the possible exception of the
Unicode ecodings).

>4. Ideographic space character
>
>(1) Proposal
>
>The ideographic space character should not be considered as
>a white space character.
>
>(2) Proposed revision
>
>Revise [1] as below:
>	[1] S::= (#X0020 | #X0009 | #X000d | #X000a)+
>
>Remove the below line from the SGML declaration.
>
>$B!!!!!!!!!!!!(BITAB$B!!(BSEPCHAR$B!!(B12288$B!!(B--$B!!(Bideographic$B!!(Bspace$B!!(B--
>
>(3) Background
>
>In Japan, this character is recognized as a fixed-width
>"Kanji" character.  There is no consensus that the ideographic
>space character is a delimiter.  Rather, most people assume
>that it is not.

I disagree. I have seen *many* Japanese people become *very* confused
when they include (accidently or not) an ideographics space character
into their DTD or whatever, and have a parser fail. 

Also, I do not think that most people recognise it as a fixed-wdith
kanji character at all. A long time ago now, I spent a considerable
amount of time talking to SGML *users* about whether they would like
hankaku katakana, and ideographic spaces "normalised", and how they
would use them. The results of that survey were inconclusive at
best... some people thought normalisation was a great idea, and other
felt that they really didn't want it.

>5. The private use areas
>
>(1) Proposal
>
>The private use areas of Unicode should not be used in XML.

The private use area, and surrogates are critical for certain
applications. 

>(3) Background
>
>XML is primarily intended for open interchange over the
>Internet.  If the private use areas of Unicode are 
>used in XML, we will have incompatibility problems.

This is based upon the denial of the need. The need is there, but the
infrastructure is not.
Received on Wednesday, 28 May 1997 09:54:53 UTC