Comments on XML Part 1 from Japanese experts

Comments on XML Part 1 from Japanese experts

	
	28 May, 1997

	MURATA Makoto (Last first)
	Yushi Komachi
	HIYAMA Masayuki (LAST First)
	Yasuhiro Okui



1. Introduction

We carefully studied XML Part 1 so as to examine if XML can
handle Japanese documents.  We believe that XML Part 1 is
well designed, and deeply appreciate the effort of the W3C
SGML ERB and W3C SGML WG.

However, we believe that some small changes are useful and
even desirable for compatibility and interoperability with
the Internet and computing environments.  We cordially
request that the ERB consider these proposed changes.

None of our proposals modify the basic principles of XML.
These proposals are rather minor and are only relevant to
non-ASCII XML documents.


2. Hankaku katakana characters

(1) Proposal

Hankaku katakana characters should be used through character
references only.

(2) Proposed revision 

In "4.3.3 Character Encoding in Entities", introduce a
paragraph as below:

	Direct use of hankaku katakana characters
	(#xFF66-#xFF9F) is disallowed in any encoding method.
	For backward compatibility with JIS X 0201, hankaku
	katakana characters may be referenced via character
	references.

Omit 65382-65391 from the definition of NAMESTRT (Appendix A).

Omit [#xFF66-#xFF6F] from BaseChar [74].

(3) Background

Hankaku katakana characters exist in Unicode 2.0 only for
backward compatibility with JIS X 0201 (1 byte).  Since JIS X
0208 (2 byte) has all of katakana characters, hankaku katakana 
characters cause duplicate encoding.

Internet experts in Japan commonly have the belief that
hankaku katakana characters shall not be used.  The only
encoding method for transmitting Japanese e-mail and news
articles, namely ISO_2022_JP, does not allow hankaku katakana
characters.  Thus, XML documents containing hankaku katakana
characters will lead to loss of information or data
corruption, when they are transmitted via e-mail or news.

Another common encoding method, namely EUC_JP, is able to
capture hankaku katakana characters, but just barely.  A
control function SS2 is required only for this purpose.  As
a result, many software tools on Unix do not support hankaku
katakana characters.

Furthermore, recently revised JIS X 0208 discourages the use
of hankaku katakana characters.  It says that they should only
be used for backward compatibilities, and that the next
version of JIS X 0208 is likely to omit hankaku katakana
characters.

In some (rare) cases, hankaku katakana characters might be
required for legacy documents.  Fortunately, XML has
character references.  They are free from loss of
information, data corruption, and hard-to-implement
control functions.


3. JIS X 0212 characters in EUC_JP

(1) Proposal

Direct use of JIS X 0212 characters in EUC-JP should be
prohibited.  In other words, the control function SS3 of 
EUC_JP should be disallowed in XML.

(2) Proposed revision

In "4.3.3 Character Encoding in Entities", introduce a
paragraph as below:

	Control functions SS2 (hankaku katakana) and SS3
	(JIS X 0212) shall not be used for EUC_JP.
	For backward compatibility, JIS X 0212 characters 
	can be represented by character references.

(3) Background

JIS X 0212 is rarely used in Japan.  First, JIS X 0212
characters cannot be represented by Shift_JIS, and are thus
not supported by most of the personal computers.  Second,
JIS X 0212 characters cannot be represented by ISO_2022_JP,
and thus cannot be transmitted via e-mail or news.  Third, a
control function SS3 is required to capture JIS X 0212
characters in EUC_JP, but many tools do not support this
control function.

JIS X 0212 is recognized as a failure by the JIS committee,
as the origin of most of the "characters" in JIS X 0212 is
unclear.  Some of the committee members even want to cancel
JIS X 0212.  The committee plans to develop two alternative
standards, and doubly-encode some of the JIS X 0212
characters in these new standards.


3. Shift_JIS

(1) Proposal

As the definition of Shift_JIS, Appendix 1 of JIS X 0208-1997
should be referenced.

(2) Proposed revision

In "4.3.3 Character Encoding in Entities", add a paragraph as 
below:
	The definition of Shift_JIS is given by Appendix 1 
	of JIS X 0208.

Add a reference as below:

	JIS X 0208
		Japanese Industrial Standard, "7-bit and
		8-bit double byte coded Kanji sets for
		information interchange",
		January 1997

(3) Background

The Shift_JIS encoding was proprietary.  Different companies
slightly extended it by introducing different
characters. Thus, we have had incompatibility problems.

Fortunately, Appendix 1 of the newly revised JIS X 0208
standardizes this encoding method. 


4. Ideographic space character

(1) Proposal

The ideographic space character should not be considered as
a white space character.

(2) Proposed revision

Revise [1] as below:
	[1] S::= (#X0020 | #X0009 | #X000d | #X000a)+

Remove the below line from the SGML declaration.

      ITAB SEPCHAR 12288 -- ideographic space --

(3) Background

In Japan, this character is recognized as a fixed-width
"Kanji" character.  There is no consensus that the ideographic
space character is a delimiter.  Rather, most people assume
that it is not.


5. The private use areas

(1) Proposal

The private use areas of Unicode should not be used in XML.

(2) Proposed revision

In "2.3 Characters", change "Users may extend ... " to "Users
may not extend ...".

(3) Background

XML is primarily intended for open interchange over the
Internet.  If the private use areas of Unicode are 
used in XML, we will have incompatibility problems.

Makoto

Received on Tuesday, 27 May 1997 21:47:39 UTC