- From: Murata Makoto <murata@apsdc.ksp.fujixerox.co.jp>
- Date: Wed, 28 May 1997 10:48:56 +0900
- To: w3c-sgml-wg@w3.org
Comments on XML Part 1 from Japanese experts 28 May, 1997 MURATA Makoto (Last first) Yushi Komachi HIYAMA Masayuki (LAST First) Yasuhiro Okui 1. Introduction We carefully studied XML Part 1 so as to examine if XML can handle Japanese documents. We believe that XML Part 1 is well designed, and deeply appreciate the effort of the W3C SGML ERB and W3C SGML WG. However, we believe that some small changes are useful and even desirable for compatibility and interoperability with the Internet and computing environments. We cordially request that the ERB consider these proposed changes. None of our proposals modify the basic principles of XML. These proposals are rather minor and are only relevant to non-ASCII XML documents. 2. Hankaku katakana characters (1) Proposal Hankaku katakana characters should be used through character references only. (2) Proposed revision In "4.3.3 Character Encoding in Entities", introduce a paragraph as below: Direct use of hankaku katakana characters (#xFF66-#xFF9F) is disallowed in any encoding method. For backward compatibility with JIS X 0201, hankaku katakana characters may be referenced via character references. Omit 65382-65391 from the definition of NAMESTRT (Appendix A). Omit [#xFF66-#xFF6F] from BaseChar [74]. (3) Background Hankaku katakana characters exist in Unicode 2.0 only for backward compatibility with JIS X 0201 (1 byte). Since JIS X 0208 (2 byte) has all of katakana characters, hankaku katakana characters cause duplicate encoding. Internet experts in Japan commonly have the belief that hankaku katakana characters shall not be used. The only encoding method for transmitting Japanese e-mail and news articles, namely ISO_2022_JP, does not allow hankaku katakana characters. Thus, XML documents containing hankaku katakana characters will lead to loss of information or data corruption, when they are transmitted via e-mail or news. Another common encoding method, namely EUC_JP, is able to capture hankaku katakana characters, but just barely. A control function SS2 is required only for this purpose. As a result, many software tools on Unix do not support hankaku katakana characters. Furthermore, recently revised JIS X 0208 discourages the use of hankaku katakana characters. It says that they should only be used for backward compatibilities, and that the next version of JIS X 0208 is likely to omit hankaku katakana characters. In some (rare) cases, hankaku katakana characters might be required for legacy documents. Fortunately, XML has character references. They are free from loss of information, data corruption, and hard-to-implement control functions. 3. JIS X 0212 characters in EUC_JP (1) Proposal Direct use of JIS X 0212 characters in EUC-JP should be prohibited. In other words, the control function SS3 of EUC_JP should be disallowed in XML. (2) Proposed revision In "4.3.3 Character Encoding in Entities", introduce a paragraph as below: Control functions SS2 (hankaku katakana) and SS3 (JIS X 0212) shall not be used for EUC_JP. For backward compatibility, JIS X 0212 characters can be represented by character references. (3) Background JIS X 0212 is rarely used in Japan. First, JIS X 0212 characters cannot be represented by Shift_JIS, and are thus not supported by most of the personal computers. Second, JIS X 0212 characters cannot be represented by ISO_2022_JP, and thus cannot be transmitted via e-mail or news. Third, a control function SS3 is required to capture JIS X 0212 characters in EUC_JP, but many tools do not support this control function. JIS X 0212 is recognized as a failure by the JIS committee, as the origin of most of the "characters" in JIS X 0212 is unclear. Some of the committee members even want to cancel JIS X 0212. The committee plans to develop two alternative standards, and doubly-encode some of the JIS X 0212 characters in these new standards. 3. Shift_JIS (1) Proposal As the definition of Shift_JIS, Appendix 1 of JIS X 0208-1997 should be referenced. (2) Proposed revision In "4.3.3 Character Encoding in Entities", add a paragraph as below: The definition of Shift_JIS is given by Appendix 1 of JIS X 0208. Add a reference as below: JIS X 0208 Japanese Industrial Standard, "7-bit and 8-bit double byte coded Kanji sets for information interchange", January 1997 (3) Background The Shift_JIS encoding was proprietary. Different companies slightly extended it by introducing different characters. Thus, we have had incompatibility problems. Fortunately, Appendix 1 of the newly revised JIS X 0208 standardizes this encoding method. 4. Ideographic space character (1) Proposal The ideographic space character should not be considered as a white space character. (2) Proposed revision Revise [1] as below: [1] S::= (#X0020 | #X0009 | #X000d | #X000a)+ Remove the below line from the SGML declaration. ITAB SEPCHAR 12288 -- ideographic space -- (3) Background In Japan, this character is recognized as a fixed-width "Kanji" character. There is no consensus that the ideographic space character is a delimiter. Rather, most people assume that it is not. 5. The private use areas (1) Proposal The private use areas of Unicode should not be used in XML. (2) Proposed revision In "2.3 Characters", change "Users may extend ... " to "Users may not extend ...". (3) Background XML is primarily intended for open interchange over the Internet. If the private use areas of Unicode are used in XML, we will have incompatibility problems. Makoto
Received on Tuesday, 27 May 1997 21:47:39 UTC