- From: Tim Bray <tbray@textuality.com>
- Date: Wed, 28 May 1997 08:46:19 -0700
- To: w3c-sgml-wg@w3.org
At 10:48 AM 5/28/97 +0900, Murata Makoto wrote: First of all, thanks to our Japanese colleagues for their comments. Several of us are quite nervous about writing guidelines into the XML spec as to how different national groups should use their own native character repertoires and encoding schemes. I'd like to make a proposal: we should take such guidelines and publish them as separate documents in the XML series. For example, we could have a document entitled "Recommendations for the use of Japanese text in XML docuements", as part 5 (or 6, or whatever we're up to now) in the series. Now that the W3C has a branch at Keio U., would it be possible to bring them into the process? >2. Hankaku katakana characters >(1) Proposal >Hankaku katakana characters should be used through character >references only. I note that the hankaku katakana are part of the compatibility block in the Unicode standard. It seems to me that it might be a good idea that we should recommend that for XML documents, no Unicode compatibility characters should be used. This would cover the halfwidth katakana and a whole bunch of other problematic characters. >Omit 65382-65391 from the definition of NAMESTRT (Appendix A). >Omit [#xFF66-#xFF6F] from BaseChar [74]. I agree with this... even if we are stuck with these things, we don't want them showing up in tag names. >3. JIS X 0212 characters in EUC_JP >(1) Proposal >Direct use of JIS X 0212 characters in EUC-JP should be >prohibited. In other words, the control function SS3 of >EUC_JP should be disallowed in XML. As noted above, it seems to me that we should not, in the XML spec itself, make rules governing how people use ASCII or EBCDIC or ISO-Latin or JIS - with the sole exception that the characters should be Unicode characters. I note from page 105 of the Unicode book that JIS 212 was one of the standards used to harvest characters for the CJK area in Unicode. Perhaps this is material for another XML-related spec. >3. Shift_JIS >(1) Proposal >As the definition of Shift_JIS, Appendix 1 of JIS X 0208-1997 >should be referenced. I do not agree with this. The whole idea is that in an XML text entity, you can use any encoding scheme you want as long as you can map the characters to Unicode. We do not give references for ISO-Latin or ASCII or anything except Unicode. >4. Ideographic space character >(1) Proposal >The ideographic space character should not be considered as >a white space character. This is obviously a difficult decision. In my work in Japan (in the area of full-text search) I was told that the fullwidth space should be treated as a space for purposes of searching. In the internationalization TC to SGML, how is this handled? >5. The private use areas >(1) Proposal >The private use areas of Unicode should not be used in XML. >(3) Background >XML is primarily intended for open interchange over the >Internet. If the private use areas of Unicode are >used in XML, we will have incompatibility problems. But there are a lot of characters in the world that are not in Unicode. Particularly in the areas of mathematics, chemistry, and so on? If we can't use the private-use area, how can we use these at all? I agree that it will require co-ordination between sender and receiver to make this work. Cheers, Tim Bray tbray@textuality.com http://www.textuality.com/ +1-604-708-9592
Received on Wednesday, 28 May 1997 11:48:10 UTC