Re: Comments on XML Part 1 from Japanese experts from Murata Makoto on 1997-05-29 (w3c-sgml-wg@w3.org from May 1997)

From: Murata Makoto <murata@apsdc.ksp.fujixerox.co.jp>
Date: Thu, 29 May 1997 11:31:40 +0900
To: Tim Bray <tbray@textuality.com>
Cc: w3c-sgml-wg@w3.org
Message-Id: <9705290231.AA00308@lute.apsdc.ksp.fujixerox.co.jp>
Tim Bray writes:
>At 10:48 AM 5/28/97 +0900, Murata Makoto wrote:
>
>First of all, thanks to our Japanese colleagues for their comments.

Thanks.

>Several of us are quite nervous about writing guidelines into the XML
>spec as to how different national groups should use their own native
>character repertoires and encoding schemes.  

This concern is understandable.

>I'd like to make a 
>proposal: we should take such guidelines and publish them as separate
>documents in the XML series.  For example, we could have a document
>entitled "Recommendations for the use of Japanese text in XML
>docuements", as part 5 (or 6, or whatever we're up to now) in the
>series.  

I should speak with my colleagues, but this might be a good idea.

>Now that the W3C has a branch at Keio U., would it be possible
>to bring them into the process?

I do not know yet.  Historically, people at Japanese universities 
have had little interest in SGML or structured documents. 

>>2. Hankaku katakana characters
>>(1) Proposal
>>Hankaku katakana characters should be used through character
>>references only.
>
>I note that the hankaku katakana are part of the compatibility block
>in the Unicode standard.  It seems to me that it might be a good idea
>that we should recommend that for XML documents, no Unicode compatibility
>characters should be used.  This would cover the halfwidth katakana
>and a whole bunch of other problematic characters.

This is probably a good idea.

>>Omit 65382-65391 from the definition of NAMESTRT (Appendix A).
>>Omit [#xFF66-#xFF6F] from BaseChar [74].
>
>I agree with this... even if we are stuck with these things,
>we don't want them showing up in tag names.

Exactly!

>>3. JIS X 0212 characters in EUC_JP
>>(1) Proposal
>>Direct use of JIS X 0212 characters in EUC-JP should be
>>prohibited.  In other words, the control function SS3 of 
>>EUC_JP should be disallowed in XML.
>
>As noted above, it seems to me that we should not, in the XML spec
>itself, make rules governing how people use ASCII or EBCDIC or 
>ISO-Latin or JIS - with the sole exception that the characters should
>be Unicode characters.

Again, this is understandable.

>>3. Shift_JIS
>>(1) Proposal
>>As the definition of Shift_JIS, Appendix 1 of JIS X 0208-1997
>>should be referenced.
>
>I do not agree with this.  The whole idea is that in an XML text 
>entity, you can use any encoding scheme you want as long as you
>can map the characters to Unicode.  We do not give references for
>ISO-Latin or ASCII or anything except Unicode.

OK.

>>4. Ideographic space character
>>(1) Proposal
>>The ideographic space character should not be considered as
>>a white space character.
>
>This is obviously a difficult decision.  In my work in Japan
>(in the area of full-text search) I was told that the fullwidth
>space should be treated as a space for purposes of searching.
>In the internationalization TC to SGML, how is this handled?

Yes, this is a difficult decision.

Programmers think the ideographic space as a delimiter is a 
crazy idea. I do not think any of the computer programming languages 
allows the ideographic space character as a delimiter.  At least, cc 
on Unix does not.  nsgmls does not.  

Some non-programmers think the ideographic space is merely another space 
character.  As a part of PCDATA, we should allow the ideographic space 
and full-text search engines should handle it as a delimiter.  Nevertheless, 
we feel that the ideographic space should be disallowed as a delimter 
between <!DOCTYPE and the DTD name.

As for Gavin's reply, I do not understand his position.  In his mail (Thu, 16 
Jan), Gavin wrote:

>Currently "S" is defined with ideographic space as a component. As
>ideographic space is *not* part of ASCII, I would recommend using a
>production other than "S" as the separator in the XMLdecl; something
>limited to just space and tabs (though it will be easy for authors in
>Japan to make a mistake here, it's probably a reasonable tradeoff).

Tim Bray writes:
>>5. The private use areas
>>(1) Proposal
>>The private use areas of Unicode should not be used in XML.
>>(3) Background
>>XML is primarily intended for open interchange over the
>>Internet.  If the private use areas of Unicode are 
>>used in XML, we will have incompatibility problems.
>
>But there are a lot of characters in the world that are not
>in Unicode.  Particularly in the areas of mathematics, chemistry,
>and so on?  If we can't use the private-use area, how can we
>use these at all?  I agree that it will require co-ordination 
>between sender and receiver to make this work.

We are well aware the need of gaiji characters.  What we are against 
is the use of gaiji characters in open intechange via XML over 
the Internet.  Put in another way, documents containing characters 
from the private use areas are not XML, but XML-like.

Makoto
 
Fuji Xerox Information Systems
 
Tel: 044-812-7230   Fax: 044-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp
Received on Wednesday, 28 May 1997 22:31:03 UTC