Re: Comments on 31 March spec

[Murata Makoto]
> Christopher R. Maden writes:
> >Appendix A: "... depending on the character set..."  No, no, NO!!!
> >ALL XML documents *must* have the same character set, in the SGML
> >sense, or the numeric character references are trash.  They may have
> >different encodings or BCTFs, but the *character set* is ALWAYS the
> >same.  This prose must be cleaned up ASAP, or we'll be haunted by
> >incompatible applications for XML's entire brief life.
> To get rid of this problem, I would like to disallow number
> references to two byte characters.  Does anybody have any problems?
> I don't think Japanese have any problems.

For all-Japanese documents you won't, but what if you want to refer to
Anton Dvorak or Bialystok, and spell them correctly using carons and
slashed l's?

But it's not just multi-byte characters that are a problem.  If I
write an XML document in ISO 8859-1 here on my UNIX system and refer
to less-than with a numeric reference, and then Michael FTPs it in
ASCII mode to his IBM VM/ESA mainframe using EBCDIC, it will be
re-encoded.  My less-than reference will now be a different character.
The XML system (or SGML-system-in-XML-mode) needs to be told that this
is still a document whose character set is ISO 10646, but whose
encoding is now EBCDIC instead of ISO 8859-1.

Numeric character references must *always* refer to 10646 code points,
and in the SGML sense that means that the document character set must
always be ISO 10646.  Encodings or BCTFs change; data does not (and
can not!)

Christopher R. Maden                  One Richmond Square
DynaText SIT Technical Support        Providence, RI 02906 USA
Inso Corporation                      +1.401.421.9550 (voice)
Electronic Publishing Solutions       +1.401.521.2030 (facsimile)

Received on Friday, 4 April 1997 13:28:57 UTC