Re: Comments on 31 March spec
> Christopher R. Maden writes:
> >Appendix A: "... depending on the character set..." No, no, NO!!!
> >ALL XML documents *must* have the same character set, in the SGML
> >sense, or the numeric character references are trash. They may have
> >different encodings or BCTFs, but the *character set* is ALWAYS the
> >same. This prose must be cleaned up ASAP, or we'll be haunted by
> >incompatible applications for XML's entire brief life.
> To get rid of this problem, I would like to disallow number
> references to two byte characters. Does anybody have any problems?
> I don't think Japanese have any problems.
For all-Japanese documents you won't, but what if you want to refer to
Anton Dvorak or Bialystok, and spell them correctly using carons and
But it's not just multi-byte characters that are a problem. If I
write an XML document in ISO 8859-1 here on my UNIX system and refer
to less-than with a numeric reference, and then Michael FTPs it in
ASCII mode to his IBM VM/ESA mainframe using EBCDIC, it will be
re-encoded. My less-than reference will now be a different character.
The XML system (or SGML-system-in-XML-mode) needs to be told that this
is still a document whose character set is ISO 10646, but whose
encoding is now EBCDIC instead of ISO 8859-1.
Numeric character references must *always* refer to 10646 code points,
and in the SGML sense that means that the document character set must
always be ISO 10646. Encodings or BCTFs change; data does not (and
Christopher R. Maden One Richmond Square
DynaText SIT Technical Support Providence, RI 02906 USA
Inso Corporation +1.401.421.9550 (voice)
Electronic Publishing Solutions +1.401.521.2030 (facsimile)