- From: Christopher R. Maden <crm@eps.inso.com>
- Date: Fri, 4 Apr 1997 17:51:16 GMT
- To: w3c-sgml-wg@w3.org
[Murata Makoto] > Christopher R. Maden writes: > >Appendix A: "... depending on the character set..." No, no, NO!!! > >ALL XML documents *must* have the same character set, in the SGML > >sense, or the numeric character references are trash. They may have > >different encodings or BCTFs, but the *character set* is ALWAYS the > >same. This prose must be cleaned up ASAP, or we'll be haunted by > >incompatible applications for XML's entire brief life. > > To get rid of this problem, I would like to disallow number > references to two byte characters. Does anybody have any problems? > I don't think Japanese have any problems. For all-Japanese documents you won't, but what if you want to refer to Anton Dvorak or Bialystok, and spell them correctly using carons and slashed l's? But it's not just multi-byte characters that are a problem. If I write an XML document in ISO 8859-1 here on my UNIX system and refer to less-than with a numeric reference, and then Michael FTPs it in ASCII mode to his IBM VM/ESA mainframe using EBCDIC, it will be re-encoded. My less-than reference will now be a different character. The XML system (or SGML-system-in-XML-mode) needs to be told that this is still a document whose character set is ISO 10646, but whose encoding is now EBCDIC instead of ISO 8859-1. Numeric character references must *always* refer to 10646 code points, and in the SGML sense that means that the document character set must always be ISO 10646. Encodings or BCTFs change; data does not (and can not!) -Chris -- Christopher R. Maden One Richmond Square DynaText SIT Technical Support Providence, RI 02906 USA Inso Corporation +1.401.421.9550 (voice) Electronic Publishing Solutions +1.401.521.2030 (facsimile)
Received on Friday, 4 April 1997 13:28:57 UTC