- From: Rick Jelliffe <ricko@topologi.com>
- Date: Sat, 12 Apr 2003 06:50:27 +1000
- To: <www-tag@w3.org>
Paul Grosso <pgrosso@arbortext.com> > So my question is, since one will probably have to do even more for the > kind of reliability you want, why leave in this one incompatibility? Is > the cost of breaking backward compatibility with XML 1.0 worth the benefit, > given that you've just admitted you still don't have your bullet-proof > reliability? I will answer in two ways. Breaking Compatability --------------------------- First, on the "cost of breaking backwards compatability with XML 1.0". As you know, Unicode 1.0 reserved the C1 range but did not assign any characters to it. At the time XML 1.0 was released, the control characters had no semantics. Since Unicode 3.0, the C1 control codes now do have semantics: the ISO 6429. That means that people who used those codes with different semantics are not conforming to Unicode 3.0. The XML 1.1 revision's purpose is to align XML with Unicode 3.1 and future versions. So anyone who has used those characters with different semantics is not conforming with Unicode 3.n. and we don't need to support them. I note that it does not necessarily break compatability with implementations. For example, MSXML 4 (as used in my company's freebie validator for WXS, Schematron, RELAX NG, etc) barfed if faced with C1 controls. It was acting correctly in this, because the presence of a literal control character in a text stream is either a sign of an error (eg. EOT) or of some non-textual use (e.g. what would BS be doing in a document). (The validator rejected the controls not because they were allowed or disallowed in Unicode, but because they were inappropriate when found in an 8859-1 data stream, I would say. ) About the very first support question on our validator in 0ct 2001 was someone who had a Euro in 0x80 but it was labelled 8859-1: they reported it as a bug. Redundant-code error-detection works. See http://lists.w3.org/Archives/Public/xml-editor/2001OctDec/0004.html for more info on C1 control characters. See http://lists.xml.org/archives/xml-dev/200109/msg00259.html for more info on C0 control characters. See http://www.xml.com/pub/a/2002/09/18/euroxml.html for discussion of Euro, especially box "How Could XML 1.1 Help?" All or Nothing ----------------- Next, the issue that it is still not enough. I have explained already in a previous post that error-detection by exploiting code redundancy and a checksum (xml:md5) are applicable in different cases. Having one reduces the need for the other, but they don't cover exactly the same cases. Probably code redundancy might more catch human error (editing, programmer used default encoding to read or write, data comes from corrupt database) while an xml:md5 might more catch system errors (e.g. corrupting transcoding, processors that process by byte but make incorrect assumptions). (The same goes for restricting name characters: it can find things that code redundancy will not. Somewhere I gave an example of this with the Greek 8859-? character encoding mislabelled as 8859-1. However, the XML Core WG does not want to utilize redundancy in this way, so the C1 controls are the only game in town.) Now, as I have pointed out, using code redundancy will not catch any errors where two encodings have common feasible code sequences that don't involve the C1 range. For example, ISO8859-2 mislabelled as ISO 8859-1. The only way to attempt to detect those is through name checking. (And, to flog a dead horse, it is completely spurious to say that we cannot make use of allocated code points because we need to be future-compatible: it is ludicrous to think that Unicode Consortium will drop letters out of the Greek alphabet or change ISO 8859-1 so that "multiply" becomes a letter. Future-proofing XML to Unicode evolution does not imply that existing allocated characters cannot be used for redundancy checking at the character level: it is only unallocated character positions that XML 1.1. needs to be open to.) What code redundancy will find is where a proprietary extension to an ISO standard character set has been used, but labelled as the ISO set, and many encoding issues for CJK. See http://www.topologi.com/resources/XML_Naming_Rules.html for some details. All-or-nothing is not the choice, and there is no need to railroad ourselves into thinking it is. The choice is some-or-nothing. The XML Core WG should discover and maintain the strengths of XML. Perhaps the TAG has a role to figure out the robustness objectives for XML if it is to be used for important data transfers. Checking for characters in the C1 range only involves a small range-check: it is hard to imagine there could be any other low-hanging fruit hanging so low. Cheers Rick Jelliffe
Received on Friday, 11 April 2003 16:46:32 UTC