- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Fri, 11 Apr 2003 19:20:57 +1000
- To: <www-tag@w3.org>
Paul Grosso wrote > The XML Core WG has not resolved this open issue yet, so I for one > wouldn't mind understanding this better. CR seems a bit late for this. > I am unclear on the benefits of this. In exchange for making some > well-formed XML 1.0 documents no longer well-formed XML 1.1, what > exactly are we getting? I gather the answer is greater "encoding > error detection," that is, the ability to reject yet more documents. Which part don't you understand? I have provided the XML Core WG with examples of which encoding pairs would be affected and to what extent, that shows that it is applicable in common cases, notably including CP1252 (includes Euro) mislabelled as ISO8859-1.[1] I have provided the XML Core WG with a formula to estimate the probality of encodings being detected, that shows we can expect it to be effective for the encoding pairs for which is is applicable. Why do I have to go over this again? The WG did not find any holes in the reasoning last time. I think the real problem here is the feeling that there should be some other layer under XML that looks after this kind of thing: that XML should not be complicated by things that don't belong to abstract characters. But there is not;-- XML is the Johnny-on-the-spot. An XML processor is presented with bytes, not characters, so it is XML's responsibility to make sure the translation from bytes to characters is robust. It comes down to whether XML should be robust enough for mission critical applications. (Actually, I wonder whether even with literal C1s banned, XML is not reliable enough for "life-threatening" applications without something like Liam's suggested xml:md5, if the document contains any non-ASCII literals and is not in UTF-16.) Another reason XML should do it is because DBMS vendors have shown an extreme disinclination from testing the encoding of data coming in. It is a great failing in integrity-checking that only becomes apparant when you don't have a single regional character encoding to cope with, but it is understandable because of fears about benchmarking, given that most people only are dealing with their inhouse data, and most houses are in one locale. (Whether users might not prefer reliability is another matter.) I have seen databases corrupted because of this. XML is well-placed to take DBMS off the hook here. XML can, nothing else can, we need it, it is possible, therefore XML should. Have any users requested to the XML Core WG that XML should be made less reliable? Cheers Rick Jelliffe [1] A more likely thing, given the advent of Euro in CP1252 (ANSI) but not in 8859-1. See http://www.xml.com/pub/a/2002/09/18/euroxml.html
Received on Friday, 11 April 2003 05:17:02 UTC