Re: internet media types and encoding

At 19:20 2003 04 11 +1000, Rick Jelliffe wrote:

>Paul Grosso wrote
>
>> The XML Core WG has not resolved this open issue yet, so I for one
>> wouldn't mind understanding this better.
>
>CR seems a bit late for this.

CR is for collecting implementation and use feedback.  We got feedback
from several quarters suggesting that it was important for XML 1.0 to
be a subset of XML 1.1.

In fact, the CR request [1] (member only) explicitly called out this
issue and requested feedback on it [2].

>> I am unclear on the benefits of this.  In exchange for making some
>> well-formed XML 1.0 documents no longer well-formed XML 1.1, what
>> exactly are we getting?  I gather the answer is greater "encoding
>> error detection," that is, the ability to reject yet more documents.
>
>Which part don't you understand? 

I don't understand why catching mistakenly labeled documents is so
important that it's worth breaking backward compatibility with XML 1.0.
[I'm speaking personally, not for the XML Core WG.]

> I have provided the XML Core WG
>with examples of which encoding pairs would be affected and to what extent,
>that shows that it is applicable in common cases, notably including CP1252 
>(includes Euro) mislabelled as ISO8859-1.[1]  I have provided  the XML 
>Core WG with a formula to estimate the probality of encodings being detected,
>that shows we can expect it to be effective for the encoding pairs for which
>is is applicable. 
>
>Why do I have to go over this again?  The WG did not find any holes in the 
>reasoning last time. 

No one is discussing holes in logic.  We're discussing cost/benefit tradeoffs.


>I think the real problem here is the feeling that there should be some other
>layer under XML that looks after this kind of thing: that XML should not be 
>complicated by things that don't belong to abstract characters. 
>
>But there is not;-- XML is the Johnny-on-the-spot. An XML processor is presented 
>with bytes, not characters, so it is XML's responsibility to make sure the 
>translation from bytes to characters is robust.  It comes down to whether XML 
>should be robust enough for mission critical applications.
>
>(Actually,  I wonder whether even with literal C1s banned, XML is not reliable enough 
>for "life-threatening" applications without something like Liam's suggested xml:md5, 
>if the document contains any non-ASCII literals and is not in UTF-16.)

Right, there were other things you (and others) requested that we did not
do in XML 1.1 because they would have broken backward compatibility even
more.  So you don't now have the "reliability you want."

So my question is, since one will probably have to do even more for the
kind of reliability you want, why leave in this one incompatibility?  Is
the cost of breaking backward compatibility with XML 1.0 worth the benefit,
given that you've just admitted you still don't have your bullet-proof
reliability?

>Another reason XML should do it is because DBMS vendors have shown an
>extreme disinclination from testing the encoding of data coming in.  It is a
>great failing in integrity-checking that only becomes apparant when you don't
>have a single regional character encoding to cope with, but it is understandable 
>because of fears about benchmarking, given that most people only are dealing
>with their inhouse data, and most houses are in one locale.  (Whether users might 
>not prefer reliability is another matter.)  I have seen databases corrupted because 
>of this.  XML is well-placed to take DBMS off the hook here. 
>
>XML can, nothing else can, we need it, it is possible, therefore XML should.
>Have any users requested to the XML Core WG that XML should be
>made less reliable?

No, but we're not talking about making it less reliable, we're talking
about leaving it as reliable (in this area) as it is currently in XML 1.0.

And many users have requested backward compatibility with XML 1.0.

paul

p.s.  Despite my arguments, I'm still not sure what the right answer is.
But personally, I'd like to hear cost/benefit analyses from folks on both
sides, or this will likely be decided merely by intensity of discussion.


[1] http://lists.w3.org/Archives/Member/chairs/2002JulSep/0128

[2] To quote from [1], it said:

The removal of direct representation of control characters in the range
#x7F-#x9F represents a change in well-formedness. That is, well-formed
XML 1.0 documents which contain these characters do not become
well-formed XML 1.1 documents simply by changing their version number.
Occurrences of control characters must also be converted to numeric
character references.

As a criterion for exiting CR, the XML Core WG will collect evidence
substantiating (or contradicting) our opinion that:

1) converting characters in the #x7F-#x9F range to numeric
   character references while updating XML 1.0 documents to XML 1.1 does
   not represent a significant obstacle to adoption of XML 1.1;
2) there are no significant scenarios where converting characters
   in the #x7F-#x9F range to numeric character references is impractical or
   impossible;
3) that the benefits of this change to the proper detection of
   character encoding represent a significant improvement in
   interoperability.

Received on Friday, 11 April 2003 10:36:51 UTC