Re: internet media types and encoding from Rick Jelliffe on 2003-04-11 (www-tag@w3.org from April 2003)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Fri, 11 Apr 2003 19:20:57 +1000
To: <www-tag@w3.org>
Message-ID: <018701c3000b$a8f6c250$4bc8a8c0@AlletteSystems.com>

Paul Grosso wrote

> The XML Core WG has not resolved this open issue yet, so I for one
> wouldn't mind understanding this better.

CR seems a bit late for this.

> I am unclear on the benefits of this.  In exchange for making some
> well-formed XML 1.0 documents no longer well-formed XML 1.1, what
> exactly are we getting?  I gather the answer is greater "encoding
> error detection," that is, the ability to reject yet more documents.

Which part don't you understand?  I have provided the XML Core WG
with examples of which encoding pairs would be affected and to what extent,
that shows that it is applicable in common cases, notably including CP1252 
(includes Euro) mislabelled as ISO8859-1.[1]  I have provided  the XML 
Core WG with a formula to estimate the probality of encodings being detected,
that shows we can expect it to be effective for the encoding pairs for which
is is applicable. 

Why do I have to go over this again?  The WG did not find any holes in the 
reasoning last time. 

I think the real problem here is the feeling that there should be some other
layer under XML that looks after this kind of thing: that XML should not be 
complicated by things that don't belong to abstract characters. 

But there is not;-- XML is the Johnny-on-the-spot. An XML processor is presented 
with bytes, not characters, so it is XML's responsibility to make sure the 
translation from bytes to characters is robust.  It comes down to whether XML 
should be robust enough for mission critical applications.

(Actually,  I wonder whether even with literal C1s banned, XML is not reliable enough 
for "life-threatening" applications without something like Liam's suggested xml:md5, 
if the document contains any non-ASCII literals and is not in UTF-16.)

Another reason XML should do it is because DBMS vendors have shown an
extreme disinclination from testing the encoding of data coming in.  It is a
great failing in integrity-checking that only becomes apparant when you don't
have a single regional character encoding to cope with, but it is understandable 
because of fears about benchmarking, given that most people only are dealing
with their inhouse data, and most houses are in one locale.  (Whether users might 
not prefer reliability is another matter.)  I have seen databases corrupted because 
of this.  XML is well-placed to take DBMS off the hook here. 

XML can, nothing else can, we need it, it is possible, therefore XML should.
Have any users requested to the XML Core WG that XML should be
made less reliable?

Cheers
Rick Jelliffe

[1] A more likely thing, given the advent of Euro in CP1252 (ANSI) but
not in 8859-1. See http://www.xml.com/pub/a/2002/09/18/euroxml.html

Received on Friday, 11 April 2003 05:17:02 UTC