Re: internet media types and encoding from Rick Jelliffe on 2003-04-09 (www-tag@w3.org from April 2003)

From: Rick Jelliffe <ricko@topologi.com>
Date: Wed, 9 Apr 2003 20:26:51 +1000
To: "Chris Lilley" <chris@w3.org>
Cc: <www-tag@w3.org>
Message-ID: <03a901c2fe82$89169f80$4bc8a8c0@AlletteSystems.com>

From: "Chris Lilley" <chris@w3.org>

> I really, really want to avoid the situation where an XML file is well
> formed over the wire but ceases to be well formed when the server or
> other backend, filesystem-based processor manipulates it because the
> charset parameter is not present and the encoding declaration is
> wrong.

Yes, it is increasingly important.  It would be great to deprecate text/xml.
I don't know whether it will be possible to "stop" people or not, but worth a try.

> Transcoding proxies do exactly that - make XML documents not well
> formed. the solution is to stop the dumb proxies breaking documents
> and if you can't stop them, then just don't use text/xml..

For a robust system, every layer needs to have 1) formats accurately 
labelled for the next layer to dispatch and dissect and  2) mechanisms 
to test that the label is feasible for the data found.  

For example, in Internet protocols not only does a packet say "I am UDP" 
it also provides a checksum that can be used to verify. Above the XML 
level, not only does an element information set say "I am element x in 
namespace y" but we also can have a schema to validate it. For robustness, 
labelling needs to be paired with verification even if the verification is 
statistical or optional.

There are a handful of methods typically available for verification 
(error-detection): notably checksums, parsing and redundant codes.[1] 

XML 1.0 advanced textual formats by providing a workable labelling
mechanism for encoding. But we need a verification mechanism too:--
when we go up the protocol stacks XML is somewhat of a weak link.

For encoding error-detection, XML 1.1 takes one small step backwards 
(by opening up the characters used in names) but then takes a very large 
step forwards (by not allowing most C1 control characters directly). 
(The C1 controls are roughly U+0080-U+009F: reserving these is enough
to detect many common encoding errors, in particular mislabelling
character sets --such as Big 5 or Win 1252 "ANSI"-- as ISO 8859-1.)

It is not enough to huff and puff...oops...deprecate text/xml!  In concert with
deprecation XML needs to reserve enough redundant Unicode code points in 
critical unused areas so that XML processers can detect as many character-
encoding-labelling errors as they can. This is also true with application/xml*. 

I hope the TAG will encourage the XML Core WG to improve and not
dump the C1 restrictions proposed in XML 1.1.

Cheers
Rick Jelliffe

[1] For the specific meaning of redundant code see for example
 http://www.fb9dv.uni-duisburg.de/education/fce1/material/codes.pdf

Received on Wednesday, 9 April 2003 06:23:08 UTC