- From: Chris Lilley <chris@w3.org>
- Date: Wed, 9 Apr 2003 22:11:33 +0200
- To: "Rick Jelliffe" <ricko@topologi.com>
- CC: www-tag@w3.org
On Wednesday, April 9, 2003, 12:26:51 PM, Rick wrote: RJ> From: "Chris Lilley" <chris@w3.org> >> I really, really want to avoid the situation where an XML file is well >> formed over the wire but ceases to be well formed when the server or >> other backend, filesystem-based processor manipulates it because the >> charset parameter is not present and the encoding declaration is >> wrong. RJ> Yes, it is increasingly important. It would be great to deprecate RJ> text/xml. I don't know whether it will be possible to "stop" RJ> people or not, but worth a try. Well clearly you can't stop them by yellong STOP but you can reduce it greatly just by making clear exactly what the drawbacks are, in detail. >> Transcoding proxies do exactly that - make XML documents not well >> formed. the solution is to stop the dumb proxies breaking documents >> and if you can't stop them, then just don't use text/xml.. RJ> For a robust system, every layer needs to have 1) formats accurately RJ> labelled for the next layer to dispatch and dissect and 2) mechanisms RJ> to test that the label is feasible for the data found. Yes. RJ> For example, in Internet protocols not only does a packet say "I am UDP" RJ> it also provides a checksum that can be used to verify. Above the XML RJ> level, not only does an element information set say "I am element x in RJ> namespace y" but we also can have a schema to validate it. For robustness, RJ> labelling needs to be paired with verification even if the verification is RJ> statistical or optional. RJ> There are a handful of methods typically available for verification RJ> (error-detection): notably checksums, parsing and redundant codes.[1] RJ> XML 1.0 advanced textual formats by providing a workable labelling RJ> mechanism for encoding. But we need a verification mechanism too:-- RJ> when we go up the protocol stacks XML is somewhat of a weak link. xml:md5 ? RJ> For encoding error-detection, XML 1.1 takes one small step backwards RJ> (by opening up the characters used in names) but then takes a very large RJ> step forwards (by not allowing most C1 control characters directly). RJ> (The C1 controls are roughly U+0080-U+009F: reserving these is enough RJ> to detect many common encoding errors, in particular mislabelling RJ> character sets --such as Big 5 or Win 1252 "ANSI"-- as ISO 8859-1.) I hear you. There was a talk at XML 2002 USA where some guy was talking about compression for XML messages, and giving statistics of entropy and so on to show how much redundancy had been removed (using XMill with considerable schema-based fine tuning knowledge of the data types of attributes to get the best compression). I pointed out that this reduction in message size came at a cost in terms of being able to detect, and correct for, errors which might be useful in a noisy communications link such as a battlefield. RJ> It is not enough to huff and puff...oops...deprecate text/xml! In RJ> concert with deprecation XML needs to reserve enough redundant RJ> Unicode code points in critical unused areas so that XML RJ> processers can detect as many character- encoding-labelling errors RJ> as they can. This is also true with application/xml*. Detect, or correct? RJ> I hope the TAG will encourage the XML Core WG to improve and not RJ> dump the C1 restrictions proposed in XML 1.1. I wasn't aware that it was in danger of being dumped. Its abundantly clear that all versions of Unicode from 1.0 to 4.0beta have said and continue to say that 80 to 9F are control codes, not printable characters (and further, they say what codes they are and none of them have any business being in a markup language). -- Chris mailto:chris@w3.org
Received on Wednesday, 9 April 2003 16:11:45 UTC