Re: internet media types and encoding

On Wednesday, April 9, 2003, 12:26:51 PM, Rick wrote:

RJ> From: "Chris Lilley" <chris@w3.org>

>> I really, really want to avoid the situation where an XML file is well
>> formed over the wire but ceases to be well formed when the server or
>> other backend, filesystem-based processor manipulates it because the
>> charset parameter is not present and the encoding declaration is
>> wrong.

RJ> Yes, it is increasingly important. It would be great to deprecate
RJ> text/xml. I don't know whether it will be possible to "stop"
RJ> people or not, but worth a try.

Well clearly you can't stop them by yellong STOP but you can reduce it
greatly just by making clear exactly what the drawbacks are, in
detail.

>> Transcoding proxies do exactly that - make XML documents not well
>> formed. the solution is to stop the dumb proxies breaking documents
>> and if you can't stop them, then just don't use text/xml..

RJ> For a robust system, every layer needs to have 1) formats accurately 
RJ> labelled for the next layer to dispatch and dissect and  2) mechanisms 
RJ> to test that the label is feasible for the data found.

Yes.

RJ> For example, in Internet protocols not only does a packet say "I am UDP" 
RJ> it also provides a checksum that can be used to verify. Above the XML 
RJ> level, not only does an element information set say "I am element x in 
RJ> namespace y" but we also can have a schema to validate it. For robustness, 
RJ> labelling needs to be paired with verification even if the verification is 
RJ> statistical or optional.

RJ> There are a handful of methods typically available for verification 
RJ> (error-detection): notably checksums, parsing and redundant codes.[1] 

RJ> XML 1.0 advanced textual formats by providing a workable labelling
RJ> mechanism for encoding. But we need a verification mechanism too:--
RJ> when we go up the protocol stacks XML is somewhat of a weak link.

xml:md5 ?

RJ> For encoding error-detection, XML 1.1 takes one small step backwards 
RJ> (by opening up the characters used in names) but then takes a very large 
RJ> step forwards (by not allowing most C1 control characters directly). 
RJ> (The C1 controls are roughly U+0080-U+009F: reserving these is enough
RJ> to detect many common encoding errors, in particular mislabelling
RJ> character sets --such as Big 5 or Win 1252 "ANSI"-- as ISO 8859-1.)

I hear you. There was a talk at XML 2002 USA where some guy was
talking about compression for XML messages, and giving statistics of
entropy and so on to show how much redundancy had been removed (using
XMill with considerable schema-based fine tuning knowledge of the
data types of attributes to get the best compression). I pointed out
that this reduction in message size came at a cost in terms of being
able to detect, and correct for, errors which might be useful in a
noisy communications link such as a battlefield.

RJ> It is not enough to huff and puff...oops...deprecate text/xml! In
RJ> concert with deprecation XML needs to reserve enough redundant
RJ> Unicode code points in critical unused areas so that XML
RJ> processers can detect as many character- encoding-labelling errors
RJ> as they can. This is also true with application/xml*.

Detect, or correct?

RJ> I hope the TAG will encourage the XML Core WG to improve and not
RJ> dump the C1 restrictions proposed in XML 1.1.

I wasn't aware that it was in danger of being dumped.

Its abundantly clear that all versions of Unicode from 1.0 to 4.0beta
have said and continue to say that 80 to 9F are control codes, not
printable characters (and further, they say what codes they are and
none of them have any business being in a markup language).

-- 
 Chris                            mailto:chris@w3.org

Received on Wednesday, 9 April 2003 16:11:45 UTC