Re: 9 July 2003 draft of "Client handling of MIME headers" available from Tim Bray on 2003-07-10 (www-tag@w3.org from July 2003)

From: Tim Bray <tbray@textuality.com>
Date: Thu, 10 Jul 2003 09:27:44 -0700
To: "Roy T. Fielding" <fielding@apache.org>
Cc: "Ian B. Jacobs" <ij@w3.org>, www-tag@w3.org
Message-ID: <3F0D9400.5040504@textuality.com>

The subject under debate is (I think, maybe I'm wrong & that's why we're 
having trouble syncing up) is what we ought to say about the use of the 
charset parameter when accompanying data served as XML, by which we mean 
*/xml or */*+xml.

Roy T. Fielding wrote:

> I have more problems with "reasonably modern" browsers than the ones
> that simply follow the standards.  XML doesn't define how applications
> are expected to process the content within elements and attributes.
> XML does not prevent someone from implementing client-side scripting
> within XML elements.  Therefore, the only way that XML can enable
> auto-selection of character encodings without opening a security hole
> is by requiring that they always be processed in the same way by both
> generators and recipients of messages.  You'll have to make that process
> a normative requirement.

I believe that XML does in fact ensure that they are processed in the 
same way.  Check out the coverage of character encodings in 
http://www.w3.org/TR/REC-xml, in particular 4.3.3 (#charencoding) and 
the #sec-guessing appendix.  It is possible, with some effort, to "fool" 
an XML procssor with a bogus encoding declaration, but (unless I'm 
missing something) not for anything but ASCII characters.

So at the moment, I can't visualize a security vulnerability that would 
occur as a result of charset settings <emph>in the case where data is 
served as XML and given to an XML processor</emph>.  Of course, I 
wouldn't be that surprised if someone could dream up a counter-example. 
  But I couldn't, and I tried.

> 
>  In short, the
> exceptions listed in that section are neither needed nor desirable:
> the media type is authoritative and that's all there is to it. 

I agree on authoritativeness, but not with the rest of the sentence.  If 
the XML processor's auto-detection of the character type disagrees with 
the media type metadata, then that is an error condition and the agent 
MUST report an error.  Since an XML processor's autodetection of 
encoding is infinitely more likely to be correct than, for example, an 
Apache server's guess based on file extension, local policies, and so 
on, the best solution is to do as we say and *not* provide the charset 
parameter <emph>for data served as XML</emph>, unless you are *really 
sure* that it knows what the encoding is, and to recognize that in this 
case the information is purely redundant; it may be of interest to 
intermediate entities such as caches and proxies (although it's not 
obvious to me how) but it can never be of positive utility to the 
receiving agent, if the receiving agent is an XML processor.

So why can't we say that?

> If you
> don't want to allow servers the freedom to be efficient, then do not
> allow the charset parameter on application/*xml.

I'd go for that, and extend the ban to application/*+xml, but I see no 
reason why this would decrease efficiency.

-- 
Cheers, Tim Bray
         (ongoing fragmented essay: http://www.tbray.org/ongoing/)

Received on Thursday, 10 July 2003 12:40:14 UTC