Re: Can HTTP content-type charset disagree with its contents XML encoding? from Chris Lilley on 2003-04-29 (www-international@w3.org from April to June 2003)

From: Chris Lilley <chris@w3.org>
Date: Tue, 29 Apr 2003 20:18:15 +0200
To: "Kurosaka, Teruhiko" <Teruhiko.Kurosaka@iona.com>
CC: "Www-International (E-mail)" <www-international@w3.org>
Message-ID: <637641046.20030429201815@w3.org>

On Tuesday, April 29, 2003, 7:35:53 PM, Teruhiko wrote:

KT> Chris,
KT> Thank you for your reply.

KT> Could you be so kind to quote the relevant sections of XML and HTTP spec ?
KT> XML spec does not seem to address this situation to me.

The XML spec defers to the mime registration for the XML media type.
http://www.ietf.org/rfc/rfc3023.txt
It is in that specification that the precedence is defined (and other
unfortunate things, such as a mandatory default of US-ASCII when no
charset is provided in the HTTP, regardless of what the XML encoding
declaration says).

This is very bad. As a member of the TAG I find this very broken,
architecturally speaking. Tim Bray agrees, and I have proposed wording
in the architecture document that spells this out.

There are know problems with charset in the text/* media types, such
as a mandatory fallback to text/plain;charset=us-ascii. The solution
is to deprecate text/xml and have a charset-free application/xml,
using the nicely defined xml mechanism to declare the encoding in all
circumstances, rather than dragging the problems from text/* into the
hitherto unaffected other media types.

KT> Anyway, shouldn't this practice be explicitly forbidden for any types of
KT> contents (HTML, XML etc.) that have their own mechanism of encoding 
KT> identification?

Yes, of course it should. I am glad that you agree.

KT> -kuro

>> -----Original Message-----
>> From: Chris Lilley [mailto:chris@w3.org]
>> KT> I came across an article that shows an example of a SOAP message
>> KT> in its List 1:
>> KT> http://www.atmarkit.co.jp/fxml/tanpatsu/21websvc/websvc02.html
>> KT> (The article is in Japanese but the List 1 contains only 
>> ASCII text
>> KT> except one line within <m:GoodsName> element.)
>> 
>> KT> In this example, the HTTP level header says the contents is
>> KT> in UTF-8:
>> 
>> KT> Content-Type: application/soap-xml; charset="utf-8"
>> 
>> KT> But the XML document which is the contents of this HTTP request
>> KT> claims that the contents is in Shift_JIS as in:
>> KT>  <?xml version="1.0" encoding="shift_jis"?>
>> 
>> KT> I am puzzled.  Does anyone know:
>> 
>> KT> (1) Is this legal?
>> 
>> Unfortunately yes. Its a really bad idea, because the message
>> immediately becomes not well formed as soon as the http headers go
>> away.
>> 
>> KT> (2) If it is legal, which declaration is supposed to 
>> wins? I.e. should
>> KT> the contents be in UTF-8 encoding or Shift_JIS encoding in this
>> KT> example?
>> 
>> The http headers win.
>> 
>> The only corner case where this sort of thing could be generated is
>> when an xml-unaware program has converted the encoding from one to
>> another, and somehow knows enough to convey this to the server in some
>> undocumented, server defined way but does not know enough to convey it
>> to the xml processors in a documented, well defined way by updating
>> the encoding declaration.
>> 
>> And to support this use case, the HTTP headers are defined to override
>> the encoding declaration in the XML.
>> 
>> Not so much of a problem with transient, over the wire information
>> such as SOAP messages, but much more of a problem for other, longer
>> lived xml information which is frequently processed on the server
>> side, from the local filestore, and also processed on the client
>> side, for example saved and looked at later. In both these situations
>> there is no http header information and the self-describing nature of
>> XML is compromised - the XML is not well formed!
>> 
>> Of course, the correct solution is to not put duplicate and
>> contradictory encoding information in the http headers, but rather to
>> say that programs which make xml content not well formed are 
>> broken and
>> should be fixed.

-- 
 Chris                            mailto:chris@w3.org

Received on Tuesday, 29 April 2003 14:18:35 UTC