W3C home > Mailing lists > Public > www-international@w3.org > April to June 2003

Re: Can HTTP content-type charset disagree with its contents XML encoding?

From: Chris Lilley <chris@w3.org>
Date: Tue, 29 Apr 2003 20:18:15 +0200
Message-ID: <637641046.20030429201815@w3.org>
To: "Kurosaka, Teruhiko" <Teruhiko.Kurosaka@iona.com>
CC: "Www-International (E-mail)" <www-international@w3.org>

On Tuesday, April 29, 2003, 7:35:53 PM, Teruhiko wrote:

KT> Chris,
KT> Thank you for your reply.

KT> Could you be so kind to quote the relevant sections of XML and HTTP spec ?
KT> XML spec does not seem to address this situation to me.

The XML spec defers to the mime registration for the XML media type.
It is in that specification that the precedence is defined (and other
unfortunate things, such as a mandatory default of US-ASCII when no
charset is provided in the HTTP, regardless of what the XML encoding
declaration says).

This is very bad. As a member of the TAG I find this very broken,
architecturally speaking. Tim Bray agrees, and I have proposed wording
in the architecture document that spells this out.

There are know problems with charset in the text/* media types, such
as a mandatory fallback to text/plain;charset=us-ascii. The solution
is to deprecate text/xml and have a charset-free application/xml,
using the nicely defined xml mechanism to declare the encoding in all
circumstances, rather than dragging the problems from text/* into the
hitherto unaffected other media types.

KT> Anyway, shouldn't this practice be explicitly forbidden for any types of
KT> contents (HTML, XML etc.) that have their own mechanism of encoding 
KT> identification?

Yes, of course it should. I am glad that you agree.

KT> -kuro

>> -----Original Message-----
>> From: Chris Lilley [mailto:chris@w3.org]
>> KT> I came across an article that shows an example of a SOAP message
>> KT> in its List 1:
>> KT> http://www.atmarkit.co.jp/fxml/tanpatsu/21websvc/websvc02.html
>> KT> (The article is in Japanese but the List 1 contains only 
>> ASCII text
>> KT> except one line within <m:GoodsName> element.)
>> KT> In this example, the HTTP level header says the contents is
>> KT> in UTF-8:
>> KT> Content-Type: application/soap-xml; charset="utf-8"
>> KT> But the XML document which is the contents of this HTTP request
>> KT> claims that the contents is in Shift_JIS as in:
>> KT>  <?xml version="1.0" encoding="shift_jis"?>
>> KT> I am puzzled.  Does anyone know:
>> KT> (1) Is this legal?
>> Unfortunately yes. Its a really bad idea, because the message
>> immediately becomes not well formed as soon as the http headers go
>> away.
>> KT> (2) If it is legal, which declaration is supposed to 
>> wins? I.e. should
>> KT> the contents be in UTF-8 encoding or Shift_JIS encoding in this
>> KT> example?
>> The http headers win.
>> The only corner case where this sort of thing could be generated is
>> when an xml-unaware program has converted the encoding from one to
>> another, and somehow knows enough to convey this to the server in some
>> undocumented, server defined way but does not know enough to convey it
>> to the xml processors in a documented, well defined way by updating
>> the encoding declaration.
>> And to support this use case, the HTTP headers are defined to override
>> the encoding declaration in the XML.
>> Not so much of a problem with transient, over the wire information
>> such as SOAP messages, but much more of a problem for other, longer
>> lived xml information which is frequently processed on the server
>> side, from the local filestore, and also processed on the client
>> side, for example saved and looked at later. In both these situations
>> there is no http header information and the self-describing nature of
>> XML is compromised - the XML is not well formed!
>> Of course, the correct solution is to not put duplicate and
>> contradictory encoding information in the http headers, but rather to
>> say that programs which make xml content not well formed are 
>> broken and
>> should be fixed.

 Chris                            mailto:chris@w3.org
Received on Tuesday, 29 April 2003 14:18:35 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:40:47 UTC