- From: Chris Lilley <chris@w3.org>
- Date: Tue, 29 Apr 2003 20:18:15 +0200
- To: "Kurosaka, Teruhiko" <Teruhiko.Kurosaka@iona.com>
- CC: "Www-International (E-mail)" <www-international@w3.org>
On Tuesday, April 29, 2003, 7:35:53 PM, Teruhiko wrote: KT> Chris, KT> Thank you for your reply. KT> Could you be so kind to quote the relevant sections of XML and HTTP spec ? KT> XML spec does not seem to address this situation to me. The XML spec defers to the mime registration for the XML media type. http://www.ietf.org/rfc/rfc3023.txt It is in that specification that the precedence is defined (and other unfortunate things, such as a mandatory default of US-ASCII when no charset is provided in the HTTP, regardless of what the XML encoding declaration says). This is very bad. As a member of the TAG I find this very broken, architecturally speaking. Tim Bray agrees, and I have proposed wording in the architecture document that spells this out. There are know problems with charset in the text/* media types, such as a mandatory fallback to text/plain;charset=us-ascii. The solution is to deprecate text/xml and have a charset-free application/xml, using the nicely defined xml mechanism to declare the encoding in all circumstances, rather than dragging the problems from text/* into the hitherto unaffected other media types. KT> Anyway, shouldn't this practice be explicitly forbidden for any types of KT> contents (HTML, XML etc.) that have their own mechanism of encoding KT> identification? Yes, of course it should. I am glad that you agree. KT> -kuro >> -----Original Message----- >> From: Chris Lilley [mailto:chris@w3.org] >> KT> I came across an article that shows an example of a SOAP message >> KT> in its List 1: >> KT> http://www.atmarkit.co.jp/fxml/tanpatsu/21websvc/websvc02.html >> KT> (The article is in Japanese but the List 1 contains only >> ASCII text >> KT> except one line within <m:GoodsName> element.) >> >> KT> In this example, the HTTP level header says the contents is >> KT> in UTF-8: >> >> KT> Content-Type: application/soap-xml; charset="utf-8" >> >> KT> But the XML document which is the contents of this HTTP request >> KT> claims that the contents is in Shift_JIS as in: >> KT> <?xml version="1.0" encoding="shift_jis"?> >> >> KT> I am puzzled. Does anyone know: >> >> KT> (1) Is this legal? >> >> Unfortunately yes. Its a really bad idea, because the message >> immediately becomes not well formed as soon as the http headers go >> away. >> >> KT> (2) If it is legal, which declaration is supposed to >> wins? I.e. should >> KT> the contents be in UTF-8 encoding or Shift_JIS encoding in this >> KT> example? >> >> The http headers win. >> >> The only corner case where this sort of thing could be generated is >> when an xml-unaware program has converted the encoding from one to >> another, and somehow knows enough to convey this to the server in some >> undocumented, server defined way but does not know enough to convey it >> to the xml processors in a documented, well defined way by updating >> the encoding declaration. >> >> And to support this use case, the HTTP headers are defined to override >> the encoding declaration in the XML. >> >> Not so much of a problem with transient, over the wire information >> such as SOAP messages, but much more of a problem for other, longer >> lived xml information which is frequently processed on the server >> side, from the local filestore, and also processed on the client >> side, for example saved and looked at later. In both these situations >> there is no http header information and the self-describing nature of >> XML is compromised - the XML is not well formed! >> >> Of course, the correct solution is to not put duplicate and >> contradictory encoding information in the http headers, but rather to >> say that programs which make xml content not well formed are >> broken and >> should be fixed. -- Chris mailto:chris@w3.org
Received on Tuesday, 29 April 2003 14:18:35 UTC