Re: Can HTTP content-type charset disagree with its contents XML encoding?

On Tuesday, April 29, 2003, 3:24:57 AM, Teruhiko wrote:


KT> I came across an article that shows an example of a SOAP message
KT> in its List 1:
KT> http://www.atmarkit.co.jp/fxml/tanpatsu/21websvc/websvc02.html
KT> (The article is in Japanese but the List 1 contains only ASCII text
KT> except one line within <m:GoodsName> element.)

KT> In this example, the HTTP level header says the contents is
KT> in UTF-8:

KT> Content-Type: application/soap-xml; charset="utf-8"

KT> But the XML document which is the contents of this HTTP request
KT> claims that the contents is in Shift_JIS as in:
KT>  <?xml version="1.0" encoding="shift_jis"?>

KT> I am puzzled.  Does anyone know:

KT> (1) Is this legal?

Unfortunately yes. Its a really bad idea, because the message
immediately becomes not well formed as soon as the http headers go
away.

KT> (2) If it is legal, which declaration is supposed to wins? I.e. should
KT> the contents be in UTF-8 encoding or Shift_JIS encoding in this
KT> example?

The http headers win.

The only corner case where this sort of thing could be generated is
when an xml-unaware program has converted the encoding from one to
another, and somehow knows enough to convey this to the server in some
undocumented, server defined way but does not know enough to convey it
to the xml processors in a documented, well defined way by updating
the encoding declaration.

And to support this use case, the HTTP headers are defined to override
the encoding declaration in the XML.

Not so much of a problem with transient, over the wire information
such as SOAP messages, but much more of a problem for other, longer
lived xml information which is frequently processed on the server
side, from the local filestore, and also processed on the client
side, for example saved and looked at later. In both these situations
there is no http header information and the self-describing nature of
XML is compromised - the XML is not well formed!

Of course, the correct solution is to not put duplicate and
contradictory encoding information in the http headers, but rather to
say that programs which make xml content not well formed are broken and
should be fixed.


KT> T. "Kuro" Kurosaka
KT> Internationalization Architect
KT> teruhiko.kurosaka@iona.com
KT> -------------------------------------------------------
KT> IONA Technologies
KT> 2350 Mission College Blvd. Suite 650
KT> Santa Clara, CA 95054
KT> Tel: (408) 350 9684/9500 
KT> Fax: (408) 350 9501
KT> -------------------------------------------------------
KT> Making Software Work Together TM



-- 
 Chris                            mailto:chris@w3.org

Received on Tuesday, 29 April 2003 07:48:08 UTC