RE: Can HTTP content-type charset disagree with its contents XML encoding?

Hi Kuro,

We can discuss it at our next meeting, although I would suggest that W3C-I18N Core TF would probably be a more appropriate venue.

Speaking strictly for myself, I agree that this is really and truly broken as a design.

In point of fact, I'm not sure that the HTTP header actually wins in practice, since in at least some cases the XML parser/processor gets the bytestream separate from the HTTP transfer mechanism. Mis- or unlabeled content that isn't pre-converted from a bytestream to a character representation survives this and then probably the XML declaration "wins". Of course, the fact that they conflict at all is a problem.

In terms of Web services, though, this isn't generally a problem. The media type for a SOAP message is commonly 'application/soap+xml'. I quote from the SOAP 1.2 Primer:

When placing SOAP messages in HTTP bodies, the HTTP Content-type header must be chosen as "application/soap+xml". (The optional charset parameter, which can take the value of "utf-8" or "utf-16", is shown in this example, but if it is absent the character set rules for freestanding [XML 1.0] apply to the body of the HTTP request.)

Which is located here:

http://www.w3.org/TR/2003/PR-soap12-part0-20030507/#L26866

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-------------------------------------------------
Internationalization is an architecture.
It is not a feature. 

Chair, W3C-I18N-WG Web Services Task Force
To participate see http://www.w3.org/International/ws 

> -----Original Message-----
> From: public-i18n-ws-request@w3.org 
> [mailto:public-i18n-ws-request@w3.org]On Behalf Of Kurosaka, Teruhiko
> Sent: Wednesday, May 14, 2003 2:38 PM
> To: Public-I18n-Ws (E-mail)
> Subject: Re: Can HTTP content-type charset disagree with its 
> contents XML encoding?
> 
> 
> 
> A few weeks ago, I sent out a question to www-international@w3.org
> under this subject.  The question was, when sending out XML over HTTP,
> whether it is legal to put a different encoding in HTTP 
> Content-Type; charset=
> thatn that in the encoding attribute of the XML declaration, and 
> if so, which
> encoding should be applied in interpreting the XML packet.
> 
> To this posting, Chris Lilley <mailto: chris@w3.org>replied, 
> which I quote in the bottom.
> He essentially says
> (1) Best practice is to use the media type application/xml 
> without charset attribute
> (2) Currently, having conflicting declarations is legal and the 
> charset declared in HTTP 
>       header should be used.
> (3) He agrees this is a bad practice should be prohibited.
> 
> I wonder if the member of WS Task Force agree with this opinion,
> and if we need to take any further action.
> 
> 
> Quotes from Chris' reply follow:
> ----------------------------------------------------------------------
> KT> (1) Is this legal?
> 
> Unfortunately yes. Its a really bad idea, because the message
> immediately becomes not well formed as soon as the http headers go
> away.
> 
> KT> (2) If it is legal, which declaration is supposed to wins? I.e. should
> KT> the contents be in UTF-8 encoding or Shift_JIS encoding in this
> KT> example?
> 
> The http headers win.
> ----------------------------------------------------------------------
> KT> Could you be so kind to quote the relevant sections of XML 
> and HTTP spec ?
> KT> XML spec does not seem to address this situation to me.
> 
> The XML spec defers to the mime registration for the XML media type.
> http://www.ietf.org/rfc/rfc3023.txt
> It is in that specification that the precedence is defined (and other
> unfortunate things, such as a mandatory default of US-ASCII when no
> charset is provided in the HTTP, regardless of what the XML encoding
> declaration says).
> 
> This is very bad. As a member of the TAG I find this very broken,
> architecturally speaking. Tim Bray agrees, and I have proposed wording
> in the architecture document that spells this out.
> 
> There are know problems with charset in the text/* media types, such
> as a mandatory fallback to text/plain;charset=us-ascii. The solution
> is to deprecate text/xml and have a charset-free application/xml,
> using the nicely defined xml mechanism to declare the encoding in all
> circumstances, rather than dragging the problems from text/* into the
> hitherto unaffected other media types.
> 
> KT> Anyway, shouldn't this practice be explicitly forbidden for 
> any types of
> KT> contents (HTML, XML etc.) that have their own mechanism of encoding 
> KT> identification?
> 
> Yes, of course it should. I am glad that you agree
> ----------------------------------------------------------------------
> 
> ----
> T. "Kuro" Kurosaka, Internationalization Architect
> IONA Technologies, Santa Clara, CA USA / +1 408 350-9684 
> Visit i18n.iona.com for up-to-date i18n information. (IONA Internal)

Received on Wednesday, 14 May 2003 15:22:46 UTC