Re: Can HTTP content-type charset disagree with its contents XML encoding?

A few weeks ago, I sent out a question to www-international@w3.org
under this subject.  The question was, when sending out XML over HTTP,
whether it is legal to put a different encoding in HTTP Content-Type; charset=
thatn that in the encoding attribute of the XML declaration, and if so, which
encoding should be applied in interpreting the XML packet.

To this posting, Chris Lilley <mailto: chris@w3.org>replied, which I quote in the bottom.
He essentially says
(1) Best practice is to use the media type application/xml without charset attribute
(2) Currently, having conflicting declarations is legal and the charset declared in HTTP 
      header should be used.
(3) He agrees this is a bad practice should be prohibited.

I wonder if the member of WS Task Force agree with this opinion,
and if we need to take any further action.


Quotes from Chris' reply follow:
----------------------------------------------------------------------
KT> (1) Is this legal?

Unfortunately yes. Its a really bad idea, because the message
immediately becomes not well formed as soon as the http headers go
away.

KT> (2) If it is legal, which declaration is supposed to wins? I.e. should
KT> the contents be in UTF-8 encoding or Shift_JIS encoding in this
KT> example?

The http headers win.
----------------------------------------------------------------------
KT> Could you be so kind to quote the relevant sections of XML and HTTP spec ?
KT> XML spec does not seem to address this situation to me.

The XML spec defers to the mime registration for the XML media type.
http://www.ietf.org/rfc/rfc3023.txt
It is in that specification that the precedence is defined (and other
unfortunate things, such as a mandatory default of US-ASCII when no
charset is provided in the HTTP, regardless of what the XML encoding
declaration says).

This is very bad. As a member of the TAG I find this very broken,
architecturally speaking. Tim Bray agrees, and I have proposed wording
in the architecture document that spells this out.

There are know problems with charset in the text/* media types, such
as a mandatory fallback to text/plain;charset=us-ascii. The solution
is to deprecate text/xml and have a charset-free application/xml,
using the nicely defined xml mechanism to declare the encoding in all
circumstances, rather than dragging the problems from text/* into the
hitherto unaffected other media types.

KT> Anyway, shouldn't this practice be explicitly forbidden for any types of
KT> contents (HTML, XML etc.) that have their own mechanism of encoding 
KT> identification?

Yes, of course it should. I am glad that you agree
----------------------------------------------------------------------

----
T. "Kuro" Kurosaka, Internationalization Architect
IONA Technologies, Santa Clara, CA USA / +1 408 350-9684 
Visit i18n.iona.com for up-to-date i18n information. (IONA Internal)

Received on Wednesday, 14 May 2003 14:38:13 UTC