RE: charset use in SOAP 1.2 from Addison Phillips [wM] on 2003-05-30 (public-i18n-ws@w3.org from May 2003)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Thu, 29 May 2003 17:25:36 -0700
To: "Kurosaka, Teruhiko" <Teruhiko.Kurosaka@iona.com>, "Public-I18n-Ws \(E-mail\)" <public-i18n-ws@w3.org>
Message-ID: <PNEHIBAMBMLHDMJDDFLHMEFIGLAA.aphillips@webmethods.com>
Hi Kuro,

First, the intent is that all SOAP messages should be encoded using UTF-8 or
UTF-16. That doesn't mean that they actually are or that utf-8/utf-16 are
the only valid values here. You *can* create SOAP messages in another
encoding.

The application/soap+xml type is defined here:
http://www.w3.org/TR/2003/PR-soap12-part2-20030507/#ietf-draft

Which refers to here:
http://www.ietf.org/rfc/rfc3023.txt  (see Page 9).

Which says (and I quote):

<quot>
      Although listed as an optional parameter, the use of the charset
      parameter is STRONGLY RECOMMENDED, since this information can be
      used by XML processors to determine authoritatively the charset of
      the XML MIME entity.  The charset parameter can also be used to
      provide protocol-specific operations, such as charset-based
      content negotiation in HTTP.

      "utf-8" [RFC2279] and "utf-16" [RFC2781] are the recommended
      values, representing the UTF-8 and UTF-16 charsets, respectively.
      These charsets are preferred since they are supported by all
      conforming processors of [XML].

      If an application/xml entity is received where the charset
      parameter is omitted, no information is being provided about the
      charset by the MIME Content-Type header.  Conforming XML
      processors MUST follow the requirements in section 4.3.3 of [XML]
      that directly address this contingency.  However, MIME processors
      that are not XML processors SHOULD NOT assume a default charset if
      the charset parameter is omitted from an application/xml entity.
</quot>

It is a bad idea generally to create SOAP documents that don't use Unicode
because there is no guarantee that the SOAP Processor supports anything but
Unicode.

The Content-Type of application/soap+xml is NOT a text/* type for a reason.
For one thing it gets you out of the ASCII default.

A Content-Transfer-Encoding is generally applied to data external to the
actual content encoding and content processing. You are probably familiar
with MIME, which is used for transporting non-ASCII data. Generally speaking
in both HTTP and SMTP, the actual SOAP document arrives as an attachment and
is MIME encoded. You don't have to deal with this, as it is part of the
transport stack.

Best Regards,

Addison

> -----Original Message-----
> From: public-i18n-ws-request@w3.org
> [mailto:public-i18n-ws-request@w3.org]On Behalf Of Kurosaka, Teruhiko
> Sent: Thursday, May 29, 2003 4:44 PM
> To: Public-I18n-Ws (E-mail)
> Subject: charset use in SOAP 1.2
>
>
>
> I'm browsing the final spec of SOAP 1.2 Primer
> http://www.w3.org/TR/2003/PR-soap12-part0-20030507/
> and I have two questions.
>
> Near
> http://www.w3.org/TR/2003/PR-soap12-part0-20030507/#L26866
> 	(thank you, Addision, for the pointer)
> it reads:
>
>  > When placing SOAP messages in HTTP bodies, the HTTP
> Content-type header
>  > must be chosen as "application/soap+xml". (The optional
> charset parameter,
>  > which can take the value of "utf-8" or "utf-16", is shown in
> this example,
>  > but if it is absent the character set rules for freestanding
> [XML 1.0] apply to
>  > the body of the HTTP request.)
>
> The note in the paren seems to be saying that only UTF-8 and UTF-16
> are the valid encodings that can be used in SOAP (over HTTP).  Does anyone
> know if this is their intent, and if so, why?
>
>
> On the other hand, SOAP-over-SMTP examples
> http://www.w3.org/TR/2003/PR-soap12-part0-20030507/#Example14
> http://www.w3.org/TR/2003/PR-soap12-part0-20030507/#Example31
> 			(This is actually for Exampl 15.)
> do not use the charset attribute in the Content-Type header.  According to
> Section 5.2 of RFC 2045
> http://www.ietf.org/rfc/rfc2045.txt?number=2045
> lack of charset info means US-ASCII.  Perhaps one can argue that
> the US-ASCII
> default only applies to text/* content, and application/soap+xml
> has a different rule,
> which I may have overlooked.
> But I also notice that both examples lack a
> Content-Transfer-Encoding header.
> Section 6.1 of RFC 2045 also mentions that lack of
> Content-Transfer-Encoding
> header means 7-bit channel.  Yet the body part in both examples contains
> non-ASCII characters in the <n:name> element, which probably requires 8bit
> channel or some sort of transfer encoding.
>
> Is this an over-sight that we should point out, or am I misunderstanding
> something?
>
>
> T. "Kuro" Kurosaka
> Internationalization Architect
> teruhiko.kurosaka@iona.com
> -------------------------------------------------------
> IONA Technologies
> 2350 Mission College Blvd. Suite 650
> Santa Clara, CA 95054
> Tel: (408) 350 9684/9500
> Fax: (408) 350 9501
> -------------------------------------------------------
> Making Software Work Together TM
>
>
Received on Thursday, 29 May 2003 20:25:38 UTC