Re: Requesting a revision of RFC3023 from Addison Phillips [wM] on 2003-09-22 (www-tag@w3.org from September 2003)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Mon, 22 Sep 2003 16:50:02 -0400
To: "WWW-Tag" <www-tag@w3.org>, ietf-xml-mime@imc.org
Cc: "Addison Phillips" <aphillips@webmethods.com>
Message-Id: <4.2.0.58.J.20030922164833.03c388b0@localhost>
Hi Martin, et al,

Having tracked the thread down and having read it I feel like I can
contribute something to it. This is a common and not very fun problem that
our implementations encounter frequently with XML documents transmitted to
our system by other products.

SOAP 1.2 recommends the use of application/soap+xml as the media type
(although it is not required, see section 7.1.4 of [SOAP12-PART2], it is
pretty close to a requirement for HTTP). Noah is correct that charset is
optional. In the absence of charset, the application/*xml types default to
the encoding embedded in the XML document itself, which I think is generally
seen to be the desirable way to go.

Various SOAP implementations less than 1.2 use various media types,
including text/xml, depending on the transport, etc.

The problem with changing rfc3023 is that there are a number of
implementations out there that adhere to the exact letter of the involved
RFCs (3023/2045/2046/etc.). I seem to recall that there are implementations
that require the charset parameter or which forceably filter the data to
ASCII (converting all 8th-bit bytes to the '?' character) and thus there are
many implementations that, to get the right results with these, forceably
emit charset parameters.

Therefore, unless absolutely forbidden, implementations would still have to
support the use of charset with both media types. And I don't see how we can
forbid the use of the charset parameter given the need for need for
interoperability with extant sensitive systems.

It would be nice if text/xml could be modified, since it is quite common to
get un-charset-labeled content that really is NOT US-ASCII. Since one can
always detect that a data stream is not US-ASCII, it has always seemed a bit
odd to me that the RFCs require the data to be destroyed when there is clear
evidence that one is losing something. I understand the reasoning, but I
think there is a difference between saying that omission of a charset
parameter invites data corruption (e.g. the MIME or XML processor is not
required to look at the XML content and thus MAY use US-ASCII to interpret
the data) and one that insists on it (e.g. the MIME or XML processor is
required to interpret the data using US-ASCII-7).

Perhaps we should focus on the semantics of charset not being present,
instead of focusing on forbidding/requiring charset itself. Consider this
paragraph of RFC3023:

<snip>
       Conformant with [RFC2046], if a text/xml entity is received with
       the charset parameter omitted, MIME processors and XML processors
       MUST use the default charset value of "us-ascii"[ASCII].  In cases
       where the XML MIME entity is transmitted via HTTP, the default
       charset value is still "us-ascii".
</snip>

This could be changed to something more friendly, like:

<snip>
       If a text/xml entity is received with
       the charset parameter omitted, MIME processors and XML processors
       MAY attempt to detect the charset from the XML content itself. Such
       detection MUST follow the requirements of section 4.3.3 of [XML].
       MIME and XML processors that do not attempt or are unable to detect
       the charset using this requirement must use US-ASCII (or UTF-8????)...
etc. and so forth...
</snip>

This allows receivers the leeway to detect errant senders (while leaving
errant senders of text/xml as non-conforming). This seems like a reasonable
compromise to me.

[SOAP12-PART2] http://www.w3.org/TR/2003/PR-soap12-part2-20030507

Just my two cents. Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office)  +1 408.210.3569 (mobile)
mailto:aphillips@webmethods.com

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.

 > -----Original Message-----
 > From: Martin Duerst [mailto:duerst@w3.org]
 > Sent: Friday, September 19, 2003 11:05 AM
 > To: Addison Phillips
 > Subject: Fwd: Re: Requesting a revision of RFC3023
 >
 >
 > Hello Addison,
 >
 > This is from two lists (ietf-xml-mime@imc.org, WWW-Tag <www-tag@w3.org>).
 > Re SOAP, I guess you might have some answer. If yes, can you send it to
 > those lists or to me for forwarding?
 >
 > Regards,    Martin.
 >
 >
 > >To: MURATA Makoto <murata@hokkaido.email.ne.jp>
 > >Cc: ietf-xml-mime@imc.org, WWW-Tag <www-tag@w3.org>
 > >Subject: Re: Requesting a revision of RFC3023
 > >From: noah_mendelsohn@us.ibm.com
 > >Date: Fri, 19 Sep 2003 11:04:11 -0400
 > >Sender: owner-ietf-xml-mime@mail.imc.org
 > >List-Archive: <http://www.imc.org/ietf-xml-mime/mail-archive/>
 > >List-ID: <ietf-xml-mime.imc.org>
 > >List-Unsubscribe: <mailto:ietf-xml-mime-request@imc.org?body=unsubscribe>
 >
 > >Murata Makoto writes:
 > >
 > > >> I believe that SOAP implementations use the
 > > >> charset parameter.  If we remove the charset
 > > >> parameter, we will make them non-conformant.
 > >
 > >This is not my area of expertise, but I note that the HTTP binding [1]
 > >provided by SOAP 1.2 Recommendation uses an application/soap+xml media
 > >type, a definition of which is at [2] (I believe it is working its way
 > >through the formal registration process.)  My reading is that the
 > >definition lists charset as optional, and makes clear that its proper use
 > >is to be found in RFC 3023.
 > >
 > >I am not aware of what typical implementations of SOAP 1.1 or
 > SOAP 1.2 are
 > >doing, but the 1.2 spec at least seems to list it as optional.
 > Again, I'm
 > >not expert in this stuff and am not offering an opinion, but I
 > thought the
 > >links might be helpful.
 > >
 > >[1]http://www.w3.org/TR/soap12-part2/#soapinhttp
 > >[2] http://www.w3.org/TR/soap12-part2/#ietf-draft
 > >
 > >------------------------------------------------------------------
 > >Noah Mendelsohn                              Voice: 1-617-693-4036
 > >IBM Corporation                                Fax: 1-617-693-8676
 > >One Rogers Street
 > >Cambridge, MA 02142
 > >------------------------------------------------------------------
 > >
 > >
 > >
 > >
 > >
 > >
 > >
 > >MURATA Makoto <murata@hokkaido.email.ne.jp>
 > >Sent by: www-tag-request@w3.org
 > >09/19/03 08:10 AM
 > >
 > >
 > >         To:     ietf-xml-mime@imc.org, WWW-Tag <www-tag@w3.org>
 > >         cc:     (bcc: Noah Mendelsohn/Cambridge/IBM)
 > >         Subject:        Re: Requesting a revision of RFC3023
 > >
 > >
 > >
 > >
 > >On Fri, 19 Sep 2003 03:50:11 +0200
 > >Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
 > >
 > > > You want to change something that has been STRONGLY
 > RECOMMENDED for over
 > > > five years to (ideally) MUST NOT just because it could cause trouble
 > > > when used improperly or with broken implementations. Today I am good
 > > > with web standards if I use the charset parameter, tommorow I am bad
 > > > with web standards if I do. What's next on #W3C? Use tables for layout
 > > > because people could get CSS wrong and old browsers get some
 > CSS wrong?
 > > > I don't think this leads anywhere.
 > >
 > >I believe that SOAP implementations use the charset parameter.  If we
 > >remove the charset parameter, we will make them non-conformant.
 > >
 > >Cheers,
 > >
 > >--
 > >MURATA Makoto <murata@hokkaido.email.ne.jp>
 > >
 > >
Received on Monday, 22 September 2003 16:52:57 UTC