RE: Requesting a revision of RFC3023 from Addison Phillips [wM] on 2003-09-24 (www-tag@w3.org from September 2003)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Wed, 24 Sep 2003 13:20:38 -0700
To: "MURATA Makoto" <murata@hokkaido.email.ne.jp>
Cc: "WWW-Tag" <www-tag@w3.org>, <ietf-xml-mime@imc.org>
Message-ID: <PNEHIBAMBMLHDMJDDFLHMEHOHAAA.aphillips@webmethods.com>
Hi Makoto,

Thanks for your response! I'm happy to contribute as best I can.
>
> This issue was discussed when RFC 2376 was developed.  I recall
> that my then-co-author (E. Whitehead) proposed exactly the same
> thing, but
> that proposal was not agreed.

I think it will continue to keep coming back until the policy is changed.
Too many XML implementators are frustrated by it.
>
> In my understanding, MIME people in IETF would like to keep the charset
> parameter of text/* authoritative, since a number of mail
> programs rely only on
> the charset parameter for text/*.

I don't think it is a problem for the charset parameter *when it is present*
to be authoritative. I think the real problem here is that it is also
considered to be authoritative when it is NOT present--overriding other
potential means for identifying content encoding that are available. The
lack of information should not take precedence over positive information in
the content when the implementation can take advantage of said information.

That doesn't mean that there can't be an outlet for implementations to
choose not to interpret from the content, only that implementations that can
choose to interpret the content be allowed to do so.
>
> However, as people correctly pointed out, omission of the charset
> parameter of text/xml is typically caused by the fact that authors
> cannot change the configuration of WWW servers.  For this reason, W3C
> recommendations for CSS and HTML say something similar to your
> suggestion,
> but the IETF RFC for CSS does not say anything about the default.

Then each standard or RFC should be modified over time to include such
information, either directly or by reference to parent standards. For
example, SOAP does not include any statement whatsoever about the charset to
use in a SOAP message. XML 1.0 and the RFC for the media type contain that
information (and in the case of the media type, IIRC that is by reference to
RFC 3023). But certainly Ur-standards, like CSS, XML, HTML, and so on should
contain a default (and not leave unannounced values open to interpretation).

> The IETF RFC 2854 for HTML says the default is US-ASCII (MIME) or
> 8859-1 (HTTP), but also says that "the actual default is either a
> corporate character encoding or character encodings widely deployed in a
> certain national or regional community"

That last statement is quite sloppy, in my opinion. It basically says: the
default is ASCII/8859, except when it isn't (e.g. you cannot rely on there
being a default). This is basically akin to just saying that the content
should be introspected by the receiver or user agent as best as possible.

I think it is reasonable (if occasionally a problem) to preserve the
authority of the charset tag when it is present in the media type.

I think that when it is not present, it should be permissable to detect the
encoding from the content. This suggests that implementations which cannot
determine the charset when transmitting content should be discouraged from
emitting the charset parameter ("guessing").

Finally, I think that the various default encodings should be modified and
harmonized in a reasonable way.

Personally I think that references to US-ASCII as a default should be
replaced with UTF-8 wherever possible (which is a superset of US-ASCII).
Since UTF-8 is highly patterned, it is also highly detectable. It could be
followed by a reasonable fallback (possibly US-ASCII or ISO8859-1: at least
from 8859-1 you can reconstruct the bytes). The existing RFC3023 contains
the counter argument to this, the logic of which I don't dispute. I'm merely
suggesting that a change in policy here might be a better "move towards the
future".

So to reiterate, I think that the RFC should be modified to try and avoid
the deliberate loss of data (as when using US-ASCII with data that contains
non-US-ASCII sequences) where it is possible for the implementation to cope.
This move won't generate any new incompatibilities for existing
implementations that do not grope the content. I understand that MIME
applications in particular cannot always avoid this kind of data degradation
and that's okay with me, so long as it isn't "wrong" to detect *missing*
charset values.

I'll probably come to regret that paragraph ;-).

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office)  +1 408.210.3569 (mobile)
mailto:aphillips@webmethods.com

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.
Received on Wednesday, 24 September 2003 16:22:14 UTC