- From: Addison Phillips [wM] <aphillips@webmethods.com>
- Date: Wed, 24 Sep 2003 13:20:38 -0700
- To: "MURATA Makoto" <murata@hokkaido.email.ne.jp>
- Cc: "WWW-Tag" <www-tag@w3.org>, <ietf-xml-mime@imc.org>
Hi Makoto, Thanks for your response! I'm happy to contribute as best I can. > > This issue was discussed when RFC 2376 was developed. I recall > that my then-co-author (E. Whitehead) proposed exactly the same > thing, but > that proposal was not agreed. I think it will continue to keep coming back until the policy is changed. Too many XML implementators are frustrated by it. > > In my understanding, MIME people in IETF would like to keep the charset > parameter of text/* authoritative, since a number of mail > programs rely only on > the charset parameter for text/*. I don't think it is a problem for the charset parameter *when it is present* to be authoritative. I think the real problem here is that it is also considered to be authoritative when it is NOT present--overriding other potential means for identifying content encoding that are available. The lack of information should not take precedence over positive information in the content when the implementation can take advantage of said information. That doesn't mean that there can't be an outlet for implementations to choose not to interpret from the content, only that implementations that can choose to interpret the content be allowed to do so. > > However, as people correctly pointed out, omission of the charset > parameter of text/xml is typically caused by the fact that authors > cannot change the configuration of WWW servers. For this reason, W3C > recommendations for CSS and HTML say something similar to your > suggestion, > but the IETF RFC for CSS does not say anything about the default. Then each standard or RFC should be modified over time to include such information, either directly or by reference to parent standards. For example, SOAP does not include any statement whatsoever about the charset to use in a SOAP message. XML 1.0 and the RFC for the media type contain that information (and in the case of the media type, IIRC that is by reference to RFC 3023). But certainly Ur-standards, like CSS, XML, HTML, and so on should contain a default (and not leave unannounced values open to interpretation). > The IETF RFC 2854 for HTML says the default is US-ASCII (MIME) or > 8859-1 (HTTP), but also says that "the actual default is either a > corporate character encoding or character encodings widely deployed in a > certain national or regional community" That last statement is quite sloppy, in my opinion. It basically says: the default is ASCII/8859, except when it isn't (e.g. you cannot rely on there being a default). This is basically akin to just saying that the content should be introspected by the receiver or user agent as best as possible. I think it is reasonable (if occasionally a problem) to preserve the authority of the charset tag when it is present in the media type. I think that when it is not present, it should be permissable to detect the encoding from the content. This suggests that implementations which cannot determine the charset when transmitting content should be discouraged from emitting the charset parameter ("guessing"). Finally, I think that the various default encodings should be modified and harmonized in a reasonable way. Personally I think that references to US-ASCII as a default should be replaced with UTF-8 wherever possible (which is a superset of US-ASCII). Since UTF-8 is highly patterned, it is also highly detectable. It could be followed by a reasonable fallback (possibly US-ASCII or ISO8859-1: at least from 8859-1 you can reconstruct the bytes). The existing RFC3023 contains the counter argument to this, the logic of which I don't dispute. I'm merely suggesting that a change in policy here might be a better "move towards the future". So to reiterate, I think that the RFC should be modified to try and avoid the deliberate loss of data (as when using US-ASCII with data that contains non-US-ASCII sequences) where it is possible for the implementation to cope. This move won't generate any new incompatibilities for existing implementations that do not grope the content. I understand that MIME applications in particular cannot always avoid this kind of data degradation and that's okay with me, so long as it isn't "wrong" to detect *missing* charset values. I'll probably come to regret that paragraph ;-). Best Regards, Addison Addison P. Phillips Director, Globalization Architecture webMethods | Delivering Global Business Visibility 432 Lakeside Drive, Sunnyvale, CA, USA +1 408.962.5487 (office) +1 408.210.3569 (mobile) mailto:aphillips@webmethods.com Chair, W3C-I18N-WG, Web Services Task Force http://www.w3.org/International/ws Internationalization is an architecture. It is not a feature.
Received on Wednesday, 24 September 2003 16:22:14 UTC