- From: Chris Lilley <chris@w3.org>
- Date: Thu, 30 Oct 2003 22:18:52 +0100
- To: MURATA Makoto <murata@hokkaido.email.ne.jp>
- Cc: www-tag@w3.org
On Thursday, October 30, 2003, 12:33:01 AM, MURATA wrote: (snipping all the 'we agree' parts) >> Firstly, the charset parameter and the xml encoding declaration should >> never differ, because otherwise the document is only well formed in >> transit and not when processed on the server or when saved to the >> client. MM> To make progress, let's say that there is a concern about such MM> differences and that, if the recipient saves the document in a MM> file without rewriting the encoding declaration, the result is a MM> broken XML document. Very well. Let us also note that it is also broken when sitting on the server. This is a concern firstly because server-side processing of XML is very common, and secondly because we would not like to encourage the practice where authoring tools do not correctly label the encoding of an XML file but rely on a server to 'fix it up' for them. >> MM> - Server implementers or Server Managers SHOULD NOT specify the >> MM> default value of the charset >> MM> parameter of text/xml, application/xml, >> MM> Text/xml-external-parsed-entity, >> MM> Application/xml-external-parsed-entity, */*+xml, or >> MM> Application/xml-dtd, unless they can guarantee that >> MM> that default value is correct for all MIME entities of these media >> MM> types. >> >> Which, it is possible to show, they cannot do in the general case. MM> Although WWW servers send many XML documents, protocol MM> implementations (e.g., SOAP) also send XML documents. It is easy MM> for them to correctly specify the charset parameter. Especially if they can reliably read it from the xml encoding declaration. MM> I think that MM> future implementations of trackback should correctly specify the MM> charset parameter. (At present, people unfortunately use MM> "application/x-www-form-urlencoded" without providing any info, MM> which causes lots of data corruption in Japan.) I am aware of that problem, and raised it during the 'when to use GET' discussions several times. Further down the line, an application/xform-iriencoded might help there; for now, PUT solves the problem. (Should this really be a media type and not a content-transfer-encoding? But I digress). >> Perhaps the framework and scheme should be pointed to? MM> I can imagine that this starts a heated discussion and a MM> significant delay. I can imagine that too, although so far there has only been one post on the subject, from Paul Grosso. MM> I know that the XML Core WG would like to MM> register XPointer as fragment identifiers, but has W3C agreed on MM> this? (I'm just asking.) Its not clear that XML Core wants to do this, and Paul indicated that guiding people towards framework and scheme (in other words, a method to construct fragment identifiers, for different media types) would be the correct thing to do. I agree that trying to standardize even such a simple thing as barenames for all of XML brings up the xml:id dependency and would hinder short term progress. >> MM> 4) Possible reasons for not providing the charset parameter for >> MM> specialized media types >> >> MM> I think that "This media type is utf-8 only and thus does not need any >> MM> mechanism to identify the charset" is a perfectly good reason, since >> MM> "UTF-8 only" is a generic principle. This should be mentioned in the >> MM> I-D. >> >> Its one specific reason. Its not enough though. Why should an XML file >> that can be either UTF-8 or UTF-16 need a charset parameter? It offers >> no useful or additional information. All XML processors handle both >> charsets as a conformance requirement. MM> In general, I do not want to undermine the only generic mechanism MM> (the charset parameter) without establishing an alternative. I concur, but believe that XML has already established an alternative and the +xml convention allows an implementation to determine that an unrecognized media type is in XML. MM> Something limited MM> to XML is not generic to me. True, but it is generic to all of XML which is a non-negligible slice of current and expected media type registrations. Were this not the case, there would have been no value in a +xml convention. MM> Furthermore, "UTF-16LE" or "UTF16-BE" are preferred MM> by RFC 2781 Since it talks about 'adding a BOM' I believe that their usage of "labelling as UTF-16" does not mean "by a server". Rather, it seems to be saying that text without a BOM should not be labelled as UTF-16 but must be labelled to indicate its byte order. i am sure we would both agree with this. In the usual case, a BOM is present. RFC 2781 states: > An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE" > would occur with document formats that mandate a BOM in UTF-16 text, > thereby requiring the use of the "UTF-16" tag only. I would interpret XML 1.0 without an encoding declaration as being such a system, and thus being conformant to RFC 2781. MM> but they are not mandatory in XML. However, to make progress, MM> I am willing to mention "UTF-8 or UTF-16 only" as a possible reason MM> together with my concern above. That would be fine. >> The rest of this sounds good; but I think we need fuller discussion on >> the use of a charset parameter that disagrees with what the XML >> encoding declaration says. This is clearly harmful, yet seems to be >> encouraged. MM> As I suggested above, let's say that there is a concern about it MM> and see how people feel about it. That seems workable. Are you editing in RFC 2629 xml? -- Chris mailto:chris@w3.org
Received on Thursday, 30 October 2003 16:19:14 UTC