- From: Chris Lilley <chris@w3.org>
- Date: Mon, 7 Apr 2003 21:05:07 +0200
- To: www-tag@w3.org
Hello, The following language is proposed for section "4.2. Processing model" of the Web Architecture document. It also includes a lead-in sentence to be added to the overall Section "4. Representations" introductory material. http://www.w3.org/2001/tag/webarch/ This discharges the action item: Action CL 2003/0127: Draft language for arch doc that takes language from internet media type registration, propose for arch doc, include sentiment of TB's second sentence from CP10. For reference, CP 10 from Tim Bray states: > Agents which receive a resource representation accompanied by an > Internet Media Type MUST interpret the representation according to > the semantics of that Media Type and other header information. > Servers which generate representations MUST not generate Media Types > and other header information (for example charsets) unless there is > certainty that the headers are correct. Taking that second sentence, the current media type registration document, RFC 2048, Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures http://www.zvon.org/tmRFC/RFC2048/Output/ http://www.ietf.org/rfc/rfc2048.txt says nothing about charset. Encoding is mentioned, but not in that sense (mentiond as content transfer encodings such as base64 or quoted-printable). http://www.ietf.org/rfc/rfc2046.txt RFC 2046, Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types http://www.zvon.org/tmRFC/RFC2046/Output/ http://www.ietf.org/rfc/rfc2046.txt discusses encodings (charset, in IETF parlance) solely in the context of text/* types. For existing, non "+xml" types, such as image, video, application, model and so on there is thus no mention of charset parameters and no problem. The use of charset in the +xml RFC for application/*+xml is exceptional. RFC 3023, XML Media Types http://www.zvon.org/tmRFC/RFC3023/Output/ http://www.ietf.org/rfc/rfc3023.txt states in section "3.1. Text/xml Registration" that the MIME charset is autoritative, even when not present, in which case a default of US-ASCII must be enforced. Since the defaults for XML in the case of an encoding declaration being omitted are UTF-8 and UTF-16, this is a problem and RFC 3023 discusses this in some detasil in the text/xml registration. This implies that should it ever be desirable to do both server-side (non-networked) processing and also networked, HTTP processing, either set an explicit xml encoding declaration of US-ASCII (and ensure that it is true, using ncr or entities for all codepoints above 127) or, alternatively, do not use text/xml. The latter course seems the recommended one. In section "3.2. Application/xml Registration" RFC 3023 also lists, unusually, a charset parameter, declares it autoritative in all cases over HTTP even when not present, and attempts to impose a rewriting rule on clients that recieve and store application/xml content whose encoding declaration differs from that of the MIME charset parameter. Correct processing in the incredibly frequent case where the server does XML processing before the file is sent, is not adressed. Clearly, for this to work and since there is no MIME information in this local processing, the xml encoding declaration must give the correct answer 9if omitted, the encoding must be UTF-8 or UTF-16). Furthermore, section "7. A Naming Convention for XML-Based Media Types" states > Registrations for new XML-based media types under top-level types > other than "text" SHOULD, in specifying the charset parameter and > encoding considerations, define them as: "Same as [charset parameter > / encoding considerations] of application/xml as specified in RFC > 3023." > The use of the charset parameter is STRONGLY RECOMMENDED, since this > information can be used by XML processors to determine > authoritatively the charset of the XML MIME entity. This broadens the scope of the damage to all xml related media type registrations. The TAG has already noted in its finding "Internet Media Type registration, consistency of use" http://www.w3.org/2001/tag/2002/0129-mime > Thus there is no ambiguity when the charset is omitted, and the > STRONGLY RECOMMENDED injunction to use the charset is misplaced for > application/xml and for non-text "+xml" types. Consequently, for XML > representations, server-side applications SHOULD only supply a > charset header when there is complete certainty as to the encoding > in use. Otherwise, an error will cause a perfectly usable > representation to be rejected by an architecturally sound client. It further notes that incorrect labelling of the encoding is harmful and that xml content types already include mandatory encoding information. It would be harmful for different sources of encoding information to be used in different circumstances (local processing on client or on server, as opposed to networked processing between client and server); having multiple sources of the same information clearly leads to situations where these get out of sync and is to be avoided. So, most of the wording suggested below is already backed up by an existing finding, and the scope is limited to xml media types only. Other types have no problem, (apart from all text/* types). In general, for issue contentTypeOverride-24, the TAG is tending towards a position that overriding MIME information (and HTTP headers in general) is a bad thing, see http://www.w3.org/2001/tag/ilist#contentTypeOverride-24 It has also been noted in discussions that content creators have control over their content but do not have control over the setup of the server they are using. Unlike media type itself, which is widely communicated to the server by the filename extension, there are no widespread naming conventions for indicating the encoding used and thus, no way that authoring tools can generate content that wil automatically cause the correct charset parameter to be created regardless of the server used. If the TAG says on the one hand that content creators and clients are better placed to determine the encoding used (so don't believe the encoding produced by the server) and on the other hand that the media type and other information sent by the server must not or should not be overridden, then there is a contradiction. The resolution of this is to state, as CP 10 suggests and as already given in a previous finding, that for non-text */*+xml types, a charset parameter must not be generated unless it is consistent with the encoding information given by the xml encoding declaration and further, that for application/xml and for non-text */*+xml the absence of a charset parameter means "use the XML encoding declaration". In such a case, the finding for contentTypeOverride-24 could confidently propose that overriding MIME information, particularly charset information, was always bad and that the information, where provided, must always be used. in consequence, we should: - request a modification the media type registration to discourage registration of new text/* types unless all the requirements for text/* are met, including the encoding constraints (including mandatory charset where no charset parameter is provided) and to encourage registration of new non text/* types instead. - request a modification the +XML media type registration to state that encoding information MUST NOT be supplied by the server unless it is known to be correct and agrees with any internal encoding information in the content. - warn against the use of text/xml for any content that is not strictly US-ASCII encoded. - request that a new definition of the application/xml and */*+xml media types, that removes the unfortunate and unenforceable charset wording and replaces it with wording that allows interoperability and consistent processing for server-side, client-side and client-server uses of XML without content rewriting. This might be a new RFC to obselete RFC 3023 or, alternatively, might be standards-track work from the XML Core working group for example. It would change the meaning of the absence of a charset parameter for application/xml and thus for non-text */*+xml. Since a desire has been expressed to a) keep the architecture document short b) give detailed explanation for architecture, not just stating the what but also the why I propose to open a new issue on this, issue the explanatory material above as a finding (the why) add the material below (the what) to the arch document, and close the issue. Note that the text below assumes that the changes to the xml media type registration definition have been made. ----- 8< ----------- Because of character encoding fallbacks when there is no MIME charset parameter, use of text/* media types must ensure that the content is encoded in US-ASCII unless it can be guaranteed that all servers will serve this content with the correct charset parameter. Registration of new text/* types is strongly discouraged, for the same reason. For application/xml and non-text "+xml" media types, charset parameters must not be produced unless they give the same information as the xml encoding declaration would give. ----- 8< ----------- -- Chris mailto:chris@w3.org
Received on Monday, 7 April 2003 15:05:13 UTC