XML media types, charset, TAG findings

Hello all,

In the approved TAG finding

Internet Media Type registration, consistency of use
TAG Finding 3 June 2002 (Revised 4 September 2002)

a specific criticism of RFC 3023 is raised
3. Consistency in Communicating Character Encoding

and the conclusion is

>> Thus there is no ambiguity when the charset is omitted, and the
>> STRONGLY RECOMMENDED injunction to use the charset is misplaced for
>> application/xml and for non-text "+xml" types. Consequently, for XML
>> representations, server-side applications SHOULD only supply a
>> charset header when there is complete certainty as to the encoding in
>> use. Otherwise, an error will cause a perfectly usable representation
>> to be rejected by an architecturally sound client.

>> We recommend that section 7.1 of [RFC3023] be amended to something
>> like the following:

>> The use of the charset parameter, when the charset is reliably known
>> and agrees with the encoding declaration, is RECOMMENDED, since this
>> information can be used by non-XML processors to determine
>> authoritatively the charset of the XML MIME entity.

This is further backed up by another approved TAG finding

Authoritative Metadata
TAG Finding 25 February 2004

4.2 Self-describing data and Risk of Inconsistency

>> Representation providers SHOULD NOT in general specify the character
>> encoding for XML data in protocol headers since the data is
>> self-describing.

However, the registration for application/xml still says

> Although listed as an optional parameter, the use of the charset
> parameter is STRONGLY RECOMMENDED, since this information can be used
> by XML processors to determine authoritatively the charset of the XML
> MIME entity. The charset parameter can also be used to provide
> protocol-specific operations, such as charset-based content
> negotiation in HTTP.

Since RFC 3023 was published, it has become clear that the +xml
convention has taken off. One consequence is that a transcoding proxy
can reliably distinguish xml from non-xml media types, when meeting an
unknown media type.

Thus, it can know to either
a) leave it alone, or
b) transcode to another charset, at the same time fixing up the XML
encoding declaration

in the same way that it knows to not transcode, say, an image/gif from
Latin-1 to Shift-JIS.

Thus the generality argument (we want all encoding handled in the same
way) can be applied to all the +xml types.

Coupled with the deprecation of the text/xml and
text/xml-external-parsed-entity types (and thus insulation from the
particular encoding testrictions of text/*) we are now, in this revision
of the document, in a position to be a little stronger:

  The encoding declaration in an XML document and the charset (if
  provided) MUST be consistent.

This removes the requirement on all XML tools from wget on up, to
rewrite XML instances when saving to a local filestore, so that they are
well formed. Instead, no rewriting is required.

In consequence, the wording on the optional charset parameter should be

The main value of a charset parameter is as a duplicate copy of the
encoding in use; for use by non-XML processors (full text search
engines? content management systems?) and for use in content

Thus, I would like to see language in the specification that removes the
idea of charset as an overide to the XML encoding declaration, and
instead talks of charset as an optional parameter that may have certain
uses and if provided MUST be consistent with the encoding declared by
the instance (BOM, encoding declaration, or absence therof)

I am of course happy to propose specific text, but wanted reactions to
this first, to ensure all the editors are in agreement as to how to

Due to the interplay between the draft and the two TAG findings, I have
copied this to www-tag.

 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 Member, W3C Technical Architecture Group

Received on Thursday, 7 October 2004 14:44:24 UTC