internet media types and encoding

Hello,

The following language is proposed for section "4.2. Processing model"
of the Web Architecture document. It also includes a lead-in sentence
to be added to the overall Section "4. Representations" introductory
material.
http://www.w3.org/2001/tag/webarch/

This discharges the action item:

Action CL 2003/0127: Draft language for arch doc that takes language
from internet media type registration, propose for arch doc, include
sentiment of TB's second sentence from CP10.

For reference, CP 10 from Tim Bray states:
> Agents which receive a resource representation accompanied by an
> Internet Media Type MUST interpret the representation according to
> the semantics of that Media Type and other header information.
> Servers which generate representations MUST not generate Media Types
> and other header information (for example charsets) unless there is
> certainty that the headers are correct.

Taking that second sentence, the current media type registration
document,

RFC 2048, Multipurpose Internet Mail Extensions (MIME) Part Four:
Registration Procedures
http://www.zvon.org/tmRFC/RFC2048/Output/
http://www.ietf.org/rfc/rfc2048.txt

says nothing about charset. Encoding is mentioned, but not in that
sense (mentiond as content transfer encodings such as base64 or
quoted-printable).

http://www.ietf.org/rfc/rfc2046.txt

RFC 2046, Multipurpose Internet Mail Extensions (MIME) Part Two:
Media Types
http://www.zvon.org/tmRFC/RFC2046/Output/
http://www.ietf.org/rfc/rfc2046.txt

discusses encodings (charset, in IETF parlance) solely in the context
of text/* types.

For existing, non "+xml" types, such as image, video, application,
model and so on there is thus no mention of charset parameters and no
problem. The use of charset in the +xml RFC for application/*+xml is
exceptional.

RFC 3023, XML Media Types
http://www.zvon.org/tmRFC/RFC3023/Output/
http://www.ietf.org/rfc/rfc3023.txt

states in section "3.1. Text/xml Registration" that the MIME charset is
autoritative, even when not present, in which case a default of
US-ASCII must be enforced. Since the defaults for XML in the case of
an encoding declaration being omitted are UTF-8 and UTF-16, this is a
problem and RFC 3023 discusses this in some detasil in the text/xml
registration.

This implies that should it ever be desirable to do both server-side
(non-networked) processing and also networked, HTTP processing, either
set an explicit xml encoding declaration of US-ASCII (and ensure that
it is true, using ncr or entities for all codepoints above 127) or,
alternatively, do not use text/xml. The latter course seems the
recommended one.

In section "3.2. Application/xml Registration" RFC 3023 also lists,
unusually, a charset parameter, declares it autoritative in all cases
over HTTP even when not present, and attempts to impose a rewriting
rule on clients that recieve and store application/xml content whose
encoding declaration differs from that of the MIME charset parameter.

Correct processing in the incredibly frequent case where the server
does XML processing before the file is sent, is not adressed. Clearly,
for this to work and since there is no MIME information in this local
processing, the xml encoding declaration must give the correct answer
9if omitted, the encoding must be UTF-8 or UTF-16).

Furthermore, section "7. A Naming Convention for XML-Based Media
Types" states

> Registrations for new XML-based media types under top-level types
> other than "text" SHOULD, in specifying the charset parameter and
> encoding considerations, define them as: "Same as [charset parameter
> / encoding considerations] of application/xml as specified in RFC
> 3023."

> The use of the charset parameter is STRONGLY RECOMMENDED, since this
> information can be used by XML processors to determine
> authoritatively the charset of the XML MIME entity.

This broadens the scope of the damage to all xml related media type
registrations.

The TAG has already noted in its finding
"Internet Media Type registration, consistency of use"
http://www.w3.org/2001/tag/2002/0129-mime

> Thus there is no ambiguity when the charset is omitted, and the
> STRONGLY RECOMMENDED injunction to use the charset is misplaced for
> application/xml and for non-text "+xml" types. Consequently, for XML
> representations, server-side applications SHOULD only supply a
> charset header when there is complete certainty as to the encoding
> in use. Otherwise, an error will cause a perfectly usable
> representation to be rejected by an architecturally sound client.

It further notes that incorrect labelling of the encoding is harmful
and that xml content types already include mandatory encoding
information. It would be harmful for different sources of encoding
information to be used in different circumstances (local processing
on client or on server, as opposed to networked processing between
client and server); having multiple sources of the same information
clearly leads to situations where these get out of sync and is to be
avoided.

So, most of the wording suggested below is already backed up by an
existing finding, and the scope is limited to xml media types only.
Other types have no problem, (apart from all text/* types).

In general, for issue contentTypeOverride-24, the TAG is tending
towards a position that overriding MIME information (and HTTP headers
in general) is a bad thing, see
http://www.w3.org/2001/tag/ilist#contentTypeOverride-24

It has also been noted in discussions that content creators have
control over their content but do not have control over the setup of
the server they are using. Unlike media type itself, which is widely
communicated to the server by the filename extension, there are no
widespread naming conventions for indicating the encoding used and
thus, no way that authoring tools can generate content that wil
automatically cause the correct charset parameter to be created
regardless of the server used.

If the TAG says on the one hand that content creators and clients are
better placed to determine the encoding used (so don't believe the
encoding produced by the server) and on the other hand that the media
type and other information sent by the server must not or should not
be overridden, then there is a contradiction.

The resolution of this is to state, as CP 10 suggests and as already
given in a previous finding, that for non-text */*+xml types, a
charset parameter must not be generated unless it is consistent with the
encoding information given by the xml encoding declaration and
further, that for application/xml and for non-text */*+xml the absence
of a charset parameter means "use the XML encoding declaration".

In such a case, the finding for contentTypeOverride-24 could
confidently propose that overriding MIME information, particularly
charset information, was always bad and that the information, where
provided, must always be used.

in consequence, we should:

- request a modification the media type registration to discourage
registration of new text/* types unless all the requirements for text/*
are met, including the encoding constraints (including mandatory
charset where no charset parameter is provided) and to encourage
registration of new non text/* types instead.

- request a modification the +XML media type registration to state that
encoding information MUST NOT be supplied by the server unless it is
known to be correct and agrees with any internal encoding information
in the content.

- warn against the use of text/xml for any content that is not
strictly US-ASCII encoded.

 - request that a new definition of the application/xml and */*+xml
 media types, that removes the unfortunate and unenforceable charset
 wording and replaces it with wording that allows interoperability and
 consistent processing for server-side, client-side and client-server
  uses of XML without content rewriting. This might be a new RFC to
  obselete RFC 3023 or, alternatively, might be  standards-track work
  from the XML Core working group for example. It would change the
  meaning of the absence of a charset parameter for application/xml
  and thus for non-text */*+xml.

Since a desire has been expressed to

a) keep the architecture document short
b) give detailed explanation for architecture, not just stating the
what but also the why

I propose to open a new issue on this, issue the explanatory material
above as a finding (the why) add the material below (the what) to the
arch document, and close the issue.

Note that the text below assumes that the changes to the xml media
type registration definition have been made.
          
----- 8< -----------

Because of character encoding fallbacks when there is no MIME charset
parameter, use of text/* media types must ensure that the content is
encoded in US-ASCII unless it can be guaranteed that all servers will
serve this content with the correct charset parameter. Registration of
new text/* types is strongly discouraged, for the same reason.

For application/xml and non-text "+xml" media types, charset
parameters must not be produced unless they give the same information
as the xml encoding declaration would give.

----- 8< -----------
          

-- 
 Chris                          mailto:chris@w3.org

Received on Monday, 7 April 2003 15:05:13 UTC