Re: Rough sketch for an I-D (a successor of RFC 3023) from Chris Lilley on 2003-10-30 (www-tag@w3.org from October 2003)

From: Chris Lilley <chris@w3.org>
Date: Thu, 30 Oct 2003 22:18:52 +0100
To: MURATA Makoto <murata@hokkaido.email.ne.jp>
Cc: www-tag@w3.org
Message-ID: <125520306.20031030221852@w3.org>
On Thursday, October 30, 2003, 12:33:01 AM, MURATA wrote:

(snipping all the 'we agree' parts)

>> Firstly, the charset parameter and the xml encoding declaration should
>> never differ, because otherwise the document is only well formed in
>> transit and not when processed on the server or when saved to the
>> client.

MM> To make progress, let's say that there is a concern about such
MM> differences and that, if the recipient saves the document in a
MM> file without rewriting the encoding declaration, the result is a
MM> broken XML document.

Very well.

Let us also note that it is also broken when sitting on the server.
This is a concern firstly because server-side processing of XML is
very common, and secondly because we would not like to encourage the
practice where authoring tools do not correctly label the encoding of
an XML file but rely on a server to 'fix it up' for them.

>> MM> - Server implementers or Server Managers SHOULD NOT specify the
>> MM>   default value of the charset
>> MM>   parameter of text/xml, application/xml,
>> MM>   Text/xml-external-parsed-entity,
>> MM>   Application/xml-external-parsed-entity, */*+xml, or
>> MM>   Application/xml-dtd, unless they can guarantee that 
>> MM>   that default value is correct for all MIME entities of these media
>> MM>   types.
>> 
>> Which, it is possible to show, they cannot do in the general case.

MM> Although WWW servers send many XML documents, protocol
MM> implementations (e.g., SOAP) also send XML documents. It is easy
MM> for them to correctly specify the charset parameter.

Especially if they can reliably read it from the xml encoding
declaration.

MM> I think that
MM> future implementations of trackback should correctly specify the
MM> charset parameter. (At present, people unfortunately use
MM> "application/x-www-form-urlencoded" without providing any info,
MM> which causes lots of data corruption in Japan.)

I am aware of that problem, and raised it during the 'when to use GET'
discussions several times.

Further down the line, an application/xform-iriencoded might help
there; for now, PUT solves the problem. (Should this really be a media
type and not a content-transfer-encoding? But I digress).


>> Perhaps the framework and scheme should be pointed to?

MM> I can imagine that this starts a heated discussion and a
MM> significant delay.

I can imagine that too, although so far there has only been one post
on the subject, from Paul Grosso.

MM> I know that the XML Core WG would like to
MM> register XPointer as fragment identifiers, but has W3C agreed on
MM> this? (I'm just asking.)

Its not clear that XML Core wants to do this, and Paul indicated that
guiding people towards framework and scheme (in other words, a method
to construct fragment identifiers, for different media types) would be
the correct thing to do.

I agree that trying to standardize even such a simple thing as
barenames for all of XML brings up the xml:id dependency and would
hinder short term progress.


>> MM> 4) Possible reasons for not providing the charset parameter for
>> MM> specialized media types
>> 
>> MM> I think that "This media type is utf-8 only and thus does not need any
>> MM> mechanism to identify the charset" is a perfectly good reason, since
>> MM> "UTF-8 only" is a generic principle.  This should be mentioned in the
>> MM> I-D.
>> 
>> Its one specific reason. Its not enough though. Why should an XML file
>> that can be either UTF-8 or UTF-16 need a charset parameter? It offers
>> no useful or additional information. All XML processors handle both
>> charsets as a conformance requirement.

MM> In general, I do not want to undermine the only generic mechanism
MM> (the charset parameter) without establishing an alternative.

I concur, but believe that XML has already established an alternative
and the +xml convention allows an implementation to determine that an
unrecognized media type is in XML.

MM>  Something limited
MM> to XML is not generic to me.

True, but it is generic to all of XML which is a non-negligible slice
of current and expected media type registrations. Were this not the
case, there would have been no value in a +xml convention.

MM>  Furthermore, "UTF-16LE" or "UTF16-BE" are preferred
MM> by RFC 2781

Since it talks about 'adding a BOM' I believe that their usage of
"labelling as UTF-16" does not mean "by a server". Rather, it seems to
be saying that text without a BOM should not be labelled as UTF-16 but
must be labelled to indicate its byte order. i am sure we would both
agree with this.

In the usual case, a BOM is present.

RFC 2781 states:

> An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE"
> would occur with document formats that mandate a BOM in UTF-16 text,
> thereby requiring the use of the "UTF-16" tag only.

I would interpret XML 1.0 without an encoding declaration as being
such a system, and thus being conformant to RFC 2781.


MM> but they are not mandatory in XML.  However, to make progress,
MM> I am willing to mention "UTF-8 or UTF-16 only" as a possible reason
MM> together with my concern above.

That would be fine.

>> The rest of this sounds good; but I think we need fuller discussion on
>> the use of a charset parameter that disagrees with what the XML
>> encoding declaration says. This is clearly harmful, yet seems to be
>> encouraged.

MM> As I suggested above, let's say that there is a concern about it 
MM> and see how people feel about it.

That seems workable.

Are you editing in RFC 2629 xml?

-- 
 Chris                            mailto:chris@w3.org
Received on Thursday, 30 October 2003 16:19:14 UTC