Regarding XML Proposed Erratum 71 from Martin Duerst on 2001-06-14 (xml-editor@w3.org from April to June 2001)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 14 Jun 2001 13:14:44 +0900
To: Francois Yergeau <FYergeau@alis.com>, pgrosso@arbortext.com
Cc: xml-editor@w3.org, w3c-xml-core-wg@w3.org, w3c-i18n-ig@w3.org, connolly@w3.org
Message-Id: <4.2.0.58.J.20010614122745.063017f0@sh.w3.mag.keio.ac.jp>

Dear XML core WG,

By chance, I just discovered Proposed Erratum 71:

http://www.w3.org/XML/Group/2000/10/proposed-xml10-2e-errata#PE71

It is true that this is a bit vague in not saying who is
responsible for the escaping, but this has been fixed by
PE 51/E4 to say that the XML processor is responsible:

http://www.w3.org/XML/Group/2000/10/proposed-xml10-2e-errata#PE51
http://www.w3.org/XML/xml-V10-2e-errata#E4

This mainly very clearly answered whether the escaping has
to be done by the creator of the document or by the recipient;
it is the recipient. This was already present in the first
edition of the XML Rec (see
http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent),
it just got lost when the rules of how exactly to escape were
clarified (erratum 78
http://www.w3.org/XML/xml-19980210-errata#E78).
Erratum 78 is bogus only in the sense that it dropped some
important bit;  Erratum 78 together with the second edition
erratum 4 is not bogus at all.

However, PE71 seems to be more about details of what
happens on the receiver side. Somebody wrote:

 >>>>
 > When (and
 >only when) the string matching the SystemLiteral terminal in the language
 >is interpreted as a URI reference, it may need to be escaped before passing
 >it around the web.  But this part of XML 1.0 cannot be saying that the
 >SystemLiteral in the XML file cannot contain certain characters, and there
 >is no reason to be doing URI ref escaping before passing the string to the
 >catalog resolver.  The escaping talked about in 4.2.2 of XML 1.0 should
 >happen only when the final URI is determined (after catalog resolution
 >and/or absolutization) just before being "sent out" to the server.
 >>>>

This seems to imply that a system reference is only to be interpreted
as an URI in certain contexts, after various other processing steps.

This is definitely not how the XML Spec is written. In particular,
it says:

 >>>>
The SystemLiteral is called the entity's system identifier. It is a URI
reference (as defined in [IETF RFC 2396], updated by [IETF RFC 2732]),
meant to be dereferenced to obtain input for the XML processor to construct
the entity's replacement text.
 >>>>

This does not mean that catalogs cannot be used, but it means that
if catalogs are used, they are treated as part of URI reference resolution,
similar to HTTP caches and proxies, and so on.

This then brings us back to the question of whether system literals
have to be escaped before being checked in the catalog or not.
I see two possible interpretations:
1) The XML processor has to do escaping before resolution, and
    catalog lookup is part of resolution, so escaping is before
    catalog lookup. (strict reading)
2) If catalogs can handle arbitrary characters, then catalog lookup
    can do unescaping, and if catalog lookup is built into the XML
    processor or otherwise tightly coupled, this can be shortcut.
    (intentional reading)

Anyway, this looks very much like an implementation detail that
probably shouldn't affect the XML spec very much.

In actual practice, the main problem I expect is that while
catalogs may be well defined to handle characters such as
e.g. spaces, it would be really important to make sure that
they can handle all the Unicode characters allowed in XML
in a consistent and stable way. If somebody can point me to
the relevant spec, I'm glad to give it a check.
If arbitrary Unicode characters cannot be handled, then
I guess it would be better for the catalog spec to work
with fully escaped URIs, for consistency.

Regards,   Martin.

Received on Thursday, 14 June 2001 00:15:19 UTC