- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 14 Jun 2001 13:14:44 +0900
- To: Francois Yergeau <FYergeau@alis.com>, pgrosso@arbortext.com
- Cc: xml-editor@w3.org, w3c-xml-core-wg@w3.org, w3c-i18n-ig@w3.org, connolly@w3.org
Dear XML core WG, By chance, I just discovered Proposed Erratum 71: http://www.w3.org/XML/Group/2000/10/proposed-xml10-2e-errata#PE71 It is true that this is a bit vague in not saying who is responsible for the escaping, but this has been fixed by PE 51/E4 to say that the XML processor is responsible: http://www.w3.org/XML/Group/2000/10/proposed-xml10-2e-errata#PE51 http://www.w3.org/XML/xml-V10-2e-errata#E4 This mainly very clearly answered whether the escaping has to be done by the creator of the document or by the recipient; it is the recipient. This was already present in the first edition of the XML Rec (see http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent), it just got lost when the rules of how exactly to escape were clarified (erratum 78 http://www.w3.org/XML/xml-19980210-errata#E78). Erratum 78 is bogus only in the sense that it dropped some important bit; Erratum 78 together with the second edition erratum 4 is not bogus at all. However, PE71 seems to be more about details of what happens on the receiver side. Somebody wrote: >>>> > When (and >only when) the string matching the SystemLiteral terminal in the language >is interpreted as a URI reference, it may need to be escaped before passing >it around the web. But this part of XML 1.0 cannot be saying that the >SystemLiteral in the XML file cannot contain certain characters, and there >is no reason to be doing URI ref escaping before passing the string to the >catalog resolver. The escaping talked about in 4.2.2 of XML 1.0 should >happen only when the final URI is determined (after catalog resolution >and/or absolutization) just before being "sent out" to the server. >>>> This seems to imply that a system reference is only to be interpreted as an URI in certain contexts, after various other processing steps. This is definitely not how the XML Spec is written. In particular, it says: >>>> The SystemLiteral is called the entity's system identifier. It is a URI reference (as defined in [IETF RFC 2396], updated by [IETF RFC 2732]), meant to be dereferenced to obtain input for the XML processor to construct the entity's replacement text. >>>> This does not mean that catalogs cannot be used, but it means that if catalogs are used, they are treated as part of URI reference resolution, similar to HTTP caches and proxies, and so on. This then brings us back to the question of whether system literals have to be escaped before being checked in the catalog or not. I see two possible interpretations: 1) The XML processor has to do escaping before resolution, and catalog lookup is part of resolution, so escaping is before catalog lookup. (strict reading) 2) If catalogs can handle arbitrary characters, then catalog lookup can do unescaping, and if catalog lookup is built into the XML processor or otherwise tightly coupled, this can be shortcut. (intentional reading) Anyway, this looks very much like an implementation detail that probably shouldn't affect the XML spec very much. In actual practice, the main problem I expect is that while catalogs may be well defined to handle characters such as e.g. spaces, it would be really important to make sure that they can handle all the Unicode characters allowed in XML in a consistent and stable way. If somebody can point me to the relevant spec, I'm glad to give it a check. If arbitrary Unicode characters cannot be handled, then I guess it would be better for the catalog spec to work with fully escaped URIs, for consistency. Regards, Martin.
Received on Thursday, 14 June 2001 00:15:19 UTC