- From: Jon Bosak <bosak@atlantic-83.Eng.Sun.COM>
- Date: Fri, 31 Jan 1997 19:32:48 -0800
- To: w3c-sgml-wg@www10.w3.org
- cc: bosak@atlantic-83.Eng.Sun.COM
The inclusion of public identifiers in XML was requested by many WG members during the deliberations that led up to the November draft. Following a discussion of public identifiers initiated by a message to the WG from Tim Bray on December 11, at the beginning of January I invited Paul Grosso, Ken Holman, Paul Prescod, and David Durand to form an ad hoc study group to propose a set of specific changes to be made to the XML draft in order to add public identifiers. The question of whether XML should attempt to go beyond the syntax of public identifiers and specify a resolution mechanism was left entirely to the judgement of the group. Since I have a vested interest in a particular FPI resolution scheme (the one that I helped to design for the documentation system that will be included in Solaris 2.6), I isolated myself from the deliberations of the group and know nothing more about their work than you will find reported below. Attached are (a) a proposal from the group to use a catalog mechanism for the resolution of public identifiers, (b) a response to the proposal from James Clark, and (c) a response to James's comments by Paul Grosso. These documents are now placed before you for your consideration and comment. Please note that the ERB has not discussed any details of the proposed solution beyond what you see here and has taken no position on it. Its status at this point is the same as any other suggestion made by members of the WG. I would like to express my gratitude and convey the thanks of the ERB to Paul G., Paul P., Ken, and David for all the thought and care that went into this proposal. Jon ======================================================================== Date: Wed, 29 Jan 97 11:53:54 CST From: paul@arbortext.com (Paul Grosso) Subject: Re: Latest XML Catalog draft proposal as of 19970127 15:00 CST General notes: 1. In the XML WD1.0, a system identifier is a URL. However, the current catalog draft chooses not to put this restriction on the "right hand side" of a catalog entry, though a URL continues to be the most obvious thing to use there. One of our subgroup does prefer to restrict the right hand side to URLs. 2. In this proposal, a public identifier is effectively just some minimum literal that is to be used to provide one initial level of indirection (e.g., via a catalog) before being subjected to whatever resolution process a system identifier would require. Since the XML SGML declaration has FORMAL NO, no mention of FPIs is made. Note also that no mention of URNs or FSIs was deemed necessary. Nothing in this proposed extension to XML either prohibits or prescribes the use of any of these concepts. 3. Since we want to define comments in catalogs to match XML comments, the bulk of the production for catComment is just left it as "...". 4. We've placed no "one level only" restriction on the DELEGATE entry type. How to avoid undesirable recursion is left as an implementation issue. 5. Our four person sub-working group was split on whether to add the CATALOG and DELEGATE entry types. At least one of us wanted to have only PUBLIC and CATALOG. At least one wanted PUBLIC and DELEGATE and was unsure we needed CATALOG. We felt, however, that CATALOG and DELEGATE were at least worth writing up and sending to the full WG, and some of us recommend that they are both included in XML. 6. Note that the terms "catalog" and "catalog entry entity" (the latter is called "catalog entry file" in SGML Open TR 9401) are [hopefully] used in a precise manner. In particular, a "catalog" is a logical concept potentially composed of multiple "catalog entry entities". (In FSI terminology, a catalog can be represented as an "sos-sequence" which is a sequence of storage object specifications where each sos [storage object specification] points to a catalog entry entity.) 7. The productions for the catalog are numbered C1-C10 in this proposal; they should probably be renumbered to fit in with the rest of the productions when this proposal is incorporated into the XML spec. The productions numbered 70a-70e in this proposal are also expected to be renumbered in the revision of the XML spec. --------- Section 4.2.2 "External Entities" in Working Draft 14-Nov-96 [68] [[unchanged]] [69] ExternalID := 'PUBLIC' S PublicID ( S SystemID )? | 'SYSTEM' (S SystemID)? [70] SystemLiteral := '"' [^"]* '"' | "'" [^']* "'" [70a] PublicID := RestrictedLiteral [70b] SystemID := SystemLiteral [70c] RestrictedLiteral := '"' RestrictedLiteralChars '"' | "'" RestrictedLiteralChars "'" [70d] RestrictedLiteralChars := (Letter | Digit | S | SpecialChars)* [70e] SpecialChars := ['()+,-./:=?] [71] [[unchanged]] The PublicID is called the entity's public identifier. The SystemID that may follow the entity's public identifier and the SystemID that may follow the SYSTEM keyword are called the entity's system identifier. A public identifier is a RestrictedLiteral that serves as the identifier of public text (i.e., text that is known beyond the context of a single document or system environment) or other shared information object. If an entity has a public identifier, an XML processor may retrieve the content of the entity using any mapping of public identifier to system identifier that it wishes; in the interest of document portability, this specification describes a particular mapping that is recommended but not required. This mapping is accomplished by using an interpreted version of the public identifier as an index into an associated XML catalog--see the definition of the catalog in 4.3.1 for details. A given XML processor may attempt other mapping algorithms before or (if the catalog fails to produce a successful mapping) after accessing this catalog. Regardless of the resolution process attempted, if the XML processor cannot successfully use or resolve the public identifer, and a system identifier follows the public identifier, the XML processor shall behave as if the system identifier was the only identifier supplied. If the XML processor cannot successfully use or resolve the public identifier, and a system identifier does not follow the public identifier, the XML processor shall behave as if a system identifier could not be resolved. The system identifier is a URL, which may be used to retrieve... [[continues as in XML WD 1.0]]. - - - - - - - - - - - - - - 4.3.1 XML Catalog Specification When an XML processor encounters an external entity defined by a public identifier, the processor may, but need not, determine the associated system identifier required for the tasks described in section 4.3 of this specification from a supplemental XML catalog. An XML catalog is a logical concept potentially composed of multiple catalog entry entities. For example, a catalog could be specified by a list of file names or URLs where each file name or URL points to a catalog entry entity. The term "catalog" is used in general to refer to the logical thing doing the mapping as well as to refer specifically to the current ordered collection of catalog entry entities; the term "catalog entry entity" is used to refer specifically to a particular entity in the list of entities composing the catalog. How an XML processor locates and accesses a catalog composed of one or more XML catalog entry entities is not prescribed by this XML specification. [C1] XMLCatalog := S? ( ( catComment | catEntry ) ( S ( catComment | catEntry ) )* )? [C2] catComment := '--*' ... '*--' [C3] catEntry := catPublic | catCatalog | catDelegate | catOtherEntry [C4] catPublic := 'PUBLIC' S PublicID S catSystemID [C5] catCatalog := 'CATALOG' S catSystemID [C6] catDelegate := 'DELEGATE' S PartialPublicID S catSystemID [C7] PartialPublicID := RestrictedLiteral [C8] catOtherEntry := catKeyword ( S SystemLiteral )+ [C9] catKeyword := [^S"'./\<>] [^S./\<>]* [C10] catSystemID := SystemLiteral An example of several catalog entries follows: PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN" "iso-lat1.gml" PUBLIC "-//ACME//DTD Report//EN" "http://www.acme.com/dtds/report.dtd" PUBLIC "corporate legal boilerplate C312" "<corpdb>get_bplate('C312')" DELEGATE "-//SUN::SUNSOFT" "http://docs.sun.com/catalog" CATALOG "/home/pubs/catalog.soc" The PublicID is the public identifier (see also production 70a) of the entity or notation. The catSystemID is a SystemLiteral that is either an entity's or notation's system identifier (such as a URL--see also production 70b) or it is some other SystemLiteral used to locate the entity, recognize the notation, or (in the case of CATALOG and DELEGATE entries) locate another catalog entry entity. An XML processor may choose to determine a system identifier from this SystemLiteral value. Users of an XML processor that determines system identifiers from the SystemLiteral value of the right hand side of a catalog entry must ensure that the syntax of that value is compatible with the processor in use. Because public identifiers are used as "match indices" into catalogs, an "interpreted" value is defined that is used in the matching process. The public identifiers from the ExternalID production as well as the PublicID and PartialPublicID from the various catalog entries undergo this interpretation prior to any matching being attempted. This interpretation consists of removing the surrounding quotes, deleting all leading and trailing white space, and replacing all embedded sequences of occurrences of white space (see production 1) with a single space (#x0020) character. When the interpreted value of the public identifier of an ExternalID (production 69) exactly matches the interpreted value of the public identifier of a PUBLIC type catalog entry, that catalog entry associates its right hand side (catSystemID) with the given entity or notation. The first such exact match terminates the catalog lookup process so that any other types of catalog entries as well as any subsequent PUBLIC entries are ignored. If there are no PUBLIC entries in a given catalog entry entity that result in an exact match with the interpreted value of the public identifier of the ExternalID, then all DELEGATE catalog entries in that catalog entry entity are considered. The process comparing the interpreted value of the public identifier of the ExternalID with the PartialPublicID of the DELEGATE catalog entry, which is described in detail below, can be summarized as looking for PartialPublicID's that are initial substring matches of the public identifier of the ExternalID. If this catalog entry entity produces any such matches, the right hand side of all such matching entries are used, in order from longest PartialPublicID match to shortest, to generate a new complete logical catalog (i.e., a newly specified list of catalog entry entities) that replaces the current catalog. (If there are multiple occurrences of the same length match, they shall be ordered in the order of the matching DELEGATE entries in the catalog entry entity.) The catalog lookup process for this public identifier continues with this new (replacement) catalog, ignoring for the purposes of this public identifier any other entry types in the current catalog entry entity as well as any subsequent catalog entry entities that may have been part of the previous list of catalog entry entities. This newly defined catalog is then processed in much the same manner as if it had been the originally specified catalog; however, only the entity's public identifier is available for lookup--its entity name and system identifier (if any) are not available during lookup in any "delegated to" catalog. (This allows delegation to resolve public identifiers but not entity names using, for example, the ENTITY keyword defined in the SGML Open TR9401 catalog.) Lookup for subsequent public identifiers is unaffected by this process; that is, the effect of this replacement catalog holds only for the lookup of the current ExternalID's public identifier. The exact process of comparing the interpreted value of the public identifier of the ExternalID with the interpreted value of the PartialPublicID of the DELEGATE entry specifies that only a subset of all initial substrings of the public identifier are tested against the PartialPublicID. The interpreted value of the public identifier is conceptually separated into parts delimited by one of the two separating token strings "//" and "::" (without the quotes). The separating tokens themselves are considered parts. For example, the string -//IETF::HTML-WG//DTD HTML 2.0//EN is composed of nine parts; the following are the nine (non-null) PartialPublicID values that would cause a DELEGATE entry to match this public identifier: - -// -//IETF -//IETF:: -//IETF::HTML-WG -//IETF::HTML-WG// -//IETF::HTML-WG//DTD HTML 2.0 -//IETF::HTML-WG//DTD HTML 2.0// -//IETF::HTML-WG//DTD HTML 2.0//EN If there are no PUBLIC or DELEGATE matches for the interpreted value of the public identifier of this ExternalID in the current catalog entry entity, match processing continues with the next catalog entry entity in the catalog list (if any). The CATALOG entry can be used to insert new catalog entry entities into the current list of catalog entry entities. The right hand side of a CATALOG entry is used to locate another catalog entry entity that is read after the current catalog entry entity if the current catalog entry entity does not provide a match for the public identifier. Multiple CATALOG entries are allowed, and the referenced catalog entry entities will be inserted into the current catalog list in order. If multiple catalog entry entities were originally provided to this XML processor, the mere existence of a CATALOG entry does not prevent subsequent entities from being searched if the current catalog entry entity and those specified on CATALOG entries provide no match for the public identifier. Note that the effect of any CATALOG entry would occur only after all other entries in this catalog entry entity have been considered. All catOtherEntry entries have no effect prescribed by this XML specification, though individual implementations may recognize and operate on any or all of them. Date: Thu, 30 Jan 1997 10:35:34 +0700 From: James Clark <jjc@jclark.com> Subject: Re: XML Catalog draft proposal In general I think this is pretty good (though I still have doubts about the wisdom of supporting catalogs in XML). >1. In the XML WD1.0, a system identifier is a URL. However, the > current catalog draft chooses not to put this restriction on > the "right hand side" of a catalog entry, though a URL continues > to be the most obvious thing to use there. This is the only thing I strenuously object to. It makes no sense to me to treat the right hand side of a catalog entry any differently from a system identifier. - It's confusing for the user for there two be two very similar contexts which allow different kinds of system identifier. - If an implementation can support something in one context, there is no reason for it not to support it in the other. - By allowing things in catalog RHSs that aren't allowed in system identifiers, you make it impossible to create a tool that "de-catalogs" something, by replacing all public identifiers with system identifiers obtained from the catalog. James Date: Thu, 30 Jan 97 10:57:43 CST From: paul@arbortext.com (Paul Grosso) Subject: Re: XML Catalog draft proposal > From: James Clark <jjc@jclark.com> > > >1. In the XML WD1.0, a system identifier is a URL. However, the > > current catalog draft chooses not to put this restriction on > > the "right hand side" of a catalog entry, though a URL continues > > to be the most obvious thing to use there. > > This is the only thing I strenuously object to. It makes no sense to > me to treat the right hand side of a catalog entry any differently > from a system identifier. > > - It's confusing for the user for there two be two very similar > contexts which allow different kinds of system identifier. > > - If an implementation can support something in one context, there is > no reason for it not to support it in the other. > > - By allowing things in catalog RHSs that aren't allowed in system > identifiers, you make it impossible to create a tool that > "de-catalogs" something, by replacing all public identifiers with > system identifiers obtained from the catalog. Our subgroup considered all those points, and one of us did end up feeling we should restrict the rhs of catalogs to URLs. (Personally, I think the way we should address all of James' concerns above is to un-restrict XML's system identifiers.) Allow me to explain some of the concerns we discussed on this issue. It may be that the ERB will see a way to address these concerns. 1. It was a goal to allow the catalog to map public identifiers into URNs. (Do we want to open up the defn of sysids in XML to include URNs, or does the defn of URL already include URNs?) 2. It was a goal to allow XML catalogs to be used with SGML/XML repositories (aka databases). For example, it was felt important to allow an entry such as: PUBLIC "corporate legal boilerplate C312" "<corpdb>get_bplate('C312')" where the rhs is some database call (whether expressed in FSI syntax as in this example or not). 3. It was a goal (albeit a less important one) to allow catalogs to have FSIs on the rhs. 4. Though URLs remain the most obvious thing for a sysid to be, it might make sense to allow the market to decide what sysids will work. This is one place where it might be best to suggest that URLs give maximum interoperability, but to allow other things to develop. After all, any decent XML tool that finds a sysid that isn't a URL is going to try its best to figure it out anyway, so how is restricting it in the XML spec going to help? One of the key reasons we opted to include the general notes and, in fact, made the one about the rhs note number 1 is precisely because we wanted to call attention to this fact that we hope and expect will be debated by the ERB and WG. Our subgroup felt it made sense to provide draft wording that encompassed the most we felt the WG should consider under the assumption that making the proposal more restrictive was easier than going the other way. While the proposal reflects our general consensus, we all could see both sides to most issues. Having presented our proposal, I'm quite sure that no one in our subgroup would feel it inappropriate for the ERB or WG to decide to restrict the rhs of catalog entries to URLs, especially if the concerns I list above are considered and either addressed or deemed beyond the scope of XML. paul
Received on Friday, 31 January 1997 22:33:06 UTC