XML catalog draft

The inclusion of public identifiers in XML was requested by many WG
members during the deliberations that led up to the November draft.
Following a discussion of public identifiers initiated by a message to
the WG from Tim Bray on December 11, at the beginning of January I
invited Paul Grosso, Ken Holman, Paul Prescod, and David Durand to
form an ad hoc study group to propose a set of specific changes to be
made to the XML draft in order to add public identifiers.  The
question of whether XML should attempt to go beyond the syntax of
public identifiers and specify a resolution mechanism was left
entirely to the judgement of the group.

Since I have a vested interest in a particular FPI resolution scheme
(the one that I helped to design for the documentation system that
will be included in Solaris 2.6), I isolated myself from the
deliberations of the group and know nothing more about their work than
you will find reported below.

Attached are (a) a proposal from the group to use a catalog mechanism
for the resolution of public identifiers, (b) a response to the
proposal from James Clark, and (c) a response to James's comments by
Paul Grosso.  These documents are now placed before you for your
consideration and comment.

Please note that the ERB has not discussed any details of the proposed
solution beyond what you see here and has taken no position on it.
Its status at this point is the same as any other suggestion made by
members of the WG.

I would like to express my gratitude and convey the thanks of the ERB
to Paul G., Paul P., Ken, and David for all the thought and care that
went into this proposal.

Jon

========================================================================

Date: Wed, 29 Jan 97 11:53:54 CST
From: paul@arbortext.com (Paul Grosso)
Subject: Re: Latest XML Catalog draft proposal as of 19970127 15:00 CST

General notes:

1.  In the XML WD1.0, a system identifier is a URL.  However, the 
    current catalog draft chooses not to put this restriction on 
    the "right hand side" of a catalog entry, though a URL continues 
    to be the most obvious thing to use there.  One of our subgroup
    does prefer to restrict the right hand side to URLs.

2.  In this proposal, a public identifier is effectively just some 
    minimum literal that is to be used to provide one initial level 
    of indirection (e.g., via a catalog) before being subjected to 
    whatever resolution process a system identifier would require.  
    Since the XML SGML declaration has FORMAL NO, no mention of FPIs 
    is made.  Note also that no mention of URNs or FSIs was deemed 
    necessary.  Nothing in this proposed extension to XML either 
    prohibits or prescribes the use of any of these concepts.

3.  Since we want to define comments in catalogs to match XML comments,
    the bulk of the production for catComment is just left it as "...".

4.  We've placed no "one level only" restriction on the DELEGATE
    entry type.  How to avoid undesirable recursion is left as an
    implementation issue.

5.  Our four person sub-working group was split on whether to add 
    the CATALOG and DELEGATE entry types.  At least one of us wanted 
    to have only PUBLIC and CATALOG.  At least one wanted PUBLIC and 
    DELEGATE and was unsure we needed CATALOG.  We felt, however, that 
    CATALOG and DELEGATE were at least worth writing up and sending to the 
    full WG, and some of us recommend that they are both included in XML.

6.  Note that the terms "catalog" and "catalog entry entity" (the
    latter is called "catalog entry file" in SGML Open TR 9401) are
    [hopefully] used in a precise manner.  In particular, a "catalog"
    is a logical concept potentially composed of multiple "catalog
    entry entities".  (In FSI terminology, a catalog can be
    represented as an "sos-sequence" which is a sequence of storage 
    object specifications where each sos [storage object specification]
    points to a catalog entry entity.)  

7.  The productions for the catalog are numbered C1-C10 in this proposal;
    they should probably be renumbered to fit in with the rest of the
    productions when this proposal is incorporated into the XML spec.
    The productions numbered 70a-70e in this proposal are also expected
    to be renumbered in the revision of the XML spec.

--------- Section 4.2.2 "External Entities" in Working Draft 14-Nov-96

[68]  [[unchanged]]
[69]  ExternalID := 'PUBLIC' S PublicID ( S SystemID )? |
                   'SYSTEM' (S SystemID)?
[70]  SystemLiteral := '"' [^"]* '"' | "'" [^']* "'"
[70a] PublicID   := RestrictedLiteral
[70b] SystemID   := SystemLiteral
[70c] RestrictedLiteral := 
	'"' RestrictedLiteralChars '"' | "'" RestrictedLiteralChars "'"
[70d] RestrictedLiteralChars := (Letter | Digit | S | SpecialChars)*
[70e] SpecialChars := ['()+,-./:=?]
[71]  [[unchanged]]

The PublicID is called the entity's public identifier.  The SystemID
that may follow the entity's public identifier and the SystemID that
may follow the SYSTEM keyword are called the entity's system
identifier.

A public identifier is a RestrictedLiteral that serves as the
identifier of public text (i.e., text that is known beyond the context
of a single document or system environment) or other shared information
object.

If an entity has a public identifier, an XML processor may retrieve the
content of the entity using any mapping of public identifier to system
identifier that it wishes; in the interest of document portability,
this specification describes a particular mapping that is recommended
but not required.  This mapping is accomplished by using an interpreted
version of the public identifier as an index into an associated XML
catalog--see the definition of the catalog in 4.3.1 for details.  A
given XML processor may attempt other mapping algorithms before or (if
the catalog fails to produce a successful mapping) after accessing this
catalog.

Regardless of the resolution process attempted, if the XML processor
cannot successfully use or resolve the public identifer, and a system
identifier follows the public identifier, the XML processor shall
behave as if the system identifier was the only identifier supplied.
If the XML processor cannot successfully use or resolve the public
identifier, and a system identifier does not follow the public
identifier, the XML processor shall behave as if a system identifier
could not be resolved.

The system identifier is a URL, which may be used to retrieve...
[[continues as in XML WD 1.0]].

- - - - - - - - - - - - - -

4.3.1 XML Catalog Specification

When an XML processor encounters an external entity defined by a public
identifier, the processor may, but need not, determine the associated
system identifier required for the tasks described in section 4.3 of
this specification from a supplemental XML catalog.

An XML catalog is a logical concept potentially composed of multiple
catalog entry entities.  For example, a catalog could be specified by a
list of file names or URLs where each file name or URL points to a
catalog entry entity.  The term "catalog" is used in general to refer
to the logical thing doing the mapping as well as to refer specifically
to the current ordered collection of catalog entry entities; the term
"catalog entry entity" is used to refer specifically to a particular
entity in the list of entities composing the catalog.

How an XML processor locates and accesses a catalog composed of one
or more XML catalog entry entities is not prescribed by this XML
specification.

[C1]  XMLCatalog  := S? ( ( catComment | catEntry )
                        ( S ( catComment | catEntry ) )* )?
[C2]  catComment  := '--*' ... '*--'
[C3]  catEntry    := catPublic | catCatalog | catDelegate | catOtherEntry
[C4]  catPublic   := 'PUBLIC' S PublicID S catSystemID
[C5]  catCatalog  := 'CATALOG' S catSystemID 
[C6]  catDelegate := 'DELEGATE' S PartialPublicID S catSystemID
[C7]  PartialPublicID := RestrictedLiteral
[C8]  catOtherEntry := catKeyword ( S SystemLiteral )+
[C9]  catKeyword  := [^S"'./\<>] [^S./\<>]*
[C10] catSystemID := SystemLiteral

An example of several catalog entries follows:
  PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN" "iso-lat1.gml"
  PUBLIC "-//ACME//DTD Report//EN" "http://www.acme.com/dtds/report.dtd"
  PUBLIC "corporate legal boilerplate C312" "<corpdb>get_bplate('C312')"
  DELEGATE "-//SUN::SUNSOFT"     "http://docs.sun.com/catalog"
  CATALOG "/home/pubs/catalog.soc"

The PublicID is the public identifier (see also production 70a) 
of the entity or notation.  

The catSystemID is a SystemLiteral that is either an entity's or 
notation's system identifier (such as a URL--see also production 70b) 
or it is some other SystemLiteral used to locate the entity, recognize 
the notation, or (in the case of CATALOG and DELEGATE entries) locate 
another catalog entry entity.  An XML processor may choose to determine 
a system identifier from this SystemLiteral value.  Users of an XML 
processor that determines system identifiers from the SystemLiteral 
value of the right hand side of a catalog entry must ensure that the 
syntax of that value is compatible with the processor in use.

Because public identifiers are used as "match indices" into catalogs,
an "interpreted" value is defined that is used in the matching process.
The public identifiers from the ExternalID production as well as the
PublicID and PartialPublicID from the various catalog entries undergo 
this interpretation prior to any matching being attempted.  This 
interpretation consists of removing the surrounding quotes, deleting
all leading and trailing white space, and replacing all embedded
sequences of occurrences of white space (see production 1) with a
single space (#x0020) character.

When the interpreted value of the public identifier of an ExternalID
(production 69) exactly matches the interpreted value of the public
identifier of a PUBLIC type catalog entry, that catalog entry
associates its right hand side (catSystemID) with the given entity or
notation.  The first such exact match terminates the catalog lookup
process so that any other types of catalog entries as well as any
subsequent PUBLIC entries are ignored.

If there are no PUBLIC entries in a given catalog entry entity that
result in an exact match with the interpreted value of the public
identifier of the ExternalID, then all DELEGATE catalog entries in 
that catalog entry entity are considered.  The process comparing the
interpreted value of the public identifier of the ExternalID with the
PartialPublicID of the DELEGATE catalog entry, which is described in
detail below, can be summarized as looking for PartialPublicID's 
that are initial substring matches of the public identifier of the
ExternalID.  If this catalog entry entity produces any such matches,
the right hand side of all such matching entries are used, in order
from longest PartialPublicID match to shortest, to generate a new
complete logical catalog (i.e., a newly specified list of catalog entry
entities) that replaces the current catalog.  (If there are multiple
occurrences of the same length match, they shall be ordered in the
order of the matching DELEGATE entries in the catalog entry entity.)

The catalog lookup process for this public identifier continues with
this new (replacement) catalog, ignoring for the purposes of this
public identifier any other entry types in the current catalog entry
entity as well as any subsequent catalog entry entities that may have
been part of the previous list of catalog entry entities.  This newly
defined catalog is then processed in much the same manner as if it had
been the originally specified catalog; however, only the entity's
public identifier is available for lookup--its entity name and system
identifier (if any) are not available during lookup in any "delegated
to" catalog.  (This allows delegation to resolve public identifiers but
not entity names using, for example, the ENTITY keyword defined in the
SGML Open TR9401 catalog.)  Lookup for subsequent public identifiers is
unaffected by this process; that is, the effect of this replacement
catalog holds only for the lookup of the current ExternalID's public 
identifier.

The exact process of comparing the interpreted value of the public
identifier of the ExternalID with the interpreted value of the
PartialPublicID of the DELEGATE entry specifies that only a subset of
all initial substrings of the public identifier are tested against the
PartialPublicID.  The interpreted value of the public identifier is
conceptually separated into parts delimited by one of the two separating
token strings "//" and "::" (without the quotes).  The separating tokens
themselves are considered parts.  For example, the string
	-//IETF::HTML-WG//DTD HTML 2.0//EN
is composed of nine parts; the following are the nine (non-null) 
PartialPublicID values that would cause a DELEGATE entry to match 
this public identifier:
	-
	-//
	-//IETF
	-//IETF::
	-//IETF::HTML-WG
	-//IETF::HTML-WG//
	-//IETF::HTML-WG//DTD HTML 2.0
	-//IETF::HTML-WG//DTD HTML 2.0//
	-//IETF::HTML-WG//DTD HTML 2.0//EN

If there are no PUBLIC or DELEGATE matches for the interpreted value 
of the public identifier of this ExternalID in the current catalog entry 
entity, match processing continues with the next catalog entry entity 
in the catalog list (if any).  The CATALOG entry can be used to 
insert new catalog entry entities into the current list of catalog 
entry entities.  The right hand side of a CATALOG entry is used to 
locate another catalog entry entity that is read after the current 
catalog entry entity if the current catalog entry entity does not 
provide a match for the public identifier.  Multiple CATALOG entries 
are allowed, and the referenced catalog entry entities will be inserted 
into the current catalog list in order.  If multiple catalog entry 
entities were originally provided to this XML processor, the mere 
existence of a CATALOG entry does not prevent subsequent entities 
from being searched if the current catalog entry entity and those 
specified on CATALOG entries provide no match for the public identifier.  
Note that the effect of any CATALOG entry would occur only after all 
other entries in this catalog entry entity have been considered.

All catOtherEntry entries have no effect prescribed by this XML
specification, though individual implementations may recognize
and operate on any or all of them.




Date: Thu, 30 Jan 1997 10:35:34 +0700
From: James Clark <jjc@jclark.com>
Subject: Re: XML Catalog draft proposal

In general I think this is pretty good (though I still have doubts
about the wisdom of supporting catalogs in XML).

>1.  In the XML WD1.0, a system identifier is a URL.  However, the 
>    current catalog draft chooses not to put this restriction on 
>    the "right hand side" of a catalog entry, though a URL continues 
>    to be the most obvious thing to use there. 

This is the only thing I strenuously object to.  It makes no sense to
me to treat the right hand side of a catalog entry any differently
from a system identifier.

- It's confusing for the user for there two be two very similar
contexts which allow different kinds of system identifier.

- If an implementation can support something in one context, there is
no reason for it not to support it in the other.

- By allowing things in catalog RHSs that aren't allowed in system
identifiers, you make it impossible to create a tool that
"de-catalogs" something, by replacing all public identifiers with
system identifiers obtained from the catalog.

James



Date: Thu, 30 Jan 97 10:57:43 CST
From: paul@arbortext.com (Paul Grosso)
Subject: Re: XML Catalog draft proposal

> From: James Clark <jjc@jclark.com>
> 
> >1.  In the XML WD1.0, a system identifier is a URL.  However, the 
> >    current catalog draft chooses not to put this restriction on 
> >    the "right hand side" of a catalog entry, though a URL continues 
> >    to be the most obvious thing to use there. 
> 
> This is the only thing I strenuously object to.  It makes no sense to
> me to treat the right hand side of a catalog entry any differently
> from a system identifier.
> 
> - It's confusing for the user for there two be two very similar
> contexts which allow different kinds of system identifier.
> 
> - If an implementation can support something in one context, there is
> no reason for it not to support it in the other.
> 
> - By allowing things in catalog RHSs that aren't allowed in system
> identifiers, you make it impossible to create a tool that
> "de-catalogs" something, by replacing all public identifiers with
> system identifiers obtained from the catalog.

Our subgroup considered all those points, and one of us did end up
feeling we should restrict the rhs of catalogs to URLs.  (Personally,
I think the way we should address all of James' concerns above is to
un-restrict XML's system identifiers.)

Allow me to explain some of the concerns we discussed on this issue.
It may be that the ERB will see a way to address these concerns.

1.  It was a goal to allow the catalog to map public identifiers into
    URNs.  (Do we want to open up the defn of sysids in XML to include
    URNs, or does the defn of URL already include URNs?)

2.  It was a goal to allow XML catalogs to be used with SGML/XML
    repositories (aka databases).  For example, it was felt important
    to allow an entry such as:
   PUBLIC "corporate legal boilerplate C312" "<corpdb>get_bplate('C312')"
    where the rhs is some database call (whether expressed in FSI syntax
    as in this example or not).

3.  It was a goal (albeit a less important one) to allow catalogs to 
    have FSIs on the rhs. 

4.  Though URLs remain the most obvious thing for a sysid to be, it
    might make sense to allow the market to decide what sysids will
    work.  This is one place where it might be best to suggest that
    URLs give maximum interoperability, but to allow other things
    to develop.  After all, any decent XML tool that finds a sysid
    that isn't a URL is going to try its best to figure it out anyway,
    so how is restricting it in the XML spec going to help?

One of the key reasons we opted to include the general notes and, in
fact, made the one about the rhs note number 1 is precisely because we
wanted to call attention to this fact that we hope and expect will be
debated by the ERB and WG.  Our subgroup felt it made sense to provide
draft wording that encompassed the most we felt the WG should consider
under the assumption that making the proposal more restrictive was
easier than going the other way.  While the proposal reflects our
general consensus, we all could see both sides to most issues.  Having
presented our proposal, I'm quite sure that no one in our subgroup
would feel it inappropriate for the ERB or WG to decide to restrict the
rhs of catalog entries to URLs, especially if the concerns I list above
are considered and either addressed or deemed beyond the scope of XML.

paul

Received on Friday, 31 January 1997 22:33:06 UTC