RFC: A Distributed Universal SGML/XML Catalogue Management System

Rationale
=========

Many applications today benefit from an SGML and/or XML Entity Catalogue
to dereference entities referenced by a Public Identifier.  For a
validating SGML parser this is an absolute requirement.  For any
SGML or XML parser it serves to enable entities such as DTDs and
modules to be resolved locally.

Hitherto, different packages and applications have distributed entity
catalogues.  Examples are Docbook, HTML Validators, the OpenSP parser,
and operating system distros.  However, there is little coordination
between the distributors of these, and no common package distributors
can rely on.  Even in tightly-controlled environments such as the
Debian packages, the W3C Validator includes its own Entity
Catalogue rather than relying on it being available as a dependency.

This situation should be rationalised to allow for an SGML and XML
catalogue to be a single package on which other packages can depend.
In this note, we propose a framework for managing such a package.

Goals
=====
* To maintain a Universal Catalogue
* To provide an automated process for generating local installations of
  all or part of the Universal Catalogue.
* To minimise the effort and coordination required to ensure that the
  universal catalogues and local installations remain up-to-date.
  In particular, end-users should be offered a self-maintaining default
  installation that eliminates effort on their part altogether.
* To enable control of different parts of the catalogue to be delegated
  to the people/organisations responsible for them.

A loose analogy could be drawn to DNS.  But since immediate lookup of
[SG|X]ML entities is dealt with by SYSTEM ids, we only have to deal with
efficient cacheing of local copies of PUBLIC ids.  Entities are in
general long-lived, but by no means immutable (for example, the MathML 2
DTD modules have undergone several minor revisions).

Managing a Universal Catalogue
==============================

In principal, all organisations creating public identifiers should be
registered with ISO.
But this is not widely practiced, and the present chaotic situation
indicates that it is not effectively meeting todays needs.  We propose
that a distributed architecture for automating catalogue management
is both feasible and preferable.

#### ISO registry: availability???

Our proposal envisages a central registry, cooperating with a set of
recognised repositories each managing its own entity catalogue locally.
For example, the W3C, WapForum and Oasis each manage their own catalogues
independently.  Likewise, different groups acting independently within
W3C are responsible for different areas such as HTML, MathML, SVG and
SMIL.
We propose that a universal catalogue will work best if responsibility
for each sub-catalogue is explicitly devolved to the working group
responsible for defining it.  The central registry will serve merely
to reference the reponsible groups, in a manner somewhat analagous to DNS.

This is broadly in line with the registry already run by the ISO but
not widely used.  What our proposal adds is the availability of the
registry online in machine-readable format, and its integration with
catalogues maintained by each participating organisation.  It is
possible that tying the registry in to distribution of Markup libraries
and catalogues may in itself be an incentive for organisations to
register.

#### Implications for naming conventions?

Implementation
==============

Since the Universal Catalogue serves SGML and XML applications, it is
appropriate that it should itself be capable of implementation as an
SGML or XML application.  This is straightforward: all we need is a
DTD for declaring catalogues and catalogue entries, and a list of
entities defining catalogues maintained by the groups entrusted with
doing so.  This is then implemented by a program to fetch the data
required and write the catalogues.  Local installations may be
customised by selecting which entities to include, while package
maintainers can ship a standard configuration.

An implementation demonstrating the above is available at
<URL:http://valet.webthing.com/catalogue/>.  It fetches the master
catalogue, DTD and Entities by HTTP.  It updates all entries defined,
but uses HTTP If-Modified-Since header to avoid the overhead of re-
fetching anything that is already up-to-date in the local installation.
It can therefore be run regularly (e.g. monthly) with minimal overhead.

CatalogueManager may be used as-is, but is intended as a proof-of-concept.
Non-technical issues such as how to delegate responsibility for different
sub-catalogues need to be addressed, and the file format used for
the demonstrator is likely to be subject to improvement.

Security
========

A package such as CatalogueManager that updates system files based on
third-party definitions has potential to introduce malicious files.
It is strongly recommended that standard system security be used to
avoid serious consequences in the event of any of the sub-catalogues
being compromised.  CatalogueManager should run as a user with no
privilege to write to the local filesystem except within a designated
SGML/XML library area, such as /usr/local/share/sgmlib.
Distributors creating a package such as an RPM of CatalogueManager
should ensure your users' security.

A more inherently secure architecture would generate all local filenames
internally, and is probably preferable.  The current implementation serves
for back-compatibility until the proposal can be considered stable.

Received on Tuesday, 9 September 2003 16:34:46 UTC