Re: XML catalog draft from Alex Milowski on 1997-02-09 (w3c-sgml-wg@w3.org from February 1997)

From: Alex Milowski <lex@www.copsol.com>
Date: Sun, 9 Feb 1997 14:29:39 -0600 (CST)
To: w3c-sgml-wg@w3.org
Message-Id: <199702092029.OAA10942@copsol.com>
Hello all,

I have been following the discussion of XML catalogs *very* close.  I'm
going to start this e-mail with something that many of you may not
agree with:

   Catalogs are *not* sufficient for manipulation and transmission of
   XML (or SGML) documents.

Let me regress,...

Once upon a not-to-distant-past, I designed and wrote a simple SGML
repository for a client.  They had to have something quick and nothing
in the industry really met their needs for the cost they would have to
spend on buying one.  So, I set out to build a repository.

I had some simple design constraints stemming from decisions on "standards"
for their authoring environment:

  1. Every entity had a unique name across all documents.
  2. Every public identifier is unique across all documents.
  3. Every entity name starts with a "owner" code.
  4. After the owner code, a code that identifies the entity use is
     required.

For example, if the owner was 'CSI' and the entity was a passage of legal
text, its use code was 'LNG' for 'language'.  Thus, the entity would
be named 'CSI.LNG.Something'.

In addition, we were moving from a file-base authoring environment and the
authors needed to be able to organize entities much like they had organized
files.  Thus, we needed a virtual directory structure that enforced unique
entity names across all entities in the repositories.

I created a simple object model that had a flat namespace for maintaining the
constraints outlined above and had other organizational constructs for
presenting a usable interface to a user.

Note: This company has 11,000 "entities" for 100 documents.

Now, once the repository was built, I had to import the data.  Well, I thought:
"Easy, I have an SGML Open catalog that has all of them defined."

Wrong.  It doesn't work.

First, everything with a public identifier can't be imported because there
is no entity name defined.  From SGML, this really doesn't matter.  That is,
I can have two entities x and y that have the same public identifier.  In
this model--environment constraint--that would be wrong (illegal).

We needed these constraints to maximize reuse and allow the ability for
questions to be asked like:  Who uses this entity?

Now, what does this have to do with catalogs?

I my mind (sometimes a scary place), if SGML Open catalogs can't do the
above, it begs the question of *why* it is not complete.  The answer--again, in
my mind--is that they are single-purpose.  Resolve this entity to something.

We have a greater issue in XML of solving the fundamental problem that
our documents exist as a collection of distinct information types with 
relationships (a hypertext relationship!), not single "files" or streams
that can cascade resolution schemes for finding the rest of the pieces.

Lets look at the simple case:

If my document is self-contained, does an XML processor "know" that it
doesn't need a catalog?  Maybe.

...more complex:

If I have a document which refers to an entity x, which has two possible
choices (configurations), what do I have to do to configure this?

If we keep expanding such questions, we can see that additional constructs
would have to be put in place to handle different problems such as databases.
We we have done is designed to the specific case when our problem is much
more broad.  We need to step back and see how we can orchestrate a broad
solution that is, to some extent, future proof (extendible).

In my thinking for this client I mentioned, we have already discussed the
idea of "meta-documents".  Documents that describe all the components
necessary and all the variations (if possible) such that an application
can load this meta-document.

The essential idea is this:

   The XML (SGML) document is at the same "class" or "level" as the
   other information components--style-sheets, graphics, transformations,
   etc.  It is a very important component in the system, but not
   very useful in a practical way without the other components.  Thus, it
   is *not* the starting point.

   The collective information about all the components necessary to
   handle, process, identify, etc. this document--the meta-document--is
   the first-class construct.  The meta-document is the starting point.

From this, I'm going to make some observations:

1. It is possible to define a known document type to encode such a 
   meta-document.  This is not necessary but it would seem unlikely to
   developed "yet-another-document-information-encoding" (YADIE).

2. The meta-document and its components could be encoded in one stream
   such that a request is made for the meta document "stream" and
   all information is shipped to the application.  This does *not*
   require that all components are packed into this stream.  It only
   requires that they be identified.  Anyone familiar with the concept of
   JAR files can see where this is coming from.

3. For simple documents one-time documents, packing everything into
   a stream is very useful.  The meta-document in this case may be very
   small--only a few bytes.

4. The need for "catalog location", "style-sheet location", etc. standards
   goes away and gets replaced with a uniform mechanism for describing
   component locations and relationships.

I have attached a very simple DTD for such a meta-document.  I wrote this
very quickly and it *doesn't* contain all the ideas that I have about this
subject but I wanted something concrete.

==============================================================================
R. Alexander Milowski     http://www.copsol.com/   alex@copsol.com
Copernican Solutions Incorporated                  (612) 379 - 3608
     
<!-- A simple DTD written in 5 minutes... be kind. -->
<!-- Note: More relationships could be specified in this
           DTD such as different "views" via different
           style-sheets. -->

<!ELEMENT XAR       - - (Header,Document*,Component,EntityDef*) >
<!ELEMENT Header    - - (Version) 
   -- Identifies the archive and version --
>

<!ELEMENT Document  - O EMPTY>
<!ATTLIST Document
   -- whatever needed to link to the appropriate component --
>


<!ELEMENT Component - O EMPTY
   -- Defines a component necessary for this document to be 
      processed. --
>
<!ATTLIST Component 
   -- type is a fixed set of "known" names that applications
      are expected to recognize --
   type   (stylesheet|catalog) #IMPLIED
   -- other is for transmitting other components specific to
      an application
   other  CDATA                #IMPLIED
   -- whatever else is necessary to locate the component --
>

<!ELEMENT EntityDef - - ((Public|Name),Component) >

<!ELEMENT (Public|Name) - - (#PCDATA) >
Received on Sunday, 9 February 1997 15:30:29 UTC