Re: XML catalog draft from lee@sq.com on 1997-02-07 (w3c-sgml-wg@w3.org from February 1997)

From: <lee@sq.com>
Date: Fri, 7 Feb 97 17:23:17 EST
To: w3c-sgml-wg@w3.org
Message-Id: <9702072223.AA01565@sqrex.sq.com>
[this is a long article because of the notes at the end on how Panorama
 actually fetches SGML OPEN CATALOG files, and some (by no means all)
 of the issues involved.
 Lee
]

Paul Prescod wrote:
> The proposal leaves the resolution mechanism up to the application as it
> should. 

No it shouldn't.
I want something that works.  In the same way.  Everywhere.
That is what we all need.

There is no point saying the market will produce lots of competing
mechanisms and the best one will win.   They will all lose.

[...]
> > >  Either way, some means of associating
> > > catalogues or ilinksets with documents is required.  
> > Clearly -- otherwise we haven't solved the problem, but only made it
> > more complicated.  A way of getting from instance to catalog is needed.
> 
> I don't agree here. Catalogs are useful without a transmission mechanism.
I didn't say they weren't.  Nor did Terry.

> If I send you a file with <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
> I have a feeling that your software will resolve it correctly.
Wrong.

And what if I put up on the Web an XML file using
	PUBLIC "-//Liam Quin//DTD b359//EN"
which the draft allows.
How are you going to resolve = find = use the DTD?

If you say that's not specified, please start again.

> On the other hand! I think that a mechanism for associating catalogs with 
> instances is useful, and important.
Which is what I said in my article, and all I said.

> I raised this in the catalog group, and
> we agreed that it was outside of our mandate and did not bother to discuss it
> further.  Some of us felt that it was something that the ERB should
> add when they integrate the catalog proposal with the XML spec.

Then I hope the ERB sends you all back to the drawing board.
What is the catalog _for_?  It is for turning a PUBLIC ID into a SYSTEM ID.
That is _all_ it is for.   You now have to solve the problem of finding
the CATALOG in the first place, or you have not solved the required
problem.

If I say, I'd like a brandy but I don't have one, telling me, "you can get
brandy by pouring it out of a brandy bottle" doesn't help me very much;
it only frustrates me, because if I had a brandy bottle, I wouldn't have
said I didn't have any brandy.

If I say, I'd like a SYSTEM ID, telling me I can get one in a CATALOG file
that I don't have is similar, except I need a SYSTEM ID to get the CATALOG,
whereas I need money to get the brandy (usually -- anyone willing to swap
brandy for PUBLIC identifiers??)

> I planned to recommend a mechanism to the ERB independent from the catalog
> group, but could not decide between the Socat-way, with a file named
> "catalog" which is very convenient, but tromps on the user's filename
> space,

If SGML users are keeping files around called catalog.soc that are not
SGML OPEN catalogs, that's their problem.

> (perhaps less if the file was named xml-cat) or with a processing
> instruction, which is syntactically ugly and a little inconvenient to
> add to each document.

There are no files on the web -- a URL is not a filename.  Now, they
usually map into filenames, but given the following URL to an XML file,
how do I get to the catalog?
	http://some.where.com/cgi-bin/documentation/bk12/ch3;level=novice

hint: cgi-bin/documentation is a program, a CGI interface to a document
management system doing dynamic fragmentation of SGML/XML; I cannot store
CATALOG files in the database... (since they are not SGML) let's say.
This is a real, common example (but with a fake URL here!)

> > I will say right now that we spent a lot of effort on this topic for
> > SoftQuad Panorama, and didn't get it right in the 1st release.
> > 
> > It's still not perfect, but we have backward compatibility issues.
> > Let's do it right for XML.
> 
> What is "right"? Your experience with this issue will be useful to us.

(1) allow links from the doc to the DTD directly (no catalog) even if
    there is a PUBLIC ID (Pano does this -- you'll see why in a sec)

(2) allow a way of identifying a "base" URL of the current document, so
    that relative paths can work in links & sys-ids.  Allow this to be
    prepended to the file with no other processing (it would come before
    the <?-XML- ...?> header in this case, I expect) so that it can be
    done by a non-XML-aware proxy server or very simple CGI script.
    Panorama 2 uses a processing instruction for this -- we didn't have it
    for Panorama 1, and this was a big problem, as you couldn't get
    bookmarks and annotations working from GET-style search queries
    without it.

(3) use the same mechanism to link style sheets to instances as you use
    to link documents to instances.  Panorama uses a separate file,
    "entityrc", but I now wish that the information had all beein in
    one place, e.g. "catalog".
    If you use public identifiers to link to style sheets, you will need to
    be able to give both a PUBLIC and a SYSTEM for the DTD as in (1), but
    you will need the SYSTEM identifier to override the PUBLIC one in the
    case where you don't actually have a catalog file.


(4) remember that you can't do file system probes.  The original CATALOG
    spec said that the filename for catalog was case insensitive.  Originally,
    because of its Windows heritage (despite the first version ("darc")
    being on Unix!), Panorama looked for CATALOG on the remote server.
    But more than half of all web servers are running Unix today, and
    the path parts of URLS are case sensitive.  We got so many support
    calls about this that today Panorama looks first for "catalog" and
    then for "CATALOG", but the failed probe does cause an obscure (to
    the user) error message on many systems.
    We never implemented the TR requirement of supporting Catalog, cAtalog,
    and so forth, as each one would take a separate HTTP transaction...
    This has been fixed (the TR was changed, as I recall), but it is
    best if you never have to look and see if a URL works or not.
    Sometimes, a URL probe might actually cost money -- e.g.if you're paying
    for documents -- or might require a password, or might simply fail
    silently with a zero-length "document" being returned, or a document
    being returned saying "the URL you requested was not found; please
    check your spelling..."!!!

(5) allow an instance to indicate that no CATALOG exists, and to give
    all the information in some other way.  Same for style sheet linkage,
    whether using CATALOG or ENTITYRC (I hope not) or something else.
    This need follows from a combination of the need to support database
    queries and the inadvisability of trying to do something like
    file system probes.

(6) you need to be able to associate multiple style sheets with each
    instance (e.g. for printing), and possibly other things, such as
    Java programs, active table of contents definitions, metadata,
    collection information, location on navigational maps, and so forth.
    Public identifiers can be used as part of this, as can processing
    instructions.  However you do it, it's essential that the same files
    can be viewed locally with no web server and locally or remotely
    using a web server, as otherwise it's imossible to test them without
    putting them on a web server.  The best way to do this is to treat
    all system identifiers as partial URLs, relative to the file containing
    them.  This means that if you open
    	/users/liam/docs/barefoot/ankle1.xml
    and it refers to
        SYSTEM "walking.dtd"
    then an XML application ought tolook for
        /users/liam/docs/barefoot/walking.dtd
    but if exactly the same unchanged bytestream had been downloaded as
        http://www.sq.com/people/liam/ankle1.xml (this is a fake URL)
    then the same XML application should resolve "walking.dtd" as
        http:://www.sq.com/people/liam/walking.dtd
    and if it had been ftp://.... then the same procedure should be used.
    You have to consider what to do with a URL such as:
        http://...../ankle1.xml;version=3
    and
        SYSTEM "walking.dtd;version=2"
    where presumably we should look for
        http://...../walking.dtd;version=2
    and not try to apply both sets of MIME parameters.
    (the ; is a preferred alternative to using & in queries, too)

Sorry, I have probably written too much already.

All of these issues need to be solved, or you won't end up with
interchangeable SGML on the Web.  I know.  I've been there.

I don't want to force the same solutions we used on people necessarily,
but the same issues do need to be addressed.

Lee
Received on Friday, 7 February 1997 17:23:28 UTC