Re: draft proposal for catalog resolution from Paul Prescod on 1997-03-31 (w3c-sgml-wg@w3.org from March 1997)

From: Paul Prescod <papresco@calum.csclub.uwaterloo.ca>
Date: Mon, 31 Mar 1997 08:03:43 -0500
To: w3c-sgml-wg@w3.org
Message-ID: <333FB62F.D03@csclub.uwaterloo.ca>
lee@sq.com wrote:
> If it is accepted as a minumum requirement for all XML processors,
> even DTD-less ones, we've probably lost our Dirty Perl Hacker.

A CATALOG is a line delimited linearization of an assocative array. This
is five lines of Perl code in the hands of any Perl hacker:

open(catalog, "catalog");
while( $line = <catalog> ){
    $line =~ /PUBLIC (.*) (.*)/;
    $mappings{$1} = $2;
}

I'm not even good at Perl. I think regexp matching changed in one Perl
version so you would probably explicitly exclude spaces in the first
(.*).

> If it is optional, we have an optional language feature.

Like public identifiers and alternate character encodings. Big deal.

> People hoping to put URNs in PUBLIC identifiers will have to check
> that it's OK not to have ! @ # % ^ & _ { } [ ] | \ ~ ` ; < > , in
> URNs, as they are forbidden in PUBLIC identifiers.  

When URNs exist we can specify whatever mapping we want from public
identifiers to URNs. It is just a simple language translation problem.

> Perhaps SGML
> could be changed here, as there doesn't seem any advantage to
> restricting the character set, and it's going to look odd to allow
> Kanji or Devanagari or accented Latin characters in SYSTEM IDs
> such as file names and URLs (URL internationalisation is in progress,
> but file: URLs are already OK in practice at least, and you can
> escape characters in URLs with %, a character not allowed in a
> PUBLIC Id) and have A-Za-z 0-9 and a little punctuation in PUBLIC
> identifiers, that are supposed to be more powerful.

SYSTEM identifiers can use any character available on the SYSTEM. PUBLIC
identifiers must be restricted to characters that we can presume are
PUBLICally available (available on any processing platform). I think
that it would be a good idea to loosen the PUBLIC identifier rules up a
little for SGML so that we can at least have a decent escape character
like "%" instead of, say, "?".

> So + for space in URLs, %dd for other characters in URLs, &#dddd;
> in text, and ?dddd? only in PUBLIC identifiers.  Still want it?

Since URNs are non-existant, this seems to be essentially a non-problem.
 
> Here are five ways of including a DTD fragment:
> 
> [1]
>     <!DOCTYPE xx % PUBLIC "yy">
> 
> [2]
>     <!DOCTYPE xx % SYSTEM "how to get yy">
> 
> [3]
>     <!DOCTYPE xx [
>         <!Entity yy % PUBLIC "yy">
>         %yy;
>     ]>
> 
> [4]
>     <!DOCTYPE xx [
>         <!Entity yy % SYSTEM "how to get yy">
>         %yy;
>     ]>
> 
> [5]
>     <!DOCTYPE xx [
>         <!Entity catalog % SYSTEM "how to get catalog.xml">
>         <!--* catalog.xml defines the yy entity *-->
>         %yy;
>     ]>
> 
> [6]
>     <!DOCTYPE xx [
>         <!Entity catalog % PUBLIC "catalog.xml">
>         <!--* catalog.xml defines the yy entity
>             * but this relies on external PUBLIC resolution to
>             * get our real XML catalog
>             *-->
>         %yy;
>     ]>

These are not equivalent. The first does not just include a DTD
fragment. It declares the document type of the document. The second
declares the location of a file that can be used to validate the
document (a DTD). The third and fourth have a different semantic: for
instance an HTML document created in either of these ways would not be a
valid HTML document according to the HTML and SGML specifications. The
first is the only valid way to declare HTML documents. The fifth and
sixth are levels of indirection you arbitrarily (it seems) decided to
add. You could add 10 levels (xx includes yy, which includes zz, which
includes aa) and claim we have ten more ways of doing the same thing.
 
 Paul Prescod
Received on Monday, 31 March 1997 08:06:22 UTC