Re: the return of the Public Identifier Question

On Thu, 20 Mar 1997 09:22:33 -0800 Terry Allen said:
>I don't want XML processors to handle resolution at all, I want them
>to send URLs to the systems generalized URL-fetching mechanism to
>be resolved.  I think what we are talking about is the process of
>determining what it is that is to be resolved, which identifier
>to choose ("how to manage indirection").

If a system offers that service, a processor can avail itself of it;
I didn't mean to suggest the spec would or should require implementors
of XML to re-implement a sockets and URL-fetching library ...

> ... I'm still confused about the user and
>the processor.
>What is the model of interaction between processor and application that
>the SGML ERB is implicitly relying on?  When I download an XML file

The ERB is explicitly trying to stay away from prescribing an
implementation model, so these questions are hard to answer.  I think
the answers to your questions are implicitly as follows, however
(personal opinions only):

>that contains a public identifier BEACH,
> - is it the a or the p that the file is directed to first?

An implementation choice.  It is certainly possible to imagine a
situation in which P handles all the entities and passes a grove or
some simple format (e.g. sgmls output format) to a downstream
application; in this case, P sees the file and passes to A some
representation of it.  It's equally possible to imagine A being in
the driver's seat, and A reading a file or data stream and periodically
passing pieces of it to P saying "Here, parse this and give me a grove
back".  In this case, A sees the file first.

> - if the a, does the a do something with BEACH before passing
>   the file to the p?

Implementation dependent.

> - is the string BEACH part of the p's output?

If I read clause 4.3 correctly, the answer here is *always* 'yes', at
least in the cases where 'BEACH' is the public identifier of an entity
to which reference is made in the document.  Actually, of course, 4.3
says this only of SYSTEM identifiers; I am assuming that rule 4 would be
extended to say "For an external entity, the processor must inform the
application of the entity's system identifier, if any, and public
identifier, if any."  If it's not so extended, then the answer to your
question is 'Maybe; if P wants it to be, yes, otherwise no.'

In some cases, P will have contracted with A to handle all entity
resolution; if BEACH is part of a public identifier on an entity
declaration, and the entity is referenced, then if I were writing A, I
would probably be willing to settle for the contents of the entity so
named.  Tim Bray, on the other hand, building a full-text index, will
want to see the identifier(s).  In other cases, P will have promised A
*not* to resolve entity references, and so A will need to have a way to
ask P to resolve a reference at an appropriate time (e.g. when the user
asks that this be done, or at load time if the publisher's policy so
wills it and A agrees to follow the policy); if clause 4.3 is modified
in what seems to me the natural way, then the way to do this is
prescribed:  P makes the string 'BEACH' available to A, probably along
with the system identifier given, if any.  Some parsers will probably
also hand A the storage object identifier they generate from 'BEACH',
but that's not written into 4.3 now.

> - does the p resolve BEACH [or send BEACH out to a system utility
>   for resolution] as part of parsing?

I'm in favor of defining this as a responsibility of P, rather than of
A.  But I don't think a decision has been made.  And I'm not sure I want
to say *when* this has to happen; I think initial-parse-time and
link-traversal or entity-expansion time are both plausible.  And nothing
can stop A from doing what it likes with 'BEACH', including sending it
to a public-id-to-URL server to see what comes back.

> - does resolution mean converting BEACH to a URL, or (in the case of
>   PIs) converting it to a system id for comparison with cache contents?

XML 1.0 defines system IDs as URLs, so I do not understand this

> - does the p resolve BEACH as part of parsing only for certain
>   purposes described in the spec?

If 'BEACH' is given as the public identifier of an entity, and the
entity is referenced, then if P is responsible for expanding entity
references, P is responsible for finding the data stream named by
'BEACH', at entity-reference-expansion time.  Is this a trick question?

> - does the p resolve BEACH only if it is used as a system or public ID
>   and not if it is the target of a link?

I think XML 1.0 makes P responsible for expanding entity references,
either a priori or on demand from A.  But clause 4.3 rule 8 could also
be read as allowing A to handle it, and does not explicitly say that P
*must* provide the service on request, so perhaps 1.0 is underspecified

Who translates public identifiers into system ids/urls, and when, is
not yet specified, as far as I know.  I've been assuming above that
P handles it at least for entities referred to.  That seems to suggest
P could handle it for links, too.  Whether to translate from public id
to url at parse time or at link traversal time seems to me best left
to P to work out.

One could, I think, decide otherwise and say it's all left to A:  P just
has to pass the identifiers through, leaving A to work out its own
salvation with catalog lookups, etc.  I don't think this is logically
untenable, but I do think it's a bad implementation strategy and a bad
decision.  If the author of A really wants to handle it all herself,
then she can do so, by ignoring P's services and doing her own work on
the public ids which P is required (can be required) to provide.

> - if the p passes BEACH to the a, does the a resolve BEACH directly,
>   without bothering the p, or does the a tell the p, "Go to the BEACH"?

If P is handling the TCP/IP port for A, A will ask P to go to the
BEACH.  If A controls the port, or chooses to use another port handler,
P needn't be involved.  I don't think XML 1.0 does or should constrain
this; I think both possibilities are and should be legal.

>| I'm also open to saying only that conforming processor must support the
>| MRM and may support other methods, and how they decide which to use is
>| to be decided by the designer, the implementor, the user, and anyone
>| else who horns in on the discussion, but is not constrained by XML.
>| (Sole difference:  implementations are not required to provide a
>| user-settable option saying "do it this way".)
>What are implementations?  If they are applications, as a user I

Processors.  Sorry, I'm out of practice being disciplined in my

>want one with all the knobs.  If they are processors, under what
>circs are they resolving these IDs (see list above)?

I like knobs, too.  But the WG and ERB may or may not feel that our
personal preferences suffice as a reason to require that XML processors
(and apps?) *always* provide *all* the knobs *as a condition of
conformance*.  Whether a knob should be required here is an open
question; I'm agnostic (as I said), and you haven't actually said
what you think XML should do, only what you hope apps will do.

>| > ...
>| >My point is that resolution (having power working) is not
>| >indirection (choosing among PG&E, windmills, solar power, etc.),
>| >and that any choice of method may result in failure of resolution.
>| I think this is true, but so universally true that I'm not sure I
>| can derive any consequences from it.  No matter what resolution
>| method we choose, it can fail.  If we don't choose one at all, but
>| leave the choice to implementors, it can still fail.  ...
>So will it fail worse with or without specification?  I think the
>other interoperability issues (see list of questions above) need
>answers, and those answers might inform the choice of what to do
>about public identifiers.

I don't think having or not having a required resolution method is
likely to affect the frequency or severity of resolution failure,
given that the publisher has made appropriate information available to
enable resolution.  So I still don't see that your point has a bearing
on our decision, indubitable though it may be.

Wait, hang on a moment.

In practice, specifying a Minimum Required Method of public-id
resolution will mean fewer failures, I think.  Here's the logic:

  - some failures will be due to network outages, permissions problems,
    etc.; the frequency of these is unaffected by the MRM/no-MRM choice
    though it may be affected by the choice of a resolution mechanism
    (a mechanism that provides several levels of fallback is likely to
    fail less often than one with no fallbacks at all)

  - some failures will be due to unavailability of required information
    (e.g. failure to provide an appropriate SGML Open Catalog,
    failure to register public ID with the new Public-ID server,
    failure to install the correct version of the Sortes Vergilianae
    Name Resolver, ...); these will be more frequent if a publisher
    must provide the required information (always in effect a
    public:system map) in more than one form.  So specifying a MRM
    reduces the failure rate.

  - some failures will be due to inaccuracy of required information
    (e.g. provision of an SGML Open Catalog with bad entries,
    registration of the wrong public ID with the new Public-ID server,
    installation of the Sortes Vergialinae Name Resolver with the
    wrong config file ..); these will be more frequent if a publisher
    must provide the required information in more than one form.
    So here too specifying a MRM reduces the failure rate.

OK.  I don't know if this is what you were driving at, but I've
convinced myself that the second and third laws of thermodynamics
do provide an argument for specifying an MRM.

>| ...  I think of producing a
>| system identifier for a resource accessible to the processor as
>| constituting resolution of the public identifier; am I misusing the
>| term?
>I'm thinking of resolution as actually obtaining the thing identified.
>Merely producing a system identifier doesn't guarantee success
>(that is, it can leave the matter unresolved).  And if anything on the
>net is accessible to the processor if it wants to send a request for
>a URL, production of a system identifier is only part of the process
>(of course, if the thing is returned as a result of the request,
>a system identifier can be produced for it).

OK.  I have been silently assuming that once P or A has a system id,
issuing the network request for the actual resource is (a) not very
hard and (b) equally likely to succeed or fail no matter what method
we've used to identify the resource and/or translate from public id to
system id.  But if you want to reserve 'resolution' for actually
coming up with the data in your input buffer, that's OK with me.

>Fair enough; I would suggest that the spec be worded such that the
>processor (if that's the piece involved) be allowed to hand off
>indirection management to another component of the user's system
>without loss of conformance.
>What matters is that indirection gets handled, not necessarily which piece
>of the machinery handles it.  To determine whether it is necessary
>to specify which piece of the machinery does the work, you have to
>articulate a model of the machinery or of its operations.

Yes.  The current draft spec tries to be both precise and general,
in defining "the required behavior of an XML processor in terms of
how it must read XML data and the information it must provide to
the application" -- and *not* in terms of who calls whom when.  I
think this leaves implementors free to organize the interaction of
A and P however they wish, while still making clear what P has to
do, when asked.  I think we should strive for the same Who and What,
not How and When, specificity in future revisions and in XML-Link.

-C. M. Sperberg-McQueen