Re: Relative URL draft 04 from Roy T. Fielding on 1995-01-24 (uri@w3.org from January 1995)

From: Roy T. Fielding <fielding@avron.ICS.UCI.EDU>
Date: Mon, 23 Jan 1995 19:47:50 -0800
To: Larry Masinter <masinter@parc.xerox.com>
Cc: uri@bunyip.com
Message-Id: <9501231947.aa01826@paris.ics.uci.edu>
Larry writes:

> The reason for putting 'establishing a base' in an appendix is that it
> isn't exhaustive. It doesn't define how to establish the base URL for
> all situations, nor can it. What's there now is a set of examples;
> some of them correspond to well-established practice (like the base
> for things transported by http, while others are things that you've
> made up while writing this document (e.g., establishing a base for
> multipart mail messages).

I disagree with all three sentences.  First, a standard does not need
to be exhaustive -- it needs to be complete.  Second, what this standard
defines is the order of precedence for ascertaining the base URL from:

   1) inside the content
   2) part of the message
   3) the retrieval context
   4) the default

Section 3.5 was just a special case of (3).

This order of precedence applies equally well to all Internet protocols,
and is thus complete from the standpoint of an Internet standard.

Third, there are known gaps in well-established practice that are
fixed by this specification, and at the same time it does not conflict with
any established practice.  If you have a problem with that, we might
as well shut down the WG.

In particular, the question of "what is the base URL of a component in
a multipart message?" is one that was discovered while defining the handling
of multipart documents for HTTP.  The solution I "made up" is both simple
and technically sound given the possible presence of multiple levels of
"base" headers (and thus multiple levels of retrieval context).  I added
that section to resolve any ambiguity regarding that situation.  If anyone
has a better solution, please let me know.  Ignoring the ambiguity is not
acceptable.  In an attempt to clarify it, I have changed section 3 as follows:
======================================================================
3.  Establishing a Base URL

   The term "relative URL" implies that there exists some absolute
   "base URL" against which the relative reference is applied.  Indeed,
   the base URL is necessary to define the semantics of any embedded
   relative URLs; without it, a relative reference is meaningless.
   In order for relative URLs to be usable within a document, the base
   URL of that document must be known to the parser.

   The base URL of a document can be established in one of four ways,
   listed below in order of precedence.  The order of precedence can be
   thought of in terms of layers, where the innermost defined base URL
   has the highest precedence.  This can be visualized graphically as:

      .---------------------------------------------------------.
      |  .---------------------------------------------------.  |
      |  |  .---------------------------------------------.  |  |
      |  |  |  .---------------------------------------.  |  |  |
      |  |  |  |   (3.1) Base URL embedded in the      |  |  |  |
      |  |  |  |         document's content            |  |  |  |
      |  |  |  `---------------------------------------'  |  |  |
      |  |  |   (3.2) URL defined by a "Base" message     |  |  |
      |  |  |         header (or equivalent)              |  |  |
      |  |  `---------------------------------------------'  |  |
      |  |   (3.3) URL of the document's retrieval context   |  |
      |  `---------------------------------------------------'  |
      |   (3.4) Base URL = "" (undefined)                       |
      `---------------------------------------------------------'

3.1.  Base URL within Document Content

   Within certain document media types, the base URL of the document
   can be embedded within the content itself such that it can be
   readily obtained by a parser.  This can be useful for descriptive
   documents, such as tables of content, which may be transmitted to
   others through protocols other than their usual retrieval context
   (e.g. E-Mail or USENET news).

   It is beyond the scope of this document to specify how, for each
   media type, the base URL can be embedded.  However, an example of
   how this is done for the Hypertext Markup Language (HTML) [3] is
   provided in an Appendix (Section 10).

3.2.  Base URL within Message Headers

   A second method for identifying the base URL of a document is to
   specify it within the message headers (or equivalent tagged
   metainformation) of the message enclosing the document.  For
   protocols that make use of message headers like those described in
   RFC 822 [5], it is recommended that the format of this header be:

      base-header  = "Base" ":" "<URL:" absoluteURL ">"

   where "Base" is case-insensitive.  For example, the header

      Base: <URL:http://www.ics.uci.edu/Test/a/b/c>

   would indicate that any relative URLs found within the document
   should be parsed relative to <URL:http://www.ics.uci.edu/Test/a/b/c>.
   Any whitespace (including that used for line folding) inside the
   angle brackets should be ignored.

   Protocols which do not use the RFC 822 message header syntax, but
   which do allow some form of tagged metainformation to be included
   within messages, may define their own syntax for passing the base URL
   as part of a message.  Describing the syntax for all possible
   protocols is beyond the scope of this document.  It is assumed that
   user agents using such a protocol will be able to obtain the
   appropriate syntax from that protocol's specification.

   In situations where both an embedded base URL (as described in
   Section 3.1) and a base-header are present, the embedded base URL
   takes precedence.

3.3.  Base URL from the Retrieval Context

   If neither an embedded base URL nor a base-header is present, then,
   if a URL was used to retrieve the base document, that URL shall be
   considered the base URL.  Note that if the retrieval was the result
   of a redirected request, the last URL used (i.e., that which resulted
   in the actual retrieval of the document) is the base URL.

   Composite media types, such as the "multipart/*" and "message/*"
   media types defined by MIME (RFC 1521, [4]), require special
   processing in order to determine the retrieval context of an enclosed
   document.  For these types, the base URL of the composite entity
   must be determined first; this base is then considered the retrieval
   context for its component parts, and thus the base URL for any part
   that does not define its own base via one of the methods described
   in Sections 3.1 and 3.2.  This logic is applied recursively for
   component parts that are themselves composite entities.

   In other words, the retrieval context (Section 3.3) of a component
   part is the base URL of the composite entity of which it is a part.
   Thus, a composite entity can redefine the retrieval context of its
   component parts via inclusion of a base-header, and this redefinition
   applies recursively for a hierarchy of composite parts.  Note that
   this is not necessarily the same as defining the base URL of the
   components, since each component may include an embedded base URL
   or base-header that takes precedence over the retrieval context.

3.4.  Default Base URL

   If none of the conditions described in Sections 3.1 -- 3.3 apply,
   then the base URL is considered to be the empty string and all
   embedded URLs within that document are assumed to be absolute URLs.
   It is the responsibility of the distributor(s) of a document
   containing relative URLs to ensure that the base URL for that
   document can be established.  It must be emphasized that relative
   URLs cannot be used reliably in situations where the object's base
   URL is not well-defined.

======================================================================

> I don't believe that your assertion "The method of establishing a base
> must be part of the standard" holds up. You assert it, but you don't
> justify it. In any case, even it if must be part of 'a' standard, it
> isn't clear that it must be part of *this* standard, which defines the
> syntax and semantics of relative URLs.

I'm sorry, I thought that was clear.  The base URL defines the semantics
of all embedded relative URLs.  Without the base, all embedded relative
references are meaningless.  Obviously, it must be part of *this* standard.

> Two things: most importantly, the defined syntax for news and nntp
> URLs don't include any semantics for "/".  At best, you're left saying
> that a raw "<message-id>" is a relative URL to a "news:<message-id>"
> URL. The syntax for available groups doesn't allow you to say that
> applying ".." as a relative URL to "news:alt.binaries.parsers" would
> get you "news:alt.binaries".

Ooops, sorry. I keep forgetting that news: URLs do not include the article
numbers found in libwww.  I'll move news to the paragraph above it.

> And using "../3" in
> "nntp://news.org:119/alt.binaries/12" doesn't seem particularly useful.

nntp URLs do use "/" as hierarchy, follow the generic-RL syntax, and
I know of several examples where relative URLs could be useful in such
circumstances.  In fact, it should be in the bottom group with http.
That still doesn't mean that they *must* be used -- only that there are
no inherent restrictions on their use.

> I hadn't really gone over your BNF, but I'm puzzled how:
> 
> !    absoluteURL = generic-RL | ( scheme ":" *( uchar | reserved ) )
>   
> +    generic-RL  = scheme ":" [ relativeURL ]
> + 
> 
> leads one to allow a relative URL as a kind of absoluteURL.

Eh?  No, it just defines the production rules for the generic-RL syntax.
Actually, it should be

       generic-RL  = scheme ":" relativeURL

but that's a technicality.  This is the same as saying:

       generic-RL  = ( scheme ":" "//" net_loc [ abs_path ] )
                   | ( scheme ":" "/"  rel_path             )
                   | ( scheme ":" [ path ] [ ";" params ] [ "?" query ] )

The dual use of production names is meaningful and corresponds to the
way existing parsers handle relative URLs as part of the generic-RL
parsing process.


......Roy Fielding   ICS Grad Student, University of California, Irvine  USA
                                     <fielding@ics.uci.edu>
                     <URL:http://www.ics.uci.edu/dir/grad/Software/fielding>
Received on Monday, 23 January 1995 22:49:48 UTC