Re: Request for feedback: HTTP-based Resource Descriptor Discovery from Eran Hammer-Lahav on 2009-01-31 (www-tag@w3.org from January 2009)

From: Eran Hammer-Lahav <eran@hueniverse.com>
Date: Sat, 31 Jan 2009 13:59:26 -0700
To: Jonathan Rees <jar@creativecommons.org>
CC: "www-tag@w3.org" <www-tag@w3.org>, Phil Archer <phil@philarcher.org>, Mark Nottingham <mnot@mnot.net>, "www-talk@w3.org" <www-talk@w3.org>
Message-ID: <90C41DD21FB7C64BB94121FBBC2E7234127C9399D4@P3PW5EX1MB01.EX1.SECURESERVER.NET>
Thanks for the feedback. It is extremely useful. Please note that I have already published a -01 revision last week which addressed some of these concerns.

See my comments below.

On 1/29/09 6:56 AM, "Jonathan Rees" <jar@creativecommons.org> wrote:
> - Please do not say 'resource discovery' as this protocol is not about
>    discovering resources.  You have many alternatives that do not say
>    something that's confusing: 'descriptor resource discovery',
>    'description discovery', 'resource description discovery', etc.

This was already changed in -01 to 'descriptor discovery'.

> - I really wish we could say something stronger about the format of
>    the DR.  May I suggest that the DR be required to possess at least
>    one 'representation' that is either RDF/XML or convertible to
>    RDF/XML using GRDDL?

It is the job of the descriptor to be useful, not the discovery spec to suggest that...

> - I anticipate some confusion as to whether the link relates the
>    resource to the DR (as in the POWDER 'describedby' definition you
>    quote), the URI to the DR, or the URI to the DR's URI (as in the
>    second sentence of section 6).  In RDF, <resource> describedby <dr>
>    is most natural to write, but RDF semantics rules out the
>    possibility that this might say anything specific to a particular
>    URI naming the resource[*].  This protocol is an opportunity for the
>    URI owner to say things not only about the resource but about the
>    URI/resource binding itself, such as its authority, provenance, and
>    stability, and that will vary with URI, not resource, as each URI
>    might have a different "owner".

Much of this debate is depends heavily on two questions:

- Are we discovering a URI Descriptor or Resource Descriptor?
- Is this protocol part of the network layer or the application layer?

I don't have full answers but I am attempting as much as possible to create a Resource Descriptor discovery protocol, and I find positioning it closer to the application layer much easier to implement (since it can work on a very narrow set of network layer features).

The relationship between the URI used to the resource being discovered can be simply described as 'what we've got'. I am not sure how to say anything more useful in the spec.

> - The POWDER documentation gives a different URI for the describedby
>    relation than the one that you'd get by using the proposed
>    IANA-based relation registry.  It would be unfortunate if there
>    continued to be two URIs for the same thing, and you should work
>    with POWDER to settle on one or the other.  I would not make use use
>    of the link relation registry a requirement.

'rel' types across all methods will depend directly on the proposed registry defined by draft-nottingham-http-link-header. POWDER (per Phil Archer) will be properly registered within this proposed IANA registry. Whatever draft-nottingham-http-link-header consider equivalent to the short name 'describedby' is acceptable for this.

> - Editorial comment: On first reading I found the first set of bullets
>    in section 7 to be very mysterious.  They make no sense at all until
>    you've read the following text.  I suggest that (a) you list the
>    three methods before launching into the factors that go into
>    deciding between them; and (b) that the four bullets be more
>    specific - e.g. instead of saying it depends on document type (media
>    type), say that it depends on whether the resource has a
>    representation supporting the <link> element, and rather than saying
>    it depends on URI scheme, say that it depends on whether the scheme
>    is http(s) or something else.

Yep. I'm looking for ways to move parts of this to the introduction and others turn into actionable items.

> - Bullet "HTTP Link header": "Limited to resources with an accessible
>    representation using the HTTP protocol [RFC2616], or..." -- while
>    you're not saying anything wrong here, I don't see what purpose the
>    part before the "or" serves, and I find it distracting.  I think you
>    should simply say:
>        "Limited to resources for
>        which an HTTP GET or HEAD request returns a non-5xx
>        HTTP response [RFC2616]."

This sounds reasonable.

>    The exact limitation you want to put on HTTP (2xx, 2xx+3xx,
>    2xx+3xx+4xx, or any) is debatable.  I think 3xx responses have to be
>    OK (see below), 4xx responses should be, and 5xx responses could be
>    although I don't think I would trust them.
>
>    If all HTTP responses can carry believable Link: headers, matters
>    are greatly simplified because you can just say that you can always
>    try the HTTP method - it is not limited in any way.

The difficulty is to align the spec with existing expectations and to make sure that it is always predictable. I am also trying to align the result codes with the common semantic expectation of what constitutes a valid representation for the resource identified with the URI being dereference.

Before I dive into a deep review of all the possible HTTP response code I'd like to ask a simple question. What actual use cases break and what inefficiencies are created from a strict limitation of the allowed response codes (200, 301, 302, 303, 307, 401)?

Discovery has to be non-intrusive (at least in its general purpose elements) which seems to limit us to only GET and HEAD. There is nothing stopping an application from using a normative reference to this spec and then extending the allowed set of methods and response code if it adds value to their use cases, but I can't come up with scenarios where this restriction actual breaks stuff (that a follow up HEAD can't solve).

With regard to the permitted HTTP response codes, I am having a hard time simply allowing whole sets (2xx, 3xx) because each one has codes that are unacceptable for this purpose.

1xx and 5xx are obviously out.

In the 2xx range:

* 200 OK - obviously useful.
* 201 Created - doesn't fit with the passive nature of the protocol (or GET/HEAD).
* 202 Accepted - implies something other than synchronous information retrieval. Not sure how can a generic discovery library handle this, and what it means in a reply to a GET/HEAD with Link headers present.
* 203 Non-Authoritative Information - I can see this being used, but should the spec call out the potential issues with trusting such information?
* 204 No Content - seems useful as it provide an updated metainformation view, but present the issue of incomplete information.
* 205 Reset Content - no idea.
* 206 Partial Content - useful.

Given the above concerns, is it still appropriate for the spec to simply state that a 2xx response is valid? It is after all the responsibility of the application to implement HTTP correctly, which means it should be aware that each 2xx response has its own semantics. I'm ok with replacing all 200 in the spec with 2xx.

The 3xx range is harder to generalize because of existing expectations as to their semantic meaning. The problem, of course, is cause by the way this entire discovery protocol is defined. If this was a URI Descriptor Discovery protocol, a 3xx response would not be followed for the purpose of obtaining a descriptor. Instead, the 3xx response header Links will be used and the Location header ignored.

Since this is trying to be a Resource Descriptor Discovery, where the resource URI is simply the first cookie crumb, the effort to obtain the Link headers must follow the same rules as the effort to obtain a valid representation of the resource (which does not stop at the first 3xx response).

What I know is that we can't have both. It should not be a matter of opinion as to which Link header ends up being found or used for the purpose of descriptor discovery. I am afraid of trying to define this in a generic way because there is too much confusion already with regard to what exactly should applications do with each 3xx code.

For each 3xx code, this is how I believe the discovery of Link headers should be performed:

* 300 Multiple Choices - Link headers on the 300 response must be ignored. How to pick the desired representation is out of scope, but one has to be selected and retrieved (rinse and repeat until a 2xx code is received) and its Links used.
* 301 Moved Permanently - Repeat the process using the URI found in the Location header. Link headers on the 301 response must be ignored.
* 302 Found - same as 301.
* 303 See Other - the 303 Link headers are used and the URI found in the Location header is not used for discovery since the Location header points to a different resource.
* 304 Not Modified - does not seem to contain any relevant information, and I'm not sure what to do with any Link headers it may contain.
* 305 Use Proxy - same as 301 but following proxy rules.
* 307 Temporary Redirect - same as 301.

Even if we agree that 304 is not applicable to discovery, we still have a conflicting resolution between 303 and the rest of the response codes. I am open to expending the allowed range, but will still need to be explicit about the difference between 303 and the rest.

The 4xx range is easier to deal with because for the most part, from a discovery pov, it is not about the resource but about the request. It represent a hurdle for the client to resolve in order to move passed it and obtain a representation.

Without considering any real-world use cases, it is easy to simply dismiss all 4xx responses and declare that the Link header method has failed (other methods should be attempted such as <Link> element or Site-meta). But at least one response code can greatly benefit from this discovery protocol: 401. In the context of a 401, Link headers can offer valuable information about how to get passed it. Some people seem to suggest that a 404 can be used in a similar semantic fashion as a 303, but I rather stay out of that debate.

My assumption is that a 4xx response is not a valid representation of the resource and therefore cannot include Link headers relevant for finding the location of the resource descriptor. It is however, a valid representation of the resource under very specific conditions, such as its access restrictions.

Even for the 401 use case, it is trivial to move discovery needs to the WWW-Authenticate response header. If the descriptor is directly related to getting past the 401 road block, it is probably more appropriate to let the security challenge define its own discovery mechanism rather than try and generalize it here.

I'm inclined after writing this to remove all 4xx codes from the supported set, including 401. The rules I am following is: no representation, no descriptor.

---

Proposed resolution: allow 2xx, 3xx with different handling of 303 vs all others, leave 4xx undefined, and forbid 1xx and 5xx. Allowing the entire 2xx range will put the burden on the client to follow basic HTTP rules (and know what is not reasonable to expect in a reply to a GET/HEAD request).

> - In TAG discussion the question arose as to why all three methods had
>    to produce the same descriptor resource location.

The language in -02 will be: "If more than one method is supported, all methods MUST produce the same set of resource descriptors." I have taken the more liberal approach.

> - Anywhere you mention 301 and 302 you should also add 307.

Yes. I will also make it clear that redirects should be obeyed when retrieving the HTML or ATOM representation in the <LINK> element method.

> - The algorithm in 8.2 is one I strongly object to, as it does not permit
>    Link: on 30x responses, which IMO is a central Semantic Web use case.
>    Consider, for example, a "value added" URI for a document where a
>    301 response provides a Link: to useful metadata, and redirects to
>    the actual document.

See previous discussion. As you clearly demonstrated, it is hard to make generic statements about whole classes of responses (i.e. 3xx, 4xx). You also raise the questions of what it is the descriptor is about, the resource or the URI. My issue with your approach is that it isn't really an interop spec, but a best practice guide. All I care about is interop even at the cost of eliminating potentially useful use cases. Note that your handling of 301 above is self contradictory.

> - Your proposal to specify URI-to-DR-URI rewrites as
>    template="prefix{uri}suffix" is a good start, but I think that the
>    additional ability to specify match conditions on the input URI will
>    end up being important.  In one project I work on we're already
>    using the rule

Please review the current text [1] and let me know if it addresses all your use cases. I am well aware that it is incorrect in handling mailto URIs since they do not have an authority component (a mistake corrected in -02).

> - We need to be careful about quoting.  If a DR is meant to be found
>    via a CGI script invoked via a query URI (the link-template prefix
>    has a ? in it), and the original URI already contains significant
>    CGI characters like &, then an application could get into big
>    trouble.  This needs to be either handled directly somehow (I can't
>    imagine how), or left as a combination of a big scary disclaimer and
>    a security warning.

Can you provide examples?

> - I think you need to warn that this protocol should only be applied
>    to URIs not containing a fragment id.  If you allow fragment ids
>    you're going to get into serious problems with both quoting and
>    semantics.

I am not sure what to do here. Should the fragment be removed from the definition of 'uri' in the template vocabulary? That seems like the easiest solution (allowing it to be used explicitly with the 'fragment' variable).

> [*] Footnote (not relevant unless you care about how RDF might
> interact with this discovery protocol): Suppose U1 and U2 both name
> (denote, identify, refer to, are interpreted to be, etc.) some
> resource R

Where is that established (that both refer to the same resource R)?

> and suppose that
>
>     <U1> describedby <DR1>.
>     <U2> describedby <DR2>.
>
> Then necessarily
>
>     <U1> describedby <DR2>.
>     <U2> describedby <DR1>.

Not without some other external information. We just had a couple hours of debate on a similar topic at the XRI TC, namely if multiple resources (via their representations) can point to the same descriptor URI, and if doing so implies any kind of relationship between them. We decided that it is allowed, but it does not imply any relationship between the resources pointing to the same descriptor URI.

In other words:

R1-URI --> RD-URI
R2-URI --> RD-URI

Means exactly the same as:

R1-URI --> RD1-URI
R2-URI --> RD2-URI

When the content of RD1 and RD2 is identical.

> - Under <link> element (section 7), please include XHTML along with
> HTML (this came up on a TAG telecon).

Ok.

> - I understand that we desire to stay away from a rigorous treatment
> of authentication, authority, and authorization, leaving that up
> either to risk acceptance or an orthogonal security infrastructure.
> However, we need to specify what the protocol's position is on
> attribution, in the situation where communication *is* secure and/or
> risks are accepted.

Why? Isn't this the role of the descriptor? This might be true for Link headers in general, but as used in this protocol, the only statement allowed is where to find 'information about'. No other conclusion is defined from the presence of descriptor location.

> <link> has problems in this regard that Link: and site-meta don't.
> Although in the normal case a document speaks for the owner of the URI
> that names it, there are important cases where this doesn't hold. One
> is where the resource is obsolete, so that what it said before is no
> longer true. This is not just a mistake to be fixed as faithfully
> retaining unmodified old versions is often very important.

Not really. While atomic operations are not really practical with regard to obtaining a snapshot in time of potentially all three methods for a single resource, in theory, if one is to archive a representation in which the <LINK> element is to be useful, it must also archive the other potential method outputs. How outdated data is used is out of scope.

>  From a communication point of view, <link> is the best of the three
> methods to link to a DR since there is the least risk that it will get
> detached from the representation.

>From an implementation pov, <LINK> in non XML documents is the least desirable method since parsing HTML is notoriously awful. It was suggested that the spec required at least one other method than HTML <LINK>. I am seriously considering it (but doing so will violate the principals declared in the analysis appendix).

> But your memo does talk about authority (here I
> think we mean what statements can be put in the mouths of what
> principals) as if it's a question it cares about. I think the problem
> of whether <link> speaks for the URI owner ought to be addressed
> somehow.

In -02 I am doing my best to remove any mention of 'authority' other than in relation to 3986.

EHL

[1] http://tools.ietf.org/html/draft-hammer-discovery-01#section-8.3.2.1
Received on Saturday, 31 January 2009 21:00:31 UTC