RE: Proposed disposition of Stuart Williams' comments on Metadata in URI 31 from Schleiff, Marty on 2006-09-18 (www-tag@w3.org from September 2006)

From: Schleiff, Marty <marty.schleiff@boeing.com>
Date: Mon, 18 Sep 2006 09:59:00 -0700
To: <noah_mendelsohn@us.ibm.com>, "Stuart Williams" <skw@hp.com>
Cc: <www-tag@w3.org>
Message-ID: <2C1C6A07EEDCB14ABBACAC793BF8BE9E02E96AB8@XCH-NW-6V2.nw.nos.boeing.com>
Hi All,

I found Metadata in URI pretty interesting. While it seems pretty
thorough in exploring metadata about a URI's intended resource, it
doesn't seem to address metadata about the URI itself. Helpful metadata
about the URI (not about the resource) might include things like claims
of persistence vs. one time pseudonym, normalization & matching rules,
ordering rules (e.g., for versions), on click behavior, resolvable or
not, protocol to use, etc.

I think URI schemes do a pretty good job of letting an application know
how to process a URI (http processing is different than https; ldap
processing is different than ftp; etc.). As the TAG is encouraging use
of just a single scheme (i.e., http) for all identifiers, it seems the
TAG should also provide direction on how to convey processing
instructions to relying applications. The only suggestion I have seen,
which I think came from a TAG member, and which I think is not very well
thought out, was something like the following:

    Don't do this:  <newScheme>://<stuff>
    Do do this:     http://<newSchemeOrganization>.org/<stuff>

Then a clever application, upon recognition of
"<newSchemeOrganization>.org", might know to interpret the URI as
specified by NewSchemeOrganization.  While I think this idea might be a
start, it doesn't go far enough. It doesn't answer the following:

1) It relies too much on tribal knowledge. How's an application supposed
to know that "<newSchemeOrganization>.org" is intended to convey
scheme-type information, while "<otherOrganization>.org" does not convey
scheme-type information? I think this could be resolved by introducing a
new DNS top level domain (maybe something like ".scheme" or ".spec")
specifically for the purpose of unambiguously indicating that the URI
has particular characteristics and meaning (e.g.,
http://<newSchemeOrganization>.spec/<stuff>).

2) What should be returned by "http://<newSchemeOrganization>.spec"?
Hopefully it would be a specification describing the special
characteristics of the URIs under this authority.

3) What if several companies collaborate on a new specification, but
there's no organization representing the collaboration of companies,
then what should go into the authority section?

4) TAG members frequently justify the use of a single scheme by claiming
it's expensive to introduce new schemes, and difficult for applications
to be taught how to process new schemes. I claim it would be just as
expensive and difficult to teach applications how to recognize the
various URI characteristics and semantics with a
"<newSchemeOrganization>.spec" approach, and more expensive with a
nebulous "<newSchemeOrganization>.org" approach, and even more expensive
with no approach at all.


Marty.Schleiff@boeing.com; CISSP
Associate Technical Fellow - Cyber Identity Specialist
Computing Security Infrastructure
(206) 679-5933

-----Original Message-----
From: noah_mendelsohn@us.ibm.com [mailto:noah_mendelsohn@us.ibm.com] 
Sent: Monday, September 18, 2006 7:34 AM
To: Stuart Williams
Cc: www-tag@w3.org
Subject: Proposed disposition of Stuart Williams' comments on Metadata
in URI 31

At its June 2006 F2F meeting in Amherst, MA, the TAG voted "to accept
http://www.w3.org/2001/tag/doc/metaDataInURI-31-20060609 contingent on
Noah finishing his TODO list to the satisfaction of Ed." [1]  Before I
could wrap up the finding, I received two sets of additional comments,
and I informed the TAG that I would delay publication until I had
reviewed and suggested dispositions for that new input.  In August, I
summarized my proposed responses to the comments received from Bjoern
Hoehrmann [2]. The purpose of the this note is to describe the proposed
dispositions for the comments received from Stuart Williams [3,4].

As in my responses to Bjoern, I'm trying to strike a balance.  On the
one hand, I want to be responsive where there are important concerns.
On the other, we always have to pick a point here the TAG will say
"publish", and further comments can be considered as input to possible
revisions.  So, I've tried to respond to Stuart's comments with some
detail and care, but given their late arrival I am setting the bar a
little higher than I might normally in being open to significant redraft
of the findings.  I hope the following strikes a reasonable balance.

The following quotes are from Stuart's comments (the notes in [4])
followed by my comments.  In a moment I will send another note
announcing the posting of a new draft of the finding.  Any changes
proposed below are in that draft (which in fact is already on the Web at
the usual URIs.) 

Stuart's comment:

> The concept of authority wrt to URI is one which some have pushed back

> against. They have argued that the URI scheme itself is what states 
> what a given URI identifies. Generally this is presented as an 
> operationalised notion of what it means to 'identify' a resource. This

> view would likely also argue that RFC2616 'creates' all possible HTTP 
> URIs.

The term is used at least occasionally in the WWW Architecture document,
and in some sense it's locally defined in the metaDataInURI-31 draft:

"The authority that creates a URI is responsible for assuring that it is
associated with the intended resource, "

I don't recall any other TAG members raising this in earlier reviews.
If other members of the TAG think it's worth the effort to come up with
a different approach, I'd be willing, but my vote is to leave this as
is.  I do see your point, but I'm just not convinced there's a problem.

-------

Commenting on:

> "Many URI schemes offer a flexible structure that can also be used to 
> carry additional information, called metadata, about the resource."

Stuart's comment:

> Do you have an example of such a scheme.
> I can't think of any!!!

Sure, the http scheme for example.  I can encode into URIs in that
scheme creation dates, directory hierarchies, file types, and all sorts
of things.  It doesn't provide a standard representation for any one of
those, but that's not the point:  it's a schema that "can be used" to
carry such information.  Indeed, the subject of the finding is when it
should be used in that way, and when consumers of URIs should depend on
it having been used that way.

-------

Commenting on:

> "The first question is focused on people and software acting in the 
> role of or on behalf of a URI assignment authority (authorities) for 
> URI assignments within the scope of that authority. The other 
> questions are focused on people and software making use of URIs 
> assigned outside of their own authority (observers).

Stuart's comment:

> Whilst I'm conscious that this is either text that I wrote or similar,

> it is again couched in terms of authority,  which I know some rejects.

> That said I think that there may be a crossing of layers here in that 
> an operationalised view of what a  given URI identifies has nothing to

> say about what a resource signifies.

As I said above, I'm OK with speaking of authorities.  On your 2nd
point, I don't see the finding text speaking in those terms, but even if
it did, I think there is a connection, insofar as the definition of a
URI schema creates the (means by which an authority expresses an)
association between any particular URI and a resource.  If the
"operational" results aren't consistent with or reflective of that, then
I would say the system is misconfigured. For example, if I have in hand
an https URI, and I dereference it over a network that someone has
misconfigured to ignore all the integrity guarantees implied for the
association of an https URI with its resource, I may operationally get a
result that is not really for the resource.  That's true, but it's
because the system is not configured in a manner that reflects the
requirements of the URI scheme that it's supporting.  When things work
right, I think the operational results are reflective of the underlying
resource, at least in whatever sense that the URI scheme establishes
such an association.

I do have some concern with that paragraph, but mainly editorially.  I
think it's a but clunky, but it seemed to have been in the finding since
before I was involved, and since it was saying things with which I
basically agree, I left it. 

Proposed resolution:  To deal with the clunkiness, I have reworded as: 
"The first question is primarily of concern to URI assignment
authorities, who must choose a suitable URI for each resource that they
control. The other questions are focused on people and software making
use of URIs, 
whether at the resource authority or elsewhere.   Of course the
questions 
are related, insofar is one reason for an authority to encode metadata
is for the benefit of resource users."

Stuart's comment:

> FWIW IIRC Roy on the other hand supported the notion of delegated 
> authority passed on downward from the URI spec to scheme specs, to 
> 'owners' of DNS names and so forth.

I'm comfortable with Roy's position.

-------

Commenting on:

> In this example, there is no normative specification that provides for

> determination of a media-type from URI suffixes, and the assignment
authority
> has provided no documentation to license an inference of media-type 
> from
the
> URI. Martin's browser is in error, because it relies on URI metadata 
> that is not covered by normative specifications and has not been 
> documented by the assignment authority. A correctly written browser 
> would have shown the
faulty
> XML as text, or might conceivably have shown a warning about the
apparent
> mismatch between the type inferred from the URI and the returned
Content-
> Type. (Martin's browser is also ignoring TAG finding "Authoritative
Metadata"
> [AUTHMETA], which mandates that the Content-Type HTTP header takes 
> precedence even if type information had somehow been reliably encoded 
> in the URI.)

Stuart's comment:

> Comment [skw4]: It is in error because it construes that there is 
> metadata intentionally placed in the URI when there is not.

Hmm.  You seem to be saying that we know conclusively that there is no
metadata in that URI, and I don't think that's the case.  In fact, there
may well have been metadata, even in the .xml suffix in question.  The
authority may have decided to use .xml as a suffix for anything that was
originally intended as xml, and in this case has extended that
convention to some buggy XML that is in fact not well formed.  I think
the draft on the finding is correct as it stands:  there may or may not
be metadata in the URI, but the point is we can't know whether it's
there or how to interpret it unless there are normative specifications
or documentation from the assignment authority.  I'm afraid I'm not
convinced on this one.

-------

Stuart's comment:

> typo: reaons -> reasons

Fixed, thank you!

-------

Commenting on:

> There is certain metadata that Martin or his browser can reliably
determine
> from the URI. For example, the URI conveys that the http scheme has 
> been used, and that attempts to access the resource should be directed

> to the
IP
> address returned from the DNS resolution of the string "example.org". 
These
> conclusions are licensed by normative specifications such as [URI] and

> [HTTP].

Stuart's comment:

> Comment [skw5]: Hmmmm I
> have always found this tricky. Wrt
> to say FTP URI scheme, the
> scheme tells you (in an operational
> style) what resource is identified -
> it is the resource that would
> provide the resulting
> representation *if* you did a
> particular bunch of things. The
> HTTP spec is the same. However,
> neither is a statement about HOW
> the resource should be accessed,
> only a statement of WHAT
> resource is identified. Ok. Yes,
> typically HTTP: would imply that
> access using http ought to be
> possible.

I've found it tricky too, witness my so far unsuccessful attempts to
tell just this story in the drafts on schemeProtocols.  The question
here is: 
does the paragraph as quoted above need fixing?   I certainly think it's

right that "the http scheme has been used", as that's covered by
normative 
specs.   I'm a little less clear on whether I've quite correctly told
the 
story in saying "that attempts to access the resource should be directed
to the IP address returned from the DNS resolution of the string
"example.org". These conclusions are licensed by normative
specifications such as [URI] and [HTTP]." 

I'll ask other TAG members for their opinions, though I really don't
want to back into the whole schemeProtocols discussion.  If necessary,
I'll delete the offending parts of that paragraph.  Unless other TAG
members agree there's a problem, I propose to leave it.


-------

Commenting on:

> Good Practice: Avoid software dependencies on metadata in URIs.

Stuart's comment:

> Comment [skw6]: The tone of
> this seems to me to have a
> presumption that metadata *is*
> embedded in URIs, as opposed to
> "in some cases there happens to be
> metadata embedded in URIs".

The section in which this suggestion was made has been dropped.

> I find myself not wanting to allow
> that the things being cited here as
> metadata are infact metadata. I see
> them mostly as 'distinguishing'
> characteristics which have been
> encoded into URIs

That seems like metadata to me, except in the case where the information
in the URI happens to duplicate what's in the content, in which case
it's arguably "data" not "metadata". 

> principally for
> the purpose of generating unique,
> transcribable URIs, rather than
> with the intent that metadata be
> recoverable from the URI.

I'm not convinced that the motivations of the authority are what's
important.  It's often there.  When it is, or when it appears to be
there (see sections on guessing), it's tempting for clients to rely on
it.  This GPN is saying:  especially in software, don't do that.

Anyway, as noted above, the section has been dropped.

-------

Commenting on:

> that is the only one for which the URI authority has taken specific 
> responsibility.

Stuart's comment:

> Hmmm... I might argue that the
> same assignment authority is
> equally *responsible* for both
> URIs, however they have set no
> particular expectation wrt to the
> second URI (at least in the vicinity
> of Chicago - though who knows
> what might happen to be painted
> on the side of busses in Boston).

Good point.  He's responsible for the URI and the resource, he just
hasn't claimed that it has anything to do with the weather. 

Proposed resolution:

I've reworded that to:  "Bob has seen an advertisement listing just the
Chicago URI, and that is the only one that the URI authority has
warranted will be a useful weather report."

-------

Commenting on:

> Good Practice: Guess information from URIs only when the consequences 
> of an incorrect guess are acceptable.

Stuart suggests:

> Alternative formulation: "When guessing information from URIs be 
> robust to unexpected results."

Honestly, I don't like mine, but I'm afraid I don't like yours much
either.  This part of the finding has always suffered from a certain
circularity or obviousness, and I haven't found a great way to get to
the essence which is:  "Guessing has its downsides, but on balance it's
something people will do and often have good reasons for doing.  Watch
out for the obvious pitfalls."  Doesn't have quite the gravitas I'd
expect in a TAG finding, but I'll give it a little more thought.  Some
chance the original will survive, in part because I haven't come up with
better, in part because it was approved by the TAG, and in part because
I think it's time to ship this and while the above isn't quite up to my
standards, it's not telling anyone to do anything dangerous.

-------

Commenting on:

> Bob could, with this assurance, write his own software to construct 
> and use such URIs to retrieve weather reports.

Stuart writes:

> Ok... but
> Bob's software is also vulnerable
> to change *if* example.org change
> the way that they organise their
> URI space (modulo or not "Cool
> URIs..."). I think that this risks
> overstating the assurance that Bob
> has.

Well, he could just as well hang onto the form for a week, a month or a
year, fill it out, and hit the same problem.  You're right that given
the way browsers work, there's a social expectation that forms are
filled in promptly, but Cool URIs Don't Change, and I think that applies
to the ones with query strings too.  Anyway, ole Bob knows the nature of
the documentation he got (an HTML form), and if he's smart enough to
reverse engineer it to get the URI assignment policy, I bet he's smart
enough to make a guess as to whether the form is time sensitive.

-------

Commenting on:

> Assignment authorities may publish specifications detailing the 
> structure and semantics of the URIs they assign. Other users of those 
> URIs may use such specifications to infer information about resources 
> identified by URI assigned by that authority.

Stuart writes:

> Comment [skw10]:
> I think that the generation of
> unique identifiers is the more
> likely reason for embedding socalled
> metadata in a URI. I suspect
> that in general it is rarely the intent that the URI be parsed to 
> extract what some construe as embedded 'metadata'.
> I think the uniqueness driver
> should be introduced earlier, where
> sufficient static distinguishing
> characteristics are encoded into a
> URI in order to make it unique.

I suppose I'm less convinced than you that we need to get into the
motivations of the assignment authorities, but even if we did, I don't
share your assumptions.  Usually when I see a URI like:  
http://www.cnn.com/2006/WORLD/meast/08/14/carroll/index.html, which
happens to be an actual news report URI from CNN a few weeks ago, I
don't think they are just going for uniqueness.  GUIDs would be far
easier. 
While they've presumably chosen the assignment for their own reasons,
it's a good guess as to what metadata they're encoding here, and I can
think of lots of reasons other than uniqueness that they would have done
so.  The very existence and widespread use of .htaccess files in Apache
suggests that metadata is encoded in URIs for reasons other than
uniqueness.  That being the case, I think it's appropriate that this
finding assumes that such metadata will often be there, or appear to be
there, and that it focusses mostly on when to encode it, and whether to
trust it.

-----------------------------

The draft finding says:

> Assignment authorities may publish specifications detailing the 
> structure and semantics of the URIs they assign. Other users of those 
> URIs may use such specifications to infer information about resources 
> identified by URI assigned by that authority.

Stuart writes:

> I think that given that such specifications may be subject to change, 
> there should be some caution suggested wrt the permanence of any 
> implied commitment on the part of the assignment authority.

As I noted earlier, Cool URIs don't change.   As far as I'm concerned,
the 
instant the assignment authority publishes the bindings for a family of
URIs, good practice is that the associations for those URIs be set 
forever.   On the contrary, rather than warning of impermanence, I'd be 
tempted to warn assignment authorities that such documentation does, per
Cool URIs, represent a perpetual commitment at least in principle.  As I
mentioned at the start of this note, I'm setting the bar pretty high on
making changes at this late point, as they are likely to generate more
debate and more delays.  Since I don't think the draft is "broken" I
propose to leave it.  Were I convinced to change it after all, my
starting position would be to add the warning to assignment authorities
that the commitment is perpetual.  Can you live with this resolution?

Thank you for the care with which you reviewed the latest drafts, and
for your patience in waiting for this response.  Please review the new
draft, and let me know whether you are comfortable with the resolutions
contained 
therein.   I expect this will be published as a TAG Finding shortly.
Thank 
you!

Noah

[1] http://www.w3.org/2001/tag/2006/06/14-minutes.html#item01
[2] http://lists.w3.org/Archives/Public/www-tag/2006Aug/0069.html
[3] http://lists.w3.org/Archives/Public/www-tag/2006Jul/0026.html
[4]
http://lists.w3.org/Archives/Public/www-archive/2006Jul/att-0009/metaDat
aInURI-31-skw-ann.pdf

--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Monday, 18 September 2006 16:59:57 UTC