Re: discovering site metadata

On Fri, Feb 08, 2002 at 09:14:08AM -0500, Al Gilman wrote:
> The root URL for a domain is the commercial best practice for the
> page that is the default and generic entry point for the realm
> proper to one "corporate author" or virtual business entity.

I think it's much more than that; it's a resource that is the
hierarchical root of all other resources made available by that
authority.


> The HEAD request gets you metadata for that resource, be it the
> site or the first page on entering the site..

> And the metadata captured in the HTTP headers / SOAP envelope for
> the launch page includes generic terms and conditions for that page
> such as "general properties and conventions practiced on this
> site." That is not all there inline in the headers, of course, but
> there is a "more about me" reference that starts you down the path
> to wandering that patch of web.

The problem is that this requires two requests; one to get the
reference to 'more about me', then one to dereference it. It also
requires all metadata to be lumped into the 'more about me' resource,
which implies that either a) everyone has to agree on the format of
representations of it, or b) it produces links to the actual
application-specific resources.

What you end up with, with something like P3P, then, is

1. GET / ; response includes MoreAboutMe: /aboutme.xml
2. GET /aboutme.xml ; response includes
   "<p3p:policyref>/p3p.prf</p3p:policyref>"
3. GET /p3p.prf ; response points to the policy /policy.xml
4. GET /policy.xml

Whereas, using conneg, it's collapsed down to 

1. GET / Accept: application/p3p-prf+xml ; response points to the
   policy /policy.xml
2. GET /policy.xml

I'd like to minimise the coordination required to specify such
metadata; there are a lot of subtle requirements for things like P3P
that may not be considered by a
lump-all-metadata-(references)-in-resource-x approach, which could
act as a bottleneck.

My current concerns with the conneg approach are that there is no URI
for the representation returned (unless the response is a redirect),
and that it needs to be clear that such requests may be equivalent to
P3P's 'safe zone'; i.e., no cookies are required or recorded, etc.


> You should neither have to descend the tree nor ascend the tree to
> get this information.  You should be able to get there following
> HyperReferences.  And the commercial best practice already gives
> you a 'forward' hyperlink HOME from essentially every fragment of
> the corpus that is the site content.  But the "more about me"
> reference in the headers should be populated on every page
> regardless.  And be context specific to that page where the service
> offeror has thought that through.  But the page-specfic stuff would
> embed a forward-motion path on to the site-generic context
> conditions review should that be different,

I agree, and these mechanisms are seing more use, as UAs start to
support them (see Mozilla and LINKs). However, there are some
applications where they are not useful; these are the ones I'm
interested. I've pasted the introduction from that I-D below.

1. Introduction

   An increasingly common requirement for Web technologies is to
   describe metadata about a group of resources, rather than just a
   single resource.

   The most commonly deployed solutions to this problem involve
   defining a 'well-known location" for a resource which describes a
   mapping of metadata to resources.

   For example:

   o  P3P[2] uses the resource on the path '/w3c/p3p.xml' as a
      'Policy Reference File', which maps privacy policies to
      different portions of the site.

   o  The Robot Exclusion[3] convention uses the path '/robots.txt'
      to direct automated agents as to which portions of the site are
      not to be visited.

   o  A recent proposal, WS-Inspection[4], uses a well-known location
     '/inspection.wsil' to aid in the location of Web Services.

   There are a number of reasons for the well-known location
   approach;

   o  Scoping - by defining a single metadata source that is tied to
      the URI authority, the metadata statements contained within can
      be considered authoritative for that site.  For example, the
      P3P Policy Reference File at http://www.example.org/w3c/p3p.xml
      is authoritative for www.example.org, because it is under the
      control of www.example.org.

   o  Efficiency - it is cumbersome to directly embed metadata in
      every representation of a resource produced, both because of
      the management overhead involved, and the byte bloat in the
      representations themselves.

   o  Flexibility - often, statements about a resource are separate
      in time from its representations.  Separating them allows them
      to be changed without affecting the resources themselves.

   o  Privacy - some metadata affects the way requests are made (or
      not made), bringing the requirement that the metadata is known
      beforehand.  For example, the metadata in
      http://www.example.net/ robots.txt must be known before other
      requests can be made by a robot, because it might preclude
      them.

   However, use of a well-known location imposes the protocol
   designers' choice of identifiers into publishers' URI namespaces. 
   The chosen identifier may not be easy to make available, depending
   on the nature of the server implementation, or it may be
   impractical to integrate the well-known location into content
   management systems.

   This document defines a mechanism whereby protocols can specify
   site metadata without tying it to a well-known location, by using
   mechanisms in the HTTP [1].


-- 
Mark Nottingham
http://www.mnot.net/
 

Received on Friday, 8 February 2002 11:19:11 UTC