Re: discovering site metadata (more) from Al Gilman on 2002-02-08 (www-talk@w3.org from January to February 2002)

From: Al Gilman <asgilman@iamdigex.net>
Date: Fri, 08 Feb 2002 12:21:25 -0500
To: Mark Nottingham <mnot@mnot.net>
Cc: Dan Connolly <connolly@w3.org>, www-talk@w3.org
Message-Id: <200202081721.MAA966231@smtp1.mail.iamworld.net>
At 10:57 AM 2002-02-08 , Mark Nottingham wrote:
>
>On Fri, Feb 08, 2002 at 09:14:08AM -0500, Al Gilman wrote:
>> The root URL for a domain is the commercial best practice for the
>> page that is the default and generic entry point for the realm
>> proper to one "corporate author" or virtual business entity.
>
>I think it's much more than that; it's a resource that is the
>hierarchical root of all other resources made available by that
>authority.
>
>
>> The HEAD request gets you metadata for that resource, be it the
>> site or the first page on entering the site..
>
>> And the metadata captured in the HTTP headers / SOAP envelope for
>> the launch page includes generic terms and conditions for that page
>> such as "general properties and conventions practiced on this
>> site." That is not all there inline in the headers, of course, but
>> there is a "more about me" reference that starts you down the path
>> to wandering that patch of web.
>
>The problem is that this requires two requests; one to get the
>reference to 'more about me', then one to dereference it. It also
>requires all metadata to be lumped into the 'more about me' resource,
>which implies that either a) everyone has to agree on the format of
>representations of it, or b) it produces links to the actual
>application-specific resources.

first, the network-transaction-count performance issue.

You are overlooking the fact that all markup is metadata.  The metainformation
that is of interest to most visitors is covered in the semantics of the markup
incorporated in the body of the HTTP response, in the representation of the
resource _on the face of it_.

There will be further definitization that is of interest, but _to a minority of
accessors_.  So the extra HTTP round trip is appropriate.

The most natural way to link to this information of infrequent interest is by
reference, so that the network-transit-count metric of the level of indirection
is appropriately and inversely related to the frequency with which users follow
that path, conditioned on that they have any interest in the site at all.

Second: on the representation of the "more about me" information.   Not
everyone has to agree on the representation of everything that you can learn
from the next representation that comes down.  There has to be a universally
recognized core, of course.  The URI-reference forms that core, for a
machinable process started by this link by which the accessor can unravel the
representation including yet more allusions to decoding aids from each
representation involved.  See the RDDL applications for examples.  There can be
more layers beyond the first type/subtype distinction implied in the "more
about me" reference in the incremental access to contextual clauses and
refinements in interpretation.  The point is that at each stage the connections
are connected.

Al

>
>What you end up with, with something like P3P, then, is
>
>1. GET / ; response includes MoreAboutMe: /aboutme.xml
>2. GET /aboutme.xml ; response includes
>   "<p3p:policyref>/p3p.prf</p3p:policyref>"
>3. GET /p3p.prf ; response points to the policy /policy.xml
>4. GET /policy.xml
>
>Whereas, using conneg, it's collapsed down to 
>
>1. GET / Accept: application/p3p-prf+xml ; response points to the
>   policy /policy.xml
>2. GET /policy.xml
>
>I'd like to minimise the coordination required to specify such
>metadata; there are a lot of subtle requirements for things like P3P
>that may not be considered by a
>lump-all-metadata-(references)-in-resource-x approach, which could
>act as a bottleneck.
>
>My current concerns with the conneg approach are that there is no URI
>for the representation returned (unless the response is a redirect),
>and that it needs to be clear that such requests may be equivalent to
>P3P's 'safe zone'; i.e., no cookies are required or recorded, etc.
>
>
>> You should neither have to descend the tree nor ascend the tree to
>> get this information.  You should be able to get there following
>> HyperReferences.  And the commercial best practice already gives
>> you a 'forward' hyperlink HOME from essentially every fragment of
>> the corpus that is the site content.  But the "more about me"
>> reference in the headers should be populated on every page
>> regardless.  And be context specific to that page where the service
>> offeror has thought that through.  But the page-specfic stuff would
>> embed a forward-motion path on to the site-generic context
>> conditions review should that be different,
>
>I agree, and these mechanisms are seing more use, as UAs start to
>support them (see Mozilla and LINKs). However, there are some
>applications where they are not useful; these are the ones I'm
>interested. I've pasted the introduction from that I-D below.
>
>1. Introduction
>
>   An increasingly common requirement for Web technologies is to
>   describe metadata about a group of resources, rather than just a
>   single resource.
>
>   The most commonly deployed solutions to this problem involve
>   defining a 'well-known location" for a resource which describes a
>   mapping of metadata to resources.
>
>   For example:
>
>   o  P3P[2] uses the resource on the path '/w3c/p3p.xml' as a
>      'Policy Reference File', which maps privacy policies to
>      different portions of the site.
>
>   o  The Robot Exclusion[3] convention uses the path '/robots.txt'
>      to direct automated agents as to which portions of the site are
>      not to be visited.
>
>   o  A recent proposal, WS-Inspection[4], uses a well-known location
>     '/inspection.wsil' to aid in the location of Web Services.
>
>   There are a number of reasons for the well-known location
>   approach;
>
>   o  Scoping - by defining a single metadata source that is tied to
>      the URI authority, the metadata statements contained within can
>      be considered authoritative for that site.  For example, the
>      P3P Policy Reference File at
<http://www.example.org/w3c/p3p.xml>http://www.example.org/w3c/p3p.xml
>      is authoritative for <http://www.example.org/>www.example.org, because it
is under the
>      control of <http://www.example.org/>www.example.org.
>
>   o  Efficiency - it is cumbersome to directly embed metadata in
>      every representation of a resource produced, both because of
>      the management overhead involved, and the byte bloat in the
>      representations themselves.
>
>   o  Flexibility - often, statements about a resource are separate
>      in time from its representations.  Separating them allows them
>      to be changed without affecting the resources themselves.
>
>   o  Privacy - some metadata affects the way requests are made (or
>      not made), bringing the requirement that the metadata is known
>      beforehand.  For example, the metadata in
>      <http://www.example.net/>http://www.example.net/ robots.txt must be known
before other
>      requests can be made by a robot, because it might preclude
>      them.
>
>   However, use of a well-known location imposes the protocol
>   designers' choice of identifiers into publishers' URI namespaces. 
>   The chosen identifier may not be easy to make available, depending
>   on the nature of the server implementation, or it may be
>   impractical to integrate the well-known location into content
>   management systems.
>
>   This document defines a mechanism whereby protocols can specify
>   site metadata without tying it to a well-known location, by using
>   mechanisms in the HTTP [1].
>
>
>-- 
>Mark Nottingham
><http://www.mnot.net/>http://www.mnot.net/
> 
>
Received on Friday, 8 February 2002 12:21:03 UTC