W3C home > Mailing lists > Public > www-talk@w3.org > January to February 2002

Re: discovering site metadata

From: Al Gilman <asgilman@iamdigex.net>
Date: Fri, 08 Feb 2002 11:59:20 -0500
Message-Id: <200202081659.LAA964806@smtp1.mail.iamworld.net>
To: Mark Nottingham <mnot@mnot.net>
Cc: Dan Connolly <connolly@w3.org>, www-talk@w3.org
At 10:57 AM 2002-02-08 , Mark Nottingham wrote:
>On Fri, Feb 08, 2002 at 09:14:08AM -0500, Al Gilman wrote:
>> The root URL for a domain is the commercial best practice for the
>> page that is the default and generic entry point for the realm
>> proper to one "corporate author" or virtual business entity.
>I think it's much more than that; it's a resource that is the
>hierarchical root of all other resources made available by that

This is commercial custom but not part of the technical specification. 

Hierarchy in the namespace is a syntactic convenience as far as URIs are
concerned, the sense as nested contexts is a good guess based on practice but
not a requirement to use this form of URL.  Compare with URL-encoded script
parameters that use path segments rather than the searchpart syntax.  The sense
of the sequence of path segments is at the discretion of the service offeror.

Publishing entities (people and pseudo-people such as corporations and other
ORGs) should have the capability to publish as one web presence arbitrary
collections of URIs.

A collection of data which is excised from the assets on hand at a server and
dispatched to a recipient as a representation of "a resource" needs more
"packaging for delivery" than just the Location reference.  Anything from the
context of that Location that matters should be pulled into the packaging (SOAP
envelope, HTTP headers) as an explicit reference.

This requirement applies to any datagram, whether HTTP or MIME.  The datagrams
in HTTP are a clear fact of operational life.  Once sent, the data sent may
become a persistent record at the recipient's facility and need to be properly
_characterized_ as well as 'identified' as a result.

Meeting the requirement at the datagram level fills all 'site' requirements. 
And taking special measures for sites fails to meet the datagram requirements,
which are real and subsume site metadata.


>> The HEAD request gets you metadata for that resource, be it the
>> site or the first page on entering the site..
>> And the metadata captured in the HTTP headers / SOAP envelope for
>> the launch page includes generic terms and conditions for that page
>> such as "general properties and conventions practiced on this
>> site." That is not all there inline in the headers, of course, but
>> there is a "more about me" reference that starts you down the path
>> to wandering that patch of web.
>The problem is that this requires two requests; one to get the
>reference to 'more about me', then one to dereference it. It also
>requires all metadata to be lumped into the 'more about me' resource,
>which implies that either a) everyone has to agree on the format of
>representations of it, or b) it produces links to the actual
>application-specific resources.
>What you end up with, with something like P3P, then, is
>1. GET / ; response includes MoreAboutMe: /aboutme.xml
>2. GET /aboutme.xml ; response includes
>   "<p3p:policyref>/p3p.prf</p3p:policyref>"
>3. GET /p3p.prf ; response points to the policy /policy.xml
>4. GET /policy.xml
>Whereas, using conneg, it's collapsed down to 
>1. GET / Accept: application/p3p-prf+xml ; response points to the
>   policy /policy.xml
>2. GET /policy.xml
>I'd like to minimise the coordination required to specify such
>metadata; there are a lot of subtle requirements for things like P3P
>that may not be considered by a
>lump-all-metadata-(references)-in-resource-x approach, which could
>act as a bottleneck.
>My current concerns with the conneg approach are that there is no URI
>for the representation returned (unless the response is a redirect),
>and that it needs to be clear that such requests may be equivalent to
>P3P's 'safe zone'; i.e., no cookies are required or recorded, etc.
>> You should neither have to descend the tree nor ascend the tree to
>> get this information.  You should be able to get there following
>> HyperReferences.  And the commercial best practice already gives
>> you a 'forward' hyperlink HOME from essentially every fragment of
>> the corpus that is the site content.  But the "more about me"
>> reference in the headers should be populated on every page
>> regardless.  And be context specific to that page where the service
>> offeror has thought that through.  But the page-specfic stuff would
>> embed a forward-motion path on to the site-generic context
>> conditions review should that be different,
>I agree, and these mechanisms are seing more use, as UAs start to
>support them (see Mozilla and LINKs). However, there are some
>applications where they are not useful; these are the ones I'm
>interested. I've pasted the introduction from that I-D below.
>1. Introduction
>   An increasingly common requirement for Web technologies is to
>   describe metadata about a group of resources, rather than just a
>   single resource.
>   The most commonly deployed solutions to this problem involve
>   defining a 'well-known location" for a resource which describes a
>   mapping of metadata to resources.
>   For example:
>   o  P3P[2] uses the resource on the path '/w3c/p3p.xml' as a
>      'Policy Reference File', which maps privacy policies to
>      different portions of the site.
>   o  The Robot Exclusion[3] convention uses the path '/robots.txt'
>      to direct automated agents as to which portions of the site are
>      not to be visited.
>   o  A recent proposal, WS-Inspection[4], uses a well-known location
>     '/inspection.wsil' to aid in the location of Web Services.
>   There are a number of reasons for the well-known location
>   approach;
>   o  Scoping - by defining a single metadata source that is tied to
>      the URI authority, the metadata statements contained within can
>      be considered authoritative for that site.  For example, the
>      P3P Policy Reference File at
>      is authoritative for <http://www.example.org/>www.example.org, because it
is under the
>      control of <http://www.example.org/>www.example.org.
>   o  Efficiency - it is cumbersome to directly embed metadata in
>      every representation of a resource produced, both because of
>      the management overhead involved, and the byte bloat in the
>      representations themselves.
>   o  Flexibility - often, statements about a resource are separate
>      in time from its representations.  Separating them allows them
>      to be changed without affecting the resources themselves.
>   o  Privacy - some metadata affects the way requests are made (or
>      not made), bringing the requirement that the metadata is known
>      beforehand.  For example, the metadata in
>      <http://www.example.net/>http://www.example.net/ robots.txt must be known
before other
>      requests can be made by a robot, because it might preclude
>      them.
>   However, use of a well-known location imposes the protocol
>   designers' choice of identifiers into publishers' URI namespaces. 
>   The chosen identifier may not be easy to make available, depending
>   on the nature of the server implementation, or it may be
>   impractical to integrate the well-known location into content
>   management systems.
>   This document defines a mechanism whereby protocols can specify
>   site metadata without tying it to a well-known location, by using
>   mechanisms in the HTTP [1].

>Mark Nottingham
Received on Friday, 8 February 2002 11:59:07 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:33:03 UTC