siteData-36: strawman from Tim Bray on 2003-02-27 (www-tag@w3.org from February 2003)

From: Tim Bray <tbray@textuality.com>
Date: Thu, 27 Feb 2003 10:49:17 -0800
To: www-tag@w3.org
Message-ID: <3E5E5DAD.9020704@textuality.com>

I took an action item last TAG telecon to raise a strawman proposal. 
TBL launched this with his proposal at 
http://lists.w3.org/Archives/Public/www-tag/2003Feb/0093.html.  He 
outlines the problem and proposes a new HTTP header (the note says "HTTP 
tag" but that's a typo), but isn't quite explicit enough in 
acknowledging that we're inventing a new architectural thing, the notion 
of a "site".

Here's how I'd come at it.  Right now the web architecture doesn't have 
any formal notion of a "site", and software that tries to pretend it 
does by and large doesn't do a very good job (as the author of two 
large-scale web spiders I have bitter first-hand knowledge).  Things 
like /robots.txt that try to pretend that a host is a site have problems 
because, well, a host isn't always a site.

So let's introduce a formal notion of a "Web Site", which is a 
collection of Resources, each identified by URI.  A resource can be in 
more than one site - not an obvious choice, but it seems it would be 
hard to enforce a rule to the contrary.

Since a Web Site is an interesting and important thing, it ought to be a 
resource and ought to have a URI.  There is no point trying to write any 
rules about whether all the resources on a site ought to be on the same 
host or whether the site's URI should look like those of the resources.

Then you introduce a new HTTP header as TBL suggested.  I'd call it 
"Web-site" or just "Site".  Any server could, but need not, include this 
header in a response to a GET or HEAD request.

You could easily include this in the <head> of HTML documents along the 
lines of

  <meta http-equiv="Web-site" content="http://example.com/site" />

Perhaps <link> would be better, or perhaps the HTML people might want to 
define new markup for the purpose.

Of course, this leads inevitably to the question of what is a useful 
representation for a site.  The kinds of stuff that could go there could 
include robots info, language info, favicon.ico equivalent, RSS info, 
p3p info, etc etc etc.  Unlike the RDDL issues we've been discussing, I 
see little requirement for human readability, so this feels like a 
natural for a small (but extensible) RDF vocabulary, who cares if it's 
ugly.  The RDF assertions would mostly have as their subject the URI "", 
which works well in this case.  -Tim

Received on Thursday, 27 February 2003 13:49:22 UTC