Re: Proposed issue: site metadata hook from Paul Prescod on 2003-02-14 (www-tag@w3.org from February 2003)

From: Paul Prescod <paul@prescod.net>
Date: Thu, 13 Feb 2003 19:36:41 -0800
To: Seairth Jacobs <seairth@seairth.com>, www-tag@w3.org
Message-ID: <3E4C6449.7060305@prescod.net>

Seairth Jacobs wrote:
>...
> 
> Any such hook might need to keep a few things in mind (imho):
> 
> 1) In the case of /robots.txt, /w3c/p3p, and /favico, these can be easily
> maintained by even the least experienced person just by copying the
> appropriate file to the appropriate location.  That's it.  No other files,
> headers, server settings, etc. need to be touched.  Requiring people to do
> any more than this seems like an uphill battle.

True, but the end-user's workflow is a reflection of their available 
tools. Just as servers know that "index.html" is magical, they 
could/should know that "robots.txt" is magical. This makes the server 
vendors the violators of the user's namespace which is okay: the user 
should be able to  configure the server to use the namespace differently 
(as you can turn off the magic handling of "index.html").

> 2) In the case of robots.txt, any hook that provides an added level of
> indirection will likely not be adopted.  For instance, if GoogleBot has to
> issue a HEAD /, then follow a URI (returned in the header) to get back an
> RDF document, then parse the document to find the location of the robots.txt
> file, then turn around and do this for every other site on the web it
> indexes, I'm guessing Google would continue on with the /robots.txt file.

How many "sites" do you think Google indexes versus pages? Also, Google 
doesn't have to do a HEAD. It more likely does a GET because it is 95% 
likely to need the root homepage anyhow. If it finds a metadata URL and 
that metadata URL happens to say "don't index me" then Google throws 
away the page it got.

Also, consider how many extra GETs Google must do today for non-existing 
robot.txt's. Surely there is a cost to that. If more and more metadata 
URIs are added, the system will fail to scale.

>...
> 3) How much trouble is this causing right now?  In theory, it makes sense
> that the owner of a domain should have full control over his identifiers and
> the resource(s) they point to.  In practice, though, how many people have
> had issues with this, especially compared to the number that haven't had an
> issue?

Personally, I would say that there is a fairly major issue that 
robots.txt can only live at the root.

  Paul Prescod

Received on Thursday, 13 February 2003 22:37:40 UTC