Re: Indexing extension from Martijn Koster on 1995-06-01 (ietf-http-wg@w3.org from April to June 1995)

From: Martijn Koster <m.koster@nexor.co.uk>
Date: Thu, 01 Jun 1995 08:59:21 +0100
To: Ted Hardie <hardie@merlot.arc.nasa.gov>
Cc: Harald.T.Alvestrand@uninett.no, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, harvest-dvl@cs.colorado.edu, naic@nasa.gov, webmasters@nasa.gov
Message-Id: <199506011411.AA016645906@hplb.hpl.hp.com>
In message <199505311737.KAA04022@merlot.arc.nasa.gov>, Ted Hardie writes:

I'm probably going to have trouble expressing this clearly, but I'm a
bit worried about the host-security-restriction angle on this
problem. In your situation you have host-based access restriction (be
it DNS or IP), which is just one of the possible restrictions one
could encounter. Other authentication schemes may be in use, the
access restriction may depend on external factors such as server load
or time of day etc. I wouldn't like to see a hack that fixes only one
particular aspect of this problem, and will take years to get
implemented, and will haunt us later. Of course I would like to see
progress.

I see your problem as one of a higher level of abstraction: you have
collections of resources which are intended for the general public,
and collections which are intended for a private audience. Even in the
case where you choose not to place access restrictions on the private
collection you may want to prevent wide publication of the
collection. My favourite real-life example is a bug database: you want
it online for customers, but you don't want to be known for the bugs
in your products because some indexer just ended up there!

You place an access restriction on a collection, which is a policy on
the collection, not a property of it. Deciding how the collection is
indexed should also be a policy decision, and be separate from other
policies such as access restrictions.

> On another note, several people have pointed out the existence of
> the Robot exclusion standard, and have suggested using a robots.txt
> at sites or in hierarchies that should not be indexed.  This is a
> fine temporary suggestion,

Well, the /robots.txt was designed as a temporary solution, so that
fits :-)

> but I think it is a bit inelegant, as it requires the maintainers of
> those pages to keep two different forms of access control--one for
> humans and one for local robots.

If that is your only worry I'm sure you could automate some of the
process, e.g. by generating the robots.txt from the acl files or
whatever.  One could even argue that the problem is in fact in the
access control implementation on your server: I'd much rather
configure my server in terms of document collections with policies
than in files (URLs) with access restrictions.

I see the separation of policies as a feature, not a bug, and think a
robots.txt style restriction may therefore be appropriate.

The added advantage is of course that you help out other robots, and
that Harvest support for robots.txt is quite easy to implement (they
should do it anyway if they haven't yet) and you need no server
changes.

> Perhaps a Pragma method of "request restrictions" would be the best
> idea; it would allow the server to determine whether to send a
> description of the restrictions to the browser (useful if the
> indexer wishes to use some logic to determine whether to index) or a
> simple "restrictions exist" reply.

> What do people think of using that Pragma method as a solution?

Sure, but that requires some enumeration of access control. My problem
with that is that there can be any number of access control policies
in force, which may not be easy to describe. How do you deal with
"access is OK outside the hours 9-5", "access is OK if a trusted
third-party authenticates you", "access is OK if you're over 18",
"access OK unless you're annoying@wherever" etc? I don't think a
"restrictions exist" is really useful. Anyway, that discussion may
have a better place on a http-security list (probably is one). I can
see some (different) use for such a facility, but think it's
non-trivial.

It seems to me that your actual requirement is selection of documents
for indexing purposes, for particular audiences. Maybe we should be
discussing how we go about asking servers things like "give me a list
of your X most important documents/meta-descriptions to include in a
public index", "list URL roots of documents not intended for a public
index (aka robots.txt)", or to make it more scalable for large server
do a "Get this document If-index: public" or something. Yeah, I know,
also non-trivial :-).

-- Martijn
__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
WWW: http://web.nexor.co.uk/mak/mak.html
Received on Thursday, 1 June 1995 07:17:32 UTC