Re: Indexing extension from Ted Hardie on 1995-06-01 (ietf-http-wg@w3.org from April to June 1995)

From: Ted Hardie <hardie@merlot.arc.nasa.gov>
Date: Thu, 1 Jun 1995 11:17:44 -0700 (PDT)
To: Martijn Koster <m.koster@nexor.co.uk>
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, harvest-dvl@cs.colorado.edu
Message-Id: <199506011817.LAA05090@merlot.arc.nasa.gov>
> 
> In message <199505311737.KAA04022@merlot.arc.nasa.gov>, Ted Hardie writes:
> 
> I'm probably going to have trouble expressing this clearly, but I'm a
> bit worried about the host-security-restriction angle on this
> problem. In your situation you have host-based access restriction (be
> it DNS or IP), which is just one of the possible restrictions one
> could encounter. Other authentication schemes may be in use, the
> access restriction may depend on external factors such as server load
> or time of day etc. I wouldn't like to see a hack that fixes only one
> particular aspect of this problem, and will take years to get
> implemented, and will haunt us later. Of course I would like to see
> progress.
> 
> I see your problem as one of a higher level of abstraction: you have
> collections of resources which are intended for the general public,
> and collections which are intended for a private audience. Even in the
> case where you choose not to place access restrictions on the private
> collection you may want to prevent wide publication of the
> collection. My favourite real-life example is a bug database: you want
> it online for customers, but you don't want to be known for the bugs
> in your products because some indexer just ended up there!
> 
> You place an access restriction on a collection, which is a policy on
> the collection, not a property of it. Deciding how the collection is
> indexed should also be a policy decision, and be separate from other
> policies such as access restrictions.
> 
> > On another note, several people have pointed out the existence of
> > the Robot exclusion standard, and have suggested using a robots.txt
> > at sites or in hierarchies that should not be indexed.  This is a
> > fine temporary suggestion,
> 
> Well, the /robots.txt was designed as a temporary solution, so that
> fits :-)
> 
> > but I think it is a bit inelegant, as it requires the maintainers of
> > those pages to keep two different forms of access control--one for
> > humans and one for local robots.
> 
> If that is your only worry I'm sure you could automate some of the
> process, e.g. by generating the robots.txt from the acl files or
> whatever.  One could even argue that the problem is in fact in the
> access control implementation on your server: I'd much rather
> configure my server in terms of document collections with policies
> than in files (URLs) with access restrictions.
> 
> I see the separation of policies as a feature, not a bug, and think a
> robots.txt style restriction may therefore be appropriate.
> 
> The added advantage is of course that you help out other robots, and
> that Harvest support for robots.txt is quite easy to implement (they
> should do it anyway if they haven't yet) and you need no server
> changes.
> 
> > Perhaps a Pragma method of "request restrictions" would be the best
> > idea; it would allow the server to determine whether to send a
> > description of the restrictions to the browser (useful if the
> > indexer wishes to use some logic to determine whether to index) or a
> > simple "restrictions exist" reply.
> 
> > What do people think of using that Pragma method as a solution?
> 
> Sure, but that requires some enumeration of access control. My problem
> with that is that there can be any number of access control policies
> in force, which may not be easy to describe. How do you deal with
> "access is OK outside the hours 9-5", "access is OK if a trusted
> third-party authenticates you", "access is OK if you're over 18",
> "access OK unless you're annoying@wherever" etc? I don't think a
> "restrictions exist" is really useful. Anyway, that discussion may
> have a better place on a http-security list (probably is one). I can
> see some (different) use for such a facility, but think it's
> non-trivial.
> 
> It seems to me that your actual requirement is selection of documents
> for indexing purposes, for particular audiences. Maybe we should be
> discussing how we go about asking servers things like "give me a list
> of your X most important documents/meta-descriptions to include in a
> public index", "list URL roots of documents not intended for a public
> index (aka robots.txt)", or to make it more scalable for large server
> do a "Get this document If-index: public" or something. Yeah, I know,
> also non-trivial :-).
> 
> -- Martijn
> __________
> Internet: m.koster@nexor.co.uk
> X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
> WWW: http://web.nexor.co.uk/mak/mak.html
>
Received on Thursday, 1 June 1995 11:17:18 UTC