- From: Ted Hardie <hardie@merlot.arc.nasa.gov>
- Date: Thu, 1 Jun 1995 11:17:44 -0700 (PDT)
- To: Martijn Koster <m.koster@nexor.co.uk>
- Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, harvest-dvl@cs.colorado.edu
> > In message <199505311737.KAA04022@merlot.arc.nasa.gov>, Ted Hardie writes: > > I'm probably going to have trouble expressing this clearly, but I'm a > bit worried about the host-security-restriction angle on this > problem. In your situation you have host-based access restriction (be > it DNS or IP), which is just one of the possible restrictions one > could encounter. Other authentication schemes may be in use, the > access restriction may depend on external factors such as server load > or time of day etc. I wouldn't like to see a hack that fixes only one > particular aspect of this problem, and will take years to get > implemented, and will haunt us later. Of course I would like to see > progress. > > I see your problem as one of a higher level of abstraction: you have > collections of resources which are intended for the general public, > and collections which are intended for a private audience. Even in the > case where you choose not to place access restrictions on the private > collection you may want to prevent wide publication of the > collection. My favourite real-life example is a bug database: you want > it online for customers, but you don't want to be known for the bugs > in your products because some indexer just ended up there! > > You place an access restriction on a collection, which is a policy on > the collection, not a property of it. Deciding how the collection is > indexed should also be a policy decision, and be separate from other > policies such as access restrictions. > > > On another note, several people have pointed out the existence of > > the Robot exclusion standard, and have suggested using a robots.txt > > at sites or in hierarchies that should not be indexed. This is a > > fine temporary suggestion, > > Well, the /robots.txt was designed as a temporary solution, so that > fits :-) > > > but I think it is a bit inelegant, as it requires the maintainers of > > those pages to keep two different forms of access control--one for > > humans and one for local robots. > > If that is your only worry I'm sure you could automate some of the > process, e.g. by generating the robots.txt from the acl files or > whatever. One could even argue that the problem is in fact in the > access control implementation on your server: I'd much rather > configure my server in terms of document collections with policies > than in files (URLs) with access restrictions. > > I see the separation of policies as a feature, not a bug, and think a > robots.txt style restriction may therefore be appropriate. > > The added advantage is of course that you help out other robots, and > that Harvest support for robots.txt is quite easy to implement (they > should do it anyway if they haven't yet) and you need no server > changes. > > > Perhaps a Pragma method of "request restrictions" would be the best > > idea; it would allow the server to determine whether to send a > > description of the restrictions to the browser (useful if the > > indexer wishes to use some logic to determine whether to index) or a > > simple "restrictions exist" reply. > > > What do people think of using that Pragma method as a solution? > > Sure, but that requires some enumeration of access control. My problem > with that is that there can be any number of access control policies > in force, which may not be easy to describe. How do you deal with > "access is OK outside the hours 9-5", "access is OK if a trusted > third-party authenticates you", "access is OK if you're over 18", > "access OK unless you're annoying@wherever" etc? I don't think a > "restrictions exist" is really useful. Anyway, that discussion may > have a better place on a http-security list (probably is one). I can > see some (different) use for such a facility, but think it's > non-trivial. > > It seems to me that your actual requirement is selection of documents > for indexing purposes, for particular audiences. Maybe we should be > discussing how we go about asking servers things like "give me a list > of your X most important documents/meta-descriptions to include in a > public index", "list URL roots of documents not intended for a public > index (aka robots.txt)", or to make it more scalable for large server > do a "Get this document If-index: public" or something. Yeah, I know, > also non-trivial :-). > > -- Martijn > __________ > Internet: m.koster@nexor.co.uk > X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M > WWW: http://web.nexor.co.uk/mak/mak.html >
Received on Thursday, 1 June 1995 11:17:18 UTC