- From: Martijn Koster <m.koster@nexor.co.uk>
- Date: Thu, 01 Jun 1995 08:59:21 +0100
- To: Ted Hardie <hardie@merlot.arc.nasa.gov>
- Cc: Harald.T.Alvestrand@uninett.no, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, harvest-dvl@cs.colorado.edu, naic@nasa.gov, webmasters@nasa.gov
In message <199505311737.KAA04022@merlot.arc.nasa.gov>, Ted Hardie writes: I'm probably going to have trouble expressing this clearly, but I'm a bit worried about the host-security-restriction angle on this problem. In your situation you have host-based access restriction (be it DNS or IP), which is just one of the possible restrictions one could encounter. Other authentication schemes may be in use, the access restriction may depend on external factors such as server load or time of day etc. I wouldn't like to see a hack that fixes only one particular aspect of this problem, and will take years to get implemented, and will haunt us later. Of course I would like to see progress. I see your problem as one of a higher level of abstraction: you have collections of resources which are intended for the general public, and collections which are intended for a private audience. Even in the case where you choose not to place access restrictions on the private collection you may want to prevent wide publication of the collection. My favourite real-life example is a bug database: you want it online for customers, but you don't want to be known for the bugs in your products because some indexer just ended up there! You place an access restriction on a collection, which is a policy on the collection, not a property of it. Deciding how the collection is indexed should also be a policy decision, and be separate from other policies such as access restrictions. > On another note, several people have pointed out the existence of > the Robot exclusion standard, and have suggested using a robots.txt > at sites or in hierarchies that should not be indexed. This is a > fine temporary suggestion, Well, the /robots.txt was designed as a temporary solution, so that fits :-) > but I think it is a bit inelegant, as it requires the maintainers of > those pages to keep two different forms of access control--one for > humans and one for local robots. If that is your only worry I'm sure you could automate some of the process, e.g. by generating the robots.txt from the acl files or whatever. One could even argue that the problem is in fact in the access control implementation on your server: I'd much rather configure my server in terms of document collections with policies than in files (URLs) with access restrictions. I see the separation of policies as a feature, not a bug, and think a robots.txt style restriction may therefore be appropriate. The added advantage is of course that you help out other robots, and that Harvest support for robots.txt is quite easy to implement (they should do it anyway if they haven't yet) and you need no server changes. > Perhaps a Pragma method of "request restrictions" would be the best > idea; it would allow the server to determine whether to send a > description of the restrictions to the browser (useful if the > indexer wishes to use some logic to determine whether to index) or a > simple "restrictions exist" reply. > What do people think of using that Pragma method as a solution? Sure, but that requires some enumeration of access control. My problem with that is that there can be any number of access control policies in force, which may not be easy to describe. How do you deal with "access is OK outside the hours 9-5", "access is OK if a trusted third-party authenticates you", "access is OK if you're over 18", "access OK unless you're annoying@wherever" etc? I don't think a "restrictions exist" is really useful. Anyway, that discussion may have a better place on a http-security list (probably is one). I can see some (different) use for such a facility, but think it's non-trivial. It seems to me that your actual requirement is selection of documents for indexing purposes, for particular audiences. Maybe we should be discussing how we go about asking servers things like "give me a list of your X most important documents/meta-descriptions to include in a public index", "list URL roots of documents not intended for a public index (aka robots.txt)", or to make it more scalable for large server do a "Get this document If-index: public" or something. Yeah, I know, also non-trivial :-). -- Martijn __________ Internet: m.koster@nexor.co.uk X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M WWW: http://web.nexor.co.uk/mak/mak.html
Received on Thursday, 1 June 1995 07:17:32 UTC