- From: Ted Hardie <hardie@merlot.arc.nasa.gov>
- Date: Fri, 26 May 1995 16:36:14 -0700 (PDT)
- To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, harvest-dvl@cs.colorado.edu, naic <naic@nasa.gov>
- Cc: webmasters@nasa.gov
*Intro: I've been working with the Harvest development team over the past few days to try to resolve a problem with indexing our webspace, and have come to the conclusion that what we want would be facilitated by a minor extension to the HTTP protocol (for 1.1 or later, of course). Having read through the current draft specs, though, I see several possible approaches to resolving the problem, and I'm not sure which one would be most likely to be accepted by the community. I have laid out the problem below, and I would appreciate any and all input into the best long-term resolution to the issue. *Basis: The NASA webmasters group would like to provide a search interface that would allow a user to search all of NASA's webspace for information. After studying different approaches, several sites began trying out the Harvest system of brokers/gatherers. Using Harvest, an individual host at each center would gatherer information from other web sites at the center; this gathering was seeded with known hosts and extended by an n-hop reach, using Harvest as a "spider" or "worm". This provided a very useful, reasonably compact method of indexing the webspace at the center, and would, ultimately, have been able to provide the NASA wide interface we want. *The Problem: Running the gatherer from within each center provided a good way to balance the resource consumption among centers, but also had the effect of making the gatherer a "local" user for access control purposes. Since the gatherer acts like a local browsing user, information which had been made available only to center personnel or NASA personnel was accessible o the browser. Outside users could see these resources when they searched, but could not select the documents (They did have access, however, to the Summary Object Information about the resources, which often provided too much information for outside users). We are trying to create a searchable index for outside users, but we cannot currently determine easily which documents users should see. Most http daemons currently seem to be configured to deny browsers access to access control files (NCSA's httpd, for example, by default eliminates the .htaccess files from the list of files served); this means it is not possible for the gatherer to request the access file and eliminate items which should not be indexed. *Short Term Solutions: Running the gatherers on hosts likely to be denied access (outside the local domain or NASA domain, and outside likely networks), and then keeping the brokers on known hosts and nets. Cease to run Harvest as a spider, and force webmasters to register which parts of their sites should be indexed and which should not. *Long Term This problem probably faces many other institutions with large webspaces, even if they do not have a highly distributed set of web servers. To allow sites to present integrated search capabilities, without compromising areas under site-based access restriction it seems that we might want to have the http server be able to report that the access to material is restricted even if the the current browser meets those restrictions. This is information that the server already has; the key would simply be making it available to the browser/indexer. Several possible methods present themselves. One way would simply be to have servers apply access restrictions based on the From: field as well as the address of the request (allowing spiders/gatherers to set a bogus address, which would allow for gathering only unrestricted documents--it would have to be *in addition*, of course, to prevent spoofing in the other direction. ) The server could also present browsers with a header reporting whether or not the document being served was restricted (and optionally, how). This seems like overkill, since few browsers would care about the restriction. If proposols for a rating system succeed, however, it could be simply moved into the header reporting the rating (Rating: would then include things like Local_Access as well as Adults_Only, and might be better named Restrictions:). An Accept: header could also do the job; like Accept-encoding: and Accept-charset:, it would give the server information about what kinds of information the browser wishes to see. The difference of course, would be that it would be that the server would decide to respond based on access control files rather than the file be served. This method would allow indexers to configure their spiders to gather all available documents, only documents available to everyone, or only documents which did not bear a particular restriction (note that this would be most useful if there were some standardization on restriction classes). It also seems like a Pragma directive could be defined for this; though the parallel with no-cache is not especially strong, it seems like this kind of unusual directive "don't return if restrictions exist" (or "no-restrict" for simplicity's sake) might also need to be passed through each proxy. *Directions? Which of these methods seem most useful? What other methods might there be which would be more useful? Do other folks think that this problem even deserves to be addressed at this level, or should it be handled at some other leve? Any help, commentary, or suggestions appreciated, Regards, Ted Hardie NAIC
Received on Friday, 26 May 1995 16:37:30 UTC