Indexing extension from Ted Hardie on 1995-05-26 (ietf-http-wg@w3.org from April to June 1995)

From: Ted Hardie <hardie@merlot.arc.nasa.gov>
Date: Fri, 26 May 1995 16:36:14 -0700 (PDT)
To: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com, harvest-dvl@cs.colorado.edu, naic <naic@nasa.gov>
Cc: webmasters@nasa.gov
Message-Id: <199505262336.QAA01399@merlot.arc.nasa.gov>
*Intro:

I've been working with the Harvest development team over the past few
days to try to resolve a problem with indexing our webspace, and have
come to the conclusion that what we want would be facilitated by a
minor extension to the HTTP protocol (for 1.1 or later, of course).
Having read through the current draft specs, though, I see several
possible approaches to resolving the problem, and I'm not sure which
one would be most likely to be accepted by the community.  I have
laid out the problem below, and I would appreciate any and all input
into the best long-term resolution to the issue.

*Basis:

The NASA webmasters group would like to provide a search interface
that would allow a user to search all of NASA's webspace for information.
After studying different approaches, several sites began trying out
the Harvest system of brokers/gatherers.  Using Harvest, an individual
host at each center would gatherer information from other web sites
at the center; this gathering was seeded with known hosts and extended
by an n-hop reach, using Harvest as a "spider" or "worm".  This provided
a very useful, reasonably compact method of indexing the webspace at
the center, and would, ultimately, have been able to provide the
NASA wide interface we want.

*The Problem:

Running the gatherer from within each center provided a good way to
balance the resource consumption among centers, but also had the
effect of making the gatherer a "local" user for access control
purposes.  Since the gatherer acts like a local browsing user,
information which had been made available only to center personnel or
NASA personnel was accessible o the browser.  Outside users could
see these resources when they searched, but could not select the documents
(They did have access, however, to the Summary Object Information about
the resources, which often provided too much information for outside
users).  

We are trying to create a searchable index for outside users, but we
cannot currently determine easily which documents users should see.
Most http daemons currently seem to be configured to deny browsers
access to access control files (NCSA's httpd, for example, by default
eliminates the .htaccess files from the list of files served); this
means it is not possible for the gatherer to request the access file
and eliminate items which should not be indexed.

*Short Term Solutions:

Running the gatherers on hosts likely to be denied access (outside
the local domain or NASA domain, and outside likely networks), and
then keeping the brokers on known hosts and nets.  

Cease to run Harvest as a spider, and force webmasters to register
which parts of their sites should be indexed and which should not.

*Long Term

This problem probably faces many other institutions with large
webspaces, even if they do not have a highly distributed set of web
servers.  To allow sites to present integrated search capabilities,
without compromising areas under site-based access restriction it
seems that we might want to have the http server be able to report
that the access to material is restricted even if the the current
browser meets those restrictions.  This is information that the
server already has; the key would simply be making it available
to the browser/indexer.

Several possible methods present themselves.  One way would simply be
to have servers apply access restrictions based on the From: field as
well as the address of the request (allowing spiders/gatherers to set a
bogus address, which would allow for gathering only unrestricted
documents--it would have to be *in addition*, of course, to prevent
spoofing in the other direction. )  

The server could also present browsers with a header reporting whether
or not the document being served was restricted (and optionally,
how). This seems like overkill, since few browsers would care about
the restriction.  If proposols for a rating system succeed, however,
it could be simply moved into the header reporting the rating (Rating:
would then include things like Local_Access as well as Adults_Only,
and might be better named Restrictions:).  


An Accept: header could also do the job; like Accept-encoding: and
Accept-charset:, it would give the server information about what kinds
of information the browser wishes to see.  The difference of course,
would be that it would be that the server would decide to respond
based on access control files rather than the file be served.  This
method would allow indexers to configure their spiders to gather all
available documents, only documents available to everyone, or only
documents which did not bear a particular restriction (note that this
would be most useful if there were some standardization on restriction
classes).

It also seems like a Pragma directive could be defined for this; though
the parallel with no-cache is not especially strong, it seems like
this kind of unusual directive "don't return if restrictions exist" 
(or "no-restrict" for simplicity's sake) might also need to be passed
through each proxy.

*Directions?

Which of these methods seem most useful?  What other methods might
there be which would be more useful?  Do other folks think that this
problem even deserves to be addressed at this level, or should it be
handled at some other leve?

Any help, commentary, or suggestions appreciated,

				Regards,
					Ted Hardie
					NAIC
Received on Friday, 26 May 1995 16:37:30 UTC