Re: Indexing extension (fwd)

(Sorry for the double message-I fumble-fingered the first time
and sent too soon.  Anyway...)
 
Martijn,
 
You raise a lot of interesting issues, probably far more clearly than
I will respond.  I've tried to break up your issues into chunks and
respond to each.
 
 
 Martijn writes:
  M>I'm probably going to have trouble expressing this clearly, but I'm a
  M>bit worried about the host-security-restriction angle on this
  M>problem. In your situation you have host-based access restriction (be
  M>it DNS or IP), which is just one of the possible restrictions one
  M>could encounter. Other authentication schemes may be in use, the
  M>access restriction may depend on external factors such as server load
  M>or time of day etc. I wouldn't like to see a hack that fixes only one
  M>particular aspect of this problem, and will take years to get
  M>implemented, and will haunt us later. Of course I would like to see
  M>progress.
 
I certainly agree that any solution proposed should be applicable to
situations beyond host-based access restrictions.  I also see the
method I've currently proposed, using a pragma method to request
information about access restrictions as extensible.  Like you,
I don't think a response of "restrictions exist" is particlularly
useful; I only included it as a minimal response because some sites
would be skittish about reporting what the restrictions are.  A
useful response would be one which gave details of the restriction
and allowed the indexer to employ its heuristics to decide if it
wanted to index the data.
 
 
  
  M>I see your problem as one of a higher level of abstraction: you have
  M>collections of resources which are intended for the general public,
  M>and collections which are intended for a private audience. Even in the
  M>case where you choose not to place access restrictions on the private
  M>collection you may want to prevent wide publication of the
  M>collection. My favourite real-life example is a bug database: you want
  M>it online for customers, but you don't want to be known for the bugs
  in your products because some indexer just ended up there!
  
  M>You place an access restriction on a collection, which is a policy on
  M>the collection, not a property of it. Deciding how the collection is
  M>indexed should also be a policy decision, and be separate from other
  M>policies such as access restrictions.
  
 
I'm not sure I totally agree with the above.  While I agree that some
sites will want to provide additional policy restrictions on data
indexing, I think the accessibility of the actual data is crucial
information.  If access to the data is limited, the indexer may want
to know that even if the indexer has the needed access, or if no
other policy restrictions exist.
 
 
 
  Ted Hardie  writes:
   On another note, several people have pointed out the existence of
   the Robot exclusion standard, and have suggested using a robots.txt
   at sites or in hierarchies that should not be indexed.  This is a
   fine temporary suggestion,
  
  M>Well, the /robots.txt was designed as a temporary solution, so that
  M>fits :-)
  
   but I think it is a bit inelegant, as it requires the maintainers of
   those pages to keep two different forms of access control--one for
   humans and one for local robots.
  
  M>If that is your only worry I'm sure you could automate some of the
  M>process, e.g. by generating the robots.txt from the acl files or
  M>whatever.  One could even argue that the problem is in fact in the
  M>access control implementation on your server: I'd much rather
  M>configure my server in terms of document collections with policies
  M>than in files (URLs) with access restrictions.
  
  M>I see the separation of policies as a feature, not a bug, and think a
  M>robots.txt style restriction may therefore be appropriate.
  
  M>The added advantage is of course that you help out other robots, and
  M>that Harvest support for robots.txt is quite easy to implement (they
  M>should do it anyway if they haven't yet) and you need no server
  M>changes.
 
Actually, you can't automate this easily in our situation because of
the distributed nature of the web documents being indexed; there are
more than 50 web servers at the Ames Research Center alone, with a
similar number at each of the other NASA centers.  Nor is the problem
unique to NASA; several folks have written me to indicate that they
face the same problems of scale as we do, so I believe that a solution
that doesn't rely on the current methods is needed (or will be by the
time we can implement it).  
 
I believe that we could set this up in a way that would eventually
replace robots.txt, by allowing detailed responses to requests for
information about restrictions.  A response of something like "All
browsers. No indexers" or "Password-authorized browsers.  Lycos
webcrawler infoseek indexers."  would pretty much handle what the
robots.txt does now.  If there were no policy level restrictions given
to the server, though, it should still be able to respond with
standard access information.
 
  
   Perhaps a Pragma method of "request restrictions" would be the best
   idea; it would allow the server to determine whether to send a
   description of the restrictions to the browser (useful if the
   indexer wishes to use some logic to determine whether to index) or a
   simple "restrictions exist" reply.
  
   What do people think of using that Pragma method as a solution?
  
  M>Sure, but that requires some enumeration of access control. My problem
  M>with that is that there can be any number of access control policies
  M>in force, which may not be easy to describe. How do you deal with
  M>"access is OK outside the hours 9-5", "access is OK if a trusted
  M>third-party authenticates you", "access is OK if you're over 18",
  M>"access OK unless you're annoying@wherever" etc? I don't think a
  M>"restrictions exist" is really useful. Anyway, that discussion may
  M>have a better place on a http-security list (probably is one). I can
  M>see some (different) use for such a facility, but think it's
  M>n on-trivial.
 
This would certainly require us to establish a syntax for responses,
but I don't think that the enumeration need be exhaustive.  If we establish
what kinds of information is found where, we can name common likely responses
and give guidelines on what else might be used.  This would allow an
indexer to "do the right thing" in most cases, and kick the unknowns to
a human for instructions (they might well have a default action defined
as well, depending on the indexer's design).

  
 M>It seems to me that your actual requirement is selection of documents
 M>for indexing purposes, for particular audiences. Maybe we should be
 M>discussing how we go about asking servers things like "give me a list
 M>of your X most important documents/meta-descriptions to include in a
 M>public index", "list URL roots of documents not intended for a public
 M>index (aka robots.txt)", or to make it more scalable for large server
 M>do a "Get this document If-index: public" or something. Yeah, I know,
 M> also non-trivial :-).
 
Yes, this would be an interesting question, and if others want to expand
the discussion to include the more general question of meta-data, I'm willing;
I will remove some of the cc:'s on the message, though, as I suspect we would
be going outside the bounds of the defined interests of this group.
 
  
  -- Martijn
  __________
  Internet: m.koster@nexor.co.uk
  X-400: C=GB; A= ; P=Nexor; O=Nexor; S=koster; I=M
  WWW: http://web.nexor.co.uk/mak/mak.html
  
 
 
 

Received on Thursday, 1 June 1995 15:02:32 UTC