W3C home > Mailing lists > Public > ietf-http-wg-old@w3.org > January to April 1996

Alex Hopmann: Re: New response code

From: Shel Kaphan <sjk@amazon.com>
Date: Tue, 20 Feb 1996 08:09:34 -0800
Message-Id: <199602201609.IAA01210@bert.amazon.com>
To: http-wg-request%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Cc: http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
http-wg-request@cuckoo.hpl.hp.com writes:
 > [ meant for the list -- ange ]
 > 
 > ------- Forwarded Message
 > 
 > Date:    Mon, 19 Feb 1996 10:17:11 -0800
 > From:    hopmann@holonet.net (Alex Hopmann)
 > To:      http-wg-request@cuckoo.hpl.hp.com
 > Subject: Re: New response code
 > 
 > Shel Kaphan wrote:
 > >Some applications would generate pages differently if they are being
 > >probed by a robot.  For instance, in applications that use URL
 > >encoding of session information (which will be with us until cookies
 > >take over completely)  it might be preferable not to generate session
 > >ids, or at least not new ones, for robots.
 > My hunch is this is a bad idea. In general sites should exclude those URLs
 > that encode session information from Robots. And the concept of servers
 > returning different results to robots is open to abuse- There already are
 > some sites that were designed to detect the robots from the more popular
 > sites and returned some sort of document designed to match everything. This
 > way sites can lure people to their sites without having anything of interest
 > to the browser. Because of this in practice the robot authors are probably
 > not going to want to identify themselves anyway (beyond fetching the
 > "robots.txt" file).
 > Alex Hopmann
 > ResNova Software, Inc.
 > hopmann@holonet.net
 > 
 > 
 > ------- End of Forwarded Message
 > 

I'm sure you're right that there will be some co-evolutionary arms
races in this area no matter what.   The problem is that sites that
use session IDs are quite likely to use them pervasively, but it still
might make sense to index the pages on those sites.

Actually, I came with this idea in talking to the altavista folks, who
are concerned about filling up their indexes with multiple copies of
pages that differ only in the session ID in the URL.  Not all sites
will have set up robots.txt files.  And the altavista guys are
also nice enough that they don't want to bog down servers with
multiple requests for pages they've already seen.  So the robot operators
have some self-interest in making this work.  And server operators
also have some self interest here -- it doesn't do much good if robots
hold on to session IDs and repeatedly give the same ones to new users.

If you have a better idea about how to deal with this, please let me know.
(The answer is not (yet) "use cookies").

--Shel
Received on Tuesday, 20 February 1996 08:15:36 EST

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 06:31:45 EDT