Re: Robots

From: Jim Pitkow (pitkow@parc.xerox.com)
Date: Tue, Dec 08 1998


Date: Tue, 8 Dec 1998 01:00:09 PST
To: www-wca <www-wca@w3.org>
From: Jim Pitkow <pitkow@parc.xerox.com>
Message-Id: <98Dec8.010743pst."147472"@mailback.parc.xerox.com>
Subject: Re: Robots


The user agent field is a mess and something we should try to clean up in
redefining CLF.   I'd like to see the filed include: browser, version, OS, and
{real-time user (surfing), user initiated agent (get me these pages for
off-line reading), autonomous agent (search engine crawlers}. 

Common techniques that I am aware of for identifying robots from log files
include:

* check if the robots.txt was requested and treat all subsequent requests from
the same (host,user agent) tuple as a robot (one can also perform correlation
analysis on the resulting paths to see if other agents that do not request the
robots.txt file fall into this category)

* look for abnormally long paths. It's easy enough to determine if an
exhaustive crawl is being performed, since each site has a magic number of
connected pages per site as determined via a breadth first search of the
site. 
For Xerox, the magic number used to be 102.

* look for disconnected paths, i.e., paths whose pages are not connected via
the explicit hyperlink topology of the site.  

* look for robot wordage in the user agent field ("crawler,"  "robot," etc.) 
Depending upon the site, filtering out all agents that do not include
"Mozilla"
and "MSIE" may exclude a significant segment of the user population.  

One thing we may find value in is to create a list of known robots using
automatic detecting algorithms as part of the automatic recharacterization
process and publish the list off  the WCA pages.


>
>
>Hi Chris,
>
>this is a misunderstanding, what I mean is that robots show a behavior
>that is completely different from human behavior and therefore it must be
>treated separately if we want to analyze user behavior. Filtering out
>human requests is a difficult if not impossible task. In my
>analysis I use a crude heuristic: every user agent different from Mozilla
>is considered a robot. I know that there have been cases that robots used
>Mozilla as a name in the USER_AGENT header and I know that there are many
>more browsers besides Netscape and MSIE. But it gives me a good estimate.
>
>volker 
>
>volker turau
>FH Wiesbaden Fachbereich Informatik 
>Tel.: +49-611-9495-205 FAX +49-611-9495-210
>http://www.informatik.fh-wiesbaden.de/~turau
>