Characterizing browsing patterns: problems encountered

From: Volker Turau (
Date: Tue, Dec 15 1998

Date: Tue, 15 Dec 1998 12:23:14 +0000 (GMT)
From: Volker Turau <>
Message-ID: <Pine.HPP.3.96.981215121801.14523A-100000@dipl01>
Subject: Characterizing browsing patterns: problems encountered


on the basis of log-files of our web-server I tried to characterize
browsing patterns. I came across the following problems: 

-- using log files it is hard to determine clickstreams (as supported by
the Apache module mod_usertrack). I performed a grouping of requests on
the basis of IP-addresses and time stamps.  I defined two accesses
related, when they came from the same IP-address, used the same client
software and the difference of the access times was less than a constant
T. I regarded the equivalence classes of the transitive closure of this
relation as the user sessions. I know that this method has its weekness
(IP-address vs. users). A session is considered a robot session, in case
there is at least one access to robots.txt. On the basis of these classes
I try to do some long term analysis.
What are better methods? 

-- the requests servers receive are only those, that could not be
satisfied by caches inbetween the client and the server (e.g. in the
client software). Hence server log files only partially represent the
browsing characteristics of users, they are really server access patterns. 
I think that real user behavior can only be obtained using the client
software, e.g. by logging the browser activities. Has this been done? 

-- In order to find a correlation between user behavior and site structure
I needed the structure of the web site at access time, this was not always
available because the site structure had changed since. 

Are there archives of server log files publicly available?

volker turau
FH Wiesbaden Fachbereich Informatik 
Tel.: +49-611-9495-205 FAX +49-611-9495-210