Re: Accurate user-based log file analysis from Brian Behlendorf on 1995-07-18 (www-talk@w3.org from July to August 1995)

From: Brian Behlendorf <brian@organic.com>
Date: Mon, 17 Jul 1995 21:22:58 -0700 (PDT)
To: Terry Myerson <tmyerson@iserver.interse.com>
Cc: www-talk@w3.org
Message-Id: <Pine.3.89.9507172026.C29440-0100000@eat.organic.com>
On Mon, 17 Jul 1995, Terry Myerson wrote:
> You are speaking to extremes. We have log files from over 100 organizations
> in our test suite. The data has been scrutinized, and indeed both accurate
> enough and extremely valuable.

I have no doubt the data's valuable, but the accuracy is what I'm 
questioning.  Could you elaborate in what way the accuracy was tested?

> We are indeed talking about user sessions, and not users. My usage of the term
> users was indeed a marketing decision, I apologize. But user sessions are still
> a much better statistic to base business decisions upon than hits or unique
> hostnames.

Agreed.  But PLEASE let's get the terminology right - when marketers talk 
about numbers, they are *not* talking about one-time sessions.  They want 
to know if Bob came back 20 times in 20 days or only once, which just 
looking at sessions can't tell you.

> >Could you elaborate on these DC's?  What can you key off of except 
> >hostnames from CLFF data?  
> 
> There are other DC's in there. 

Well, let's walk through the CLFF, and tell me where the other 
"distinguishing characteristics" are:

RFC931 identd information - only a couple sites will supply this, and it 
	introduces a huge latency on a server so most people turn it off

authenticated username - again, it's something the site has to enable at
	their expense, which most don't do

date/time - you can perform heuristics on the date in conjunction with 
	hostnames by arguing that a gap in time represents the end of
	one user and the beginning of a second - but that ignores the
	situation where someone follows a link *out* and then comes back 
	much later, and the more proxies are used the less useful it would
	be.  Our current estimates are that 20% of our accesses are coming 
	from behind proxy servers, and that number has been going quite 
	steadily upward.

request - you can lay out paths in a web site if you have a 
	directed-graph model of how the pages connect.  The analysis
	program must be able to inspect this hierarchy, using a robot or by 
	being able to read the HTML files themselves.  The more links on a
	page, the less useful this is - and you also don't know when people
	step *back*, which makes building Markov models very difficult.
	In short, chaining paths is also a very weak link.

Error response - most of the time it's either a success (200) or a 
	"not-modified" (304).  I don't see how you can determine from
	this whether the request represents a new user or not - if the
	object is first fetched from a given host and returns a 200, 
	and then 20 minutes later from the same host and gets a 304, what 
	does that mean?  Either 1) it's the same person refreshing that
	object, using a browser that implements caching, 2) another user behind a 
	proxy server getting that object for the first time.  If the second 
	response is a 200, then it's from 1) the same user whose browser doesn't
	implement caching or 2) the same user whose browser implements caching
	but the object wasn't in the cache for whatever reason, or 3) a totally
	different user coming from behind a non-caching proxy.  

File size - insignificant.

So what's there?  

I apologize if I'm making a ruckus on this issue - but it's something 
we're heavily involved with as well, and I have to *constantly* 
*constantly* deal with clients and prospective clients who have been told 
be overeager marketers from other companies what can and can't be done in 
this and other technical arenas.  I don't doubt that there is a large 
market for a good analysis tool a la getstats or wwwstat or the other 
free analysis tools out there.  But I am very wary of unfulfillable 
claims which have been getting way too much press.  There are real 
solutions to this coming down the pipe that will give marketers a better 
idea of who's visiting them without having to guess or derive flaky 
heuristics that work one day and not the next, while still strongly 
protecting the user's right to privacy. 

I've said too much on the subject... next!

	Brian


--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com  brian@hyperreal.com  http://www.[hyperreal,organic].com/
Received on Tuesday, 18 July 1995 00:25:10 UTC