- From: Brian Behlendorf <brian@organic.com>
- Date: Mon, 17 Jul 1995 21:22:58 -0700 (PDT)
- To: Terry Myerson <tmyerson@iserver.interse.com>
- Cc: www-talk@w3.org
On Mon, 17 Jul 1995, Terry Myerson wrote: > You are speaking to extremes. We have log files from over 100 organizations > in our test suite. The data has been scrutinized, and indeed both accurate > enough and extremely valuable. I have no doubt the data's valuable, but the accuracy is what I'm questioning. Could you elaborate in what way the accuracy was tested? > We are indeed talking about user sessions, and not users. My usage of the term > users was indeed a marketing decision, I apologize. But user sessions are still > a much better statistic to base business decisions upon than hits or unique > hostnames. Agreed. But PLEASE let's get the terminology right - when marketers talk about numbers, they are *not* talking about one-time sessions. They want to know if Bob came back 20 times in 20 days or only once, which just looking at sessions can't tell you. > >Could you elaborate on these DC's? What can you key off of except > >hostnames from CLFF data? > > There are other DC's in there. Well, let's walk through the CLFF, and tell me where the other "distinguishing characteristics" are: RFC931 identd information - only a couple sites will supply this, and it introduces a huge latency on a server so most people turn it off authenticated username - again, it's something the site has to enable at their expense, which most don't do date/time - you can perform heuristics on the date in conjunction with hostnames by arguing that a gap in time represents the end of one user and the beginning of a second - but that ignores the situation where someone follows a link *out* and then comes back much later, and the more proxies are used the less useful it would be. Our current estimates are that 20% of our accesses are coming from behind proxy servers, and that number has been going quite steadily upward. request - you can lay out paths in a web site if you have a directed-graph model of how the pages connect. The analysis program must be able to inspect this hierarchy, using a robot or by being able to read the HTML files themselves. The more links on a page, the less useful this is - and you also don't know when people step *back*, which makes building Markov models very difficult. In short, chaining paths is also a very weak link. Error response - most of the time it's either a success (200) or a "not-modified" (304). I don't see how you can determine from this whether the request represents a new user or not - if the object is first fetched from a given host and returns a 200, and then 20 minutes later from the same host and gets a 304, what does that mean? Either 1) it's the same person refreshing that object, using a browser that implements caching, 2) another user behind a proxy server getting that object for the first time. If the second response is a 200, then it's from 1) the same user whose browser doesn't implement caching or 2) the same user whose browser implements caching but the object wasn't in the cache for whatever reason, or 3) a totally different user coming from behind a non-caching proxy. File size - insignificant. So what's there? I apologize if I'm making a ruckus on this issue - but it's something we're heavily involved with as well, and I have to *constantly* *constantly* deal with clients and prospective clients who have been told be overeager marketers from other companies what can and can't be done in this and other technical arenas. I don't doubt that there is a large market for a good analysis tool a la getstats or wwwstat or the other free analysis tools out there. But I am very wary of unfulfillable claims which have been getting way too much press. There are real solutions to this coming down the pipe that will give marketers a better idea of who's visiting them without having to guess or derive flaky heuristics that work one day and not the next, while still strongly protecting the user's right to privacy. I've said too much on the subject... next! Brian --=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-- brian@organic.com brian@hyperreal.com http://www.[hyperreal,organic].com/
Received on Tuesday, 18 July 1995 00:25:10 UTC