Summary of the WCA BOF at WWW8

From: Johan Hjelm (hjelm@w3.org)
Date: Tue, May 18 1999


Message-Id: <4.1.19990517185744.00bf5100@127.0.0.1>
Message-Id: <4.1.19990517185744.00bf5100@127.0.0.1>
Date: Tue, 18 May 1999 14:24:46 +0200
To: www-wca@w3.org
From: Johan Hjelm <hjelm@w3.org>
Subject: Summary of the WCA BOF at WWW8

Here is a summary of the WCA BOF at WWW8. 

Henrik presented the WCA organisation. This was followed by a brief
terminology walkthrough, to make sure that we all were on the same level. A
gentleman who did not stay raised the issue of Nielsen doing ratings of web
sites in the same way they do TV channels, and according to their web
(http://www.nielsen-netratings.com/), they are actually inserting a
measurement proxy into the network and logging packet traces. I am aware of
one other company that does this, Tidningsstatistik in Sweden
(http://www.tsrs.se/, all in Swedish unfortunately). It may be we should
contact them and see if we can work together.

On to the meeting: Discussion was very good, bouncing round from all sides
(academic, industry, user). Here are a few issues we discussed:
* Robots, user agents and clients. How to identify malicious robots
(patterns seem to be the only solution); from a practical standpoint this
can be used to insert "robot deterrents", e.g. second-long delays between
requests. Patterns can be determined from number of requests per session,
flow per unit of time. It would be a "business behavioral definition". 
* "Sites" are now not even organised over several servers in the same
domain (such as in load balancing); sometimes, elements gets fetched from
other domains, e.g. icons on the yahoo site; forms on other sites, etc.
This means that the only way to track a users behaviour is cross-site
analysis, which means that we must work with collections of logs from a
number of sites (or proxy logs?) Cross-site analysis is the only thing that
really can tell you how your site works. 
* Rate of change, rate of propagation of new standards (over the entire
web) would be a good metric. Likewise resource type, versions of user
agents. Things like the use of css, caching, robot.txt etc. is interesting
to measure. 
* How content providers provide information is something that needs to be
found out (see rate of change). 
* There are considerable privacy issues on caching and logging (e.g.
transparent proxies), which we have discussed in the WG but need to take
into account. 
* Proxies and proxy hit metering would be a useful way of measuring the web
traffic. How it is used and how it is defeated, and wether it is used to
forward measurements, is important to find out. 

Generally, it seems that we now are narrowing in on the specific things we
want to measure. We also find that the user session related issues are very
hard to analyse if you do not use something that captures the entire user
session with the web as such, not just a specific site or collection of
resources. This implies, in my opinion, that user sessions are better
measured with tooled clients or through proxy logs (e.g. firewalls), and
what we can measure on the web are the site/collection specific things, as
well as the HTTP-server-specifics. 

Johan

************************************************************
                     Johan HJELM
       Ericsson Research, User Applications Group 
         Currently visiting engineer at the W3C
             The World Wide Web Consortium
                     hjelm@w3.org
   http://www.w3.org/People/W3Cpeople.html#Hjelm
    Fax +1-617-258 5999, Phone +1-617-253-9630
   MIT/LCS, 545 Tech. Sq. Cambridge MA 02139 USA 
        opinions are personal, always my own, 
  and not necessarily those of Ericsson or the W3C. 
============================================================