Responses to comments

From: Lavoie,Brian (lavoie@oclc.org)
Date: Wed, Mar 31 1999


Message-ID: <72B89459DD2BD211B5CD0000F840094E107562@oa3-server.dev.oclc.org>
From: "Lavoie,Brian" <lavoie@oclc.org>
To: "'www-wca@w3.org'" <www-wca@w3.org>
Date: Wed, 31 Mar 1999 13:31:47 -0500
Subject: Responses to comments

Thanks for your comments, Henryk. Sorry for the delay in replying - I ran up
against a deadline on a paper last week, so I had to put this on the
backburner.

Here are my responses:

>I don't like using "file" as a unit at all. Resource is really more
>flexible and we never really care about how that resource is stored
>internally. There is no reason why an object stored in a database can't be
>as (or more) static.

Resources and files are not intended to be synonomous, so we don't have to
sacrifice one term to use the other. As for the usefulness of the file unit,
I think we should care how information is stored, because the storage format
can have broader implications, such as client accessibility. But more
importantly, files are observable, well-defined entities that we can use as
building blocks to unambiguously define more complex Web entities, such as
Web resources, Web pages, etc.   

>A resource doesn't really consist of one or more files.

You're quite right - a resource is not necessarily a collection of files.
But if we want to use "resource" as a measurable concept we can apply to the
Web, we have to restrict our scope. Resource in its purest interpretation
doesn't get us very far: it includes Web pages, a weather report, an idea,
oil, timber ... We need to base it on something concrete, observable, and
related to the Web, which was why I was thinking about a resource as a
collection of HTTP-accessible files.     

>For better or worse - HTTP calls an instantiation for an "entity" which
>doesn't really say anything. However, I think we are better off sticking to
>what people already are familiar with - I have a definition which is a bit
>more general than the one HTTP uses (doesn't split "metadata" from "data"
>as in entity body and entity header.

That's fine - we'll go with the term people are already using.

>   Client
>   A software application that initiates network communication.
>I think that we here can be a little more specific, see for example:

We can use your definition. I don't think that we need to provide an
authoritative or detailed definition of a client (or server), however, since
this is a term used in a variety of contexts, and is certainly not limited
to Web characterization.

>The important thing is that "server" also covers the term "proxy" and other
>intermediaries.

Same thing here.

>I propose using these definitions instead for message, request, and
response:

Most of the alterations I made to the definitions you suggest were to strip
out the examples from the definitions. Maybe we should have a separate
"example" section for each definition, especially if they are important for
clarifying the definition.

>It is inherent in the Web model that a user agent always issues requests on
>behalf of some human although it doesn't have to be directly. For example,
>a robot still behaves on behalf of the human starting it. I wouldn't make
>it a primitive. Instead I would like to define certain access patterns -
>there is no reason why a browser can't become a robot while filling a cache
>or using a robot to behave as a browser to download inlined images etc.

I disagree. I think that separating the human from the client can lead to
some useful analysis of each. Also, the user/client distinction is necessary
for distinguishing between clicks and requests, and explicit and implicit
clicks.

>The Web is really not limited to HTTP (nor HTML/XML for that matter). Those
>are just popular ways of implementing the Web - the Web is really the
>complete information space that can be referenced by URIs. That is,
>anything that is a resource is on the Web. When you think about it, this is
>really not limited to networked resources - however, this is how we
>normally think about them. Examples of non-networked URIs are phone
>addresses, for example:   As we have already defined a "resource", we don't
have to change that
>definition.

Your point is well taken, but in my opinion, practical considerations have
to prevail here. According to your definition, essentially the entire
Internet, and even non-networked resources, could fall within the writ of
Web characterization, which I don't think is what we want. You may want to
check out some earlier e-mails that were exchanged on the mailing list on
the issue "What is the Web?". I think the consensus that emerged from that
dialogue is that what we're mainly interested in is information that is
accessed from Web servers - i.e., accessed via HTTP. Otherwise, we fall into
an infinite regression - everything has to be analyzed. We have to start
with some reasonable boundaries to our analysis - for example, we're not
interested in non-networked resources, so why try to accommodate them in the
definition? It is not necessary (or feasible) to capture every potential
aspect of the Web in our definition. We just need to make clear what we're
including in our analysis, and what we're not. That's not to say, of course,
that we can't amend the definition in the future as needs warrant.

>There are plenty of other formats that contain URIs - pdf, powerpoint, etc.
>It is not limited to HTML. Again, I don't think we have to say anything
>more than what we already have on resources.
    
Again, practical considerations are key. We're not analyzing everything with
a URI - we're confining ourselves to the Web (see my previous comment). The
term "Web-accessible resource" is intended to acknowledge the fact that
resources are available via non-HTTP protocols through links appearing in
Web resources. In this sense, we can view the Web as the "core" set of
resources, and Web-accessible Internet resources as a layer of additional
resources surrounding the core (and directly accessible from the core) and
stop there.


>   Web Clients
>   
>   Web Client
>   A client that can be used to access Web resources.
>
>We have already defined this as a "client" - it doesn't matter what
>protocol it is really speaking nor whether it has a human clicking on the
>mouse.

It actually does matter - the first definition of a client is a primitive:
in other words, a basic concept we use as a building block for our
Web-specific concepts. For Web characterization, we are only interested in
Web clients, which is a client as defined as a primitive, with the added
stipulation that it has the ability to access Web resources. Not all clients
can do this. Since Web resources are defined above as resources accessible
through HTTP, it does matter what protocol is spoken.
I agree that Web clients can be automated or user driven, but it is
important to make the distinction.

>Instead of defining "click" (which is also not very general - there are
>many other ways of initiating a request) then I think we are in fact
>already covered by the "web page" definition where we leave it to the user
>preferences and/or application capabilities to decide which links are
>dereferenced and which are not.

The click term is not intended to be general: it is intended to describe the
specific action of a human (user) manually requesting a resource via a link.
It is an elaboration (i.e., specific example) of a request.

>   Click-through Rate
>   Frequency with which a Web resource, identified by a URL, is clicked.
>Do you mean "Web page access rate"? That is, the (mean or distribution?)
>time between changing web pages? Again, I think we should avoid the term
>"click".

This question is probably best answered by Jim ...

>   User Session
>   A cohesive set of user clicks across one or more Web servers.
>
>What about "A set of Web pages accessed by continuous dereferencing of
>links contained within these web pages. A session is not limited to a
>single Web site"?

You would need to be more specific about what you mean by "continuous". I
guess one of the advantages of the other definition was that "cohesive" was
appropriately vague.

>   Episode
>   A subset of related user clicks that occur within a user session.
>How is that distinguished from a session?

A session is analogous to the time between when a user sits down at the
computer and the time when he gets up. On the other hand, within a single
session, a user can generate multiple clusters of logically related
requests: for example, in a session, a user might visit a number of sites
devoted to Java (episode 1). Then, while he is still online, he might check
the closing prices on his stocks (episode 2). Finally, he might buy some
books at Amazon.com (episode 3). I think the term is useful from a
conceptual standpoint, although it will be difficult to measure.

>   Server Session
>   A collection of user clicks to a Web server during a user session.
>   Also called a visit.
>What about:
>	The part of a user session limited to a single Web site.

I wanted to capture the point that a Web site is not equivalent to a Web
server.

>The following definitions are rather specific to HTTP - so maybe we should
>introduce a special section of HTTP "stuff" including the equivalent sizes
>for responses?

General comment on your comments:

I think a major difference between your definitions and mine are that yours
emphasize general interpretations of concepts, while mine focus on the
concept as it is currently manifested on the Web. Both perspectives have
advantages and disadvantages, and I tried to locate a happy medium between
the two when I composed some of these definitions. However, I finally
decided that in some circumstances, the Web-specific interpretations have to
prevail.
I think an important point to remember is that Web characterization boils
down to metrics, and metrics have to be based on observable, measurable
concepts.Therefore, the concepts we work with need to have an element of
concreteness to them that allows us to apply them to Web data. Now, this is
not to say that a concept should be defined by its measurement technique -
we had a discussion about that at the last conference call. But the concept
does have to be measurable, and as unambiguous as possible.


Thanks for your comments - let me know what you think.

Brian