Re: Responses to comments

From: Jim Pitkow (pitkow@parc.xerox.com)
Date: Fri, Apr 02 1999


Message-Id: <4.0.2.19990401204753.00a7f100@mailback.parc.xerox.com>
Date: Thu, 1 Apr 1999 21:31:15 PST
To: "Lavoie,Brian" <lavoie@oclc.org>, "'www-wca@w3.org'" <www-wca@w3.org>
From: Jim Pitkow <pitkow@parc.xerox.com>
Subject: Re: Responses to comments

At 10:31 AM 3/31/99 , Lavoie,Brian wrote:
>>A resource doesn't really consist of one or more files.
>
>You're quite right - a resource is not necessarily a collection of files.
>But if we want to use "resource" as a measurable concept we can apply to the
>Web, we have to restrict our scope. Resource in its purest interpretation
>doesn't get us very far: it includes Web pages, a weather report, an idea,
>oil, timber ... We need to base it on something concrete, observable, and
>related to the Web, which was why I was thinking about a resource as a
>collection of HTTP-accessible files.     

My view on this is that we need to define clearly both the abstract and
concrete if possible.  In the abstract, referring to a Web page as a
resource seems just as fine as referring to a streaming feed from a
particular source.  We may want to adopt the notions of composite resources
to refer to the fusion of more than one resource.

Files versus databases.  To me it seems that one can chase one's tail for a
long while asking: "If a file is stored in a database as a blob, is it
still a file?"   I do think we need the notion of a container (page) and
the corresponding elements that help constitute the page (images, style
sheet, etc).  It'd be nice if these abstracted away from whether these are
stored on a file system or in a db.

>>   Client
>>   A software application that initiates network communication.
>>I think that we here can be a little more specific, see for example:
>
>We can use your definition. I don't think that we need to provide an
>authoritative or detailed definition of a client (or server), however, since
>this is a term used in a variety of contexts, and is certainly not limited
>to Web characterization.

In the larger scheme of things, don't the notion of client and server
disappear into an set of objects and methods?  Is a marimba server that
initiates and sends updated information a client?  Is a server that pushes
invalidated pages to a set of proxies a client?  To me the notion of a
client implies a certain proximity to the user.  It also implies being a
consumer of information rather than a provider/supplier.

>>The important thing is that "server" also covers the term "proxy" and other
>>intermediaries.
>
>Same thing here.

Not quite sure what you mean by this Henrik.  We treat the analysis of
servers quite differently than servers.

>>I propose using these definitions instead for message, request, and
>response:
>
>Most of the alterations I made to the definitions you suggest were to strip
>out the examples from the definitions. Maybe we should have a separate
>"example" section for each definition, especially if they are important for
>clarifying the definition.

I agree that a separate example/figure that the definitions optionally
referred to may be helpful.

>>It is inherent in the Web model that a user agent always issues requests on
>>behalf of some human although it doesn't have to be directly. For example,
>>a robot still behaves on behalf of the human starting it. I wouldn't make
>>it a primitive. Instead I would like to define certain access patterns -
>>there is no reason why a browser can't become a robot while filling a cache
>>or using a robot to behave as a browser to download inlined images etc.
>
>I disagree. I think that separating the human from the client can lead to
>some useful analysis of each. Also, the user/client distinction is necessary
>for distinguishing between clicks and requests, and explicit and implicit
>clicks.

I agree with Brian.  To me the issue boils down to autonomy.  A robot that
crawls the Web using a breadth first search of so is quite different from a
user filling a cache for off-line reading and quite different from a user
clicking on each hyperlink to retrieve the content.  It'd be nice to come
up with a set of values/dimension that encompasses this difference.

>>The Web is really not limited to HTTP (nor HTML/XML for that matter). Those
>>are just popular ways of implementing the Web - the Web is really the
>>complete information space that can be referenced by URIs. That is,
>>anything that is a resource is on the Web. When you think about it, this is
>>really not limited to networked resources - however, this is how we
>>normally think about them. Examples of non-networked URIs are phone
>>addresses, for example:   As we have already defined a "resource", we don't
>have to change that
>>definition.
>
>Your point is well taken, but in my opinion, practical considerations have
>to prevail here. According to your definition, essentially the entire
>Internet, and even non-networked resources, could fall within the writ of
>Web characterization, which I don't think is what we want. You may want to
>check out some earlier e-mails that were exchanged on the mailing list on
>the issue "What is the Web?". I think the consensus that emerged from that
>dialogue is that what we're mainly interested in is information that is
>accessed from Web servers - i.e., accessed via HTTP. Otherwise, we fall into
>an infinite regression - everything has to be analyzed. We have to start
>with some reasonable boundaries to our analysis - for example, we're not
>interested in non-networked resources, so why try to accommodate them in the
>definition? It is not necessary (or feasible) to capture every potential
>aspect of the Web in our definition. We just need to make clear what we're
>including in our analysis, and what we're not. That's not to say, of course,
>that we can't amend the definition in the future as needs warrant.

I agree. We characterize the Web but not email, usenet, irc, etc. which all
can be accessed via URIs.  Limiting the scope for the purposes of the WCA
is useful so long as we clearly state that these are not absolute
definitions - only relative to the interests of the WCA and can be expanded
as needed.

>>There are plenty of other formats that contain URIs - pdf, powerpoint, etc.
>>It is not limited to HTML. Again, I don't think we have to say anything
>>more than what we already have on resources.
>    
>Again, practical considerations are key. We're not analyzing everything with
>a URI - we're confining ourselves to the Web (see my previous comment). The
>term "Web-accessible resource" is intended to acknowledge the fact that
>resources are available via non-HTTP protocols through links appearing in
>Web resources. In this sense, we can view the Web as the "core" set of
>resources, and Web-accessible Internet resources as a layer of additional
>resources surrounding the core (and directly accessible from the core) and
>stop there.

I tend to like the super-set / sub-set frame, where the WCA is initially
concerned with a subset of all Web-accessible information.

>>   Web Clients
>>   
>>   Web Client
>>   A client that can be used to access Web resources.
>>
>>We have already defined this as a "client" - it doesn't matter what
>>protocol it is really speaking nor whether it has a human clicking on the
>>mouse.
>
>It actually does matter - the first definition of a client is a primitive:
>in other words, a basic concept we use as a building block for our
>Web-specific concepts. For Web characterization, we are only interested in
>Web clients, which is a client as defined as a primitive, with the added
>stipulation that it has the ability to access Web resources. Not all clients
>can do this. Since Web resources are defined above as resources accessible
>through HTTP, it does matter what protocol is spoken.
>I agree that Web clients can be automated or user driven, but it is
>important to make the distinction.

I agree.  Others?

>>Instead of defining "click" (which is also not very general - there are
>>many other ways of initiating a request) then I think we are in fact
>>already covered by the "web page" definition where we leave it to the user
>>preferences and/or application capabilities to decide which links are
>>dereferenced and which are not.
>
>The click term is not intended to be general: it is intended to describe the
>specific action of a human (user) manually requesting a resource via a link.
>It is an elaboration (i.e., specific example) of a request.

I think we need to enumerate all the possible ways currently of
dereferencing links (typing in, bookmarks/favorites, history, etc.) so that
we a) document current capabilities and b) can identify areas that have yet
to be characterized.

>>   Click-through Rate
>>   Frequency with which a Web resource, identified by a URL, is clicked.
>>Do you mean "Web page access rate"? That is, the (mean or distribution?)
>>time between changing web pages? Again, I think we should avoid the term
>>"click".
>
>This question is probably best answered by Jim ...

For this, I think we're referring to a tuple <a,b> where clicking on 'a'
results in 'b' being displayed.  In other terms, this is the transition
frequency, or the value of the edge, from node a to node b.  The sum of all
click throughs/edges for a page is equal to the Web page access rate -
assuming you are able to count typing in, bookmarks, etc. as the first
element in the tuple.

>>   User Session
>>   A cohesive set of user clicks across one or more Web servers.
>>
>>What about "A set of Web pages accessed by continuous dereferencing of
>>links contained within these web pages. A session is not limited to a
>>single Web site"?
>
>You would need to be more specific about what you mean by "continuous". I
>guess one of the advantages of the other definition was that "cohesive" was
>appropriately vague.

To me it seems that the notion of a session is more temporal than anything
else right now.  This is the only method I know that people frequently use.
 Martin's World Cup paper provides a systematic analysis of the trade offs
of using different timeout periods.  Overall, I think the measurement
community seems to use 30 minutes for client side requests.  The server
side community uses something more like 3-5 minutes or less.  This
difference is very important, and each value has yet to reach consensus in
my mind.

>>   Episode
>>   A subset of related user clicks that occur within a user session.
>>How is that distinguished from a session?

Sessions can have subsessions or episodes.  The notion of the hub and spoke
navigation pattern where a user starts from and returns to an index page is
an example of an episode within a session.  It my be best to use the term
cohesive here.

>A session is analogous to the time between when a user sits down at the
>computer and the time when he gets up. On the other hand, within a single

Not necessarily so.  I sit down at 10:00 and leave the chair at 12:00 but
have many sessions in between (on the phone, do other work, etc.).

>session, a user can generate multiple clusters of logically related
>requests: for example, in a session, a user might visit a number of sites
>devoted to Java (episode 1). Then, while he is still online, he might check
>the closing prices on his stocks (episode 2). Finally, he might buy some
>books at Amazon.com (episode 3). I think the term is useful from a
>conceptual standpoint, although it will be difficult to measure.

Yup.

>>   Server Session
>>   A collection of user clicks to a Web server during a user session.
>>   Also called a visit.
>>What about:
>>	The part of a user session limited to a single Web site.
>
>I wanted to capture the point that a Web site is not equivalent to a Web
>server.

An interesting distinction.  Is this made explicit in the primary
definitions? (currently in the air and don't have access to the latest term
sheet).

>General comment on your comments:
>
>I think a major difference between your definitions and mine are that yours
>emphasize general interpretations of concepts, while mine focus on the
>concept as it is currently manifested on the Web. Both perspectives have
>advantages and disadvantages, and I tried to locate a happy medium between
>the two when I composed some of these definitions. However, I finally
>decided that in some circumstances, the Web-specific interpretations have to
>prevail.
>I think an important point to remember is that Web characterization boils
>down to metrics, and metrics have to be based on observable, measurable
>concepts.Therefore, the concepts we work with need to have an element of
>concreteness to them that allows us to apply them to Web data. Now, this is
>not to say that a concept should be defined by its measurement technique -
>we had a discussion about that at the last conference call. But the concept
>does have to be measurable, and as unambiguous as possible.

I tend to agree with the measurable concept as well as a further refinement
to HTTP for the WCA (at this point in time).  I also agree that we need to
have the definitions broad and self-extensible enough to permit growth in
scope.  An upfront statement about scope and preferred observable system
along with clarifying in the appropriate definitions may help clarify.

Nice progress.