Re: Responses to comments

From: Henrik Frystyk Nielsen (
Date: Thu, Apr 01 1999

Message-Id: <>
Date: Thu, 01 Apr 1999 17:30:36 -0500
To: "Lavoie,Brian" <>, "''" <>
From: Henrik Frystyk Nielsen <>
Subject: Re: Responses to comments

At 13:31 3/31/99 -0500, Lavoie,Brian wrote:
>Thanks for your comments, Henryk. Sorry for the delay in replying - I ran up
>against a deadline on a paper last week, so I had to put this on the

Hi Brian,

No problem - thanks for your reply. To start with your comment at the end...

>General comment on your comments:
>I think a major difference between your definitions and mine are that yours
>emphasize general interpretations of concepts, while mine focus on the
>concept as it is currently manifested on the Web. Both perspectives have
>advantages and disadvantages, and I tried to locate a happy medium between
>the two when I composed some of these definitions. However, I finally
>decided that in some circumstances, the Web-specific interpretations have to
>I think an important point to remember is that Web characterization boils
>down to metrics, and metrics have to be based on observable, measurable
>concepts.Therefore, the concepts we work with need to have an element of
>concreteness to them that allows us to apply them to Web data. Now, this is
>not to say that a concept should be defined by its measurement technique -
>we had a discussion about that at the last conference call. But the concept
>does have to be measurable, and as unambiguous as possible.

I think you are absolutely right about this distinction - and you are
bringing up a lot of very good points. Let me try and argue for why I think
we should try and keep to the more abstract side of the road and why I
think project scoping and terminology should be separated.

It basically boils down to how we most efficiently communicate with
ourselves over time and with other groups: Specific implementations are
shortterm - the architecture is longterm. Therefore, what I would like is
to start with a broad definition of terms and then limit our scope to
specifics when implementing what we are chartered to do for the moment.

The reason for this is that we want to involve as many people as possible
both directly as well as indirectly - even those who maybe don't consider
them to be typical "Web users".

Terminology tends to be the primary interface between different groups of
people working on related things and I have many times seen the "they are
not quite talking about the same thing so we can't work together" reaction
based on different terms.

Maybe what we should do is to have a general terminology section and then a
current scope section describing what we consider capable handling.

>Here are my responses:
>>I don't like using "file" as a unit at all. Resource is really more
>>flexible and we never really care about how that resource is stored
>>internally. There is no reason why an object stored in a database can't be
>>as (or more) static.
>Resources and files are not intended to be synonomous, so we don't have to
>sacrifice one term to use the other. As for the usefulness of the file unit,
>I think we should care how information is stored, because the storage format
>can have broader implications, such as client accessibility. But more
>importantly, files are observable, well-defined entities that we can use as
>building blocks to unambiguously define more complex Web entities, such as
>Web resources, Web pages, etc.   

I think you are touching two different problems here: What *can* be
observed is the often relatively slow changing state of files compared to
say, dynamic weather maps. The rate at which resources change, what I call
"state distribution", is indeed an interesting point to observe but there
is no direct relationship to files - files are not observable through a Web

By even thinking about files we immediately get into the problem of how to
handle server side include files, for example, which are mostly files but
not quite. Similarly, there are lots of examples of longterm persistent
entries in a database.

>>A resource doesn't really consist of one or more files.
>You're quite right - a resource is not necessarily a collection of files.
>But if we want to use "resource" as a measurable concept we can apply to the
>Web, we have to restrict our scope. Resource in its purest interpretation
>doesn't get us very far: it includes Web pages, a weather report, an idea,
>oil, timber ... We need to base it on something concrete, observable, and
>related to the Web, which was why I was thinking about a resource as a
>collection of HTTP-accessible files.     

No, what we are observing is how resources react when we poke at them with
GET methods, for example. This is a very concrete measurement which we can
repeat at any point of time independent of what the resource is really is.

Depending on the application used to poke at the resource we may
dereference some of the links in the response we get back but how and why
we do this is a function of the type of application. Robots do it in one
way, a human user does it in another way etc.

>>I think that we here can be a little more specific, see for example:
>We can use your definition. I don't think that we need to provide an
>authoritative or detailed definition of a client (or server), however, since
>this is a term used in a variety of contexts, and is certainly not limited
>to Web characterization.

I agree, this is more a question of completeness but I think it is useful
to include, especially because a client doesn't have to be a GUI browser.

>Most of the alterations I made to the definitions you suggest were to strip
>out the examples from the definitions. Maybe we should have a separate
>"example" section for each definition, especially if they are important for
>clarifying the definition.

I guess it depends on the context but I can live with that. 

>>It is inherent in the Web model that a user agent always issues requests on
>>behalf of some human although it doesn't have to be directly. For example,
>>a robot still behaves on behalf of the human starting it. I wouldn't make
>>it a primitive. Instead I would like to define certain access patterns -
>>there is no reason why a browser can't become a robot while filling a cache
>>or using a robot to behave as a browser to download inlined images etc.
>I disagree. I think that separating the human from the client can lead to
>some useful analysis of each. Also, the user/client distinction is necessary
>for distinguishing between clicks and requests, and explicit and implicit

I think you are missing my point - I am not talking about separating the
user from the client but rather characterizing the client as a function of
what it does rather than what it claims to be.

For example, I have a robot which can emulate a GUI browser by downloading
all the inlined images and follow certain links based on some constraint.
Hence it really acts as a human user and not a robot. On the other hand,
many GUI browsers can check a set of links automatically - essentially
acting like a robot.

>>The Web is really not limited to HTTP (nor HTML/XML for that matter). Those
>>are just popular ways of implementing the Web - the Web is really the
>>complete information space that can be referenced by URIs. That is,
>>anything that is a resource is on the Web. When you think about it, this is
>>really not limited to networked resources - however, this is how we
>>normally think about them. Examples of non-networked URIs are phone
>>addresses, for example:   As we have already defined a "resource", we don't
>have to change that
>Your point is well taken, but in my opinion, practical considerations have
>to prevail here. According to your definition, essentially the entire
>Internet, and even non-networked resources, could fall within the writ of
>Web characterization, which I don't think is what we want.

You are right that we have to limit ourselves - I don't argue with that at
all. What I am arguing is that we shouldn't force the current scope on the
terminology but keep them separate.

> You may want to
>check out some earlier e-mails that were exchanged on the mailing list on
>the issue "What is the Web?". I think the consensus that emerged from that
>dialogue is that what we're mainly interested in is information that is
>accessed from Web servers - i.e., accessed via HTTP. Otherwise, we fall into
>an infinite regression - everything has to be analyzed. We have to start
>with some reasonable boundaries to our analysis - for example, we're not
>interested in non-networked resources, so why try to accommodate them in the
>definition? It is not necessary (or feasible) to capture every potential
>aspect of the Web in our definition. We just need to make clear what we're
>including in our analysis, and what we're not. That's not to say, of course,
>that we can't amend the definition in the future as needs warrant.

I am quite happy with this limitation. The situation that I want to avoid
is that we limit ourselves in our terminology to deal with HTML and hence
can't start considering style sheets without changing our terminology.

>>There are plenty of other formats that contain URIs - pdf, powerpoint, etc.
>>It is not limited to HTML. Again, I don't think we have to say anything
>>more than what we already have on resources.
>Again, practical considerations are key. We're not analyzing everything with
>a URI - we're confining ourselves to the Web (see my previous comment). The
>term "Web-accessible resource" is intended to acknowledge the fact that
>resources are available via non-HTTP protocols through links appearing in
>Web resources. In this sense, we can view the Web as the "core" set of
>resources, and Web-accessible Internet resources as a layer of additional
>resources surrounding the core (and directly accessible from the core) and
>stop there.
>>   Web Clients
>>   Web Client
>>   A client that can be used to access Web resources.
>>We have already defined this as a "client" - it doesn't matter what
>>protocol it is really speaking nor whether it has a human clicking on the
>It actually does matter - the first definition of a client is a primitive:
>in other words, a basic concept we use as a building block for our
>Web-specific concepts. For Web characterization, we are only interested in
>Web clients, which is a client as defined as a primitive, with the added
>stipulation that it has the ability to access Web resources. Not all clients
>can do this. Since Web resources are defined above as resources accessible
>through HTTP, it does matter what protocol is spoken.
>I agree that Web clients can be automated or user driven, but it is
>important to make the distinction.

Yes, but "Web" is not the right term to use for this limitation of scope.
We can call it an HTTP browser but not a Web client.

>>Instead of defining "click" (which is also not very general - there are
>>many other ways of initiating a request) then I think we are in fact
>>already covered by the "web page" definition where we leave it to the user
>>preferences and/or application capabilities to decide which links are
>>dereferenced and which are not.
>The click term is not intended to be general: it is intended to describe the
>specific action of a human (user) manually requesting a resource via a link.
>It is an elaboration (i.e., specific example) of a request.

I was actually thinking of human users who also have many ways of
activating the process of resolving an address - speech recognition etc. I
know that the WAI project (Web Accessibility Initiative)

is very interested in WCA and they are explicitly dealing with alternative
ways of interacting with a client.

>>   Click-through Rate
>>   Frequency with which a Web resource, identified by a URL, is clicked.
>>Do you mean "Web page access rate"? That is, the (mean or distribution?)
>>time between changing web pages? Again, I think we should avoid the term
>This question is probably best answered by Jim ...


>>   User Session
>>   A cohesive set of user clicks across one or more Web servers.
>>What about "A set of Web pages accessed by continuous dereferencing of
>>links contained within these web pages. A session is not limited to a
>>single Web site"?
>You would need to be more specific about what you mean by "continuous". I
>guess one of the advantages of the other definition was that "cohesive" was
>appropriately vague.

What I meant with "continuous" was whether the documents had explicit links
to each other. That is, a user goes from A to B because there is a link in
A pointing to B.

>>   Episode
>>   A subset of related user clicks that occur within a user session.
>>How is that distinguished from a session?
>A session is analogous to the time between when a user sits down at the
>computer and the time when he gets up. On the other hand, within a single
>session, a user can generate multiple clusters of logically related
>requests: for example, in a session, a user might visit a number of sites
>devoted to Java (episode 1). Then, while he is still online, he might check
>the closing prices on his stocks (episode 2). Finally, he might buy some
>books at (episode 3). I think the term is useful from a
>conceptual standpoint, although it will be difficult to measure.
>>   Server Session
>>   A collection of user clicks to a Web server during a user session.
>>   Also called a visit.
>>What about:
>>	The part of a user session limited to a single Web site.
>I wanted to capture the point that a Web site is not equivalent to a Web

Yes but the notion of "equivalent" is really hard to measure as two things
can be equal at different abstraction levels: for example, a document in
Danish and English may convey the same information but unless we have a
machine readable way of expressing this then this is almost impossible to
figure out automatically. This is likely to change over time, however, as
data models like RDF can be used to express such relationships.


Henrik Frystyk Nielsen,
World Wide Web Consortium