HTTP and EARL

Hi,

I promised to post a note today, but I haven't found time 'til now.


I suspect the inclusion of HTTP header information in the current
draft may be motivated at least in part by my comments from the
previous incarnation of the WG, so I'll try and outline why
they're important.

Well-Specified Web Contents
---------------------------

The fundamental issue is one of identifying webcontent.
When we cite a URL, we are referring to a document which is
presumed to be relevant to whatever we are discussing at the time.
Our reference is based on the premise that what I see at the
URL is equivalent to what you see there.  The actual document
may be different, for example it may have been updated, or
the server may serve us different variants (e.g. languages)
according to our respective browser preferences.  Normally,
such differences should not materially affect the relevance
of the URL; it it does, then our use of the URL has failed.

But when we are making assertions of a technical nature about
the contents (including markup/structure) of webpages, this
kind of equivalence is inadequate.  An EARL assertion about
the English-language version of a page may not apply to a
french or german version, still less a chinese, russian
or arabic page.  We need to specify webcontent more precisely.

We may take it as axiomatic that it is sufficient to record our
complete HTTP request, together with the time it was made (this
might exclude some cases such as randomly-generated pages from
the discussion).  On this assumption, the subject of our
assertions is then well-specified.

However, it is not necessary to record the entire request.
It is only necessary to record those headers which may
affect the response through content negotiation.  Any header
not involved in content negotiation will not affect the
response unless the server is violating HTTP, in which case
all bets are off (as in the randomised page).

The headers that (may) affect the content returned by a server
are identified by the server in a Vary: response header.
These are the therfore the headers we need to record.
For example, if an HTTP response contains:

Vary: Accept-Language,Accept-Charset

then we need to record the values of the Accept-Language and
Accept-Charset headers we sent (or, if applicable, the fact
that we didn't send those headers at all).  We can ignore
other headers, though we should probably also record the
Vary: response header itself.

Request and Response Headers
----------------------------

During the telecon, someone mentioned recording headers such as
Content-Length.  At this point it was evident that they were
thinking of Response Headers, and we were at cross-purposes.
It is entirely reasonable for a tool to record response headers,
but it is not a requirement for well-specified contents.

Change Detection
----------------

I've discussed change detection before, and don't propose to go
into detail this evening.  I should just note that it is helpful
to be able to detect change.  Recording headers such as
Last-Modified, Content-MD5 or Etag may help, but none of these
is mandatory.  Alternatives such as computing and storing a checksum
of document contents will do the job just as well and more reliably.

URL and HTTP
------------

The HTTP URL is a concatenation of different components:
        protocol (HTTP)
        Host, Port
        Path, Query String
This can be represented in more than one way.  If we follow
HTTP strictly, then the Host and Port are Connection information
(outside HTTP), whereas the Path and Query components are the
HTTP request line (within the HTTP protocol).  We may EITHER
record { URL } OR equivalently { Connection, Request Line }
I'll discuss the structure on the basis of the latter, but
I expect people may prefer the former.

Representing HTTP in EARL
-------------------------

The primary purpose of representing an HTTP transaction in EARL is
to ensure that the subject is well-specified.  The second purpose
is to record data that may be useful.  I would suggest that perhaps
a tool making assertions about a page retrieved by HTTP:
  * MUST record transaction timestamp (this will normally also
    appear in the HTTP headers, but is not mandatory, so we
    can't rely on it)
  * MUST record the URL and/or connection+request line
  * MUST record HTTP request headers involved in content negotiation
  * MAY record any other HTTP headers

The anatomy of a record is something like:

record = { timestamp, connection, HTTP transaction, checksum* }
HTTP transaction = { Request, Response }
Request = { Request line, Request Header*, Request Entity? }
Response = { Status line, Response Header+, Response Entity? }

Request Line is the later part of the URL (see above)
Request Headers MAY be discarded, except for those relevant
to content negotiation (see above).

The Response Status Line contains an HTTP response code which
we SHOULD record in any system that makes assertions about
pages having status codes other than 200 (success).
Response Headers may be of interested and MAY be saved.

Request and Response entities are outside the scope of this note.
The response entity is of course more commonly known as the webpage
about which we are making assertions.

Since it's nearly midnight and I'm falling asleep, I'm going to
postpone looking at Annotea or thinking about representing anything
in RDF schema.  I hope this clarifies a bit why I think we need
to expand a little on the current draft schema to deal meaningfully
with HTTP.

-- 
Nick Kew

Received on Wednesday, 18 May 2005 22:51:58 UTC