- From: Nick Kew <nick@webthing.com>
- Date: Wed, 18 May 2005 23:53:02 +0100
- To: public-wai-ert@w3.org
Hi, I promised to post a note today, but I haven't found time 'til now. I suspect the inclusion of HTTP header information in the current draft may be motivated at least in part by my comments from the previous incarnation of the WG, so I'll try and outline why they're important. Well-Specified Web Contents --------------------------- The fundamental issue is one of identifying webcontent. When we cite a URL, we are referring to a document which is presumed to be relevant to whatever we are discussing at the time. Our reference is based on the premise that what I see at the URL is equivalent to what you see there. The actual document may be different, for example it may have been updated, or the server may serve us different variants (e.g. languages) according to our respective browser preferences. Normally, such differences should not materially affect the relevance of the URL; it it does, then our use of the URL has failed. But when we are making assertions of a technical nature about the contents (including markup/structure) of webpages, this kind of equivalence is inadequate. An EARL assertion about the English-language version of a page may not apply to a french or german version, still less a chinese, russian or arabic page. We need to specify webcontent more precisely. We may take it as axiomatic that it is sufficient to record our complete HTTP request, together with the time it was made (this might exclude some cases such as randomly-generated pages from the discussion). On this assumption, the subject of our assertions is then well-specified. However, it is not necessary to record the entire request. It is only necessary to record those headers which may affect the response through content negotiation. Any header not involved in content negotiation will not affect the response unless the server is violating HTTP, in which case all bets are off (as in the randomised page). The headers that (may) affect the content returned by a server are identified by the server in a Vary: response header. These are the therfore the headers we need to record. For example, if an HTTP response contains: Vary: Accept-Language,Accept-Charset then we need to record the values of the Accept-Language and Accept-Charset headers we sent (or, if applicable, the fact that we didn't send those headers at all). We can ignore other headers, though we should probably also record the Vary: response header itself. Request and Response Headers ---------------------------- During the telecon, someone mentioned recording headers such as Content-Length. At this point it was evident that they were thinking of Response Headers, and we were at cross-purposes. It is entirely reasonable for a tool to record response headers, but it is not a requirement for well-specified contents. Change Detection ---------------- I've discussed change detection before, and don't propose to go into detail this evening. I should just note that it is helpful to be able to detect change. Recording headers such as Last-Modified, Content-MD5 or Etag may help, but none of these is mandatory. Alternatives such as computing and storing a checksum of document contents will do the job just as well and more reliably. URL and HTTP ------------ The HTTP URL is a concatenation of different components: protocol (HTTP) Host, Port Path, Query String This can be represented in more than one way. If we follow HTTP strictly, then the Host and Port are Connection information (outside HTTP), whereas the Path and Query components are the HTTP request line (within the HTTP protocol). We may EITHER record { URL } OR equivalently { Connection, Request Line } I'll discuss the structure on the basis of the latter, but I expect people may prefer the former. Representing HTTP in EARL ------------------------- The primary purpose of representing an HTTP transaction in EARL is to ensure that the subject is well-specified. The second purpose is to record data that may be useful. I would suggest that perhaps a tool making assertions about a page retrieved by HTTP: * MUST record transaction timestamp (this will normally also appear in the HTTP headers, but is not mandatory, so we can't rely on it) * MUST record the URL and/or connection+request line * MUST record HTTP request headers involved in content negotiation * MAY record any other HTTP headers The anatomy of a record is something like: record = { timestamp, connection, HTTP transaction, checksum* } HTTP transaction = { Request, Response } Request = { Request line, Request Header*, Request Entity? } Response = { Status line, Response Header+, Response Entity? } Request Line is the later part of the URL (see above) Request Headers MAY be discarded, except for those relevant to content negotiation (see above). The Response Status Line contains an HTTP response code which we SHOULD record in any system that makes assertions about pages having status codes other than 200 (success). Response Headers may be of interested and MAY be saved. Request and Response entities are outside the scope of this note. The response entity is of course more commonly known as the webpage about which we are making assertions. Since it's nearly midnight and I'm falling asleep, I'm going to postpone looking at Annotea or thinking about representing anything in RDF schema. I hope this clarifies a bit why I think we need to expand a little on the current draft schema to deal meaningfully with HTTP. -- Nick Kew
Received on Wednesday, 18 May 2005 22:51:58 UTC