RSS traces from Mark Nottingham on 2003-05-08 (ietf-http-wg@w3.org from April to June 2003)

From: Mark Nottingham <mnot@mnot.net>
Date: Wed, 7 May 2003 21:36:20 -0700
To: ietf-http-wg@w3.org
Cc: wrec@cs.utk.edu
Message-Id: <9DB30446-810E-11D7-82FF-000A27836A68@mnot.net>

I'm slowly building a collection of traces for access to a relatively 
new kind of Web content - RSS feeds.

If you're not familiar with it, RSS is a format that represents a list 
of items, each with its own title, description, link, and other 
metadata. Clients (often called "aggregators") periodically poll to 
build a view of the channel over time, adding new items in the 
representation to a local store. In this manner, one can keep abreast 
of news headlines and other chronologically-ordered lists.

This format is becoming more popular, both because of increasing 
support (sites like MSDN, the New York Times and CNN all have RSS 
feeds) and the arrival of "weblogs" (which also uses RSS).

There are a number of interesting questions about RSS that come to 
mind, including;
   - what is the polling interval?
   - how common is validation?
   - what is the rate of change for the RSS?
   - what is the size of the feed?
   - what times of day does polling happen?
   - how self-similar is RSS traffic (is is "lumpy" around the top of 
the hour, for example?)

I suspect that RSS, because it is polled, is not at all typical Web 
traffic, and therefore places unusual requirements on Web servers and 
intermediaries. I also suspect that it may eventually require us to 
rethink distribution; invalidation and other approaches may become much 
more desirable, as opposed to polling.

Rather than keep all of the fun for myself (I have a day job), I've 
placed the traces on the Web for the greater enjoyment of the caching 
and traffic characterization community ;)  They are at:
   http://www.mnot.net/rss/traces/
(this will redirect to another site; please bookmark the URI above, in 
case it changes).

So far, I have one trace; it has been anonymized (combined log format 
with the client IP, ident, userinfo, URI, referer and user-agent fields 
hashed or half-hashed). If there is another format that's more 
suitable, please tell me.

This trace contains about 500,000 entries and represents a week's worth 
of access to a RSS "scraping" service; i.e., it's a Web site that 
processes other Web sites to produce a number of feeds. As such, it 
contains accesses to multiple feeds.

Please tell me if this is interesting/useful, and send along any 
results you come up with. I'm working on getting more traces; stay 
tuned (are any of the repositories - e.g., W3C WCA, Internet Traffic 
Archive - still active? Neither has seen anything new in quite some 
time...).

Regards,

Received on Thursday, 8 May 2003 00:36:31 UTC