Re: EARL, RDF, Interesting Examples and PubSub.com from Bob Wyman on 2004-02-15 (www-rdf-interest@w3.org from February 2004)

From: Bob Wyman <bob@wyman.us>
Date: Sun, 15 Feb 2004 14:44:56 -0500
To: Charles McCathieNevile <charles@w3.org>
Cc: Bob Wyman <bob@wyman.us>, "'Mansur Darlington'" <ensmjd@bath.ac.uk>, info@oilit.com, www-rdf-interest@w3.org, semanticweb@yahoogroups.com
Message-Id: <200402151944.AQS01339@ms8.verisignmail.com>
Charles McCathieNevile wrote:
> I hope that crawling RDF sites via seeAlso or
> something is reasonably feasible 
    seeAlso provides only unidirectional links and as such 
requires a great deal of configuration management to be 
useful. The reliance on URLs (URIs for resources) by seeAlso 
means that you can only find things which are known to the 
generator of the RDF which contains the seeAlso link. The 
result is a closed network of data.
    On the other hand, the method that I propose, relying on 
PubSub.com's content-based matching, does not have this 
constraint. Using the PubSub.com method, related chunks of 
RDF are collected together based on their content rather than 
the links established between them. The result is a more open 
and dynamic knowledge network.
     The content-based method I propose has the effect of 
making URI links bi-directional. By watching for and 
reporting referenced URI's, we provide the back links that 
are not natively supported in the system. Whenever someone 
references a URI amoung the more than 1 million blogs that we 
monitor, you will be provided with a link to the referencing 
site. Thus, a bi-directional network is constructed on the 
fly. The result is a much more open and dynamic knowledge 
network which does not require that any two RDF sites have 
explicit knowledge of each others URIs. i.e. "seeAlso" is not 
required in such a network. It is only useful in building 
constrained, closed networks.

> Federating these aggregators is of course an 
> interesting problem too.
    The problem of federation is an interesting one as well 
as one that has been studied extensively. Some of the best 
work on publish/subscribe federation can be seen in the work 
of Antonio Carzaniga on the Siena project (see: 
http://www.cs.colorado.edu/~carzanig/siena/index.html).
    However, while federation is a fascinating problem, the 
requirement for federation really only arises when the volume 
of data or subscriptions rises to the point where a 
centralized system can't keep up with the it. Given that 
there isn't much RDF being produced today, we don't really 
have a need for federation at this time. It would make a 
great deal of sense to first experiment to discover the 
limits of non-federated systems and understand the dynamics 
of loosely coupled RDF networks before putting too much 
effort into building federated systems. Only in this manner 
can we really determine what the requirements are for a 
federated system. Currently, at PubSub.com, we are able to 
monitor over 1 million blogs yet the CPU of our matching 
engine is essentially idling most of the time (we never go 
over 4% or 5% of CPU...) We've got plenty of capacity to 
allow a great deal of experimentation and learning to occur.
    Let's understand the problem and application domain 
before making it more complex.

> One of the interesting questions there is how to work
> across multiple annotea servers existing.
    Unfortunately, Annotea use currently requires that for me 
to create an annotation, I must have an account with a known 
Annotea server and submit my annotations to it. The result, 
of course, is that while there may be many Annotea servers 
that are interested in annotations that I create, I'm only 
going to submit to one or a subset of them -- probably the 
most "popular" servers since with RDF, the more data you 
have, the better. Thus, power laws will come to dominate and 
there will be a need to develop potentially complex protocols 
to share annotations between Annotea servers in order to 
spread the knowledge captured in the annotations. 
    On the other hand, if a content-based publish/subscribe 
service like PubSub.com is used to distribute and discover 
annotations, anyone can create an annotation and publish it 
simply by putting the annotation in some file that PubSub.com 
monitors. Then, any number of Annotea servers that are 
interested in aggregating these annotations can simply 
subscribe to the them independently. The result is a 
significant reduction in the need for cross-server 
coordination and a reduction in the influence of power laws 
and network effects. (Of course, if servers wish to constrain 
whose annotations they handle, they can require that 
authentication tokens be included in the annotations.)

> It seems to me an important part of this is a standard 
> way of querying for a bit of RDF
    We provide, at PubSub.com, a method of querying for data 
based on its content. Of particular interest for RDF is the 
ability to query based on the URIs that are referenced in the 
content. This simple mechanism is sufficient to handle a very 
wide range of requirements. I invite you to experiment with 
it so that we can determine its limits and determine, based 
on practical experience, how to provide richer facilities.

    bob wyman
Received on Sunday, 15 February 2004 14:45:01 UTC