- From: Bob Wyman <bob@wyman.us>
- Date: Sun, 15 Feb 2004 14:44:56 -0500
- To: Charles McCathieNevile <charles@w3.org>
- Cc: Bob Wyman <bob@wyman.us>, "'Mansur Darlington'" <ensmjd@bath.ac.uk>, info@oilit.com, www-rdf-interest@w3.org, semanticweb@yahoogroups.com
Charles McCathieNevile wrote:
> I hope that crawling RDF sites via seeAlso or
> something is reasonably feasible
seeAlso provides only unidirectional links and as such
requires a great deal of configuration management to be
useful. The reliance on URLs (URIs for resources) by seeAlso
means that you can only find things which are known to the
generator of the RDF which contains the seeAlso link. The
result is a closed network of data.
On the other hand, the method that I propose, relying on
PubSub.com's content-based matching, does not have this
constraint. Using the PubSub.com method, related chunks of
RDF are collected together based on their content rather than
the links established between them. The result is a more open
and dynamic knowledge network.
The content-based method I propose has the effect of
making URI links bi-directional. By watching for and
reporting referenced URI's, we provide the back links that
are not natively supported in the system. Whenever someone
references a URI amoung the more than 1 million blogs that we
monitor, you will be provided with a link to the referencing
site. Thus, a bi-directional network is constructed on the
fly. The result is a much more open and dynamic knowledge
network which does not require that any two RDF sites have
explicit knowledge of each others URIs. i.e. "seeAlso" is not
required in such a network. It is only useful in building
constrained, closed networks.
> Federating these aggregators is of course an
> interesting problem too.
The problem of federation is an interesting one as well
as one that has been studied extensively. Some of the best
work on publish/subscribe federation can be seen in the work
of Antonio Carzaniga on the Siena project (see:
http://www.cs.colorado.edu/~carzanig/siena/index.html).
However, while federation is a fascinating problem, the
requirement for federation really only arises when the volume
of data or subscriptions rises to the point where a
centralized system can't keep up with the it. Given that
there isn't much RDF being produced today, we don't really
have a need for federation at this time. It would make a
great deal of sense to first experiment to discover the
limits of non-federated systems and understand the dynamics
of loosely coupled RDF networks before putting too much
effort into building federated systems. Only in this manner
can we really determine what the requirements are for a
federated system. Currently, at PubSub.com, we are able to
monitor over 1 million blogs yet the CPU of our matching
engine is essentially idling most of the time (we never go
over 4% or 5% of CPU...) We've got plenty of capacity to
allow a great deal of experimentation and learning to occur.
Let's understand the problem and application domain
before making it more complex.
> One of the interesting questions there is how to work
> across multiple annotea servers existing.
Unfortunately, Annotea use currently requires that for me
to create an annotation, I must have an account with a known
Annotea server and submit my annotations to it. The result,
of course, is that while there may be many Annotea servers
that are interested in annotations that I create, I'm only
going to submit to one or a subset of them -- probably the
most "popular" servers since with RDF, the more data you
have, the better. Thus, power laws will come to dominate and
there will be a need to develop potentially complex protocols
to share annotations between Annotea servers in order to
spread the knowledge captured in the annotations.
On the other hand, if a content-based publish/subscribe
service like PubSub.com is used to distribute and discover
annotations, anyone can create an annotation and publish it
simply by putting the annotation in some file that PubSub.com
monitors. Then, any number of Annotea servers that are
interested in aggregating these annotations can simply
subscribe to the them independently. The result is a
significant reduction in the need for cross-server
coordination and a reduction in the influence of power laws
and network effects. (Of course, if servers wish to constrain
whose annotations they handle, they can require that
authentication tokens be included in the annotations.)
> It seems to me an important part of this is a standard
> way of querying for a bit of RDF
We provide, at PubSub.com, a method of querying for data
based on its content. Of particular interest for RDF is the
ability to query based on the URIs that are referenced in the
content. This simple mechanism is sufficient to handle a very
wide range of requirements. I invite you to experiment with
it so that we can determine its limits and determine, based
on practical experience, how to provide richer facilities.
bob wyman
Received on Sunday, 15 February 2004 14:45:01 UTC