Re: P2P and RDF (was Re: API for querying a set of RDF graphs?) from Benja Fallenstein on 2003-03-22 (www-rdf-interest@w3.org from March 2003)

From: Benja Fallenstein <b.fallenstein@gmx.de>
Date: Sat, 22 Mar 2003 23:08:49 +0100
To: Reto Bachmann-Gmuer <reto@gmuer.ch>
CC: www-rdf-interest@w3.org
Message-ID: <3E7CDEF1.40401@gmx.de>
Hi Reto,

Reto Bachmann-Gmuer wrote:
>> This would be quite nice indeed-- if it is archievable.
[complex query example snipped]
>>
>> If we match against each graph separately, of course such a query 
>> wouldn't be possible...
>>
>> On a p2p system, one idea that occured to me was to use a distributed 
>> hashtable mapping (node occurring in graph) -> (location of graph). 
>> This would allow me to find graphs relevant to a query by asking the 
>> distributed hashtable for graphs containing the nodes from the query. 
>> Again, the problem seems more manageable if we only look at one graph 
>> at a time.
> 
> I think this mechanism could be integrated in the Graph/Model's 
> implementation to generally improve the speed of reifications and 
> frequent related queries.

So you're suggesting to have a Graph implementation that loads 
statements from the p2p network on demand, reifying them. On reflection, 
I think that this would be a good thing indeed, especially since you 
make a very strong point here:

> What the application layer is concerned, I think the power of RDF in a 
> P2P environment gets lost, when the application separates different 
> "knowledge-worlds" too much.

Obviously, we would really like to be able to run queries over the union 
of all the information in the p2p network.

However, I still don't know how to implement it in a scalable way. My 
simple hack seems to wreck havoc here. If my application wants to know 
all rdf:Properties ex:inventedBy ex:exampleInc which have an rdf:domain 
of foaf:Person and an rdf:range of ex:Animal, how do we avoid having to 
download all graphs containing statements about either rdf:Property or 
foaf:Person-- surely a much bigger set than the graphs we're actually 
interested in?

(If we look at each graph separately, we only need to download graphs 
containing *all* terms in the query (if it's purely conjunctive). Once 
we've found out that there are only 76 graphs mentioning ex:exampleInc, 
we can download those instead of downloading all 143'296 mentioning 
foaf:Person... Still not perfect, but maybe an improvement :) )

Do you have ideas how to create a scalable lookup-- i.e. one that only 
needs to look at graphs actually relevant to the query at hand?

> I think a P2P application should 
> essentially be able to adapt its own knowledge basing on the peers it is 
> exposed too, practically this means that the application doen't (just) 
> store an index of the peers but focus on conflicts and intersections 
> between the models. Intersections reinforce the believe in the 
> statements, contradictions weakens it.

This sounds interesting; on the other hand, I don't think number of 
peers storing statements that agree with each other is necessarily a 
good measure: If organization X caches their home page blurb on all 
their five thousand machines, they'll have an edge towards everybody 
believing it :-)

(Number of trusted signers agreeing on the statement may be a better 
measure?)

I'm not too deeply interested in the details of this at the moment, 
btw-- it's an interesting challenge, but not necessary for getting the 
things I'm working on to run :-)

> Analyzing of similarities between 
> the set of conflicts and correspondences associated with each peer  an 
> application could determine which peers are most likely to give relevant 
> answer to a certain query (using collaborative filtering techniques like 
> e.g. http://www.movielens.umn.edu/).

BTW, there's been research about collaborative filtering in a p2p 
context, ensuring the privacy of the people submitting ratings:

     http://www.cs.berkeley.edu/~jfc/papers/02/SIGIR02.pdf
     http://www.cs.berkeley.edu/~jfc/papers/02/IEEESP02.pdf

>> Storm stores data in *blocks*, byte sequences not unlike files, but 
>> identified by a cryptographic content hash (making them immutable, 
>> since if you change a block's content, you have a different hash and 
>> thus a different block). This allows you to make a reference to one 
>> specific version of something, authenticable with the hash (this is 
>> potentially very useful for importing ontologies into an RDF graph). 
>> You can retrieve a block from any source-- some server or peer or your 
>> local computer or an email attachment etc.-- since you can always 
>> check the authenticity. Because different blocks have different names, 
>> to synchronize the data on two computers, you can simply copy all 
>> blocks that exist on only one of the two to the other of the two 
>> (convenient e.g. for users with both a laptop and a desktop). When you 
>> create a new version of something, the old versions don't get 
>> overwritten, because you create a new block not affecting the old 
>> ones. The list goes on :-)
> 
> I think this is a very good approach, you could use freenet conten-hash 
> uris to identify the blocks.

We'll probably register our own URN namespace, among other goals because 
we want to use 'real,' registered URIs. (We're also considering putting 
a MIME content type in the URI, so that a block served up through our 
system would be basically as useful as a file retrieved through HTTP, 
and allowing us to easily serve blocks through a HTTP proxy, too. Not 
yet decided, though-- some people I've contacted commented that MIME 
types do not belong in URIs.)

> But am I right that this makes rdf-literals 
> obsolete for everything but small decimals?

Hm, why? :-)

> And how do you split the metadata in blocks

Well, depends very much on the application. How do you split metadata 
into files? :-)

>> So anyway, there are a number of reasons why we need to do powerful 
>> queries over a set of Storm blocks. For example, since we use hashes 
>> as the identifiers for blocks, we don't have file names as hints to 
>> humans about their content; instead, we'll use RDF metadata, stored in 
>> *other* blocks. As a second example, on top of the unchangeable 
>> blocks, we need to create a notion of updateable, versioned resources. 
>> We do this by creating metadata blocks saying e.g., "Block X is the 
>> newest version of resource Y as of 2003-03-20T22:29:25Z" and searching 
>> for the newest such statement.
> 
> I don't quite understand: isn't there a regression problem if the 
> metadata is itself contained in blocks? Or is at least the timestamp of 
> a block something external to the blocks?

A metadata block does not usually have a 'second-level' metadata block 
with information about the first metadata block, if you mean that; no, 
timestamps are not external to the blocks.

Rather, the metadata block described above would simply contain RDF 
triples like,

     _1 rdf:type  VersionAssignment
     _1 storm:resource        Y
     _1 storm:currentVersion  X
     _1 storm:timestamp  "2003-03-20T22:29:25Z"

So the timestamp would be attached to an RDF node inside the metadata block.

Does that answer the question? (Sorry, not sure whether I understood it 
right.)

>> (This design also provides a clean separation between changeable 
>> resources and individual versions of a resource, which are of course 
>> resources themselves.)
> 
> This would be useful also for mies, if an annotation is not related to 
> url but to its current content, this should be cached and the annotation 
> related to the content-hash-url, but this is not for version 0.1.

You're of course welcome to use Storm technology in future version ;-)

Having a closer look at mies right now.

- Benja
Received on Saturday, 22 March 2003 17:09:12 UTC