Re: P2P and RDF (was Re: API for querying a set of RDF graphs?) from Reto Bachmann-Gmuer on 2003-03-24 (www-rdf-interest@w3.org from March 2003)

From: Reto Bachmann-Gmuer <reto@gmuer.ch>
Date: Mon, 24 Mar 2003 11:56:52 +0100
To: Benja Fallenstein <b.fallenstein@gmx.de>
Cc: www-rdf-interest@w3.org
Message-Id: <5242BFC2-5DE7-11D7-BE8D-003065CDBE5C@gmuer.ch>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Benja

>>> On a p2p system, one idea that occured to me was to use a 
>>> distributed hashtable mapping (node occurring in graph) -> (location 
>>> of graph). This would allow me to find graphs relevant to a query by 
>>> asking the distributed hashtable for graphs containing the nodes 
>>> from the query. Again, the problem seems more manageable if we only 
>>> look at one graph at a time.
>> I think this mechanism could be integrated in the Graph/Model's 
>> implementation to generally improve the speed of reifications and 
>> frequent related queries.
> So you're suggesting to have a Graph implementation that loads 
> statements from the p2p network on demand, reifying them.
Actually, I Thought you were talking about a locally "cached" subset of 
the peer's graph rather than the whole remote graph.
> On reflection, I think that this would be a good thing indeed, 
> especially since you make a very strong point here:
>> What the application layer is concerned, I think the power of RDF in 
>> a P2P environment gets lost, when the application separates different 
>> "knowledge-worlds" too much.
>
> Obviously, we would really like to be able to run queries over the 
> union of all the information in the p2p network.
>
> However, I still don't know how to implement it in a scalable way. My 
> simple hack seems to wreck havoc here. If my application wants to know 
> all rdf:Properties ex:inventedBy ex:exampleInc which have an 
> rdf:domain of foaf:Person and an rdf:range of ex:Animal, how do we 
> avoid having to download all graphs containing statements about either 
> rdf:Property or foaf:Person-- surely a much bigger set than the graphs 
> we're actually interested in?
Certainly downloading all remote graphs does not seem to be feasible. 
Just forwarding the query to the peers, however, does not produce the 
same result (e.g. when one graph states that ex2:inspiredBy is a 
subclass ex:inventedBy and another graph uses ex2:inspiredBy). I think 
what we need is an application that in a first steps queries the peers 
to get more knowledge on the terms involved in the query, with the 
result of this preparative query the application can on the one hand 
guess which peers are competent in the domain and on the other hand 
refine the query adding the premises that allow the queried host to be 
likely to make the inferences we would do when merging all the peer's 
hosts  (e.g. tell the queried host that ex:inventedBy is a superclass 
of ex2:inspiredBy).

> (If we look at each graph separately, we only need to download graphs 
> containing *all* terms in the query (if it's purely conjunctive). Once 
> we've found out that there are only 76 graphs mentioning 
> ex:exampleInc, we can download those instead of downloading all 
> 143'296 mentioning foaf:Person... Still not perfect, but maybe an 
> improvement :) )
I guess a hashtable with all the nodes occuring in the peers graphs 
could easily get much too big, especially if the data is store in 
blocks which all have an uri. I'm thinking on a possibility to store an 
indicator of the competence of a peer as attribute to about every 
resource present in our graph anyway. With this we would avoid storing 
uris unlikely to ever be involved in a query while we could store more 
than just "knows about" / "doesn't know about" for terms we deal with 
(e.g. "has 23 statements containing term" or more sophisticatedly "has 
23 statement containing term and 324 statements containing closely 
related terms").

> Do you have ideas how to create a scalable lookup-- i.e. one that only 
> needs to look at graphs actually relevant to the query at hand?
ideas and questions, but no algorithm! ;-)

>
>> I think a P2P application should essentially be able to adapt its own 
>> knowledge basing on the peers it is exposed too, practically this 
>> means that the application doen't (just) store an index of the peers 
>> but focus on conflicts and intersections between the models. 
>> Intersections reinforce the believe in the statements, contradictions 
>> weakens it.
>
> This sounds interesting; on the other hand, I don't think number of 
> peers storing statements that agree with each other is necessarily a 
> good measure: If organization X caches their home page blurb on all 
> their five thousand machines, they'll have an edge towards everybody 
> believing it :-)
>
> (Number of trusted signers agreeing on the statement may be a better 
> measure?)

Maybe something like:
believeInStatement = AverageOf(trustInPeer * believeInStatementByPeer) 
/ AverageOf(trustInPeer*believeInContradictionByPeer), where believe 
and trust are number between 0 and 1, 0 expresses believe of wrongness, 
0.5 indifference.
>
> I'm not too deeply interested in the details of this at the moment, 
> btw-- it's an interesting challenge, but not necessary for getting the 
> things I'm working on to run :-)
I'll be back coding right after breakfast :-). Seriously, I believe 
while some sort of trust-metrics is essential for a P2P application to 
be performant and robust against attacks it is mandatory for semantic 
P2P applications, the application must have means to deal with 
ambiguity and increase its knowledge nevertheless.

>> Analyzing of similarities between the set of conflicts and 
>> correspondences associated with each peer  an application could 
>> determine which peers are most likely to give relevant answer to a 
>> certain query (using collaborative filtering techniques like e.g. 
>> http://www.movielens.umn.edu/).
>
> BTW, there's been research about collaborative filtering in a p2p 
> context, ensuring the privacy of the people submitting ratings:
>
>     http://www.cs.berkeley.edu/~jfc/papers/02/SIGIR02.pdf
>     http://www.cs.berkeley.edu/~jfc/papers/02/IEEESP02.pdf
cool, will have a look at it (right after coding ;-))

>>> Storm stores data in *blocks*, byte sequences not unlike files, but 
>>> identified by a cryptographic content hash (making them immutable, 
>>> since if you change a block's content, you have a different hash and 
>>> thus a different block). This allows you to make a reference to one 
>>> specific version of something, authenticable with the hash (this is 
>>> potentially very useful for importing ontologies into an RDF graph). 
>>> You can retrieve a block from any source-- some server or peer or 
>>> your local computer or an email attachment etc.-- since you can 
>>> always check the authenticity. Because different blocks have 
>>> different names, to synchronize the data on two computers, you can 
>>> simply copy all blocks that exist on only one of the two to the 
>>> other of the two (convenient e.g. for users with both a laptop and a 
>>> desktop). When you create a new version of something, the old 
>>> versions don't get overwritten, because you create a new block not 
>>> affecting the old ones. The list goes on :-)
>> I think this is a very good approach, you could use freenet 
>> conten-hash uris to identify the blocks.
>
> We'll probably register our own URN namespace, among other goals 
> because we want to use 'real,' registered URIs. (We're also 
> considering putting a MIME content type in the URI, so that a block 
> served up through our system would be basically as useful as a file 
> retrieved through HTTP, and allowing us to easily serve blocks through 
> a HTTP proxy, too. Not yet decided, though-- some people I've 
> contacted commented that MIME types do not belong in URIs.)
hmm, I don't see the analogy with http, since http-urls should not 
contain a content-type indicator but leave the task to browser and 
server to negotiate the best content-type deliverable. Of course  your 
case is different, since your uri immutably references a sequence of 
bytes. I strongly disagree with putting the mime-type into the url, 
because the mime type is meta information for which I see no reason to 
be threaded differently than other meta-information, rather 
theoretically is it possible that the same sequence of bytes (block) 
represents different contents being interpreted with a different 
mime-type/encoding, should the host then store the block twice? Higher 
level applications should not use block-uris anyway but deal with an 
abstraction representing the content (like http urls should).

An example to be more explicit:
<urn:urn-5:G7Fj> <DC:title> "Ulisses"
<urn:urn-5:G7Fj> <DC:decription> "bla bli"
<urn:urn-5:G7Fj> <ex:type> <ex:text>
<urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
<urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
<urn:urn-5:lG5d> <ex:englishVersion> <urn:urn-5:G7Fj>
<urn:urn-5:lG5d> <ex:spanishVersion> <urn:urn-5:kA2L>
....

In this example application should reference "urn:urn-5:G7Fj" (which 
does not have a mime type) rather than "urn:content-hash: Dj&/fjkZRT68" 
(which has a mime type in a specific context) wherever possible, in 
many cases a higher abstraction "urn:urn-5:lG5d" can be used . While 
you can only deficiently use http to server a block, you could server 
the uri of both the abstractions (urn:urn-5:G7Fj and urn:urn-5:lG5d) 
directly using http 1.1.features.

>> But am I right that this makes rdf-literals obsolete for everything 
>> but small decimals?
> Hm, why? :-)
well, why use literal if you can make a block out of it, shortening 
queries and unifying handling?
>
>> And how do you split the metadata in blocks
> Well, depends very much on the application. How do you split metadata 
> into files? :-)
Not at all ;-). The splitting into file is rudimentary represented 
meta-data, if you use RDF the filesystem is a legacy application.

>>> So anyway, there are a number of reasons why we need to do powerful 
>>> queries over a set of Storm blocks. For example, since we use hashes 
>>> as the identifiers for blocks, we don't have file names as hints to 
>>> humans about their content; instead, we'll use RDF metadata, stored 
>>> in *other* blocks. As a second example, on top of the unchangeable 
>>> blocks, we need to create a notion of updateable, versioned 
>>> resources. We do this by creating metadata blocks saying e.g., 
>>> "Block X is the newest version of resource Y as of 
>>> 2003-03-20T22:29:25Z" and searching for the newest such statement.
>> I don't quite understand: isn't there a regression problem if the 
>> metadata is itself contained in blocks? Or is at least the timestamp 
>> of a block something external to the blocks?
>
> A metadata block does not usually have a 'second-level' metadata block 
> with information about the first metadata block, if you mean that;
Say you want to change the description of an entity, not just add a new 
one, I think you should tell about another metadata block that it is 
wrong (in the era starting now ;-)).

> no, timestamps are not external to the blocks.
When the user synchronizes his laptop with the home-pc I guess the 
metadata may be contradictory, I thought with an external timestamp 
contradictions could be handled (the newer is the right one). If the 
timestamp is part of the metadata the application should probably 
enforce it (while generally giving the user the maximum power to make 
all sorts of metadata constructs).


cheers,
reto
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (Darwin)

iD8DBQE+fuR8D1pReGFYfq4RAsgaAKCGFH+oPiL1HV8wbDBSbnykob3mVQCgytVd
DT23VqreYj9mawynuYXZIUc=
=LhV6
-----END PGP SIGNATURE-----
Received on Monday, 24 March 2003 05:57:13 UTC