- From: Benja Fallenstein <b.fallenstein@gmx.de>
- Date: Sat, 22 Mar 2003 23:08:49 +0100
- To: Reto Bachmann-Gmuer <reto@gmuer.ch>
- CC: www-rdf-interest@w3.org
Hi Reto, Reto Bachmann-Gmuer wrote: >> This would be quite nice indeed-- if it is archievable. [complex query example snipped] >> >> If we match against each graph separately, of course such a query >> wouldn't be possible... >> >> On a p2p system, one idea that occured to me was to use a distributed >> hashtable mapping (node occurring in graph) -> (location of graph). >> This would allow me to find graphs relevant to a query by asking the >> distributed hashtable for graphs containing the nodes from the query. >> Again, the problem seems more manageable if we only look at one graph >> at a time. > > I think this mechanism could be integrated in the Graph/Model's > implementation to generally improve the speed of reifications and > frequent related queries. So you're suggesting to have a Graph implementation that loads statements from the p2p network on demand, reifying them. On reflection, I think that this would be a good thing indeed, especially since you make a very strong point here: > What the application layer is concerned, I think the power of RDF in a > P2P environment gets lost, when the application separates different > "knowledge-worlds" too much. Obviously, we would really like to be able to run queries over the union of all the information in the p2p network. However, I still don't know how to implement it in a scalable way. My simple hack seems to wreck havoc here. If my application wants to know all rdf:Properties ex:inventedBy ex:exampleInc which have an rdf:domain of foaf:Person and an rdf:range of ex:Animal, how do we avoid having to download all graphs containing statements about either rdf:Property or foaf:Person-- surely a much bigger set than the graphs we're actually interested in? (If we look at each graph separately, we only need to download graphs containing *all* terms in the query (if it's purely conjunctive). Once we've found out that there are only 76 graphs mentioning ex:exampleInc, we can download those instead of downloading all 143'296 mentioning foaf:Person... Still not perfect, but maybe an improvement :) ) Do you have ideas how to create a scalable lookup-- i.e. one that only needs to look at graphs actually relevant to the query at hand? > I think a P2P application should > essentially be able to adapt its own knowledge basing on the peers it is > exposed too, practically this means that the application doen't (just) > store an index of the peers but focus on conflicts and intersections > between the models. Intersections reinforce the believe in the > statements, contradictions weakens it. This sounds interesting; on the other hand, I don't think number of peers storing statements that agree with each other is necessarily a good measure: If organization X caches their home page blurb on all their five thousand machines, they'll have an edge towards everybody believing it :-) (Number of trusted signers agreeing on the statement may be a better measure?) I'm not too deeply interested in the details of this at the moment, btw-- it's an interesting challenge, but not necessary for getting the things I'm working on to run :-) > Analyzing of similarities between > the set of conflicts and correspondences associated with each peer an > application could determine which peers are most likely to give relevant > answer to a certain query (using collaborative filtering techniques like > e.g. http://www.movielens.umn.edu/). BTW, there's been research about collaborative filtering in a p2p context, ensuring the privacy of the people submitting ratings: http://www.cs.berkeley.edu/~jfc/papers/02/SIGIR02.pdf http://www.cs.berkeley.edu/~jfc/papers/02/IEEESP02.pdf >> Storm stores data in *blocks*, byte sequences not unlike files, but >> identified by a cryptographic content hash (making them immutable, >> since if you change a block's content, you have a different hash and >> thus a different block). This allows you to make a reference to one >> specific version of something, authenticable with the hash (this is >> potentially very useful for importing ontologies into an RDF graph). >> You can retrieve a block from any source-- some server or peer or your >> local computer or an email attachment etc.-- since you can always >> check the authenticity. Because different blocks have different names, >> to synchronize the data on two computers, you can simply copy all >> blocks that exist on only one of the two to the other of the two >> (convenient e.g. for users with both a laptop and a desktop). When you >> create a new version of something, the old versions don't get >> overwritten, because you create a new block not affecting the old >> ones. The list goes on :-) > > I think this is a very good approach, you could use freenet conten-hash > uris to identify the blocks. We'll probably register our own URN namespace, among other goals because we want to use 'real,' registered URIs. (We're also considering putting a MIME content type in the URI, so that a block served up through our system would be basically as useful as a file retrieved through HTTP, and allowing us to easily serve blocks through a HTTP proxy, too. Not yet decided, though-- some people I've contacted commented that MIME types do not belong in URIs.) > But am I right that this makes rdf-literals > obsolete for everything but small decimals? Hm, why? :-) > And how do you split the metadata in blocks Well, depends very much on the application. How do you split metadata into files? :-) >> So anyway, there are a number of reasons why we need to do powerful >> queries over a set of Storm blocks. For example, since we use hashes >> as the identifiers for blocks, we don't have file names as hints to >> humans about their content; instead, we'll use RDF metadata, stored in >> *other* blocks. As a second example, on top of the unchangeable >> blocks, we need to create a notion of updateable, versioned resources. >> We do this by creating metadata blocks saying e.g., "Block X is the >> newest version of resource Y as of 2003-03-20T22:29:25Z" and searching >> for the newest such statement. > > I don't quite understand: isn't there a regression problem if the > metadata is itself contained in blocks? Or is at least the timestamp of > a block something external to the blocks? A metadata block does not usually have a 'second-level' metadata block with information about the first metadata block, if you mean that; no, timestamps are not external to the blocks. Rather, the metadata block described above would simply contain RDF triples like, _1 rdf:type VersionAssignment _1 storm:resource Y _1 storm:currentVersion X _1 storm:timestamp "2003-03-20T22:29:25Z" So the timestamp would be attached to an RDF node inside the metadata block. Does that answer the question? (Sorry, not sure whether I understood it right.) >> (This design also provides a clean separation between changeable >> resources and individual versions of a resource, which are of course >> resources themselves.) > > This would be useful also for mies, if an annotation is not related to > url but to its current content, this should be cached and the annotation > related to the content-hash-url, but this is not for version 0.1. You're of course welcome to use Storm technology in future version ;-) Having a closer look at mies right now. - Benja
Received on Saturday, 22 March 2003 17:09:12 UTC