- From: Benja Fallenstein <b.fallenstein@gmx.de>
- Date: Sat, 22 Mar 2003 23:08:49 +0100
- To: Reto Bachmann-Gmuer <reto@gmuer.ch>
- CC: www-rdf-interest@w3.org
Hi Reto,
Reto Bachmann-Gmuer wrote:
>> This would be quite nice indeed-- if it is archievable.
[complex query example snipped]
>>
>> If we match against each graph separately, of course such a query
>> wouldn't be possible...
>>
>> On a p2p system, one idea that occured to me was to use a distributed
>> hashtable mapping (node occurring in graph) -> (location of graph).
>> This would allow me to find graphs relevant to a query by asking the
>> distributed hashtable for graphs containing the nodes from the query.
>> Again, the problem seems more manageable if we only look at one graph
>> at a time.
>
> I think this mechanism could be integrated in the Graph/Model's
> implementation to generally improve the speed of reifications and
> frequent related queries.
So you're suggesting to have a Graph implementation that loads
statements from the p2p network on demand, reifying them. On reflection,
I think that this would be a good thing indeed, especially since you
make a very strong point here:
> What the application layer is concerned, I think the power of RDF in a
> P2P environment gets lost, when the application separates different
> "knowledge-worlds" too much.
Obviously, we would really like to be able to run queries over the union
of all the information in the p2p network.
However, I still don't know how to implement it in a scalable way. My
simple hack seems to wreck havoc here. If my application wants to know
all rdf:Properties ex:inventedBy ex:exampleInc which have an rdf:domain
of foaf:Person and an rdf:range of ex:Animal, how do we avoid having to
download all graphs containing statements about either rdf:Property or
foaf:Person-- surely a much bigger set than the graphs we're actually
interested in?
(If we look at each graph separately, we only need to download graphs
containing *all* terms in the query (if it's purely conjunctive). Once
we've found out that there are only 76 graphs mentioning ex:exampleInc,
we can download those instead of downloading all 143'296 mentioning
foaf:Person... Still not perfect, but maybe an improvement :) )
Do you have ideas how to create a scalable lookup-- i.e. one that only
needs to look at graphs actually relevant to the query at hand?
> I think a P2P application should
> essentially be able to adapt its own knowledge basing on the peers it is
> exposed too, practically this means that the application doen't (just)
> store an index of the peers but focus on conflicts and intersections
> between the models. Intersections reinforce the believe in the
> statements, contradictions weakens it.
This sounds interesting; on the other hand, I don't think number of
peers storing statements that agree with each other is necessarily a
good measure: If organization X caches their home page blurb on all
their five thousand machines, they'll have an edge towards everybody
believing it :-)
(Number of trusted signers agreeing on the statement may be a better
measure?)
I'm not too deeply interested in the details of this at the moment,
btw-- it's an interesting challenge, but not necessary for getting the
things I'm working on to run :-)
> Analyzing of similarities between
> the set of conflicts and correspondences associated with each peer an
> application could determine which peers are most likely to give relevant
> answer to a certain query (using collaborative filtering techniques like
> e.g. http://www.movielens.umn.edu/).
BTW, there's been research about collaborative filtering in a p2p
context, ensuring the privacy of the people submitting ratings:
http://www.cs.berkeley.edu/~jfc/papers/02/SIGIR02.pdf
http://www.cs.berkeley.edu/~jfc/papers/02/IEEESP02.pdf
>> Storm stores data in *blocks*, byte sequences not unlike files, but
>> identified by a cryptographic content hash (making them immutable,
>> since if you change a block's content, you have a different hash and
>> thus a different block). This allows you to make a reference to one
>> specific version of something, authenticable with the hash (this is
>> potentially very useful for importing ontologies into an RDF graph).
>> You can retrieve a block from any source-- some server or peer or your
>> local computer or an email attachment etc.-- since you can always
>> check the authenticity. Because different blocks have different names,
>> to synchronize the data on two computers, you can simply copy all
>> blocks that exist on only one of the two to the other of the two
>> (convenient e.g. for users with both a laptop and a desktop). When you
>> create a new version of something, the old versions don't get
>> overwritten, because you create a new block not affecting the old
>> ones. The list goes on :-)
>
> I think this is a very good approach, you could use freenet conten-hash
> uris to identify the blocks.
We'll probably register our own URN namespace, among other goals because
we want to use 'real,' registered URIs. (We're also considering putting
a MIME content type in the URI, so that a block served up through our
system would be basically as useful as a file retrieved through HTTP,
and allowing us to easily serve blocks through a HTTP proxy, too. Not
yet decided, though-- some people I've contacted commented that MIME
types do not belong in URIs.)
> But am I right that this makes rdf-literals
> obsolete for everything but small decimals?
Hm, why? :-)
> And how do you split the metadata in blocks
Well, depends very much on the application. How do you split metadata
into files? :-)
>> So anyway, there are a number of reasons why we need to do powerful
>> queries over a set of Storm blocks. For example, since we use hashes
>> as the identifiers for blocks, we don't have file names as hints to
>> humans about their content; instead, we'll use RDF metadata, stored in
>> *other* blocks. As a second example, on top of the unchangeable
>> blocks, we need to create a notion of updateable, versioned resources.
>> We do this by creating metadata blocks saying e.g., "Block X is the
>> newest version of resource Y as of 2003-03-20T22:29:25Z" and searching
>> for the newest such statement.
>
> I don't quite understand: isn't there a regression problem if the
> metadata is itself contained in blocks? Or is at least the timestamp of
> a block something external to the blocks?
A metadata block does not usually have a 'second-level' metadata block
with information about the first metadata block, if you mean that; no,
timestamps are not external to the blocks.
Rather, the metadata block described above would simply contain RDF
triples like,
_1 rdf:type VersionAssignment
_1 storm:resource Y
_1 storm:currentVersion X
_1 storm:timestamp "2003-03-20T22:29:25Z"
So the timestamp would be attached to an RDF node inside the metadata block.
Does that answer the question? (Sorry, not sure whether I understood it
right.)
>> (This design also provides a clean separation between changeable
>> resources and individual versions of a resource, which are of course
>> resources themselves.)
>
> This would be useful also for mies, if an annotation is not related to
> url but to its current content, this should be cached and the annotation
> related to the content-hash-url, but this is not for version 0.1.
You're of course welcome to use Storm technology in future version ;-)
Having a closer look at mies right now.
- Benja
Received on Saturday, 22 March 2003 17:09:12 UTC