Storm blocks and metadata (Re: P2P and RDF)

Hi Reto,

[drifting towards off-topic, but leaving on www-rdf-interest for now 
because it still concerns the use of RDF]

Reto Bachmann-Gmuer wrote:
>>> I think this is a very good approach, you could use freenet 
>>> conten-hash uris to identify the blocks.
>>
>> We'll probably register our own URN namespace, among other goals 
>> because we want to use 'real,' registered URIs. (We're also 
>> considering putting a MIME content type in the URI, so that a block 
>> served up through our system would be basically as useful as a file 
>> retrieved through HTTP, and allowing us to easily serve blocks through 
>> a HTTP proxy, too. Not yet decided, though-- some people I've 
>> contacted commented that MIME types do not belong in URIs.)
> 
> hmm, I don't see the analogy with http, since http-urls should not 
> contain a content-type indicator but leave the task to browser and 
> server to negotiate the best content-type deliverable. Of course  your 
> case is different, since your uri immutably references a sequence of 
> bytes.

Yes, that would have been my argument. However, you make a good point 
below: If we refer to an RDF 'metadata' block containing the URI of the 
actual block, we can include references to alternative versions-- even 
allowing some degree of content negotiation. This is something I have to 
mull about :-)

> I strongly disagree with putting the mime-type into the url, 
> because the mime type is meta information for which I see no reason to 
> be threaded differently than other meta-information,

It is necessary for the interpretation of the data we get; and it's 
usually easy to agree on (people won't too often assign different mime 
types to the same bytes). One thing about content hashes is, when two 
people put the same file into a hash-based system, they will use the 
same identifier for it. With MIME types, that's still pretty much true; 
with more elaborate metadata, it isn't.

Using the same identifier is important for queries like, "Which 
documents include this image?" If the three documents that use the image 
use three different kinds of IDs for it (because they refer to three 
different kinds of metadata), you're out of luck.

> rather theoretically is it possible that the same sequence of bytes (block) 
> represents different contents being interpreted with a different 
> mime-type/encoding, should the host then store the block twice?

Up to the host. Since it *is* rather unlikely, I don't think there would 
be big penalties to storing the block twice in this case. I wouldn't do 
it anyway, but for a different reason: Other systems do not include the 
MIME type in their hash-based identifiers, and we should be able to find 
blocks and serve them to those systems even when we do not know the MIME 
type.

> Higher level applications should not use block-uris anyway but deal with an 
> abstraction representing the content (like http urls should).

You mean as in, with content negotiation applied? You use a single URI 
which maps to different representations of the same resource?

> An example to be more explicit:
> <urn:urn-5:G7Fj> <DC:title> "Ulisses"
> <urn:urn-5:G7Fj> <DC:decription> "bla bli"

This, for example, I would not include here. :-) Firstly, it is 
something I would want to be versioned independently: if I change the 
description of an image, that should not create a new version of the 
image. Secondly, I don't see a reason why the URI of the image would 
need to refer to this. Thirdly, I don't think that when a file is put 
into the system-- and thus given its identifier-- is necessarily the 
time to create this kind of metadata. It would seem to hold up the task 
at hand. Rather, I'd like to be able to add it later on, and maybe 
someone else can do that even better than me-- like a librarian who has 
scientific background in giving metadata about stuff.

It seems like you could easily put this data in another block without 
losing much (assuming that the second block could be easily found 
through an appropriate query).

> <urn:urn-5:G7Fj> <ex:type> <ex:text>
> <urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
> <urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
> <urn:urn-5:lG5d> <ex:englishVersion> <urn:urn-5:G7Fj>
> <urn:urn-5:lG5d> <ex:spanishVersion> <urn:urn-5:kA2L>

These, on the other hand, are very good cases, because they can be used 
by the computer in ways that require a certain level of trust: We want 
to retrieve only the data that the referrer intended to be retrieved, 
and we want to be able to check this cryptographically-- so this 
actually needs to be part of what we protect cryptographically.

One technical side note, though. We'd have two types of URIs, something 
like,

     urn:foo:content-hash:jv24kt5
     urn:foo:ref:rs53h85p

The first would be just a plain byte stream identified by a content 
hash. The second would be a content hash, too, but we'd know that the 
target should be interpreted as an RDF file with data like you give 
above. Now, when we retrieved this block, we need to know at which node 
we need to start looking to find the block we're interested in, so I 
think we'd need to write this as something like,

<urn:urn-5:G7Fj> <ex:type> <ex:text>
<urn:urn-5:G7Fj> <ex:utf8-encoding> <urn:content-hash: jhKHUL7HK>
<urn:urn-5:G7Fj> <ex:latin1-encoding> <urn:content-hash: Dj&/fjkZRT68>
<> <ex:englishVersion> <urn:urn-5:G7Fj>
<> <ex:spanishVersion> <urn:urn-5:kA2L>

I.e., "this resource" is <> (the empty URI reference) and we start 
traversing the graph from there.

I found another use case for RDF metadata: Creative Commons licenses. It 
would make sense to me if this would be part of the reference, allowing 
the computer to automatically conclude how data may be copied and used.

> In this example application should reference "urn:urn-5:G7Fj" (which 
> does not have a mime type) rather than "urn:content-hash: Dj&/fjkZRT68" 
> (which has a mime type in a specific context) wherever possible, in many 
> cases a higher abstraction "urn:urn-5:lG5d" can be used .

Um, using a urn-5 doesn't work since it's just a random number-- if we 
use just a random number, we cannot check whether the data we may 
retrieve from a p2p network is really what the person making the 
reference wanted us to see. We would need to use "urn:foo:ref:[blah]", 
which would be the above RDF data, from which we could then get the 
specific representation.

> While you can 
> only deficiently use http to server a block,

Why?

> you could server the uri of 
> both the abstractions (urn:urn-5:G7Fj and urn:urn-5:lG5d) directly using 
> http 1.1.features.

(Again, you'd have to use hashes, or you could be arbitrarily spoofed.)

>>> But am I right that this makes rdf-literals obsolete for everything 
>>> but small decimals?
>>
>> Hm, why? :-)
> 
> well, why use literal if you can make a block out of it, shortening 
> queries and unifying handling?

Ah, that depends on many factors. Speed is one; you may need to load a 
lot of blocks to get the data for all the literals in a graph. Also, if 
we store each block as a file on a file system, there are some file 
systems that perform badly when faced with a large number of really 
small files.

>>> And how do you split the metadata in blocks
>>
>> Well, depends very much on the application. How do you split metadata 
>> into files? :-)
> 
> Not at all ;-). The splitting into file is rudimentary represented 
> meta-data, if you use RDF the filesystem is a legacy application.

Um, but if you put metadata on an http server, you split it too?

The rule of thumb is: Split it in units you would want to transfer 
independently. E.g. in Annotea, you would make one block = one 
annotation. When putting email into RDF, you might make one block = one 
email. You might want to put your FOAF data in one block. If you have 
metadata about many documents, you might make a metadata block for each 
document you process. If you publish your personal TV recommendations 
each week, you'd make one block each week.

Of course if the granularity doesn't fit the task at hand-- you want to 
send a friend all love story recommendations of the last year-- the 
computer can split up those blocks automatically and reassemble them in 
a different way. It's just that for many applications a certain 
granularity fits usage patterns pretty well-- for example, you'd most of 
the time transmit an annotation as a whole. Then, if you've downloaded 
an annotation once, you never need to download it again (that's one of 
the benefit of putting them in blocks, you can cache them indefinitely).

>>>> So anyway, there are a number of reasons why we need to do powerful 
>>>> queries over a set of Storm blocks. For example, since we use hashes 
>>>> as the identifiers for blocks, we don't have file names as hints to 
>>>> humans about their content; instead, we'll use RDF metadata, stored 
>>>> in *other* blocks. As a second example, on top of the unchangeable 
>>>> blocks, we need to create a notion of updateable, versioned 
>>>> resources. We do this by creating metadata blocks saying e.g., 
>>>> "Block X is the newest version of resource Y as of 
>>>> 2003-03-20T22:29:25Z" and searching for the newest such statement.
>>>
>>> I don't quite understand: isn't there a regression problem if the 
>>> metadata is itself contained in blocks? Or is at least the timestamp 
>>> of a block something external to the blocks?
>>
>> A metadata block does not usually have a 'second-level' metadata block 
>> with information about the first metadata block, if you mean that;
> 
> Say you want to change the description of an entity, not just add a new 
> one, I think you should tell about another metadata block that it is 
> wrong (in the era starting now ;-)).

"Not usually" just meant that *most* metadata blocks do not have a 
second-level metadata block, in case you were worried that we'd need an 
infinite number of metametametameta blocks otherwise :)

>> no, timestamps are not external to the blocks.
> 
> When the user synchronizes his laptop with the home-pc I guess the 
> metadata may be contradictory, I thought with an external timestamp 
> contradictions could be handled (the newer is the right one). If the 
> timestamp is part of the metadata the application should probably 
> enforce it (while generally giving the user the maximum power to make 
> all sorts of metadata constructs).

The timestamp is on the assertion, "Block X is the newest version of 
resource Y," and it gives the time when the user said X is the current 
version (i.e., when the user saved the document). If the user saves the 
document on the desktop, and then on the laptop, that would be different 
saves, made at different times, so the timestamps wouldn't be 
contradictory: they would simply be the timestamps of two different things.

(There's another problem in this scenario, though: If the user edited a 
document independently on desktop and laptop, it wouldn't be nice if the 
version saved later would supersede the other one; rather, the changes 
from both should be merged. We actually use a slightly different system 
for synchronization of independent systems; instead of storing a 
timestamp, we store a list of obsoleted versions... but that's leading 
us astray here :-) )

- Benja

Received on Tuesday, 25 March 2003 05:23:00 UTC