Re: Digest URI's from Dan Brickley on 2000-04-06 (www-rdf-interest@w3.org from April 2000)

From: Dan Brickley <danbri@w3.org>
Date: Thu, 6 Apr 2000 06:53:12 -0400 (EDT)
To: "McBride, Brian" <bwm@hplb.hpl.hp.com>
cc: "'www-rdf-interest@w3.org'" <www-rdf-interest@w3.org>
Message-ID: <Pine.LNX.4.20.0004060624490.19595-100000@tux.w3.org>
Hi Brian,


On Thu, 6 Apr 2000, McBride, Brian wrote:

> I've been thinking of implementing digest URI's but I can't get a good
> enough understanding from the mailing list archives to do so, so can someone
> help me out please:

Digest URIs are a proposal floating around on the mailing list; currently
they have no status in the W3C RDF specifications. They may nevertheless
be a useful technique for implementors; quite how they fit into the RDF
picture is still up for discussion.

Sergey has some notes at 
http://WWW-DB.Stanford.EDU/~melnik/rdf/api.html (although he uses a
fictional URN scheme for urn:rdf:* identifiers which I'm not persuaded by).

> 1)  Which entitities can have digest URI's

This would be determined by a combination of the top-level URI scheme used
(uuid:, foobar:, http:, doi:, handle: etc) and the policies operating over
the subset of that URI space (eg urn:rdf: etc if we had URNs) for naming
entities.

I think the main proposal was to use computed URIs for 'RDF models' based
on the abstract contents of the graph. I suspect further work is needed
here on canonicalising the graph representation (eg. treatement of
language tagged content). There was also some discussion of computed URIs
for 'anonymous' or so-called 'no-name' resources, ie. nodes that are
'mentioned in passing' in a chunk of XML/RDF without their URI being
included in the markup. 

A related approach would be to use these digests as properties of
resources instead of identifiers...

> 
> 2)  What are they for?

does the above help? Briefly, RDF applications benefit from a datamodel
that allows for aggregation of data from multiple sources. Since we (try
to) use URIs for node identifiers, RDF allows us to aggregate data simply
by joining on uniquely identified nodes. So, the idea behind using digests 
is that we can do more data joins, and therefore do better data
aggregation.

 > 
> 3)  What does a digest URI denote?

I'm not aware of a specific URI scheme proposal for these so can't comment
on this one
> 
> 4)  What properties do they need to have?
> 
> 5) I understand there is an algorithm for computing them given an RDF syntax
> representation of a model.  

There are algorithms for doing this sort of thing given any blob of XML
markup. I took Sergey's proposal to be operating over the abstract RDF
data model:

	Currently, model digest is a XOR of triple digests. Triple digest is
	computed as XOR with rotation on digests of the
	predicate, subject and object. This approach provides a straightforward
	way of digital signing of RDF content (as
	opposed to signing of serialized RDF), facilitating the "Web of Trust"... 



		Given a model stored in a database, I could
> serialise that many different ways.  How do I compute digest URI's for a
> model stored in a database, or is that an unnecessary thing to do.

'triple digest' seems to be sergey's approach. I'm not sure what we'd do
about implied arcs in the graph, langauge tagging etc to ensure we had a
canonicalisation strategy before computing the triple and model
digests. In other words, two models could be RDF model equivalent but have
trivial differences in their actual storage (missing but implied rdf:type
arcs, variations in representation of XML literals, xml:lang etc) giving
them different triple/model digests.

Does this help? Sergey, was this a fair characterisation?

There's a paper by Clifford Lynch in D-Lib Magazine Sept 1999 that touches
on similar areas. In particular, the issue of different layers of
representation -- we'd need to give some careful thought to
canonicalisation of literal XML data for example...

http://www.dlib.org/dlib/september99/09lynch.html
	Canonicalization: A Fundamental Tool to Facilitate Preservation
        and Management of Digital Information

	brief excerpts...
	[...]
	For example, UNICODE, which is the underlying character set for a
	growing
        number of current storage standards, allows multiple bit streams
	to represent the same stream
        of logical characters. Any canonicalization of objects that
	include data represented in
        UNICODE must enforce a standard choice of encoding on that data. 
	[...]	
	Canonicalizations for other types of digital objects that have less clear
	formal models would
        seem to be a likely near term research area. For example, is it
	reasonable to think about an
        RTF-based or ASCII-based canonicalization for certain types of
	documents, or even about a
        hierarchy of such canonicalizations, with algorithms higher up in
	the hierarchy capturing
        more of the document?s intrinsic meaning? This is likely to be
	difficult[...]
 

Dan

> [My assumption here is that the model was constructed in the database, and
> was not derived from a serialised RDF input stream]
> 
> Brian McBride
> HPLabs
>  
>
Received on Thursday, 6 April 2000 06:54:14 UTC