Re: Arguments against digest URIs from Sergey Melnik on 2000-01-19 (www-rdf-interest@w3.org from January 2000)

From: Sergey Melnik <melnik@db.stanford.edu>
Date: Wed, 19 Jan 2000 15:12:40 -0800
To: Jonas Liljegren <jonas@paranormal.o.se>
CC: RDF Interest Group <www-rdf-interest@w3.org>
Message-ID: <388644E8.B6050523@db.stanford.edu>
Jonas Liljegren wrote:
> 
> It's about digest URIs. There have come up a number of considerations
> against the use of digest URIs. Not only digest URIs. But any kind of
> algorithm for common URIs. That includes the x-pointer suggestion.
> 
> I was going to write a long summary of the issues and arguments for
> and against digest URIs. But I'm not up to it. So I just list the
> things that comes to mind.

That looks rather like a long summary! ;)

> The three things that needs a calculated URI is:
>   * model URI
>   * statement URI
>   * anonymous resource URI

(1) anonymous resource URI: definitely true
(2) statement URI: I believe so
(3) model URI: an optional goody

> I think that digest URIs is not the complete solution for the problems
> it tries to confront. A complete solution still has to incorporate
> more layers of metadata. It's better to just don't have digest URIs.

An incremental attempt to find the solution is better that nothing at
all ;) BTW, let me reiterate what the rationales for the digest URIs are
in first place.

(1) We need to refer to anonymous resources used by other people (or the
things that they represent). Currently, every RDF parser generates a
different URI for an anonymous resource, thus the only way for a third
party to speak about this resource would be to reify the context where
this resource was used. Under unlucky circumstances, you'd have to cite
the whole document (model). This is a very verbose solution and it
indeed requires an additional layer of metadata. Seems like an overkill
to me. IMO, standard digests solve this problem in an elegant way.

(2) Statement URIs: a certain consensus has been reached on this list
w.r.t. that reified statements can be adequately treated as having
unique, context free URIs. A cryptographic digest is just a convenient
abbreviation for these URIs.

(3) Model URIs: having a digest for a model allows signing a set of
statements using public key technology. Note that given an algorithm to
compute a digest of a model, we sign the content itself rather than its
representation using some serialization syntax. I believe model digests
is a powerful lever for the Web of Trust.

Now let me briefly address some of the issues you raised:

>  Higher threshold for implementation
>  -----------------------------------
> 
> There will hopefully be many implementations of RDF. Some will just be
> able to read a specifik form of the XML serialization. Other will be
> more generic. There is a point in not requireing too much from an
> implementation. MD5 or SHA-1 is maby not that hard to use, (there are
> support for both in Perl modules,) but it does limit the ways to
> implement RDF for a specific purpose.

Note that the digest algorithm for anonymous resources depends on the
serialization syntax, whereas statement URIs and model URIs do not. This
means that an application that uses a very simple and straightforward
serialization (without anonymous resources) must not know anything about
digests. Statement URIs and model URIs are still character strings,
aren't they?

> And you can't depend on digest URIs if not everyone is using them.

True. Hard to argue against it. If there are no standards,
interoperation is impossible.

>  URI aliases
>  -----------
> 
> What about URI aliases? Two URIs could be used do denote the same
> thing.  Persons often have a diffrent identifier for every membership
> register. There will have to be ways to express the relationships
> between resources, regardless of if it's about the same sort of
> statement, the same model or the same thing.
> 
> It's not enough to have a common algorithm to give unique identifiers
> for anonymous resources. You will still have to be able to say that two
> URIs is aliases for the same resource. So why not use this handling
> of aliases to handle other cases there you want to say that one URI
> for, say, a model is an alias for another URI.

This is correct and is definitely a requirement. Let me elaborate the
point I made above to clarify the problem. Imagine someone stated
something about an anonymous resource, say A, mentioned on his/her RDF
page. No doubt, you can pick some unused name, say B, use it throughout
you descriptions and state that B is equivalent to A. How do you refer
to A to state the equivalence? The only way to do that would be to say
"a resource used in this and that particular context". For example, A
could be "a resource that has an anonymous dc:Creator X, which belongs
to an anonymous organization Y, which is labelled 'W3C'". For anonymous
resources, you'd have to find the complete information (context) needed
to fully characterize it. This is exactly what the digest algorithm does
in a transparent fashion. So you don't have to quote to whole thing.

>  Value equivalence
>  -----------------
> 
> The digest URI for a triple is calculated on the actual string of
> bytes for the literal part. But the literal could be encoded in
> diffrent formats. Unicode, Latin1, or others.

Again, a case for standardization. In the current algorithm, the Unicode
representation is used.

> The object of a statement could be considered the same even if it is
> diffrent in the byte for byte way. If the object is a person, it could
> be a literal with the name. But the name could be written in a couple
> of diffrent forms. It could also be an URI for the person, or an
> anonymous URI to a resource that specify the person by describing the
> first and last name separately, and maby giving them a type arc each.
> 
> You will have to be able to specify their equivalence. If you have a
> rule for digest URIs, you would have that way, and on top of that the
> more complete way to express equivalence. So why not skip the digest
> way and go for the complete soulution? (The complete solution would be
> to introduce more statements, containing metadata about the
> resources.)

Digest URIs are not meant to provide a general solution for specifying
or computing the equivalence of resources.

>  Not realy unique
>  ----------------
> 
> A digest is not guaranteed to be unique. There are a theoretical
> chanse that two diffrent things will get the same URI.  There would
> still have to be an extra layer for determining URI equivalence.

Legal issues are out of scope. For most other practical purposes,
160-bit (or X-bit) hash seems to be a good approximation.

>  The nature of the statement
>  ---------------------------
> 
> In a reification of a statement, every reification should be handled
> separately, as separate events. They have properties like source,
> time, probability and context of statement. Even if the statement in
> itself would have a unique URI, there would have to be separate URIs
> for every stating event.

I disagree with that. See also
http://lists.w3.org/Archives/Public/www-rdf-interest/1999Dec/0070.html

>  Version handling
>  ----------------
> 
> Statements, resources, literals and models will come in diffrent
> versions.  Some versions will be chronological. Other will be
> variations of the content, like different languages or different
> target groups.
> 
> There is many ways to handle new versions. Many applications would
> like to keep a statement URI, even if the object part of it changes.
> They would often like to keep the URI of a resource, even if its
> content changed. They would like to keep the URI of the model, even if
> new statements would be added.
> 
> Some applications would like to handle a history of versions, of
> statements in different times. Others would only concern temself with
> the present.
> 
> The use of digest URIs for statements and models will force every
> application to deal with history, and to deal with it in a way that
> could be incompatible with what is needed. I think that it would be
> better to let the version handling be a separate layer, that could be
> included or excluded, and that could evolve by itself to meet the
> needs.

Digest-based model URIs provide a way to refer to the RDF content
directly, rather than to the location of its serialization. No force,
please! ;) One can still refer to an RDF document using a URL. But its
contents may have changed... A version handling mechanism can be easily
built on top of digest-based URIs and URL of models.

>  Open / closed models
>  --------------------
> 
> How will you maintain metadata about a model, with digest URIs?  The
> metadata would have to be linked to the model. But every change in the
> model would modify the model URI.

This is exactly the intention. There is a different between saying "I
trust the fact that Ralph Swick works at W3C" and saying "I trust
whatever information in contained in this page". The latter may be
appropriate for many cases, though.

>  Statements as models
>  --------------------
> 
> A model is a group of statements. we could reify a single statement,
> but you would maby more often like to say somethng about a group of
> statements.  This group could be given a explicit URI.  That would be
> the same thing as to give a explicit URI to a model. The grouping of
> the statement could be done on one site and used on other sites.
> 
> The handling of those things is something that belongs on a higher
> level. It's not something to be handled with digest URIs.

Why not? This is exactly how it works: you drop a bunch of statements
into an empty model and compute its digest-based URI. That gives an
explicit URI for a group of statements. I don't see any contradiction in
that.

>  ---  I have not summarized this in the they I intended from the
>       start.  The feeling of destroying the work of Sergey Melnik made
>       me loose my spirit...

Thanks for your comments and your concern, I promise not to erase my
code in despair ;)

> But... I suggest that we just skip this unique URI concern. The
> problems of aliases and version handling is a topic for another
> day. Not something that should go into either the core API nor the
> schema layer.
> 
> The generated URIs should all be based on your own namespace,
> guaranteed to be unique.

An application that does not care about the uniqueness etc. of generated
URIs should not bother. On the other hand, the implementation of this
mechanism enables other developers to evaluate its usefulness in
different application scenarios.

Best,
Sergey
Received on Wednesday, 19 January 2000 18:05:44 UTC