Arguments against digest URIs from Jonas Liljegren on 2000-01-02 (www-rdf-interest@w3.org from January 2000)

From: Jonas Liljegren <jonas@paranormal.o.se>
Date: Sun, 02 Jan 2000 21:25:11 +0100
To: RDF Intrest Group <www-rdf-interest@w3.org>
CC: Sergey Melnik <melnik@DB.Stanford.EDU>
Message-ID: <386FB427.A3D2DBEF@paranormal.o.se>
I haven't read the list since december 18th. Some holidays have come in
the way.  But more than that: I didn't want to write this text.
Sergey Melnik has done so much work...


It's about digest URIs. There have come up a number of considerations
against the use of digest URIs. Not only digest URIs. But any kind of
algorithm for common URIs. That includes the x-pointer suggestion.


I was going to write a long summary of the issues and arguments for
and against digest URIs. But I'm not up to it. So I just list the
things that comes to mind.


The three things that needs a calculated URI is:
  * model URI
  * statement URI
  * anonymous resource URI



I think that digest URIs is not the complete solution for the problems
it tries to confront. A complete solution still has to incorporate
more layers of metadata. It's better to just don't have digest URIs.



 Higher threshold for implementation
 -----------------------------------

There will hopefully be many implementations of RDF. Some will just be
able to read a specifik form of the XML serialization. Other will be
more generic. There is a point in not requireing too much from an
implementation. MD5 or SHA-1 is maby not that hard to use, (there are
support for both in Perl modules,) but it does limit the ways to
implement RDF for a specific purpose.

And you can't depend on digest URIs if not everyone is using them.


 URI aliases
 -----------

What about URI aliases? Two URIs could be used do denote the same
thing.  Persons often have a diffrent identifier for every membership
register. There will have to be ways to express the relationships
between resources, regardless of if it's about the same sort of
statement, the same model or the same thing.

It's not enough to have a common algorithm to give unique identifiers
for anonymous resources. You will still have to be able to say that two
URIs is aliases for the same resource. So why not use this handling
of aliases to handle other cases there you want to say that one URI
for, say, a model is an alias for another URI.


 Value equivalence
 -----------------

The digest URI for a triple is calculated on the actual string of
bytes for the literal part. But the literal could be encoded in
diffrent formats. Unicode, Latin1, or others.

The object of a statement could be considered the same even if it is
diffrent in the byte for byte way. If the object is a person, it could
be a literal with the name. But the name could be written in a couple
of diffrent forms. It could also be an URI for the person, or an
anonymous URI to a resource that specify the person by describing the
first and last name separately, and maby giving them a type arc each.

You will have to be able to specify their equivalence. If you have a
rule for digest URIs, you would have that way, and on top of that the
more complete way to express equivalence. So why not skip the digest
way and go for the complete soulution? (The complete solution would be
to introduce more statements, containing metadata about the
resources.)


 Not realy unique
 ----------------

A digest is not guaranteed to be unique. There are a theoretical
chanse that two diffrent things will get the same URI.  There would
still have to be an extra layer for determining URI equivalence.


 The nature of the statement
 ---------------------------

In a reification of a statement, every reification should be handled
separately, as separate events. They have properties like source,
time, probability and context of statement. Even if the statement in
itself would have a unique URI, there would have to be separate URIs
for every stating event.

So why not add a few more data about the statement, and use those data
for handling equivalence between two statements. Equivalence could in
general be determined by examining the subject, predicate and object
properties, regardless of the URI representing the statement.


 URIs can be unknown
 -------------------

Many things could have official URIs. There would be cases there those
is not used in the XML serialization. This would result in the parser
or serializer generating an unofficial URI for that anonymous/unknown
resource. That would lead to two URIs for the same thing.

There will often be temporary URIs. They could also be used in queries
to denote a unknown entity that you would like to find a more proper
URI for.

A application with the ability to handle this will also be able to
handle URIs generated from XML serializations with anonymous
resources.  Thus, there is no need for a special algorithm for
generating the URIs for the anonymous resources.


 Version handling
 ----------------

Statements, resources, literals and models will come in diffrent
versions.  Some versions will be chronological. Other will be
variations of the content, like different languages or different
target groups.

There is many ways to handle new versions. Many applications would
like to keep a statement URI, even if the object part of it changes.
They would often like to keep the URI of a resource, even if its
content changed. They would like to keep the URI of the model, even if
new statements would be added.

Some applications would like to handle a history of versions, of
statements in different times. Others would only concern temself with
the present.

The use of digest URIs for statements and models will force every
application to deal with history, and to deal with it in a way that
could be incompatible with what is needed. I think that it would be
better to let the version handling be a separate layer, that could be
included or excluded, and that could evolve by itself to meet the
needs.


 Open / closed models
 --------------------

How will you maintain metadata about a model, with digest URIs?  The
metadata would have to be linked to the model. But every change in the
model would modify the model URI.  The metadata would point to a
nonexistent resource. It would be even harder to embed the metadata in
the model itself. The metadata would depend on the model URI and the
model URI would depend on the metadata.


 Statements as models
 --------------------

A model is a group of statements. we could reify a single statement,
but you would maby more often like to say somethng about a group of
statements.  This group could be given a explicit URI.  That would be
the same thing as to give a explicit URI to a model. The grouping of
the statement could be done on one site and used on other sites.

The handling of those things is something that belongs on a higher
level. It's not something to be handled with digest URIs.

 


 ---  I have not summarized this in the they I intended from the
      start.  The feeling of destroying the work of Sergey Melnik made
      me loose my spirit...


But... I suggest that we just skip this unique URI concern. The
problems of aliases and version handling is a topic for another
day. Not something that should go into either the core API nor the
schema layer. 

The generated URIs should all be based on your own namespace,
guaranteed to be unique.




(And now avaits a hundred new emails to read... :)

-- 
/ Jonas  -  http://paranormal.o.se/perl/proj/rdf/schema_editor/
Received on Sunday, 2 January 2000 15:25:44 UTC