Analysis of Anon Resources (long) from McBride, Brian on 2000-09-08 (www-rdf-interest@w3.org from September 2000)

From: McBride, Brian <bwm@hplb.hpl.hp.com>
Date: Fri, 8 Sep 2000 08:21:39 +0100
To: "RDF Interest (E-mail)" <www-rdf-interest@w3.org>
Message-ID: <5E13A1874524D411A876006008CD059FE7DA08@0-mail-1.hpl.hp.com>
This is an attempt to describe some of the issues surrounding
anonymous resources in the RDF models.  I am going to try to
set out the issues that I can see as clearly as I can,
and as far a possible, make no judgements about them.  This is
off the top of my head, it is not a summary of previous
discussions, though it clearly draws on some of them.

It has been suggested recently that there are four 'models' to
consider when we discuss RDF;

  o the abstact model, sometimes called the data model, or
    just the model.
  o the graphical model
  o the triple model
  o the xml serialization

I think of these as not being peers, but that the abstract
model is primary - it is THE rdf model, and the others are
representations of that model in different languages.  
I say this not to assert that this is the correct way to think
of things, but more to make my assumptions explicit.

Equivalence between different representations is determined
by whether they represent the same abstract model. The RDF
Model and Syntax spec provides no formal specification of
a language for representing triples. 

I hope the following description of the abstract model, so
far as it goes will be common ground:

An RDF model is a directed graph.  It contains nodes
connected by directed arcs, i.e. arcs that have a specific
source and destination node.  The source node of an arc must
come from a set I will here call R.  The destination node 
of an arc must come from either the set called R or the set
called Literals.  Arcs are always labelled with a URI.

The issue at question here is whether all members of the set
R have a URI.

The model and syntax specification is at best unclear and at
worst inconsistent on this question.  Section 2.1 states 
"Resources are always named by URIs plus optional anchor
ids".  Section 2.1.1 has a graphical representation of a
model with no URI and there are frequent references to 
anonymous resources throughout the text.

It is therefore futile to try to resolve this question by
referring to individual portions of the specification.

How then, can this question be answered.  There seem to me
to be the following options:

  o we can consider the spec as a whole, identify which 
    parts we think were unclear and reinterpret them to
    create a consistent interpretation.

  o we could ask the original authors what they meant and
    whether they still think that's right.

  o we could come to an independent resolution of what
    would be best.

The rest of this email is a discussion of the possible
solutions and their implications.

Some possible solutions

  o  All members of the set R must be given a URI by an
     application or parser

  o  Remove anonymous resources from the serialization
     - they were a mistake

  o  Invent a new class of URI, not URL's, not URN's
     but a scoped resource name.

  o  Some members of the set R do not have a URI.


All Members of R are Given a URI
================================

Implementation of API's such as I have been working on is
certainly easier.  So I, for one, like that( - oops - I'm
not supposed to be being judgemental).

Applications have to generate URI's for all resources, even
for insignificant resources such as are used to represent
compound values.

And in particular parsers have to generate URI's for all
anonymous resources they encounter in an XML input stream.  
And here I think is an important point of principle.

Any two parsers reading the same XML serialization should  
produce a representation of the same abstract model, i.e. 
a representation of the same graph.  This requires that
they have the same nodes with the same URI's.

Such generated URI's cannot reasonably be thought of as
URL's - they are not locators.  They must therefore be
URN's and there are some strict requirements on the 
behaviour of URN's i.e. they persist and the same URN
must never be used to represent two different resources
even over time.

How are parsers to do this?  Generating a unique name for 
each anonymous resource in a serialization is not hard.  
They must also ensure that they do not generate the same 
URI for different resources. 

One way to achieve these two requirements is to have the 
generated ID's be a function of the URI the parser used 
to access the XML serialization.  There are some 
difficulties with this.  A parser does not always have a 
URI for the source of the serialization.  The same 
serialisation may be accessed through different URI's - via 
a redirect - and this would result in different models.  
The same serialization might be copied and the copy, having
a different URI, would describe a different model.

These difficulties can be surmounted if there is a way for
a serialization to specify some key or base URI that will 
be used in the generation of anonymous URI's.  This could 
be accomplished without changing the current syntax by 
introducing a processing instruction.

Another desireable feature (requirement?) is that the
URI's generated should not change under some transformations
of the serialization.  For example, if the ordering of the
statements in a serialization were changed in way that
should not change the model being represented, then
the URI's generated by the parser should not change.
For example:

   <rdf:Description about="http:/foo">
       <bar:p1>
         <rdf:Description>
           <bar:p2>bar1</bar:p2>
         </rdf:description>
       </bar:p1>
       <bar:p1>
         <rdf:Description>
           <bar:p2>bar2</bar:p2>
         </rdf:description>
       </bar:p1>
    </rdf:Description>

Does this describe the same model as:

   <rdf:Description about="http:/foo">
       <bar:p1>
         <rdf:Description>
           <bar:p2>bar2</bar:p2>
         </rdf:description>
       </bar:p1>
       <bar:p1>
         <rdf:Description>
           <bar:p2>bar1</bar:p2>
         </rdf:description>
       </bar:p1>
    </rdf:Description>

Similarly, the serialization might change by
inserting or deleting parts.  Should the URI's
of those parts of the model unaffected by these
changes be allowed to change?

It is not trivial to design an algorithm for
generating URI's which would have all these properties.  
Sergey has made good progress in this area, but the
last time I thought about it there was a difficulty
with the algorithm he was using at the time.

There is the issue of persistence over time.  Is a parser
allowed to generate the same URI for different resources
perhaps because the source XML serialization has been
edited?  URI rules would say no, I think.  URN's
defintely say no.  URL's would allow it, but 
it's hard to reconcile these generated names with the
concept of a locator.  Perhaps this problem could be
dealt with by dumping it back on the generator of the
RDF serialization.  If the serialization is changed
in such a way that it might result in the use of the
same URI for different resources, the base for the 
generation of URI's must be changed to a new unique
value.  It might be hard to explain to users when 
they must change the base and why.

Create a New Class of URI with Different Rules
==============================================

i.e. bend the definition of URI as its getting in the
way.  This may be what DanB had in mind when he
suggested "var:..." format URI's a few months back.

var format URI's relax the URN constraint on persistence.
Within some scope, the definition of which is outside
the understanding of RDF processors, these URI's behave
like URN's.  It is upto the user, or his operations
folks, that they manage the use of RDF and RDF processors
so that two uses of the same var to represent different
resources never meet.

Regard Anonymous Resources as a Mistake and Remove Them
=======================================================

This approach forces the generation of URI's back to
the generator of the RDF.  This generator should have
enough application knowledge to generate URI's that
really are URI's.

This approach will presumably break some of the RDF that
is out there.  How big a problem this is, I don't know.
If it would cause a problem for you, raise your hand now.

Some Members of R do not have a URI
===================================

This permits models to be constructed with nodes that
do not have a globally unique identifier.  There 
are entities, which we might want to represent in an RDF 
model which do not have a 'natural' URI.  Me for example.
There are entities which an application designer may prefer
not to give a name to - e.g. compound values such as my
weight.

Applications are no longer forced to construct artificial
URI's for entities which have no natural URI's.

The XML serialization syntax and the graphical presentation
have defined means for representing nodes with no URI.  M&S
gives no formal description of a language for representing
triples, but it does include examples where anonymous nodes
are represented in a triple notation.  It is clear that it
is possible to design a representation of triples which can
distinguish between URI's and other names with a more limited
scope.

A key point to note is that there are graphs with anonymous
nodes that cannot be represented in the current XML syntax.
To ensure equivalence between the abstract model and the
syntax, they syntax must either be extended, or the use of
abstract nodes in a model constrained.

An RDF processor can track the identity of an anonymous
resources whilst it remains within the processing scope of
that processor.  But if such a resource moves out of the
scope of the processor and comes back in, the processor has
no way to know it is the same resource.

So for example, consider an implementation of an Rdf model,
i.e. a collection of statments, which contain references
to anonymous resources.  If such a model were written as
and XML serialization to a file in a way that preserved the
anonimity and then read back in again, the processor would
have no way to tell that the anonymous resources written out
were the same as the anonymous resources being read back in.
Take a simple example.  A model with resource representing me,
a property linking me to an anonymous resource representing my
weight.  Write this out to a file, read it back in and add it
to the same model, I end up with two weight properties.  Not
great.

Whew!

Brian McBride
HPLabs
Received on Friday, 8 September 2000 03:21:44 UTC