- From: McBride, Brian <bwm@hplb.hpl.hp.com>
- Date: Fri, 8 Sep 2000 08:21:39 +0100
- To: "RDF Interest (E-mail)" <www-rdf-interest@w3.org>
This is an attempt to describe some of the issues surrounding anonymous resources in the RDF models. I am going to try to set out the issues that I can see as clearly as I can, and as far a possible, make no judgements about them. This is off the top of my head, it is not a summary of previous discussions, though it clearly draws on some of them. It has been suggested recently that there are four 'models' to consider when we discuss RDF; o the abstact model, sometimes called the data model, or just the model. o the graphical model o the triple model o the xml serialization I think of these as not being peers, but that the abstract model is primary - it is THE rdf model, and the others are representations of that model in different languages. I say this not to assert that this is the correct way to think of things, but more to make my assumptions explicit. Equivalence between different representations is determined by whether they represent the same abstract model. The RDF Model and Syntax spec provides no formal specification of a language for representing triples. I hope the following description of the abstract model, so far as it goes will be common ground: An RDF model is a directed graph. It contains nodes connected by directed arcs, i.e. arcs that have a specific source and destination node. The source node of an arc must come from a set I will here call R. The destination node of an arc must come from either the set called R or the set called Literals. Arcs are always labelled with a URI. The issue at question here is whether all members of the set R have a URI. The model and syntax specification is at best unclear and at worst inconsistent on this question. Section 2.1 states "Resources are always named by URIs plus optional anchor ids". Section 2.1.1 has a graphical representation of a model with no URI and there are frequent references to anonymous resources throughout the text. It is therefore futile to try to resolve this question by referring to individual portions of the specification. How then, can this question be answered. There seem to me to be the following options: o we can consider the spec as a whole, identify which parts we think were unclear and reinterpret them to create a consistent interpretation. o we could ask the original authors what they meant and whether they still think that's right. o we could come to an independent resolution of what would be best. The rest of this email is a discussion of the possible solutions and their implications. Some possible solutions o All members of the set R must be given a URI by an application or parser o Remove anonymous resources from the serialization - they were a mistake o Invent a new class of URI, not URL's, not URN's but a scoped resource name. o Some members of the set R do not have a URI. All Members of R are Given a URI ================================ Implementation of API's such as I have been working on is certainly easier. So I, for one, like that( - oops - I'm not supposed to be being judgemental). Applications have to generate URI's for all resources, even for insignificant resources such as are used to represent compound values. And in particular parsers have to generate URI's for all anonymous resources they encounter in an XML input stream. And here I think is an important point of principle. Any two parsers reading the same XML serialization should produce a representation of the same abstract model, i.e. a representation of the same graph. This requires that they have the same nodes with the same URI's. Such generated URI's cannot reasonably be thought of as URL's - they are not locators. They must therefore be URN's and there are some strict requirements on the behaviour of URN's i.e. they persist and the same URN must never be used to represent two different resources even over time. How are parsers to do this? Generating a unique name for each anonymous resource in a serialization is not hard. They must also ensure that they do not generate the same URI for different resources. One way to achieve these two requirements is to have the generated ID's be a function of the URI the parser used to access the XML serialization. There are some difficulties with this. A parser does not always have a URI for the source of the serialization. The same serialisation may be accessed through different URI's - via a redirect - and this would result in different models. The same serialization might be copied and the copy, having a different URI, would describe a different model. These difficulties can be surmounted if there is a way for a serialization to specify some key or base URI that will be used in the generation of anonymous URI's. This could be accomplished without changing the current syntax by introducing a processing instruction. Another desireable feature (requirement?) is that the URI's generated should not change under some transformations of the serialization. For example, if the ordering of the statements in a serialization were changed in way that should not change the model being represented, then the URI's generated by the parser should not change. For example: <rdf:Description about="http:/foo"> <bar:p1> <rdf:Description> <bar:p2>bar1</bar:p2> </rdf:description> </bar:p1> <bar:p1> <rdf:Description> <bar:p2>bar2</bar:p2> </rdf:description> </bar:p1> </rdf:Description> Does this describe the same model as: <rdf:Description about="http:/foo"> <bar:p1> <rdf:Description> <bar:p2>bar2</bar:p2> </rdf:description> </bar:p1> <bar:p1> <rdf:Description> <bar:p2>bar1</bar:p2> </rdf:description> </bar:p1> </rdf:Description> Similarly, the serialization might change by inserting or deleting parts. Should the URI's of those parts of the model unaffected by these changes be allowed to change? It is not trivial to design an algorithm for generating URI's which would have all these properties. Sergey has made good progress in this area, but the last time I thought about it there was a difficulty with the algorithm he was using at the time. There is the issue of persistence over time. Is a parser allowed to generate the same URI for different resources perhaps because the source XML serialization has been edited? URI rules would say no, I think. URN's defintely say no. URL's would allow it, but it's hard to reconcile these generated names with the concept of a locator. Perhaps this problem could be dealt with by dumping it back on the generator of the RDF serialization. If the serialization is changed in such a way that it might result in the use of the same URI for different resources, the base for the generation of URI's must be changed to a new unique value. It might be hard to explain to users when they must change the base and why. Create a New Class of URI with Different Rules ============================================== i.e. bend the definition of URI as its getting in the way. This may be what DanB had in mind when he suggested "var:..." format URI's a few months back. var format URI's relax the URN constraint on persistence. Within some scope, the definition of which is outside the understanding of RDF processors, these URI's behave like URN's. It is upto the user, or his operations folks, that they manage the use of RDF and RDF processors so that two uses of the same var to represent different resources never meet. Regard Anonymous Resources as a Mistake and Remove Them ======================================================= This approach forces the generation of URI's back to the generator of the RDF. This generator should have enough application knowledge to generate URI's that really are URI's. This approach will presumably break some of the RDF that is out there. How big a problem this is, I don't know. If it would cause a problem for you, raise your hand now. Some Members of R do not have a URI =================================== This permits models to be constructed with nodes that do not have a globally unique identifier. There are entities, which we might want to represent in an RDF model which do not have a 'natural' URI. Me for example. There are entities which an application designer may prefer not to give a name to - e.g. compound values such as my weight. Applications are no longer forced to construct artificial URI's for entities which have no natural URI's. The XML serialization syntax and the graphical presentation have defined means for representing nodes with no URI. M&S gives no formal description of a language for representing triples, but it does include examples where anonymous nodes are represented in a triple notation. It is clear that it is possible to design a representation of triples which can distinguish between URI's and other names with a more limited scope. A key point to note is that there are graphs with anonymous nodes that cannot be represented in the current XML syntax. To ensure equivalence between the abstract model and the syntax, they syntax must either be extended, or the use of abstract nodes in a model constrained. An RDF processor can track the identity of an anonymous resources whilst it remains within the processing scope of that processor. But if such a resource moves out of the scope of the processor and comes back in, the processor has no way to know it is the same resource. So for example, consider an implementation of an Rdf model, i.e. a collection of statments, which contain references to anonymous resources. If such a model were written as and XML serialization to a file in a way that preserved the anonimity and then read back in again, the processor would have no way to tell that the anonymous resources written out were the same as the anonymous resources being read back in. Take a simple example. A model with resource representing me, a property linking me to an anonymous resource representing my weight. Write this out to a file, read it back in and add it to the same model, I end up with two weight properties. Not great. Whew! Brian McBride HPLabs
Received on Friday, 8 September 2000 03:21:44 UTC