- From: Walton, Paul (Exchange) <WaltonP@logica.com>
- Date: Fri, 13 Mar 1998 08:50:03 -0000
- To: "'www-rdf-comments'" <www-rdf-comments@w3.org>
Dear RDF editors, Comments on WD-rdf-syntax-19980216 This note contains some comments on the RDF Model and Syntax working draft (WD-rdf-syntax-19980216). The comments are in three sections: - identification of a numbers of issues with the core data model - a proposal for resolving the issues - a proposal for representing a particular RDF model as a labelled directed graph (in a slightly different way from the current proposal) ISSUES The terminology and description of the core data model are based on labelled directed graphs (called labelled digraphs from here). But there are problems in the draft associated with this approach. 1 The core data model does not define what a node is. 2 The use of node identifiers is unclear. 3 The core data model can be projected onto a labelled digraph representation, but the RDF structure proposed is richer than a labelled digraph can support. 4 It is not clear what constitutes node identity. Taking these in turn... 1 The core data model does not define what a node is The core data model references a set of nodes (called Nodes), but provides no mechanism for identifying elements of the set. Some subsets are defined; for example PropertyTypes is defined as a subset (although PropertyTypes is itself not defined). Specific elements (such as RDF:Seq, RDF:Alt) are defined. But there are no rules for determining membership. 2 The use of node identifiers is unclear The XML syntax includes identifiers to refer to bags. But the core data model does not define the relationship between nodes and identifiers. 3 The core data model can be projected onto a labelled digraph representation, but the RDF structure proposed is richer than a labelled digraph can support The difficulty here is definition 2 of the core data model ("There is a subset of Nodes called PropertyTypes"). This definition is needed to supply the richness of structure and expressive power that RDF is intended to support. But it implies that a label on a digraph is also a node. Definition 3 ("There is a set of 3-tuples called Triples...") confirms that the initial elements of properties are nodes (by definition 2), and these are the elements which are used to label graphs. This is inconsistent with the definition of graphs, digraphs and labelled digraphs (and the labelling in graphs normally refers to the nodes). It is undoubtedly useful to create labelled digraphs which represent RDF models, but it should e recognised that the representation loses some of the structure of RDF. In this note I have used the term "projection" to indicate the relationship, in the sense that an RDF model can be projected onto a labelled directed graph. In the proposal below I have suggested a mechanism for using a richer structure than labelled digraph (but still a projection) which incorporates labelled digraphs in a structure similar to that used in CASE tool diagrams. 4 It is not clear what constitutes node identity Some nodes have identifiers (in the XML definition). Some atomic nodes (I'm assuming here that the atomic values as defined are intended as nodes as shown in the examples) have an associated string (which might be a URI or "John Smith" or whatever). It is clear that two nodes with the same identifier are the same node. But what about each of the following: - If two atomic URI nodes have the same URI are they the same node? In some cases, for example in a description, then they must be the same node. But there may be other cases in which authors need more flexibility. For example, there may be 2 descriptions of a resource, one for now, and one to represent some planned future state. They may use the same URI string, but the author may want to preserve a semantic description (since they are actually descriptions of different things). - If two atomic string nodes have the same string are they the same node? This may vary according to the type of string. Dates and numbers, for example (as in the case of RDF:n ordinals) are at one extreme, but there will be no intention that each occurrence of "John Smith" has the same node. In this case it may be up to the author. - If two nodes have the same construction are they the same node? Again this should be up to the author. These issues can be resolved in a number of ways. My favourite is described in outline below. CORE DATA MODEL PROPOSAL The core data model proposed here is intended to resolve the above issues while meeting the needs of RDF and staying consistent with the XML syntax. The intention is to give each node a clear identifier, and to show clearly what a node actually is. Conventions used are: {...} represents a set. [x1, ... , xn] represents an ordered set. If x is an order set, then n(x) is the nth element | means logical OR. <- means is an element of (this was the closest I could get to an epsilon!). PropertyTypes, URIs, Strings, Ids are disjoint sets. From the point of view of the definitions they need to be disjoint. But in any instance they will be represented by particular vocabularies defined (perhaps) according to XML namespace proposals. (Note: there are issues here about whether or not to differentiate between URIs and strings and how to represent different classes that the strings themselves represent. There is also an issue about how to handle a string which happens to be a real URI. Some of the answers may be implied by the property in which the string is contained. When this is not the case, the simplest answer may be to embed properties of the following form in the model: "John Smith" -- RDF:InstanceOf --> WhateverClassWeWant) AtomicNodeBodies = {x : x <- PropertyTypes | x <- URIs | x <- Strings} Nodes = {[i, b] : i <- Ids, b <- NodeBodies} If x is a node (ie if x <- Nodes) then i(x) is the id of the node (ie the first element of the pair) and b(x) is the body (ie the second element of the pair). A property body contains 3 node ids of which the first is in the set PropertyTypes. This is the same order as in the current draft, but I agree with the other comment you have received on the draft that it is easier for the reader (although it makes no fundamental difference) is the property type is second. PropertyBodies = {[j, k, l] : j = i(p), for some p <- PropertyTypes, k = i(x), j = i(y) for some x, y <- Nodes} Properties = {[i, b] : i <- Ids, b <- PropertyBodies} DescriptionBodies = {[j1,...jn]: 1<= n, jk = i(pk) for some pk <- Properties for 1 <= k <= n, 2(px) = 2(py) for 1 <= x, y <= n} Descriptions = {[i, b] : i <- Ids, b <- DescriptionBodies} The following shows how collections can be included in the model. SequenceBodies = {[j1,...jn]: 1<= n, jk = i(pk) for some pk <- Nodes for 1 <= k <= n } Sequence = {[i, b] : i <- Ids, b <- SequenceBodies} Other collection types can be included using the same principles. NodeBodies = {x : x <- AtomicNodeBodies | x <- PropertyBodies | x <- DescriptionBodies | x <- SequenceBodies } Further types can be included as needed. The definitions of Node and NodeBodies are circular, since a node body may reference nodes and vice versa. However since they are rooted in atomic node bodies there is no problem here. This definition resolves the issues railed above. Nodes are clearly defined, and each has an id. Node identity becomes a straightforward issue. If x and y are nodes they are the same node if and only if i(x) = i(y). The author (or the tool the author is using) can decide how to allocate ids to nodes according to the semantics of the domain. The definition has also been loosened from the labelled digraph definition to reflect the richer structure of RDF. Embedding ids in the data model does not necessarily imply that these should be exposed to the user (or in the XML beyond the current syntax proposal). There are also some by-products of this definition. Since all properties are defined to be nodes, there is no need for reification (but, of course, all of the extra reification properties can be included is that is useful). Also there is a ready-made mechanism for handling higher arity. An example We assume that the set of ids are the natural numbers. Other atomic bodies should be obvious from the context. [1, Author] [2, "Ora Lassila"] [3, "http://www.w3.org/People/Lassila"] [4, [1, 3, 2]] This is a property: "http://www.w3.org/People/Lassila" >-- Author --> "Ora Lassila" [5, LastModified] [6, "19980203"] [7, [5, 3, 6]] This is a property: "http://www.w3.org/People/Lassila" >-- LastModified --> "19980203" [8, {4, 7}] This is a description of "http://www.w3.org/People/Lassila" [9, "Joe Bloggs"] [10, [1, 8, 9]] This is a property: (The description of "http://www.w3.org/People/Lassila" with id 8) >-- Author --> "Joe Bloggs". This means that Joe Bloggs was the author of the description itself (containing the two properties above). LABELLED DIGRAPH PROPOSAL To represent a particular RDF model as a labelled digraph the link between the label and node needs to be broken. If an RDF node is both a digraph label and a digraph node then they have to be separately represented on the graph. The use of ids provides a mechanism for doing this and still maintaining the link between them. Where it is useful and relevant, the ids could be represented on the diagram. In a similar way, if an RDF node is most usefully presented in more than one digraph node, then the id will make the relationship clear. The above core data model proposal exposes the tree structure of the core model in the same way as it is exposed in the XML syntax. Human beings are used to understanding tree structures and many familiar tools use tree structuring. In particular CASE tools often use hierarchical (tree structured) diagrams to represent complex models. While recognising that RDF is not purely tree structured it is useful to be able to use familiar tools wherever possible, so long as the differences can be clearly highlighted. Therefore I suggest combining the use of labelled digraphs with the use of hierarchies (where appropriate) to represent projections of RDF models. The following diagrams (based on the example above) indicate the ideas. These ideas are (I think) implied in figure 4 of the draft spec. The first diagram represents the highest level. The box contains a brief note and also the id, and indicates that the user can drill down to see the next level. +------------------+ |8 | |Description of | >-- Author --> "Joe Bloggs" |"http:// etc" | | | +------------------+ The next diagram shows what happens when you drill down. +-------- |8 |Description | | "http://www.w3.org/People/Lassila" | >-- Author --> "Ora Lassila" | | | +- LastModified --> "19980203" | +-------- On the actual diagrams the nodes should be included in ellipses, and there should be facilities for adding ids as needed (and as described above). A property can be represented as a node in a similar way (by putting a box or ellipse round it). A property node is very similar to a description node which contains one property, but in the definition the latter has an extra set of square brackets. But they can still be represented in a similar (but not identical) way. The use of hierarchies provides an important element in supporting human comprehension. As RDF develops, in different domains conventions will develop about how to break the pure hierarchical structure and how best to represent the breaks on the diagrams. Obvious ideas are: - providing tools to highlight elements with the same id - providing diagrams which show the overall hierarchy and how it is broken by extra properties. CONCLUSION I realise that I have suggested a significant change to the RDF draft, but the change clarifies the overall model and resolves the issues I raised at the start. I hope you find it helpful, and I will be pleased to provide any other help I can. Paul Walton
Received on Friday, 13 March 1998 03:50:52 UTC