Comments on the RDF core data model

Dear RDF editors,

Comments on WD-rdf-syntax-19980216

This note contains some comments on the RDF Model and Syntax working draft
(WD-rdf-syntax-19980216).  The comments are in three sections:
- identification of a numbers of issues with the core data model
- a proposal for resolving the issues
- a proposal for representing a particular RDF model as a labelled directed
graph (in a slightly different way from the current proposal)

ISSUES

The terminology and description of the core data model are based on labelled
directed graphs (called labelled digraphs from here).  But there are
problems in the draft associated with this approach.

1 The core data model does not define what a node is.
2 The use of node identifiers is unclear.
3 The core data model can be projected onto a labelled digraph
representation, but the RDF structure proposed is richer than a labelled
digraph can support.
4 It is not clear what constitutes node identity.

Taking these in turn...

1 The core data model does not define what a node is

The core data model references a set of nodes (called Nodes), but provides
no mechanism for identifying elements of the set.  Some subsets are defined;
for example PropertyTypes is defined as a subset (although PropertyTypes is
itself not defined).  Specific elements (such as RDF:Seq, RDF:Alt) are
defined.  But there are no rules for determining membership.

2 The use of node identifiers is unclear

The XML syntax includes identifiers to refer to bags.  But the core data
model does not define the relationship between nodes and identifiers.

3 The core data model can be projected onto a labelled digraph
representation, but the RDF structure proposed is richer than a labelled
digraph can support

The difficulty here is definition 2 of the core data model ("There is a
subset of Nodes called PropertyTypes").  This definition is needed to supply
the richness of structure and expressive power that RDF is intended to
support.  But it implies that a label on a digraph is also a node.
Definition 3 ("There is a set of 3-tuples called Triples...") confirms that
the initial elements of properties are nodes (by definition 2), and these
are the elements which are used to label graphs.  This is inconsistent with
the definition of graphs, digraphs and labelled digraphs (and the labelling
in graphs normally refers to the nodes).

It is undoubtedly useful to create labelled digraphs which represent RDF
models, but it should e recognised that the representation loses some of the
structure of RDF.  In this note I have used the term "projection" to
indicate the relationship, in the sense that an RDF model can be projected
onto a labelled directed graph.

In the proposal below I have suggested a mechanism for using a richer
structure than labelled digraph (but still a projection) which incorporates
labelled digraphs in a structure similar to that used in CASE tool diagrams.

4 It is not clear what constitutes node identity

Some nodes have identifiers (in the XML definition).  Some atomic nodes (I'm
assuming here that the atomic values as defined are intended as nodes as
shown in the examples) have an associated string (which might be a URI or
"John Smith" or whatever).

It is clear that two nodes with the same identifier are the same node.  But
what about each of the following:

- If two atomic URI nodes have the same URI are they the same node?  In some
cases, for example in a description, then they must be the same node.  But
there may be other cases in which authors need more flexibility.  For
example, there may be 2 descriptions of a resource, one for now, and one to
represent some planned future state.  They may use the same URI string, but
the author may want to preserve a semantic description (since they are
actually descriptions of different things).

- If two atomic string nodes have the same string are they the same node?
This may vary according to the type of string.  Dates and numbers, for
example (as in the case of RDF:n ordinals) are at one extreme, but there
will be no intention that each occurrence of "John Smith" has the same node.
In this case it may be up to the author.

- If two nodes have the same construction are they the same node?  Again
this should be up to the author.

These issues can be resolved in a number of ways.  My favourite is described
in outline below.

CORE DATA MODEL PROPOSAL

The core data model proposed here is intended to resolve the above issues
while meeting the needs of RDF and staying consistent with the XML syntax.
The intention is to give each node a clear identifier, and to show clearly
what a node actually is.

Conventions used are:

{...} represents a set.
[x1, ... , xn] represents an ordered set.
If x is an order set, then n(x) is the nth element 
| means logical OR.
<- means is an element of (this was the closest I could get to an epsilon!).

PropertyTypes, URIs, Strings, Ids are disjoint sets.  From the point of view
of the definitions they need to be disjoint.  But in any instance they will
be represented by particular vocabularies defined (perhaps) according to XML
namespace proposals.

(Note: there are issues here about whether or not to differentiate between
URIs and strings and how to represent different classes that the strings
themselves represent.  There is also an issue about how to handle a string
which happens to be a real URI.  Some of the answers may be implied by the
property in which the string is contained.  When this is not the case, the
simplest answer may be to embed properties of the following form in the
model:  "John Smith" -- RDF:InstanceOf --> WhateverClassWeWant)

AtomicNodeBodies = {x : x <- PropertyTypes | x <- URIs | x <- Strings}

Nodes = {[i, b] : i <- Ids, b <- NodeBodies}

If x is a node (ie if x <- Nodes) then i(x) is the id of the node (ie the
first element of the pair) and b(x) is the body (ie the second element of
the pair).

A property body contains 3 node ids of which the first is in the set
PropertyTypes.  This is the same order as in the current draft, but I agree
with the other comment you have received on the draft that it is easier for
the reader (although it makes no fundamental difference) is the property
type is second.

PropertyBodies = {[j, k, l] : j = i(p), for some p <- PropertyTypes, k =
i(x), j = i(y) for some x, y <- Nodes}

Properties = {[i, b] : i <- Ids, b <- PropertyBodies}

DescriptionBodies = {[j1,...jn]: 1<= n, jk = i(pk) for some pk <- Properties
for 1 <= k <= n, 2(px) = 2(py) for 1 <= x, y <= n}

Descriptions = {[i, b] : i <- Ids, b <- DescriptionBodies}

The following shows how collections can be included in the model.

SequenceBodies = {[j1,...jn]: 1<= n, jk = i(pk) for some pk <- Nodes  for 1
<= k <= n }

Sequence = {[i, b] : i <- Ids, b <- SequenceBodies}

Other collection types can be included using the same principles.

NodeBodies = {x : x <- AtomicNodeBodies | 
x <- PropertyBodies |
x <- DescriptionBodies |
x <- SequenceBodies }

Further types can be included as needed.

The definitions of Node and NodeBodies are circular, since a node body may
reference nodes and vice versa.  However since they are rooted in atomic
node bodies there is no problem here.

This definition resolves the issues railed above.  Nodes are clearly
defined, and each has an id.  Node identity becomes a straightforward issue.
If x and y are nodes they are the same node if and only if i(x) = i(y).  The
author (or the tool the author is using) can decide how to allocate ids to
nodes according to the semantics of the domain.  The definition has also
been loosened from the labelled digraph definition to reflect the richer
structure of RDF.

Embedding ids in the data model does not necessarily imply that these should
be exposed to the user (or in the XML beyond the current syntax proposal).  

There are also some by-products of this definition.  Since all properties
are defined to be nodes, there is no need for reification (but, of course,
all of the extra reification properties can be included is that is useful).
Also there is a ready-made mechanism for handling higher arity.

An example

We assume that the set of ids are the natural numbers.  Other atomic bodies
should be obvious from the context.

[1, Author]
[2, "Ora Lassila"]
[3, "http://www.w3.org/People/Lassila"]
[4, [1, 3, 2]]

This is a property: "http://www.w3.org/People/Lassila" >-- Author --> "Ora
Lassila"

[5, LastModified]
[6, "19980203"]
[7, [5, 3, 6]]

This is a property: "http://www.w3.org/People/Lassila" >-- LastModified -->
"19980203"

[8, {4, 7}]

This is a description of  "http://www.w3.org/People/Lassila"

[9, "Joe Bloggs"]
[10, [1, 8, 9]]

This is a property: (The description of "http://www.w3.org/People/Lassila"
with id 8) >-- Author --> "Joe Bloggs".

This means that Joe Bloggs was the author of the description itself
(containing the two properties above).

LABELLED DIGRAPH PROPOSAL

To represent a particular RDF model as a labelled digraph the link between
the label and node needs to be broken.  If an RDF node is both a digraph
label and a digraph node then they have to be separately represented on the
graph.  The use of ids provides a mechanism for doing this and still
maintaining the link between them.  Where it is useful and relevant, the ids
could be represented on the diagram.  In a similar way, if an RDF node is
most usefully presented in more than one digraph node, then the id will make
the relationship clear.

The above core data model proposal exposes the tree structure of the core
model in the same way as it is exposed in the XML syntax.  Human beings are
used to understanding tree structures and many familiar tools use tree
structuring.  In particular CASE tools often use hierarchical (tree
structured) diagrams to represent complex models.  While recognising that
RDF is not purely tree structured it is useful to be able to use familiar
tools wherever possible, so long as the differences can be clearly
highlighted.

Therefore I suggest combining the use of labelled digraphs with the use of
hierarchies (where appropriate) to represent projections of RDF models.  The
following diagrams (based on the example above) indicate the ideas.  These
ideas are (I think) implied in figure 4 of the draft spec.

The first diagram represents the highest level.  The box contains a brief
note and also the id, and indicates that the user can drill down to see the
next level.

+------------------+
|8                 |
|Description of    |  >-- Author --> "Joe Bloggs"
|"http:// etc"     |
|                  |
+------------------+

The next diagram shows what happens when you drill down.

+--------
|8
|Description
|
| "http://www.w3.org/People/Lassila" 
| >-- Author --> "Ora Lassila"
|  |
| +- LastModified --> "19980203"
|
+--------


On the actual diagrams the nodes should be included in ellipses, and there
should be facilities for adding ids as needed (and as described above).

A property can be represented as a node in a similar way (by putting a box
or ellipse round it).  A property node is very similar to a description node
which contains one property, but in the definition the latter has an extra
set of square brackets.  But they can still be represented in a similar (but
not identical) way.

The use of hierarchies provides an important element in supporting human
comprehension.  As RDF develops, in different domains conventions will
develop about how to break the pure hierarchical structure and how best to
represent the breaks on the diagrams.  Obvious ideas are:

- providing tools to highlight elements with the same id
- providing diagrams which show the overall hierarchy and how it is broken
by extra properties.

CONCLUSION

I realise that I have suggested a significant change to the RDF draft, but
the change clarifies the overall model and resolves the issues I raised at
the start.  I hope you find it helpful, and I will be pleased to provide any
other help I can.

Paul Walton

Received on Friday, 13 March 1998 03:50:52 UTC