On the semantics of RDF*

In the last call we had a "lively" discussion about the semantics of RDF* - my apologies for getting so agitated about this topic! So please let me explain in a calmer way why I think that the proposed semantics are such a problem (and no snarkiness ahead, I promise!).

I see two or three different possible semantics for RDF* plus some secondary or orthogonal aspects. This makes it a bit hard to assess where the proposed semantics lands, and why, and what could be done different, and why and how. 

RDF* is a proposal in a field that hasn’t a proper name but it could be characterized as "make it easier to express complicated stuff in RDF", expanding the expressivity of the simplistic RDF triple in intuitive ways. The triple that simply relates S to O by P has well known shortcomings when relations get more complicated, have more members r atributes, when statements need to be annotated etc. That field of proposed solutions is as crowded as diverse: RDF Standard Reification, rdf:value, Named Graphs, Singleton Properties, Design Patterns, specialized ontologies like Prov-o, Concise Bounded Descriptions, Property Graphs etc, and now also RDF*.

The problem space is also all but easy to conceptualize. Take for example Bob buying a car. The car is used and not in good condition - a property of the car. Bob however needed a car badly - a contextual property of Bob. He paid cash with borrowed money - properties of buying, maybe also of Bob (who’s notoriously broke). The transaction took place last Wednesday - maybe also a property of buying, maybe arther of the whole ongoing. All things considered Bob is contend and lucky that he now has a car again - a property of the buying, or of Bob, or of the whole transaction? Later he will tell his grandchildren how this car-buying activity changed his life as the first thing he did with the car was … etc - an anecdote err annotation on the whole car-buying event.
Most solutions proposed so far can’t cover all the subtle differences in meaning that this example tries to evoke. Most of them stuff everything in one place: reification always annotates the whole statement, rdf:value makes everything part of the object in the relation, Singleton Properties annotate the property etc. Property Graphs are rather rich in this respect as they can annotate subject, predicate and object individually - but not the whole statement. Actually they resemble n-ary relations in RDF much more than reification. In fact allowing blank nodes in predicate positions and making extensive use of rdf:value would be all it takes to model Property Graphs in RDF. RDF* however, while aimimg to bring Property Graphs to RDF, is syntactically a form of reification. 


On the most fundamental level we have two paradigms to model complex relations in RDF: n-ary relations and reification. In an attempt to make illustrate these two paradigms one could say that n-ary relations refine a relation from the inside whereas reification comment on it from the outside. However inside and outside meet at the hull and in practice the difference is not always clear: in the example above Bob being lucky might be an attribute of Bob or an annotation to the relation - it depends on how you tell it, to whom, at what occassion, in what context. But there are rather clear-cut cases too: the car being pink is an inside detail of the relation - n-ary - whereas the date at which all this was documented in a blog post is an annotation from the outside - reification. 


The concept of n-ary relations is relatively straightforward and they are the backbone of modelling in RDF: agglutinating and accumulating statements, adding aspect after aspect, detail after detail. It is all a flat space to reasoners, logically simple, but often a unstructured soup of triples and hard to navigate for users.

Reification however is a different beast. In principle it opens portals to hell err paradoxes but I’ve been pacified that this is not a problem here. What remains a problem is that reification can reify quite different things. The most important distinction is that between triples and occurrences:

- a triple (also known as "triple type") fits best into the universalist approach of RDF and its aim of integrating data on a world wide scale and in a decentralized fashion. A URI (as well as a literal) means the same everywhere and a reified triple - which is composed of URIs (and literals) - does so just the same.
Reifying a triple is semantically rather close to an n-ary relation which by its very nature operates in the standard realm of RDF: referentially transparent in a possible interpretation of the underlying "syntactic triples".

- an occurrence (also known as "triple occurrence") in contrast is concerned about a specific triple in some graph or document or any other self-contained set of triples. This is useful if one wants to record provenance, discuss viewpoints, capture flows of information etc.
Occurrences have a subcategrory: per the set-based semantics of RDF there can be only one occurrence of a triple in a graph. That however is not naturally right and intuitive as e.g. the WikiData use case shows but rather an artifact of the way RDF is formalized (sets seemed to be a good way to keep the model theory reasonably compact and simple). RDF Standard Reification caters for that as it can name different occurrences of the same triple (just introduce a further reification quadlet with a new subject). Verbosity aside the problem of RDF Standard Reification is that it can’t name the graph/document/set that the triple occurs in - it is underspecified. 


An orthogonal perspective is that of referentially transparent or opaque reification. The proposed semantics takes the very unusual step of specifying the embedded triple to be referentially opaque. Referential transparency is otherwise one of the corner stones of the semantics of RDF but the authors argue that it is important to know exactly the syntactic details of an annotated statement because otherwise they get lost in interpretation. So far the authors have been relatively unmoved by numerous expressions of astonishement and critique and by proposals to go about this a slightly less disruptive way (see for example Antoine Zimmermann's proposal to define an appropriate literal datatype).
Formally this question is independent of the discussion about types and occurrences although in practice one wonders if the focus on details that RDF generally rather tries to unify away than to stress wouldn’t be more appropriate in the context of occurrences.
The practical problem with this semantics is that it doesn’t quite cover the prominent use cases. Generally people on the semantic web are not concerned with the fact if a URI is written all lowercase or not. To the contrary they expect such differences to be ignored by the machinery they use. More importantly the semantic web is built on the foundation of the Non Unique Name Assumption. Facilitating interoperability by allowing different names to co-denote the same thing and thereby avoid the need for one centralized vocabulary is absolutely essential for the semantic web to work. RDF* is meant and advertized to solve a very common problem on the semantic web: the lack of modelling primitives that are more expressive than the basic triple. Its syntax doesn’t hint at any problem, it rather suggest ready to go ease of use. So in any reasonable scenario, with the blessing of some standard and the support of quite some vendors, it will be used for all things semantic web, building on the NUNA among other basic principles. But the proposed semantics will fundamentally work against those very foundations. 
There’s only two possible outcomes: a lot of effort will have to be put into evangalizing people NOT to use this fancy new tool so that just the well-informed use it to do the right Superman-ish thing. Or we will get shitloads and shitloads of triples whose semantics simply don’t express what their creators want, expect and where told it would. That outlook is why I’m so impertinently asking for the proposed semantics to be changed.

There’s also a semantic problem that AFAICT hasn’t been discussed so far: the embedded triple is not really connected with the asserted triple it wants to reify precisely because it isn’t fully part of the interpretation but tries to keep one foot in the syntactic realm. Take the following example:
   :a :b :c .
   << :a :b :c >> :y :z .
Everything seems fine. Now some harmless processing step in some regular RDF toolchain changes the asserted triple to:
   :A :B :C .
No change in meaning as far as RDF is concerned. In RDF* however the embedded triple and its annotation do now refer to an unasserted triple. AFAICT this means that the connection to the asserted triple is lost, :A :B :C is not annotated anymore and we have a totally new graph. Maybe the annotation said "a :explosiveMaterial" or "a :fakeNews". To protect from such SNAFUS you’d actually have to make _sure_ that the embedded triple follows all syntactic changes in the asserted triple. Imagine that cost in coding, in actual computation, in lack of robustness. I promised above not to get snarky, so I won’t, but in all honesty I think that this is the point where the endeavour of the proposed semantics finally hits the wall, with force. This semantics can be employed and does work when special care is taken and the awareness is there that special care is indeed advised: documenting and juxtaposing claims, different viewpoints, arguments etc are such cases. But _not_ the general semantic web looking for an easier way to model n-ary relations and record provenance.


In the past I have argued (with increasing aggressiveness, ahem) that the RDF* semantics should define reification as referring to an occurrencce - just like RDF Standard Reification does - and default to the local graph. I still think that this would be the best solution as it hits a rather pragmatic sweet spot:
- most annotations can be understood as referring to a specific occurrence of a triple, not its general type. After all Bob from the example above probably will buy a car more than once in his life. At least that will seldomlyly be outright wrong, and people are lazy and will probably use the same meta modelling idiom all the time, neglecting the subtle semantic differentiations between n-ary relations and reification, types and occurrences etc. So its semantics would better be on the safe side.
- provenance, teh poster child occurrence use case, is an important use case - it even has its own vocabulary - and drove the seminal example of RDF*
- this would finally put a lid on the open pot of RDF Standard Reification semantics, plus do away with its verbosity.
IMO this is a pragmatic, middle-of-the-road, easy to implement design that gets the job done and makes people say "finally, yes, what was so hard about that?" (and maybe "thank you!").

In this semantics the embedded triple, defaulting to point to an occurrence in the local graph, would itself be syntactic sugar for an embedded quad
   << :a :b :c <> >>
the fourth element explicitly pointing to the local graph.
There could then be other variants that address an occurrence in another graph
   << :a :b :c :g >>
or even the WikiData usecase of multiple occurrences in one graph, local or otherwise
   << :a :b :c <>#1 >>
or somewhere else
   << :a :b :c :g#1 >>
etc, 
but also the triple type, e.g.
   << :a :b :c () >>
and the literal like triple of the proposed RDF* semantics, e.g.
   << :a :b :c "" >>

Of course, syntactically it would be more straightforward to let the embedded triple refer to the referentially transparent triple and require a forth element for anything more specfic: <> for an occurrence in the local graph, an IRI for an aoccurrence in another graph, or "" for the literal-like referentially opaque type. I can live with that and I confess that it might be more intuitive in the long run. Reifying the triple type is however semantically so close to n-ary relations that I wonder if it is really a good idea to promote it that much. For that usecase I'd actually favor something like a combination of rdf:value and blank nodes also in predicate position (as shortly discussed above). That would also be much nearer to the way Property Graphs work - if that is a criterium.


We had discussions about how representations with different semantics can be constructed from the base embedded triple. Pierre-Antoine made a proposal for occurrences along the following lines (not verbatim, sorry):
   _:b1 :occurrenceOf << :a :b :c >> ;
        :inGraph :g ;
        a rdf:Statement .
The reference to rdf:Statement can of course be omitted if :occurrenceOf is defined with the proper domain, so the minimal triple count is 2. 
My hunch is that a reference to the referentially transparent triple would best be constructed in the same way:
   _:b2 :tripleOf << :a :b :c >> ;
        a :Triple .
Again, the last statement can be omitted as the rdfs:domain of :tripleOf can convey that information, so it’s "just" 1 triple extra.

B.t.w: it is the responisbility of RDF* to define these properties and classes.

But more importanty, the bottom line is: it is possible to derive occurrences and triples from embedded triples but at a considerable cost. Without the syntactic extensions to embedded quads discussed above this really begs the question why the very niche use case of differentiating Morning star from Evening star (a.k.a. the Superman problem) gets the sugared sugar version with cream and cherry on top, while the standard case of triples and the probably most common case of occurrences have to do with one or two extra triples. IMO this is a severely unbalanced design. 

I’ll say it once again: I like the proposed semantics, I like N3 formulas, I like the possibility to introduce statements that are available to query but don’t drive entailments. That is a very practical addition in an advanced toolset. But this can't be the default semantics for RDF*. That would create a _lot_ of semantically unsound triples and we would actually be better off without a formal semantics at all than with one that we’ll effectively be forced to ignore, or to evaluate case by case - which would be quite a prospect.
So:
- either change the name of this project to RDFformula or similar and make it clear that it is not what RDF* aimed to be
- or change the default semantics to a more mainstream use case of triple or occurrence and make the literal-ish semantics a special case, 
   _:b3 :syntactcTripleOf << :a :b :c >> ,
         rdf:value ":a :b :c"^^:rdfLiteral .
You can tell by the property names that I still find it hard to describe what the RDF* semantics actually proposes. The break in the semantics discussed above is still not healed (as that's most probably impossible) but at least it would be much easier to communicate and handle appropriatly.


A few hours ago the following remark surfaced in my Inbox:

> On 27. Jan 2021, at 10:26, Pierre-Antoine Champin <pierre-antoine.champin@ercim.eu> wrote:

> One more thing: I realized that in the email quoted above, I was mostly arguing about my assumption that RDF* triples should mostly behave like literals. I got convinced during the last call that this assumption was not shared by the group, so my other objections from that email are moot.


That would indeed be an unexpected but nonetheless very welcome development. Sorry if I’m beating on a dead horse here but this mail, although written today, was in the making for some time and I can’t bring it over me not to send it (and after all the work that I invested in this topic I feel entitled to a little overreaction ;-). I still hope that it will make things clearer.


Thomas


P.S.: and there have to be embedded graphs. I really see no reason why 
   << :a :b :c . :d :e :f . >> 
should not be possible. Isn’t Gregg already implementing them? (Cato…)

Received on Wednesday, 27 January 2021 15:44:08 UTC