Re: RDF* semantics from thomas lörtsch on 2019-08-14 (public-rdf-star@w3.org from August 2019)

From: thomas lörtsch <tl@rat.io>
Date: Wed, 14 Aug 2019 14:52:25 +0200
To: Olaf Hartig <olaf.hartig@liu.se>
Cc: "public-rdf-star@w3.org" <public-rdf-star@w3.org>, Patrick J Hayes <phayes@ihmc.us>
Message-Id: <D83DD483-0FB1-4E6A-9139-61BA84FA4C18@rat.io>
Hi Olaf, all,


sorry, but this is going to be a bit longer. The gist of it is:

RDF* is more than a RDF-standard-reification-style statement ID as it also describes or - in PG mode - even asserts the statement it identifies. But irrespective of such advanced properties it has to deal with some basic problems in the RDF reification semantics just as if it were a mere identifier.

Adding a datatype for RDF* identifiers to the RDF semantics is not enough to define its reification semantics. A semantically sound solution requires to extend the reification semantics of RDF by a notion of context in which a statement actually occurs. There’s no way around that as otherwise it’s impossible to honor RDF's distinction between abstract statement types and concrete statement occurrences/tokens. 

Unfortunately RDF left this part of the reification semantics unspecified - it's a gap that has to be fixed by an extension to the RDF reification semantics. This extension can be specified in RDF* alone but if done right all of RDF would benefit from such a precedent.

Natural candidates for such contexts are documents, named graphs, nested RDF* statements, self contained snippets of RDF in any serialization. A sensible default semantics of reification would have a statement identifier identify a statement occurrence in the same context - something that would seem quite obvious but isn’t guaranteed under the current incomplete RDF reification semantics. 

Sound disambiguation through a definition of contexts and sensible local defaults are only half the battle. A complete solution would also require the ability to address a statement occurrence outside the current context. To that end a new syntax that combines statement id and context id has to be defined. Maybe that step is a bit too much for this project.

An RDF* snippet in PG mode is a shortcut that asserts that statement right away. This may cause trouble down the road as it violates the type/token distinction of RDF standard reification semantics and creates possibly messy entanglements. It can’t reference and annotate statement tokens in other contexts or the abstract statement type. It also makes it hard to annotate other annotations - something that Property Graphs can’t do but RDF can.

The question might come up why RDF* should carry the burden of fixing a gap that no one has bothered to fix  for almost two decades: one reason is that it makes heavy use of reification so it’d better do so on a sound foundation, the other one being that indeed some have tried - e.g. Fluents and Singleton Properties - and the popularity of RDF* in RDF circles indicates wide spread demand for a solution to this problem.

More details below.


> On 9. Aug 2019, at 11:23, Olaf Hartig <olaf.hartig@liu.se> wrote:
> 
> Dear Pat, all,
> 
> As promised in my initial email in this thread (see below), I have
> created a draft document that specifies how the definitions in "RDF 1.1
> Semantics" have to be extended to provide a model-theoretic semantics
> for RDF* graphs.
> 
> Please find the draft in the following github repo of the W3C RDF-DEV CG
> 
> https://github.com/w3c/EasierRDF
> 
> You can also read the latest version of the draft rendered directly from
> github (which means you don't have to download the repo first). To his
> end, use the following link.
> 
> http://htmlpreview.github.io/?https://github.com/w3c/EasierRDF/blob/master/RDFstar/RDFstarSemantics.html
> 
> I am looking forward to your feedback on this draft.


In the past I understood RDF* mainly as an alternative to the RDF standard reification quadlet, a better statement ID attribute. Nesting statements seemed like a corner case and I glossed over the ramifications of being able to identify and state a statement at the same time. I realized this through the 2 modes SA and PG that your semantics introduce.

Identification of statements with RDF* has that interesting property that it carries the statement it identifies on its sleeve. So no need to create an identifier before using it, no need to look up the statement identified through the identifier (or at least not as long as abstract statements, vulgo "unasserted assertions", stay out of the picture). This feels natural and intuitive. The downside is that the identifier itself can get long and unwieldy and nested statements make it even longer and therefor unfit for involved use cases where e.g. provenance of triples has to be tracked across many steps of transformations. Still IMHO it is an interesting cross between identification and addressing and a good candidate for syntactic sugar in standard use cases. This is the SA mode. PG mode is more complex as it goes one step further and actually asserts the statement it identifies+addresses in one and the same step. The semantics of PG mode are based on the semantics of the more basic SA mode and I will concentrate on SA mode.

The SA and PG mode semantics that you propose are not sufficient to define the semantics of RDF* because they don’t tackle the problem of incomplete reification semantrics in RDF. 
Your RDF* semantics proposal concentrates on introducing a new datatype for its identifiers in RDF as RDF doesn’t allow literals in subject position. That’s okay although maybe a bit much IMHO given the few cases in which I see RDF* style identifiers as really useful. Couldn’t you as well make them URIs by prefixing them with some special name space and leave the central parts of the RDF model theory untouched? That would shift the necessary changes to the reification part of the semantics where some machinery has to ensure that an RDF* identifier/URI is treated like a reification quadlet. Of course this would loose some of the ease of use of <<s p o>> snippets. Or, alternatively, extend the RDF semantics to allow literals in subject position - some people have been arguing for this for a long time. Well, that’s just some ideas.
The part that really needs a change, or rather an extension, or much rather completion is the RDF standard reification semantics itself. This is where the central problem to reification in RDF lies. The RDF reification semantics distinguishes abstract statement types and concrete statement occurrences, or ’types’ and 'tokens' for short. This is a very useful approach as it lays the semantic foundation to attribute statement tokens relative to where they occur. However the spec only goes half the way: it describes the vocabulary to define abstract statements but stays mum about how to treat concrete statement occurrences. This is very surprising to the uninitiated and so  annoying and counter-intuitive that it hurts the RDF reification mechanism severly. People complain about the verbose syntax, the reification quadlet, but the gap in the semantics is much worse. 
The RDF reification semantics takes the natural approach that a statement token can only be defined through some context in which it occurs. That could intuitively be understood to be some document or named graph or some other self-contained snippet of RDF. RDF doesn’t standardize any concrete way to define or specify such a context. Therefor it is impossible to reify a concrete statement occurrence in RDF. 
That needs to be fixed as without such a fix RDF* has no more model theoretic semantics than RDF standard reification, which is: none. But if we can fix it it would benefit all approaches to reification and statement attribution in RDF which could be quite valuable. While this is formally beyond the reach of RDF* the solution is rather obvious and RDF* could set a useful precedent by defining the semantics of reification through the RDF Semantics extension mechanism and in a way that all other syntaxes can easily adopt.
That’s my main point. I’ll get into more detail below. 


RDF* however doesn’t stop at carrying the statement that it identifies on its sleeve. In PG mode the statement identifier can be interpreted as the statement itself. I think that, although seemingly handy and elegant, this is not a good idea, for several reasons. 
RDF* in PG mode introduces a style of modelling that is rather orthogonal to the way RDF is modeled. Suddenly statements with a whole set of attributes - the important statements presumably - become encapsulated in RDF nodes and the RDF edges between those nodes only play second fiddle. This feels weird and like a strange indirection and is quite a departure from RDF style modelling. 
Another concern is that PG mode makes it much harder to annotate other statement annotations - something that SA mode can do easily but Property Graphs can’t do at all. Why give away that advantage? 
The burden that SA mode puts on the user doesn’t seem terribly heavy. While SA mode does indeed require to explicitly state a triple before being able to attribute it this is still a big improvement over the syntactic verbosity of the RDF standard reification quadlet. It is quite safe from accidental retrieval of abstract triples through some sloppy SPARQL query too although not as much as the RDF standard reification quadlet.
With regards to sound semantics however I fear that PG mode may introduce one shortcut too much. It has the potential to create another knot like the name/label mess in Named Graphs that later on will be almost impossible to disentangle (see Antoine Zimmermanns Note in the RDF 1.1 specs [0]). The problem is that merging definition and instantiation of a statement creates a type/token hybrid that might be impossible to disentangle later on. That however would make it hard if not impossible to disambiguate statements of the same type from different contexts. I haven’t completely thought this through yet but so far I'm really sceptical. Reification in RDF is a contested topic since the beginnings of RDF standardization. It’s not without hope but one more half-baked, half-ambiguous standard in the area surely wouldn't help in the long run.
All that makes me think that PG mode goes too far and tries to be something that Property Graphs themselves already are, and much better at that. Different modes anyways tend to be a source of confusion and headaches and should only be introduced when really necessary. Perhaps it would be clearer to not make PG a mode within the RDF semantics but change the name to "RDF*/PG" or "rPG" and make it an extension to RDF (and RDF*)?

> 
> Thanks,
> Olaf
> 
> 
> On Mon, 2019-08-05 at 16:09 +0200, Olaf Hartig wrote:
>> Dear Pat, all,
>> 
>> Great to have you on board. Your help will be much appreciated!
>> 
>> In the following, I am responding to the points raised by Pat:
>> 
>> On Mon, 2019-07-08 at 09:04 -0700, Patrick J Hayes wrote:
>>> [...]
>>> I believe that  RDF* would benefit from being given a clear direct
>>> semantics of its own, rather than via a reduction mapping to RDF.
>> 
>> I agree. In fact, I have already started a draft that specifies how the
>> definitions in Sec.5 of the "RDF 1.1 Semantics" have to be extended to
>> provide a model-theoretic semantics for RDF* graphs. I will share once I
>> have cleaned it up.
>> 
>>> [...]
>>> RDF* seems to presume that making an assertion about a triple also
>>> thereby asserts the triple. This is not how reification was designed
>>> to work, and it is in violation of the description of the semantics
>>> of reification in the RDF specs. Thus, RDF* is currently not a correct
>>> modeling of RDF reification. This issue needs to be addressed and
>>> resolved.
>> 
>> Your observation is correct: The RDF*-to-RDF mapping (and the
>> corresponding SPARQL*-to-SPARQL mapping) as defined in my earlier papers
>> are based on the assumption that a nested RDF* triple t also asserts the
>> triple that is the subject or the object of t. Hence, my definitions of
>> these mappings use the RDF reification vocabulary to capture this
>> assumption. I do not think that using the RDF reification vocabulary in
>> this way is a violation of the RDF specs. Note that I have *not* been
>> using RDF* to model RDF reification (rather the other way around), and
>> indeed I agree that making the aforementioned assumption is not a
>> correct modeling of RDF reification.
>> 
>> At this point it may be helpful to emphasize that my initial perspective
>> on RDF*/SPARQL* (as reflected in the definitions in my papers) has been
>> influenced by discussions with triplestore vendors who were interested
>> in a practical, reification-like feature to capture and to query
>> statement-level annotations. The general intention was that this feature
>> would be used in a way like people use the notion of edge properties in
>> Property Graph databases (if you are not familiar with Property Graphs:
>> an "edge property" is a key-value pair associated with an edge in such a
>> graph). Then, the aforementioned assumption followed from this intention
>> because, in a Property Graph, to assign edge properties to an edge, the
>> edge must exist in the graph.
>> 
>> Having said that, I understand that there also are use cases for which
>> the aforementioned assumption is unsuitable; that is, use cases in which
>> asserting a nested RDF* triple t should *not* entail the assertion of
>> the triples that occur in t. My current idea for the RDF*/SPARQL*
>> approach to also cover such use cases is to introduce two different
>> modes of how RDF*/SPARQL* may be used: One of these modes explicitly
>> makes the aforementioned assumption; this is the mode that is captured
>> by the existing documents and it might be called the "Property Graph
>> mode" (PG mode, for short). The other mode does not make the assumption;
>> it might be called the "separate-assertions mode" (SA mode). It is not
>> difficult to adapt the existing definitions to capture this SA mode as
>> an alternative to the PG mode. Apparently, the model-theoretic semantics
>> for RDF* graphs will also differ depending on whether PG mode or SA mode
>> is used, and so will the semantics of SPARQL* update operations and of
>> SPARQL* queries.
>> 
>> As an example regarding the latter, consider an RDF* graph that contains
>> only the following nested triple (prefix declarations omitted).
>> 
>> ( (:bob, foaf:age, 23), dct:creator, :crawler1 )
>> 
>> Furthermore, assume the following SPARQL* query.
>> 
>> SELECT * WHERE { :bob foaf:age ?a }
>> 
>> In PG mode, the result of this query over the given RDF* graph consists
>> of a single solution mapping m with m(?a)=23. In contrast, in SA mode,
>> the query result is empty.
>> 
>> I am looking forward to comments on the idea to introduce these two
>> modes of usage.


>>> I propose an idea for consideration by this community, to allow for
>>> meta-descriptions to apply to entire RDF graphs rather than restricted
>>> to single triples. The costs of this seem relatively small and the
>>> benefits quite great. 
>> 
>> While I do not know what exactly you aim to propose, what you write
>> sounds more related to a discussion of named graphs. The RDF*/SPARQL*
>> approach is explicitly focused on statement-level metadata rather than
>> graph-level metadata, which IMHO are orthogonal concerns.


The practical use of abstract statements is rather limited and so the whole discussion might seem quite theoretical. Most use cases of reification speak about a concrete occurrence of a statement to e.g. document its provenance. As long as this happens in the same context that the statement itself occurs in it might seem fine to not care about the difference between abstract and concrete statements. 
However for a complete solution to reification it is crucial to be able to disambiguate statements of the same type coming from different sources and to attribute statements from other sources without endorsing them. This kind of precision is indispensable to e.g. record how different actors came to the same conclusion/statement in different contexts.

Disambiguating statements depends on honoring the distinction between abstract types and concrete tokens of statements. Merging abstract and concrete statements like PG mode does may make it impossible to disentangle them afterwards and would then indeed be a violation of the RDF reification semantics.

The RDF Semantics are based on a notion of statement context but they don’t define what constitutes such a statement occurrence context or how it could be specified. They instead refer to out of band means (the RDF 1.0 Primer explains all this quite well - although that specific part didn’t make it into the 1.1 spec it is still valid [1]). This gap needs to be fixed. Natural candidates are a document, a named graph or some self contained snippet of RDF in which a statement actually occurs. Defaults may be defined or even a cascade of defaults e.g. first named graph, then document, then RDF* statement, then enclosing RDF snippet etc., maybe even a 'context' attribute.

Attributing statements across contexts additionally requires a means to address a concrete occurrence through a combination of context identifier and abstract statement identifier. See some ideas below.

While almost everybody including the editor of the RDF Semantics seem to bitch about RDF standard reification I did indeed start to like it. Sure, the quadlet is verbose but how else would you describe a triple with triples? Syntactic sugar and some backend optimizations can easily get around that. The distinction between abstract and concrete statements however seems indeed very well designed to me. If only the reification semantics hadn’t that glaring omission of not specifying what actually constitutes the context of an occurrence...

RDF/XML is an interesting example for how syntactic sugar can make reification painless and natural: RDF/XML provides an ID attribute in lieu of the RDF reification quadlet. Despite all the other "features" of RDF/XML this one is really useful and a very intuitive implementation of syntactic sugar for reification. 
RDF/XML also provides a blueprint for intuitive semantics: if you refer to a statement per its ID that in the same context has already been stated and been amended with said ID then it considers that statement as indeed stated, as an occurrence - which totally makes sense in the context of a (RDF/XML) document. Through this default all the fuss about abstract statements and statements in other contexts can safely be ignored in standard use cases. RDF* in SA mode can be understood quite similarly. 
While RDF/XML was still focused on documents later work in RDF shifted to database backed systems where Sparql and its implementation of Named Graphs define the natural boundary of validity of a statement. Documents, Named Graphs, a possibly nested RDF* statement or other self contained snippets of RDF in any suitable serialization would all make excellent candidates for defining the context of a statement occurrence.
In that sense Named Graphs are indeed not orthogonal to but a variation of a nested RDF* statement. 


>> Best,
>> Olaf


To sum up and collect the above into some kind of proposal:

Extending/completing the RDF reification semantics to make them fit for RDF* as well as any other RDF serialization would require to define three things: 
- how to constitute and identify the context of an occurrence
- how to identify an abstract statement type
- how to identify a concrete statement token in some context

1) statement context identification

A few obvious candidates are serializations in documents, Named Graphs, (nested) RDF* snippets. They all define the context of a statement occurrence in rather intuitive and natural ways. 
Mostly they have URIs, RDF* snippets are self-contained. 
Applications may of course make more or other arrangements, streams might be tricky, a dedicated 'context' attribute could be defined, fallback mechanisms from more specific to more general types of contexts might be useful etc. but this looks like a solid baseline.

2) abstract statement type identification

In principle a statement identifier can be anything:
- an arbitrary, maybe mnemonic string,
- an RDF* <<s p o>> element,
- a blank node
- a compressed version of the statement e.g a ZIP archive,
- a hash of the statement e.g an MD5 hash
etc. 

All these techniques have some up- and downsides which I don’t want to discuss extensively here. Arbitrary strings as well as RDF* snippets do, within the same document, have usability advantages. One of my gripes with RDF* is that in nested statements it isn’t so usable anymore as e.g. tracking provenance through chains of transformations would probably be rather painful. Mnemonic strings or blank nodes might have usability advantages over RDF* <<s p o>> elements in certain settings. RDF* surely has its merits but I’m not sure it’s worth the effort of adding it to the already extensive suite of serializations. IMO adding an ID attribute to other serializations like Turtle would make more sense and some of the support for RDF* comes from a wrong perception about what the problems of RDF standard reification really are. But well - I won’t argue against it much as it doesn’t hurt much either, and again: it sure has some merits, at least in SA mode.

The semantics of RDF requires the identifier to be a URI or a blank node. That makes it necessary to either tweak the core model theoretic semantics as in Olafs proposal or by allowing literals in subject position or to make identifiers go through some URI-ification procedure. 
When used as a statement identifier in an RDF reification quadlet all these techniques identify an abstract statement. The RDF* <<s p o>> element additionally is syntactic sugar for the RDF standard reification quadlet.
That settles the abstract case. Defining an occurrence needs further arrangements.


3) concrete statement token identification

For the case of concrete occurrences two cases have to be considered:
- identification of statements inline/within some context,
- identification across contexts. 

Within some context - e.g. some Named Graph, RDF/XML document, turtle snippet, RDF* element - any statement identifier that does not explicitly define the context it refers to can be considered to address an occurrence in the same context. This settles the semantics of any RDF* <<s p o>> snippet and (not that anybody would care) of the ID attribute in RDF/XML. It could also nicely be extended to ID attributes in other serializations like Turtle which IMHO would be really helpful. This is nice and intuitive as long as it happens within one and the same context. 

Identification of occurrences across contexts is more complicated. To attribute a statement that has been asserted in another graph/document/RDF* snippet we need to be able to refer to that statement via a combination of its abstract identifier and an identifier of the context it occurs in. 
I am not aware of any attempt so far to address concrete statement occurrences in some specific named graph or other context (that makes me wonder if I’m missing something very obvious but let me proceed anyway…).
As this represents a step of disambiguation beyond the standard use case an intuitive solution might be to put the identifier of the abstract statement in front (like in the default case of local statement attribution) and add a reference to the context (but only when needed to refer to an outside statement), e.g. 
 
 someStatementID?context=someNamedGraphURI
 <<s p o>>?context=someDocumentURI
 <<s p o>?<context-id>>
 etc

This is of course a syntactic issue and the '?context=...' and '?<...>' syntaxes are just examples of how it could be approached.

Interestingly this works just as well with Named Graphs as definded in RDF 1.1 that have no semantics at all. Irrespective if the name of the graph is a "real" name, denoting the graph or a mere label denoting also some other resource: it does _address_ the graph just fine and a combination of graph+statement-identifier makes it possible to attribute a statement occurrence (despite the fact that under the semantics of RDF 1.1 the graph itself can’t be attributed safely). 


Although this is still quite sketchy in the details I hope the general direction is clear and convincing. Please comment!

Best,
Thomas


[0] http://www.w3.org/TR/2014/NOTE-rdf11-datasets-20140225/
[1] https://www.w3.org/TR/rdf-primer/#reification
Received on Wednesday, 14 August 2019 12:53:03 UTC