drop referentially opaque semantics in embedded triples

Hi all,

there are different aspects to consider when discussing the right semantics for RDF-star and the hardest thing about it is to not let this mail get too long. 
When I talk about the semantics of RDF-star I’m refering to what an embedded triple means, what meaning it tries to capture and convey, not how that meaning is formalized. Concerning the latter I’m barely able to follow the discussions between Pierre-Antoine and Peter. But that’s no reason not to look into what the proposed semantics actually means, what it stands for and how people might intuitively understand and use it. That exploration can get "philosophical" and some people hate that, but work in the area of knowledge representation inevitably has to introspect the ways we think and express ourselves - otherwise it would be just blindly stumbling forwards. 


TL,DR:

- Embedded triples should mean exactly the same as asserted triples, with the one exception that they are not asserted. 
- Embedded triples should, in terms of semantics, swim with the referentially transparent flow of RDF and concentrate on their modelling prowess.
- The real value proposition of RDF-star is extending the expressive capabilities of RDF with a self-contained statement identifier and bridging the gap to Property Graphs.
- Quoting semantics is an orthogonal purpose that comes with its own (and relatively rare) use cases, and with a bag of problems. Quoting semantics and RDF-star embedded triples should not be intertwined. 
- Practically this is easy to do: drop referential opacity for terms in embedded triples from the proposed semantics. Done.


THE PROBLEM

Pierre-Antoine proposes to treat the terms in embedded triples as referentially opaque. That is a problem because it makes embedded triples silently swim against the flow of all other RDF around them, although they are envisioned as an integral part of modelling complex constructs in RDF. 
A lot of things that can’t be expressed in simple triples are considered candidates for RDF-star embedded triples. The hopes are high that embedded triples overcome the expressive limitations of the simplistic triple formalism: getting rid of cumbersome and verbose reification syntax, bringing property graph style modelling capabilities to RDF, avoiding the pain of repurposing named graphs to singleton graphs, having to endure less blank node soup when modelling n-ary relations. If those hopes are indeed to come true, embedded triples will be everywhere in RDF, they will become an ubiquitous syntactic structure. 
Yet according to the proposed semantics the terms in embedded triples are not referentially transparent, they don’t co-denote like _all_ other RDF around them. Instead they are treated like syntactic structures, prior to denoting a resource in the realm of interpretation that RDF ordinarily operates in. Nowhere else in RDF can such a semantics be found. 
If terms in embedded triples don’t co-denote they interrupt the flow of operation in RDF because embedded triples without obvious reason cease to  make the connections that everything else in RDF makes. Procedures and applications will fail to work as expected, references will be lost, alternatives will stay hidden and unused. As this happens rather silently, the impression will be that "embedded triples don’t work", or that "formal semantics don’t work", or that "RDF doesn’t work" (as opposed to Property Graphs for example). Pick your favorite horror scenario. 


REFERENTIAL TRANSPARENCY, CO-DENOTATION etc.

Referential transparency is not a coincidental feature of RDF, it is an essential building block in the semantic web architecture. Two terms may refer to the same thing in the realm of interpretation: they may co-denote. And if they co-denote, they can be used interchangeably: they are referentially transparent. 
The semantic web is designed to enable data integration in a decentralized fashion. It has no centralized vocabulary: different IRIs may refer to the same resource and one resource may have multiple identifiers. The concept of a car has different identifiers in WikiData, in some car makers association's vocabulary or in schema.org. One can refer to me by my name as a string, by me email adress(es), my web site or what have you. This reflects a decentralized reality and it is often messy, but overall very useful. Indeed this ability to integrate data on a global scale without having to establish normative identifiers upfront is generally considered indispensable to the architecture of the semantic web and a prerequisite to its realization as a shared and decentralized global information system. It is the translation mechanism that saves the semantic tower of Babel from collapsing. 


REFERENTIAL OPACITY, CITATION AND QUOTATION SEMANTICS

Referential opacity OTOH is the equivalent to quoting in natural language. Putting a word in quotes, "like so", constrains its possible interpretations to exactly _that_ word. In natural communication we don’t use quotation very often as it demands care and precision and its focus can hinder conveying our message. For example
 Thomas called the proposed semantics nonsense.
conveys me expressing deep frustration with the proposed semantics in a telco a feww weeks ago. On the other hand
 Thomas called the proposed semantics "nonsense".
puts the focus on the exact term I used, useful to discuss the (un-) appropriateness of my wording. These are different foci, sometimes different use cases. We also have another citation mechanism in natural language, indirect speech:
 Thomas said that the proposed semantics were nonsense.
That is what we normally use to introduce a position that we don’t want to (or can’t) document with a precision required for quoting and that maybe we also don’t want to endorse.
Embedded triples, as they are not asserted, are well suited to encode indirect speech. Quoting semantics however are much stronger and are a relatively rare use case not only in natural language but also on the semantic web. The proposed semantics however puts them front and center. It changes the meaning of embedded statements in subtle but profound ways. It leaves the predominant use case of referential transparency undefined.

An often given example for the usefulness of referential opacity is disambuguating references to the morning star from those to the evening star - although a modern age person with more knowledge may make them co-denote the planet Venus, thus failing to document certain believes of people at given times and locations. A similar example is the knowledge that Superman and Clark Kent refer to the same extraterrestrial being - a fictuous "fact" well known to the readers of Superman comics but not to the Louis Lane persona in said comic. More generally referential opacity allows to document believe system from a higher, more encompassing view point. This can sure be very useful but it doesn’t harmonize well with the concept and project of a globally shared knowledge space that the semantic web tries to advance. On the semantic web it is and will always be a special case, an outlier, even in the age of alternative facts. 


THERE BE DRAGONS

This is slippery and largely uncharted territory: more precision can always be understood in two different ways, as _additional_ detail or as _constraining_ specialization. The glass of quoting is half-full or half-empty, depending on perspective. The proposed semantics however doesn’t provide a means to disambiguate between the two. The more ubiquitous the use of embedded triples on the semantic web will become, the more problems will arise from this ambiguity. We know that blues too well already from all the confusion about interpreting range and domain as constraints or as axioms. The proposed semantics has the potential to incorporate this sort of dissonance into a basic modelling primitive.
For example: what happens when a term in an asserted triple gets replaced by a co-denoting term? Is an annotation on the corresponding embedded triple still valid? Does the embedded triple have to follow suite and replace the term too? But then why have referentially opaque semantics in the first place? RDF-star provides no rule and no guidance on how to proceed in such a very likely scenario.
A related problem is that the shorthand annotation syntax doesn’t explicate the embedded triple. Therefor with the shorthand syntax there is no chance for the embedded triple to diverge from the asserted triple though co-denotation. That is nice but doesn’t it also mean that the two syntaxes aren’t equivalent under the proposed semantics?
In general I try to not get too obsessed with semantic rabbit holes but this looks like a can of worms, and so far it hasn’t been explored by this CG at all. Even if I considered referential opacity a useful default semantics for embedded triples it is IMO not ripe for standardization.


USE CASES

Looking through the use cases [0] it’s rather obvious that _none_ of them asked for referential opacity, but many could suffer from the lack of co-denotation (**). One use case, UniProt, explicitly asks for referential transparency. 

The use case that Pierre-Antoine gave in the recent Lotico presentation [1] was "explainable AI". Now that is a use case that perfectly fits the proposed semantics, but it is also extremely special and as close to the metal of the semantic web’s inferencing capabilities as it could possibly get. It is hardly a proof of general usefulness. It's also the kind of use case that up to now has been served by named graphs. It certainly isn’t trying to cover property graphs, the WikiData data model or other approaches to attributed relations.

This would not be such a problem if embedded triples were proposed and designed as a special solution to a special problem. But they are not: quite to the contrary they are positioned as the go-to modelling primitive in all cases where the simplistic triple falls short in expressive power. Olaf in said Lotico presentation compared embedded triples to RDF standard reification, named graphs and singleton properties and pointed out that nonwithstanding other obvious differences the main distinction between RDF-star and those approaches is that RDF-star doesn’t need an extra identifier as the embedded triple is itself the identifier. I think that is a good decsription of the essence of RDF-star. However all those other approaches that RDF-star aspires to replace are firmly embedded in the referentially transparent semantics of RDF. So why is the RDF-star semantics taking the opposite direction?


A PRUDENT APPROACH?

Pierre-Antoine argues that the proposed semantics is a prudent approach as one can always add entailments later whereas one never can take them back once they are made. I have to disagree in two ways. 

First of all: everything that changes the flow of RDF without proper cause and explicit notion is dangerous. 
It doesn’t matter if the approach adds or holds back entailments: _changing_ the entailment regime is what causes trouble. If a semantics breaks well established and deeply founded common expectations regarding the entailments it enables, then that semantics has to provide a very good reason for such disruptive behaviour and its proponents have to be prepared to pro-actively and vividly educate people on it. The OWA is a good example of a case where it's unavoidable to break common expectations - and how hard it is to educate users and implementors on such a shift. The authors however skim over the topic and leave its resolution undefined. 
Applications on the semantic web rely on co-denotation and they take it for granted. Disrupting this expectation without clear warning and very comprehensible reason risks to take aback and disappoint users. (Formal) semantics is hard enough for users to understand and follow. Unexpected changes in its mechanics are sure to only drive them further into the camp of those that question the use of formal semantics anyway, and who feel little incentive to spending another thought on them. Such SNAFU will only add to the large piles of unsound triples that we already have to deal with.

Second the claim that switching to referentially transparent semantics later on would always be possible is really only half the truth. Of course nobody is forbidden to run inferences to one’s heart content, but we have to aim for standardized procedures and behaviours. To make co-denotation of embedded triples available in practice the authors would have to come up with principled ways to implement and express referential transparency per triple, per a set of triples, per vocabulary, per application. They would have to provide guidance on when such a move to referential transparency is advisable and how to communicate it to users and integrators. Instead users are left on their own, without any attempt to specify this vital part. The referentially transparent semantics that we can expect to be implicit in the vast majority of published embedded triples is not in any way standardized.

Olaf presented a proof of concept of how to implement referential transparency in extension of the proposed semantics, essentially implementing entailments in SPARQL on a per property basis. One critique of this approach was that it was quite involved and didn’t seem to integrate well into established workflows. But it also was a rather ad-hoc solution on a per case basis, based on the assumption that embedded triples are such a new kind of resource that they wouldn’t be usable with all established properties anyway. Olaf suggested that one would have to decide per property and/or class if they can be used with embedded triples, and maybe would even have to update vocabularies. This argumentation is in stark contrast to the portrayel of embedded triples as self-contained statement identifiers that are otherwise quite comparable to identifiers used in reification, for named graphs etc. Those identifiers have always been usable with any property and classes whatsoever (*). So why shouldn’t the same be true for embedded triples? 

It is a recurring and worrying theme in the argumentation of the authors that they like to put the burden of disambiguation on vocabularies. Shared vocabularies are an important and valuable asset on the semantic web. They take a lot of time to develop, to advertize, to learn and implement. A proliferation of properties is hurting the usability of the semantic web. We are not living in an ivory-tower world where users first consult upper ontologies before eventually minting another subtly differentiated class or relation. We have to strive towards rough consensus and running semantics. Modelling primitives have to support such pragmatism, not hinder it. The semantics of embedded triples can’t be allowed to spill into the realm of vocabularies.


WHAT’S MISSING:

The claim that the proposed semantics of referential opacity for embedded triples can be extended to referential transparency has to be investigated for a use case like UniProt. It needs to be shown how this is not only possible in ad-hoc and cumbersome ways but actually feasible on large scale, in day-to-day operations, with enough elegance and conciseness to be practical. Questions that need to be answered are how to:
- implement referential transparency as a default, without beforehand knowledge of properties and classes (and assuming there are near uncountably many)
- or, if that is too hard, at least present a method how to implement referential transparency in a principled, rule-based way
- express referential transparency per statement 
- convey those transparent semantics to other users so that data can be exchanged in semantically sound ways

Of course, if such a mechanism presents itself, and an elegant and concise one at that, the next question would still be why the tail is wagging with the dog: why make the overwhelming majority of use cases jump through hoops instead of putting such burden on the rare cases that actually need referentially opaque quotation semantics?


SEPARATION OF CONCERNS

A better idea might be to follow the proven design principle of separation of concerns, and design and implement referentially opaque semantics independently of the syntactic feature of embedded triples. Quotation could be just as useful for individual terms (in the "nonsense" example above there is no reason why the identifier for Thomas shouldn’t co-denote) and for graphs (as theories are often more complex than just one triple). Quoting semantics can for example be implemented through a special type of literal, as Antoine Zimmermann proposed - separation of concerns and intuitive interface ("…") included. But that work should be tackled as a separate task as it’s quite orthogonal to the issue of embedded triples.


Best,
Thomas



[0] https://w3c.github.io/rdf-star/UCR/rdf-star-ucr.html
[1] http://www.lotico.com/index.php/Metadata_for_RDF_Statements:_The_RDF-star_Approach



(*) There is one exception to this rule: you can’t say anything about another triple that would make that triple false, as RDF has strictly monotonic semantics. If in doubt interpret an annotation not as constraining the subject, but as adding to the description - which again boils down to the question if the glass is half-empty or half-full. Apart from that however everything is fair game, and rightfully so. 



(**) RDF-star Use cases call for referential transparency

The use cases collected in RDF-star Use Cases and Requirements (Unofficial Draft 30 April 2021) [0] in their overwhelming majority call for referential transparency - mostly implpicitly, sometimes even explicitly.

- Use case 3.1 describes the simple case where a relation is attributed with some similarity measure. Nothing hints at a need for strict quotation as the relation itself has standard co-denotation semantics.
- Use case 3.2 argues against "unnecessarily restrictive relations between interpretation and representation" and explicitly calls it a mistake to treat blank node labels as referentially opaque.
- Use case 3.3 again presents a standard RDF relation where nothing hints at the need to treat one of the involved nodes - :pavel, :worksAt and :Stardog - as quotes. 
- Use cases 3.4 (Meta-properties) and 3.5 (WikiData) are no different.
- Use case 3.6 (UniProt) explicitly calls for referential transparency. To quote: "At the same time we deal with a lot of renaming (IRIs) for the same thing. e.g. a related database might use http://identifiers.org/uniprot/P05067 instead of http://purl.uniprot.org/uniprot/P05067. And a owl:sameAs is used to merge these datasets. Our attributions/evidences should be found no matter which IRI is used."
- Use case 3.7 is concerned with data format issues to which the question at hand doesnt’t apply.
- Use case 3.8 is concerned with shape definitions. As this is a rather technical use case which uses well defined terms from a very specific vocabulary (SHACL in this case) quoting semantics probably doesn’t hurt but isn’t a necessary requirement either (rather doubling down on what is already evident: when creating SHACL constraints, using terms from the SHACL vocabulary makes good sense).
- Use case 3.9 (attribute individual nodes) is again a standard RDF use case and doesn’t need referential opacity (and, b.t.w., isn’t met by RDF-star).
- Use case 3.10 (represent competing scenarios) explicitly calls for a choice depending on actual needs as "Both scenarios can benefit from referential transparency but also faithful reproduction of source information".
- Use case 3.11 (occurrences in other graphs) is concerned with an orthogonal issue.
- Use case 3.12 (commit history) would harmonize well with referential opacity, however it doesn’t profit from it either. This is also a very technical use case and very close to the metal.
- Use case 3.13 is rather concerned with the relation od RDF-star to RDF standard reification - which is of course referentially transparent but still the use case seems irrelevant to the issue at hand.
- Use case 3.14 (time-sensitive data) describes attributions from an orthogonal perspective like time of access. Nothing hints at a need for referential opacity in the base data. Referential opacity would introduce a further orthogonal perspective and thus might be rather un-welcome. 
- Use case 3.15 (uncertainty representation) is again a case of adding orthogonal attributions to base data.
- Use case 3.16 (Compact Serialization of OWL Graphs) asks for a more compact representation of reification in OWL. It is safe to assume that the user also would still like to be able to make meaningfull use of owl:sameAs and would therefor suffer from the lack of referential transparency in embedded triples.
- Use case 3.17.1 is concerend with the orthogonal aspect of annotating occurrences in other graphs. It’s unclear if it would benefit from referential opacity: occurrences do look like a more likely candidate for such a restrictive semantics, but not necessarily so.
- Use case 3.17.2 is concerned with administratve issues orthogonal to the question at hand.
- Use case 3.17.3 (graph-level metadata) is again concerned with occurrences. So it might or might not profit from referential opacity but as the author doesn’t call for it is impossible to tell if he would be pleasantly or unpleasently surprised by the lack of co-denotation.

To sum up: only one sole use case, 3.10, explicitly calls for referential opacity - as an option, not as a default! - and that use case was submitted by yours truly to make the case for an explicit selection mecahnism. One use case, 3.12 (commit history), seems like a natural fit for referential opacity, but wouldn’t profit from it much. Some use cases like UniProt very explicitly call for referential transparency. For the vast majority we are left with our best judgement. It is safe to say however that _nobody_ called for this feature as the standard semantics. If we would be counting, it would be about 3:0 against the proposed semantics, with about a dozen abstentions.

Received on Thursday, 6 May 2021 21:30:54 UTC