Why Option 1 from Thomas Lörtsch on 2024-02-13 (public-rdf-star-wg@w3.org from February 2024)

From: Thomas Lörtsch <tl@rat.io>
Date: Tue, 13 Feb 2024 12:38:23 +0100
To: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-Id: <2F982C7E-C474-4E6C-B018-89DF8BA31E54@rat.io>

Hi all.

tl;dr
I was asked to lay out the arguments for option 1. Well, it is the simplest solution, and it covers all the needs we agree on: the new syntaxes ensure wellformed and concise reification of occurrences(*), and a wellformedness-constraint allows to reject unconforming uses of the mapping to RDF standard reification. It requires no complicated extensions of the RDF model theory and abstract syntax, but provides all the practical benefit. Extending option 1 towards the expressivity of option 2 is possible when the rare need arises, at no extra cost compared to option 2 itself. Some like the symmetry between option 2 and option 3, but option 3 suffers from the same problem as option 2: it adds an unnecessary reference to the type, reminiscent of the type-based approach the CG proposal followed and that we decided to abandon in late 2023. A more concise alternative to option 3 is discussed at the end of this mail.

What we seem to agree on:

We seem to agree on user facing syntaxes, i.e. syntactic sugar for reification:
- annotation syntax in Turtle, e.g.
:s :p :o {| :x :y |} # an edge implicitly named by a blank node
:s :p :o {| :e | :x :y |} # an edge explicitly named by :e
- triple term syntax, e.g.
<< :s :p :o >> :x :y . # as above
<< :e | :s :p :o >> :x :y . # as above

We also seem to agree on which expressivity we strive for, namely that use cases ask for:
- occurrences, not types
(i.e. not the abstract statement, but the who/when/why/… etc are the focus of interest)
- referential transparency
(i.e. the syntax is only a means, its interpretation is crucial)

The two questions we need to decide:

The issue now is how to formalize all this and how to map it to RDF 1.0/1.1 triples. The table "Seeking Consensus" [1] describes the currently discussed alternatives. The proposals can be categorized in two different ways, and each in two different groups. There’s the distinction between proposals that formalize occurrences versus those that formalize types, and there’s the distinction between proposals that extend the RDF model and abstract syntax with a triple term or edge, and those that restrain themselves to a wellformedness-enforcing semantics extension. However, one option is missing: a triple term a la option 3, but corresponding to option 1 w.r.t. its semantics. It’s called Option '3alt' in the following table (the layout is sure to be garbled in tranmission, sorry in advance).

|| Semantics | Extend RDF model || Occurrence | Type
|| extension | & abstract syntax || |
|| | || |
Option 1 || x | || x |
Option 2 || x | || | x
Option 3 || | x || | x
Option 3alt || | x || x |

It is important to understand that these two issues are orthogonal: option 2 and 3 both start out from representing types of statements, but provide a formalization either as semantic extension (essentially a well-formedness constraint) or as a new term type in RDF model theory and abstract syntax. They nicely mirror each other, but they both re-introduce the concept of a triple *type* which actually we decided to drop in favor of occurrences.
Option 1 starts out from occurrences and argues that syntactic sugar and a semantic extension is all we need to ensure sound use of reification in practice. Proper support in syntaxes combined with a semantic extension that allows data consumers to reject ill-formed reifications provides all the practical guarantees without any change to the existing formalization of model theory and abstract syntax.
However, as some prefer the more involved approach I outline below Option '3alt': a 'triple occurrence term' << :e | :s :p :o >> as symmetric to option 1 and an alternative to option 3’s 'triple type term' <<( :s :p :o )>>.

Why option 1:

Option 1 in the table "Seeking Consensus" [1] has the benefit that it is the simplest of all proposals and at the same time covers the predominant use case: annotation of occurrences. Here <<:e | :s :p :o>> expands to

:e rdf:subject :s .
:e rdf:predicate :p .
:e rdf:object :o .

This expansion can easily be extended by further triples towards option 2 if the need arises. Option 2 introduces an indirection between a triple and the identifier of its reification (here named :e):

:e rdf:nameOf [
rdf:subject :s ;
rdf:predicate :p ;
rdf:object :o
]

If one reformulates this slightly, removing some of the syntactic sugar, it becomes more apparent where the two options differ. This is again option 2:

:e rdf:nameOf _:e .
_:e rdf:subject :s .
_:e rdf:predicate :p .
_:e rdf:object :o .

Obviously the difference to option 1 is actually pretty minimal: option 2 introduces an indirection via an intermediate blank node and a mandatory 'rdf:nameOf' relation to the identifier of an actual occurrence. Why this indirection, why the extra triple, why a blank node?

The indirection is supposed to ensure backwards compatability with existing uses of reification because the range of rdf:nameOf is governed by a wellformedness restriction [2]. The blank node ensures locality of identification. The serialization using Turtle’s syntactic sugar does indeed make the construct look like a somehow closed entity, suggesting a safety from external interference and messy misuse. However, the equivalent N-triples serialization clarifies that the magic is constrained to an extra triple and a need for an existential reification *type* identifier.

This construct can easily be added to option 1 as well, but should it? The introduction of an intermediary node between reification and occurrence identifier adds an in general unneeded option into the basic formalism and thereby increases the danger of misunderstandings and diverging interpretations. We already spent two Task Force meetings discussing intricacies of the blank node that option 2 introduces to identify the triple type: is it in a functional relation to the triple it represents? Does it introduce referential opacity througgh the backdoor? Can it be leaned away? (Pierre-Antoine, Andy and me all at some point or another thought it can, but it can’t!) Is it a special kind of blank node that needs additional merging rules? Is it therefore a new term type? Etc. These discussions confirm to me that every extra moving part exponentially increases the risk of misunderstandings. Sound and practical semantics is best achieved through a minimum of constructs, maximally adapted to the predominant use cases. This is not only about triple count or ease of querying, it is also about what makes obvious sense and what adds more ambiguity rather than disambiguation. We decided to make the syntaxes refer to occurrences, not types. Introducing types into the formalization opens the can of worms again that we just hoped to have closed for good.

Two more arguments are brought forward in favor of the more involved construct of option 2. However, both can be realized in option 1 just as well, if the need actually arises, and without disadvantage neither to the general nor the special case.

One argument pro option 2 is that it caters for triple types. However, in those rare cases where one needs to refer to a statement as type, e.g. for aggregation purposes, adding a further statement to option 1, e.g.
:e rdfx:hasType :e_t .
can express that (of course an extra clause in the semantics extension may provide additional safety also to this special case). Now the triple count is merely on par with option 2, and this additional triple will seldomly be needed as the use cases show.
On the other hand there is the risk that users will misuse option 2 by annotating the blank node identifier directly. In that case they will inadvertendly use RDF standard reification semantics of occurrences although their intent might be to annotate the type. If the dear reader at this moment has trouble following: so will have the users. An extension of option 1 by an extra 'rdfx:hasType' is in practice much safer.

A second argument pro option 2 is that it allows for easy extension towards graphs, because the indirection introduced by ':e :nameOf [ <reification triplet> ]' allows for many-to-many relationships. The same can again be achieved via an extra triple added to option 1, e.g.
:e rdfx:inGraph :g .
Now the triple count of options 1 and 2 is again on par, but again only if the need actually arises. It’s another question if the concept of constructing graphs from reifications has any future and should even be considered a useful argument.

Why a simpler alternative to option 3:

So far I mainly argued about the differences between option 1 and 2. That however is mainly relevant if one agrees that no change to the model theoretic semantics and abstract syntax of RDF is needed, i.e. that there is no need to hardcode triple terms into the heart of RDF, but syntactic sugar in serializations and SPARQl and a supporting wellformedness-condition is all we need. However, some also seem to like the symmetry between option 2 and option 3, and that made me aware that option 3 is just as unnecessarily involved as option 2. Also option 3 re-introduces the triple type that we decided against when coming to a rough consensus at the end of last year. The type syntax
<<( :s :p :o )>>
was introduced in December 23 to dismabiguate the type from an implicitly named occurrence
<< :s :p :o >>
which in the new occurrence-focused consenus expands to
<< [] | :s :p :o >>
But we don’t have types anymore in the syntactic sugar, and therefore we also don’t need a primitive to disambiguate them from unnamed occurrences. Sure, a standalone occurrence in N-triples is not wellformed. E.g. the following
<< :e | :s :p :o >>
is just a term, not a complete statement. On the other hand its expanded representation
:e rdf:subject :s .
:e rdf:predicate :p .
:e rdf:object :o .
is wellformed. But is this discrepance a problem? IMO it isn’t as reifications don’t make any sense on their own. Without further attributions they just say that some statement can exist. The syntactic sugar provided by the star-serializations, e.g. ':s :p :o {| :x :y |} ' and '<< :s :p :o >> :x :y', depends on actual annotations on a reification. Those serializations provide no self-contained reification. Consequently we also don’t need it in the form of an expansion of
<<:e | :s :p :o>>
to
:e rdf:nameOf <<( :s :p :o )>> .
as option 3 suggests.
We don’t need
<<( :s :p :o )>>
for anything else than backwards compatability to the CG proposal, but we agreed to leave that behind. So we don’t need it at all. Just like option 2 adds one more triple to option 1 without compelling benefit, the expansion proposed in option 3, ':e rdf:nameOf <<( :s :p :o )>> .' only adds one more triple without practical use. As the blank node introduced in option 2 this type syntax risks again that people annotate the type when they mean to annotate an occurrence. We should provide annotations to types as an expicit option, but not as a casual possibility. As shown above the triple count in the end is not worse, but the safety from inadvertant misuse is much higher this way round.

So if this WG decides to extend RDF model theory and abstract syntax with a new term type, then it should be focused on what we actually need, called here Option 3alt, a plain *triple occurrence term*:

<< :e | :s :p :o >>

and not another intermediate like <<( :s :p :o )>>, expanding to option 2 with its added indirection, resulting in unclear intuitions, diverging interpretations and inviting misuse.

Why neither option 3 nor that simpler alternative 3alt:

Hardcoding the triple term syntax - either for type or occurrence - into the RDF model theory and abstract syntax doesn’t bring any tangible benefit. Options 1 and 2 are perfectly well suited to describe and enforce the intended semantics. The rest ist syntax and implementation detail: triple term syntaxes allow to transfer reifications over the wire without the maintenance headaches that RDF standard reification quad incur. Implementations should be free to chose how they implement those terms: rather verbatim as some of the RDF-star implementations suggest, with named graphs as Dydra does, as RDF reification quads as some triple stores still do, or in any other way they come up with. The specification shouldn’t try to force implementors in any specific direction. A change to RDF model and abstract syntax should bring some real benefit and should be met with strong support in the community. A formalization of triple terms as proposed in option 3 has neither.

To make a final argument, the approach of option 1 - a combination of syntactic sugar and a semantics extension - IMO provides not only a much easier and modular solution to our immediate problem, but also can serve as a blueprint to solve other problems in RDF that involve compounds of statements, especially lists. No matter what this WG decides w.r.t. triple terms, we will not have further introductions of new term types to model and abstract syntax anytime soon. It is just too deep a change to be done in a "living standard" way, almost casually. Introducing a semantics extension as an optional, although strongly suggested constraint on new syntactic sugar (or even newly published data in existing syntaxes), to me seems like a mechanism much easier to agree on and implement.

Best,
Thomas

(*) Occurrences are also known as instances, tokens and edges. We should at some point decide for one of those names, but for the sake of this discussion they all are fine.

[1] https://htmlpreview.github.io/?https://github.com/w3c/rdf-star-wg/blob/main/docs/seeking-consensus-2024-01.html
[2] https://lists.w3.org/Archives/Public/public-rdf-star-wg/2024Jan/0138.html

Received on Tuesday, 13 February 2024 11:38:40 UTC