Addressing Problems from Niklas Lindström on 2023-12-06 (public-rdf-star-wg@w3.org from December 2023)

From: Niklas Lindström <lindstream@gmail.com>
Date: Wed, 6 Dec 2023 13:34:10 +0100
To: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-ID: <CADjV5jcVhQxG89Ak1w+CuJXzBPGepu71KTDRSOHHF+oJ9Rjk0w@mail.gmail.com>
Dear all,

We must address the problems shown with triple terms, from the point
of view of the use cases we must cater for in RDF.

I recently asked some short questions about triple terms [1] that are
yet to be answered.

But the most important question right now is:

Can we responsibly introduce triple types to RDF?

It is a radical change to RDF. It introduces a new term, for the first
time since RDF 1.0, which is to be used in both the subject and object
position of other triples.

A triple is no longer a three-tuple of primitives, but a recursively
defined tree structure, unlike IRIs and blank nodes (which are atomic
primitives) and literals (which are tuples of lexicals, type IRI and
optional language code for the rdf:langString type). They are not a
set of triples (i.e. RDF graphs), but triples that can themselves
contain triples. That is a more exotic graph data structure: edges
whose nodes can be edges (nested edges at that, not part of the outer
graph).

(Note that named triples, like named graphs, would be a way to avoid
this complexity.)

So a corollary question is: Can we responsibly add such a complex form
of *trees* as a *primitive* to RDF and SPARQL?

This is not even remotely close to a minimal change. It redefines the fundament.

This change will affect the entire RDF community. At this year's DCMI
conference, I got comments like: "adding this to RDF would turn it
into XML, in a bad way," and "why don't you define semantics for named
graphs instead?".

(Note that XML and HTML (and soon JSON) in RDF are fully opaque
literals; not data structures. Database extensions for SPARQL are free
to enable querying within those of course, but that's decidedly
outside of RDF territory, and rightfully so.)

Like Ora said during the last (231130) telecon, "we're not the 'we
like RDF'-social club". Certainly not. We're here to refine a
technology which has been used in the wild for more than 20 years.
Anyone can make things bigger and more complex. We have a
responsibility of keeping the simplicity of RDF as simple as possible
(c.f. [2]).


## Triple Terms Do Not Work As Advertised

So triple terms must really be worth it then, to warrant this
complexity being irrevocably added to the core of RDF? That has not
been shown. On the contrary, it has been shown that triple terms do
*not* work for what they are purportedly introduced for (provenance
and qualification).

This all depends on use cases. *If* RDF-star is explicitly added for
talking about universal, abstract, recursive triple structures
themselves, then that *might* warrant something like this complexity.
I have asked for use cases for that, but haven't gotten any answers.
(I can imagine that this would open up for some rule based cases, like
what Notation 3 is being used for today, but without graph terms I
cannot see it being of much practical use.)

The RDF-star examples commonly seen, e.g. in the CG report and in the
GraphDB tutorial [3], basically all are about provenance and
qualification of some sort. You can easily, in each example, see how
the "seminal error" of the "seminal example" would be committed by
adding just one single temporal fact to the triple itself. These are
patterns from the CG report:

    <<< <s> :p <o> >> :accordingTo <someone> .

    <<< <s> :p <o> >> :statedBy <someone> .

    <alice> :claims << <bob> :age 23 >> .

One seminal error remains in example 9:

    :a :name "Alice" {| :statedBy :bob ; :recorded "2021-07-07"^^xsd:date |} .

The ones in the CG report that are not directly about such provenance
or qualification facts, are either the attempted correction using
:occurrenceOf in example 8, or perhaps example 17 (if you ignore the
pending seminal error made by putting a dct:source on it):

    <<?c a owl:Class>> dct:source ?src ;
        :entailing <<?c a rdfs:Class>> .

Of course, this still errs, since it is an opaque universal structure,
and there is no way that entailment could have been done without
associating a *token* of that structure with a specific context, here
of semantic definition of the resource that owl:Class denotes (in the
OWL ontology).

So again, you *cannot* qualify a triple type, as in talking about the
richer context from which a triple was derived. Because that derived
triple is a token of its type.

And you cannot describe provenance about the type of a triple either.
It can be asserted in many contexts, each being a token occurrence of
the triple.

Already three years ago, in [4], Pierre-Antoine noted that RDF-star is
easily misused.

Yet, as shown above, the CG report didn't make that sufficiently
clear, as it still commits those errors. And RDF-star is already being
taught and promoted as working for provenance and qualification (as in
"adding metadata to existing relationships") [5]; and as a
"replacement" for reification, and/or named graphs, for detailed cases
(which smaller, "embedded" named graphs are already being used for,
not the least in JSON-LD).

(Here is another, recent example noting that RDF-star doesn't work for
these cases: [6]. It tries to stay positive that it could be made to
work (with examples that do not actually work).)

These are all clear warning signs, if not outright invalidations of
the current design.


## We Need To Talk About Occurrences

A triple term in RDF-star, right now, is the abstract triple *without*
a context. Like a structured literal, opaquely composed of a subject,
predicate and object terms.

The same triple can be derived from many different contexts. And it is
the triple *in a context* that needs to be talked about. That's an
instantiated occurrence.

A triple is a simplification of one or more, granular, contextual occurrences.

A triple can even mean different things in different contexts, but
that is an advanced case of multiple worlds (achieved with isolated
named graphs, or disjoint datasets of graphs).

For provenance, qualification and any kind of annotation about a used
triple, hypothetically or actually, we're *always* talking about such
an occurrence. The occurrence itself! Occurrences such as the ones we
make when we make assertions, when we build graphs that form
descriptions of things.

There is an interest in, and a set of use cases for, using RDF-star
for qualification (or even n-ary relations), due to not wanting to
invent new terms (e.g. [7]).

The most obvious cases are generic ("oversimplified") relationships,
such as `dct:relation`. Many of those are commonly qualified by
subproperties; but most properties are, from some perspective,
simplifications of a more granular state of affairs. And singleton
properties have (more or less) proven to be too complex to work
effectively here in practice.

To again quote Pierre-Antoine, here in [8]: "if a relationship was
initially thought to be 'simple' enough to be modeled as a predicate,
and turns out to be more complex (either because of some exceptional
cases, such as people changing name), then RDF-star provides a smooth
transition from the original modelling to a more detailed one."

See also the follow-up in [9] by Jerven Bolleman. The
`connected:by_road_to` is an intuitive example of there being
occurrences behind the simple triple (there are many roads that lead
between towns).

This is a viable, recurring use case. But, again, a triple (its
"type") is a *simplification* of a more granular context. And it is
obvious that you cannot let the simplification itself *denote* a
qualification of it.

This is a crucial feature of RDF (as opposed to LPGs), and only
through "backing", unasserted, described occurrences of a triple can
we achieve this in a simple, backwards-compatible manner. Both
reification and named graphs cater for that (albeit the latter only in
practice, as in theory it is undefined what it caters for).

The RDF-star CG report, however, adds fundamental complexity, but
*still* needs an indirected node for the occurrence. And provides
little guidance in doing so, and ample ways to forget to do so! Using
a universal type in the subject position is for making universal
claims. This is still an open issue [10], and shows the range of
problems introduced (and the difficulties of discussing them).

This was also shown by Ora in the Neptune use cases [11]. I cannot
understand how the current trajectory is acceptable when this document
showcases these exact problems?


## Named Graphs?

Let's not forget that named graphs have been used for provenance for a
decade now ([12], [13], [14], [15], [16]).

We have recently, a bit more collectively, explored the relationship
between some form of RDF-star and named graphs. We've seen that there
can be one, but some problems have been made clear. One problem is
about graph terms, having the same problems as triple terms (opacity
or not, type or token). The other was that since named graphs are
resource names paired with an RDF graph in an *undefined* way
(side-stepping but not solving those questions), it is not formally
possible to define what that pairing contextually means within a given
dataset.

I would argue that defining standard options for dataset semantics, of
which the wider RDF community now has a decade of experience and is
now asking for, is *not* adding complexity, and could help us out a
lot. It addresses the *challenging* task of consolidating what is out
there with something explicitly left undefined until we can do that.
Our charter may prevent us from shouldering that responsibility in the
current maintenance round of RDF (along with the *assumption* made
early on that named graphs cannot be used for more than one purpose at
once). But it certainly shouldn't make that work *harder* to do by
adding *new complexity*, which distracts and fragments practice and
effective interoperability.


## Any Other Way?

So should we really add this much new complexity, along with a note
stating that RDF is now harder to use, unless you have a clear
understanding of the type/token distinction? Or should we steer away
from triples as types and focus on means for occurrences of triples to
be more effectively described, to cater for easier provenance and "ad
hoc" qualification for them?

I am certainly in favour of the latter. As others also have, I've made
several attempts at addressing this, recently in [17].

Best regards,
Niklas

[1]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Dec/0003.html>
[2]: <https://en.wikipedia.org/wiki/Rule_of_least_power>
[3]: <https://graphdb.ontotext.com/documentation/10.3/rdf-sparql-star.html>
[4]: <https://lists.w3.org/Archives/Public/public-rdf-star/2020Dec/0076.html>
[5]: <https://enterprise-knowledge.com/rdf-what-is-it-and-why-do-i-need-it/>
[6]: <https://medium.com/@dallemang/why-im-not-excited-about-rdf-star-5f1993fd0ead>
[7]: <https://github.com/w3c/rdf-ucr/wiki/RDF%E2%80%90star-for-Annotations-as-Miscellaneous-Marginalia#prov-o-qualification-versus-rdf-star-annotation>
[8]: <https://lists.w3.org/Archives/Public/public-rdf-star/2022Jan/0071.html>
[9]: <https://lists.w3.org/Archives/Public/public-rdf-star/2022Jan/0074.html>
[10]: <https://github.com/w3c/rdf-star/issues/169>
[11]: <https://lists.w3.org/Archives/Public/public-rdf-star/2021Dec/0001.html>
[12]: <https://patterns.dataincubator.org/book/named-graphs.html>
[13]: <https://docs.stardog.com/tutorials/rdf-graph-data-model#named-graphs>
[14]: <https://sven-lieber.org/en/2023/06/26/rdf-named-graphs/>
[15]: <https://cidoc-crm.org/Issue/ID-526-named-graph-usage-recommendations-guideline-document>
[16]: <https://arxiv.org/abs/2211.16195>
[17]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0061.html>
Received on Wednesday, 6 December 2023 12:34:43 UTC