- From: Niklas Lindström <lindstream@gmail.com>
- Date: Wed, 6 Dec 2023 13:34:10 +0100
- To: RDF-star Working Group <public-rdf-star-wg@w3.org>
Dear all, We must address the problems shown with triple terms, from the point of view of the use cases we must cater for in RDF. I recently asked some short questions about triple terms [1] that are yet to be answered. But the most important question right now is: Can we responsibly introduce triple types to RDF? It is a radical change to RDF. It introduces a new term, for the first time since RDF 1.0, which is to be used in both the subject and object position of other triples. A triple is no longer a three-tuple of primitives, but a recursively defined tree structure, unlike IRIs and blank nodes (which are atomic primitives) and literals (which are tuples of lexicals, type IRI and optional language code for the rdf:langString type). They are not a set of triples (i.e. RDF graphs), but triples that can themselves contain triples. That is a more exotic graph data structure: edges whose nodes can be edges (nested edges at that, not part of the outer graph). (Note that named triples, like named graphs, would be a way to avoid this complexity.) So a corollary question is: Can we responsibly add such a complex form of *trees* as a *primitive* to RDF and SPARQL? This is not even remotely close to a minimal change. It redefines the fundament. This change will affect the entire RDF community. At this year's DCMI conference, I got comments like: "adding this to RDF would turn it into XML, in a bad way," and "why don't you define semantics for named graphs instead?". (Note that XML and HTML (and soon JSON) in RDF are fully opaque literals; not data structures. Database extensions for SPARQL are free to enable querying within those of course, but that's decidedly outside of RDF territory, and rightfully so.) Like Ora said during the last (231130) telecon, "we're not the 'we like RDF'-social club". Certainly not. We're here to refine a technology which has been used in the wild for more than 20 years. Anyone can make things bigger and more complex. We have a responsibility of keeping the simplicity of RDF as simple as possible (c.f. [2]). ## Triple Terms Do Not Work As Advertised So triple terms must really be worth it then, to warrant this complexity being irrevocably added to the core of RDF? That has not been shown. On the contrary, it has been shown that triple terms do *not* work for what they are purportedly introduced for (provenance and qualification). This all depends on use cases. *If* RDF-star is explicitly added for talking about universal, abstract, recursive triple structures themselves, then that *might* warrant something like this complexity. I have asked for use cases for that, but haven't gotten any answers. (I can imagine that this would open up for some rule based cases, like what Notation 3 is being used for today, but without graph terms I cannot see it being of much practical use.) The RDF-star examples commonly seen, e.g. in the CG report and in the GraphDB tutorial [3], basically all are about provenance and qualification of some sort. You can easily, in each example, see how the "seminal error" of the "seminal example" would be committed by adding just one single temporal fact to the triple itself. These are patterns from the CG report: <<< <s> :p <o> >> :accordingTo <someone> . <<< <s> :p <o> >> :statedBy <someone> . <alice> :claims << <bob> :age 23 >> . One seminal error remains in example 9: :a :name "Alice" {| :statedBy :bob ; :recorded "2021-07-07"^^xsd:date |} . The ones in the CG report that are not directly about such provenance or qualification facts, are either the attempted correction using :occurrenceOf in example 8, or perhaps example 17 (if you ignore the pending seminal error made by putting a dct:source on it): <<?c a owl:Class>> dct:source ?src ; :entailing <<?c a rdfs:Class>> . Of course, this still errs, since it is an opaque universal structure, and there is no way that entailment could have been done without associating a *token* of that structure with a specific context, here of semantic definition of the resource that owl:Class denotes (in the OWL ontology). So again, you *cannot* qualify a triple type, as in talking about the richer context from which a triple was derived. Because that derived triple is a token of its type. And you cannot describe provenance about the type of a triple either. It can be asserted in many contexts, each being a token occurrence of the triple. Already three years ago, in [4], Pierre-Antoine noted that RDF-star is easily misused. Yet, as shown above, the CG report didn't make that sufficiently clear, as it still commits those errors. And RDF-star is already being taught and promoted as working for provenance and qualification (as in "adding metadata to existing relationships") [5]; and as a "replacement" for reification, and/or named graphs, for detailed cases (which smaller, "embedded" named graphs are already being used for, not the least in JSON-LD). (Here is another, recent example noting that RDF-star doesn't work for these cases: [6]. It tries to stay positive that it could be made to work (with examples that do not actually work).) These are all clear warning signs, if not outright invalidations of the current design. ## We Need To Talk About Occurrences A triple term in RDF-star, right now, is the abstract triple *without* a context. Like a structured literal, opaquely composed of a subject, predicate and object terms. The same triple can be derived from many different contexts. And it is the triple *in a context* that needs to be talked about. That's an instantiated occurrence. A triple is a simplification of one or more, granular, contextual occurrences. A triple can even mean different things in different contexts, but that is an advanced case of multiple worlds (achieved with isolated named graphs, or disjoint datasets of graphs). For provenance, qualification and any kind of annotation about a used triple, hypothetically or actually, we're *always* talking about such an occurrence. The occurrence itself! Occurrences such as the ones we make when we make assertions, when we build graphs that form descriptions of things. There is an interest in, and a set of use cases for, using RDF-star for qualification (or even n-ary relations), due to not wanting to invent new terms (e.g. [7]). The most obvious cases are generic ("oversimplified") relationships, such as `dct:relation`. Many of those are commonly qualified by subproperties; but most properties are, from some perspective, simplifications of a more granular state of affairs. And singleton properties have (more or less) proven to be too complex to work effectively here in practice. To again quote Pierre-Antoine, here in [8]: "if a relationship was initially thought to be 'simple' enough to be modeled as a predicate, and turns out to be more complex (either because of some exceptional cases, such as people changing name), then RDF-star provides a smooth transition from the original modelling to a more detailed one." See also the follow-up in [9] by Jerven Bolleman. The `connected:by_road_to` is an intuitive example of there being occurrences behind the simple triple (there are many roads that lead between towns). This is a viable, recurring use case. But, again, a triple (its "type") is a *simplification* of a more granular context. And it is obvious that you cannot let the simplification itself *denote* a qualification of it. This is a crucial feature of RDF (as opposed to LPGs), and only through "backing", unasserted, described occurrences of a triple can we achieve this in a simple, backwards-compatible manner. Both reification and named graphs cater for that (albeit the latter only in practice, as in theory it is undefined what it caters for). The RDF-star CG report, however, adds fundamental complexity, but *still* needs an indirected node for the occurrence. And provides little guidance in doing so, and ample ways to forget to do so! Using a universal type in the subject position is for making universal claims. This is still an open issue [10], and shows the range of problems introduced (and the difficulties of discussing them). This was also shown by Ora in the Neptune use cases [11]. I cannot understand how the current trajectory is acceptable when this document showcases these exact problems? ## Named Graphs? Let's not forget that named graphs have been used for provenance for a decade now ([12], [13], [14], [15], [16]). We have recently, a bit more collectively, explored the relationship between some form of RDF-star and named graphs. We've seen that there can be one, but some problems have been made clear. One problem is about graph terms, having the same problems as triple terms (opacity or not, type or token). The other was that since named graphs are resource names paired with an RDF graph in an *undefined* way (side-stepping but not solving those questions), it is not formally possible to define what that pairing contextually means within a given dataset. I would argue that defining standard options for dataset semantics, of which the wider RDF community now has a decade of experience and is now asking for, is *not* adding complexity, and could help us out a lot. It addresses the *challenging* task of consolidating what is out there with something explicitly left undefined until we can do that. Our charter may prevent us from shouldering that responsibility in the current maintenance round of RDF (along with the *assumption* made early on that named graphs cannot be used for more than one purpose at once). But it certainly shouldn't make that work *harder* to do by adding *new complexity*, which distracts and fragments practice and effective interoperability. ## Any Other Way? So should we really add this much new complexity, along with a note stating that RDF is now harder to use, unless you have a clear understanding of the type/token distinction? Or should we steer away from triples as types and focus on means for occurrences of triples to be more effectively described, to cater for easier provenance and "ad hoc" qualification for them? I am certainly in favour of the latter. As others also have, I've made several attempts at addressing this, recently in [17]. Best regards, Niklas [1]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Dec/0003.html> [2]: <https://en.wikipedia.org/wiki/Rule_of_least_power> [3]: <https://graphdb.ontotext.com/documentation/10.3/rdf-sparql-star.html> [4]: <https://lists.w3.org/Archives/Public/public-rdf-star/2020Dec/0076.html> [5]: <https://enterprise-knowledge.com/rdf-what-is-it-and-why-do-i-need-it/> [6]: <https://medium.com/@dallemang/why-im-not-excited-about-rdf-star-5f1993fd0ead> [7]: <https://github.com/w3c/rdf-ucr/wiki/RDF%E2%80%90star-for-Annotations-as-Miscellaneous-Marginalia#prov-o-qualification-versus-rdf-star-annotation> [8]: <https://lists.w3.org/Archives/Public/public-rdf-star/2022Jan/0071.html> [9]: <https://lists.w3.org/Archives/Public/public-rdf-star/2022Jan/0074.html> [10]: <https://github.com/w3c/rdf-star/issues/169> [11]: <https://lists.w3.org/Archives/Public/public-rdf-star/2021Dec/0001.html> [12]: <https://patterns.dataincubator.org/book/named-graphs.html> [13]: <https://docs.stardog.com/tutorials/rdf-graph-data-model#named-graphs> [14]: <https://sven-lieber.org/en/2023/06/26/rdf-named-graphs/> [15]: <https://cidoc-crm.org/Issue/ID-526-named-graph-usage-recommendations-guideline-document> [16]: <https://arxiv.org/abs/2211.16195> [17]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0061.html>
Received on Wednesday, 6 December 2023 12:34:43 UTC