- From: Niklas Lindström <lindstream@gmail.com>
- Date: Thu, 30 Nov 2023 14:39:17 +0100
- To: RDF-star Working Group <public-rdf-star-wg@w3.org>
Dear all, I actually think the current proposals are closer to each other than it might seem. What Souri proposes with RDFn [1] is very close to what I was seeking with "bound" named graphs ([2], [3]). Both are "about tokens" (as in the same triple can be named by more than one identifier (blank node or IRI), which are considered distinct unless asserted to be the same). But Souri proposes something valuable, which has been around in various guises before (e.g. in [4] and [5]), and I think is also alluded to by Peter in [6] (option 2,1,1, expanding to "the same central node"). Here is an attempt at consolidation of these various ideas, taking what the CG was seeking into account (and this time keeping all of its syntax). ## The Troubles of Describing Triples Having triple terms as "types" has shown to be troublesome, both in theory and practise. They are *universals* (like literals), and neither provenance nor qualification (our actual use cases) are about universals. Cases describe instantiated occurrences of them, in various contexts (graphs). Admittedly, these are *mainly* the asserted triples in the current graph (one unique s,p,o per g). So the "type" point of view is understandable, and in the simplest cases is all you see. But also "referenced" or "possible" triples come into view a lot; and they all are "identified by their singleton sets". Such referenced ("backing") triples also cater for the LPG cases; but can stay unasserted, in the background, without "polluting" RDF with multisets. (It is not logically wrong to talk about universals directly, but it is rarely (if ever) the intent. RDF has this *cautious* design of disallowing literals in the subject position for this reason. To prevent users from "shooting themselves in the foot", if you will.) ## Consolidating Occurrences: Default Token Identifiers This "auto-named triple" approach solves the disconnect, in that it "talks about tokens", without abandoning the effect of concentrating on a default triple in a graph in the simplest cases. So, we can: * Define a function (tripleId) that maps s,p,o to a unique identifier (blank node or IRI). This denotes a "default triple token", or, if you will, the triple occurrence *in a graph*. ## Options at Hand Let's examine a case and some options. I'll use this example (not because it's my favorite, but because it is common, and also contains the "seminal error", which we "save ourselves from" by describing tokens): << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ; dct:source <s1> . This is the same default triple token" throughout the graph, and the above is the same as: << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 . << <bob> foaf:birthday "1970-01-01" >> dct:source <s1> . (Note: Of course the date should be `"1970-01-01"^^xsd:date`; it's omitted for brevity.) For this syntax, we use `tripleId` to get a unique identifier from the syntactic triple term. Below we'll use a simple bnode id, `_:bb70`; but anything goes as long as it is unique, e.g. a hash-based bnode id like `_:gen6e16a579edbbf4dc3339be9415c39ea8`, an IRI like `<urn:tdb:2014:urn:md5:6e16a579edbbf4dc3339be9415c39ea8>` or a data-URL-variant thereof (no hash; terribly long). ## Option A: Reification This can be used as the identifier of a simple reified statement: _:bb70 rdf:subject <bob> . _:bb70 rdf:predicate foaf:birthday . _:bb70 rdf:object "1970-01-01" . _:b1 ex:certainty 0.9 . _:b1 dct:source <s1> . For the annotation shorthand: <bob> foaf:birthday "1970-01-01" {| ex:certainty 0.9 ; dct:source <s1> |} . This could become: <bob> foaf:birthday "1970-01-01" . _:bb70 rdf:subject <bob> . _:bb70 rdf:predicate foaf:birthday . _:bb70 rdf:object "1970-01-01" . _:b1 ex:certainty 0.9 . _:b1 dct:source <s1> . We do want repeated annotations too (in some form): <bob> foaf:birthday "1970-01-01" {| dct:source <s1> ; ex:certainty 0.9 |}, "1970-01-01" {| dct:source <s2>; ex:certainty 0.8 |} . When there is more than one "referenced occurrence" like this, the auto-naming isn't used, since the reference triples "decohere". So we reasonably get regular blank nodes: <bob> foaf:birthday "1970-01-01" . _:b1 rdf:subject <bob> . _:b1 rdf:predicate foaf:birthday . _:b1 rdf:object "1970-01-01" . _:b1 dct:source <s1> . _:b1 ex:certainty 0.9 . _:b2 rdf:subject <bob> . _:b2 rdf:predicate foaf:birthday . _:b2 rdf:object "1970-01-01" . _:b2 dct:source <s2> . _:b2 ex:certainty 0.8 . It could make sense to always use regular blank nodes for the annotation form; *or* to require explicit names for repetitions. On that note, here is a form for explicitly named annotations: <bob> foaf:birthday "1970-01-01" {<#t1>} . <#t1> ex:certainty 0.9; dct:source <s1> . In "terse" triples: <bob> foaf:birthday "1970-01-01" . <#t1> rdf:subject <bob> . <#t1> rdf:predicate foaf:birthday . <#t1> rdf:object "1970-01-01" . <#t1> ex:certainty 0.9 . <#t1> dct:source <s1> . With this, we finally have a Turtle equivalent to RDF/XML statement annotations (used extensively in UniProt): <rdf:Description rdf:about="bob"> <foaf:birthday rdf:ID="t1">1970-01-01</foaf:birthday> </rdf:Description> <rdf:Description rdf:ID="t1"> <ex:certainty rdf:datatype="&xsd;double">0.9</ex:certainty> <dct:source rdf:resource="s1"/> </rdf:Description> How do we "save ourselves from the seminal error" then, if triple terms are at least type-like? In this basic form we could just resort to reification; or triple terms could have an optional identifier, like: << _:b1 | <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ; dct:source <s1> . << _:b2 | <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.8 ; dct:source <s2> . Or (which I prefer) the completing object could be marked as "quoted" (I've previously used `--`, but it has been considered hard to spot): <bob> foaf:birthday << "1970-01-01" >> {| dct:source <s1> ; ex:certainty 0.9 |}, << "1970-01-01" >> {| dct:source <s2>; ex:certainty 0.8 |} . Exact syntax isn't important yet, only whether this is what we can converge upon or not. For named graphs, this: <g1> { << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 . } <g2> { << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.8 . } becomes, in "terse" quads: _:bb70 rdf:subject <bob> <g1> . _:bb70 rdf:predicate foaf:birthday <g1> . _:bb70 rdf:object "1970-01-01" <g1> . _:bb70 ex:certainty 0.9 <g1> . _:bb70 rdf:subject <bob> <g2> . _:bb70 rdf:predicate foaf:birthday <g2> . _:bb70 rdf:object "1970-01-01" <g2> . _:bb70 ex:certainty 0.8 <g2> . Granted, given the reasoning above (an instantiated triple occurrence in a graph) it might make sense that `tripleId` mint different identifiers for different graphs. Annotation forms achieve that anyway though, and the above is simpler as is (*if* the *union* of the two graphs share blank nodes, the certainty claims in them are in conflict (assuming such semantics for the property), which can be important information). Of course, we're still on square one here. It's more *convenient* reification, but perhaps not *better*. While this could be all we need, let's look further a bit. ## Option B: Attempting Semantics for Datasets What I've been aiming for is isolated (as in unasserted, from the open world point of view) named triple sets, bound to another "graph name resource" in a dataset. I *tried* to base my approach on the open-ended options for dataset semantics, without touching the abstract syntax. This was not about giving all uses of named graphs fixed semantics, but about *opting in* to semantic datasets. I thought this was respectful of what's out there, given what RDF 1.1 Concepts states [7]: > RDF does not place any formal restrictions on what resource the graph name may denote, nor on the relationship between that resource and the graph. A discussion of different RDF dataset semantics can be found in [RDF11-DATASETS]. Given that, claiming that graph names mean nothing is only *one* of many possible interpretations. And while formal means for doing so are still undefined, I hoped they didn't have to be. Looking at RDF11-DATASETS [8]: > A vocabulary specifically tailored for describing the intended dataset semantics could be defined in a future specification. It suggests that through description of the resource naming a graph, defining how the graph it is paired with is interpreted, within a dataset, could be possible. Its dataset semantics option 3.4 [9] is close to what I've attempted. With such semantics for named graphs, in order not to break monotonicity, graphs must reasonably be explicitly "accepted" to be considered asserted in a union default graph [10]. So my option for the above was to, out of band (in an implementation) *selecting* a semantic dataset profile, in which named graphs are isolated unless accepted. (The simple act of loading them into graph names in a semantic graph store would "accept" the default graph here, but not the named graph.) So our example simply becomes: _:bb70 ex:certainty 0.9 . _:bb70 dct:source <s1> . <bob> foaf:birthday "1970-01-01" _:bb70 . And for scoping this (for graph store management), I proposed `rdfx:boundBy` to relate two graph name resources to ensure that the "bound" ones remain isolated, and "owned" by their binding resource (for atomic updates and deletes). So if we read the above into named graph `<g1>`, we get: _:bb70 rdfx:boundBy <g1> . _:bb70 ex:certainty 0.9 <g1> . _:bb70 dct:source <s1> <g1> . <bob> foaf:birthday "1970-01-01" _:bb70 . *Of course* this is not an easy thing to formalize and get implemented. It requires "semantic datasets", and is hard to get right (defining semantics by the presence of statements (without breaking monotonicity), requiring an explicit opt-in profile, etc). Thus I said it might be a tall order. Too tall, I've gathered. So let's defer this option, and see if we can do something else *now* which does not prevent semantic datasets in the future. ## Option C: Explicit Abstract Syntax Instead Another way to get isolated named triple sets is to make them explicit in the concepts and abstract syntax, but without adding new terms that regular users will come across (so neither the subject, predicate nor object positions of triples have access to anything novel). This is drawing from Souri's RDFn *and* Andy's graph terms [11], in a kind of amalgam (or compromise). * Define a new kind of quoted identifier. *Not* for general use, *only* for the fourth position in a quad. * It is formed by a regular identifier (blank node id or IRI) and an optional graph name identifier. Formally: quoted(id=some-id, optional graph=some-graph). * Triples named by this term are *not asserted*. (It is called "quoted" here, but could of course be called e.g. "isolated" or "protected".) Here I use this syntax for such "quoted identifiers" for something in a default graph (again, *only* usable in the fourth position of a quad): {_:bb70} And this for a quoted identifier in a named graph `<g1>`: <g1>{_:bb70} Structurally, it is related to typed literals. To a lesser extent it is reminiscent of the triple terms it replaces; the main difference being that this is not a recursive structure; and that the identifier "within" is a regular RDF identifier which is used in subjects and objects. Here is the initial example in "terse pseudo-quads": <bob> foaf:birthday "1970-01-01" {_:bb70} . _:bb70 ex:certainty 0.9 . _:bb70 dct:source <s1> . And for a triple description in a named graph: <g1> { << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ; dct:source <s1> . } In "terse pseudo-quads": <bob> foaf:birthday "1970-01-01" <g1>{_:bb70} . _:bb70 ex:certainty 0.9 <g1> . _:bb70 dct:source <s1> <g1> . Of course, this can be considered as "quins in disguise". As such this option is *very* close to what RDFn proposes. The main difference is that not *all* triples are auto-named, only "RDF-star-described" ones, and that such names are always isolated triples, marked through "quoted" quad identifiers (fusing position 4 and 5 of RDFn). Note: While this proposal requires a quad representation, it is not necessarily restricted to TriG (but to N-quads and not N-triples). But since "statements about statements" is not basic RDF 101, it should be discussed. For provenance, this is related to named graphs, and should be explained alongside them. For "qualification", It is the *last* resort when you've got granular data but "run out of modelling options"; usually in a production scenario. In schema.org, we've got "impure" but pragmatic, triples-only options. In Wikidata, this is more interesting. (For LPG usage, I've gotten the impression that semantics have a back seat, and putting raw data into "something" is more common practice. Not unlike some RDF usage in the wild; and that's fine. We just need to ensure that it's hard to "shoot yourself in the foot" with what we introduce.) ## What About Opacity? Controlling opacity is left to a future semantics for datasets (as in [8], also thought of e.g. in [12].). For now, it depends on specific implementation options for the union default graph, and for what their inference engines take into account. I think this is acceptable since the majority of collected use cases and examples rely on a practical transparent interpretation of triples, whether asserted or not. Also, since if we "get closer" to named graphs, these options could work on asserted and "protected" triple sets alike. ## Future Convergence: Upgrading From Option C to B? Option C is upgradable to semantic datasets, if such will eventually be defined. * The "quoted fourth term" can be made equal to an explicit graph semantics of that "wrapped" identifier. It is a syntactic marker that could be interpreted as a semantic declaration. * With named annotations, we can also have named, isolated triple sets. It can still fall back to reification, but would require a relationship (e.g. `rdfx:triple`) from that named, isolated set to each isolated triple. * There is a path towards graph terms as default names for graph "token" structures, using RDF C14N on its triple set (a `graphId` function along the lines of the above `tripleId` mapping function). Thank you if you read this far! Best regards, Niklas [1]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0028.html> [2]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0026.html> [3]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0032.html> [4]: <https://lists.w3.org/Archives/Public/public-rdf-star/2020Dec/0062.html> [5]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023May/0063.html> [6]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0031.html> [7]: <https://www.w3.org/TR/rdf11-concepts/#section-dataset> [8]: <https://www.w3.org/TR/rdf11-datasets/#declaring> [9]: <https://www.w3.org/TR/rdf11-datasets/#each-named-graph-defines-its-own-context> [10]: <https://www.w3.org/TR/sparql11-service-description/#sd-uniondefaultgraph> [11]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Oct/0038.html> [12]: <https://gist.github.com/niklasl/c22994e664663b6730613ecc1321c418#opacity-as-conditional-entailment>
Received on Thursday, 30 November 2023 13:39:51 UTC