Attempting Consolidation from Niklas Lindström on 2023-11-30 (public-rdf-star-wg@w3.org from November 2023)

From: Niklas Lindström <lindstream@gmail.com>
Date: Thu, 30 Nov 2023 14:39:17 +0100
To: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-ID: <CADjV5jdAesjj+Uk9PsJB+nWTFavQ8+GYP+CTE1a5rE5rzP13Fg@mail.gmail.com>
Dear all,

I actually think the current proposals are closer to each other than
it might seem.

What Souri proposes with RDFn [1] is very close to what I was seeking
with "bound" named graphs ([2], [3]). Both are "about tokens" (as in
the same triple can be named by more than one identifier (blank node
or IRI), which are considered distinct unless asserted to be the
same). But Souri proposes something valuable, which has been around in
various guises before (e.g. in [4] and [5]), and I think is also
alluded to by Peter in [6] (option 2,1,1, expanding to "the same
central node").

Here is an attempt at consolidation of these various ideas, taking
what the CG was seeking into account (and this time keeping all of its
syntax).


## The Troubles of Describing Triples

Having triple terms as "types" has shown to be troublesome, both in
theory and practise. They are *universals* (like literals), and
neither provenance nor qualification (our actual use cases) are about
universals. Cases describe instantiated occurrences of them, in
various contexts (graphs). Admittedly, these are *mainly* the asserted
triples in the current graph (one unique s,p,o per g). So the "type"
point of view is understandable, and in the simplest cases is all you
see. But also "referenced" or "possible" triples come into view a lot;
and they all are "identified by their singleton sets". Such referenced
("backing") triples also cater for the LPG cases; but can stay
unasserted, in the background, without "polluting" RDF with multisets.

(It is not logically wrong to talk about universals directly, but it
is rarely (if ever) the intent. RDF has this *cautious* design of
disallowing literals in the subject position for this reason. To
prevent users from "shooting themselves in the foot", if you will.)


## Consolidating Occurrences: Default Token Identifiers

This "auto-named triple" approach solves the disconnect, in that it
"talks about tokens", without abandoning the effect of concentrating
on a default triple in a graph in the simplest cases.

So, we can:

* Define a function (tripleId) that maps s,p,o to a unique identifier
(blank node or IRI). This denotes a "default triple token", or, if you
will, the triple occurrence *in a graph*.


## Options at Hand

Let's examine a case and some options. I'll use this example (not
because it's my favorite, but because it is common, and also contains
the "seminal error", which we "save ourselves from" by describing
tokens):

    << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ;
        dct:source <s1> .

This is the same default triple token" throughout the graph, and the
above is the same as:

    << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 .
    << <bob> foaf:birthday "1970-01-01" >> dct:source <s1> .

(Note: Of course the date should be `"1970-01-01"^^xsd:date`; it's
omitted for brevity.)

For this syntax, we use `tripleId` to get a unique identifier from the
syntactic triple term. Below we'll use a simple bnode id, `_:bb70`;
but anything goes as long as it is unique, e.g. a hash-based bnode id
like `_:gen6e16a579edbbf4dc3339be9415c39ea8`, an IRI like
`<urn:tdb:2014:urn:md5:6e16a579edbbf4dc3339be9415c39ea8>` or a
data-URL-variant thereof (no hash; terribly long).

## Option A: Reification

This can be used as the identifier of a simple reified statement:

    _:bb70 rdf:subject <bob> .
    _:bb70 rdf:predicate foaf:birthday .
    _:bb70 rdf:object "1970-01-01" .

    _:b1 ex:certainty 0.9 .
    _:b1 dct:source <s1> .

For the annotation shorthand:

    <bob> foaf:birthday "1970-01-01" {| ex:certainty 0.9 ;
                                        dct:source <s1> |} .

This could become:

    <bob> foaf:birthday "1970-01-01" .

    _:bb70 rdf:subject <bob> .
    _:bb70 rdf:predicate foaf:birthday .
    _:bb70 rdf:object "1970-01-01" .

    _:b1 ex:certainty 0.9 .
    _:b1 dct:source <s1> .

We do want repeated annotations too (in some form):

    <bob> foaf:birthday "1970-01-01" {| dct:source <s1> ;
                                        ex:certainty 0.9 |},
            "1970-01-01" {| dct:source <s2>;
                            ex:certainty 0.8 |} .

When there is more than one "referenced occurrence" like this, the
auto-naming isn't used, since the reference triples "decohere". So we
reasonably get regular blank nodes:

    <bob> foaf:birthday "1970-01-01" .

    _:b1 rdf:subject <bob> .
    _:b1 rdf:predicate foaf:birthday .
    _:b1 rdf:object "1970-01-01" .
    _:b1 dct:source <s1> .
    _:b1 ex:certainty 0.9 .

    _:b2 rdf:subject <bob> .
    _:b2 rdf:predicate foaf:birthday .
    _:b2 rdf:object "1970-01-01" .
    _:b2 dct:source <s2> .
    _:b2 ex:certainty 0.8 .

It could make sense to always use regular blank nodes for the
annotation form; *or* to require explicit names for repetitions.

On that note, here is a form for explicitly named annotations:

    <bob> foaf:birthday "1970-01-01" {<#t1>} .

    <#t1> ex:certainty 0.9;
        dct:source <s1> .

In "terse" triples:

    <bob> foaf:birthday "1970-01-01" .

    <#t1> rdf:subject <bob> .
    <#t1> rdf:predicate foaf:birthday .
    <#t1> rdf:object "1970-01-01" .
    <#t1> ex:certainty 0.9 .
    <#t1> dct:source <s1> .

With this, we finally have a Turtle equivalent to RDF/XML statement
annotations (used extensively in UniProt):

    <rdf:Description rdf:about="bob">
      <foaf:birthday rdf:ID="t1">1970-01-01</foaf:birthday>
    </rdf:Description>

    <rdf:Description rdf:ID="t1">
      <ex:certainty rdf:datatype="&xsd;double">0.9</ex:certainty>
      <dct:source rdf:resource="s1"/>
    </rdf:Description>

How do we "save ourselves from the seminal error" then, if triple
terms are at least type-like? In this basic form we could just resort
to reification; or triple terms could have an optional identifier,
like:

    << _:b1 | <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ;
        dct:source <s1> .
    << _:b2 | <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.8 ;
        dct:source <s2> .

Or (which I prefer) the completing object could be marked as "quoted"
(I've previously used `--`, but it has been considered hard to spot):

    <bob> foaf:birthday << "1970-01-01" >> {| dct:source <s1> ;
                                        ex:certainty 0.9 |},
            << "1970-01-01" >> {| dct:source <s2>;
                            ex:certainty 0.8 |} .

Exact syntax isn't important yet, only whether this is what we can
converge upon or not.

For named graphs, this:

    <g1> {
        << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 .
    }
    <g2> {
        << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.8 .
    }

becomes, in "terse" quads:

    _:bb70 rdf:subject <bob> <g1> .
    _:bb70 rdf:predicate foaf:birthday <g1> .
    _:bb70 rdf:object "1970-01-01" <g1> .
    _:bb70 ex:certainty 0.9 <g1> .

    _:bb70 rdf:subject <bob> <g2> .
    _:bb70 rdf:predicate foaf:birthday <g2> .
    _:bb70 rdf:object "1970-01-01" <g2> .
    _:bb70 ex:certainty 0.8 <g2> .

Granted, given the reasoning above (an instantiated triple occurrence
in a graph) it might make sense that `tripleId` mint different
identifiers for different graphs. Annotation forms achieve that anyway
though, and the above is simpler as is (*if* the *union* of the two
graphs share blank nodes, the certainty claims in them are in conflict
(assuming such semantics for the property), which can be important
information).

Of course, we're still on square one here. It's more *convenient*
reification, but perhaps not *better*. While this could be all we
need, let's look further a bit.


## Option B: Attempting Semantics for Datasets

What I've been aiming for is isolated (as in unasserted, from the open
world point of view) named triple sets, bound to another "graph name
resource" in a dataset.

I *tried* to base my approach on the open-ended options for dataset
semantics, without touching the abstract syntax. This was not about
giving all uses of named graphs fixed semantics, but about *opting in*
to semantic datasets. I thought this was respectful of what's out
there, given what RDF 1.1 Concepts states [7]:

> RDF does not place any formal restrictions on what resource the graph name may denote, nor on the relationship between that resource and the graph. A discussion of different RDF dataset semantics can be found in [RDF11-DATASETS].

Given that, claiming that graph names mean nothing is only *one* of
many possible interpretations. And while formal means for doing so are
still undefined, I hoped they didn't have to be. Looking at
RDF11-DATASETS [8]:

> A vocabulary specifically tailored for describing the intended dataset semantics could be defined in a future specification.

It suggests that through description of the resource naming a graph,
defining how the graph it is paired with is interpreted, within a
dataset, could be possible. Its dataset semantics option 3.4 [9] is
close to what I've attempted. With such semantics for named graphs, in
order not to break monotonicity, graphs must reasonably be explicitly
"accepted" to be considered asserted in a union default graph [10].

So my option for the above was to, out of band (in an implementation)
*selecting* a semantic dataset profile, in which named graphs are
isolated unless accepted. (The simple act of loading them into graph
names in a semantic graph store would "accept" the default graph here,
but not the named graph.)

So our example simply becomes:

    _:bb70 ex:certainty 0.9 .
    _:bb70 dct:source <s1> .

    <bob> foaf:birthday "1970-01-01" _:bb70 .

And for scoping this (for graph store management), I proposed
`rdfx:boundBy` to relate two graph name resources to ensure that the
"bound" ones remain isolated, and "owned" by their binding resource
(for atomic updates and deletes). So if we read the above into named
graph `<g1>`, we get:

    _:bb70 rdfx:boundBy <g1> .

    _:bb70 ex:certainty 0.9 <g1> .
    _:bb70 dct:source <s1> <g1> .

    <bob> foaf:birthday "1970-01-01" _:bb70 .

*Of course* this is not an easy thing to formalize and get
implemented. It requires "semantic datasets", and is hard to get right
(defining semantics by the presence of statements (without breaking
monotonicity), requiring an explicit opt-in profile, etc).

Thus I said it might be a tall order. Too tall, I've gathered. So
let's defer this option, and see if we can do something else *now*
which does not prevent semantic datasets in the future.


## Option C: Explicit Abstract Syntax Instead

Another way to get isolated named triple sets is to make them explicit
in the concepts and abstract syntax, but without adding new terms that
regular users will come across (so neither the subject, predicate nor
object positions of triples have access to anything novel).

This is drawing from Souri's RDFn *and* Andy's graph terms [11], in a
kind of amalgam (or compromise).

* Define a new kind of quoted identifier. *Not* for general use,
*only* for the fourth position in a quad.
* It is formed by a regular identifier (blank node id or IRI) and an
optional graph name identifier. Formally: quoted(id=some-id, optional
graph=some-graph).
* Triples named by this term are *not asserted*.

(It is called "quoted" here, but could of course be called e.g.
"isolated" or "protected".)

Here I use this syntax for such "quoted identifiers" for something in
a default graph (again, *only* usable in the fourth position of a
quad):

    {_:bb70}

And this for a quoted identifier in a named graph `<g1>`:

    <g1>{_:bb70}

Structurally, it is related to typed literals. To a lesser extent it
is reminiscent of the triple terms it replaces; the main difference
being that this is not a recursive structure; and that the identifier
"within" is a regular RDF identifier which is used in subjects and
objects.

Here is the initial example in "terse pseudo-quads":

    <bob> foaf:birthday "1970-01-01" {_:bb70} .
    _:bb70 ex:certainty 0.9 .
    _:bb70 dct:source <s1> .

And for a triple description in a named graph:

    <g1> {
      << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ;
          dct:source <s1> .
    }

In "terse pseudo-quads":

    <bob> foaf:birthday "1970-01-01" <g1>{_:bb70} .
    _:bb70 ex:certainty 0.9 <g1> .
    _:bb70 dct:source <s1> <g1> .

Of course, this can be considered as "quins in disguise". As such this
option is *very* close to what RDFn proposes. The main difference is
that not *all* triples are auto-named, only "RDF-star-described" ones,
and that such names are always isolated triples, marked through
"quoted" quad identifiers (fusing position 4 and 5 of RDFn).

Note: While this proposal requires a quad representation, it is not
necessarily restricted to TriG (but to N-quads and not N-triples). But
since "statements about statements" is not basic RDF 101, it should be
discussed. For provenance, this is related to named graphs, and should
be explained alongside them. For "qualification", It is the *last*
resort when you've got granular data but "run out of modelling
options"; usually in a production scenario. In schema.org, we've got
"impure" but pragmatic, triples-only options. In Wikidata, this is
more interesting.

(For LPG usage, I've gotten the impression that semantics have a back
seat, and putting raw data into "something" is more common practice.
Not unlike some RDF usage in the wild; and that's fine. We just need
to ensure that it's hard to "shoot yourself in the foot" with what we
introduce.)


## What About Opacity?

Controlling opacity is left to a future semantics for datasets (as in
[8], also thought of e.g. in [12].). For now, it depends on specific
implementation options for the union default graph, and for what their
inference engines take into account.

I think this is acceptable since the majority of collected use cases
and examples rely on a practical transparent interpretation of
triples, whether asserted or not. Also, since if we "get closer" to
named graphs, these options could work on asserted and "protected"
triple sets alike.


## Future Convergence: Upgrading From Option C to B?

Option C is upgradable to semantic datasets, if such will eventually be defined.

* The "quoted fourth term" can be made equal to an explicit graph
semantics of that "wrapped" identifier. It is a syntactic marker that
could be interpreted as a semantic declaration.

* With named annotations, we can also have named, isolated triple
sets. It can still fall back to reification, but would require a
relationship (e.g. `rdfx:triple`) from that named, isolated set to
each isolated triple.

* There is a path towards graph terms as default names for graph
"token" structures, using RDF C14N on its triple set (a `graphId`
function along the lines of the above `tripleId` mapping function).


Thank you if you read this far!

Best regards,
Niklas

[1]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0028.html>
[2]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0026.html>
[3]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0032.html>
[4]: <https://lists.w3.org/Archives/Public/public-rdf-star/2020Dec/0062.html>
[5]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023May/0063.html>
[6]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0031.html>
[7]: <https://www.w3.org/TR/rdf11-concepts/#section-dataset>
[8]: <https://www.w3.org/TR/rdf11-datasets/#declaring>
[9]: <https://www.w3.org/TR/rdf11-datasets/#each-named-graph-defines-its-own-context>
[10]: <https://www.w3.org/TR/sparql11-service-description/#sd-uniondefaultgraph>
[11]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Oct/0038.html>
[12]: <https://gist.github.com/niklasl/c22994e664663b6730613ecc1321c418#opacity-as-conditional-entailment>
Received on Thursday, 30 November 2023 13:39:51 UTC