Re: Attempting Consolidation from Thomas Lörtsch on 2023-11-30 (public-rdf-star-wg@w3.org from November 2023)

From: Thomas Lörtsch <tl@rat.io>
Date: Thu, 30 Nov 2023 17:36:06 +0100
To: Niklas Lindström <lindstream@gmail.com>
Cc: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-Id: <A98D3E0A-4EC0-4DEE-B6B2-B9413F2B3897@rat.io>
Hi Niklas,


I did of course read until the end ;-) but I’m top-posting for readability. But I also incorporated your corrections in the quoted response below. 


I think too that some proposals are not too far away from each other and that we can get to a coherent whole (better with graphs, but also with triples). Your version B comes pretty close to the nested graph proposal. Version C however, your fallback, will probably not gain much acceptance from implementors because it requires to introspect the triple/graph identifier. I might be wrong, but I guess this is a no-go and standard reification based version A would not only be easier to implement but also more performant - and perfectly backwards compatible. 


I like your attempts to firmly tie a token and its identifier together, like 
> <bob> foaf:birthday "1970-01-01" {<#t1>} .

and
>    << _:b1 | <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ;
>        dct:source <s1> .
>    << _:b2 | <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.8 ;
>        dct:source <s2> .

Such tight coupling really helps with use cases like that described in  "3.1. Challenge #1: Edge Properties, Multiple Edge Instances, and Reification" in "The OneGraph Vision: Challenges of Breaking the Graph Model Lock-In", 2023, Lassila et al [0] (which I also mentioned an hour ago in a response to Andy in another thread). I would like to propose yet another variant:

 <bob> foaf:birthday "1970-01-01" {| _:b1 | dct:source <s1> ;
                                            ex:certainty 0.9 |},
                                  {| _:b2 | dct:source <s2>;
                                            ex:certainty 0.8 |} .

However, those identifiers must be provided automatically and can not rely on users to take the extra step and define them. In that respect I find the nested graph syntax more succinct:

  []{ :s :p :o }

One can’t omit the preceding name (blank node or explicit IRI) without running into a parsing error - and it’s not more keystrokes either. Please forgive the blatant self-advertisement, camouflaged as suggestion.


Relegating opacity to future work seems a waste. Although I would not interpret the charter as explicitly asking for it - I just doubt that many people realized it was there in the first place - any mechamism that distinguishes 'accepted' triples from other triples already has everything in place to enable opacity, unassertedness and whatever else one might desire.


Best,
Thomas


[0] https://content.iospress.com/articles/semantic-web/sw223273&hl=en&sa=T&oi=gsb-ggp&ct=res&cd=0&d=16666059864973320262&ei=IaloZfu3OrKey9YPyMm8yA8&scisig=AFWwaeZbqxIIkVJxn-4m3LZzXFU2


> On 30. Nov 2023, at 14:39, Niklas Lindström <lindstream@gmail.com> wrote:
> 
> Dear all,
> 
> I actually think the current proposals are closer to each other than
> it might seem.
> 
> What Souri proposes with RDFn [1] is very close to what I was seeking
> with "bound" named graphs ([2], [3]). Both are "about tokens" (as in
> the same triple can be named by more than one identifier (blank node
> or IRI), which are considered distinct unless asserted to be the
> same). But Souri proposes something valuable, which has been around in
> various guises before (e.g. in [4] and [5]), and I think is also
> alluded to by Peter in [6] (option 2,1,1, expanding to "the same
> central node").
> 
> Here is an attempt at consolidation of these various ideas, taking
> what the CG was seeking into account (and this time keeping all of its
> syntax).
> 
> 
> ## The Troubles of Describing Triples
> 
> Having triple terms as "types" has shown to be troublesome, both in
> theory and practise. They are *universals* (like literals), and
> neither provenance nor qualification (our actual use cases) are about
> universals. Cases describe instantiated occurrences of them, in
> various contexts (graphs). Admittedly, these are *mainly* the asserted
> triples in the current graph (one unique s,p,o per g). So the "type"
> point of view is understandable, and in the simplest cases is all you
> see. But also "referenced" or "possible" triples come into view a lot;
> and they all are "identified by their singleton sets". Such referenced
> ("backing") triples also cater for the LPG cases; but can stay
> unasserted, in the background, without "polluting" RDF with multisets.
> 
> (It is not logically wrong to talk about universals directly, but it
> is rarely (if ever) the intent. RDF has this *cautious* design of
> disallowing literals in the subject position for this reason. To
> prevent users from "shooting themselves in the foot", if you will.)
> 
> 
> ## Consolidating Occurrences: Default Token Identifiers
> 
> This "auto-named triple" approach solves the disconnect, in that it
> "talks about tokens", without abandoning the effect of concentrating
> on a default triple in a graph in the simplest cases.
> 
> So, we can:
> 
> * Define a function (tripleId) that maps s,p,o to a unique identifier
> (blank node or IRI). This denotes a "default triple token", or, if you
> will, the triple occurrence *in a graph*.
> 
> 
> ## Options at Hand
> 
> Let's examine a case and some options. I'll use this example (not
> because it's my favorite, but because it is common, and also contains
> the "seminal error", which we "save ourselves from" by describing
> tokens):
> 
>    << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ;
>        dct:source <s1> .
> 
> This is the same default triple token" throughout the graph, and the
> above is the same as:
> 
>    << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 .
>    << <bob> foaf:birthday "1970-01-01" >> dct:source <s1> .
> 
> (Note: Of course the date should be `"1970-01-01"^^xsd:date`; it's
> omitted for brevity.)
> 
> For this syntax, we use `tripleId` to get a unique identifier from the
> syntactic triple term. Below we'll use a simple bnode id, `_:bb70`;
> but anything goes as long as it is unique, e.g. a hash-based bnode id
> like `_:gen6e16a579edbbf4dc3339be9415c39ea8`, an IRI like
> `<urn:tdb:2014:urn:md5:6e16a579edbbf4dc3339be9415c39ea8>` or a
> data-URL-variant thereof (no hash; terribly long).
> 
> ## Option A: Reification
> 
> This can be used as the identifier of a simple reified statement:
> 
>    _:bb70 rdf:subject <bob> .
>    _:bb70 rdf:predicate foaf:birthday .
>    _:bb70 rdf:object "1970-01-01" .
> 
>    _:bb70 ex:certainty 0.9 .
>    _:bb70 dct:source <s1> .
> 
> For the annotation shorthand:
> 
>    <bob> foaf:birthday "1970-01-01" {| ex:certainty 0.9 ;
>                                        dct:source <s1> |} .
> 
> This could become:
> 
>    <bob> foaf:birthday "1970-01-01" .
> 
>    _:bb70 rdf:subject <bob> .
>    _:bb70 rdf:predicate foaf:birthday .
>    _:bb70 rdf:object "1970-01-01" .
> 
>    _:bb70 ex:certainty 0.9 .
>    _:bb70 dct:source <s1> .
> 
> We do want repeated annotations too (in some form):
> 
>    <bob> foaf:birthday "1970-01-01" {| dct:source <s1> ;
>                                        ex:certainty 0.9 |},
>            "1970-01-01" {| dct:source <s2>;
>                            ex:certainty 0.8 |} .
> 
> When there is more than one "referenced occurrence" like this, the
> auto-naming isn't used, since the reference triples "decohere". So we
> reasonably get regular blank nodes:
> 
>    <bob> foaf:birthday "1970-01-01" .
> 
>    _:b1 rdf:subject <bob> .
>    _:b1 rdf:predicate foaf:birthday .
>    _:b1 rdf:object "1970-01-01" .
>    _:b1 dct:source <s1> .
>    _:b1 ex:certainty 0.9 .
> 
>    _:b2 rdf:subject <bob> .
>    _:b2 rdf:predicate foaf:birthday .
>    _:b2 rdf:object "1970-01-01" .
>    _:b2 dct:source <s2> .
>    _:b2 ex:certainty 0.8 .
> 
> It could make sense to always use regular blank nodes for the
> annotation form; *or* to require explicit names for repetitions.
> 
> On that note, here is a form for explicitly named annotations:
> 
>    <bob> foaf:birthday "1970-01-01" {<#t1>} .
> 
>    <#t1> ex:certainty 0.9;
>        dct:source <s1> .
> 
> In "terse" triples:
> 
>    <bob> foaf:birthday "1970-01-01" .
> 
>    <#t1> rdf:subject <bob> .
>    <#t1> rdf:predicate foaf:birthday .
>    <#t1> rdf:object "1970-01-01" .
>    <#t1> ex:certainty 0.9 .
>    <#t1> dct:source <s1> .
> 
> With this, we finally have a Turtle equivalent to RDF/XML statement
> annotations (used extensively in UniProt):
> 
>    <rdf:Description rdf:about="bob">
>      <foaf:birthday rdf:ID="t1">1970-01-01</foaf:birthday>
>    </rdf:Description>
> 
>    <rdf:Description rdf:ID="t1">
>      <ex:certainty rdf:datatype="&xsd;double">0.9</ex:certainty>
>      <dct:source rdf:resource="s1"/>
>    </rdf:Description>
> 
> How do we "save ourselves from the seminal error" then, if triple
> terms are at least type-like? In this basic form we could just resort
> to reification; or triple terms could have an optional identifier,
> like:
> 
>    << _:b1 | <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ;
>        dct:source <s1> .
>    << _:b2 | <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.8 ;
>        dct:source <s2> .
> 
> Or (which I prefer) the completing object could be marked as "quoted"
> (I've previously used `--`, but it has been considered hard to spot):
> 
>    <bob> foaf:birthday << "1970-01-01" >> {| dct:source <s1> ;
>                                        ex:certainty 0.9 |},
>            << "1970-01-01" >> {| dct:source <s2>;
>                            ex:certainty 0.8 |} .
> 
> Exact syntax isn't important yet, only whether this is what we can
> converge upon or not.
> 
> For named graphs, this:
> 
>    <g1> {
>        << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 .
>    }
>    <g2> {
>        << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.8 .
>    }
> 
> becomes, in "terse" quads:
> 
>    _:bb70 rdf:subject <bob> <g1> .
>    _:bb70 rdf:predicate foaf:birthday <g1> .
>    _:bb70 rdf:object "1970-01-01" <g1> .
>    _:bb70 ex:certainty 0.9 <g1> .
> 
>    _:bb70 rdf:subject <bob> <g2> .
>    _:bb70 rdf:predicate foaf:birthday <g2> .
>    _:bb70 rdf:object "1970-01-01" <g2> .
>    _:bb70 ex:certainty 0.8 <g2> .
> 
> Granted, given the reasoning above (an instantiated triple occurrence
> in a graph) it might make sense that `tripleId` mint different
> identifiers for different graphs. Annotation forms achieve that anyway
> though, and the above is simpler as is (*if* the *union* of the two
> graphs share blank nodes, the certainty claims in them are in conflict
> (assuming such semantics for the property), which can be important
> information).
> 
> Of course, we're still on square one here. It's more *convenient*
> reification, but perhaps not *better*. While this could be all we
> need, let's look further a bit.
> 
> 
> ## Option B: Attempting Semantics for Datasets
> 
> What I've been aiming for is isolated (as in unasserted, from the open
> world point of view) named triple sets, bound to another "graph name
> resource" in a dataset.
> 
> I *tried* to base my approach on the open-ended options for dataset
> semantics, without touching the abstract syntax. This was not about
> giving all uses of named graphs fixed semantics, but about *opting in*
> to semantic datasets. I thought this was respectful of what's out
> there, given what RDF 1.1 Concepts states [7]:
> 
>> RDF does not place any formal restrictions on what resource the graph name may denote, nor on the relationship between that resource and the graph. A discussion of different RDF dataset semantics can be found in [RDF11-DATASETS].
> 
> Given that, claiming that graph names mean nothing is only *one* of
> many possible interpretations. And while formal means for doing so are
> still undefined, I hoped they didn't have to be. Looking at
> RDF11-DATASETS [8]:
> 
>> A vocabulary specifically tailored for describing the intended dataset semantics could be defined in a future specification.
> 
> It suggests that through description of the resource naming a graph,
> defining how the graph it is paired with is interpreted, within a
> dataset, could be possible. Its dataset semantics option 3.4 [9] is
> close to what I've attempted. With such semantics for named graphs, in
> order not to break monotonicity, graphs must reasonably be explicitly
> "accepted" to be considered asserted in a union default graph [10].
> 
> So my option for the above was to, out of band (in an implementation)
> *selecting* a semantic dataset profile, in which named graphs are
> isolated unless accepted. (The simple act of loading them into graph
> names in a semantic graph store would "accept" the default graph here,
> but not the named graph.)
> 
> So our example simply becomes:
> 
>    _:bb70 ex:certainty 0.9 .
>    _:bb70 dct:source <s1> .
> 
>    <bob> foaf:birthday "1970-01-01" _:bb70 .
> 
> And for scoping this (for graph store management), I proposed
> `rdfx:boundBy` to relate two graph name resources to ensure that the
> "bound" ones remain isolated, and "owned" by their binding resource
> (for atomic updates and deletes). So if we read the above into named
> graph `<g1>`, we get:
> 
>    _:bb70 rdfx:boundBy <g1> .
> 
>    _:bb70 ex:certainty 0.9 <g1> .
>    _:bb70 dct:source <s1> <g1> .
> 
>    <bob> foaf:birthday "1970-01-01" _:bb70 .
> 
> *Of course* this is not an easy thing to formalize and get
> implemented. It requires "semantic datasets", and is hard to get right
> (defining semantics by the presence of statements (without breaking
> monotonicity), requiring an explicit opt-in profile, etc).
> 
> Thus I said it might be a tall order. Too tall, I've gathered. So
> let's defer this option, and see if we can do something else *now*
> which does not prevent semantic datasets in the future.
> 
> 
> ## Option C: Explicit Abstract Syntax Instead
> 
> Another way to get isolated named triple sets is to make them explicit
> in the concepts and abstract syntax, but without adding new terms that
> regular users will come across (so neither the subject, predicate nor
> object positions of triples have access to anything novel).
> 
> This is drawing from Souri's RDFn *and* Andy's graph terms [11], in a
> kind of amalgam (or compromise).
> 
> * Define a new kind of quoted identifier. *Not* for general use,
> *only* for the fourth position in a quad.
> * It is formed by a regular identifier (blank node id or IRI) and an
> optional graph name identifier. Formally: quoted(id=some-id, optional
> graph=some-graph).
> * Triples named by this term are *not asserted*.
> 
> (It is called "quoted" here, but could of course be called e.g.
> "isolated" or "protected".)
> 
> Here I use this syntax for such "quoted identifiers" for something in
> a default graph (again, *only* usable in the fourth position of a
> quad):
> 
>    {_:bb70}
> 
> And this for a quoted identifier in a named graph `<g1>`:
> 
>    <g1>{_:bb70}
> 
> Structurally, it is related to typed literals. To a lesser extent it
> is reminiscent of the triple terms it replaces; the main difference
> being that this is not a recursive structure; and that the identifier
> "within" is a regular RDF identifier which is used in subjects and
> objects.
> 
> Here is the initial example in "terse pseudo-quads":
> 
>    <bob> foaf:birthday "1970-01-01" {_:bb70} .
>    _:bb70 ex:certainty 0.9 .
>    _:bb70 dct:source <s1> .
> 
> And for a triple description in a named graph:
> 
>    <g1> {
>      << <bob> foaf:birthday "1970-01-01" >> ex:certainty 0.9 ;
>          dct:source <s1> .
>    }
> 
> In "terse pseudo-quads":
> 
>    <bob> foaf:birthday "1970-01-01" <g1>{_:bb70} .
>    _:bb70 ex:certainty 0.9 <g1> .
>    _:bb70 dct:source <s1> <g1> .
> 
> Of course, this can be considered as "quins in disguise". As such this
> option is *very* close to what RDFn proposes. The main difference is
> that not *all* triples are auto-named, only "RDF-star-described" ones,
> and that such names are always isolated triples, marked through
> "quoted" quad identifiers (fusing position 4 and 5 of RDFn).
> 
> Note: While this proposal requires a quad representation, it is not
> necessarily restricted to TriG (but to N-quads and not N-triples). But
> since "statements about statements" is not basic RDF 101, it should be
> discussed. For provenance, this is related to named graphs, and should
> be explained alongside them. For "qualification", It is the *last*
> resort when you've got granular data but "run out of modelling
> options"; usually in a production scenario. In schema.org, we've got
> "impure" but pragmatic, triples-only options. In Wikidata, this is
> more interesting.
> 
> (For LPG usage, I've gotten the impression that semantics have a back
> seat, and putting raw data into "something" is more common practice.
> Not unlike some RDF usage in the wild; and that's fine. We just need
> to ensure that it's hard to "shoot yourself in the foot" with what we
> introduce.)
> 
> 
> ## What About Opacity?
> 
> Controlling opacity is left to a future semantics for datasets (as in
> [8], also thought of e.g. in [12].). For now, it depends on specific
> implementation options for the union default graph, and for what their
> inference engines take into account.
> 
> I think this is acceptable since the majority of collected use cases
> and examples rely on a practical transparent interpretation of
> triples, whether asserted or not. Also, since if we "get closer" to
> named graphs, these options could work on asserted and "protected"
> triple sets alike.
> 
> 
> ## Future Convergence: Upgrading From Option C to B?
> 
> Option C is upgradable to semantic datasets, if such will eventually be defined.
> 
> * The "quoted fourth term" can be made equal to an explicit graph
> semantics of that "wrapped" identifier. It is a syntactic marker that
> could be interpreted as a semantic declaration.
> 
> * With named annotations, we can also have named, isolated triple
> sets. It can still fall back to reification, but would require a
> relationship (e.g. `rdfx:triple`) from that named, isolated set to
> each isolated triple.
> 
> * There is a path towards graph terms as default names for graph
> "token" structures, using RDF C14N on its triple set (a `graphId`
> function along the lines of the above `tripleId` mapping function).
> 
> 
> Thank you if you read this far!
> 
> Best regards,
> Niklas
> 
> [1]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0028.html>
> [2]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0026.html>
> [3]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0032.html>
> [4]: <https://lists.w3.org/Archives/Public/public-rdf-star/2020Dec/0062.html>
> [5]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023May/0063.html>
> [6]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0031.html>
> [7]: <https://www.w3.org/TR/rdf11-concepts/#section-dataset>
> [8]: <https://www.w3.org/TR/rdf11-datasets/#declaring>
> [9]: <https://www.w3.org/TR/rdf11-datasets/#each-named-graph-defines-its-own-context>
> [10]: <https://www.w3.org/TR/sparql11-service-description/#sd-uniondefaultgraph>
> [11]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Oct/0038.html>
> [12]: <https://gist.github.com/niklasl/c22994e664663b6730613ecc1321c418#opacity-as-conditional-entailment>
>
Received on Thursday, 30 November 2023 16:36:18 UTC