Re: My issues with graph-based RDF Star from Niklas Lindström on 2023-12-08 (public-rdf-star-wg@w3.org from December 2023)

From: Niklas Lindström <lindstream@gmail.com>
Date: Fri, 8 Dec 2023 15:47:29 +0100
To: Adrian Gschwend <adrian.gschwend@zazuko.com>
Cc: "public-rdf-star-wg@w3.org" <public-rdf-star-wg@w3.org>, Thomas Lörtsch <tl@rat.io>
Message-ID: <CADjV5jd-L2NUb3AyOoQp+WeVYVYT1pdMHF7SjVMy3+F9t7MM4Q@mail.gmail.com>
Hi Adrian,

These are two great cases, and very important to keep just as viable
and effective going forward. Thomas Tanon also mentioned RDF Sources
in LDP [1] as another crucially important example of graph management
in practice. And describing detailed triple provenance or background
qualification for those in such scenarios should certainly be doable.

The latter case, using Named Graphs as Documents, is basically also
how our national library system [2] is implemented. It "cheats" in
that regard, as it is not a quad store, but a document store. We just
store JSON-LD documents as is (in Postgres, but that is under-the-hood
implementation detail), normalized in such a way that each document
describes two IRI-named things: itself, as a library Record, and its
main entity, which it describes; along with sometimes plenty of bnodes
which are either "structured values" or qualified relationships (such
as contributions),  or simply not-yet-disambiguated -- i.e. not yet
linked -- related entities (agents, concepts, unknown works, etc.).

One might suspect that my perspective is skewed because of this, since
I don't "care that much" about these as pure RDF triples, and work too
much in terms of JSON-LD documents. Because as just documents, as Andy
has pointed out, having small, "nested" named graphs (often but not
always "blankly named") is easy; but is not the same as storing
triples under names in quads.

However, we certainly rely on this as proper RDF data in many ways.

One is the linked data aspect; which we use to create an "embellished"
view, following outgoing and some incoming links to generate a "not so
concise and not so bounded" description ("providing useful data").
These denormalized views are JSON-LD-framed and indexed in
ElasticSearch, and can also be accessed as TriG (as the default graph
and "snippets" of relevant embellished details; e.g. in [3],
visualized [4]).

Another is for indexing each record as a named graph in a graph store
(we use Virtuoso for that, but that *should* only be implementation
detail) and exposing a SPARQL endpoint [5].

And because of the undefined relation between an RDF Source containing
multiple graphs and formal means for managing graphs in graph stores,
I have steered clear of utilizing that other than in materialized
views.

The concept and method I've been trying to convey for these things to
work is to utilize the fact that an RDF Source can be one or more
graphs, i.e. it can always be a dataset. And if we have means for not
just reading the default graph from this source under this-or-that
graph name, as part of the union default graph, but also say "and
place all other named graphs described in this source into
'appendices', owned by this named graph and not part of the asserted
union graph", we can leverage quads for that under the hood.

Thus I think not all quads need to represent the same kind of resource
(documents, or records). Some can be unasserted "citation graphs",
"appendices" or any other of the unconstrained kinds of resources that
the "graph name" can denote (including singleton sets, but also e.g.
old versions, commit deltas, opaque quotes isolated from entailment
processing). And such resources can be linked to the "records" naming
the graphs, as in "bound" by them, so that they undergo the same ACL
rules, for instance. We "just" don't have any formal means of
declaring that, yet.

With formal means for those practices, we can describe these other
kinds of "appendix graphs" in their binding graph, for detailed
provenance and qualification.

(And while I claim that this is not *adding* complexity -- since these
practices, and others, exist in the wild, due to the unconstrained
ways we can use named graphs for -- I readily admit that some of these
practices are advanced. Just as detailed provenance and triple
qualification is fairly advanced. (Especially since qualification
should only be done if you've exhausted the option of using a more
granular model, and e.g. deriving simple edges using
owl:propertyChainAxiom entailments.) And I think that we may be able
to formalize some of it, and leverage that for the RDF-star use cases.
And in the process, ideally, paving the path for more ways, such as
annotating multiple triples from various contexts.)

But as I've said (e.g. in [6]), I'm not excluding other means, even
reification, if that "order is too tall". I do think something quad
based can be made less obtrusive than adding to the core of RDF (the
triples themselves) though, and a major part of why I attempt that is
because triples, and Turtle, is *simple*.

I don't think multi-edges belong in the core of RDF, at the simple
triple level (that would be an even more radical change to the
fundament). Nor meta-provenance for that matter; since I think that is
more on the level of how to "think in named graphs". However, I think
that multiple *contexts* related to simple, asserted graphs (as
partial "overlays" if you will, or "circles with post-it:s on
records", as I think of them in the library context), is a workable,
intuitive model of detailed, fact provenance and ad-hoc/background
qualification. (I've tinkered some with illustrating RDF-star data
like that [7], [8].)

And multi-edges then "emerge" from these added, small (often singleton
set) graphs derived from other contexts (sources, observations,
underlying complex states of affairs). There is only one simple triple
asserted; but these added, described, isolated "external" assertions
("citations" or "quotations") have the same effect, as they identify
the same triple (its "type"). Without touching the simple RDF triple
fundament,

All the best,
Niklas

[1]: <https://www.w3.org/TR/ldp/#dfn-linked-data-platform-rdf-source>
[2]: <https://github.com/libris/librisxl/>
[3]: <https://libris.kb.se/fxqnzf2r3t063cf/data.trig>
[4]: <https://niklasl.github.io/ldtr/demo/?url=https%3A//libris.kb.se/fxqnzf2r3t063cf/data.trig>
[5]: <https://libris.kb.se/sparql>
[6]: <https://lists.w3.org/Archives/Public/public-rdf-star-wg/2023Nov/0061.html>
[7]: <https://niklasl.github.io/ldtr/demo/?url=../test/data/prov-qualification-annotation.trig&edit=true>
[8]: <https://niklasl.github.io/ldtr/demo/?url=../test/data/lotr-annotated.trig&edit=true>



On Fri, Dec 8, 2023 at 11:57 AM Adrian Gschwend
<adrian.gschwend@zazuko.com> wrote:
>
> Hi group,
>
> Thomas asked me to clarify how I use graphs. I employ them in various
> ways, but let me illustrate with two examples:
>
> ## Open Data Endpoint with Multiple Stakeholders & Named Graphs
>
> For the Swiss government, we operate a large Stardog instance. Multiple
> ETL pipelines from diverse stakeholders update graphs in this single
> endpoint, running at different frequencies. Access is managed via ACLs
> on the named graphs [1], utilizing Stardog's security layer based on
> Apache Shiro and role-based access control [2]. Pipelines have the
> flexibility to choose their methods of writing into the triplestore, but
> most utilize the SPARQL Graph Store Protocol, typically writing to
> specific graphs through PUT operations, as the primary data source is
> often external.
>
> Each stakeholder is allocated one or more named graphs with write
> permissions. While most graphs are public and accessible via a generic
> read-user, either as named graphs or through the union default graph,
> some are restricted for staging purposes and are not visible in the
> public default graph.
>
> Graph names follow a defined scheme, and user, role, and permission
> management is automated.
>
> I don't see how I could use a graph based RDF-Star model here. We never
> write quads, if I would want to do an RDF Star statement, I would expect
> it to be part of a triple-representation. If this would not be the case,
> it would IMO break the design of this use-case.
>
> In that regard, would a quad based model not by definition break SPARQL
> Graph Store protocol?
>
> ## Named Graphs as "Documents"
>
> Another scenario, not directly mine but observed in two companies we
> collaborated with, involves treating RDF data as "documents". These
> companies do not use SPARQL directly. Instead, they load data into an
> Elastic/Opensearch index for efficient reads, as the data is relatively
> static. Occasional writes are handled by middleware that updates the
> triplestore.
>
> In this model, the triplestore essentially functions as an RDF document
> store, with each document represented as a graph. These graphs group
> "key-values," which are then indexed as documents in Elastic,
> transforming each graph into an elastic document.
>
> In both cases, I was restricted from creating additional graphs, as each
> graph was treated as a separate document.
>
> One could argue that using RDF Star in such a scenario might not make
> sense in the first place. But the same challenges as mentioned above
> would still apply IMO.
>
>
> [1]:
> https://docs.stardog.com/operating-stardog/security/named-graph-security
> [2]: https://docs.stardog.com/operating-stardog/security/security-model
>
> regards
>
> Adrian
>
> --
> Adrian Gschwend
> CEO Zazuko GmbH, Biel, Switzerland
>
> Phone +41 32 510 60 31
> Email adrian.gschwend@zazuko.com
>
Received on Friday, 8 December 2023 14:48:02 UTC