Re: Against the notion of reification well-formed graph (i.e., atomicity) from Andy Seaborne on 2024-01-25 (public-rdf-star-wg@w3.org from January 2024)

From: Andy Seaborne <andy@apache.org>
Date: Thu, 25 Jan 2024 13:42:08 +0000
To: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-ID: <6a3cb95f-bf01-4224-995d-7962658b9681@apache.org>
On 25/01/2024 12:27, Thomas Lörtsch wrote:
> 
> 
>> On 25. Jan 2024, at 12:22, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote:
>> On 1/25/24 06:08, Andy Seaborne wrote:
>>>> But what does it *mean*? Optimizations should only be applied after we know that it means what we want it to mean.
>>> Agreed.
>>> We can start with our goals. "bloat" has been used in two senses : "visual bloat" and "size bloat".
> 
> You’re forgetting "term bloat"…
> 
>>> Is the WG addressing the size bloat issue?
> 
> IMO it can’t be addressed in N-Triples, as N-Triples is per definition a strictly triple-based serialization, with pretty atomic terms

A bit more complicated than that, especially for annotation usage.

Both in-memory and persistent storage system largely share RDF terms 
that occur more than once in a graph, either as pointers or "node ids" 
which is at most 16 bytes, often 8. The term is not the whole of the 
strings and "rdf:subject" does not need to be repeated use of 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#subject>.

The new term can be about the size of whatever a triple is which, with 
existing approaches of term dictionaries and compression, is quite small.

> (language tags to literals being the acceptable exception). It’s whole purpose is to ease processing by eliminating shortcuts. Everything you add to that - especially new term types that combine already defined atomic term types into more complex term types, e.g. triple terms - breaks this simplicity and straightforwardness.
> 
> B.t.w. "streaming" is an argument that has been brought forward a lot. Can you point me to any halfway concise treatise of the problems and practices of streaming RDF data? I’d like to understand how solving problems with reification by means of a triple term would relate to other issues. Would it be a decisive breakthrough, or more like a drop in the bucket?
> Because my hunch is that it’s rather the latter. And where does it end? What about list terms? What about CBD terms? Or even graph terms?
> 
> 
> I’ve been peaking into your draft proposal that you mentioned to Felix the other day, at
> https://github.com/afs/rdf-star-notes/blob/main/reif-atoms.md
> You give a list of 7 problems with RDF reification. Some of them (problems 4, 5 and 6) would be handled by a notion of wellformedness.

And it is does not need to be checked across the whole graph.
It works with RDF merge.

> Problem 1 would be solved by the proposed annotation syntax. Problems 3 and 7, especially blank nodes split over multiple graphs when breaking a big graph into files of a more manageable size, are not specific to reification but a general problem.

Agreed - it is mentioned because the "Turtle syntax only" approach 
encourages blank nodes - which is probably a good choice for annotations.

7 is not just about blank nodes - it's dealing with finding groups of 
rdf:subject/rdf:predicate/rdf:object which is relevant for the 
optimization discussions.

3 is the visual aspect within one large graph document.

N-Triples does not preserve proximity of input. It very much depends on 
the indexing.

> That leaves problem 2, verbosity in N-Triples, and that just comes with the terrain. There sometimes are more or less verbose ways to represent a complex type in straight triples - RDF Collections are much worse than RDF Containers - but RDF standard reification is not too bad in that respect.
> You mention a reification atom <<(s p o)>> as a possible addition to N-Triples. That seems like a slight variation of N-Triples-Star to me,

Sort of - it is similar at the abstract syntax level, it's not the same 
semantics. or if it is, that's by chance or keeping close to reification.

There are several implementations of RDF-star-CG so it does suggest that 
a new term has some acceptance.

> and I’m not fundamentally opposed if it helps and doesn’t rely on a new term type. My question is: does it really help? And would you also add list atoms, CBD atoms, graph atoms?

Would I *like* list terms - yes! - but that's out of charter :-(

In the abstract syntax, the approach does leave open (does not block) 
graph terms. I think they bring addition challenges around entailment 
and "graph reification" that will take a long time to explore.

     Andy

> 
> Thomas
> 
> 
>>> Optimization is not just storage space (and the choices there change over the space of a few years at the moment) - it's also preserving the outcome of queries.
>>> What does SELECT (count(*) AS ?C) { ?s ?p ?o } return?
>>> or any query with a ?p.
> 
>>>      Andy
>>>>
>>>> I just realized that saying *at least* makes an implicit assumption about different terms in object position refering to the same entity in the realm of interpretation, i.e. a kind of owl:sameAs-ness. That may be way beyond what we want fix, and insofar saying *exactly* might be the safer and more restrained definition.
>>>> Still it introduces a hint of opacity that I’m not happy with.
>>>>
>>>> Thomas
>>>>
>>>>> peter
>>>>>
>>>>
>>>>
>>
>
Received on Thursday, 25 January 2024 13:42:17 UTC