Re: Against the notion of reification well-formed graph (i.e., atomicity) from Thomas Lörtsch on 2024-01-25 (public-rdf-star-wg@w3.org from January 2024)

From: Thomas Lörtsch <tl@rat.io>
Date: Thu, 25 Jan 2024 15:29:37 +0100
To: Andy Seaborne <andy@apache.org>
Cc: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-Id: <0E948231-239A-4536-895A-4B98453C7831@rat.io>
> On 25. Jan 2024, at 14:42, Andy Seaborne <andy@apache.org> wrote:
> 
> 
> 
> On 25/01/2024 12:27, Thomas Lörtsch wrote:
>>> On 25. Jan 2024, at 12:22, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote:
>>> On 1/25/24 06:08, Andy Seaborne wrote:
>>>>> But what does it *mean*? Optimizations should only be applied after we know that it means what we want it to mean.
>>>> Agreed.
>>>> We can start with our goals. "bloat" has been used in two senses : "visual bloat" and "size bloat".
>> You’re forgetting "term bloat"…
>>>> Is the WG addressing the size bloat issue?
>> IMO it can’t be addressed in N-Triples, as N-Triples is per definition a strictly triple-based serialization, with pretty atomic terms
> 
> A bit more complicated than that, especially for annotation usage.
> 
> Both in-memory and persistent storage system largely share RDF terms that occur more than once in a graph, either as pointers or "node ids" which is at most 16 bytes, often 8. The term is not the whole of the strings and "rdf:subject" does not need to be repeated use of <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject>.
> 
> The new term can be about the size of whatever a triple is which, with existing approaches of term dictionaries and compression, is quite small.

The topic was serialization to N-Triples, because that is argued to be a very important aspect because of machine readability, steram reasoning, etc. You now answer with aspects of storage, but that is completely orthogonal.

>> (language tags to literals being the acceptable exception). It’s whole purpose is to ease processing by eliminating shortcuts. Everything you add to that - especially new term types that combine already defined atomic term types into more complex term types, e.g. triple terms - breaks this simplicity and straightforwardness.
>> B.t.w. "streaming" is an argument that has been brought forward a lot. Can you point me to any halfway concise treatise of the problems and practices of streaming RDF data? I’d like to understand how solving problems with reification by means of a triple term would relate to other issues. Would it be a decisive breakthrough, or more like a drop in the bucket?
>> Because my hunch is that it’s rather the latter. And where does it end? What about list terms? What about CBD terms? Or even graph terms?
>> I’ve been peaking into your draft proposal that you mentioned to Felix the other day, at
>> https://github.com/afs/rdf-star-notes/blob/main/reif-atoms.md
>> You give a list of 7 problems with RDF reification. Some of them (problems 4, 5 and 6) would be handled by a notion of wellformedness.
> 
> And it is does not need to be checked across the whole graph.

"It" being the triple type|instance term I assume?
What is it that has to be checked accross the whole graph? 

I would imagine a (streaming) parser with an "only-accept-wellformed-reification" switch turned to "on" to collect reification statements (e.g. a triple with predicate rdf:predicate) in a special datastructure that writes completed reification quads (modulo the rd:statement declartion probably) into the store until the end of file (or in a steraming cintext a certain timeout) is reached, and then drop the remaining uncomplete reifications.

What about this would not solve the problem, or be itself problematic?
It doesn’t need to check a whole graph. Parsing has to be done once anyway. After that it’s the index that does the work. 
And, again: the new syntactic sugar should make incomplete reifications a pretty rare case.

> It works with RDF merge.

What doesn’t?

>> Problem 1 would be solved by the proposed annotation syntax. Problems 3 and 7, especially blank nodes split over multiple graphs when breaking a big graph into files of a more manageable size, are not specific to reification but a general problem.
> 
> Agreed - it is mentioned because the "Turtle syntax only" approach encourages blank nodes - which is probably a good choice for annotations.
> 
> 7 is not just about blank nodes - it's dealing with finding groups of rdf:subject/rdf:predicate/rdf:object which is relevant for the optimization discussions.

You are again conflating serialization and implementation. We are not defining implementation and optimization techniques. As the example of Virtuoso shows reification can be very performant and it’s not too hard to figure out how.

> 3 is the visual aspect within one large graph document.
> 
> N-Triples does not preserve proximity of input. It very much depends on the indexing.

And N-Triples is neither optimized nor even designed for human consumption. Turtle is, and here the proposed annotation syntax provides a solution.

>> That leaves problem 2, verbosity in N-Triples, and that just comes with the terrain. There sometimes are more or less verbose ways to represent a complex type in straight triples - RDF Collections are much worse than RDF Containers - but RDF standard reification is not too bad in that respect.
>> You mention a reification atom <<(s p o)>> as a possible addition to N-Triples. That seems like a slight variation of N-Triples-Star to me,
> 
> Sort of - it is similar at the abstract syntax level, it's not the same semantics. or if it is, that's by chance or keeping close to reification.
> 
> There are several implementations of RDF-star-CG so it does suggest that a new term has some acceptance.

You’re again mixing topics. Nobody argues that there is not some acceptance of a new term type. But is it a huge acceptance? Did all implementors like what they found? Did this WG find the CG proposal unproblematic and good to go? Or where there problems? 

>> and I’m not fundamentally opposed if it helps and doesn’t rely on a new term type. My question is: does it really help? And would you also add list atoms, CBD atoms, graph atoms?
> 
> Would I *like* list terms - yes! -

And I would like graph terms, and those would naturally encompass triple terms and CBDs and even lists. So no term bloat!

> but that's out of charter :-(

Sure, it is. But we could think it through and then propose a new charter.

> In the abstract syntax, the approach does leave open (does not block) graph terms. I think they bring addition challenges around entailment and "graph reification" that will take a long time to explore.

Maybe a long time, maybe not so long. Anyway, until then I can live without a new term but some syntactic sugar and some thrust towards wellformedness.

>    Andy
> 
>> Thomas
>>>> Optimization is not just storage space (and the choices there change over the space of a few years at the moment) - it's also preserving the outcome of queries.
>>>> What does SELECT (count(*) AS ?C) { ?s ?p ?o } return?
>>>> or any query with a ?p.
>>>>     Andy
>>>>> 
>>>>> I just realized that saying *at least* makes an implicit assumption about different terms in object position refering to the same entity in the realm of interpretation, i.e. a kind of owl:sameAs-ness. That may be way beyond what we want fix, and insofar saying *exactly* might be the safer and more restrained definition.
>>>>> Still it introduces a hint of opacity that I’m not happy with.
>>>>> 
>>>>> Thomas
>>>>> 
>>>>>> peter
>>>>>> 
>>>>> 
>>>>> 
>>> 
>
Received on Thursday, 25 January 2024 14:29:47 UTC