Re: N-triple size from Thomas Lörtsch on 2024-01-26 (public-rdf-star-wg@w3.org from January 2024)

From: Thomas Lörtsch <tl@rat.io>
Date: Fri, 26 Jan 2024 12:51:54 +0100
To: Andy Seaborne <andy@apache.org>
Cc: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-Id: <668694AD-8993-41A7-A9D6-8B13A27D1ADB@rat.io>
> On 26. Jan 2024, at 12:17, Andy Seaborne <andy@apache.org> wrote:
> 
> 
> 
> On 25/01/2024 14:29, Thomas Lörtsch wrote:
>>> On 25. Jan 2024, at 14:42, Andy Seaborne <andy@apache.org> wrote:
>>> 
>>> 
>>> 
>>> On 25/01/2024 12:27, Thomas Lörtsch wrote:
>>>>> On 25. Jan 2024, at 12:22, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote:
>>>>> On 1/25/24 06:08, Andy Seaborne wrote:
>>>>>>> But what does it *mean*? Optimizations should only be applied after we know that it means what we want it to mean.
>>>>>> Agreed.
>>>>>> We can start with our goals. "bloat" has been used in two senses : "visual bloat" and "size bloat".
>>>> You’re forgetting "term bloat"…
>>>>>> Is the WG addressing the size bloat issue?
>>>> IMO it can’t be addressed in N-Triples, as N-Triples is per definition a strictly triple-based serialization, with pretty atomic terms
>>> 
>>> A bit more complicated than that, especially for annotation usage.
>>> 
>>> Both in-memory and persistent storage system largely share RDF terms that occur more than once in a graph, either as pointers or "node ids" which is at most 16 bytes, often 8. The term is not the whole of the strings and "rdf:subject" does not need to be repeated use of <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject>.
>>> 
>>> The new term can be about the size of whatever a triple is which, with existing approaches of term dictionaries and compression, is quite small.
>> The topic was serialization to N-Triples, because that is argued to be a very important aspect because of machine readability, steram reasoning, etc. You now answer with aspects of storage, but that is completely orthogonal.

And now you change the subject of a new thread on streaming and continue a discussion of another thread. This is not helpful to discuss issues w.r.t. streaming. Especially you have pretty often argued that streaming is an important reason why the model of RDF needs to be extended with a new term type. In the mail that now has a good chance of being buried under this new topic I explain why I doubt that. Therefore I will resurrect that thread on streaming in a new mail, adding cognitive load on everybody.

> The size bloat issue is about number of triples. Point 2.

There is no point 2 in this mail. The interested reader may consult https://lists.w3.org/Archives/Public/public-rdf-star-wg/2024Jan/0159.html or https://lists.w3.org/Archives/Public/public-rdf-star-wg/2024Jan/0165.html

> N-triples files exist for a purpose - to publish data in an easy to consume machine readable format.

Exactly, and that is why triple bloat for involved constructs like reification may be considered a necessary burden. Besides, there’s still the option to include triple terms as syntactic sugar in a tbd N-triples-terms syntax.

> The receiver stores the data in a triple store so they don't have to parse it every time they want to use the data.

Right, and the receiver is free to optimize storage in any way it sees fit, especially since the discussed notion of wellformedness would allow it to indicate that it doesn’t support malformed reifications.. 


> On N-triples format:
> 
> I don't know what "term bloat" is specifically referring

It is referring to the introduction of new terms like the embedded/quoted/triple/descriptor term into the RDF model.

> but may the size of the descriptor term.

No, not my area of expertise.

> Compare like-for-like:
> 
> Yes, there is a large term in the document. (You can argue it's 3 terms in one term slot because it is a compound.)
> 
> The rdf:occurenceOf triple is one triple.
> 
> It replaces 3 triples in the "syntax" proposal and 4 in the "syntax+" proposal. 3 terms vs 9/13 terms used.
> 
> In terms of byte size, it is smaller than total bytes of the three reification triples - the URI for rdf:subject etc isn't written in the file.

That may well be but is an optimization that, as explained above, can be left to triple stores to decide if they want to implement it or not.

Thomas

>    Andy
>
Received on Friday, 26 January 2024 11:52:05 UTC