N-triple size from Andy Seaborne on 2024-01-26 (public-rdf-star-wg@w3.org from January 2024)

From: Andy Seaborne <andy@apache.org>
Date: Fri, 26 Jan 2024 11:17:44 +0000
To: Thomas Lörtsch <tl@rat.io>
Cc: RDF-star Working Group <public-rdf-star-wg@w3.org>
Message-ID: <acb0f2c3-56b6-494d-ab1c-38be05586471@apache.org>

On 25/01/2024 14:29, Thomas Lörtsch wrote:
> 
> 
>> On 25. Jan 2024, at 14:42, Andy Seaborne <andy@apache.org> wrote:
>>
>>
>>
>> On 25/01/2024 12:27, Thomas Lörtsch wrote:
>>>> On 25. Jan 2024, at 12:22, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote:
>>>> On 1/25/24 06:08, Andy Seaborne wrote:
>>>>>> But what does it *mean*? Optimizations should only be applied after we know that it means what we want it to mean.
>>>>> Agreed.
>>>>> We can start with our goals. "bloat" has been used in two senses : "visual bloat" and "size bloat".
>>> You’re forgetting "term bloat"…
>>>>> Is the WG addressing the size bloat issue?
>>> IMO it can’t be addressed in N-Triples, as N-Triples is per definition a strictly triple-based serialization, with pretty atomic terms
>>
>> A bit more complicated than that, especially for annotation usage.
>>
>> Both in-memory and persistent storage system largely share RDF terms that occur more than once in a graph, either as pointers or "node ids" which is at most 16 bytes, often 8. The term is not the whole of the strings and "rdf:subject" does not need to be repeated use of <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject>.
>>
>> The new term can be about the size of whatever a triple is which, with existing approaches of term dictionaries and compression, is quite small.
> 
> The topic was serialization to N-Triples, because that is argued to be a very important aspect because of machine readability, steram reasoning, etc. You now answer with aspects of storage, but that is completely orthogonal.

The size bloat issue is about number of triples. Point 2.

N-triples files exist for a purpose - to publish data in an easy to 
consume machine readable format. The receiver stores the data in a 
triple store so they don't have to parse it every time they want to use 
the data.

On N-triples format:

I don't know what "term bloat" is specifically referring but may the 
size of the descriptor term.

Compare like-for-like:

Yes, there is a large term in the document. (You can argue it's 3 terms 
in one term slot because it is a compound.)

The rdf:occurenceOf triple is one triple.

It replaces 3 triples in the "syntax" proposal and 4 in the "syntax+" 
proposal. 3 terms vs 9/13 terms used.

In terms of byte size, it is smaller than total bytes of the three 
reification triples - the URI for rdf:subject etc isn't written in the file.

     Andy

Received on Friday, 26 January 2024 11:17:52 UTC