Re: No definition of line terminator for canonical N-Triples

> On Nov 23, 2022, at 3:18 PM, Gregory Williams <greg@evilfunhouse.com> wrote:
> 
> On Nov 23, 2022, at 2:00 PM, Gregg Kellogg <gregg@greggkellogg.net> wrote:
>> 
>> The RDF Dataset Canonicalization Working Group makes use of an inferred Canonical representation for N-Quads. Such a form, is not actually defined in the N-Quads spec, but it can be inferred from the Canonical representation of N-Triples [1].
>> 
>> However, while the whitespace between components of a triple (subject, predicate, object, and terminating ‘.’) are defined, no discussion of whitespace separating triples in a document. This is inline with the grammar production for triple [2], but there is no definition for what comprises a ntriplesDoc [3] in canonical form. The suggested change is to add that each triple MUST be terminated by a single newline (U+000A). Without such a change, a document with several forms of EOL token could be used and still be considered canonical, where EOL is [#xD#xA]+.
> 
> Good catch. I think the language around whitespace handling for canonical N-Triples could also be improved with respect to the EOL production:
> 
>> * The whitespace following subject, predicate, and object must be a single space, (U+0020). All other locations that allow whitespace must be empty.
> 
> The EOL production contains just whitespace, but clearly cannot be subject to this “must be empty” clause.

The current C14N text seems limited to the “triple” production [1], which does not include EOL. Of course, that’s part of the issue I raise, that it _should_ consider the C14N for “ntriplesDoc" [2]. The Canonical N-Triples discussion should note that it is limited to the “triple” production and add further restrictions on the “ntriplesDoc” production, that it limits EOL to being just U+000A.

> Also on the topic of Canonical N-Triples, I have two questions:
> 
> 1. The spec says:
> 
>> HEX must use only uppercase letters ([A-F])
> 
> but as far as I can tell the HEX production is only used by UCHAR, and the spec also says:
> 
>> Characters must not be represented by UCHAR.
> 
> 
> Is this HEX requirement simply redundant?

Yes, it seems so, that restriction is meaningless.

> 2. The spec says:
> 
>> Within STRING_LITERAL_QUOTE, only the characters U+0022, U+005C, U+000A, U+000D are encoded using ECHAR. ECHAR must not be used for characters that are allowed directly in STRING_LITERAL_QUOTE.
> 
> 
> Does this really mean that control characters must be written directly without escaping or encoding (e.g. NULL, BELL, BACKSPACE, etc.)? While their use probably isn’t common in N-Triples documents, the idea of a canonical representation requiring these to be written directly strikes me as ill-advised, as it makes handling of this data more difficult (e.g. having to carefully handle NULL characters vs. NULL terminators, not being able to copy-paste data containing unprintable control characters, etc.).

It’s consistent, but many systems probably can’t represent NULL without escapes, or even natively represent NULL within a string, which would make literals containing NULL. Boundary tests, whether for C14N or not, are clearly missing for the N-Triples test suite, Turtle has more character boundaries tests, but no C14N limitations and using UCHAR.

IMO, changing this would be more disruptive to implementations than the whitespace changes for C14N documents and not something that can be addressed easily in an erratum.

> Thanks,
> Greg

Gregg

[1] https://www.w3.org/TR/n-triples/#grammar-production-triple
[2] https://www.w3.org/TR/n-triples/#grammar-production-tntriplesdoc

Received on Thursday, 24 November 2022 21:19:01 UTC