Re: No definition of line terminator for canonical N-Triples from Gregory Williams on 2022-11-23 (public-rdf-comments@w3.org from November 2022)

From: Gregory Williams <greg@evilfunhouse.com>
Date: Wed, 23 Nov 2022 15:18:26 -0800
To: Gregg Kellogg <gregg@greggkellogg.net>
Cc: public-rdf-comments@w3.org
Message-Id: <2A8D0EB8-AC76-442D-8DAF-F132D07AA311@evilfunhouse.com>

On Nov 23, 2022, at 2:00 PM, Gregg Kellogg <gregg@greggkellogg.net> wrote:
> 
> The RDF Dataset Canonicalization Working Group makes use of an inferred Canonical representation for N-Quads. Such a form, is not actually defined in the N-Quads spec, but it can be inferred from the Canonical representation of N-Triples [1].
> 
> However, while the whitespace between components of a triple (subject, predicate, object, and terminating ‘.’) are defined, no discussion of whitespace separating triples in a document. This is inline with the grammar production for triple [2], but there is no definition for what comprises a ntriplesDoc [3] in canonical form. The suggested change is to add that each triple MUST be terminated by a single newline (U+000A). Without such a change, a document with several forms of EOL token could be used and still be considered canonical, where EOL is [#xD#xA]+.

Good catch. I think the language around whitespace handling for canonical N-Triples could also be improved with respect to the EOL production:

> * The whitespace following subject, predicate, and object must be a single space, (U+0020). All other locations that allow whitespace must be empty.

The EOL production contains just whitespace, but clearly cannot be subject to this “must be empty” clause.

Also on the topic of Canonical N-Triples, I have two questions:

1. The spec says:

>  HEX must use only uppercase letters ([A-F])

but as far as I can tell the HEX production is only used by UCHAR, and the spec also says:

> Characters must not be represented by UCHAR.

Is this HEX requirement simply redundant?

2. The spec says:

> Within STRING_LITERAL_QUOTE, only the characters U+0022, U+005C, U+000A, U+000D are encoded using ECHAR. ECHAR must not be used for characters that are allowed directly in STRING_LITERAL_QUOTE.

Does this really mean that control characters must be written directly without escaping or encoding (e.g. NULL, BELL, BACKSPACE, etc.)? While their use probably isn’t common in N-Triples documents, the idea of a canonical representation requiring these to be written directly strikes me as ill-advised, as it makes handling of this data more difficult (e.g. having to carefully handle NULL characters vs. NULL terminators, not being able to copy-paste data containing unprintable control characters, etc.).

Thanks,
Greg

Received on Wednesday, 23 November 2022 23:18:41 UTC