Re: Escaped characters in RDF-1.1 N-Triples literals for Canonical documents from Peter Ansell on 2013-11-18 (public-rdf-comments@w3.org from November 2013)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Tue, 19 Nov 2013 10:27:33 +1100
To: Andy Seaborne <andy@apache.org>
Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>, michel dumontier <michel.dumontier@gmail.com>
Message-ID: <CAGYFOCQ98amjXBxm+GXPNkHL7VXEoy9=KvmaGng6uqJ3NcRM0Q@mail.gmail.com>
On 18 November 2013 22:09, Andy Seaborne <andy@apache.org> wrote:
> On 17/11/13 22:50, Peter Ansell wrote:
>>
>> The Conformance section (Section 4) of the RDF-1.1 N-Triples Candidate
>> Recommendation (05 November 2013) specifies that for a canonical
>> document [1] :
>>
>>      "Characters not allowed directly in STRING_LITERAL_QUOTE (U+0022,
>> U+005C, U+000A, U+000D) MUST use ECHAR not UCHAR. "
>>
>> However, the escape sequences in ECHAR do not seem to include U+005C "\"
>> [2]:
>>
>>      [153s] ECHAR ::= '\' [tbnrf"']
>>
>> That is, ECHAR defines escapes for \t \b \n \r \f \" \' , but it
>> doesn't appear that \\ is allowed for in that grammar. It could be
>> escaped using UCHAR as \u005C, but that seems to violate the canonical
>> rule that specifically mentions it.
>>
>> In addition, is it intentional that the list of characters mentioned
>> in the canonical section [1] does not include all of the characters
>> with escapes defined in ECHAR [2]? Should the characters that appear
>> in ECHAR [2] but not in the list in [1] be escaped using UCHAR in
>> Canonical documents or be represented using their raw UTF-8 values.
>>
>> Cheers,
>>
>> Peter
>>
>> [1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#conformance
>> [2]
>> http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-ECHAR
>>
>
> Hi Peter,
>
> Thanks for pointing that out.  It looks a systematic bug in the tool chain
> that we failed to squash.
>
> I've recorded it on the WG comments:
>
> http://www.w3.org/2011/rdf-wg/wiki/CR_Comments
>
> This is not a formal response to your comment.
>
> I have fixed the documents (which is all subject to WG approval) as follows
> and if you are satisfied, please do send an early confirmation of dealing
> with your comment to your satisfaction.

The removal of the single quote from ECHAR for N-Triples and N-Quads
again complicates matters a little, as it requires going backwards,
essentially, to what the previous specifications had. The reason I was
looking into the grammar yesterday was to make Sesame able to parse
RDF-1.1 N-Triples Candidate Recommendation documents that include
escaped single quotes \' , as Michel Dumontier had already started
escaping single quotes for upcoming Bio2RDF N-Triples and N-Quads data
dumps based on the respective Candidate Recommendations. The previous
Sesame parsers failed as they were fairly strict to the RDF Test Cases
specification where there was a single way to represent each
character. I made the change in Git to allow future Sesame releases to
parse N-Triples documents that use \'.

However I am not sure now whether I should remove that support before
it appears in a Sesame release and is relied on by users, given that
it will not appear in the next version--and the final version--of the
specification. The alternative is to examine whether it is simpler to
keep support for \' in the grammar as a useful addition for
compatibility with Turtle/TriG/SPARQL, even though it is not strictly
necessary given that only double quotes are used for surrounding
literals in N-Triples/N-Quads.

> Changes:
>
> N-Triples and N-Quads:
>
> ECHAR   ::=     '\' [tbnrf"\]
>
> which does not include ' because strings can't use '-quoting in N-Triples
> and N-Quads and there is a desire to minimise the number of ways of writing
> the same thing.

That is fair, and it was the previous method before the Candidate
Recommendations were published, so it is not without precedent.

However, the main question for me at this stage is whether having a
minimal number of ways of writing single quotes is more beneficial at
this stage then reverting the change in either a follow-up CR or
Proposed Recommendation, given that some users have already started
escaping single quotes.

There are a few issues that may affect users in both the short-term
and the long-term. Although none of them are particularly convincing
on their own, together they may give the impression that it is more
beneficial to allow \' in N-Triples and N-Quads for consistency with
Turtle/TriG/SPARQL:

1) Removing the escaping of single quotes for N-Triples may make
N-Triples files created based on the CR unparsable
    * The workaround in this case is to use a Turtle parser as it
still allows single quote escaping

2) Removing the escaping of single quotes for N-Quads may make N-Quads
files created based on the CR unparsable
    * As TriG is not structurally compatible with N-Quads, per Richard
Smiths recent comments, there is no alternative for this case. I don't
see a need for the compatibility personally, as they have different
purposes, IMO. The reason I bring it up is that there would be no
alternative parser for the files created during the CR period once
current parsers again remove support for single-quote escaping.

3) For ease of reference N-Triples could be compatible with a simple
line-based version of Turtle and by relation SPARQL.
    * The difference in allowing single-quote escaping may make
N-Triples documents not a strict-subset of Turtle, in that a valid
Turtle file with triples printed line by line, (without long
literals/prefixed URIs/etc. but with escaping for all of the allowed
Turtle ECHAR escape sequences), may not be parsable by an N-Triples
parser that didn't allow for single-quote escaping.

5) Given that the requirement that there be a single way to represent
everything has already been relaxed for N-Triples/N-Quads. Would it be
suitable to specify that single-quotes MAY be escaped, but not
SHOULD/MUST be escaped for Canonical N-Triples/Canonical N-Quads? In
particular, that would make it possible for parsers to still accept
documents produced using the RDF Test Cases Format and the previous
N-Quads specification without having a different grammar going
forward.

> In addition, I've checked Turtle and TriG (Turtle already had a related fix
> recently) to put the characters in the same order because \" is confusing
> (it is not escaping a " in the grammar itself).
>
> ECHAR   ::=     '\' [tbnrf'"\]

The SPARQL Query grammar ECHAR already has all of these characters
also, although it has the \" sequence which is correct but confusing,
as you say.

> (Turtle and TriG have a ' as well)
>
> Links to the rule in the grammar in the editors' drafts:
>
> N-Triples:
>
> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/n-triples.html#grammar-production-ECHAR

The modification to ECHAR looks good to me, excepting the possibility
described above of adding an escape for ' again.

Would it be possible to further clarify the way in which characters
which are in ECHAR, but not *strictly* disallowed in
STRING_LITERAL_QUOTE, should be represented in Canonical N-Triples?
The last 2 rules in Section 4 seem to specify that ECHAR MUST *only*
be used for the 4 characters which are strictly disallowed in
STRING_LITERAL_QUOTE. It may not be obvious to readers why only 4 out
of the 7 escape sequences in ECHAR are used in the Canonical form, and
also why UCHAR was not used to escape them. The other 3 ECHAR
characters which have escape sequences are not directly referred to in
Canonical N-Triples, other than by the reference that they should be
directly included without using UCHAR.

> N-Quads:
>
> https://dvcs.w3.org/hg/rdf/raw-file/default/nquads/index.html#grammar-production-ECHAR

The modification to ECHAR looks good to me, excepting the possibility
described above of adding an escape for ' again.

The original N-Quads followed N-Triples (RDF Test Cases format) and
hence had a single way to represent each line. Could the RDF-1.1
N-Quads specification also include a Canonical N-Quads Document
specification that would again specify a single way to represent each
line.

This wasn't my original query for this thread, so feel free to open a
separate issue for this if it is easier to track it that way. It
shouldn't be too difficult once the N-Triples Canonical form is
finalised to copy it to N-Quads.

> Turtle:
>
> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#grammar-production-ECHAR

The modification to ECHAR looks good to me.

In relation to Canonical N-Triples, the section in the Turtle spec
describing string escape sequences may be useful as a template to make
it clear exactly which way to encode the remaining 3 N-Triples ECHAR
characters, as it specifically mentions each of the 8 Turtle escape
sequences (although Turtle has no Canonical form so it refers to them
as "traditionally" escaped which would need to change for Canonical
N-Triples).

https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#sec-escapes

> TriG:
> https://dvcs.w3.org/hg/rdf/raw-file/default/trig/index.html#grammar-production-ECHAR

The modification to ECHAR looks good to me.

Thanks,

Peter
Received on Monday, 18 November 2013 23:28:00 UTC