Re: Escaped characters in RDF-1.1 N-Triples literals for Canonical documents from Andy Seaborne on 2013-11-28 (public-rdf-comments@w3.org from November 2013)

From: Andy Seaborne <andy@apache.org>
Date: Thu, 28 Nov 2013 18:00:15 +0000
To: Peter Ansell <ansell.peter@gmail.com>
CC: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <529784AF.1040907@apache.org>
To close this off:

 > There are a few issues that may affect users in both the short-term
 > and the long-term. Although none of them are particularly convincing
 > on their own, together they may give the impression that it is more
 > beneficial to allow \' in N-Triples and N-Quads for consistency with
 > Turtle/TriG/SPARQL:

Yes, on balance, the form in the CR documents where ' is allowed in ECHAR

ECHAR   ::=     '\' [tbnrf'"\]

was the preferred form in the working group.  It means that parser 
tokenizing for N-Triples, N-Quads, Turtle and Trig are the same in this 
area.

Editors' drafts are up to date.

 Thanks,
 Andy

On 18/11/13 23:27, Peter Ansell wrote:
> On 18 November 2013 22:09, Andy Seaborne <andy@apache.org> wrote:
>> On 17/11/13 22:50, Peter Ansell wrote:
>>>
>>> The Conformance section (Section 4) of the RDF-1.1 N-Triples Candidate
>>> Recommendation (05 November 2013) specifies that for a canonical
>>> document [1] :
>>>
>>>       "Characters not allowed directly in STRING_LITERAL_QUOTE (U+0022,
>>> U+005C, U+000A, U+000D) MUST use ECHAR not UCHAR. "
>>>
>>> However, the escape sequences in ECHAR do not seem to include U+005C "\"
>>> [2]:
>>>
>>>       [153s] ECHAR ::= '\' [tbnrf"']
>>>
>>> That is, ECHAR defines escapes for \t \b \n \r \f \" \' , but it
>>> doesn't appear that \\ is allowed for in that grammar. It could be
>>> escaped using UCHAR as \u005C, but that seems to violate the canonical
>>> rule that specifically mentions it.
>>>
>>> In addition, is it intentional that the list of characters mentioned
>>> in the canonical section [1] does not include all of the characters
>>> with escapes defined in ECHAR [2]? Should the characters that appear
>>> in ECHAR [2] but not in the list in [1] be escaped using UCHAR in
>>> Canonical documents or be represented using their raw UTF-8 values.
>>>
>>> Cheers,
>>>
>>> Peter
>>>
>>> [1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#conformance
>>> [2]
>>> http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-ECHAR
>>>
>>
>> Hi Peter,
>>
>> Thanks for pointing that out.  It looks a systematic bug in the tool chain
>> that we failed to squash.
>>
>> I've recorded it on the WG comments:
>>
>> http://www.w3.org/2011/rdf-wg/wiki/CR_Comments
>>
>> This is not a formal response to your comment.
>>
>> I have fixed the documents (which is all subject to WG approval) as follows
>> and if you are satisfied, please do send an early confirmation of dealing
>> with your comment to your satisfaction.
>
> The removal of the single quote from ECHAR for N-Triples and N-Quads
> again complicates matters a little, as it requires going backwards,
> essentially, to what the previous specifications had. The reason I was
> looking into the grammar yesterday was to make Sesame able to parse
> RDF-1.1 N-Triples Candidate Recommendation documents that include
> escaped single quotes \' , as Michel Dumontier had already started
> escaping single quotes for upcoming Bio2RDF N-Triples and N-Quads data
> dumps based on the respective Candidate Recommendations. The previous
> Sesame parsers failed as they were fairly strict to the RDF Test Cases
> specification where there was a single way to represent each
> character. I made the change in Git to allow future Sesame releases to
> parse N-Triples documents that use \'.
>
> However I am not sure now whether I should remove that support before
> it appears in a Sesame release and is relied on by users, given that
> it will not appear in the next version--and the final version--of the
> specification. The alternative is to examine whether it is simpler to
> keep support for \' in the grammar as a useful addition for
> compatibility with Turtle/TriG/SPARQL, even though it is not strictly
> necessary given that only double quotes are used for surrounding
> literals in N-Triples/N-Quads.
>
>> Changes:
>>
>> N-Triples and N-Quads:
>>
>> ECHAR   ::=     '\' [tbnrf"\]
>>
>> which does not include ' because strings can't use '-quoting in N-Triples
>> and N-Quads and there is a desire to minimise the number of ways of writing
>> the same thing.
>
> That is fair, and it was the previous method before the Candidate
> Recommendations were published, so it is not without precedent.
>
> However, the main question for me at this stage is whether having a
> minimal number of ways of writing single quotes is more beneficial at
> this stage then reverting the change in either a follow-up CR or
> Proposed Recommendation, given that some users have already started
> escaping single quotes.
>
> There are a few issues that may affect users in both the short-term
> and the long-term. Although none of them are particularly convincing
> on their own, together they may give the impression that it is more
> beneficial to allow \' in N-Triples and N-Quads for consistency with
> Turtle/TriG/SPARQL:
>
> 1) Removing the escaping of single quotes for N-Triples may make
> N-Triples files created based on the CR unparsable
>      * The workaround in this case is to use a Turtle parser as it
> still allows single quote escaping
>
> 2) Removing the escaping of single quotes for N-Quads may make N-Quads
> files created based on the CR unparsable
>      * As TriG is not structurally compatible with N-Quads, per Richard
> Smiths recent comments, there is no alternative for this case. I don't
> see a need for the compatibility personally, as they have different
> purposes, IMO. The reason I bring it up is that there would be no
> alternative parser for the files created during the CR period once
> current parsers again remove support for single-quote escaping.
>
> 3) For ease of reference N-Triples could be compatible with a simple
> line-based version of Turtle and by relation SPARQL.
>      * The difference in allowing single-quote escaping may make
> N-Triples documents not a strict-subset of Turtle, in that a valid
> Turtle file with triples printed line by line, (without long
> literals/prefixed URIs/etc. but with escaping for all of the allowed
> Turtle ECHAR escape sequences), may not be parsable by an N-Triples
> parser that didn't allow for single-quote escaping.
>
> 5) Given that the requirement that there be a single way to represent
> everything has already been relaxed for N-Triples/N-Quads. Would it be
> suitable to specify that single-quotes MAY be escaped, but not
> SHOULD/MUST be escaped for Canonical N-Triples/Canonical N-Quads? In
> particular, that would make it possible for parsers to still accept
> documents produced using the RDF Test Cases Format and the previous
> N-Quads specification without having a different grammar going
> forward.
>
>> In addition, I've checked Turtle and TriG (Turtle already had a related fix
>> recently) to put the characters in the same order because \" is confusing
>> (it is not escaping a " in the grammar itself).
>>
>> ECHAR   ::=     '\' [tbnrf'"\]
>
> The SPARQL Query grammar ECHAR already has all of these characters
> also, although it has the \" sequence which is correct but confusing,
> as you say.
>
>> (Turtle and TriG have a ' as well)
>>
>> Links to the rule in the grammar in the editors' drafts:
>>
>> N-Triples:
>>
>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/n-triples.html#grammar-production-ECHAR
>
> The modification to ECHAR looks good to me, excepting the possibility
> described above of adding an escape for ' again.
>
> Would it be possible to further clarify the way in which characters
> which are in ECHAR, but not *strictly* disallowed in
> STRING_LITERAL_QUOTE, should be represented in Canonical N-Triples?
> The last 2 rules in Section 4 seem to specify that ECHAR MUST *only*
> be used for the 4 characters which are strictly disallowed in
> STRING_LITERAL_QUOTE. It may not be obvious to readers why only 4 out
> of the 7 escape sequences in ECHAR are used in the Canonical form, and
> also why UCHAR was not used to escape them. The other 3 ECHAR
> characters which have escape sequences are not directly referred to in
> Canonical N-Triples, other than by the reference that they should be
> directly included without using UCHAR.
>
>> N-Quads:
>>
>> https://dvcs.w3.org/hg/rdf/raw-file/default/nquads/index.html#grammar-production-ECHAR
>
> The modification to ECHAR looks good to me, excepting the possibility
> described above of adding an escape for ' again.
>
> The original N-Quads followed N-Triples (RDF Test Cases format) and
> hence had a single way to represent each line. Could the RDF-1.1
> N-Quads specification also include a Canonical N-Quads Document
> specification that would again specify a single way to represent each
> line.
>
> This wasn't my original query for this thread, so feel free to open a
> separate issue for this if it is easier to track it that way. It
> shouldn't be too difficult once the N-Triples Canonical form is
> finalised to copy it to N-Quads.
>
>> Turtle:
>>
>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#grammar-production-ECHAR
>
> The modification to ECHAR looks good to me.
>
> In relation to Canonical N-Triples, the section in the Turtle spec
> describing string escape sequences may be useful as a template to make
> it clear exactly which way to encode the remaining 3 N-Triples ECHAR
> characters, as it specifically mentions each of the 8 Turtle escape
> sequences (although Turtle has no Canonical form so it refers to them
> as "traditionally" escaped which would need to change for Canonical
> N-Triples).
>
> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#sec-escapes
>
>> TriG:
>> https://dvcs.w3.org/hg/rdf/raw-file/default/trig/index.html#grammar-production-ECHAR
>
> The modification to ECHAR looks good to me.
>
> Thanks,
>
> Peter
>
Received on Thursday, 28 November 2013 18:00:45 UTC