Re: Escaped characters in RDF-1.1 N-Triples literals for Canonical documents from Peter Ansell on 2013-11-28 (public-rdf-comments@w3.org from November 2013)

From: Peter Ansell <ansell.peter@gmail.com>
Date: Fri, 29 Nov 2013 09:28:26 +1100
To: Andy Seaborne <andy@apache.org>
Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Message-ID: <CAGYFOCQvAahB+PaGTgThMjYndcf5gZ1EDDHq97qZ=gFu6Cwy7w@mail.gmail.com>
Hi Andy,

Thank for working through the ECHAR issue. Having ECHAR consistent
across the variations is definitely useful.

Could you also clarify the main issue that I am raising here about the
Canonical form for N-Triples and which characters are expected to be
escaped? It isn't completely clear whether "\" escapes should be used
for all 8 represented characters in ECHAR or just the 4 that are
disallowed in STRING_LITERAL_QUOTE [1].

If only the 4 that are disallowed can be used, then one would need to
use the raw character, as the last rule in Canonical N-Triples says
that the \u form isn't allowed to be used either, but it isn't
completely clear whether that is the intention.

If the goal of Canonical N-Triples is to be fairly close to the RDF
Test Cases format (with the exception of UTF-8 being preferred against
\u|\U) then having at least the 5 escaped ASCII characters that it
specifies [2] (and hopefully all 8 from RDF-1.1 N-Triples ECHAR) may
make sense.

Thanks,

Peter

[1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-STRING_LITERAL_QUOTE
[2] http://www.w3.org/TR/rdf-testcases/#ntrip_strings

On 29 November 2013 05:00, Andy Seaborne <andy@apache.org> wrote:
> To close this off:
>
>
>> There are a few issues that may affect users in both the short-term
>> and the long-term. Although none of them are particularly convincing
>> on their own, together they may give the impression that it is more
>> beneficial to allow \' in N-Triples and N-Quads for consistency with
>> Turtle/TriG/SPARQL:
>
> Yes, on balance, the form in the CR documents where ' is allowed in ECHAR
>
> ECHAR   ::=     '\' [tbnrf'"\]
>
> was the preferred form in the working group.  It means that parser
> tokenizing for N-Triples, N-Quads, Turtle and Trig are the same in this
> area.
>
> Editors' drafts are up to date.
>
>         Thanks,
>         Andy
>
>
> On 18/11/13 23:27, Peter Ansell wrote:
>>
>> On 18 November 2013 22:09, Andy Seaborne <andy@apache.org> wrote:
>>>
>>> On 17/11/13 22:50, Peter Ansell wrote:
>>>>
>>>>
>>>> The Conformance section (Section 4) of the RDF-1.1 N-Triples Candidate
>>>> Recommendation (05 November 2013) specifies that for a canonical
>>>> document [1] :
>>>>
>>>>       "Characters not allowed directly in STRING_LITERAL_QUOTE (U+0022,
>>>> U+005C, U+000A, U+000D) MUST use ECHAR not UCHAR. "
>>>>
>>>> However, the escape sequences in ECHAR do not seem to include U+005C "\"
>>>> [2]:
>>>>
>>>>       [153s] ECHAR ::= '\' [tbnrf"']
>>>>
>>>> That is, ECHAR defines escapes for \t \b \n \r \f \" \' , but it
>>>> doesn't appear that \\ is allowed for in that grammar. It could be
>>>> escaped using UCHAR as \u005C, but that seems to violate the canonical
>>>> rule that specifically mentions it.
>>>>
>>>> In addition, is it intentional that the list of characters mentioned
>>>> in the canonical section [1] does not include all of the characters
>>>> with escapes defined in ECHAR [2]? Should the characters that appear
>>>> in ECHAR [2] but not in the list in [1] be escaped using UCHAR in
>>>> Canonical documents or be represented using their raw UTF-8 values.
>>>>
>>>> Cheers,
>>>>
>>>> Peter
>>>>
>>>> [1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#conformance
>>>> [2]
>>>>
>>>> http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-ECHAR
>>>>
>>>
>>> Hi Peter,
>>>
>>> Thanks for pointing that out.  It looks a systematic bug in the tool
>>> chain
>>> that we failed to squash.
>>>
>>> I've recorded it on the WG comments:
>>>
>>> http://www.w3.org/2011/rdf-wg/wiki/CR_Comments
>>>
>>> This is not a formal response to your comment.
>>>
>>> I have fixed the documents (which is all subject to WG approval) as
>>> follows
>>> and if you are satisfied, please do send an early confirmation of dealing
>>> with your comment to your satisfaction.
>>
>>
>> The removal of the single quote from ECHAR for N-Triples and N-Quads
>> again complicates matters a little, as it requires going backwards,
>> essentially, to what the previous specifications had. The reason I was
>> looking into the grammar yesterday was to make Sesame able to parse
>> RDF-1.1 N-Triples Candidate Recommendation documents that include
>> escaped single quotes \' , as Michel Dumontier had already started
>> escaping single quotes for upcoming Bio2RDF N-Triples and N-Quads data
>> dumps based on the respective Candidate Recommendations. The previous
>> Sesame parsers failed as they were fairly strict to the RDF Test Cases
>> specification where there was a single way to represent each
>> character. I made the change in Git to allow future Sesame releases to
>> parse N-Triples documents that use \'.
>>
>> However I am not sure now whether I should remove that support before
>> it appears in a Sesame release and is relied on by users, given that
>> it will not appear in the next version--and the final version--of the
>> specification. The alternative is to examine whether it is simpler to
>> keep support for \' in the grammar as a useful addition for
>> compatibility with Turtle/TriG/SPARQL, even though it is not strictly
>> necessary given that only double quotes are used for surrounding
>> literals in N-Triples/N-Quads.
>>
>>> Changes:
>>>
>>> N-Triples and N-Quads:
>>>
>>> ECHAR   ::=     '\' [tbnrf"\]
>>>
>>> which does not include ' because strings can't use '-quoting in N-Triples
>>> and N-Quads and there is a desire to minimise the number of ways of
>>> writing
>>> the same thing.
>>
>>
>> That is fair, and it was the previous method before the Candidate
>> Recommendations were published, so it is not without precedent.
>>
>> However, the main question for me at this stage is whether having a
>> minimal number of ways of writing single quotes is more beneficial at
>> this stage then reverting the change in either a follow-up CR or
>> Proposed Recommendation, given that some users have already started
>> escaping single quotes.
>>
>> There are a few issues that may affect users in both the short-term
>> and the long-term. Although none of them are particularly convincing
>> on their own, together they may give the impression that it is more
>> beneficial to allow \' in N-Triples and N-Quads for consistency with
>> Turtle/TriG/SPARQL:
>>
>> 1) Removing the escaping of single quotes for N-Triples may make
>> N-Triples files created based on the CR unparsable
>>      * The workaround in this case is to use a Turtle parser as it
>> still allows single quote escaping
>>
>> 2) Removing the escaping of single quotes for N-Quads may make N-Quads
>> files created based on the CR unparsable
>>      * As TriG is not structurally compatible with N-Quads, per Richard
>> Smiths recent comments, there is no alternative for this case. I don't
>> see a need for the compatibility personally, as they have different
>> purposes, IMO. The reason I bring it up is that there would be no
>> alternative parser for the files created during the CR period once
>> current parsers again remove support for single-quote escaping.
>>
>> 3) For ease of reference N-Triples could be compatible with a simple
>> line-based version of Turtle and by relation SPARQL.
>>      * The difference in allowing single-quote escaping may make
>> N-Triples documents not a strict-subset of Turtle, in that a valid
>> Turtle file with triples printed line by line, (without long
>> literals/prefixed URIs/etc. but with escaping for all of the allowed
>> Turtle ECHAR escape sequences), may not be parsable by an N-Triples
>> parser that didn't allow for single-quote escaping.
>>
>> 5) Given that the requirement that there be a single way to represent
>> everything has already been relaxed for N-Triples/N-Quads. Would it be
>> suitable to specify that single-quotes MAY be escaped, but not
>> SHOULD/MUST be escaped for Canonical N-Triples/Canonical N-Quads? In
>> particular, that would make it possible for parsers to still accept
>> documents produced using the RDF Test Cases Format and the previous
>> N-Quads specification without having a different grammar going
>> forward.
>>
>>> In addition, I've checked Turtle and TriG (Turtle already had a related
>>> fix
>>> recently) to put the characters in the same order because \" is confusing
>>> (it is not escaping a " in the grammar itself).
>>>
>>> ECHAR   ::=     '\' [tbnrf'"\]
>>
>>
>> The SPARQL Query grammar ECHAR already has all of these characters
>> also, although it has the \" sequence which is correct but confusing,
>> as you say.
>>
>>> (Turtle and TriG have a ' as well)
>>>
>>> Links to the rule in the grammar in the editors' drafts:
>>>
>>> N-Triples:
>>>
>>>
>>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/n-triples.html#grammar-production-ECHAR
>>
>>
>> The modification to ECHAR looks good to me, excepting the possibility
>> described above of adding an escape for ' again.
>>
>> Would it be possible to further clarify the way in which characters
>> which are in ECHAR, but not *strictly* disallowed in
>> STRING_LITERAL_QUOTE, should be represented in Canonical N-Triples?
>> The last 2 rules in Section 4 seem to specify that ECHAR MUST *only*
>> be used for the 4 characters which are strictly disallowed in
>> STRING_LITERAL_QUOTE. It may not be obvious to readers why only 4 out
>> of the 7 escape sequences in ECHAR are used in the Canonical form, and
>> also why UCHAR was not used to escape them. The other 3 ECHAR
>> characters which have escape sequences are not directly referred to in
>> Canonical N-Triples, other than by the reference that they should be
>> directly included without using UCHAR.
>>
>>> N-Quads:
>>>
>>>
>>> https://dvcs.w3.org/hg/rdf/raw-file/default/nquads/index.html#grammar-production-ECHAR
>>
>>
>> The modification to ECHAR looks good to me, excepting the possibility
>> described above of adding an escape for ' again.
>>
>> The original N-Quads followed N-Triples (RDF Test Cases format) and
>> hence had a single way to represent each line. Could the RDF-1.1
>> N-Quads specification also include a Canonical N-Quads Document
>> specification that would again specify a single way to represent each
>> line.
>>
>> This wasn't my original query for this thread, so feel free to open a
>> separate issue for this if it is easier to track it that way. It
>> shouldn't be too difficult once the N-Triples Canonical form is
>> finalised to copy it to N-Quads.
>>
>>> Turtle:
>>>
>>>
>>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#grammar-production-ECHAR
>>
>>
>> The modification to ECHAR looks good to me.
>>
>> In relation to Canonical N-Triples, the section in the Turtle spec
>> describing string escape sequences may be useful as a template to make
>> it clear exactly which way to encode the remaining 3 N-Triples ECHAR
>> characters, as it specifically mentions each of the 8 Turtle escape
>> sequences (although Turtle has no Canonical form so it refers to them
>> as "traditionally" escaped which would need to change for Canonical
>> N-Triples).
>>
>>
>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#sec-escapes
>>
>>> TriG:
>>>
>>> https://dvcs.w3.org/hg/rdf/raw-file/default/trig/index.html#grammar-production-ECHAR
>>
>>
>> The modification to ECHAR looks good to me.
>>
>> Thanks,
>>
>> Peter
>>
>
Received on Thursday, 28 November 2013 22:28:53 UTC