Re: Escaped characters in RDF-1.1 N-Triples literals for Canonical documents

Hi Andy,

Thank for working through the ECHAR issue. Having ECHAR consistent
across the variations is definitely useful.

Could you also clarify the main issue that I am raising here about the
Canonical form for N-Triples and which characters are expected to be
escaped? It isn't completely clear whether "\" escapes should be used
for all 8 represented characters in ECHAR or just the 4 that are
disallowed in STRING_LITERAL_QUOTE [1].

If only the 4 that are disallowed can be used, then one would need to
use the raw character, as the last rule in Canonical N-Triples says
that the \u form isn't allowed to be used either, but it isn't
completely clear whether that is the intention.

If the goal of Canonical N-Triples is to be fairly close to the RDF
Test Cases format (with the exception of UTF-8 being preferred against
\u|\U) then having at least the 5 escaped ASCII characters that it
specifies [2] (and hopefully all 8 from RDF-1.1 N-Triples ECHAR) may
make sense.

Thanks,

Peter

[1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-STRING_LITERAL_QUOTE
[2] http://www.w3.org/TR/rdf-testcases/#ntrip_strings

On 29 November 2013 05:00, Andy Seaborne <andy@apache.org> wrote:
> To close this off:
>
>
>> There are a few issues that may affect users in both the short-term
>> and the long-term. Although none of them are particularly convincing
>> on their own, together they may give the impression that it is more
>> beneficial to allow \' in N-Triples and N-Quads for consistency with
>> Turtle/TriG/SPARQL:
>
> Yes, on balance, the form in the CR documents where ' is allowed in ECHAR
>
> ECHAR   ::=     '\' [tbnrf'"\]
>
> was the preferred form in the working group.  It means that parser
> tokenizing for N-Triples, N-Quads, Turtle and Trig are the same in this
> area.
>
> Editors' drafts are up to date.
>
>         Thanks,
>         Andy
>
>
> On 18/11/13 23:27, Peter Ansell wrote:
>>
>> On 18 November 2013 22:09, Andy Seaborne <andy@apache.org> wrote:
>>>
>>> On 17/11/13 22:50, Peter Ansell wrote:
>>>>
>>>>
>>>> The Conformance section (Section 4) of the RDF-1.1 N-Triples Candidate
>>>> Recommendation (05 November 2013) specifies that for a canonical
>>>> document [1] :
>>>>
>>>>       "Characters not allowed directly in STRING_LITERAL_QUOTE (U+0022,
>>>> U+005C, U+000A, U+000D) MUST use ECHAR not UCHAR. "
>>>>
>>>> However, the escape sequences in ECHAR do not seem to include U+005C "\"
>>>> [2]:
>>>>
>>>>       [153s] ECHAR ::= '\' [tbnrf"']
>>>>
>>>> That is, ECHAR defines escapes for \t \b \n \r \f \" \' , but it
>>>> doesn't appear that \\ is allowed for in that grammar. It could be
>>>> escaped using UCHAR as \u005C, but that seems to violate the canonical
>>>> rule that specifically mentions it.
>>>>
>>>> In addition, is it intentional that the list of characters mentioned
>>>> in the canonical section [1] does not include all of the characters
>>>> with escapes defined in ECHAR [2]? Should the characters that appear
>>>> in ECHAR [2] but not in the list in [1] be escaped using UCHAR in
>>>> Canonical documents or be represented using their raw UTF-8 values.
>>>>
>>>> Cheers,
>>>>
>>>> Peter
>>>>
>>>> [1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#conformance
>>>> [2]
>>>>
>>>> http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-ECHAR
>>>>
>>>
>>> Hi Peter,
>>>
>>> Thanks for pointing that out.  It looks a systematic bug in the tool
>>> chain
>>> that we failed to squash.
>>>
>>> I've recorded it on the WG comments:
>>>
>>> http://www.w3.org/2011/rdf-wg/wiki/CR_Comments
>>>
>>> This is not a formal response to your comment.
>>>
>>> I have fixed the documents (which is all subject to WG approval) as
>>> follows
>>> and if you are satisfied, please do send an early confirmation of dealing
>>> with your comment to your satisfaction.
>>
>>
>> The removal of the single quote from ECHAR for N-Triples and N-Quads
>> again complicates matters a little, as it requires going backwards,
>> essentially, to what the previous specifications had. The reason I was
>> looking into the grammar yesterday was to make Sesame able to parse
>> RDF-1.1 N-Triples Candidate Recommendation documents that include
>> escaped single quotes \' , as Michel Dumontier had already started
>> escaping single quotes for upcoming Bio2RDF N-Triples and N-Quads data
>> dumps based on the respective Candidate Recommendations. The previous
>> Sesame parsers failed as they were fairly strict to the RDF Test Cases
>> specification where there was a single way to represent each
>> character. I made the change in Git to allow future Sesame releases to
>> parse N-Triples documents that use \'.
>>
>> However I am not sure now whether I should remove that support before
>> it appears in a Sesame release and is relied on by users, given that
>> it will not appear in the next version--and the final version--of the
>> specification. The alternative is to examine whether it is simpler to
>> keep support for \' in the grammar as a useful addition for
>> compatibility with Turtle/TriG/SPARQL, even though it is not strictly
>> necessary given that only double quotes are used for surrounding
>> literals in N-Triples/N-Quads.
>>
>>> Changes:
>>>
>>> N-Triples and N-Quads:
>>>
>>> ECHAR   ::=     '\' [tbnrf"\]
>>>
>>> which does not include ' because strings can't use '-quoting in N-Triples
>>> and N-Quads and there is a desire to minimise the number of ways of
>>> writing
>>> the same thing.
>>
>>
>> That is fair, and it was the previous method before the Candidate
>> Recommendations were published, so it is not without precedent.
>>
>> However, the main question for me at this stage is whether having a
>> minimal number of ways of writing single quotes is more beneficial at
>> this stage then reverting the change in either a follow-up CR or
>> Proposed Recommendation, given that some users have already started
>> escaping single quotes.
>>
>> There are a few issues that may affect users in both the short-term
>> and the long-term. Although none of them are particularly convincing
>> on their own, together they may give the impression that it is more
>> beneficial to allow \' in N-Triples and N-Quads for consistency with
>> Turtle/TriG/SPARQL:
>>
>> 1) Removing the escaping of single quotes for N-Triples may make
>> N-Triples files created based on the CR unparsable
>>      * The workaround in this case is to use a Turtle parser as it
>> still allows single quote escaping
>>
>> 2) Removing the escaping of single quotes for N-Quads may make N-Quads
>> files created based on the CR unparsable
>>      * As TriG is not structurally compatible with N-Quads, per Richard
>> Smiths recent comments, there is no alternative for this case. I don't
>> see a need for the compatibility personally, as they have different
>> purposes, IMO. The reason I bring it up is that there would be no
>> alternative parser for the files created during the CR period once
>> current parsers again remove support for single-quote escaping.
>>
>> 3) For ease of reference N-Triples could be compatible with a simple
>> line-based version of Turtle and by relation SPARQL.
>>      * The difference in allowing single-quote escaping may make
>> N-Triples documents not a strict-subset of Turtle, in that a valid
>> Turtle file with triples printed line by line, (without long
>> literals/prefixed URIs/etc. but with escaping for all of the allowed
>> Turtle ECHAR escape sequences), may not be parsable by an N-Triples
>> parser that didn't allow for single-quote escaping.
>>
>> 5) Given that the requirement that there be a single way to represent
>> everything has already been relaxed for N-Triples/N-Quads. Would it be
>> suitable to specify that single-quotes MAY be escaped, but not
>> SHOULD/MUST be escaped for Canonical N-Triples/Canonical N-Quads? In
>> particular, that would make it possible for parsers to still accept
>> documents produced using the RDF Test Cases Format and the previous
>> N-Quads specification without having a different grammar going
>> forward.
>>
>>> In addition, I've checked Turtle and TriG (Turtle already had a related
>>> fix
>>> recently) to put the characters in the same order because \" is confusing
>>> (it is not escaping a " in the grammar itself).
>>>
>>> ECHAR   ::=     '\' [tbnrf'"\]
>>
>>
>> The SPARQL Query grammar ECHAR already has all of these characters
>> also, although it has the \" sequence which is correct but confusing,
>> as you say.
>>
>>> (Turtle and TriG have a ' as well)
>>>
>>> Links to the rule in the grammar in the editors' drafts:
>>>
>>> N-Triples:
>>>
>>>
>>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/n-triples.html#grammar-production-ECHAR
>>
>>
>> The modification to ECHAR looks good to me, excepting the possibility
>> described above of adding an escape for ' again.
>>
>> Would it be possible to further clarify the way in which characters
>> which are in ECHAR, but not *strictly* disallowed in
>> STRING_LITERAL_QUOTE, should be represented in Canonical N-Triples?
>> The last 2 rules in Section 4 seem to specify that ECHAR MUST *only*
>> be used for the 4 characters which are strictly disallowed in
>> STRING_LITERAL_QUOTE. It may not be obvious to readers why only 4 out
>> of the 7 escape sequences in ECHAR are used in the Canonical form, and
>> also why UCHAR was not used to escape them. The other 3 ECHAR
>> characters which have escape sequences are not directly referred to in
>> Canonical N-Triples, other than by the reference that they should be
>> directly included without using UCHAR.
>>
>>> N-Quads:
>>>
>>>
>>> https://dvcs.w3.org/hg/rdf/raw-file/default/nquads/index.html#grammar-production-ECHAR
>>
>>
>> The modification to ECHAR looks good to me, excepting the possibility
>> described above of adding an escape for ' again.
>>
>> The original N-Quads followed N-Triples (RDF Test Cases format) and
>> hence had a single way to represent each line. Could the RDF-1.1
>> N-Quads specification also include a Canonical N-Quads Document
>> specification that would again specify a single way to represent each
>> line.
>>
>> This wasn't my original query for this thread, so feel free to open a
>> separate issue for this if it is easier to track it that way. It
>> shouldn't be too difficult once the N-Triples Canonical form is
>> finalised to copy it to N-Quads.
>>
>>> Turtle:
>>>
>>>
>>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#grammar-production-ECHAR
>>
>>
>> The modification to ECHAR looks good to me.
>>
>> In relation to Canonical N-Triples, the section in the Turtle spec
>> describing string escape sequences may be useful as a template to make
>> it clear exactly which way to encode the remaining 3 N-Triples ECHAR
>> characters, as it specifically mentions each of the 8 Turtle escape
>> sequences (although Turtle has no Canonical form so it refers to them
>> as "traditionally" escaped which would need to change for Canonical
>> N-Triples).
>>
>>
>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#sec-escapes
>>
>>> TriG:
>>>
>>> https://dvcs.w3.org/hg/rdf/raw-file/default/trig/index.html#grammar-production-ECHAR
>>
>>
>> The modification to ECHAR looks good to me.
>>
>> Thanks,
>>
>> Peter
>>
>

Received on Thursday, 28 November 2013 22:28:53 UTC