- From: Andy Seaborne <andy@apache.org>
- Date: Thu, 28 Nov 2013 18:00:15 +0000
- To: Peter Ansell <ansell.peter@gmail.com>
- CC: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
To close this off: > There are a few issues that may affect users in both the short-term > and the long-term. Although none of them are particularly convincing > on their own, together they may give the impression that it is more > beneficial to allow \' in N-Triples and N-Quads for consistency with > Turtle/TriG/SPARQL: Yes, on balance, the form in the CR documents where ' is allowed in ECHAR ECHAR ::= '\' [tbnrf'"\] was the preferred form in the working group. It means that parser tokenizing for N-Triples, N-Quads, Turtle and Trig are the same in this area. Editors' drafts are up to date. Thanks, Andy On 18/11/13 23:27, Peter Ansell wrote: > On 18 November 2013 22:09, Andy Seaborne <andy@apache.org> wrote: >> On 17/11/13 22:50, Peter Ansell wrote: >>> >>> The Conformance section (Section 4) of the RDF-1.1 N-Triples Candidate >>> Recommendation (05 November 2013) specifies that for a canonical >>> document [1] : >>> >>> "Characters not allowed directly in STRING_LITERAL_QUOTE (U+0022, >>> U+005C, U+000A, U+000D) MUST use ECHAR not UCHAR. " >>> >>> However, the escape sequences in ECHAR do not seem to include U+005C "\" >>> [2]: >>> >>> [153s] ECHAR ::= '\' [tbnrf"'] >>> >>> That is, ECHAR defines escapes for \t \b \n \r \f \" \' , but it >>> doesn't appear that \\ is allowed for in that grammar. It could be >>> escaped using UCHAR as \u005C, but that seems to violate the canonical >>> rule that specifically mentions it. >>> >>> In addition, is it intentional that the list of characters mentioned >>> in the canonical section [1] does not include all of the characters >>> with escapes defined in ECHAR [2]? Should the characters that appear >>> in ECHAR [2] but not in the list in [1] be escaped using UCHAR in >>> Canonical documents or be represented using their raw UTF-8 values. >>> >>> Cheers, >>> >>> Peter >>> >>> [1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#conformance >>> [2] >>> http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-ECHAR >>> >> >> Hi Peter, >> >> Thanks for pointing that out. It looks a systematic bug in the tool chain >> that we failed to squash. >> >> I've recorded it on the WG comments: >> >> http://www.w3.org/2011/rdf-wg/wiki/CR_Comments >> >> This is not a formal response to your comment. >> >> I have fixed the documents (which is all subject to WG approval) as follows >> and if you are satisfied, please do send an early confirmation of dealing >> with your comment to your satisfaction. > > The removal of the single quote from ECHAR for N-Triples and N-Quads > again complicates matters a little, as it requires going backwards, > essentially, to what the previous specifications had. The reason I was > looking into the grammar yesterday was to make Sesame able to parse > RDF-1.1 N-Triples Candidate Recommendation documents that include > escaped single quotes \' , as Michel Dumontier had already started > escaping single quotes for upcoming Bio2RDF N-Triples and N-Quads data > dumps based on the respective Candidate Recommendations. The previous > Sesame parsers failed as they were fairly strict to the RDF Test Cases > specification where there was a single way to represent each > character. I made the change in Git to allow future Sesame releases to > parse N-Triples documents that use \'. > > However I am not sure now whether I should remove that support before > it appears in a Sesame release and is relied on by users, given that > it will not appear in the next version--and the final version--of the > specification. The alternative is to examine whether it is simpler to > keep support for \' in the grammar as a useful addition for > compatibility with Turtle/TriG/SPARQL, even though it is not strictly > necessary given that only double quotes are used for surrounding > literals in N-Triples/N-Quads. > >> Changes: >> >> N-Triples and N-Quads: >> >> ECHAR ::= '\' [tbnrf"\] >> >> which does not include ' because strings can't use '-quoting in N-Triples >> and N-Quads and there is a desire to minimise the number of ways of writing >> the same thing. > > That is fair, and it was the previous method before the Candidate > Recommendations were published, so it is not without precedent. > > However, the main question for me at this stage is whether having a > minimal number of ways of writing single quotes is more beneficial at > this stage then reverting the change in either a follow-up CR or > Proposed Recommendation, given that some users have already started > escaping single quotes. > > There are a few issues that may affect users in both the short-term > and the long-term. Although none of them are particularly convincing > on their own, together they may give the impression that it is more > beneficial to allow \' in N-Triples and N-Quads for consistency with > Turtle/TriG/SPARQL: > > 1) Removing the escaping of single quotes for N-Triples may make > N-Triples files created based on the CR unparsable > * The workaround in this case is to use a Turtle parser as it > still allows single quote escaping > > 2) Removing the escaping of single quotes for N-Quads may make N-Quads > files created based on the CR unparsable > * As TriG is not structurally compatible with N-Quads, per Richard > Smiths recent comments, there is no alternative for this case. I don't > see a need for the compatibility personally, as they have different > purposes, IMO. The reason I bring it up is that there would be no > alternative parser for the files created during the CR period once > current parsers again remove support for single-quote escaping. > > 3) For ease of reference N-Triples could be compatible with a simple > line-based version of Turtle and by relation SPARQL. > * The difference in allowing single-quote escaping may make > N-Triples documents not a strict-subset of Turtle, in that a valid > Turtle file with triples printed line by line, (without long > literals/prefixed URIs/etc. but with escaping for all of the allowed > Turtle ECHAR escape sequences), may not be parsable by an N-Triples > parser that didn't allow for single-quote escaping. > > 5) Given that the requirement that there be a single way to represent > everything has already been relaxed for N-Triples/N-Quads. Would it be > suitable to specify that single-quotes MAY be escaped, but not > SHOULD/MUST be escaped for Canonical N-Triples/Canonical N-Quads? In > particular, that would make it possible for parsers to still accept > documents produced using the RDF Test Cases Format and the previous > N-Quads specification without having a different grammar going > forward. > >> In addition, I've checked Turtle and TriG (Turtle already had a related fix >> recently) to put the characters in the same order because \" is confusing >> (it is not escaping a " in the grammar itself). >> >> ECHAR ::= '\' [tbnrf'"\] > > The SPARQL Query grammar ECHAR already has all of these characters > also, although it has the \" sequence which is correct but confusing, > as you say. > >> (Turtle and TriG have a ' as well) >> >> Links to the rule in the grammar in the editors' drafts: >> >> N-Triples: >> >> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/n-triples.html#grammar-production-ECHAR > > The modification to ECHAR looks good to me, excepting the possibility > described above of adding an escape for ' again. > > Would it be possible to further clarify the way in which characters > which are in ECHAR, but not *strictly* disallowed in > STRING_LITERAL_QUOTE, should be represented in Canonical N-Triples? > The last 2 rules in Section 4 seem to specify that ECHAR MUST *only* > be used for the 4 characters which are strictly disallowed in > STRING_LITERAL_QUOTE. It may not be obvious to readers why only 4 out > of the 7 escape sequences in ECHAR are used in the Canonical form, and > also why UCHAR was not used to escape them. The other 3 ECHAR > characters which have escape sequences are not directly referred to in > Canonical N-Triples, other than by the reference that they should be > directly included without using UCHAR. > >> N-Quads: >> >> https://dvcs.w3.org/hg/rdf/raw-file/default/nquads/index.html#grammar-production-ECHAR > > The modification to ECHAR looks good to me, excepting the possibility > described above of adding an escape for ' again. > > The original N-Quads followed N-Triples (RDF Test Cases format) and > hence had a single way to represent each line. Could the RDF-1.1 > N-Quads specification also include a Canonical N-Quads Document > specification that would again specify a single way to represent each > line. > > This wasn't my original query for this thread, so feel free to open a > separate issue for this if it is easier to track it that way. It > shouldn't be too difficult once the N-Triples Canonical form is > finalised to copy it to N-Quads. > >> Turtle: >> >> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#grammar-production-ECHAR > > The modification to ECHAR looks good to me. > > In relation to Canonical N-Triples, the section in the Turtle spec > describing string escape sequences may be useful as a template to make > it clear exactly which way to encode the remaining 3 N-Triples ECHAR > characters, as it specifically mentions each of the 8 Turtle escape > sequences (although Turtle has no Canonical form so it refers to them > as "traditionally" escaped which would need to change for Canonical > N-Triples). > > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#sec-escapes > >> TriG: >> https://dvcs.w3.org/hg/rdf/raw-file/default/trig/index.html#grammar-production-ECHAR > > The modification to ECHAR looks good to me. > > Thanks, > > Peter >
Received on Thursday, 28 November 2013 18:00:45 UTC