- From: Peter Ansell <ansell.peter@gmail.com>
- Date: Tue, 19 Nov 2013 10:27:33 +1100
- To: Andy Seaborne <andy@apache.org>
- Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>, michel dumontier <michel.dumontier@gmail.com>
On 18 November 2013 22:09, Andy Seaborne <andy@apache.org> wrote: > On 17/11/13 22:50, Peter Ansell wrote: >> >> The Conformance section (Section 4) of the RDF-1.1 N-Triples Candidate >> Recommendation (05 November 2013) specifies that for a canonical >> document [1] : >> >> "Characters not allowed directly in STRING_LITERAL_QUOTE (U+0022, >> U+005C, U+000A, U+000D) MUST use ECHAR not UCHAR. " >> >> However, the escape sequences in ECHAR do not seem to include U+005C "\" >> [2]: >> >> [153s] ECHAR ::= '\' [tbnrf"'] >> >> That is, ECHAR defines escapes for \t \b \n \r \f \" \' , but it >> doesn't appear that \\ is allowed for in that grammar. It could be >> escaped using UCHAR as \u005C, but that seems to violate the canonical >> rule that specifically mentions it. >> >> In addition, is it intentional that the list of characters mentioned >> in the canonical section [1] does not include all of the characters >> with escapes defined in ECHAR [2]? Should the characters that appear >> in ECHAR [2] but not in the list in [1] be escaped using UCHAR in >> Canonical documents or be represented using their raw UTF-8 values. >> >> Cheers, >> >> Peter >> >> [1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#conformance >> [2] >> http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-ECHAR >> > > Hi Peter, > > Thanks for pointing that out. It looks a systematic bug in the tool chain > that we failed to squash. > > I've recorded it on the WG comments: > > http://www.w3.org/2011/rdf-wg/wiki/CR_Comments > > This is not a formal response to your comment. > > I have fixed the documents (which is all subject to WG approval) as follows > and if you are satisfied, please do send an early confirmation of dealing > with your comment to your satisfaction. The removal of the single quote from ECHAR for N-Triples and N-Quads again complicates matters a little, as it requires going backwards, essentially, to what the previous specifications had. The reason I was looking into the grammar yesterday was to make Sesame able to parse RDF-1.1 N-Triples Candidate Recommendation documents that include escaped single quotes \' , as Michel Dumontier had already started escaping single quotes for upcoming Bio2RDF N-Triples and N-Quads data dumps based on the respective Candidate Recommendations. The previous Sesame parsers failed as they were fairly strict to the RDF Test Cases specification where there was a single way to represent each character. I made the change in Git to allow future Sesame releases to parse N-Triples documents that use \'. However I am not sure now whether I should remove that support before it appears in a Sesame release and is relied on by users, given that it will not appear in the next version--and the final version--of the specification. The alternative is to examine whether it is simpler to keep support for \' in the grammar as a useful addition for compatibility with Turtle/TriG/SPARQL, even though it is not strictly necessary given that only double quotes are used for surrounding literals in N-Triples/N-Quads. > Changes: > > N-Triples and N-Quads: > > ECHAR ::= '\' [tbnrf"\] > > which does not include ' because strings can't use '-quoting in N-Triples > and N-Quads and there is a desire to minimise the number of ways of writing > the same thing. That is fair, and it was the previous method before the Candidate Recommendations were published, so it is not without precedent. However, the main question for me at this stage is whether having a minimal number of ways of writing single quotes is more beneficial at this stage then reverting the change in either a follow-up CR or Proposed Recommendation, given that some users have already started escaping single quotes. There are a few issues that may affect users in both the short-term and the long-term. Although none of them are particularly convincing on their own, together they may give the impression that it is more beneficial to allow \' in N-Triples and N-Quads for consistency with Turtle/TriG/SPARQL: 1) Removing the escaping of single quotes for N-Triples may make N-Triples files created based on the CR unparsable * The workaround in this case is to use a Turtle parser as it still allows single quote escaping 2) Removing the escaping of single quotes for N-Quads may make N-Quads files created based on the CR unparsable * As TriG is not structurally compatible with N-Quads, per Richard Smiths recent comments, there is no alternative for this case. I don't see a need for the compatibility personally, as they have different purposes, IMO. The reason I bring it up is that there would be no alternative parser for the files created during the CR period once current parsers again remove support for single-quote escaping. 3) For ease of reference N-Triples could be compatible with a simple line-based version of Turtle and by relation SPARQL. * The difference in allowing single-quote escaping may make N-Triples documents not a strict-subset of Turtle, in that a valid Turtle file with triples printed line by line, (without long literals/prefixed URIs/etc. but with escaping for all of the allowed Turtle ECHAR escape sequences), may not be parsable by an N-Triples parser that didn't allow for single-quote escaping. 5) Given that the requirement that there be a single way to represent everything has already been relaxed for N-Triples/N-Quads. Would it be suitable to specify that single-quotes MAY be escaped, but not SHOULD/MUST be escaped for Canonical N-Triples/Canonical N-Quads? In particular, that would make it possible for parsers to still accept documents produced using the RDF Test Cases Format and the previous N-Quads specification without having a different grammar going forward. > In addition, I've checked Turtle and TriG (Turtle already had a related fix > recently) to put the characters in the same order because \" is confusing > (it is not escaping a " in the grammar itself). > > ECHAR ::= '\' [tbnrf'"\] The SPARQL Query grammar ECHAR already has all of these characters also, although it has the \" sequence which is correct but confusing, as you say. > (Turtle and TriG have a ' as well) > > Links to the rule in the grammar in the editors' drafts: > > N-Triples: > > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/n-triples.html#grammar-production-ECHAR The modification to ECHAR looks good to me, excepting the possibility described above of adding an escape for ' again. Would it be possible to further clarify the way in which characters which are in ECHAR, but not *strictly* disallowed in STRING_LITERAL_QUOTE, should be represented in Canonical N-Triples? The last 2 rules in Section 4 seem to specify that ECHAR MUST *only* be used for the 4 characters which are strictly disallowed in STRING_LITERAL_QUOTE. It may not be obvious to readers why only 4 out of the 7 escape sequences in ECHAR are used in the Canonical form, and also why UCHAR was not used to escape them. The other 3 ECHAR characters which have escape sequences are not directly referred to in Canonical N-Triples, other than by the reference that they should be directly included without using UCHAR. > N-Quads: > > https://dvcs.w3.org/hg/rdf/raw-file/default/nquads/index.html#grammar-production-ECHAR The modification to ECHAR looks good to me, excepting the possibility described above of adding an escape for ' again. The original N-Quads followed N-Triples (RDF Test Cases format) and hence had a single way to represent each line. Could the RDF-1.1 N-Quads specification also include a Canonical N-Quads Document specification that would again specify a single way to represent each line. This wasn't my original query for this thread, so feel free to open a separate issue for this if it is easier to track it that way. It shouldn't be too difficult once the N-Triples Canonical form is finalised to copy it to N-Quads. > Turtle: > > https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#grammar-production-ECHAR The modification to ECHAR looks good to me. In relation to Canonical N-Triples, the section in the Turtle spec describing string escape sequences may be useful as a template to make it clear exactly which way to encode the remaining 3 N-Triples ECHAR characters, as it specifically mentions each of the 8 Turtle escape sequences (although Turtle has no Canonical form so it refers to them as "traditionally" escaped which would need to change for Canonical N-Triples). https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#sec-escapes > TriG: > https://dvcs.w3.org/hg/rdf/raw-file/default/trig/index.html#grammar-production-ECHAR The modification to ECHAR looks good to me. Thanks, Peter
Received on Monday, 18 November 2013 23:28:00 UTC