- From: Peter Ansell <ansell.peter@gmail.com>
- Date: Fri, 29 Nov 2013 09:28:26 +1100
- To: Andy Seaborne <andy@apache.org>
- Cc: "public-rdf-comments@w3.org" <public-rdf-comments@w3.org>
Hi Andy, Thank for working through the ECHAR issue. Having ECHAR consistent across the variations is definitely useful. Could you also clarify the main issue that I am raising here about the Canonical form for N-Triples and which characters are expected to be escaped? It isn't completely clear whether "\" escapes should be used for all 8 represented characters in ECHAR or just the 4 that are disallowed in STRING_LITERAL_QUOTE [1]. If only the 4 that are disallowed can be used, then one would need to use the raw character, as the last rule in Canonical N-Triples says that the \u form isn't allowed to be used either, but it isn't completely clear whether that is the intention. If the goal of Canonical N-Triples is to be fairly close to the RDF Test Cases format (with the exception of UTF-8 being preferred against \u|\U) then having at least the 5 escaped ASCII characters that it specifies [2] (and hopefully all 8 from RDF-1.1 N-Triples ECHAR) may make sense. Thanks, Peter [1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-STRING_LITERAL_QUOTE [2] http://www.w3.org/TR/rdf-testcases/#ntrip_strings On 29 November 2013 05:00, Andy Seaborne <andy@apache.org> wrote: > To close this off: > > >> There are a few issues that may affect users in both the short-term >> and the long-term. Although none of them are particularly convincing >> on their own, together they may give the impression that it is more >> beneficial to allow \' in N-Triples and N-Quads for consistency with >> Turtle/TriG/SPARQL: > > Yes, on balance, the form in the CR documents where ' is allowed in ECHAR > > ECHAR ::= '\' [tbnrf'"\] > > was the preferred form in the working group. It means that parser > tokenizing for N-Triples, N-Quads, Turtle and Trig are the same in this > area. > > Editors' drafts are up to date. > > Thanks, > Andy > > > On 18/11/13 23:27, Peter Ansell wrote: >> >> On 18 November 2013 22:09, Andy Seaborne <andy@apache.org> wrote: >>> >>> On 17/11/13 22:50, Peter Ansell wrote: >>>> >>>> >>>> The Conformance section (Section 4) of the RDF-1.1 N-Triples Candidate >>>> Recommendation (05 November 2013) specifies that for a canonical >>>> document [1] : >>>> >>>> "Characters not allowed directly in STRING_LITERAL_QUOTE (U+0022, >>>> U+005C, U+000A, U+000D) MUST use ECHAR not UCHAR. " >>>> >>>> However, the escape sequences in ECHAR do not seem to include U+005C "\" >>>> [2]: >>>> >>>> [153s] ECHAR ::= '\' [tbnrf"'] >>>> >>>> That is, ECHAR defines escapes for \t \b \n \r \f \" \' , but it >>>> doesn't appear that \\ is allowed for in that grammar. It could be >>>> escaped using UCHAR as \u005C, but that seems to violate the canonical >>>> rule that specifically mentions it. >>>> >>>> In addition, is it intentional that the list of characters mentioned >>>> in the canonical section [1] does not include all of the characters >>>> with escapes defined in ECHAR [2]? Should the characters that appear >>>> in ECHAR [2] but not in the list in [1] be escaped using UCHAR in >>>> Canonical documents or be represented using their raw UTF-8 values. >>>> >>>> Cheers, >>>> >>>> Peter >>>> >>>> [1] http://www.w3.org/TR/2013/CR-n-triples-20131105/#conformance >>>> [2] >>>> >>>> http://www.w3.org/TR/2013/CR-n-triples-20131105/#grammar-production-ECHAR >>>> >>> >>> Hi Peter, >>> >>> Thanks for pointing that out. It looks a systematic bug in the tool >>> chain >>> that we failed to squash. >>> >>> I've recorded it on the WG comments: >>> >>> http://www.w3.org/2011/rdf-wg/wiki/CR_Comments >>> >>> This is not a formal response to your comment. >>> >>> I have fixed the documents (which is all subject to WG approval) as >>> follows >>> and if you are satisfied, please do send an early confirmation of dealing >>> with your comment to your satisfaction. >> >> >> The removal of the single quote from ECHAR for N-Triples and N-Quads >> again complicates matters a little, as it requires going backwards, >> essentially, to what the previous specifications had. The reason I was >> looking into the grammar yesterday was to make Sesame able to parse >> RDF-1.1 N-Triples Candidate Recommendation documents that include >> escaped single quotes \' , as Michel Dumontier had already started >> escaping single quotes for upcoming Bio2RDF N-Triples and N-Quads data >> dumps based on the respective Candidate Recommendations. The previous >> Sesame parsers failed as they were fairly strict to the RDF Test Cases >> specification where there was a single way to represent each >> character. I made the change in Git to allow future Sesame releases to >> parse N-Triples documents that use \'. >> >> However I am not sure now whether I should remove that support before >> it appears in a Sesame release and is relied on by users, given that >> it will not appear in the next version--and the final version--of the >> specification. The alternative is to examine whether it is simpler to >> keep support for \' in the grammar as a useful addition for >> compatibility with Turtle/TriG/SPARQL, even though it is not strictly >> necessary given that only double quotes are used for surrounding >> literals in N-Triples/N-Quads. >> >>> Changes: >>> >>> N-Triples and N-Quads: >>> >>> ECHAR ::= '\' [tbnrf"\] >>> >>> which does not include ' because strings can't use '-quoting in N-Triples >>> and N-Quads and there is a desire to minimise the number of ways of >>> writing >>> the same thing. >> >> >> That is fair, and it was the previous method before the Candidate >> Recommendations were published, so it is not without precedent. >> >> However, the main question for me at this stage is whether having a >> minimal number of ways of writing single quotes is more beneficial at >> this stage then reverting the change in either a follow-up CR or >> Proposed Recommendation, given that some users have already started >> escaping single quotes. >> >> There are a few issues that may affect users in both the short-term >> and the long-term. Although none of them are particularly convincing >> on their own, together they may give the impression that it is more >> beneficial to allow \' in N-Triples and N-Quads for consistency with >> Turtle/TriG/SPARQL: >> >> 1) Removing the escaping of single quotes for N-Triples may make >> N-Triples files created based on the CR unparsable >> * The workaround in this case is to use a Turtle parser as it >> still allows single quote escaping >> >> 2) Removing the escaping of single quotes for N-Quads may make N-Quads >> files created based on the CR unparsable >> * As TriG is not structurally compatible with N-Quads, per Richard >> Smiths recent comments, there is no alternative for this case. I don't >> see a need for the compatibility personally, as they have different >> purposes, IMO. The reason I bring it up is that there would be no >> alternative parser for the files created during the CR period once >> current parsers again remove support for single-quote escaping. >> >> 3) For ease of reference N-Triples could be compatible with a simple >> line-based version of Turtle and by relation SPARQL. >> * The difference in allowing single-quote escaping may make >> N-Triples documents not a strict-subset of Turtle, in that a valid >> Turtle file with triples printed line by line, (without long >> literals/prefixed URIs/etc. but with escaping for all of the allowed >> Turtle ECHAR escape sequences), may not be parsable by an N-Triples >> parser that didn't allow for single-quote escaping. >> >> 5) Given that the requirement that there be a single way to represent >> everything has already been relaxed for N-Triples/N-Quads. Would it be >> suitable to specify that single-quotes MAY be escaped, but not >> SHOULD/MUST be escaped for Canonical N-Triples/Canonical N-Quads? In >> particular, that would make it possible for parsers to still accept >> documents produced using the RDF Test Cases Format and the previous >> N-Quads specification without having a different grammar going >> forward. >> >>> In addition, I've checked Turtle and TriG (Turtle already had a related >>> fix >>> recently) to put the characters in the same order because \" is confusing >>> (it is not escaping a " in the grammar itself). >>> >>> ECHAR ::= '\' [tbnrf'"\] >> >> >> The SPARQL Query grammar ECHAR already has all of these characters >> also, although it has the \" sequence which is correct but confusing, >> as you say. >> >>> (Turtle and TriG have a ' as well) >>> >>> Links to the rule in the grammar in the editors' drafts: >>> >>> N-Triples: >>> >>> >>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/n-triples.html#grammar-production-ECHAR >> >> >> The modification to ECHAR looks good to me, excepting the possibility >> described above of adding an escape for ' again. >> >> Would it be possible to further clarify the way in which characters >> which are in ECHAR, but not *strictly* disallowed in >> STRING_LITERAL_QUOTE, should be represented in Canonical N-Triples? >> The last 2 rules in Section 4 seem to specify that ECHAR MUST *only* >> be used for the 4 characters which are strictly disallowed in >> STRING_LITERAL_QUOTE. It may not be obvious to readers why only 4 out >> of the 7 escape sequences in ECHAR are used in the Canonical form, and >> also why UCHAR was not used to escape them. The other 3 ECHAR >> characters which have escape sequences are not directly referred to in >> Canonical N-Triples, other than by the reference that they should be >> directly included without using UCHAR. >> >>> N-Quads: >>> >>> >>> https://dvcs.w3.org/hg/rdf/raw-file/default/nquads/index.html#grammar-production-ECHAR >> >> >> The modification to ECHAR looks good to me, excepting the possibility >> described above of adding an escape for ' again. >> >> The original N-Quads followed N-Triples (RDF Test Cases format) and >> hence had a single way to represent each line. Could the RDF-1.1 >> N-Quads specification also include a Canonical N-Quads Document >> specification that would again specify a single way to represent each >> line. >> >> This wasn't my original query for this thread, so feel free to open a >> separate issue for this if it is easier to track it that way. It >> shouldn't be too difficult once the N-Triples Canonical form is >> finalised to copy it to N-Quads. >> >>> Turtle: >>> >>> >>> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#grammar-production-ECHAR >> >> >> The modification to ECHAR looks good to me. >> >> In relation to Canonical N-Triples, the section in the Turtle spec >> describing string escape sequences may be useful as a template to make >> it clear exactly which way to encode the remaining 3 N-Triples ECHAR >> characters, as it specifically mentions each of the 8 Turtle escape >> sequences (although Turtle has no Canonical form so it refers to them >> as "traditionally" escaped which would need to change for Canonical >> N-Triples). >> >> >> https://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/index.html#sec-escapes >> >>> TriG: >>> >>> https://dvcs.w3.org/hg/rdf/raw-file/default/trig/index.html#grammar-production-ECHAR >> >> >> The modification to ECHAR looks good to me. >> >> Thanks, >> >> Peter >> >
Received on Thursday, 28 November 2013 22:28:53 UTC