- From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
- Date: Wed, 28 Jun 2017 20:12:33 -0700
- To: Gregg Kellogg <gregg@greggkellogg.net>
- Cc: Wouter Beek <w.g.j.beek@vu.nl>, SW-forum Web <semantic-web@w3.org>
Is is precisely because behaviour on invalid documents is undefined, as opposed to requiring that parsers signal an error on invalid documents, that N-Triples or Turtle parsers cannot be *overly* lenient. An N-Triples or Turtle parser that accepts invalid documents and turns them into RDF graphs is just being lenient, not overly lenient. It is possible that whitespace might be required in an N-Triples document to prevent mis-recognition by a Turtle parser because there are more valid token sequences in Turtle than in N-Triples. I don't think that this is the case, but I haven't convinced myself that it isn't. It is definitely the case that white space is needed in certain places in Turtle documents, but I think that this only happens in Turtle documents that are not N-Triples documents. For example, the white space is needed in ex:ab ex:ac ex:ad. but this is not a valid N-Triples triple. It's not very clear as to what counts as mis-recognition in Turtle. Does it mean that there are multiple parses or that there are multiple ways to divide the input into terminals? For example, :::. can only be parsed one way as a Turtle triple but a greedy tokenizer could turn the three colons into a single iri. Note 3 in Section 6.5 does not adequately cover this kind of situation. Peter F. Patel-Schneider Nuance Communications On 06/28/2017 07:30 PM, Gregg Kellogg wrote: >> On Jun 28, 2017, at 5:31 PM, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote: >> >> It is not possible for N-Triples parsers to be overly lenient, nor is it >> possible for Turtle parsers to be overly lenient. The Turtle specification >> has a note in Section 5 on this point. > > This note indicates that parsing of non-conforming documents is undefined, not that it is not possible. The presence of numerous tests which include extra white-space would indicate that consuming this, at least, is not considered to be overly lenient. IMHO, that it intended to indicate how parsers may or may not recover from parser/tokenizer errors, if triples are produced up to the point the error is discovered. There certainly are parsers that attempt to perform error recovery and continue to generate triples, which is a real-world consideration for handling many large dumps (the previous Freebase dumps, for example). > >> However, even though everything you say below is true, it is still the case >> that the grammar sections in both the N-Triples document and the Turtle >> document are incorrect and need to be rewritten. > > Perhaps an erratum would be sufficient. This might just clarify what “whitespace” means so that it can include sequences of multiple whitespace tokens and where it may be optional. As you note, in N-Triples, Whitespace between terminals is always optional (other than within literals). > >> It is also not clear that every valid N-Triples document is a valid Turtle >> document. > > How is this not clear? N-Triples is certainly intended to be a struct subset of Turtle. > > Gregg > >> Peter F. Patel-Schneider >> Nuance Communications >> >> >> On 06/28/2017 04:48 PM, Gregg Kellogg wrote: >>> Whitespace is typically taken to zero or more characters of whitespace. Note in the Change Log [1]: >>> >>>> White space rules defined outside of grammar, as in Turtle [2], although the N-Triples grammar restricts White space to tab or (tab U+0009 or space U+0020). >>> >>> If N-Triples parsers are overly lenient in allowing multiple whitespace characters between terminals, then by that logic, so are Turtle parsers. >>> >>> The restriction that terminals be separated by exactly a single whitespace is true for the Canonical form of N-Triples [3]. Tokenizers only require whitespace to distinguish two terminals that would otherwise be joined. >>> >>> Furthermore, there is a minimal whitespace test [4] that verifies that this is the intention of the working group. >>> >>> <http://example/s><http://example/p><http://example/o>. >>> <http://example/s><http://example/p>"Alice". >>> <http://example/s><http://example/p>_:o. >>> _:s<http://example/p><http://example/o>. >>> _:s<http://example/p>"Alice". >>> _:s<http://example/p>_:bnode1. >>> >>> There is also the original N-Triples test [5] that contains many instances of terminals separated by mutliple whitespace characters [5], for example: >>> >>> # spaces and tabs throughout: >>> <http://example.org/resource3> <http://example.org/property> <http://example.org/resource2> . >>> >>> Gregg Kellogg >>> gregg@greggkellogg.net >>> >>> [1] https://www.w3.org/TR/n-triples/#changes-between-last-call-working-draft-and-publication-as-note >>> [2] https://www.w3.org/TR/turtle/#grammar-production-WS >>> [3] https://www.w3.org/TR/n-triples/#canonical-ntriples >>> [4] http://w3c.github.io/rdf-tests/ntriples/lantag_with_subtag.nt >>> >>>> On Jun 28, 2017, at 8:58 AM, Peter F. Patel-Schneider <pfpschneider@gmail.com> wrote: >>>> >>>> This means that all existing N-Triples parsers are lenient in that they >>>> process documents that are not valid N-Triples documents. This, however, does >>>> not make them too lenient as there is no requirement that an N-Triples >>>> processor reject inputs that are not N-Triples documents. >>>> >>>> This does mean that Canonical N-Triples documents are not valid N-Triples >>>> documents. >>>> >>>> peter >>>> >>>> PS: Of course what it really means is that the grammar section of the >>>> N-Triples document needs to be changed. >>>> >>>> >>>> On 06/28/2017 08:50 AM, Wouter Beek wrote: >>>>>> So it seems to me that spaces are not allowed anywhere in [1] in N-Triples, i.e., >>>>>> >>>>>> <x:y> <x:y> <x:y> . >>>>>> >>>>>> is not a valid N-Triples triple. >>>>> >>>>> I do follow your reasoning here, but this would mean that all existing >>>>> N-Triples parsers are too lenient. >>>>> >>>> >>> >> >
Received on Thursday, 29 June 2017 03:13:12 UTC