Fixing N-Quads and Turtle

Fixing the N-Quads and Turtle grammars is harder, or at least requires a
different approach, because white space is required in some places in these
languages.

I think that the grammar has to be stated something like:


A Turtle document is a Unicode[UNICODE] character string encoded in UTF-8
that can be recognized using the standard two-stage process of left-to-right
greedy tokenization followed by context-free parsing augmented with some
context-sensitive constraints.

The first stage turns the sequence of UNICODE code points into a sequence of
tokens using left-to-right greedy tokenization with the following regular
expressions:

....

Note: Because the tokenization is left-to-right and greedy, 0.0 is turned
into a single DECIMAL token not an INTEGER token followed by a DECIMAL
token.

Note: Language tags are not limited to the recognized language tags of ???.
As a consequence, this stage treats strings like "hi"@prefix as a
language-tagged string and not a simple string followed by the start of a
directive.

The second stage takes the token sequence with the WS token removed and
attempts to parse it using the following BNF grammar:

.....

During this stage, the prefix of any prefixed name must be the prefix of a
previous prefixID or sparqlPrefix directive.



N-Quads can use a slightly simpler setup as it doesn't have a
context-sensitive aspect.


peter

Received on Thursday, 29 June 2017 13:25:45 UTC