Several minor problems in the grammar for the functional-style syntax

Hello,

Yevgeny Kazakov is currently trying to implement the functional-style syntax at
our lab, and he has found a number of minor problems in our definitions. I
present below the problems, as well as the possible solutions. Most of the
problems are caused by the syntax of CURIE, which is defined like this:

curie := [[prefix] ":"] irelative-ref
prefix := NCName
NCName := defined by XML
irelative-ref: defined by the IRI spec
 

1. The CURIE spec is not clear regarding whether the prefix, :, and the
irelative-ref in a CURIE can be separated by a whitespace. This makes parsing
CURIEs such as a:b:c ambiguous, as it is not clear whether one means
    a:b :c
or
    a :b:c.

This problem could be solved if we made the 'curie' production a terminal and
explicitly state that there should be no spaces in it.


2. We use @()^"=<>: as special characters in the spec -- that is, we use them as
stand-alone terminals. Ideally, we'd want the other terminals not to contain
these. This, however, is not the case: while NCName cannot contain any of these,
irelative-ref can contain the characters "@=():". The latter is quite
unfortunate: if you write 
   abc)
it is not clear whether the closing parenthesis is part of the irelative-ref or
not. This prevents the functional-style syntax from being tokenized correctly.

Another problem is that, because irelative-ref can contain :, we cannot
ambiguously parse the simple CURIE "a:b". One way of parsing it is as "a", ":",
and "b", but another way is to parse it as a simple irelative-ref with the value
"a:b".

We could fix these problems by changing the spec such that, in contrast to the
CURIE spec, we allow irelative-ref to be only NCName. In this way, no CURIE can
contain the dangerous characters, so we are fine. Furthermore, the grammar for
CURIE becomes NCName ":" NCName, and, since NCName cannot contain ":", we can
parse CURIEs correctly.



3. There is an ambiguity between CURIE and nodeID: the string
    _:abc
can be parsed either as a single terminal matching the nodeID production, or as
three terminals "_" ":" "abc" matching the CURIE production. (Note that _ is a
valid NCName.)

To fix this, in our version of the 'curie' production we should prevent a CURIE
to start with "_:". This is OK: the actual CURIE spec says that this type of
usage can be disallowed in a host language and they explicitly mention RDF.


4. There is a general problem with the fact that our reserved words match the
'curie' production; for example, "ObjectUnionOf" is a perfectly valid CURIE
(even with the fixes outlined above). This is clearly a problem, as it makes our
grammar not be LL(1); for example, to parse
    ObjectUnionOf( abc )
we need to look two tokens down the line (i.e., only after you see "(" we know
that we must have been in the production for "ObjectUnionOf"). Perhaps our
grammar is such that, by increasing the lookahead, we can circumvent this
problem; however, I am not sure of that, and this is a really sketchy solution
that is very likely to cause problems in practice.

We can avoid this problem by saying that the 'curie' production MUST NOT match
one of the terminal symbols; that is, instead of using a CURIE that matches to
one of the terminals, one MUST spell out such CURIE as a full IRI (which is
enclosed in <> and is therefore fine).


5. It is currently unclear whether "quotedString" can contain CRLF. The current
definition seems to allow this, but Yevgeny was confused. We could perhaps just
add a clarification that says "yes, it is allowed".


Please let me know how you feel about my proposals.
 
Regards,

	Boris

Received on Saturday, 21 March 2009 22:03:09 UTC