RE: Several minor problems in the grammar for the functional-style syntax from Boris Motik on 2009-03-22 (public-owl-wg@w3.org from March 2009)

From: Boris Motik <boris.motik@comlab.ox.ac.uk>
Date: Sun, 22 Mar 2009 09:26:44 -0000
To: "'Ivan Herman'" <ivan@w3.org>
Cc: "'W3C OWL Working Group'" <public-owl-wg@w3.org>
Message-ID: <BAA4EBDA4EA5420EAC063ABE9938F099@wolf>
Hello,

[snip]

> >                                                                           I
> > present below the problems, as well as the possible solutions. Most of the
> > problems are caused by the syntax of CURIE, which is defined like this:
> >
> > curie := [[prefix] ":"] irelative-ref
> > prefix := NCName
> > NCName := defined by XML
> > irelative-ref: defined by the IRI spec
> >
> >
> > 1. The CURIE spec is not clear regarding whether the prefix, :, and the
> > irelative-ref in a CURIE can be separated by a whitespace. This makes
> parsing
> > CURIEs such as a:b:c ambiguous, as it is not clear whether one means
> >     a:b :c
> > or
> >     a :b:c.
> >
> > This problem could be solved if we made the 'curie' production a terminal
> and
> > explicitly state that there should be no spaces in it.
> >
> 
> Isn't it correct that NCName cannot contain whitespace? Than my reading
> of the grammar above that it is _not_ allowed to have a whitespace there...
> 

It is true that NCName cannot contain whitespace, and this is not the issue
here. The question is whether whitespace is allowed *between* prefix, ":", and
irelative-ref -- that is, between NCName tokens. 

> >
> > 2. We use @()^"=<>: as special characters in the spec -- that is, we use
> them as
> > stand-alone terminals. Ideally, we'd want the other terminals not to contain
> > these. This, however, is not the case: while NCName cannot contain any of
> these,
> > irelative-ref can contain the characters "@=():". The latter is quite
> > unfortunate: if you write
> >    abc)
> > it is not clear whether the closing parenthesis is part of the irelative-ref
> or
> > not. This prevents the functional-style syntax from being tokenized
> correctly.
> >
> > Another problem is that, because irelative-ref can contain :, we cannot
> > ambiguously parse the simple CURIE "a:b". One way of parsing it is as "a",
> ":",
> > and "b", but another way is to parse it as a simple irelative-ref with the
> value
> > "a:b".
> >
> > We could fix these problems by changing the spec such that, in contrast to
> the
> > CURIE spec, we allow irelative-ref to be only NCName. In this way, no CURIE
> can
> > contain the dangerous characters, so we are fine. Furthermore, the grammar
> for
> > CURIE becomes NCName ":" NCName, and, since NCName cannot contain ":", we
> can
> > parse CURIEs correctly.
> >
> >
> 
> Ouch. I see the issue. This means that some valid URI-s like
> 
> http://www.w.w/#xpointer(id('a'))
> 
> (from http://www.w3.org/TR/xptr-framework/)
> 
> cannot be expressed as CURIES in the FS. It is not a huge deal, of
> course (we can always use explicit URI-s) but it is till a bit of a pain.
> 
> Just exploring an alternative: what if the way we modify the syntax is
> to disallow reference without a prefix? Ie, we could say:
> 
> curie := [prefix] ':' reference
> 
> This makes what this means is that
> 
> Namespace(bla=http://www.w.w/#)
> bla:xpointer(id('a'))
> 
> is not a terminal because the prefix is there, so is
> 
> :xpointer(id('afasd'))
> 
> because the leading ':' is there (and the default namespace is used)
> and, finally,
> 
> xpointer(id('a'))
> 
> is a terminal because there is no prefix mechanism at all, ie, it is not
> a curie.
> 
> I believe that the CURIE spec should allow a host language to do to that
> and, I believe, it does not at the moment. Maybe something to report back...
> 

This would solve the second part of the problem; however, it would not solve the
former part of the problem, which is due to the fact that CURIEs can contain the
special characters @()^"=<> that we use to delimit the syntax. Consider the
following axioms (there is deliberately no space between the axioms; we allow
this at the moment):

EquivalentClasses(A b:B)EquivalentClasses( c:C d:D )

Since CURIEs can contain (), we cannot use these characters to delineate parts
of our syntax, which makes parsing the above axioms ambiguous. One way to parse
them is to break the axioms at the first ); however, another way is to parse
these as "The classes identified by the CURIEs A, b:B)EquivalentClasses(, c:C,
and d:D are all equivalent".

In fact, the closing parenthesis is a CURIE as well; hence, parsing the simple
axiom

SubClassOf (a:A b:B)

is ambiguous: it is unclear whether the closing parenthesis is a part of the
CURIE or not; hence, the axiom might be either correct or a syntax error.

I believe that the only (sane) way of fixing this problem is to prohibit the
special characters @()^"=<> from occurring in the CURIEs. The only way to make
this happen is to strengthen the irelative-ref production. My proposal was to go
all the way down to NCName, which would make the grammar simpler.

We can send this feedback to the editors of the CURIE spec; however, I doubt
that they will be able to really come up with a magical solution: I really thin
we will need to kick @()^"=<> our from CURIEs.



There is another potential issue with this solution: if we require CURIEs to be
qualified, can we then have the default name space? I guess we can; we'd just
need to write :ABC instead of just ABC.

> >
> > 3. There is an ambiguity between CURIE and nodeID: the string
> >     _:abc
> > can be parsed either as a single terminal matching the nodeID production, or
> as
> > three terminals "_" ":" "abc" matching the CURIE production. (Note that _ is
> a
> > valid NCName.)
> >
> > To fix this, in our version of the 'curie' production we should prevent a
> CURIE
> > to start with "_:". This is OK: the actual CURIE spec says that this type of
> > usage can be disallowed in a host language and they explicitly mention RDF.
> >
> 
> I am not sure I understand. In RDFa, for example, the curie production
> '_:X' is used for BNodes which is in line with our definition of nodeID.
> CURIE allows the definition of '_:' in a specific host language as we
> want. So what is the problem exactly?
> 

The problem is that our grammar uses two productions: one for CURIE and one for
nodeIDs. Since there is lexical overlap between the two, it is not clear
whether, for exmaple, _:a is a CURIE or a nodeID. In order to obtain an
unambiguous grammar, we would need to strengthen our definitions such that the
CURIE production does not match _:a, but only nodeID does.

> >
> > 4. There is a general problem with the fact that our reserved words match
> the
> > 'curie' production; for example, "ObjectUnionOf" is a perfectly valid CURIE
> > (even with the fixes outlined above). This is clearly a problem, as it makes
> our
> > grammar not be LL(1); for example, to parse
> >     ObjectUnionOf( abc )
> > we need to look two tokens down the line (i.e., only after you see "(" we
> know
> > that we must have been in the production for "ObjectUnionOf"). Perhaps our
> > grammar is such that, by increasing the lookahead, we can circumvent this
> > problem; however, I am not sure of that, and this is a really sketchy
> solution
> > that is very likely to cause problems in practice.
> >
> > We can avoid this problem by saying that the 'curie' production MUST NOT
> match
> > one of the terminal symbols; that is, instead of using a CURIE that matches
> to
> > one of the terminals, one MUST spell out such CURIE as a full IRI (which is
> > enclosed in <> and is therefore fine).
> >
> 
> Doesn't the approach on disallowing the reference alone solve this
> problem, too?
> 

If we prohibit the occurrence of the special characters, then requiring a CURIE
to contain ":" does indeed solve the problem, simply because an unqualified word
is not a CURIE.

> 
> >
> > 5. It is currently unclear whether "quotedString" can contain CRLF. The
> current
> > definition seems to allow this, but Yevgeny was confused. We could perhaps
> just
> > add a clarification that says "yes, it is allowed".
> >
> >
> 
> Sure. Again, I would send this feedback to Shane and Mark.
> 

This part has nothing to do with CURIE, though: it has to do with the way we are
writing our own productions.

Regards,

	Boris

> Cheers
> 
> Ivan
> 
> > Please let me know how you feel about my proposals.
> >
> > Regards,
> >
> > 	Boris
> >
> >
> >
> 
> --
> 
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> PGP Key: http://www.ivan-herman.net/pgpkey.html
> FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Sunday, 22 March 2009 09:27:54 UTC