Comment on Path / PName clash and Turtle impact

Hello,

Apologies for sending this past the Last Call, but I have a comment
about the decision to combine PNames and Property Paths in SPARQL and
escaping PNames to resolve the problems this causes.

My perspective is mainly that of a Turtle user/implementer.  I
discovered this issue updating my Turtle implementation[1] for the
latest spec.  I discovered that an odd new rule has been added to the
grammar:

[163s] PN_LOCAL_ESC ::= '\\' ( '_' | '~' | '.' | '-' | '!' | '$' | '&' |
"'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | ':' | '/' | '?' | '#' |
'@' | '%' )

Unhappy with how ugly this is, and puzzled why such a specific seemingly
arbitrary set of characters has been introduced as escapes in PNames, I
investigated.  It turns out this is from SPARQL, and the escapes are to
avoid clashing with Property Paths (hereafter just "paths").

This seems like a problem to me: the Turtle specification now has a
strange and unpleasant grammar rule from a different specification, to
mesh with a concept that is meaningless in the context of a Turtle
document.  I do agree, though, that copy/paste compatibility between
statements in both languages is highly desirable.

My main point is about the method: I think escaping is a very poor way
of achieving this, and quotation is more appropriate.  Either Paths, or
PNames, should be quoted, or have a special leading character, to remove
this ambiguity.

Some cons of the current escaping scheme:

* Escaping is ugly, and difficult to work with.  Paths that include
pnames with special characters are difficult to read.

* Copying from other data sources that use these characters is
difficult, so much so that expecting a user to manually do this (i.e.
escape every character in the above list) is not realistic, and
error-prone.

* This effectively prevents future revisions of SPARQL from adding
anything to the path syntax.  If both of these specs become
recommendations, then Turtle (and the corresponding rules in SPARQL
itself) will have baked-in escapes specifically to work around path
syntax.  None can be added, because this will break the rules for
PNames, in both SPARQL and Turtle.

* The very existence of escaping implies there is a need to express
these characters in PNames.  However, this has been made tedious and
ugly to accomodate paths.  In my opinion, this is somewhat backwards.
Both languages should have a clean PName syntax.  Paths are a different
thing, and should be clearly designated as such.  Put another way,
property paths are not pnames, and crippling the pname syntax for paths
is a poor design when there are very simple alternative ways of
differentiating the two.

Some pros of quoting, rather than escaping:

* Much easier to read.  Even in a purely SPARQL context, ignoring
Turtle, having a path be very clearly delineated is much simpler to read
than navigating a mess of escapes and trying to mentally parse what is
going on.

* Turtle is not 'infected' by this SPARQL specific grammar
consideration, and both can use a simpler, more expressive, and more
friendly PName grammar.  SPARQL is not 'locked in' forevermore and is
free to update the path syntax in the future.

* Copy/paste compatibility with other data sources is much simpler,
since quoting is easy, unlike escaping.  It is also less error prone,
since only the quote character needs special consideration.

* The grammars become cleaner, since Path rules and PName rules are
clearly distinct (though the former would refer to the latter).  The
PName rules do not need to take into consideration every character used
in the Path syntax, which is crucial since the PName rules must be in
Turtle as well.  The current PName rule is a symptom that different
types of tokens have not been properly distinguished.

* The PName rules would be far more (possibly entirely) compatible with
CURIES, rather than extremely SPARQL specific.

I am not sure exactly what to suggest in terms of syntax.  It seems most
in-line with existing practice to not quote 'top-level' PNames, but
rather quote paths somehow.  This resolves the Turtle problems, but does
not resolve issues with PNames inside paths.  Here, it seems quoting is
best.  One proposal: paths always have a leading '/', and PNames within
paths are quoted with '[' and ']' (as in the CURIE spec).  Thus, the
example:

?x foaf:knows/foaf:name ?name .

Would become:

?x /[foaf:knows]/[foaf:name] ?name .

The quoting means the PNames are free to contain extended characters,
e.g. rather than the unwieldy:

?x eg:foo\/bar\/baz/eg:terms\/a\+b ?b .

You would have:

?x /[eg:foo/bar/baz]/[eg:terms/a+b] ?b .

Importantly, no quoting of PNames in any other context is necessary, and
no escaping of PNames is necessary at all, which is a significant win
for "copy-paste compatibility" (quoting could also be optional in
paths).

The prefix character is analogous to the '?' used for variables.  This
works well, and is very simple, since a token that starts with a '?' is
clearly a variable, and there is no clashing.  Paths (indeed, any new
kind of token) should be similarly simple to distinguish.  A token that
starts with a '?' is a variable.  A token that starts with a '/' is a
property path.  Simple, consistent, extensible.

Note these are just off-the-cuff examples, I have not thought much about
the best syntax.  Leading slash for paths and [] quoting as above may
not be the best choices for whatever reason; I am more interested in
highlighting the problem first.  If quoting in paths is not popular, I
wouldn't mind escaping *only in paths* - at least that doesn't wreck
Turtle.

In my opinion, this is a very serious issue.  I have a strong aversion
to implementing these PName escapes in Turtle, and consider it an
outright error.  Again, apologies for being late, but a more palatable
resolution to this problem would be a significant improvement, and
prevent future problems.

Thanks,

-dr

[1] http://drobilla.net/software/serd/

Received on Saturday, 18 February 2012 02:44:06 UTC