Increased lookahead requirements in the Turtle draft from David Robillard on 2013-02-17 (public-rdf-comments@w3.org from February 2013)

From: David Robillard <d@drobilla.net>
Date: Sun, 17 Feb 2013 17:43:53 -0500
To: public-rdf-comments@w3.org
Message-ID: <1361141033.16176.30.camel@verne.drobilla.net>

Hi,

I recently got a bug report from a user who's encountered dots in
prefixed names in "Turtle" found in the wild which my parser does not
yet support.  So, I looked at the draft towards implementing this.

Unfortunately it looks like a can of worms for a simple
recursive-descent parser.  The previous specification could be
implemented with 1 character of lookahead, but I don't think this one
can.

Since a PrefixedName can contain a dot, while reading a PrefixedName if
the next character is a dot, it is ambiguous whether or not the dot is
part of the PrefixedName or the end of a statement.  To determine this,
you need to check whether or not the next-next character is a valid
PrefixedName character, and until this is known, neither the dot nor the
next character can be 'eaten'.

The significance is that *1* character of "lookahead" isn't really
lookahead, you just need a peek().  Anything greater requires some kind
of real lookahead implementation, or at least some crafty case-specific
kludges to get around it.

This is not necessarily a spec problem, and two character lookahead is
not an onerous requirement in general, but compared to 1 it is.  I just
thought it was worth mentioning that there is a considerable new
implementation requirement here.  I will have to pay a price in
throughput for this as well.

It's clear, though, that dots in prefixed names are desirable.  Ideally,
tokens, including the delimeters (i.e. '.' and ';'), would be whitespace
delimited, so reading a PrefixedName would simply stop when whitespace
is encountered and this problem would not exist.  Perhaps not realistic
given existing practice, but it would certainly be nice.

Cheers,

-dr

Received on Sunday, 17 February 2013 22:44:21 UTC