Re: delimiters between tokens in turtle

On Thu, Feb 23, 2012 at 12:01 PM, Henry Story <>wrote:

> I can't quite work out what the delimiters between tokens are.

Well, that's because there are no delimiters between tokens, per se. You
can always use whitespace to separate tokens that would otherwise be parsed
as a single token, but this isn't necessary if there's only one way to
parse a given character sequence.

> The following seems to be correct N3 (cwm parses it)
>  @prefix : <>.
>  :me</knows>:her,:him.

Yes, this is correct N3 (and Turtle). Without getting into the gory details
of parsing theory, what the parser is doing is:

1. Starting with the first character in the input sequence, determine what
type of token it's looking at, by looking ahead as many characters as
necessary to eliminate any alternatives.
2. Once it determines the token type, it then goes back to the beginning of
the input and advances character by character, building up an internal
state along the way, until it reaches the end of the token or finds a
character that is not allowed as part of the current token.
3. If the current token is well-formed, then emit it and start over at (1)
to find the next token. If the token is malformed, throw an exception and
stop parsing.

So, for line 2 above, the parser guesses from looking at the ':' that it's
looking at a prefixed name (because no other token can start with a ':').
It reads up to the '<', which is not allowed as part of a prefixed name, so
it emits ':me' as a prefixed name. Then, from the '<' it knows it's looking
at an IRI, so it reads forward to the '>' (which signals the end of an IRI
token) and emits '</knows>' as an IRI, and so on.

> cwm even is able to parse
>   @prefix foaf: <> .
>   :me foaf:knows:her.
> Anyway, it's not so easy to implement some of this.
> For some context, I have been working on this in Scala, and writing up
> some thoughts here:

Yes, parsers are notoriously difficult to implement. I'm no expert on
parsing theory, but I see that you're using a parser combinator to
implement yours. I think the grammar was designed to be parsed using more
traditional LL(1) parsers with a separate lexical analyzer to handle the
terminal productions. This might account for some of the difficulty that
you're having.

FWIW, I had trouble implementing the same PN_PREFIX rule that you cite
above using Antlr, and had to use Antlr's predicated production feature to
work around the greediness. So I rewrote the rule as:

fragment PN_LOCAL_CHARS : '.' | PN_CHARS ;
fragment PN_CHARS_SEQ :
   ( ('.' PN_LOCAL_CHARS)=> '.' // '.' is not allowed at the end -- only
match them if they're followed by another valid char
   | PN_CHARS )* ;


> Henry
> Social Web Architect

Received on Thursday, 23 February 2012 20:48:21 UTC