Re: Langtag in Turtle BNF from Andy Seaborne on 2012-02-26 (public-rdf-comments@w3.org from February 2012)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Sun, 26 Feb 2012 15:05:04 +0000
To: public-rdf-comments@w3.org
Message-ID: <4F4A4A20.3050903@epimorphics.com>

On 24/02/12 14:45, Alex Hall wrote:
> On Fri, Feb 24, 2012 at 7:34 AM, Henry Story <henry.story@bblfish.net
> <mailto:henry.story@bblfish.net>> wrote:
>
>     In the current editors draft and spec we find
>
>     http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-turtle/turtle.bnf
>
>     LANGTAG ::= BASE
>       | PREFIX
>       | "@" [a-zA-Z]+ ( "-" [a-zA-Z0-9]+ )*
>
>     BASE ::= "@base"
>
>     PREFIX ::= "@prefix"
>
>     RDFLiteral ::= String ( LANGTAG | ( "^^" IRIref ) )?
>
>
> Interesting... Note that a language tag in Turtle (as in SPARQL) is
> defined simply as '@' followed by one or more letters, with optional
> hyphenated alphanumeric segments. Under this definition, '@base' and
> '@prefix' are both valid language tags regardless of whether they are
> explicitly included in the LANGTAG production using their BASE and
> PREFIX rules.
>
> Now, I agree that it is confusing to have them included this way in the
> LANGTAG definition. They aren't there in SPARQL, and they probably
> shouldn't be in Turtle. My guess would be that this was transcribed
> directly from the input grammar for some parser generator, and BASE and
> PREFIX were added to LANGTAG to quiet some warnings about ambiguous tokens.

Yes - there would be a fight over @base as directive and as a language 
tag. The other way round would work - define directives as langtags 
(!!!) and only allows two particular ones.  OK for machines, less so for 
people reading the grammar and still be BNF.  Parser generators do often 
allow literal "@base" to used and it means that string at that point but 
it's not BNF.

They aren't in SPARQL because @base and @prefix aren't keywords elsewhere.

You could write a single token for a literal with LANGTAG and/or 
datatype but it would be horrible (both prefix name and URI for the 
datatype would need to be spelt out).  Putting the pieces in the tokens 
and assembling the whole literal in the grammar is easier for machine 
and person.

A trick would be to make the end of the lexical for form tokens "@ or 
"^^ (+ internal whitespace) ... but then "a" is a problem.

BCP47 is a tricky pile of rules because of the lengths of subitems 
affects their meaning and the parsing rules.

But the language part:

language      = 2*3ALPHA ["-" extlang]
               / 4ALPHA
               / 5*8ALPHA

does allow "base" and "prefix" (reserved and registered language subtag 
respectively).

 Andy

Received on Sunday, 26 February 2012 15:05:31 UTC