Comment from RDF WG on CURIE Alignment with SPARQL and Turtle from Gavin Carothers on 2012-01-18 (public-rdfa-wg@w3.org from January 2012)

From: Gavin Carothers <gavin@carothers.name>
Date: Wed, 18 Jan 2012 08:27:27 -0800
To: public-rdfa-wg@w3.org
Message-ID: <CAPqY83wRH=AJztx5rZBjxY3Hwd5kk3eWPbwxLzMk=AeooE5xPw@mail.gmail.com>
Alignment with SPARQL and Turtle

Both Turtle and SPARQL provide a mechanism for writing shortened IRIs.
The stated design goals of CURIEs are:

   CURIEs are designed from the ground up to be used in attribute
values. QNames are designed for unambiguously naming elements and
attributes.
   CURIEs expand to IRIs, and any IRI can be represented by such an
expansion. QNames are treated as value pairs, but even if those pairs
are combined into a string, only a subset of IRIs can be represented.
   CURIEs can be used in non-XML grammars, and can even be used in
XML languages that do not support XML Namespaces. QNames are limited
to XML Namespace-aware XML Applications.

These exact same goals are met in Turtle and SPARQL using the concept
of Prefixed Names. Exactly how Prefixed Names and CURIEs are different
to end users is not very clear. Both use simple concatenation, both
work outside of XML, both are not value pairs. The given example
isbn:0321154991 is a perfectly valid prefixed name in both Turtle and
SPARQL. In fact all the example CURIEs in RDFa Core 1.1 and the RDFa
Primer are valid Prefixed Names (Some would require language specific
escaping in SPARQL or Turtle). However there are some differences.


CURIE grammar

The grammar for CURIEs provided in RDFa Core 1.1

prefix      ::=   NCName

reference   ::=   irelative-ref (as defined in [RFC3987])

curie       ::=   [ [ prefix ] ':' ] reference

safe_curie  ::=   '[' [ [ prefix ] ':' ] reference ']'

The grammars for prefixed names are well tested and have many
implementations. The CURIE grammar does not seem to have ANY
implementations. In fact even implementing the seemingly simple
grammar from RDFa Core 1.1 is very complicated. The grammar references
two other grammars. First the XML Namespaces grammar for NCName, which
allows a wider range of tokens then prefixed names do. Exactly what
uses cases those additional tokens are needed for is not clear. Some
examples:

_1: is a valid CURIE prefix but NOT a valid Prefixed Named prefix.
______: is a valid CURIE prefix but NOT a valid Prefixed Named prefix.

In fact, the only clearly allowed set of tokens allowed by CURIE and
not Prefixed Names is prefixes containing _ as the first character.
Given that _ in the first character is used in RDFa, Turtle, and
SPARQL to reference blank nodes it seems unlikely that anyone uses _
as the first character in their prefix names.

Moving on to the right hand side of the CURIE the grammar gets much
more amusing. First off the referenced grammar is not in the same form
as CURIE or XML which use W3C EBNF. The iri RFC use ABNF. This leads
to complicated reading for humans, and no clear way to use any
automated tool to build a CURIE a grammar. While the name
"irelative-ref" sounds like a relative IRI reference, the rule in
question is NOT limited to relative references. Host parts, IPv4 and
IPv6 segments are allowed as part of irelative-refs. This is not
exactly expected. Again what use case is served in allowing CURIEs
like:

{'prefix' : 'http://purl.org/example/'}

prefix://user:password[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:8080/

These are very easy to confuse with normal IRIs. In general it seems
that the intent of CURIEs was to limit the right hand side to relative
references but that is not accomplished by using the "irelative-ref"
production from the IRI RFC.

Recommendation:

Align RDFa with SPARQL and Turtle prefix names. It should be possible
to create a simpler grammar for "CURIEs"/Prefixed Names based on the
SPARQL and Turtle productions bellow:

[s157]          PN_CHARS          ::=   PN_CHARS_U | '-' | [0-9] | #x00B7 |
[#x0300-#x036F] | [#x203F-#x2040]
[s158]          PN_PREFIX         ::=   PN_CHARS_BASE ((PN_CHARS|'.')*
PN_CHARS)?
[s159]          PN_LOCAL          ::=   (PN_CHARS_U | [0-9] | PLX ) (
( PN_CHARS |
'.' | PLX )* ( PN_CHARS | PLX ) ) ? >
[s160]          PLX       ::=   PERCENT | PN_LOCAL_ESC
[s161]          PERCENT   ::=   '%' HEX HEX
[s162]          HEX       ::=   [0-9] | [A-F] | [a-f]
[s163]          PN_LOCAL_ESC      ::=   '\' ( '_' | '~' | '.' | '-' | '!' | '$'
| '&' | "'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | ':' | '/' |
'?' | '#' | '@' | '%' )

Main differences are likely to be that most of the PN_LOCAL_ESC
characters should be allowed in RDFa grammar directly, as the host
languages (XML and HTML) provide for escaping mechanisms already.
While this would be a backwards incompatible change the real effect on
deployed data and software is likely to be low. I am unaware of any
RDFa implementation that uses the CURIE grammar as specified, and have
not ever encountered RDFa data in the wild that uses the odder values
the current grammar productions allow.

--Gavin
Received on Wednesday, 18 January 2012 16:27:56 UTC