A real problem with CURIEs and a proposal

Hello,

I've been investigating some of the minute details and issues
surrounding CURIEs, based on the discussion that recently cropped up
with ISSUE-125 [1].

It seems to me that the definition we currently have is flawed in one
more way, and quite crucially so.


## The Problem ##

As we already know, a bunch of Facebook OpenGraph properties are
expressed with CURIEs where the parts after the prefix themselves
contain colons. For instance, "video:actor:role", and
"my-og-app:podcast:url" as seen in the examples at [2]. (There are
also 13 such properties defined in <http://ogp.me/ns#>, e.g.
"og:image:width" and "og:video:height".)

We currently define CURIEs as:

    curie       ::=   [ [ prefix ] ':' ] reference
    reference   ::=   irelative-ref ; (as defined in [RFC3987])

Now, I may be too tired to see clearly, but if I read the definition
of irelative-ref in section 2.2 of RFC 3987 [3] correctly, it actually
prohibits such CURIEs!

Let me explain. I find these to be the relevant definitions in RFC 3987:

    irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]

    irelative-part = "//" iauthority ipath-abempty
                   / ipath-absolute
                   / ipath-noscheme
                   / ipath-empty

    ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
    ipath-noscheme = isegment-nz-nc *( "/" isegment )
    ipath-empty    = 0<ipchar>

    isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
                        / "@" )
                  ; non-zero-length segment without any colon ":"

If I interpret the ABNF [4] properly, given "og:image:width", I get
the following:

 * "og:" matches the prefix and ":", so we match "image:width" against
irelative-ref;
 * there is no "?" or "#" in that, so only irelative-part is considered;
 * it does not start with "//", so we skip the following (iauthority
ipath-abempty) of the first alternative;
 * it does not start with "/", so it is not an ipath-absolute;
 * it contains a colon ":", so it is not an ipath-noscheme (does not
match isegment-nz-nc *( "/" isegment ));
 * it is not empty, so it is not an ipath-empty.

With no more alternatives in irelative-part, I conclude that
"og:image:width" is not a valid CURIE!

Please correct me if I'm wrong here! If not, it is quite evident that
we have to fix this (lest we accept to break a widely deployed
de-facto usage).

Ironically, we *do* allow for CURIEs to begin with "//". This makes it
possible to use CURIEs *indistinguishable* from "normal" IRIs (using
authority and paths), as explained in ISSUE-125 (and in my old (dead
horse) ISSUE-90 [5]).


## The Proposal ##

We have the opportunity here to fix a lot of things. I propose to
define CURIEs along the lines of:

    curie           =   [ prefix ] ':' local
    prefix          =   PN_PREFIX; as defined in SPARQL 1.1 [6]
    local           =   (ipath-rootless / ipath-empty)
                            [ "?" iquery ] [ "#" ifragment ]

    ipath-rootless  = isegment-nz *( "/" isegment )
    isegment        = *ipchar
    isegment-nz     = 1*ipchar
    ipchar          = iunreserved / pct-encoded / sub-delims / ":"
                        / "@

.. For comparison, this is the definition of the full IRI:

    IRI         = scheme ":" ihier-part [ "?" iquery ]
                         [ "#" ifragment ]

    ihier-part  = "//" iauthority ipath-abempty
                / ipath-absolute
                / ipath-rootless
                / ipath-empty


## The Consequences ##

This (if I'm awake enough) stills allow for *all* the use cases that
have hitherto been put forward as needed. E.g.:

    schema:Person/Doctor
    og:video:height
    db:resource/Albert_Einstein
    ex:some?very=special#thing

(While it is true that it would prevent the "hack" once presented as a
means of using full IRIs where RDFa 1.0 only allows CURIEs (by using
@xmlns:http="http:"), isn't that moot? Any processor affected by this
change in RDFa 1.1 should reasonably use RDFa 1.1 rules, where we now
allow such IRIs anywhere CURIEs are allowed. (And for that matter, I
don't recall any reports of actual usage of that.))

Most importantly, this completely eliminates the risk of confusing
CURIEs with normal IRIs. That is, IRIs with a scheme followed by "//",
an authority, and a path of segments (separated with "/"), followed by
optional "?" query and "#" fragment parts. These are the kinds of IRIs
that can be expressed in various relative forms and resolved against a
base IRI.

Looking at the list of official and common URI schemes at [7], I find
that of the 137 schemes, 71 (52%) are in the authority+path form. As
we know, the prevalent two on the web, http and https, are of this
kind (arguably the only relevant ones). I'd wager that we can expect
this form to stay prevalent on the web *even* if "http" we're to be
eventually superseded. (I say so because relative paths are immensely
usable, and there is an abundance of code dealing with hierarchical
URL/URI resolution. Combined with the DNS-based authority model it's
reasonably here to stay.)

Note also the fact that "http" used as prefix has already turned up in
the wild, due to the HTTP Vocabulary Working Draft [8]. This has even
been used in the RDFa 1.1 Core spec itself (as I recently reported in
my review). To my knowledge, we have asked the ERT WG to change this,
but this has not yet happened. With this change, such as prefix would
no longer be a (technical) problem.

The other form is of the "opaque" IRIs (without an authority part and
possibly no "/" separated segments (i.e. "non-relativizable")).
Seemingly we've hitherto *unintentionally* prevented some of them
(e.g. urn: and tag: URIs); but at the price of the OpenGraph CURIEs.
There are some fairly well-known schemes in this group (official or
not), e.g.: mailto, tag, urn, doi, geo, tel, callto, news, xmpp, sip,
sms, bitcoin, gtalk, skype, spotify. Of these, "tag" and "geo" can be
found in prefix.cc. (I've previously mentioned that "geo" may be of
some concern for certain RDFa users [9].) But as we've already
concluded when resolving ISSUE-90, we argue that these will probably
not be used as prefixes, and will be quite uncommon as schemes of
subject or object IRIs in RDFa. Also, given that many IRIs using these
schemes already are reminiscent of CURIEs, and are of a rather
specialized nature, I'd imagine that it's easier for anyone coming
across such oddities to recognize the collision risk, should it ever
happen. We should still be very clear in the section about CURIEs
though, that prefixes overshadow schemes in IRIs of these forms, and
that we advice users to monitor the in-scope prefixes for any such
collision (along with the workaround accomplishable by using e.g.
@prefix="geo: geo:").


## Summary ##

I sincerely hope that I have interpreted the ABNF correctly and
haven't raised the issue of OpenGraph CURIEs in error. And that I have
made a clear and satisfactory draft proposal for fixing both this and
the problems raised in ISSUE-125 (primarily the risk of confusing
CURIEs with normal IRIs).

Best regards,
Niklas

[1]: http://www.w3.org/2010/02/rdfa/track/issues/125
[2]: http://developers.facebook.com/docs/opengraph/objects/builtin/
[3]: http://tools.ietf.org/html/rfc3987#section-2.2
[4]: http://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_Form
[5]: http://www.w3.org/2010/02/rdfa/track/issues/90
[6]: http://www.w3.org/TR/2012/WD-sparql11-query-20120105/#rPNAME_LN
[7]: http://en.wikipedia.org/wiki/URI_scheme
[8]: http://www.w3.org/TR/HTTP-in-RDF10/
[9]: http://lists.w3.org/Archives/Public/public-rdfa-wg/2011Aug/0039.html

Received on Tuesday, 24 January 2012 03:02:51 UTC