W3C home > Mailing lists > Public > public-rdfa-wg@w3.org > January 2012

Re: A real problem with CURIEs and a proposal

From: Niklas Lindström <lindstream@gmail.com>
Date: Thu, 26 Jan 2012 03:47:20 +0100
Message-ID: <CADjV5jcEN=ZKWsuwoJjb6vYS3r-7BquBDh=H8gOd7tx0koUE9A@mail.gmail.com>
To: Ivan Herman <ivan@w3.org>
Cc: public-rdfa-wg <public-rdfa-wg@w3.org>, Gavin Carothers <gavin@carothers.name>
Hi Ivan,

Yes, I believe your proposal takes care of the new OpenGraph issue
without touching anytyhing else. But it doesn't seem to address any of
the RDF WG concerns.

Given that we now need to change the CURIE definition anyway, I think
it is prudent to clarify what our situation is.

With your proposal, the lexical space of CURIEs is now a perfect
superset of the IRI lexical space. I.e. every full IRI can be a CURIE.
It was nearly that before of course, but it should be noted.

(Interestingly, the use of CURIEs is then entirely equated with just
using IRIs plus the prefix expansion mechanism, apart from the more
permissive syntax of prefixes compared to IRI schemes.)

As for backwards incompatibility, there may be two sides to that coin:

1. RDFa 1.1 allow IRIs and CURIEs to mix. For @about and @resource
this means that in RDFa 1.1, as opposed to 1.0 (and as opposed to
RDF/XML, Turtle and SPARQL), prefixes in scope (defined in parent
elements or in an out-of-band host language based initial context)
override anything that looks like a prefix. This expands, without
distinction, the prefix of any CURIE -- including schemes on those
lexically identical to IRIs -- to create longer IRIs. This we are all
aware of. I wonder though, isn't this an incompatibility, since there
are forms of RDFa 1.0 where IRIs in @about and @resource would be
changed into other IRIs in RDFa 1.1 (e.g. if there is an
@xmlns:http="..." declared)? Just to ensure we're on stable ground.

2. Currently the local part of CURIEs are allowed to start with "//".
If this was not allowed in RDFa 1.1, it would be formally
backwards-incompatible. But are there *any* existing or desirable uses
for this that would be blocked? The only example I've ever seen is in
RDFa 1.0, where if one binds a prefix to itself (@prefix="http:
http:"), one can use CURIEs in @typeof, @property, @rel and @rev.
Which is moot in RDFa 1.1 since full IRIs are allowed there anyway.

Whether or not to now take the opportunity disallow CURIEs from
starting with "//" should be informed by answers to these questions,
directed to everybody:

* Would that change address the concerns of the RDF WG regarding
CURIEs being very easy to confuse with normal IRIs? I'd say yes in so
far as it disambiguates any normal, authority-based IRI from being a
potential CURIE.

* Does forbidding "//" from following the first ":" in CURIEs block
any actual or desirable CURIE usage? Gavin also said: "In general it
seems that the intent of CURIEs was to limit the right hand side to
relative references". I cannot find any use case. What is the general
opinion here?

* Everyone agrees on the existing, albeit arguably improbable, danger
of confusion and undesirable expansion of schemes. Is this problem so
minute in all conceivable scenarios that some prevention of this is
not reasonable, even if it doesn't prevent any current CURIE usage?

Now, I understand your feeling about addressing this at this point of
the process. My feeling is that this is the last chance we have to
reduce the danger of conflation (eliminating it in the case of IRIs of
the "scheme://" form). Of course, we should not let feelings dictate
what we do.

I've revised my proposal (C) below to mirror yours (B) in syntax, and
I altered it to only prevent the "//", nothing more (by adding
"ipath-absolute" to the choices). Thus it is equivalent to yours with
the exception of the construct choice: "//" iauthority ipath-abempty.

This makes our main options:

A. CURIEs today:

    curie       ::= [ [ prefix ] ':' ] reference
    reference   ::= irelative-ref

B. CURIEs supporting OpenGraph:

    curie       ::= [ [ prefix ] ':' ] reference
    reference   ::= ihier-part [ "?" iquery ] [ "#" ifragment ]

C. CURIEs supporting OpenGraph but not "prefix://":

    curie       ::= [ [ prefix ] ':' ] reference
    reference   ::= ( ipath-absolute / ipath-rootless / ipath-empty )
                        [ "?" iquery ] [ "#" ifragment ]

For comparison, this is the definition of IRI:

   IRI         = scheme ":" ihier-part [ "?" iquery ]
                        [ "#" ifragment ]

   ihier-part  = "//" iauthority ipath-abempty
               / ipath-absolute
               / ipath-rootless
               / ipath-empty

Best regards,
Niklas


2012/1/25 Ivan Herman <ivan@w3.org>:
> (Gavin, welcome in our midst:-)
>
> Niklas,
>
> fist of all, your issue with OpenGraph is indeed compelling. Whether we like it or not, it so happens that we have a major customer out there that uses an illegal CURIE (illegal in RDFa 1.0, that is) and we cannot ignore that nor can we force them to change that. This train is gone, so to say. Ie, I agree that RDFa 1.1 should try to accommodate for this.
>
> However. As you yourself say below, your proposed changes conflate various different issues that are unrelated to the Facebook issue, namely the now notorious '//' starting character issue. As I said many times, at this point in the process I am not really happy touching that (though I can see the, albeit improbable, danger of confusion there)' let alone the issue of incompatibilities with RDFa 1.0 (our charter obligation is to create incompatibilities when there are really pressing issues or major market/user push to do so).
>
> ooking at the RFC[1], there is a simpler way to amend the definition of CURIE-s in our document, where the _only_ change is the fact that the ':' character would be allowed in the reference part. Indeed, here is a possible alternative for the CURIE definition:
>
> curie       ::=   [ [ prefix ] ':' ] reference
> reference   ::=   ihier-part [ "?" iquery ] [ "#" ifragment ] ; ('ihier-part', 'iquery' and 'ifragment' as defined in [RFC3987])
>
> By doing that change, the _only_ difference between the old definition of CURIE-s and the new one is the fact that ':' characters are allowed in the reference, ie, the OpenGraph CURIE-s become valid. I have put more details on this derivation in the Post Scriptum, if you want to check this (and somebody should, to be sure about it!).
>
> My conclusion is therefore that (a) yes, we have a problem, Niklas is right; (b) my proposed change addresses this and only this issue and does not create backward compatibility issues (as also simpler in the spec:-). I would therefore prefer to go along this alternative.
>
> As for the discrepancy with Turtle: yes, I believe that this will be yet another difference between Turtle and RDFa, but Gavin should tell me whether I am wrong. However, the Facebook RDFa usage is, I am afraid, compelling enough that we have to do that and assume the differences
>
> (Gavin, the other example that did come up on our call, is the fact that, for example, Schema.org has type URI-s of the form A/B/C, and it is imperative that RDFa can do something like schema:A/B/C...)
>
> Cheers
>
> Ivan
>
> P.S. Just to avoid you guys to go to the RFC document, here are the major points.
>
> The previous CURIE definition was
>
> curie       ::=   [ [ prefix ] ':' ] reference
> reference   ::=   irelative-ref ; (as defined in [RFC3987])
>
> and the RFC says:
>
> irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
> irelative-part = "//" iauthority ipath-abempty
>                 / ipath-absolute
>                 / ipath-noscheme  <-- !
>                 / ipath-empty
>
> Remember that arrow that I have put there (this is not in the original RFC)
>
> The new alternative refers to ihier-part, which is defined as:
>
> ihier-part     = "//" iauthority ipath-abempty
>                 / ipath-absolute
>                 / ipath-rootless   <-- !
>                 / ipath-empty
>
> Again the arrow is mine; this is indeed the only difference between ihier-part (used in the proposed new version of CURIE) and irelative-ref (used in the current version).
>
> Going down in the RFC, one find:
>
> ipath-noscheme = isegment-nz-nc *( "/" isegment )
> ipath-rootless = isegment-nz *( "/" isegment )
>
> ie, the only difference is between that '-nc' stuff.
>
> The definition of these two are not completely symmetricm because isegment-nz uses yet another indirection:
>
> isegment-nz    = 1*ipchar
> ipchar         = iunreserved / pct-encoded / sub-delims / ":" / "@"
>
> whereas isegment-nz-nc id defined diretly:
>
> isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims / "@" )
>
> But, as you can see, the _only_ difference between the two, at the end of the day, is that isegment-nz-nc disallows the ':' character from isegment-nz. QED, as they put it mathematical proofs:-)
>
>
> [1] http://tools.ietf.org/html/rfc3987
>
>
> On Jan 25, 2012, at 01:50 , Niklas Lindström wrote:
>
>> Hi Ivan!
>>
>> (I'm CC:ing Gavin Carothers who raised ISSUE-125, since we're now
>> discussing whether this addresses those concerns at all.)
>>
>> 2012/1/24 Ivan Herman <ivan@w3.org>:
>>> Niklas,
>>>
>>> I think your analysis on the Open Graph protocol issue is correct.
>>
>> Good. Do you think that we should fix this? (I've been believing that
>> we do want that, even that most(?) of us thought that it was already
>> supported.) Of course, it is already the case today that since RDFa
>> 1.0 defines CURIEs like this as well, the OG usage is in fact invalid.
>> I don't know how many RDFa processors actually break on that though.
>> Since one could not mix (unsafe) CURIEs and IRIs in RDFa 1.0 I'd
>> expect most of them to just split on ":" and expand the prefix part.
>>
>>> My issue, however, is: if we go along the lines you propose, we are getting even further away from a compatibility with Turtle/SPARQL, an issue that has already been raised by the RDF WG. I am not sure what the best forum is for that.
>>
>> That was not my intention. :( The change *does* allow colon ":" in the
>> first segment of the local part now of course; explicitly in order to
>> support the OG form of CURIEs. This admittedly gets us further away.
>>
>> But by disallowing CURIEs to start with "prefix://", I hoped that it
>> would mitigate (if not fully address) one of the concerns that the
>> RDF-WG expressed, of confusing them with normal IRIs. As Gavin said:
>> "These are very easy to confuse with normal IRIs. In general it seems
>> that the intent of CURIEs was to limit the right hand side to relative
>> references but that is not accomplished by using the "irelative-ref"
>> production from the IRI RFC."
>>
>> So I set out to fulfill the goals of supporting CURIEs like:
>>
>>    og:video:width
>>    schema:Person/Engineer
>>    ex:some?very=special#thing
>>
>> while not allowing CURIEs of the forms like Gavin's example:
>>
>>    prefix://user:password[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:8080/
>>
>> Nor any other IRI using the "//" authority path form (like http and https IRIs).
>>
>> I find five things to consider:
>>
>> 1. CURIEs do not currently allow e.g. "og:video:height". PNames don't
>> either. We however have RDFa in the wild using that form (both with
>> the original Open Graph Protocol using RDFa 1.0 and the new Open Graph
>> using RDFa 1.1 with @prefix).
>>
>> 2. CURIEs support lots of special characters in the local part; PNames
>> don't. The same reasoning as in 1 seems to apply, with our explicit
>> requirements being to support e.g. "schema:Person/Engineer" and
>> "db:resource/Albert_Einstein" (and "ex:some?very=special#thing", I
>> suppose).
>>
>> 3. CURIEs are allowed to be identical to full IRIs today (PNames most
>> definitely aren't). Gavin expressed concerns about this ("Host parts,
>> IPv4 and IPv6 segments") because they can be confused with normal
>> IRIs. I interpret that as meaning those with "//" and authority after
>> the scheme. I propose to not allow the CURIE local part to start with
>> "/" (and thus not "//").
>>
>> 4. CURIE prefixes, being defined as NCNames, allow some forms which
>> are not allowed in PName prefixes (e.g. prefixes starting with "_").
>> We may be able to use PN_PREFIX instead without breaking any real use
>> case.
>>
>> 5. I kept using the ABNF from the IRI RFC because CURIEs are based on
>> IRIs. The RDF WG asked us to use W3C EBNF. Provided that we should
>> address any of the above I'd gather that it is a sound request to do
>> so using EBNF.
>>
>> My hope is that if we were to address these, the RDF WG would find the
>> results satisfactory, even if the CURIE definition end up as a
>> superset of PName.
>>
>> (Note that point 1 and 2 may also be of interest for the RDF WG
>> regarding PNames.)
>>
>> Best regards,
>> Niklas
>>
>> PS. You know that point 3 has vexed me, but please believe that I
>> don't want to reopen ISSUE-90. That suggested more invasive changes
>> which don't work with use cases as per above. I've absolutely accepted
>> that. I approached this based on ISSUE-125 along with the observation
>> of the Open Graph issue. Part of that suggested that point 3 is of
>> concern, and that it may be addressed without affecting our needs. I
>> want to keep the changes to a minimum while supporting as many
>> concerns as possible (usability and safety being the primary
>> objectives).
>>
>>
>>> Manu: will you be at the Coordination Group tomorrow? Maybe worth raising the issue there?
>>>
>>> ivan
>>>
>>> On Jan 24, 2012, at 04:01 , Niklas Lindström wrote:
>>>
>>>> Hello,
>>>>
>>>> I've been investigating some of the minute details and issues
>>>> surrounding CURIEs, based on the discussion that recently cropped up
>>>> with ISSUE-125 [1].
>>>>
>>>> It seems to me that the definition we currently have is flawed in one
>>>> more way, and quite crucially so.
>>>>
>>>>
>>>> ## The Problem ##
>>>>
>>>> As we already know, a bunch of Facebook OpenGraph properties are
>>>> expressed with CURIEs where the parts after the prefix themselves
>>>> contain colons. For instance, "video:actor:role", and
>>>> "my-og-app:podcast:url" as seen in the examples at [2]. (There are
>>>> also 13 such properties defined in <http://ogp.me/ns#>, e.g.
>>>> "og:image:width" and "og:video:height".)
>>>>
>>>> We currently define CURIEs as:
>>>>
>>>>    curie       ::=   [ [ prefix ] ':' ] reference
>>>>    reference   ::=   irelative-ref ; (as defined in [RFC3987])
>>>>
>>>> Now, I may be too tired to see clearly, but if I read the definition
>>>> of irelative-ref in section 2.2 of RFC 3987 [3] correctly, it actually
>>>> prohibits such CURIEs!
>>>>
>>>> Let me explain. I find these to be the relevant definitions in RFC 3987:
>>>>
>>>>    irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]
>>>>
>>>>    irelative-part = "//" iauthority ipath-abempty
>>>>                   / ipath-absolute
>>>>                   / ipath-noscheme
>>>>                   / ipath-empty
>>>>
>>>>    ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
>>>>    ipath-noscheme = isegment-nz-nc *( "/" isegment )
>>>>    ipath-empty    = 0<ipchar>
>>>>
>>>>    isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
>>>>                        / "@" )
>>>>                  ; non-zero-length segment without any colon ":"
>>>>
>>>> If I interpret the ABNF [4] properly, given "og:image:width", I get
>>>> the following:
>>>>
>>>> * "og:" matches the prefix and ":", so we match "image:width" against
>>>> irelative-ref;
>>>> * there is no "?" or "#" in that, so only irelative-part is considered;
>>>> * it does not start with "//", so we skip the following (iauthority
>>>> ipath-abempty) of the first alternative;
>>>> * it does not start with "/", so it is not an ipath-absolute;
>>>> * it contains a colon ":", so it is not an ipath-noscheme (does not
>>>> match isegment-nz-nc *( "/" isegment ));
>>>> * it is not empty, so it is not an ipath-empty.
>>>>
>>>> With no more alternatives in irelative-part, I conclude that
>>>> "og:image:width" is not a valid CURIE!
>>>>
>>>> Please correct me if I'm wrong here! If not, it is quite evident that
>>>> we have to fix this (lest we accept to break a widely deployed
>>>> de-facto usage).
>>>>
>>>> Ironically, we *do* allow for CURIEs to begin with "//". This makes it
>>>> possible to use CURIEs *indistinguishable* from "normal" IRIs (using
>>>> authority and paths), as explained in ISSUE-125 (and in my old (dead
>>>> horse) ISSUE-90 [5]).
>>>>
>>>>
>>>> ## The Proposal ##
>>>>
>>>> We have the opportunity here to fix a lot of things. I propose to
>>>> define CURIEs along the lines of:
>>>>
>>>>    curie           =   [ prefix ] ':' local
>>>>    prefix          =   PN_PREFIX; as defined in SPARQL 1.1 [6]
>>>>    local           =   (ipath-rootless / ipath-empty)
>>>>                            [ "?" iquery ] [ "#" ifragment ]
>>>>
>>>>    ipath-rootless  = isegment-nz *( "/" isegment )
>>>>    isegment        = *ipchar
>>>>    isegment-nz     = 1*ipchar
>>>>    ipchar          = iunreserved / pct-encoded / sub-delims / ":"
>>>>                        / "@
>>>>
>>>> .. For comparison, this is the definition of the full IRI:
>>>>
>>>>    IRI         = scheme ":" ihier-part [ "?" iquery ]
>>>>                         [ "#" ifragment ]
>>>>
>>>>    ihier-part  = "//" iauthority ipath-abempty
>>>>                / ipath-absolute
>>>>                / ipath-rootless
>>>>                / ipath-empty
>>>>
>>>>
>>>> ## The Consequences ##
>>>>
>>>> This (if I'm awake enough) stills allow for *all* the use cases that
>>>> have hitherto been put forward as needed. E.g.:
>>>>
>>>>    schema:Person/Doctor
>>>>    og:video:height
>>>>    db:resource/Albert_Einstein
>>>>    ex:some?very=special#thing
>>>>
>>>> (While it is true that it would prevent the "hack" once presented as a
>>>> means of using full IRIs where RDFa 1.0 only allows CURIEs (by using
>>>> @xmlns:http="http:"), isn't that moot? Any processor affected by this
>>>> change in RDFa 1.1 should reasonably use RDFa 1.1 rules, where we now
>>>> allow such IRIs anywhere CURIEs are allowed. (And for that matter, I
>>>> don't recall any reports of actual usage of that.))
>>>>
>>>> Most importantly, this completely eliminates the risk of confusing
>>>> CURIEs with normal IRIs. That is, IRIs with a scheme followed by "//",
>>>> an authority, and a path of segments (separated with "/"), followed by
>>>> optional "?" query and "#" fragment parts. These are the kinds of IRIs
>>>> that can be expressed in various relative forms and resolved against a
>>>> base IRI.
>>>>
>>>> Looking at the list of official and common URI schemes at [7], I find
>>>> that of the 137 schemes, 71 (52%) are in the authority+path form. As
>>>> we know, the prevalent two on the web, http and https, are of this
>>>> kind (arguably the only relevant ones). I'd wager that we can expect
>>>> this form to stay prevalent on the web *even* if "http" we're to be
>>>> eventually superseded. (I say so because relative paths are immensely
>>>> usable, and there is an abundance of code dealing with hierarchical
>>>> URL/URI resolution. Combined with the DNS-based authority model it's
>>>> reasonably here to stay.)
>>>>
>>>> Note also the fact that "http" used as prefix has already turned up in
>>>> the wild, due to the HTTP Vocabulary Working Draft [8]. This has even
>>>> been used in the RDFa 1.1 Core spec itself (as I recently reported in
>>>> my review). To my knowledge, we have asked the ERT WG to change this,
>>>> but this has not yet happened. With this change, such as prefix would
>>>> no longer be a (technical) problem.
>>>>
>>>> The other form is of the "opaque" IRIs (without an authority part and
>>>> possibly no "/" separated segments (i.e. "non-relativizable")).
>>>> Seemingly we've hitherto *unintentionally* prevented some of them
>>>> (e.g. urn: and tag: URIs); but at the price of the OpenGraph CURIEs.
>>>> There are some fairly well-known schemes in this group (official or
>>>> not), e.g.: mailto, tag, urn, doi, geo, tel, callto, news, xmpp, sip,
>>>> sms, bitcoin, gtalk, skype, spotify. Of these, "tag" and "geo" can be
>>>> found in prefix.cc. (I've previously mentioned that "geo" may be of
>>>> some concern for certain RDFa users [9].) But as we've already
>>>> concluded when resolving ISSUE-90, we argue that these will probably
>>>> not be used as prefixes, and will be quite uncommon as schemes of
>>>> subject or object IRIs in RDFa. Also, given that many IRIs using these
>>>> schemes already are reminiscent of CURIEs, and are of a rather
>>>> specialized nature, I'd imagine that it's easier for anyone coming
>>>> across such oddities to recognize the collision risk, should it ever
>>>> happen. We should still be very clear in the section about CURIEs
>>>> though, that prefixes overshadow schemes in IRIs of these forms, and
>>>> that we advice users to monitor the in-scope prefixes for any such
>>>> collision (along with the workaround accomplishable by using e.g.
>>>> @prefix="geo: geo:").
>>>>
>>>>
>>>> ## Summary ##
>>>>
>>>> I sincerely hope that I have interpreted the ABNF correctly and
>>>> haven't raised the issue of OpenGraph CURIEs in error. And that I have
>>>> made a clear and satisfactory draft proposal for fixing both this and
>>>> the problems raised in ISSUE-125 (primarily the risk of confusing
>>>> CURIEs with normal IRIs).
>>>>
>>>> Best regards,
>>>> Niklas
>>>>
>>>> [1]: http://www.w3.org/2010/02/rdfa/track/issues/125
>>>> [2]: http://developers.facebook.com/docs/opengraph/objects/builtin/
>>>> [3]: http://tools.ietf.org/html/rfc3987#section-2.2
>>>> [4]: http://en.wikipedia.org/wiki/Augmented_Backus%E2%80%93Naur_Form
>>>> [5]: http://www.w3.org/2010/02/rdfa/track/issues/90
>>>> [6]: http://www.w3.org/TR/2012/WD-sparql11-query-20120105/#rPNAME_LN
>>>> [7]: http://en.wikipedia.org/wiki/URI_scheme
>>>> [8]: http://www.w3.org/TR/HTTP-in-RDF10/
>>>> [9]: http://lists.w3.org/Archives/Public/public-rdfa-wg/2011Aug/0039.html
>>>>
>>>
>>>
>>> ----
>>> Ivan Herman, W3C Semantic Web Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>
>>>
>>>
>>>
>>>
>>
>
>
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> FOAF: http://www.ivan-herman.net/foaf.rdf
>
>
>
>
>
Received on Thursday, 26 January 2012 02:48:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 27 April 2012 04:55:19 GMT