Re: CURIEorURI Value Space Collisions

From: Niklas Lindström <lindstream@gmail.com>
Date: Sat, 16 Apr 2011 14:31:40 +0200
Message-ID: <BANLkTin21BMKYvTtVx1cPwBwpDe7SC63Cg@mail.gmail.com>
To: Shane McCarron <shane@aptest.com>
Cc: public-rdfa-wg@w3.org
Hi Shane!

On Sat, Apr 16, 2011 at 12:07 AM, Shane McCarron <shane@aptest.com> wrote:
> I guess I fail to appreciate the core problem here.  Are you worried that
> there will be a prefix declared in a framework (e.g., news:) that in some
> distant future becomes a real scheme, and that @about values that use that
> real scheme as a full URI will be introduced as content within that
> framework?  I can see this as a remote possibility, but I can't get too
> worked up about it.

Yes, I worry about that. And "news" has been a real scheme since at
least 1994 [1]. See the IANA URI Scheme registry [2], or it's
corresponding wikipedia article [3] (which I referenced in my original
post) for some 60+ schemes used in the wild. I know that but a
fragment of these are expected to be used as schemes in subjects and
objects in RDF, and for Linked Data, http is reasonably the only one
at present (well, and https). But we do not know anything for sure,
and change is sometimes rapid on the web.

Hitherto, prefix mechanisms in RDF formats have never extended the URI
space in any parsing context. They've been used in another value space
which on parsing yields URIs. With the path RDFa 1.1 is currently on,
it suddenly *extends* the URI space with a concept where one can
create "magic schemes" which are replaced with URIs to yield URIs. In
effect, schemes and prefixes are conflated. It is this mechanism I am
wary of.

Since @about and @resource contain web identifiers, they should be (at
least fairly) insulated from the technicalities of the prefix
mechanism. (And note that people are now recommending the HTTP
Vocabulary to use "httpv" instead of "http". There is no guarantee
that an http extension won't come along using that as a scheme name.)

Please note that I am mainly talking about the use of
SafeCURIEorCURIEorURI here! See the RDFa 1.1 Core [4], where this is
used for @about and @resource. My opinion is that for declaring
subjects and objects, reverting to only SafeCURIEorURI is the safest,
but allowing something like SafeCURIEorQNameOrURI might be fairly safe

I actually have less problem with the TERMorCURIEorAbsURI space for
@property, @rel etc. But that's because the use there requires
immediate awareness of which prefixes are declared (and partly because
I don't expect myself to use full URI:s there; I just don't find that
author-friendly). The idea here (Mark's I think?) is that undeclared
prefixes are resolved to themselves. So I accept that the
TERMorCURIEorAbsURI construct is there to please those who will not
use prefixes at all. In the attributes where this is used in RDFa 1.1,
CURIEs are the norm.

(Sure, it's still complex to have different value spaces, both of them
allowing unsafe CURIEs and URIs (which is the case now in the RDFa 1.1
draft). I'd might prefer, for simplicity, that any space where unsafe
CURIEs and URIs are allowed to mix, only allowed QName-compliant

One other problem is the reuse of the CURIEorURI definition in the
RDFa API. At least not without properly making it clear that in such a
value, every prefix declaration works as a "magic scheme". I'd much
rather see SafeCURIEorQNameOrURI being used in such places -- again,
at least in subject and object positions.

I have yet to see CURIEs where prefixes are used as "magic (macro)
schemes", rather than as common QNames. Those have been enough for
basically all types and property shorthands so far (and I believe good
vocabulary design should continue to adhere to that). But I know that
CURIEs have their place, since the restrictions of QNames restricts
certain edge cases, and that's fine in a CURIE-only value space.

For potential conveniences (like "dbp-ont:Artist/xyz"), I would expect
people to be just fine using SafeCURIEs when intermingling these with
regular URIs. I like that shorthand practise, especially when it's
explicit (i.e. using the safe "[dbp-ont:Artist/xyz]" form).

> There are lots of things that are going to evolve over
> time on the Internet. We cannot predict all of them.

But that is my point! That is why I am getting "worked up". ;) I know
that most people only use http, and it may be enough to caution people
not to declare http as a prefix (since the consequence of doing so in
RDFa 1.1 would be dire). But it is technically unsafe (as everybody
seems to agree on). Being pedantic, and fearing the consequences of
conflating schemes and prefixes, I am driven to voice my concern. (Not
to mention that I dislike this conflation from a theoretical
perspective as well.)

> I would be open to a few things that could tighten this stuff up - reducing
> the possibility of misinterpretation.  In no particular order, we could do
> none or all of the following:
>  1. Restrict the use of schemes that are well known *at publication
>     time* from prefix declarations. E.g., declare that http, https,
>     mailto, etc. are all illegal as prefix names, and require that
>     conforming processors ignore them (or issue an error or issue a
>     warning - don't really mind).

-1. Doesn't solve the core issue. To me it only highlights the problem
(coming off a bit as "duct tape on the opened can of worms", if you
pardon my simile). Requiring RDFa parsers to take account for the
scheme registry (as said, containing some 60+ ones at this time)
sounds bad to me, *especially* since prefixes should be completely
orthogonal to URI schemes. It will probably be to detriment of
prefixes (which alas are already questioned, in spite of their IMO
apparent friendliness).

>  2. Restrict the 'reference' portion pattern further, such that it
>     prohibits leading '//'.  There is no need I can imagine to permit
>     '//' at the beginning of a reference.  So if there were a string
>     like 'foaf://some/reference' it would not be treated as a CURIE,
>     but 'foaf:some/reference' or 'foaf:/some/reference' would still be
>     a CURIE.

-0/-1. It eliminates the risk for http/https, but not against schemes
like "mailto" (or "tag").

>  3. Encourage content authors to use prefix names that are unlikely to
>     ever be a scheme name (not sure this makes sense).  We still have
>     the NCNAME restriction on prefix names, but myprefix0 is an NCNAME
>     and my-prefix is an NCNAME.  And I can't imagine those ever being
>     a real scheme.

-1. This only indicates that there is a design flaw in the value space
definition (CURIEorURI).

>  4. Encourage content authors to eschew the use of prefixes and just
>     use full URIs (not sure this makes sense either).

-1. Giving up on prefixes would be terrible. I just want to see their
use made effortlessly safe.

As said, I am pedantic on this. I don't seem to be the only one with
this itch though, and I hope you understand my perspective.

Best regards,

[1]: http://tools.ietf.org/html/rfc1738
[2]: http://www.iana.org/assignments/uri-schemes.html
[3]: http://en.wikipedia.org/wiki/URI_scheme
[4]: http://www.w3.org/TR/rdfa-core/

> I don't think any of these steps ELIMINATE the possibility of
> misinterpretation.  But they surely won't hurt, and they are all completely
> consistent with the *intent* of CURIEs.
> Thoughts?
> On 4/15/2011 4:53 AM, Niklas Lindström wrote:
>> Hi Ivan!
>> 2011/4/15 Ivan Herman<ivan@w3.org>:
>>> True. But we would also loose possibly very useful features.
>>> I recently realized, to take an example, that the DBPedia concepts'
>>> ontology has some sort of a hierarchy. The use
>>> http://dbpedia.org/ontology/
>>> http://dbpedia.org/ontology/Artist/
>>> http://dbpedia.org/ontology/Film/
>>> etc.
>>> which would then be used, for example, on types. At the moment, one can
>>> define a prefix for ../ontology/ a then use something like
>>> dbp-ont:Artist/XXX instead of being forced to define a separate prefix for
>>> each sub-hierarchy.
>> Yes, I've thought about that too. But that would still be possible if
>> the CURIEorURI is changed to the RestrictedCURIEOrSafeCURIEorURI
>> (which I just suggested in reply to Mark -- i.e. where RestrictedCURIE
>> is defined as one of QName, or "isegment-nz-nc", or Nathan's
>> "path-absolute / ipath-noscheme / ipath-empty").
>>> B.t.w., on a separate comment: in my implementation I actually generate a
>>> warning if a URI is used with an unusual (ie, non-registered) scheme. In
>>> most cases this is the result of a misspelling in the prefix. I am not sure
>>> it is worth adding that RDFa Core as a requirement, or just have this as a
>>> good practice for RDFa processors...
>> Warnings are useful, but I definitely don't think that an RDFa parser
>> should have to worry about the scheme registry. Neither should
>> authors. It will evolve independently of the implementation and use of
>> RDFa, and of the (very much decentralized) definition of prefixes for
>> vocabularies.
>> Best regards,
>> Niklas
>>> Ivan
>>> On Apr 13, 2011, at 19:09 , Nathan wrote:
>>>> That said, it would be a lot less ambiguous if CURIE didn't use
>>>> irelative-ref and instead used:
>>>>  reference ::= ipath-absolute / ipath-noscheme / ipath-empty
>>>> then at least, http://example.org/ would never be a CURIE, and a prefix
>>>> mapping for http: would never apply / confuse.
>>>> Best,
>>>> Nathan
>>>> Mark Birbeck wrote:
>>>>> Hi Niklas,
>>>>> Everything you say is true. :)
>>>>> However, the big change in the working group's thinking came when we
>>>>> decided that it was impossible to guarantee correct interpretation of
>>>>> strings of text based solely on their format, and so instead we should
>>>>> rely on the strings' contexts.
>>>>> By using context to aid in the interpretation of a string we get a lot
>>>>> more flexibility, and we can unambiguously work out what things like
>>>>> this mean:
>>>>>  foaf:Agent
>>>>> Without context it *looks* like all of the following:
>>>>>  * a string of text with no particular meaning;
>>>>>  * a QName;
>>>>>  * a CURIE;
>>>>>  * a relative URI using the 'foaf' scheme.
>>>>> However, we decided in the working group that if no prefix mapping for
>>>>> 'foaf' was defined in the context for this string, then the string was
>>>>> *by definition* not a CURIE.
>>>>> Whether it therefore becomes a string of text or a URI is a separate
>>>>> processing step, and nothing to do with CURIE processing, but by
>>>>> taking the approach we did in the CURIE processing layer we at least
>>>>> made it possible for 'foaf:Agent' to be interpreted as a URI.
>>>>> The converse also holds; if a mapping for 'foaf' is defined, then the
>>>>> string above is *by definition* a CURIE. Now whether some host
>>>>> language decides to interpret the string as a CURIE above a URI is up
>>>>> to that host language, but RDFa does so.
>>>>> Personally I was very pleased when we took the step to take context
>>>>> into account when interpreting strings. Until that point we were
>>>>> trying to achieve the impossible -- imagining that a string on its own
>>>>> could tell you everything about what it was. Now it's very easy to
>>>>> interpret both of these strings correctly:
>>>>>  foaf:Agent
>>>>>  http://www.w3.org/
>>>>> simply by using the context.
>>>>> Best regards,
>>>>> Mark
>>>>> 2011/4/11 Niklas Lindström<lindstream@gmail.com>:
>>>>>> Hi all!
>>>>>> Is it correct that the RDFa WG is currently recommending letting
>>>>>> CURIEs share the same value space as regular URIs, and so that any
>>>>>> prefix defined with the same value as a scheme, like "http", "https",
>>>>>> "news" etc. will change the URI for any absolute URI using those
>>>>>> schemes?
>>>>>> I remember worrying about this last year, but I haven't followed the
>>>>>> decision process in detail since then. It just worries me that letting
>>>>>> these things collide will blow up for anyone who happens to use at
>>>>>> least "http" or "https" as prefixes (perhaps rendering prefixes using
>>>>>> a tool, or getting them from a profile out of their control). Or
>>>>>> perhaps worse, people believing it safe to use anything but "http(s)"
>>>>>> as prefixes, which will work until something other than those two
>>>>>> comes along in the next 10 years or so. It might happen; and if it
>>>>>> does, it may quite probably be beyond the controls of RDFa specs and
>>>>>> tools.
>>>>>> (An example: some vocabulary "Wide Exceptional Graphs" becomes
>>>>>> popular, using "wxg" as a prefix. Then Google comes along with a new
>>>>>> wxg scheme ("Web Extended by Google"), and soon lots of resources are
>>>>>> linked with that instead of old "http". Or for that matter, that some
>>>>>> other scheme [3] becomes popular again for whatever reason.)
>>>>>> I vaguely recall the WG saying something about defining "http" as a
>>>>>> prefix is bad practise. But this turns up here and there, not least
>>>>>> since the HTTP Vocabulary Draft [1] (<http://www.w3.org/2006/http#>)
>>>>>> recommend it as a prefix. And I just ran across "http" as a prefix in
>>>>>> the Tabulator source as well [2].
>>>>>> While I understand that it is confusing to use it as a prefix, I am
>>>>>> not convinced that it is safe to combine the CURIE and URI value space
>>>>>> like this. At least not without a limit on the CURIEs allowed in the
>>>>>> joint CURIEorURI space. For instance, not allowing CURIEs in that
>>>>>> space to use anything after the prefix+':' other than say an
>>>>>> isegment-nz-nc from RFC 3987, or something to that effect (like a
>>>>>> "[A-Za-z0-9_-.]+" regexp).
>>>>>> If there was such a restriction on the format of CURIEs are allowed in
>>>>>> the CURIEorURI mix (and that anything not matching it would be
>>>>>> considered a full URI), I would definitely sleep better. :)
>>>>>> Am I missing something crucial, or overly worried about the risk of
>>>>>> collisions?
>>>>>> Best regards,
>>>>>> Niklas
>>>>>> [1]: http://www.w3.org/TR/HTTP-in-RDF10/
>>>>>> [2]:
>>>>>> http://dig.csail.mit.edu/hg/tabulator/file/9a135feff10f/chrome/content/js/rdf/rdflib.js#l5644
>>>>>> [3]: http://en.wikipedia.org/wiki/URI_scheme
>>> ----
>>> Ivan Herman, W3C Semantic Web Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>>> FOAF: http://www.ivan-herman.net/foaf.rdf
> --
> Shane P. McCarron                          Phone: +1 763 786-8160 x120
> Managing Director                            Fax: +1 763 786-8180
> ApTest Minnesota                            Inet: shane@aptest.com
