Re: CURIEorURI Value Space Collisions from Ivan Herman on 2011-05-01 (public-rdfa-wg@w3.org from May 2011)

From: Ivan Herman <ivan@w3.org>
Date: Sun, 1 May 2011 18:46:51 +0200
To: Niklas Lindström <lindstream@gmail.com>
Cc: public-rdfa-wg <public-rdfa-wg@w3.org>
Message-Id: <54758072-577D-4381-A66A-0BC5B7FC7BB4@w3.org>
On May 1, 2011, at 17:30 , Niklas Lindström wrote:

> Hi folks!
> 
> Not sure if I overloaded you with this diatribe against CURIEorURI, or
> if you're already overloaded with work. ;) (I suppose the latter,
> since you plan to discuss this in an upcoming telecon, and I know
> there are a bunch of important items on the agenda.)

Indeed, we are busy getting the various API-s out of the door. Besides, the 2nd Last Call round has just ended, and we have delayed looking at those until we have all in.


> 
> Nevertheless, I just ran into this piece of code:
> 
>    https://github.com/norcalrdf/pymantic/blob/master/pymantic/uri_schemes.py
> 
> containing the documentation:
> 
>    "A complete list of URI schemes registered as of Sept 26th, 2008,
> used when parsing CURIEs to differentiate explicit URIs from CURIEs."
> 
> The fact that it even exists is to me a clear indication of the
> problems of maintenance, and collision risks, we face when conflating
> URI schemes and prefixes. (Will every implementation use the same
> list? Will noone use a prefix in the list as it stands today? Is it
> acceptable that the range of allowed prefixes will shrink (albeit
> slowly) over time? And so on...)

I think I am repeating myself (and I have to emphasize that this is my personal reaction, not an official WG opinion, let alone a W3C opinion...): in this matter my concern is to compare the probability of running into a real problem in practice compared to the ease for HTML authors to write their code. I know from my own personal experience with RDFa 1.0 that the 'safe curie' approach is a drag, something I regularly forgot and it looks pretty unnatural. We also know that there are people who do not want to use curie-s. Finally, having the possibility to write, say, 

db:ontology/bla
db:resource/blob

where db stands for the dbpedia core vocabulary is really really useful and helpful instead of having to re-define different prefixes for dbpedia.org/ontology, dbpedia.org/resource (these examples are real).

Having an implementation store the list of registered prefixes (and _not_ shrink, because no scheme goes out of definition), issue warning when a prefix collides with a prefix works for me. Because a prefix definition takes precedence over the uri scheme (say, your wxg prefix), an RDFa content produced today using that prefix remains valid and produces the same RDF graph even if at some point wxg becomes a URI scheme. If at that point somebody wants to use the wxg scheme, then the in a newly produced RDFa content another prefix will have to be chosen but that does not seem to create huge problems...

Again, this is my personal opinion. 

> 
> Anyway, pardon my nagging.

Niklas, please... this is not nagging, this is a regular technical concern and not nagging...

Cheers

Ivan


> 
> Best regards,
> Niklas
> 
> 
> 
> 2011/4/16 Niklas Lindström <lindstream@gmail.com>:
>> Hi Shane!
>> 
>> On Sat, Apr 16, 2011 at 12:07 AM, Shane McCarron <shane@aptest.com> wrote:
>>> I guess I fail to appreciate the core problem here.  Are you worried that
>>> there will be a prefix declared in a framework (e.g., news:) that in some
>>> distant future becomes a real scheme, and that @about values that use that
>>> real scheme as a full URI will be introduced as content within that
>>> framework?  I can see this as a remote possibility, but I can't get too
>>> worked up about it.
>> 
>> Yes, I worry about that. And "news" has been a real scheme since at
>> least 1994 [1]. See the IANA URI Scheme registry [2], or it's
>> corresponding wikipedia article [3] (which I referenced in my original
>> post) for some 60+ schemes used in the wild. I know that but a
>> fragment of these are expected to be used as schemes in subjects and
>> objects in RDF, and for Linked Data, http is reasonably the only one
>> at present (well, and https). But we do not know anything for sure,
>> and change is sometimes rapid on the web.
>> 
>> Hitherto, prefix mechanisms in RDF formats have never extended the URI
>> space in any parsing context. They've been used in another value space
>> which on parsing yields URIs. With the path RDFa 1.1 is currently on,
>> it suddenly *extends* the URI space with a concept where one can
>> create "magic schemes" which are replaced with URIs to yield URIs. In
>> effect, schemes and prefixes are conflated. It is this mechanism I am
>> wary of.
>> 
>> Since @about and @resource contain web identifiers, they should be (at
>> least fairly) insulated from the technicalities of the prefix
>> mechanism. (And note that people are now recommending the HTTP
>> Vocabulary to use "httpv" instead of "http". There is no guarantee
>> that an http extension won't come along using that as a scheme name.)
>> 
>> Please note that I am mainly talking about the use of
>> SafeCURIEorCURIEorURI here! See the RDFa 1.1 Core [4], where this is
>> used for @about and @resource. My opinion is that for declaring
>> subjects and objects, reverting to only SafeCURIEorURI is the safest,
>> but allowing something like SafeCURIEorQNameOrURI might be fairly safe
>> too.
>> 
>> I actually have less problem with the TERMorCURIEorAbsURI space for
>> @property, @rel etc. But that's because the use there requires
>> immediate awareness of which prefixes are declared (and partly because
>> I don't expect myself to use full URI:s there; I just don't find that
>> author-friendly). The idea here (Mark's I think?) is that undeclared
>> prefixes are resolved to themselves. So I accept that the
>> TERMorCURIEorAbsURI construct is there to please those who will not
>> use prefixes at all. In the attributes where this is used in RDFa 1.1,
>> CURIEs are the norm.
>> 
>> (Sure, it's still complex to have different value spaces, both of them
>> allowing unsafe CURIEs and URIs (which is the case now in the RDFa 1.1
>> draft). I'd might prefer, for simplicity, that any space where unsafe
>> CURIEs and URIs are allowed to mix, only allowed QName-compliant
>> QNames.)
>> 
>> One other problem is the reuse of the CURIEorURI definition in the
>> RDFa API. At least not without properly making it clear that in such a
>> value, every prefix declaration works as a "magic scheme". I'd much
>> rather see SafeCURIEorQNameOrURI being used in such places -- again,
>> at least in subject and object positions.
>> 
>> I have yet to see CURIEs where prefixes are used as "magic (macro)
>> schemes", rather than as common QNames. Those have been enough for
>> basically all types and property shorthands so far (and I believe good
>> vocabulary design should continue to adhere to that). But I know that
>> CURIEs have their place, since the restrictions of QNames restricts
>> certain edge cases, and that's fine in a CURIE-only value space.
>> 
>> For potential conveniences (like "dbp-ont:Artist/xyz"), I would expect
>> people to be just fine using SafeCURIEs when intermingling these with
>> regular URIs. I like that shorthand practise, especially when it's
>> explicit (i.e. using the safe "[dbp-ont:Artist/xyz]" form).
>> 
>> 
>>> There are lots of things that are going to evolve over
>>> time on the Internet. We cannot predict all of them.
>> 
>> But that is my point! That is why I am getting "worked up". ;) I know
>> that most people only use http, and it may be enough to caution people
>> not to declare http as a prefix (since the consequence of doing so in
>> RDFa 1.1 would be dire). But it is technically unsafe (as everybody
>> seems to agree on). Being pedantic, and fearing the consequences of
>> conflating schemes and prefixes, I am driven to voice my concern. (Not
>> to mention that I dislike this conflation from a theoretical
>> perspective as well.)
>> 
>> 
>>> I would be open to a few things that could tighten this stuff up - reducing
>>> the possibility of misinterpretation.  In no particular order, we could do
>>> none or all of the following:
>>> 
>>>  1. Restrict the use of schemes that are well known *at publication
>>>     time* from prefix declarations. E.g., declare that http, https,
>>>     mailto, etc. are all illegal as prefix names, and require that
>>>     conforming processors ignore them (or issue an error or issue a
>>>     warning - don't really mind).
>> 
>> -1. Doesn't solve the core issue. To me it only highlights the problem
>> (coming off a bit as "duct tape on the opened can of worms", if you
>> pardon my simile). Requiring RDFa parsers to take account for the
>> scheme registry (as said, containing some 60+ ones at this time)
>> sounds bad to me, *especially* since prefixes should be completely
>> orthogonal to URI schemes. It will probably be to detriment of
>> prefixes (which alas are already questioned, in spite of their IMO
>> apparent friendliness).
>> 
>> 
>>>  2. Restrict the 'reference' portion pattern further, such that it
>>>     prohibits leading '//'.  There is no need I can imagine to permit
>>>     '//' at the beginning of a reference.  So if there were a string
>>>     like 'foaf://some/reference' it would not be treated as a CURIE,
>>>     but 'foaf:some/reference' or 'foaf:/some/reference' would still be
>>>     a CURIE.
>> 
>> -0/-1. It eliminates the risk for http/https, but not against schemes
>> like "mailto" (or "tag").
>> 
>> 
>>>  3. Encourage content authors to use prefix names that are unlikely to
>>>     ever be a scheme name (not sure this makes sense).  We still have
>>>     the NCNAME restriction on prefix names, but myprefix0 is an NCNAME
>>>     and my-prefix is an NCNAME.  And I can't imagine those ever being
>>>     a real scheme.
>> 
>> -1. This only indicates that there is a design flaw in the value space
>> definition (CURIEorURI).
>> 
>> 
>>>  4. Encourage content authors to eschew the use of prefixes and just
>>>     use full URIs (not sure this makes sense either).
>> 
>> -1. Giving up on prefixes would be terrible. I just want to see their
>> use made effortlessly safe.
>> 
>> 
>> As said, I am pedantic on this. I don't seem to be the only one with
>> this itch though, and I hope you understand my perspective.
>> 
>> Best regards,
>> Niklas
>> 
>> 
>> [1]: http://tools.ietf.org/html/rfc1738
>> [2]: http://www.iana.org/assignments/uri-schemes.html
>> [3]: http://en.wikipedia.org/wiki/URI_scheme
>> [4]: http://www.w3.org/TR/rdfa-core/
>> 
>> 
>> 
>>> I don't think any of these steps ELIMINATE the possibility of
>>> misinterpretation.  But they surely won't hurt, and they are all completely
>>> consistent with the *intent* of CURIEs.
>>> 
>>> Thoughts?
>>> 
>>> On 4/15/2011 4:53 AM, Niklas Lindström wrote:
>>>> 
>>>> Hi Ivan!
>>>> 
>>>> 2011/4/15 Ivan Herman<ivan@w3.org>:
>>>>> 
>>>>> True. But we would also loose possibly very useful features.
>>>>> 
>>>>> I recently realized, to take an example, that the DBPedia concepts'
>>>>> ontology has some sort of a hierarchy. The use
>>>>> 
>>>>> http://dbpedia.org/ontology/
>>>>> http://dbpedia.org/ontology/Artist/
>>>>> http://dbpedia.org/ontology/Film/
>>>>> etc.
>>>>> 
>>>>> which would then be used, for example, on types. At the moment, one can
>>>>> define a prefix for ../ontology/ a then use something like
>>>>> dbp-ont:Artist/XXX instead of being forced to define a separate prefix for
>>>>> each sub-hierarchy.
>>>> 
>>>> Yes, I've thought about that too. But that would still be possible if
>>>> the CURIEorURI is changed to the RestrictedCURIEOrSafeCURIEorURI
>>>> (which I just suggested in reply to Mark -- i.e. where RestrictedCURIE
>>>> is defined as one of QName, or "isegment-nz-nc", or Nathan's
>>>> "path-absolute / ipath-noscheme / ipath-empty").
>>>> 
>>>> 
>>>>> B.t.w., on a separate comment: in my implementation I actually generate a
>>>>> warning if a URI is used with an unusual (ie, non-registered) scheme. In
>>>>> most cases this is the result of a misspelling in the prefix. I am not sure
>>>>> it is worth adding that RDFa Core as a requirement, or just have this as a
>>>>> good practice for RDFa processors...
>>>> 
>>>> Warnings are useful, but I definitely don't think that an RDFa parser
>>>> should have to worry about the scheme registry. Neither should
>>>> authors. It will evolve independently of the implementation and use of
>>>> RDFa, and of the (very much decentralized) definition of prefixes for
>>>> vocabularies.
>>>> 
>>>> 
>>>> Best regards,
>>>> Niklas
>>>> 
>>>> 
>>>> 
>>>>> Ivan
>>>>> 
>>>>> 
>>>>> 
>>>>> On Apr 13, 2011, at 19:09 , Nathan wrote:
>>>>> 
>>>>>> That said, it would be a lot less ambiguous if CURIE didn't use
>>>>>> irelative-ref and instead used:
>>>>>> 
>>>>>>  reference ::= ipath-absolute / ipath-noscheme / ipath-empty
>>>>>> 
>>>>>> then at least, http://example.org/ would never be a CURIE, and a prefix
>>>>>> mapping for http: would never apply / confuse.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Nathan
>>>>>> 
>>>>>> Mark Birbeck wrote:
>>>>>>> 
>>>>>>> Hi Niklas,
>>>>>>> Everything you say is true. :)
>>>>>>> However, the big change in the working group's thinking came when we
>>>>>>> decided that it was impossible to guarantee correct interpretation of
>>>>>>> strings of text based solely on their format, and so instead we should
>>>>>>> rely on the strings' contexts.
>>>>>>> By using context to aid in the interpretation of a string we get a lot
>>>>>>> more flexibility, and we can unambiguously work out what things like
>>>>>>> this mean:
>>>>>>>  foaf:Agent
>>>>>>> Without context it *looks* like all of the following:
>>>>>>>  * a string of text with no particular meaning;
>>>>>>>  * a QName;
>>>>>>>  * a CURIE;
>>>>>>>  * a relative URI using the 'foaf' scheme.
>>>>>>> However, we decided in the working group that if no prefix mapping for
>>>>>>> 'foaf' was defined in the context for this string, then the string was
>>>>>>> *by definition* not a CURIE.
>>>>>>> Whether it therefore becomes a string of text or a URI is a separate
>>>>>>> processing step, and nothing to do with CURIE processing, but by
>>>>>>> taking the approach we did in the CURIE processing layer we at least
>>>>>>> made it possible for 'foaf:Agent' to be interpreted as a URI.
>>>>>>> The converse also holds; if a mapping for 'foaf' is defined, then the
>>>>>>> string above is *by definition* a CURIE. Now whether some host
>>>>>>> language decides to interpret the string as a CURIE above a URI is up
>>>>>>> to that host language, but RDFa does so.
>>>>>>> Personally I was very pleased when we took the step to take context
>>>>>>> into account when interpreting strings. Until that point we were
>>>>>>> trying to achieve the impossible -- imagining that a string on its own
>>>>>>> could tell you everything about what it was. Now it's very easy to
>>>>>>> interpret both of these strings correctly:
>>>>>>>  foaf:Agent
>>>>>>>  http://www.w3.org/
>>>>>>> simply by using the context.
>>>>>>> Best regards,
>>>>>>> Mark
>>>>>>> 2011/4/11 Niklas Lindström<lindstream@gmail.com>:
>>>>>>>> 
>>>>>>>> Hi all!
>>>>>>>> 
>>>>>>>> Is it correct that the RDFa WG is currently recommending letting
>>>>>>>> CURIEs share the same value space as regular URIs, and so that any
>>>>>>>> prefix defined with the same value as a scheme, like "http", "https",
>>>>>>>> "news" etc. will change the URI for any absolute URI using those
>>>>>>>> schemes?
>>>>>>>> 
>>>>>>>> I remember worrying about this last year, but I haven't followed the
>>>>>>>> decision process in detail since then. It just worries me that letting
>>>>>>>> these things collide will blow up for anyone who happens to use at
>>>>>>>> least "http" or "https" as prefixes (perhaps rendering prefixes using
>>>>>>>> a tool, or getting them from a profile out of their control). Or
>>>>>>>> perhaps worse, people believing it safe to use anything but "http(s)"
>>>>>>>> as prefixes, which will work until something other than those two
>>>>>>>> comes along in the next 10 years or so. It might happen; and if it
>>>>>>>> does, it may quite probably be beyond the controls of RDFa specs and
>>>>>>>> tools.
>>>>>>>> 
>>>>>>>> (An example: some vocabulary "Wide Exceptional Graphs" becomes
>>>>>>>> popular, using "wxg" as a prefix. Then Google comes along with a new
>>>>>>>> wxg scheme ("Web Extended by Google"), and soon lots of resources are
>>>>>>>> linked with that instead of old "http". Or for that matter, that some
>>>>>>>> other scheme [3] becomes popular again for whatever reason.)
>>>>>>>> 
>>>>>>>> I vaguely recall the WG saying something about defining "http" as a
>>>>>>>> prefix is bad practise. But this turns up here and there, not least
>>>>>>>> since the HTTP Vocabulary Draft [1] (<http://www.w3.org/2006/http#>)
>>>>>>>> recommend it as a prefix. And I just ran across "http" as a prefix in
>>>>>>>> the Tabulator source as well [2].
>>>>>>>> 
>>>>>>>> While I understand that it is confusing to use it as a prefix, I am
>>>>>>>> not convinced that it is safe to combine the CURIE and URI value space
>>>>>>>> like this. At least not without a limit on the CURIEs allowed in the
>>>>>>>> joint CURIEorURI space. For instance, not allowing CURIEs in that
>>>>>>>> space to use anything after the prefix+':' other than say an
>>>>>>>> isegment-nz-nc from RFC 3987, or something to that effect (like a
>>>>>>>> "[A-Za-z0-9_-.]+" regexp).
>>>>>>>> 
>>>>>>>> If there was such a restriction on the format of CURIEs are allowed in
>>>>>>>> the CURIEorURI mix (and that anything not matching it would be
>>>>>>>> considered a full URI), I would definitely sleep better. :)
>>>>>>>> 
>>>>>>>> Am I missing something crucial, or overly worried about the risk of
>>>>>>>> collisions?
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Niklas
>>>>>>>> 
>>>>>>>> [1]: http://www.w3.org/TR/HTTP-in-RDF10/
>>>>>>>> [2]:
>>>>>>>> http://dig.csail.mit.edu/hg/tabulator/file/9a135feff10f/chrome/content/js/rdf/rdflib.js#l5644
>>>>>>>> [3]: http://en.wikipedia.org/wiki/URI_scheme
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> ----
>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>> Home: http://www.w3.org/People/Ivan/
>>>>> mobile: +31-641044153
>>>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> --
>>> Shane P. McCarron                          Phone: +1 763 786-8160 x120
>>> Managing Director                            Fax: +1 763 786-8180
>>> ApTest Minnesota                            Inet: shane@aptest.com
>>> 
>>> 
>>> 
>>> 
>> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Sunday, 1 May 2011 16:45:43 UTC