Re: CURIEorURI Value Space Collisions from Ivan Herman on 2011-04-16 (public-rdfa-wg@w3.org from April 2011)

From: Ivan Herman <ivan@w3.org>
Date: Sat, 16 Apr 2011 09:34:53 +0200
To: Shane McCarron <shane@aptest.com>
Cc: public-rdfa-wg@w3.org
Message-Id: <899CA30D-6AA4-4EC6-B3AC-51523B133116@w3.org>
On Apr 16, 2011, at 24:07 , Shane McCarron wrote:

> I guess I fail to appreciate the core problem here.  Are you worried that there will be a prefix declared in a framework (e.g., news:) that in some distant future becomes a real scheme, and that @about values that use that real scheme as a full URI will be introduced as content within that framework?  I can see this as a remote possibility, but I can't get too worked up about it.  

Having thought about it since it first came up: neither do I...

> There are lots of things that are going to evolve over time on the Internet. We cannot predict all of them.
> 
> I would be open to a few things that could tighten this stuff up - reducing the possibility of misinterpretation.  In no particular order, we could do none or all of the following:
> 
>  1. Restrict the use of schemes that are well known *at publication
>     time* from prefix declarations. E.g., declare that http, https,
>     mailto, etc. are all illegal as prefix names, and require that
>     conforming processors ignore them (or issue an error or issue a
>     warning - don't really mind).

Warning: yes. And I would take the list of registered URI schemes. Making them illegal: why? If people want to hang themselves, who am I to go against their wish? :-)

>  2. Restrict the 'reference' portion pattern further, such that it
>     prohibits leading '//'.  There is no need I can imagine to permit
>     '//' at the beginning of a reference.  So if there were a string
>     like 'foaf://some/reference' it would not be treated as a CURIE,
>     but 'foaf:some/reference' or 'foaf:/some/reference' would still be
>     a CURIE.

as shown by the mailto: example, this would take care of a fraction of schemes only. It may have no effects on new ones defined in 10 years from now...

>  3. Encourage content authors to use prefix names that are unlikely to
>     ever be a scheme name (not sure this makes sense).  We still have
>     the NCNAME restriction on prefix names, but myprefix0 is an NCNAME
>     and my-prefix is an NCNAME.  And I can't imagine those ever being
>     a real scheme.
>  4. Encourage content authors to eschew the use of prefixes and just
>     use full URIs (not sure this makes sense either).
> 
> 
> I don't think any of these steps ELIMINATE the possibility of misinterpretation.  But they surely won't hurt, and they are all completely consistent with the *intent* of CURIEs.


I actually adding an explicit warning possibility into the document on a possible scheme clash should be enough. I would actually go one step further and advise rdfa processors to refresh their list of watched schemes as they get registered.

Actually... there are even two different warnings that we may want to condone:

1. Warn if a prefix uses a registered scheme. Ie, warn for prefix="tel: http://bla". Though this is legal it may create problems for users, do not do it
2. Warn if a generated URI ref uses a non-registered scheme. Ie, if I create a URI ref with "foo:bar" this may signal that the author thinks the 'foo' prefix is defined though it is not.

Adding these should be enough in my view...

Ivan

> 
> Thoughts?
> 
> On 4/15/2011 4:53 AM, Niklas Lindström wrote:
>> Hi Ivan!
>> 
>> 2011/4/15 Ivan Herman<ivan@w3.org>:
>>> True. But we would also loose possibly very useful features.
>>> 
>>> I recently realized, to take an example, that the DBPedia concepts' ontology has some sort of a hierarchy. The use
>>> 
>>> http://dbpedia.org/ontology/
>>> http://dbpedia.org/ontology/Artist/
>>> http://dbpedia.org/ontology/Film/
>>> etc.
>>> 
>>> which would then be used, for example, on types. At the moment, one can define a prefix for ../ontology/ a then use something like dbp-ont:Artist/XXX instead of being forced to define a separate prefix for each sub-hierarchy.
>> Yes, I've thought about that too. But that would still be possible if
>> the CURIEorURI is changed to the RestrictedCURIEOrSafeCURIEorURI
>> (which I just suggested in reply to Mark -- i.e. where RestrictedCURIE
>> is defined as one of QName, or "isegment-nz-nc", or Nathan's
>> "path-absolute / ipath-noscheme / ipath-empty").
>> 
>> 
>>> B.t.w., on a separate comment: in my implementation I actually generate a warning if a URI is used with an unusual (ie, non-registered) scheme. In most cases this is the result of a misspelling in the prefix. I am not sure it is worth adding that RDFa Core as a requirement, or just have this as a good practice for RDFa processors...
>> Warnings are useful, but I definitely don't think that an RDFa parser
>> should have to worry about the scheme registry. Neither should
>> authors. It will evolve independently of the implementation and use of
>> RDFa, and of the (very much decentralized) definition of prefixes for
>> vocabularies.
>> 
>> 
>> Best regards,
>> Niklas
>> 
>> 
>> 
>>> Ivan
>>> 
>>> 
>>> 
>>> On Apr 13, 2011, at 19:09 , Nathan wrote:
>>> 
>>>> That said, it would be a lot less ambiguous if CURIE didn't use irelative-ref and instead used:
>>>> 
>>>>  reference ::= ipath-absolute / ipath-noscheme / ipath-empty
>>>> 
>>>> then at least, http://example.org/ would never be a CURIE, and a prefix mapping for http: would never apply / confuse.
>>>> 
>>>> Best,
>>>> 
>>>> Nathan
>>>> 
>>>> Mark Birbeck wrote:
>>>>> Hi Niklas,
>>>>> Everything you say is true. :)
>>>>> However, the big change in the working group's thinking came when we
>>>>> decided that it was impossible to guarantee correct interpretation of
>>>>> strings of text based solely on their format, and so instead we should
>>>>> rely on the strings' contexts.
>>>>> By using context to aid in the interpretation of a string we get a lot
>>>>> more flexibility, and we can unambiguously work out what things like
>>>>> this mean:
>>>>>  foaf:Agent
>>>>> Without context it *looks* like all of the following:
>>>>>  * a string of text with no particular meaning;
>>>>>  * a QName;
>>>>>  * a CURIE;
>>>>>  * a relative URI using the 'foaf' scheme.
>>>>> However, we decided in the working group that if no prefix mapping for
>>>>> 'foaf' was defined in the context for this string, then the string was
>>>>> *by definition* not a CURIE.
>>>>> Whether it therefore becomes a string of text or a URI is a separate
>>>>> processing step, and nothing to do with CURIE processing, but by
>>>>> taking the approach we did in the CURIE processing layer we at least
>>>>> made it possible for 'foaf:Agent' to be interpreted as a URI.
>>>>> The converse also holds; if a mapping for 'foaf' is defined, then the
>>>>> string above is *by definition* a CURIE. Now whether some host
>>>>> language decides to interpret the string as a CURIE above a URI is up
>>>>> to that host language, but RDFa does so.
>>>>> Personally I was very pleased when we took the step to take context
>>>>> into account when interpreting strings. Until that point we were
>>>>> trying to achieve the impossible -- imagining that a string on its own
>>>>> could tell you everything about what it was. Now it's very easy to
>>>>> interpret both of these strings correctly:
>>>>>  foaf:Agent
>>>>>  http://www.w3.org/
>>>>> simply by using the context.
>>>>> Best regards,
>>>>> Mark
>>>>> 2011/4/11 Niklas Lindström<lindstream@gmail.com>:
>>>>>> Hi all!
>>>>>> 
>>>>>> Is it correct that the RDFa WG is currently recommending letting
>>>>>> CURIEs share the same value space as regular URIs, and so that any
>>>>>> prefix defined with the same value as a scheme, like "http", "https",
>>>>>> "news" etc. will change the URI for any absolute URI using those
>>>>>> schemes?
>>>>>> 
>>>>>> I remember worrying about this last year, but I haven't followed the
>>>>>> decision process in detail since then. It just worries me that letting
>>>>>> these things collide will blow up for anyone who happens to use at
>>>>>> least "http" or "https" as prefixes (perhaps rendering prefixes using
>>>>>> a tool, or getting them from a profile out of their control). Or
>>>>>> perhaps worse, people believing it safe to use anything but "http(s)"
>>>>>> as prefixes, which will work until something other than those two
>>>>>> comes along in the next 10 years or so. It might happen; and if it
>>>>>> does, it may quite probably be beyond the controls of RDFa specs and
>>>>>> tools.
>>>>>> 
>>>>>> (An example: some vocabulary "Wide Exceptional Graphs" becomes
>>>>>> popular, using "wxg" as a prefix. Then Google comes along with a new
>>>>>> wxg scheme ("Web Extended by Google"), and soon lots of resources are
>>>>>> linked with that instead of old "http". Or for that matter, that some
>>>>>> other scheme [3] becomes popular again for whatever reason.)
>>>>>> 
>>>>>> I vaguely recall the WG saying something about defining "http" as a
>>>>>> prefix is bad practise. But this turns up here and there, not least
>>>>>> since the HTTP Vocabulary Draft [1] (<http://www.w3.org/2006/http#>)
>>>>>> recommend it as a prefix. And I just ran across "http" as a prefix in
>>>>>> the Tabulator source as well [2].
>>>>>> 
>>>>>> While I understand that it is confusing to use it as a prefix, I am
>>>>>> not convinced that it is safe to combine the CURIE and URI value space
>>>>>> like this. At least not without a limit on the CURIEs allowed in the
>>>>>> joint CURIEorURI space. For instance, not allowing CURIEs in that
>>>>>> space to use anything after the prefix+':' other than say an
>>>>>> isegment-nz-nc from RFC 3987, or something to that effect (like a
>>>>>> "[A-Za-z0-9_-.]+" regexp).
>>>>>> 
>>>>>> If there was such a restriction on the format of CURIEs are allowed in
>>>>>> the CURIEorURI mix (and that anything not matching it would be
>>>>>> considered a full URI), I would definitely sleep better. :)
>>>>>> 
>>>>>> Am I missing something crucial, or overly worried about the risk of collisions?
>>>>>> 
>>>>>> Best regards,
>>>>>> Niklas
>>>>>> 
>>>>>> [1]: http://www.w3.org/TR/HTTP-in-RDF10/
>>>>>> [2]: http://dig.csail.mit.edu/hg/tabulator/file/9a135feff10f/chrome/content/js/rdf/rdflib.js#l5644
>>>>>> [3]: http://en.wikipedia.org/wiki/URI_scheme
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> ----
>>> Ivan Herman, W3C Semantic Web Activity Lead
>>> Home: http://www.w3.org/People/Ivan/
>>> mobile: +31-641044153
>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
> 
> -- 
> Shane P. McCarron                          Phone: +1 763 786-8160 x120
> Managing Director                            Fax: +1 763 786-8180
> ApTest Minnesota                            Inet: shane@aptest.com
> 
> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Attachments

application/pkcs7-signature attachment: smime.p7s
Received on Saturday, 16 April 2011 07:34:22 UTC