Re: Official RDFa Response: ISSUE-90: CURIEorURI Value Space Collisions from Niklas Lindström on 2011-06-09 (public-rdfa-wg@w3.org from June 2011)

From: Niklas Lindström <lindstream@gmail.com>
Date: Thu, 9 Jun 2011 12:38:26 +0200
To: Manu Sporny <msporny@digitalbazaar.com>
Cc: RDFa WG <public-rdfa-wg@w3.org>
Message-ID: <BANLkTimDiP1W4srJSn0PQ79+ZAZyHoXfXA@mail.gmail.com>
Hi Manu!

Thank you likewise for your very thorough and thoughtful reply! I
certainly didn't take it as any kind of rejection of my thoughts, but
sincerely appreciated your feedback.

I feel very confident now that we've cleared up any misunderstandings
and that you fully understand my concerns. You made very good and
thoughtful points. I understand them all and I agree with your
reasoning. Thus, I won't further delve into those specifics, but
instead summarise my conclusions.

Given this, and having given all perspectives more thought, I'm ready
to accept the current situation as a viable compromise of concerns
(between theoretical safety and usability). As long as the reality of
the collision risk, albeit minor, between protocols and prefixes is
articulated and generally accepted, it seems fair. And I can see the
valid point of a coherent means of expression, i.e. not needing to use
SafeCURIEs in one place but not the other; as well as the
cumbersomeness of SafeCURIEs in general. I suppose one could even go
so far as to say that "CURIEorURI" is a "macro protocol" mechanism,
branding the "value space conflation" as a feature (admittedly
requiring certain care).

I can also see how to manage the technical safety of a publishing
system. If such a system provides defined prefixes and a decentralised
way for users to enter URIs, it could either (as suggested) maintain a
whitelist of explicitly allowed protocols, or it could inspect URIs
for protocols being defined as prefixes and either disallow them or
simply cope by locally defining e.g. prefix="wxg wxg:" where it is
used as such. The latter does indeed work as a workaround (though
cumbersome) if a collision becomes reality. But unless the future
holds a rampant growth of protocols, I can live with that. :)

While I don't find it perfect, real world technology rarely is. Nor
can I come up with a proposal which would resolve this in a better
way, given your defined usability goals. Sure,
RestrictedCURIEOrSafeCURIEorURI could provide more safety, but at a
complexity cost I fully understand would be considered too high. So
given that we have now shed good light on this issue and reasoned for
and against, complete with options, I am willing to accept the
resolution not to change the current use of SafeCURIEorCURIEorURI.

I do think it would be wise, as you say, to coordinate this with the
other Semantic Web groups, so that any eventual undiscovered pitfalls
or breaking of design patterns can be detected. And I hope that others
in the RDFa community have considered this situation (and are at
ease). I think that the awareness and means to handle this is
available, and that, overall, the situation is under control.

Explicitly for the record: I have thus no objections against closing
ISSUE-90 <http://www.w3.org/2010/02/rdfa/track/issues/90>.

Again, thank you very much! I think the future of RDFa 1.1 is bright.

Best regards,
Niklas
--
<http://neverspace.net/>




2011/6/1 Manu Sporny <msporny@digitalbazaar.com>:
> On 05/31/2011 11:19 AM, Niklas Lindström wrote:
>> Hi Manu, all!
>
> Hi Niklas - thanks for the very detailed and thoughtful reply. Responses
> below (they're not official, just trying to work out where we go from here).
>
>> I thank the working group for reviewing my issue!
>>
>> However, it seems I haven't quite gotten my point through. I didn't
>> propose to limit the lexical value space of CURIEs in general. It is
>> only the construct SafeCURIEorCURIEorURI I am concerned about. And
>> that is a new construct, hitherto used *only* in RDFa 1.1.
>
> Ah, I don't know if that was clear to the rest of the group. It was not
> clear to me, so thanks for clearing that up. That said, we have
> considered this issue in various guises over the development of RDFa
> 1.1. More explanation follows...
>
>> See my comments below. (I also elaborated on this in my reply [1]
>> during the discussion in April.)
>>
>>> We discussed this at length and found the following:
>>>
>>> 1. Limiting the CURIE to a regex arbitrarily limits the allow-able
>>>   characters such that other use cases cannot be supported, such as
>>>   CURIE references containing "@" or ":" or any internationalized
>>>   character in them.
>>
>> As said above, I didn't propose to reduce the current lexical space of
>> CURIEs everywhere, only in CURIEorSafeCURIEorURI (and not necessarily
>> restricted with a regex; but e.g. redefined as QNameOrSafeCURIEorURI).
>> If one wants to use complex CURIEs there, e.g. "dpb:resource/Concept",
>> SafeCURIEs would work fine, just as before.
>
> Ah, ok - that is different than what we thought you were saying. Three
> points:
>
> 1. Establishing QName-like behavior would confuse authors further.
> 2. SafeCURIEs are rarely used in practice.
> 3. CURIEs are rarely used in @about and @resource.
>
> Point #1: I discussed this with the Editor today and both of us thought
> that the QName bit is a non-starter. The reason being that we have spent
> many years trying to convince people that CURIEs are not QNames - which
> they are not. There is even a section of the RDFa specification that
> details the difference:
>
> http://www.w3.org/TR/rdfa-core/#why-curies-and-not-qnames
>
> Introducing anything that is a QName or QName-like would confuse the
> issue. That said, we could use something like
> RestrictedCURIEOrSafeCURIEorURI, but as Ivan has pointed out, that is
> problematic as well. It also wouldn't solve the problem where you have
> schemes like "mailto" or "sip". So, we would restrict the potential
> input, but not solve the problem.
>
> Point #2: We have enough data now to know that people are not using
> SafeCURIEs. We don't know if they are not using them by accident, or if
> they are not using them because they don't know they exist. Personally,
> I hate safe CURIEs and think that many people don't even know that they
> exist. They are unnecessary in almost every use case imaginable and
> complicate RDFa implementations. I know others in the community feel the
> opposite way, but the bottom line is - the majority of RDFa documents
> out there do not use SafeCURIEs, so a rule like
> RestrictedCURIEOrSafeCURIEorURI would effectively boil down to
> RestrictedCURIEorURI in practice - which wouldn't solve the problem
> you're attempting to solve.
>
> That is, most of the markup in the wild would be wrong - the RDFa
> specification would deviate from how people use CURIEs.
>
> Point #3: Almost every case of RDFa that I have seen does not use CURIEs
> in @about or @resource - named bnodes are the rare exception and those
> are very seldom used. Hash-IRIs (relative IRIs) or absolute IRIs are the
> norm. I know your point is about when somebody defines an "http" prefix
> in their document - well, in that case the RDFa is wrong. It will happen
> - we know it will happen, but in almost every case, it will be sorted
> out. If it's not sorted out, the data will be invalid and nobody will
> use it. That is - one can easily recover from the error.
>
>>> 2. Limiting the character set still doesn't prevent false positives
>>>   for very simple schemes like SIP. For example, to prevent a
>>>   false positive for "sip:niklas@example.org", one would have to
>>>   limit the "@" in all CURIEs. However, there may be some vocabularies
>>>   that want to utilize the "@" sign. That is, we may think we know
>>>   which characters are important now, but all that must happen for
>>>   this approach to fail is that an Internet Scheme would appear that
>>>   uses a character in the list of acceptable characters - for example,
>>>   "-" or "."
>>
>> Again, my issue only concerns the collision-prone *mixing* of CURIEs
>> and URIs, where URIs are the norm. I do not find TERMorCURIEorAbsURI
>> nearly as problematic (as used for e.g. @rel and @property). That is
>> simply because there CURIEs are the norm and AbsURIs the exception
>> (since in RDFa 1.0, only CURIEs where allowed).
>
> There is a contingent of people, namely coming out of the WHATWG and
> Microformats communities that would disagree with you. They would argue
> that AbsURIs would be the norm and CURIEs would be the exception (at
> least, with the markup that they create). We should expect CURIEs and
> AbsURIs, when used in the same datatype, to conflict at some point.
> Again - the key here is how often that happens and how recoverable that
> error is.
>
> Granted there will be some cases where the author does not have a choice
> on which vocabularies are defined in the head of the document. In that
> case, they can always re-define "http" prefix to be "http:" like so:
> prefix="http: http:". It is a bit ridiculous, but even in the worst case
> scenario authors can still override the mappings they use.
>
> As for the case where the mapping is inserted into a profile that they
> don't control, after they have authored their documents, well - that's a
> problem. However, so is if one of the vocabularies that they were using
> is deleted from the profile. Profile modification is always a problem.
>
> Keep in mind that this problem /only/ happens when a vocabulary prefix
> is defined in a profile without the author's knowledge. We don't feel
> that that is going to happen very often, and as I said above, it's
> always correctable if it is detected.
>
>>> 3. CURIEs are not allowed in @href and @src, so the likely-hood that
>>>   this will become a practical concern is lessened.
>>
>> I disagree. Since @about and @resource are fundamental to RDFa (indeed
>> needed in certain places), I don't see how the collision risk is
>> substantially reduced.
>
> I think we're miscommunicating. What I meant was this:
>
> Ignoring the other attributes, if we consider allowing CURIEs in @href,
> @src, @about and @resource - there would be four places where CURIEs
> could screw things up. Since we only allow CURIEs in @about and
> @resource, we halve the places where the issue can occur.
>
> Since people mainly use relative references (hash-URIs in @about), the
> potential for the problem to surface is further lessened. However, that
> doesn't mean that the problem can't occur. The most prevalent issue may
> be the declaration of a 'http' prefix. That could cause issues if
> "http:" IRIs are used in @about and @resource, but as I said before -
> people typically use relative IRIs in @about, which leaves @resource as
> our most common pain point for this problem. @resource is seldom used.
>
> Really, this is an issue when people use full IRIs. People typically,
> based on the data that we have seen to date, use IRIs in @about and
> @resource. People prefer to use relative IRIs in @about. People tend to
> not use @resource and use @href for the majority of cases.
>
> We have not done a complete study on RDFa usage in the field - this is
> just what the group has seen in general. Others may disagree with this
> overview of the situation.
>
>>> 4. There is no ambiguity as far as an RDFa Processor is concerned.
>>>   For example, if an "http" prefix is defined, then anything that
>>>   accepts a CURIE would expand the "http" prefix. That is, if the
>>>   prefix is defined, it is a CURIE. If it is not defined, it is an
>>>   IRI. Authors will discover this very quickly and vocabulary
>>>   maintainers are advised to avoid naming their vocabularies after
>>>   Internet Protocol Schemes.
>>
>> Certainly. But my issue didn't concern processing ambiguities. There's
>> a conflation of value spaces, which as you point out authors have to
>> be aware of and avoid. The need for the proposed advice is what
>> concerns me, since vocabulary prefix naming and URI schemes evolve
>> completely independent of each other!
>
> That is true, but unfortunately I don't think there is any way around
> this problem other than "authors have to be aware of the issue". They
> will become aware of it if they check the triples that are marked up on
> their page - there are good tools to allow them to do that at this
> point. If they generate bad triples, people will either complain, or not
> use their pages. This down-side is a design trade-off, and that's really
> what we're talking about here - the lesser of two evils.
>
> The RestrictedCURIE approach just won't work because of schemes like
> "sip" and "isbn". The alternative is to say "Safe CURIEs only!", which
> won't work because people don't use them.
>
> We are attempting to make authoring easier for web authors and because
> of that, we are knowingly introducing a situation where there might be a
> corner case where an author's markup generates triples that they were
> not intending to generate.
>
> However, the benefits that we reap are that plain CURIEs can be used in
> most cases without error - and this matches the practice of using plain
> CURIEs that we've seen out in the field.
>
>> I basically don't see how this shorthand feature can be warranted when
>> it leads to this conflation. Especially since SafeCURIEs have been
>> there for this use case all along. (Now in RDFa 1.1, it seems
>> SafeCURIEs are effectively a legacy.)
>>
>> I understand that e.g. the suggested QNameOrSafeCURIEorURI is more
>> complex to read though. This since the tokens in the "local name"
>> determines if prefix expansion would be triggered. Given the choice,
>> I'd probably prefer to revert to just SafeCURIEorURI for @about and
>> @resource!
>>
>> Requiring users who want to use CURIEs in @about and @resource two
>> surround them with "[" and "]" just seems more wise to me than making
>> any prefix declared, perhaps in a profile beyond the author's control,
>> to expand in every subject and object supplied via these attributes,
>> for every RDF 1.1 document created.
>
> I understand your reasoning and agree with it from a purely logical
> standpoint if we were to only consider the evidence you present above.
> However, doing that would go against established practice, which is also
> evidence that we should take into account. People wouldn't use it in the
> way that you intend people to use it.
>
>>> 5. It would create a backwards-incompatible change to RDFa. The
>>>   Working Group is not chartered to make this sort of change to
>>>   RDFa.
>>
>> On the contrary. The SafeCURIEorCURIEorURI construct did not exist in
>> RDFa 1.1. In fact, the current situation does actually introduce a
>> backwards-incompatible change. In RDFa 1.0, no prefix defined will be
>> expanded for values in @about and @resource starting with it.
>
> Hmm, this is an interesting point. I think we mention that if a @version
> is specified, the RDFa Processor /MUST/ conform to that version. This
> would mean that the backwards-incompatible change you mention never
> happens because the CURIE won't be expanded in RDFa 1.0 mode. I don't
> know if we state this clearly in the specification. However, we really
> should have a test case in the test suite to check this.
>
>>> In the end the group didn't think that limiting the value space of
>>> CURIEs would actually solve the problem you are concerned about. It may
>>> lessen the problem, theoretically, but nobody has demonstrated where
>>> this leads to a critical real-world problem with RDFa. In the worst
>>> case, the vocabulary prefix is changed in the RDFa document. In the end
>>> the Working Group decided to not place additional limitations on the
>>> value-space for CURIEs for the reasons listed above.
>>
>> I am quite aware that by just avoiding the definition of a handful of
>> common URI schemes (e.g. http, https, possibly ftp, mailto, sip), and
>> providing nothing new comes along, this is not much of a problem
>> today. But CURIEorSafeCURIEorURI is  an issue of conflation (of
>> prefixes and schemes), and I want to emphasize that.
>>
>> I can only reiterate the risk of potential rise in popularity of some
>> other protocol than http(s) amongst linked data users, in combination
>> with definition of prefixes in e.g. profiles or publishing systems
>> beyond the author's immediate control. It is a small but complex
>> problem which could cause a lot of *dynamically published* RDFa to
>> become problematic in a year, or 5, or 10. And this might be hard to
>> detect, unless publishers monitor all protocols and prefixes used in
>> their publishing systems. If RDFa 1.1 is published with
>> SafeCURIEorCURIEorURI as it is now, this would be very hard to
>> rectify.
>
> As I said above - I don't think the group believes that it would be
> difficult to detect and rectify.
>
>> It only takes for *one* prefix in the (decentrally) growing list of
>> common prefixes to become a popular protocol for this to become a real
>> problem.
>
> I can guarantee you that if that starts happening, the entire community
> will jump on the vocabulary author and make sure that they make it clear
> that a different prefix should be used. We haven't done this yet for the
> 'http' scheme/vocabulary because we've never seen it become an issue.
> Nobody has actually reported this to be an issue to the community, either.
>
> Yes - it is theoretically possible - but it has yet to materialize.
>
> Perhaps what we should do is take an action as a community to make sure
> that services like http://prefix.cc/ clearly warn about the usage of
> Internet schemes. Perhaps we could also get many of the RDFa Processor
> authors to generate warnings when prefixes that are known Schemes are used.
>
>> As I also said in [1], I am also worried that this practice may be
>> carelessly adopted in other scenarios. Particularly RDF APIs, where
>> one may want to define lots of prefixes for authors' convenience, and
>> where it may very well be desirable to make statements about resources
>> identified with protocols other than http. (And we've already found
>> cases where "http" is defined as a prefix in code libraries.
>> Furthermore, it's not uncommon for prefixes to be automatically
>> generated.)
>
> I see these as bugs in the APIs and prefix-generation code. I know that
> you may see this differently, but there is no way to reliably prevent
> this problem with the evidence that we have before us.
>
>>> We discussed it during two telecons:
>>>
>>> http://www.w3.org/2010/02/rdfa/meetings/2011-05-05#CURIEorURI_Value_Space_Collisions
>>> http://www.w3.org/2010/02/rdfa/meetings/2011-05-19#ISSUE__2d_90__3a__CURIEorURI_Value_Space_Collisions
>>>
>>> The decision is recorded here:
>>>
>>> http://www.w3.org/2010/02/rdfa/meetings/2011-05-19#resolution_2
>>>
>>> Since this is an official Last Call response, could you please respond
>>> as soon as possible and let us know whether or not the Working Group has
>>> considered your request and responded accordingly. Please let us know if
>>> this is an acceptable outcome and whether you can live with the
>>> decision. Thank you for reviewing the RDFa specification and sending in
>>> your comments. :)
>>
>> I could live with it, if it comes to that. :) But I cannot really agree.
>>
>> Have you discussed this combination of CURIEorURI in e.g. the RDF
>> working group, or the RDF community in general? I'd be somewhat
>> surprised if I'm the only one feeling uneasy about it..
>
> It was a very long discussion in the RDFa 1.0 Working Group. We have not
> raised the issue with the RDF Working group. Perhaps we should raise it
> as a coordination issue for all the Semantic Web groups. You are not the
> only one that felt uneasy about this - we all did at first, but once we
> started seeing more RDFa usage patterns, we tended to get a bit less
> concerned about the potential issue.
>
>> I know this might seem like an innocent issue with little real world
>> problems, but I hope I've made my view clearer of the potential risks
>> and difficulties of managing those. I genuinely wish for RDFa 1.1 to
>> succeed, and I have the utmost respect for your work on it!
>
> Thank you again for the long and thoughtful response. Please don't take
> this e-mail as rejection of your thoughts or input. I think you make
> several very good and very valid points. We have weighed the risks
> versus the rewards and we think that the rewards far outweigh the risks.
> That said, we'll make it a point to discuss this during our upcoming
> telecon to make sure that the group still feels this way.
>
> And as always, if people from the RDFa community could weigh in - that
> would be great.
>
> -- manu
>
> --
> Manu Sporny (skype: msporny, twitter: manusporny)
> President/CEO - Digital Bazaar, Inc.
> blog: PaySwarm Developer Tools and Demo Released
> http://digitalbazaar.com/2011/05/05/payswarm-sandbox/
>
Received on Thursday, 9 June 2011 10:39:24 UTC