Re: ISSUE-143 (Prefixes too complicated): Use of prefixes is too complicated for a Web technology [RDFa 1.1 in HTML5] from Manu Sporny on 2012-11-06 (public-rdfa-wg@w3.org from November 2012)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Tue, 06 Nov 2012 13:27:46 -0500
To: "Tab Atkins Jr." <jackalmage@gmail.com>
CC: RDFa Working Group <public-rdfa-wg@w3.org>
Message-ID: <509956A2.3070403@digitalbazaar.com>
On 11/05/12 19:16, Tab Atkins Jr. wrote:
> As outlined in the original threads that introduced this issue,
> usage in the wild shows that authors very commonly author "invalid"
> markup which uses a common prefix without specifying the prefix.

For some value of "very commonly". One of the biggest problems with this
discussion is that we don't have good data on how common this is. That
said, the RDFa WG was concerned enough about this possibility that we
introduced the RDFa Initial Context in RDFa 1.1 (which pre-defines these
common prefixes that authors may forget to define in their documents):

http://www.w3.org/2011/rdfa-context/rdfa-1.1

Do you have any data to support the claim that this is a wide-spread
occurrence in RDFa documents? Something like this (that demonstrates
that prefixes for vocabularies were not declared in the document):

http://webdatacommons.org/vocabulary-usage-analysis/index.html

> 2. The developers of consumers either *also* share this 
> misunderstanding, or just don't find it worthwhile to be correct
> when they can do just as well in practice by treating the prefix as 
> meaningful.  This suggests that there may be a real interoperability 
> danger if an author *properly* declares a prefix where the prefix is
> a common one, but the URL is to something other than what common use
> points to - in "correct" consumers the document will be interpreted
> as the author intended, but in many common consumers it will instead
> be misinterpreted to be using the common vocabulary rather than what
> the author intended.

We have not seen reports of this sort of wide-spread abuse. Is there new
data to demonstrate that there are consumers out there that are doing this?

> 3. In addition to the theoretical interop problem above, we have a 
> real interop problem already - many consumers will happily consume 
> pages that don't declare their prefix, as long as they use a 
> "well-known" prefix for it.  A conformant consumer, on the other 
> hand, would *not* do so, and would find no valid data on the pages.

As others in this thread have pointed out, it seems that you are not
aware of the RDFa Initial Context feature introduced in RDFa 1.1? A
conformant consumer /would/ find data on the pages where the author has
forgotten to declare their prefixes.

> You  have to reverse-engineer the web to find out which prefixes
> need to be supported without a declaration, and what URL they should
> be bound to.

One of your suggestions below is that we pre-define common prefixes in
use. How are we supposed to do that if we don't do crawls of the Web to
understand the common prefixes in use? Is that what you mean by
"reverse-engineer the Web"?

If so, it has also been shown that this can be done. In fact, Yahoo and
Common Crawl did crawls of a section of their corpus to give us
statistically significant results which were then used to define the
RDFa Initial Context. If we were to adopt your option #1 below, we'd
have to do this.

> 1. Discover and document the common prefixes in use, define them to 
> always be bound to the URL they're commonly bound to, even without
> an actual declaration, and don't allow them to be bound to a URL
> other than that predefined one.

What is implicit in your proposal above is that we'd continue to support
decentralized extensibility, but only for prefixes that are not
pre-defined? If that's the case, what happens when a developer uses
'foo' today, but then the W3C decides to pre-define 'foo' in the future
to map to a different URL? Wouldn't that break the author's document?

That said, this is an interesting proposal, and one that I think we
should implement *IF* there is data to back up the premise of the
argument. The data should show that:

1) Declaring the same prefix to point to two different URLs is a
   common practice on the Web, and
2) It leads to consumers mis-interpreting the data in a way that
   generates the wrong data.

I think #1 is true for the 'dc' prefix, but in that case, both
'elements' and 'terms' are largely compatible with one another. I can't
think of any other vocabulary where this holds true. Seeing data showing
that something like 'ogp' or 'schema' are being commonly mapped to two
different vocabulary URLs would be very compelling evidence.

We haven't seen any data to raise the concern that #2 is true. Do you
have data demonstrating this assertion?

> 2. Drop the indirection of prefixes entirely, and simply declare
> that prefixes themselves are meaningful.  Predefine the common
> prefixes in use.

This would effectively break any document that is using a "statistically
insignificant" prefix. It would also harm innovation in vocabularies as
the barrier to entry would be much higher. It would break decentralized
extensibility in RDFa. I think you realize all of this, but I just
wanted to make sure that others in the thread understand the
ramifications of a change like this.

> 1. If people adopted the convention of simply using their domain
> name (quite reasonable, I think, and likely more-or-less what people
> will naturally use anyway), it would convey the exact same meaning
> and uniqueness as a full URL, but with less typing - "http://foo.com"
> is 11 characters longer than "foo".

While that's true, it would also push people to not put a great deal of
thought into creating a vocabulary, or publish their vocabularies, and
it would effectively produce data that is not portable at all across
websites.

> However, if #2 is for whatever reason unacceptable, #1 is the *bare 
> minimum* that needs to be done for the RDFa spec to document
> reality, such that a consumer can follow the spec and reasonably
> expect to correctly consume content already on the web.

Of your two proposals, I think #1 has the best chance of being adopted
(if not for RDFa 1.1, then for RDFa 2.0). The part that is missing is
the data that supports the premise of your argument.

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny)
President/CEO - Digital Bazaar, Inc.
blog: HTML5 and RDFa 1.1
http://manu.sporny.org/2012/html5-and-rdfa/
Received on Tuesday, 6 November 2012 18:28:44 UTC