RE: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org

Speaking from much bitter experience in the library community, "Dream on!"

We have books full of documentation on how to encode our content.  In fact, the people doing data entry on our systems have to have masters degrees in the content.  Nonetheless, our database is full of garbage.  Simple stuff is simple.  Hard stuff is subject to misinterpretation.  Nothing about schema.org is going to change that.

Ralph

From: Peter Patel-Schneider [mailto:pfpschneider@gmail.com]
Sent: Wednesday, October 30, 2013 2:57 PM
To: Martin Hepp
Cc: Guha; W3C Vocabularies
Subject: Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org

Absent both of these it is certain that there is going to be a lot of seemingly random data and seemingly random processing.   My hope is that there will very soon be much better information available on schema.org<http://schema.org> that will help a lot.

peter

On Wed, Oct 30, 2013 at 8:24 AM, Martin Hepp <martin.hepp@unibw.de<mailto:martin.hepp@unibw.de>> wrote:
> Peter,
>
>  I don't think Martin implied that there was some kind of mystical, non-deterministic process involved in using schema.org<http://schema.org> markup or that it could only be consumed by major search players.
>
>  I believe what he was saying (and I concur) is that when you have millions of authors providing data, there are so many errors and misinterpretations (i.e., noise) that consuming it and constructing something meaningful out of it could be non-trivial. Expecting all authors to make the kind of subtle (but important to certain academic communities) distinctions might be too much.
>
> guha
Indeed, this is what I tried to say.

With "non-deterministic" I mean that schemas at Web scale do not "guarantee" the outcomes of computational operations over the respective instance data in any way near to how schemas in closed, controlled database settings do (at least in theory). Instead, they are limited to influencing the probabilities of the respective operations.

My main claim, outlined in my EKAW 2012 keynote (video here: https://vimeo.com/51152934) is that consuming data based on shared conceptual structures at Web scale is a probabilistic setting. There are no guarantees of which results will come out of computational operations over the data.

In essence, I state that shared conceptual structures at Web scale do to the nature of data processing something similar to what Heisenberg's uncertainty principle [2] did to the world of physics.

For instance, the more precisely you define the semantics of a conceptual element, the less likely will it become that the provider and consumer of data associate the exact same set of entities with that type.

I hope to elaborate that a little bit further in writing, but that is what I can contribute at that point.

Note that this view goes radically further than the idea of "noise", "data quality issues", and "data provenance", because those terms are rooted in the notion of a controlled, relatively static setting, which the Web is clearly not.

Martin


[1] From Ontologies to Web Ontologies: Lessons learned from Conceptual Modeling for the WWW
Keynote talk at EKAW 2012, Galway, Ireland. https://vimeo.com/51152934
[2] http://en.wikipedia.org/wiki/Uncertainty_principle

Received on Wednesday, 30 October 2013 19:02:30 UTC