Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org from Martin Hepp on 2013-10-30 (public-vocabs@w3.org from October 2013)

From: Martin Hepp <martin.hepp@unibw.de>
Date: Wed, 30 Oct 2013 16:24:48 +0100
To: Guha <guha@google.com>
Cc: Peter Patel-Schneider <pfpschneider@gmail.com>, W3C Vocabularies <public-vocabs@w3.org>
Message-Id: <2E6D773D-4E08-4C09-9F36-C325D2E34D45@unibw.de>
> Peter,
> 
>  I don't think Martin implied that there was some kind of mystical, non-deterministic process involved in using schema.org markup or that it could only be consumed by major search players.
> 
>  I believe what he was saying (and I concur) is that when you have millions of authors providing data, there are so many errors and misinterpretations (i.e., noise) that consuming it and constructing something meaningful out of it could be non-trivial. Expecting all authors to make the kind of subtle (but important to certain academic communities) distinctions might be too much.
> 
> guha

Indeed, this is what I tried to say.

With "non-deterministic" I mean that schemas at Web scale do not "guarantee" the outcomes of computational operations over the respective instance data in any way near to how schemas in closed, controlled database settings do (at least in theory). Instead, they are limited to influencing the probabilities of the respective operations.

My main claim, outlined in my EKAW 2012 keynote (video here: https://vimeo.com/51152934) is that consuming data based on shared conceptual structures at Web scale is a probabilistic setting. There are no guarantees of which results will come out of computational operations over the data.

In essence, I state that shared conceptual structures at Web scale do to the nature of data processing something similar to what Heisenberg's uncertainty principle [2] did to the world of physics.

For instance, the more precisely you define the semantics of a conceptual element, the less likely will it become that the provider and consumer of data associate the exact same set of entities with that type.

I hope to elaborate that a little bit further in writing, but that is what I can contribute at that point.

Note that this view goes radically further than the idea of "noise", "data quality issues", and "data provenance", because those terms are rooted in the notion of a controlled, relatively static setting, which the Web is clearly not.

Martin


[1] From Ontologies to Web Ontologies: Lessons learned from Conceptual Modeling for the WWW
Keynote talk at EKAW 2012, Galway, Ireland. https://vimeo.com/51152934
[2] http://en.wikipedia.org/wiki/Uncertainty_principle

On Oct 29, 2013, at 5:13 PM, Guha wrote:

> Peter,
> 
>  I don't think Martin implied that there was some kind of mystical, non-deterministic process involved in using schema.org markup or that it could only be consumed by major search players.
> 
>  I believe what he was saying (and I concur) is that when you have millions of authors providing data, there are so many errors and misinterpretations (i.e., noise) that consuming it and constructing something meaningful out of it could be non-trivial. Expecting all authors to make the kind of subtle (but important to certain academic communities) distinctions might be too much.
> 
> guha
> 
> 
> On Mon, Oct 28, 2013 at 8:31 AM, Peter Patel-Schneider <pfpschneider@gmail.com> wrote:
> That's an awfully depressing view of schema.org.  Do you really mean to say that there is no role for small or medium players in this game at all, even if they are only interested in some of the data?  Just why do you say this?  Is it because there is something inherent in the data that requires this processing?  Is it because there is something inherent in the producers of the data that requires this processing?  Is it because there is something inherent in the current consumers of the data the requires this processing?  Is there something inherent in the schema.org specification that requires this processing?  Is there something that can be fixed that will allow small or medium players to consume schema.org data?
> 
> My hope here, and maybe it is a forlorn one, is precisely that more consumers can use the information that is being put into web pages using the schema.org setup.  Right now it appears that only the major search players know enough about schema.org to be able to consume the information.  Of course, I do think that conceptual clarity will help here, but I do realize that in an endeavor like this one there are always going to be problems with underspecified or incorrectly specified or incorrect data.  I don't think, however, that this prevents small and medium players from being consumers of schema.org data.
> 
> Knowledge representation, just like databases, has from the beginning been concerned with more than just simple notions of entailment and computational complexity, so I would not say that issues related to data quality and intent are outside of traditional knowledge representation.
> 
> peter
> 
> PS: Do you really mean to say that processing schema.org data requires non-deterministic computation?   What would require this?  What sorts of non-traditional computation is required?
> 
> 
> On Mon, Oct 28, 2013 at 2:46 AM, Martin Hepp <martin.hepp@unibw.de> wrote:
> Peter:
> Note that schema.org sits between millions of owners of data (Web masters) and large, centralized consumers of Big Data, who apply hundreds of heuristics before using the data.
> Schema.org is an interface between Webmaster minds, data structures in back-end RDBMS driving Web sites, and search engines (and maybe other types of consumers).
> 
> The whole environment heavily relies on
> 1. probabilistic processing
> 2. the quality of the understanding between the many minds (developers) in this eco-system.
> 
> Traditional measures of conceptual clarity and guarantees / deterministic data processing are of very limited relevance in that setting.
> 
> For instance, if you introduce a conceptual distinction which is very valuable and justified from an experts perspective, this may often not lead to more reliable data processing, since the element may be used more inconsistently among Web developers (or the distinctions may not be reliable represented in the databases driving the sites).
> 
> Looking at schema.org from the perspective of knowledge representation in the traditional sense is insufficient, IMO. YOu have to look at the data ecosystem as a whole.
> 
> Information exposed using schema.org meta-data is what I would call proto-data, not ready for direct consumption by deterministic computational operations.
> 
> Martin
>  
> 

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/
Received on Wednesday, 30 October 2013 15:25:25 UTC