Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org

I do hope that there is a reasonable method for extracting good information
from web pages marked up with schema.org content.   I realize that it may
be non-trivial to recover from common situations where the information is
not precisely stated.  However, I do hope that it is possible to extract
good information from web pages marked up with schema.org content without
requiring the full resources of a company that is in the business of web
search, at least if one is a little bit careful about which sources one
uses.

Martin's message appeared to be contrary to this hope, alluding, among
other things, to some unspecified non-deterministic processing, so I was
wondering why (he thinks that) this is case.

peter

PS:  What subtle distinctions are you thinking are not going to be made in
schema.org markup?   There is the use of string identifiers instead of
actual items (which I wouldn't call a subtle difference), but what else?


On Tue, Oct 29, 2013 at 9:13 AM, Guha <guha@google.com> wrote:

> Peter,
>
>  I don't think Martin implied that there was some kind of mystical,
> non-deterministic process involved in using schema.org markup or that it
> could only be consumed by major search players.
>
>  I believe what he was saying (and I concur) is that when you have
> millions of authors providing data, there are so many errors and
> misinterpretations (i.e., noise) that consuming it and constructing
> something meaningful out of it could be non-trivial. Expecting all authors
> to make the kind of subtle (but important to certain academic communities)
> distinctions might be too much.
>
> guha
>
>
> On Mon, Oct 28, 2013 at 8:31 AM, Peter Patel-Schneider <
> pfpschneider@gmail.com> wrote:
>
>> That's an awfully depressing view of schema.org.  Do you really mean to
>> say that there is no role for small or medium players in this game at all,
>> even if they are only interested in some of the data?  Just why do you say
>> this?  Is it because there is something inherent in the data that requires
>> this processing?  Is it because there is something inherent in the
>> producers of the data that requires this processing?  Is it because there
>> is something inherent in the current consumers of the data the requires
>> this processing?  Is there something inherent in the schema.orgspecification that requires this processing?  Is there something that can
>> be fixed that will allow small or medium players to consume schema.orgdata?
>>
>> My hope here, and maybe it is a forlorn one, is precisely that more
>> consumers can use the information that is being put into web pages using
>> the schema.org setup.  Right now it appears that only the major search
>> players know enough about schema.org to be able to consume the
>> information.  Of course, I do think that conceptual clarity will help here,
>> but I do realize that in an endeavor like this one there are always going
>> to be problems with underspecified or incorrectly specified or incorrect
>> data.  I don't think, however, that this prevents small and medium players
>> from being consumers of schema.org data.
>>
>> Knowledge representation, just like databases, has from the beginning
>> been concerned with more than just simple notions of entailment and
>> computational complexity, so I would not say that issues related to data
>> quality and intent are outside of traditional knowledge representation.
>>
>> peter
>>
>> PS: Do you really mean to say that processing schema.org data requires
>> non-deterministic computation?   What would require this?  What sorts of
>> non-traditional computation is required?
>>
>>
>> On Mon, Oct 28, 2013 at 2:46 AM, Martin Hepp <martin.hepp@unibw.de>wrote:
>>
>>> Peter:
>>> Note that schema.org sits between millions of owners of data (Web
>>> masters) and large, centralized consumers of Big Data, who apply hundreds
>>> of heuristics before using the data.
>>> Schema.org is an interface between Webmaster minds, data structures in
>>> back-end RDBMS driving Web sites, and search engines (and maybe other types
>>> of consumers).
>>>
>>> The whole environment heavily relies on
>>> 1. probabilistic processing
>>> 2. the quality of the understanding between the many minds (developers)
>>> in this eco-system.
>>>
>>> Traditional measures of conceptual clarity and guarantees /
>>> deterministic data processing are of very limited relevance in that setting.
>>>
>>> For instance, if you introduce a conceptual distinction which is very
>>> valuable and justified from an experts perspective, this may often not lead
>>> to more reliable data processing, since the element may be used more
>>> inconsistently among Web developers (or the distinctions may not be
>>> reliable represented in the databases driving the sites).
>>>
>>> Looking at schema.org from the perspective of knowledge representation
>>> in the traditional sense is insufficient, IMO. YOu have to look at the data
>>> ecosystem as a whole.
>>>
>>> Information exposed using schema.org meta-data is what I would call
>>> proto-data, not ready for direct consumption by deterministic computational
>>> operations.
>>>
>>> Martin
>>>
>>
>>
>
>

Received on Tuesday, 29 October 2013 18:14:48 UTC