Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org from Christian Bizer on 2013-10-29 (public-vocabs@w3.org from October 2013)

From: Christian Bizer <chris@bizer.de>
Date: Tue, 29 Oct 2013 20:57:26 +0100
To: Peter Patel-Schneider <pfpschneider@gmail.com>, Guha <guha@google.com>
Cc: Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
Message-ID: <20131029205726.118607pqwgxugy7a@staff.webmail.uni-mannheim.de>
Hi Peter,

if you want two concrete examples illustrating the "quality of the  
understanding between the many minds (developers) in this eco-system"  
Martin is talking about, here they are:

1. http://schema.org/JobPosting clearly states that the values of the  
property "hiringOrganization" should be of type "Organization".  
Nevertheless 40% of the Scehma.org JobPosting instances that we found  
on the Web did contain a simple string as the value of this property.

2. http://schema.org/Product defines over 20 properties that can be  
used to describe products. Nevertheless out of the 14.000 websites  
that we found to use this class, around 50% only use three properties  
to describe products: Name, Description and Image.

The first flaw can easily be fixed when pre-processing Schema.org  
data. Fixing the second problem requires some more sophisticated NLP  
techniques to guess product features from the free text in the fields  
Name and Description, for instance if you want to do identity  
resolution in order to find out which of the 14.000 websites is  
offering a specific iPhone.

No magic and no non-deterministic algorithms, just the normal dirty  
stuff that makes data integration successful, with the small, but  
relevant difference that you know because of the markup that you are  
looking at product descriptions and not at arbitrary web pages.

If you (or anybody else) wants to find out more about how schema.org  
is used in the wild, you can download 1.4 billion quads Schema.org  
data originating from 140,000 websites from http://webdatacommons.org/  
or take a look at

https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/82190017-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-analysis.pdf

which gives you some basic statistics about commonly used classes and  
properties.

So no need to be a mayor search engine to explore this space and get  
an understanding about the kind of knowledge modeling that is  
understood by average webmasters ;-)

Cheers,

Chris


Guha <guha@google.com> wrote:

> Peter,
>
>  I don't think Martin implied that there was some kind of mystical,
> non-deterministic process involved in using schema.org markup or that it
> could only be consumed by major search players.
>
>  I believe what he was saying (and I concur) is that when you have millions
> of authors providing data, there are so many errors and misinterpretations
> (i.e., noise) that consuming it and constructing something meaningful out
> of it could be non-trivial. Expecting all authors to make the kind of
> subtle (but important to certain academic communities) distinctions might
> be too much.
>
> guha
>
>
> On Mon, Oct 28, 2013 at 8:31 AM, Peter Patel-Schneider <
> pfpschneider@gmail.com> wrote:
>
>> That's an awfully depressing view of schema.org.  Do you really mean to
>> say that there is no role for small or medium players in this game at all,
>> even if they are only interested in some of the data?  Just why do you say
>> this?  Is it because there is something inherent in the data that requires
>> this processing?  Is it because there is something inherent in the
>> producers of the data that requires this processing?  Is it because there
>> is something inherent in the current consumers of the data the requires
>> this processing?  Is there something inherent in the  
>> schema.orgspecification that requires this processing?  Is there  
>> something that can
>> be fixed that will allow small or medium players to consume schema.orgdata?
>>
>> My hope here, and maybe it is a forlorn one, is precisely that more
>> consumers can use the information that is being put into web pages using
>> the schema.org setup.  Right now it appears that only the major search
>> players know enough about schema.org to be able to consume the
>> information.  Of course, I do think that conceptual clarity will help here,
>> but I do realize that in an endeavor like this one there are always going
>> to be problems with underspecified or incorrectly specified or incorrect
>> data.  I don't think, however, that this prevents small and medium players
>> from being consumers of schema.org data.
>>
>> Knowledge representation, just like databases, has from the beginning been
>> concerned with more than just simple notions of entailment and
>> computational complexity, so I would not say that issues related to data
>> quality and intent are outside of traditional knowledge representation.
>>
>> peter
>>
>> PS: Do you really mean to say that processing schema.org data requires
>> non-deterministic computation?   What would require this?  What sorts of
>> non-traditional computation is required?
>>
>>
>> On Mon, Oct 28, 2013 at 2:46 AM, Martin Hepp <martin.hepp@unibw.de> wrote:
>>
>>> Peter:
>>> Note that schema.org sits between millions of owners of data (Web
>>> masters) and large, centralized consumers of Big Data, who apply hundreds
>>> of heuristics before using the data.
>>> Schema.org is an interface between Webmaster minds, data structures in
>>> back-end RDBMS driving Web sites, and search engines (and maybe other types
>>> of consumers).
>>>
>>> The whole environment heavily relies on
>>> 1. probabilistic processing
>>> 2. the quality of the understanding between the many minds (developers)
>>> in this eco-system.
>>>
>>> Traditional measures of conceptual clarity and guarantees / deterministic
>>> data processing are of very limited relevance in that setting.
>>>
>>> For instance, if you introduce a conceptual distinction which is very
>>> valuable and justified from an experts perspective, this may often not lead
>>> to more reliable data processing, since the element may be used more
>>> inconsistently among Web developers (or the distinctions may not be
>>> reliable represented in the databases driving the sites).
>>>
>>> Looking at schema.org from the perspective of knowledge representation
>>> in the traditional sense is insufficient, IMO. YOu have to look at the data
>>> ecosystem as a whole.
>>>
>>> Information exposed using schema.org meta-data is what I would call
>>> proto-data, not ready for direct consumption by deterministic computational
>>> operations.
>>>
>>> Martin
>>>
>>
>>
>
Received on Tuesday, 29 October 2013 19:57:50 UTC