Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org from Guha on 2013-10-29 (public-vocabs@w3.org from October 2013)

From: Guha <guha@google.com>
Date: Tue, 29 Oct 2013 13:34:41 -0700
To: Christian Bizer <chris@bizer.de>
Cc: Peter Patel-Schneider <pfpschneider@gmail.com>, Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
Message-ID: <CAPAGhv87SbR4_K2=cMyXMX6MZcdCjJ4VD4xgmG185A=ty0yKvg@mail.gmail.com>
I could not have said it better ...

guha


On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de> wrote:

> Hi Peter,
>
> if you want two concrete examples illustrating the "quality of the
> understanding between the many minds (developers) in this eco-system"
> Martin is talking about, here they are:
>
> 1. http://schema.org/JobPosting clearly states that the values of the
> property "hiringOrganization" should be of type "Organization".
> Nevertheless 40% of the Scehma.org JobPosting instances that we found on
> the Web did contain a simple string as the value of this property.
>
> 2. http://schema.org/Product defines over 20 properties that can be used
> to describe products. Nevertheless out of the 14.000 websites that we found
> to use this class, around 50% only use three properties to describe
> products: Name, Description and Image.
>
> The first flaw can easily be fixed when pre-processing Schema.org data.
> Fixing the second problem requires some more sophisticated NLP techniques
> to guess product features from the free text in the fields Name and
> Description, for instance if you want to do identity resolution in order to
> find out which of the 14.000 websites is offering a specific iPhone.
>
> No magic and no non-deterministic algorithms, just the normal dirty stuff
> that makes data integration successful, with the small, but relevant
> difference that you know because of the markup that you are looking at
> product descriptions and not at arbitrary web pages.
>
> If you (or anybody else) wants to find out more about how schema.org is
> used in the wild, you can download 1.4 billion quads Schema.org data
> originating from 140,000 websites from http://webdatacommons.org/ or take
> a look at
>
> https://github.com/lidingpku/**iswc-archive/raw/master/paper/**
> iswc-2013/82190017-deployment-**of-rdfa-microdata-and-**
> microformats-on-the-web-a-**quantitative-analysis.pdf<https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/82190017-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-analysis.pdf>
>
> which gives you some basic statistics about commonly used classes and
> properties.
>
> So no need to be a mayor search engine to explore this space and get an
> understanding about the kind of knowledge modeling that is understood by
> average webmasters ;-)
>
> Cheers,
>
> Chris
>
>
> Guha <guha@google.com> wrote:
>
>  Peter,
>>
>>  I don't think Martin implied that there was some kind of mystical,
>> non-deterministic process involved in using schema.org markup or that it
>> could only be consumed by major search players.
>>
>>  I believe what he was saying (and I concur) is that when you have
>> millions
>> of authors providing data, there are so many errors and misinterpretations
>> (i.e., noise) that consuming it and constructing something meaningful out
>> of it could be non-trivial. Expecting all authors to make the kind of
>> subtle (but important to certain academic communities) distinctions might
>> be too much.
>>
>> guha
>>
>>
>> On Mon, Oct 28, 2013 at 8:31 AM, Peter Patel-Schneider <
>> pfpschneider@gmail.com> wrote:
>>
>>  That's an awfully depressing view of schema.org.  Do you really mean to
>>> say that there is no role for small or medium players in this game at
>>> all,
>>> even if they are only interested in some of the data?  Just why do you
>>> say
>>> this?  Is it because there is something inherent in the data that
>>> requires
>>> this processing?  Is it because there is something inherent in the
>>> producers of the data that requires this processing?  Is it because there
>>> is something inherent in the current consumers of the data the requires
>>> this processing?  Is there something inherent in the
>>> schema.orgspecification that requires this processing?  Is there something
>>> that can
>>>
>>> be fixed that will allow small or medium players to consume
>>> schema.orgdata?
>>>
>>> My hope here, and maybe it is a forlorn one, is precisely that more
>>> consumers can use the information that is being put into web pages using
>>> the schema.org setup.  Right now it appears that only the major search
>>> players know enough about schema.org to be able to consume the
>>> information.  Of course, I do think that conceptual clarity will help
>>> here,
>>> but I do realize that in an endeavor like this one there are always going
>>> to be problems with underspecified or incorrectly specified or incorrect
>>> data.  I don't think, however, that this prevents small and medium
>>> players
>>> from being consumers of schema.org data.
>>>
>>> Knowledge representation, just like databases, has from the beginning
>>> been
>>> concerned with more than just simple notions of entailment and
>>> computational complexity, so I would not say that issues related to data
>>> quality and intent are outside of traditional knowledge representation.
>>>
>>> peter
>>>
>>> PS: Do you really mean to say that processing schema.org data requires
>>> non-deterministic computation?   What would require this?  What sorts of
>>> non-traditional computation is required?
>>>
>>>
>>> On Mon, Oct 28, 2013 at 2:46 AM, Martin Hepp <martin.hepp@unibw.de>
>>> wrote:
>>>
>>>  Peter:
>>>> Note that schema.org sits between millions of owners of data (Web
>>>> masters) and large, centralized consumers of Big Data, who apply
>>>> hundreds
>>>> of heuristics before using the data.
>>>> Schema.org is an interface between Webmaster minds, data structures in
>>>> back-end RDBMS driving Web sites, and search engines (and maybe other
>>>> types
>>>> of consumers).
>>>>
>>>> The whole environment heavily relies on
>>>> 1. probabilistic processing
>>>> 2. the quality of the understanding between the many minds (developers)
>>>> in this eco-system.
>>>>
>>>> Traditional measures of conceptual clarity and guarantees /
>>>> deterministic
>>>> data processing are of very limited relevance in that setting.
>>>>
>>>> For instance, if you introduce a conceptual distinction which is very
>>>> valuable and justified from an experts perspective, this may often not
>>>> lead
>>>> to more reliable data processing, since the element may be used more
>>>> inconsistently among Web developers (or the distinctions may not be
>>>> reliable represented in the databases driving the sites).
>>>>
>>>> Looking at schema.org from the perspective of knowledge representation
>>>> in the traditional sense is insufficient, IMO. YOu have to look at the
>>>> data
>>>> ecosystem as a whole.
>>>>
>>>> Information exposed using schema.org meta-data is what I would call
>>>> proto-data, not ready for direct consumption by deterministic
>>>> computational
>>>> operations.
>>>>
>>>> Martin
>>>>
>>>>
>>>
>>>
>>
>
>
>
Received on Tuesday, 29 October 2013 20:35:11 UTC