Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org from Adrian Giurca on 2013-10-30 (public-vocabs@w3.org from October 2013)

From: Adrian Giurca <giurca@tu-cottbus.de>
Date: Wed, 30 Oct 2013 13:37:35 +0100
To: Guha <guha@google.com>, Christian Bizer <chris@bizer.de>
CC: Peter Patel-Schneider <pfpschneider@gmail.com>, Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
Message-ID: <5270FD8F.5080707@tu-cottbus.de>
Dear Mr. Guha, Dear Mr. Bizer,

It is not a surprise that people do not always use properly the Schema 
properties. Sometimes is a matter of learning sometimes is about wrong 
design.
However, I wonder what would an expert advise me on improvimg the below 
markup:

 1. <divitemscopeitemtype="http://schema.org/JobPosting">
 2. ...
 3. posted by
 4. <span itemprop="hiringOrganization">
 5. <!-- <span itemprop="name"> -->
 6. John Doe and Associates
 7. <!-- <span> -->
 8. </span>
 9. </div>

40% of the webdatacommons crawled web sites used a string. But the other 
60% ?

Regards,
Adrian Giurca


On 10/29/2013 9:34 PM, Guha wrote:
> I could not have said it better ...
>
> guha
>
>
> On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de 
> <mailto:chris@bizer.de>> wrote:
>
>     Hi Peter,
>
>     if you want two concrete examples illustrating the "quality of the
>     understanding between the many minds (developers) in this
>     eco-system" Martin is talking about, here they are:
>
>     1. http://schema.org/JobPosting clearly states that the values of
>     the property "hiringOrganization" should be of type
>     "Organization". Nevertheless 40% of the Scehma.org JobPosting
>     instances that we found on the Web did contain a simple string as
>     the value of this property.
>
>     2. http://schema.org/Product defines over 20 properties that can
>     be used to describe products. Nevertheless out of the 14.000
>     websites that we found to use this class, around 50% only use
>     three properties to describe products: Name, Description and Image.
>
>     The first flaw can easily be fixed when pre-processing Schema.org
>     data. Fixing the second problem requires some more sophisticated
>     NLP techniques to guess product features from the free text in the
>     fields Name and Description, for instance if you want to do
>     identity resolution in order to find out which of the 14.000
>     websites is offering a specific iPhone.
>
>     No magic and no non-deterministic algorithms, just the normal
>     dirty stuff that makes data integration successful, with the
>     small, but relevant difference that you know because of the markup
>     that you are looking at product descriptions and not at arbitrary
>     web pages.
>
>     If you (or anybody else) wants to find out more about how
>     schema.org <http://schema.org> is used in the wild, you can
>     download 1.4 billion quads Schema.org data originating from
>     140,000 websites from http://webdatacommons.org/ or take a look at
>
>     https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/82190017-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-analysis.pdf
>
>     which gives you some basic statistics about commonly used classes
>     and properties.
>
>     So no need to be a mayor search engine to explore this space and
>     get an understanding about the kind of knowledge modeling that is
>     understood by average webmasters ;-)
>
>     Cheers,
>
>     Chris
>
>
>     Guha <guha@google.com <mailto:guha@google.com>> wrote:
>
>         Peter,
>
>          I don't think Martin implied that there was some kind of
>         mystical,
>         non-deterministic process involved in using schema.org
>         <http://schema.org> markup or that it
>         could only be consumed by major search players.
>
>          I believe what he was saying (and I concur) is that when you
>         have millions
>         of authors providing data, there are so many errors and
>         misinterpretations
>         (i.e., noise) that consuming it and constructing something
>         meaningful out
>         of it could be non-trivial. Expecting all authors to make the
>         kind of
>         subtle (but important to certain academic communities)
>         distinctions might
>         be too much.
>
>         guha
>
>
>         On Mon, Oct 28, 2013 at 8:31 AM, Peter Patel-Schneider <
>         pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>> wrote:
>
>             That's an awfully depressing view of schema.org
>             <http://schema.org>.  Do you really mean to
>             say that there is no role for small or medium players in
>             this game at all,
>             even if they are only interested in some of the data?
>              Just why do you say
>             this?  Is it because there is something inherent in the
>             data that requires
>             this processing?  Is it because there is something
>             inherent in the
>             producers of the data that requires this processing?  Is
>             it because there
>             is something inherent in the current consumers of the data
>             the requires
>             this processing?  Is there something inherent in the
>             schema.orgspecification that requires this processing?  Is
>             there something that can
>
>             be fixed that will allow small or medium players to
>             consume schema.orgdata?
>
>             My hope here, and maybe it is a forlorn one, is precisely
>             that more
>             consumers can use the information that is being put into
>             web pages using
>             the schema.org <http://schema.org> setup.  Right now it
>             appears that only the major search
>             players know enough about schema.org <http://schema.org>
>             to be able to consume the
>             information.  Of course, I do think that conceptual
>             clarity will help here,
>             but I do realize that in an endeavor like this one there
>             are always going
>             to be problems with underspecified or incorrectly
>             specified or incorrect
>             data.  I don't think, however, that this prevents small
>             and medium players
>             from being consumers of schema.org <http://schema.org> data.
>
>             Knowledge representation, just like databases, has from
>             the beginning been
>             concerned with more than just simple notions of entailment and
>             computational complexity, so I would not say that issues
>             related to data
>             quality and intent are outside of traditional knowledge
>             representation.
>
>             peter
>
>             PS: Do you really mean to say that processing schema.org
>             <http://schema.org> data requires
>             non-deterministic computation?   What would require this?
>              What sorts of
>             non-traditional computation is required?
>
>
>             On Mon, Oct 28, 2013 at 2:46 AM, Martin Hepp
>             <martin.hepp@unibw.de <mailto:martin.hepp@unibw.de>> wrote:
>
>                 Peter:
>                 Note that schema.org <http://schema.org> sits between
>                 millions of owners of data (Web
>                 masters) and large, centralized consumers of Big Data,
>                 who apply hundreds
>                 of heuristics before using the data.
>                 Schema.org is an interface between Webmaster minds,
>                 data structures in
>                 back-end RDBMS driving Web sites, and search engines
>                 (and maybe other types
>                 of consumers).
>
>                 The whole environment heavily relies on
>                 1. probabilistic processing
>                 2. the quality of the understanding between the many
>                 minds (developers)
>                 in this eco-system.
>
>                 Traditional measures of conceptual clarity and
>                 guarantees / deterministic
>                 data processing are of very limited relevance in that
>                 setting.
>
>                 For instance, if you introduce a conceptual
>                 distinction which is very
>                 valuable and justified from an experts perspective,
>                 this may often not lead
>                 to more reliable data processing, since the element
>                 may be used more
>                 inconsistently among Web developers (or the
>                 distinctions may not be
>                 reliable represented in the databases driving the sites).
>
>                 Looking at schema.org <http://schema.org> from the
>                 perspective of knowledge representation
>                 in the traditional sense is insufficient, IMO. YOu
>                 have to look at the data
>                 ecosystem as a whole.
>
>                 Information exposed using schema.org
>                 <http://schema.org> meta-data is what I would call
>                 proto-data, not ready for direct consumption by
>                 deterministic computational
>                 operations.
>
>                 Martin
>
>
>
>
>
>
>
>
Received on Wednesday, 30 October 2013 12:38:32 UTC