Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org from Peter Patel-Schneider on 2013-10-29 (public-vocabs@w3.org from October 2013)

From: Peter Patel-Schneider <pfpschneider@gmail.com>
Date: Tue, 29 Oct 2013 16:22:54 -0700
To: Christian Bizer <chris@bizer.de>
Cc: Guha <guha@google.com>, Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
Message-ID: <CAMpDgVx30Tkf7cBfZEJFVDKGaNg9qpUk-2nAEZR-4LvtgGBkJg@mail.gmail.com>

The first kind of behaviour described below is, perhaps, a flaw, which can
be fairly easily fixed by turning the text into a simple item.  The main
wrinkle is whether the text becomes the name of the item or a description
of the item.

The second kind of behaviour is not a flaw at all, in that there is good
data on the pages.   Of course, one might want to do better by analyzing
the text on the page to do better, but that isn't required.  And why stop
at the text in the schema.org property values?  As suggested elsewhere,
another way to improve the situation here is to have better examples so
that content providers can more easily produce better data.

The biggest issue with determining what works and what doesn't in
schema.orgis that there is no real description of either the data
model of
schema.org or the meaning (informal or formal) of data in this model.
Hopefully this will be forthcoming shortly.   (I have a short document on
what I would hope that the result looks like.)

peter

On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de> wrote:

> Hi Peter,
>
> if you want two concrete examples illustrating the "quality of the
> understanding between the many minds (developers) in this eco-system"
> Martin is talking about, here they are:
>
> 1. http://schema.org/JobPosting clearly states that the values of the
> property "hiringOrganization" should be of type "Organization".
> Nevertheless 40% of the Scehma.org JobPosting instances that we found on
> the Web did contain a simple string as the value of this property.
>
> 2. http://schema.org/Product defines over 20 properties that can be used
> to describe products. Nevertheless out of the 14.000 websites that we found
> to use this class, around 50% only use three properties to describe
> products: Name, Description and Image.
>
> The first flaw can easily be fixed when pre-processing Schema.org data.
> Fixing the second problem requires some more sophisticated NLP techniques
> to guess product features from the free text in the fields Name and
> Description, for instance if you want to do identity resolution in order to
> find out which of the 14.000 websites is offering a specific iPhone.
>
> No magic and no non-deterministic algorithms, just the normal dirty stuff
> that makes data integration successful, with the small, but relevant
> difference that you know because of the markup that you are looking at
> product descriptions and not at arbitrary web pages.
>
> If you (or anybody else) wants to find out more about how schema.org is
> used in the wild, you can download 1.4 billion quads Schema.org data
> originating from 140,000 websites from http://webdatacommons.org/ or take
> a look at
>
> https://github.com/lidingpku/**iswc-archive/raw/master/paper/**
> iswc-2013/82190017-deployment-**of-rdfa-microdata-and-**
> microformats-on-the-web-a-**quantitative-analysis.pdf<https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/82190017-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-analysis.pdf>
>
> which gives you some basic statistics about commonly used classes and
> properties.
>
> So no need to be a mayor search engine to explore this space and get an
> understanding about the kind of knowledge modeling that is understood by
> average webmasters ;-)
>
> Cheers,
>
> Chris
>

Received on Tuesday, 29 October 2013 23:23:23 UTC