Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org from Guha on 2013-10-30 (public-vocabs@w3.org from October 2013)

From: Guha <guha@google.com>
Date: Wed, 30 Oct 2013 08:35:41 -0700
To: Christian Bizer <chris@bizer.de>
Cc: Peter Patel-Schneider <pfpschneider@gmail.com>, Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
Message-ID: <CAPAGhv9nLemDbBAKw2QoE_XBqDhwiJL8h95MVDmKV8cbZPxi2A@mail.gmail.com>

On Wed, Oct 30, 2013 at 1:57 AM, Christian Bizer <chris@bizer.de> wrote:

> Hi Peter,****
>
> **
>
<text deleted>


> So the major driver for getting more structured data onto the Web are
> mainstream applications consuming it ...**
>
> **
>

Yes Yes Yes! This matter far far more than anything else. This, and only
this will move the needle.



> **
>
> <text deleted>
>
> Cheers,****
>
> ** **
>
> Chris****
>
> ** **
>
> ** **
>
> ** **
>
> *Von:* Peter Patel-Schneider [mailto:pfpschneider@gmail.com]
> *Gesendet:* Mittwoch, 30. Oktober 2013 00:23
> *An:* Christian Bizer
> *Cc:* Guha; Martin Hepp; W3C Vocabularies
> *Betreff:* Re: schema.org and proto-data, was Re: schema.org as
> reconstructed from the human-readable information at schema.org****
>
> ** **
>
> The first kind of behaviour described below is, perhaps, a flaw, which can
> be fairly easily fixed by turning the text into a simple item.  The main
> wrinkle is whether the text becomes the name of the item or a description
> of the item.
>
> The second kind of behaviour is not a flaw at all, in that there is good
> data on the pages.   Of course, one might want to do better by analyzing
> the text on the page to do better, but that isn't required.  And why stop
> at the text in the schema.org property values?  As suggested elsewhere,
> another way to improve the situation here is to have better examples so
> that content providers can more easily produce better data.****
>
> The biggest issue with determining what works and what doesn't in
> schema.org is that there is no real description of either the data model
> of schema.org or the meaning (informal or formal) of data in this
> model.   Hopefully this will be forthcoming shortly.   (I have a short
> document on what I would hope that the result looks like.)****
>
> ** **
>
> peter****
>
> ** **
>
> On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de> wrote:*
> ***
>
> Hi Peter,
>
> if you want two concrete examples illustrating the "quality of the
> understanding between the many minds (developers) in this eco-system"
> Martin is talking about, here they are:
>
> 1. http://schema.org/JobPosting clearly states that the values of the
> property "hiringOrganization" should be of type "Organization".
> Nevertheless 40% of the Scehma.org JobPosting instances that we found on
> the Web did contain a simple string as the value of this property.
>
> 2. http://schema.org/Product defines over 20 properties that can be used
> to describe products. Nevertheless out of the 14.000 websites that we found
> to use this class, around 50% only use three properties to describe
> products: Name, Description and Image.
>
> The first flaw can easily be fixed when pre-processing Schema.org data.
> Fixing the second problem requires some more sophisticated NLP techniques
> to guess product features from the free text in the fields Name and
> Description, for instance if you want to do identity resolution in order to
> find out which of the 14.000 websites is offering a specific iPhone.
>
> No magic and no non-deterministic algorithms, just the normal dirty stuff
> that makes data integration successful, with the small, but relevant
> difference that you know because of the markup that you are looking at
> product descriptions and not at arbitrary web pages.
>
> If you (or anybody else) wants to find out more about how schema.org is
> used in the wild, you can download 1.4 billion quads Schema.org data
> originating from 140,000 websites from http://webdatacommons.org/ or take
> a look at
>
>
> https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/82190017-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-analysis.pdf
>
> which gives you some basic statistics about commonly used classes and
> properties.
>
> So no need to be a mayor search engine to explore this space and get an
> understanding about the kind of knowledge modeling that is understood by
> average webmasters ;-)
>
> Cheers,
>
> Chris****
>

Received on Wednesday, 30 October 2013 15:36:12 UTC