- From: Guha <guha@google.com>
- Date: Wed, 30 Oct 2013 08:35:41 -0700
- To: Christian Bizer <chris@bizer.de>
- Cc: Peter Patel-Schneider <pfpschneider@gmail.com>, Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
- Message-ID: <CAPAGhv9nLemDbBAKw2QoE_XBqDhwiJL8h95MVDmKV8cbZPxi2A@mail.gmail.com>
On Wed, Oct 30, 2013 at 1:57 AM, Christian Bizer <chris@bizer.de> wrote: > Hi Peter,**** > > ** > <text deleted> > So the major driver for getting more structured data onto the Web are > mainstream applications consuming it ...** > > ** > Yes Yes Yes! This matter far far more than anything else. This, and only this will move the needle. > ** > > <text deleted> > > Cheers,**** > > ** ** > > Chris**** > > ** ** > > ** ** > > ** ** > > *Von:* Peter Patel-Schneider [mailto:pfpschneider@gmail.com] > *Gesendet:* Mittwoch, 30. Oktober 2013 00:23 > *An:* Christian Bizer > *Cc:* Guha; Martin Hepp; W3C Vocabularies > *Betreff:* Re: schema.org and proto-data, was Re: schema.org as > reconstructed from the human-readable information at schema.org**** > > ** ** > > The first kind of behaviour described below is, perhaps, a flaw, which can > be fairly easily fixed by turning the text into a simple item. The main > wrinkle is whether the text becomes the name of the item or a description > of the item. > > The second kind of behaviour is not a flaw at all, in that there is good > data on the pages. Of course, one might want to do better by analyzing > the text on the page to do better, but that isn't required. And why stop > at the text in the schema.org property values? As suggested elsewhere, > another way to improve the situation here is to have better examples so > that content providers can more easily produce better data.**** > > The biggest issue with determining what works and what doesn't in > schema.org is that there is no real description of either the data model > of schema.org or the meaning (informal or formal) of data in this > model. Hopefully this will be forthcoming shortly. (I have a short > document on what I would hope that the result looks like.)**** > > ** ** > > peter**** > > ** ** > > On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de> wrote:* > *** > > Hi Peter, > > if you want two concrete examples illustrating the "quality of the > understanding between the many minds (developers) in this eco-system" > Martin is talking about, here they are: > > 1. http://schema.org/JobPosting clearly states that the values of the > property "hiringOrganization" should be of type "Organization". > Nevertheless 40% of the Scehma.org JobPosting instances that we found on > the Web did contain a simple string as the value of this property. > > 2. http://schema.org/Product defines over 20 properties that can be used > to describe products. Nevertheless out of the 14.000 websites that we found > to use this class, around 50% only use three properties to describe > products: Name, Description and Image. > > The first flaw can easily be fixed when pre-processing Schema.org data. > Fixing the second problem requires some more sophisticated NLP techniques > to guess product features from the free text in the fields Name and > Description, for instance if you want to do identity resolution in order to > find out which of the 14.000 websites is offering a specific iPhone. > > No magic and no non-deterministic algorithms, just the normal dirty stuff > that makes data integration successful, with the small, but relevant > difference that you know because of the markup that you are looking at > product descriptions and not at arbitrary web pages. > > If you (or anybody else) wants to find out more about how schema.org is > used in the wild, you can download 1.4 billion quads Schema.org data > originating from 140,000 websites from http://webdatacommons.org/ or take > a look at > > > https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/82190017-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-analysis.pdf > > which gives you some basic statistics about commonly used classes and > properties. > > So no need to be a mayor search engine to explore this space and get an > understanding about the kind of knowledge modeling that is understood by > average webmasters ;-) > > Cheers, > > Chris**** >
Received on Wednesday, 30 October 2013 15:36:12 UTC