- From: Adrian Giurca <giurca@tu-cottbus.de>
- Date: Wed, 30 Oct 2013 13:37:35 +0100
- To: Guha <guha@google.com>, Christian Bizer <chris@bizer.de>
- CC: Peter Patel-Schneider <pfpschneider@gmail.com>, Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
- Message-ID: <5270FD8F.5080707@tu-cottbus.de>
Dear Mr. Guha, Dear Mr. Bizer, It is not a surprise that people do not always use properly the Schema properties. Sometimes is a matter of learning sometimes is about wrong design. However, I wonder what would an expert advise me on improvimg the below markup: 1. <divitemscopeitemtype="http://schema.org/JobPosting"> 2. ... 3. posted by 4. <span itemprop="hiringOrganization"> 5. <!-- <span itemprop="name"> --> 6. John Doe and Associates 7. <!-- <span> --> 8. </span> 9. </div> 40% of the webdatacommons crawled web sites used a string. But the other 60% ? Regards, Adrian Giurca On 10/29/2013 9:34 PM, Guha wrote: > I could not have said it better ... > > guha > > > On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de > <mailto:chris@bizer.de>> wrote: > > Hi Peter, > > if you want two concrete examples illustrating the "quality of the > understanding between the many minds (developers) in this > eco-system" Martin is talking about, here they are: > > 1. http://schema.org/JobPosting clearly states that the values of > the property "hiringOrganization" should be of type > "Organization". Nevertheless 40% of the Scehma.org JobPosting > instances that we found on the Web did contain a simple string as > the value of this property. > > 2. http://schema.org/Product defines over 20 properties that can > be used to describe products. Nevertheless out of the 14.000 > websites that we found to use this class, around 50% only use > three properties to describe products: Name, Description and Image. > > The first flaw can easily be fixed when pre-processing Schema.org > data. Fixing the second problem requires some more sophisticated > NLP techniques to guess product features from the free text in the > fields Name and Description, for instance if you want to do > identity resolution in order to find out which of the 14.000 > websites is offering a specific iPhone. > > No magic and no non-deterministic algorithms, just the normal > dirty stuff that makes data integration successful, with the > small, but relevant difference that you know because of the markup > that you are looking at product descriptions and not at arbitrary > web pages. > > If you (or anybody else) wants to find out more about how > schema.org <http://schema.org> is used in the wild, you can > download 1.4 billion quads Schema.org data originating from > 140,000 websites from http://webdatacommons.org/ or take a look at > > https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/82190017-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-analysis.pdf > > which gives you some basic statistics about commonly used classes > and properties. > > So no need to be a mayor search engine to explore this space and > get an understanding about the kind of knowledge modeling that is > understood by average webmasters ;-) > > Cheers, > > Chris > > > Guha <guha@google.com <mailto:guha@google.com>> wrote: > > Peter, > > I don't think Martin implied that there was some kind of > mystical, > non-deterministic process involved in using schema.org > <http://schema.org> markup or that it > could only be consumed by major search players. > > I believe what he was saying (and I concur) is that when you > have millions > of authors providing data, there are so many errors and > misinterpretations > (i.e., noise) that consuming it and constructing something > meaningful out > of it could be non-trivial. Expecting all authors to make the > kind of > subtle (but important to certain academic communities) > distinctions might > be too much. > > guha > > > On Mon, Oct 28, 2013 at 8:31 AM, Peter Patel-Schneider < > pfpschneider@gmail.com <mailto:pfpschneider@gmail.com>> wrote: > > That's an awfully depressing view of schema.org > <http://schema.org>. Do you really mean to > say that there is no role for small or medium players in > this game at all, > even if they are only interested in some of the data? > Just why do you say > this? Is it because there is something inherent in the > data that requires > this processing? Is it because there is something > inherent in the > producers of the data that requires this processing? Is > it because there > is something inherent in the current consumers of the data > the requires > this processing? Is there something inherent in the > schema.orgspecification that requires this processing? Is > there something that can > > be fixed that will allow small or medium players to > consume schema.orgdata? > > My hope here, and maybe it is a forlorn one, is precisely > that more > consumers can use the information that is being put into > web pages using > the schema.org <http://schema.org> setup. Right now it > appears that only the major search > players know enough about schema.org <http://schema.org> > to be able to consume the > information. Of course, I do think that conceptual > clarity will help here, > but I do realize that in an endeavor like this one there > are always going > to be problems with underspecified or incorrectly > specified or incorrect > data. I don't think, however, that this prevents small > and medium players > from being consumers of schema.org <http://schema.org> data. > > Knowledge representation, just like databases, has from > the beginning been > concerned with more than just simple notions of entailment and > computational complexity, so I would not say that issues > related to data > quality and intent are outside of traditional knowledge > representation. > > peter > > PS: Do you really mean to say that processing schema.org > <http://schema.org> data requires > non-deterministic computation? What would require this? > What sorts of > non-traditional computation is required? > > > On Mon, Oct 28, 2013 at 2:46 AM, Martin Hepp > <martin.hepp@unibw.de <mailto:martin.hepp@unibw.de>> wrote: > > Peter: > Note that schema.org <http://schema.org> sits between > millions of owners of data (Web > masters) and large, centralized consumers of Big Data, > who apply hundreds > of heuristics before using the data. > Schema.org is an interface between Webmaster minds, > data structures in > back-end RDBMS driving Web sites, and search engines > (and maybe other types > of consumers). > > The whole environment heavily relies on > 1. probabilistic processing > 2. the quality of the understanding between the many > minds (developers) > in this eco-system. > > Traditional measures of conceptual clarity and > guarantees / deterministic > data processing are of very limited relevance in that > setting. > > For instance, if you introduce a conceptual > distinction which is very > valuable and justified from an experts perspective, > this may often not lead > to more reliable data processing, since the element > may be used more > inconsistently among Web developers (or the > distinctions may not be > reliable represented in the databases driving the sites). > > Looking at schema.org <http://schema.org> from the > perspective of knowledge representation > in the traditional sense is insufficient, IMO. YOu > have to look at the data > ecosystem as a whole. > > Information exposed using schema.org > <http://schema.org> meta-data is what I would call > proto-data, not ready for direct consumption by > deterministic computational > operations. > > Martin > > > > > > > >
Received on Wednesday, 30 October 2013 12:38:32 UTC