- From: Guha <guha@google.com>
- Date: Tue, 29 Oct 2013 13:34:41 -0700
- To: Christian Bizer <chris@bizer.de>
- Cc: Peter Patel-Schneider <pfpschneider@gmail.com>, Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
- Message-ID: <CAPAGhv87SbR4_K2=cMyXMX6MZcdCjJ4VD4xgmG185A=ty0yKvg@mail.gmail.com>
I could not have said it better ... guha On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de> wrote: > Hi Peter, > > if you want two concrete examples illustrating the "quality of the > understanding between the many minds (developers) in this eco-system" > Martin is talking about, here they are: > > 1. http://schema.org/JobPosting clearly states that the values of the > property "hiringOrganization" should be of type "Organization". > Nevertheless 40% of the Scehma.org JobPosting instances that we found on > the Web did contain a simple string as the value of this property. > > 2. http://schema.org/Product defines over 20 properties that can be used > to describe products. Nevertheless out of the 14.000 websites that we found > to use this class, around 50% only use three properties to describe > products: Name, Description and Image. > > The first flaw can easily be fixed when pre-processing Schema.org data. > Fixing the second problem requires some more sophisticated NLP techniques > to guess product features from the free text in the fields Name and > Description, for instance if you want to do identity resolution in order to > find out which of the 14.000 websites is offering a specific iPhone. > > No magic and no non-deterministic algorithms, just the normal dirty stuff > that makes data integration successful, with the small, but relevant > difference that you know because of the markup that you are looking at > product descriptions and not at arbitrary web pages. > > If you (or anybody else) wants to find out more about how schema.org is > used in the wild, you can download 1.4 billion quads Schema.org data > originating from 140,000 websites from http://webdatacommons.org/ or take > a look at > > https://github.com/lidingpku/**iswc-archive/raw/master/paper/** > iswc-2013/82190017-deployment-**of-rdfa-microdata-and-** > microformats-on-the-web-a-**quantitative-analysis.pdf<https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/82190017-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-analysis.pdf> > > which gives you some basic statistics about commonly used classes and > properties. > > So no need to be a mayor search engine to explore this space and get an > understanding about the kind of knowledge modeling that is understood by > average webmasters ;-) > > Cheers, > > Chris > > > Guha <guha@google.com> wrote: > > Peter, >> >> I don't think Martin implied that there was some kind of mystical, >> non-deterministic process involved in using schema.org markup or that it >> could only be consumed by major search players. >> >> I believe what he was saying (and I concur) is that when you have >> millions >> of authors providing data, there are so many errors and misinterpretations >> (i.e., noise) that consuming it and constructing something meaningful out >> of it could be non-trivial. Expecting all authors to make the kind of >> subtle (but important to certain academic communities) distinctions might >> be too much. >> >> guha >> >> >> On Mon, Oct 28, 2013 at 8:31 AM, Peter Patel-Schneider < >> pfpschneider@gmail.com> wrote: >> >> That's an awfully depressing view of schema.org. Do you really mean to >>> say that there is no role for small or medium players in this game at >>> all, >>> even if they are only interested in some of the data? Just why do you >>> say >>> this? Is it because there is something inherent in the data that >>> requires >>> this processing? Is it because there is something inherent in the >>> producers of the data that requires this processing? Is it because there >>> is something inherent in the current consumers of the data the requires >>> this processing? Is there something inherent in the >>> schema.orgspecification that requires this processing? Is there something >>> that can >>> >>> be fixed that will allow small or medium players to consume >>> schema.orgdata? >>> >>> My hope here, and maybe it is a forlorn one, is precisely that more >>> consumers can use the information that is being put into web pages using >>> the schema.org setup. Right now it appears that only the major search >>> players know enough about schema.org to be able to consume the >>> information. Of course, I do think that conceptual clarity will help >>> here, >>> but I do realize that in an endeavor like this one there are always going >>> to be problems with underspecified or incorrectly specified or incorrect >>> data. I don't think, however, that this prevents small and medium >>> players >>> from being consumers of schema.org data. >>> >>> Knowledge representation, just like databases, has from the beginning >>> been >>> concerned with more than just simple notions of entailment and >>> computational complexity, so I would not say that issues related to data >>> quality and intent are outside of traditional knowledge representation. >>> >>> peter >>> >>> PS: Do you really mean to say that processing schema.org data requires >>> non-deterministic computation? What would require this? What sorts of >>> non-traditional computation is required? >>> >>> >>> On Mon, Oct 28, 2013 at 2:46 AM, Martin Hepp <martin.hepp@unibw.de> >>> wrote: >>> >>> Peter: >>>> Note that schema.org sits between millions of owners of data (Web >>>> masters) and large, centralized consumers of Big Data, who apply >>>> hundreds >>>> of heuristics before using the data. >>>> Schema.org is an interface between Webmaster minds, data structures in >>>> back-end RDBMS driving Web sites, and search engines (and maybe other >>>> types >>>> of consumers). >>>> >>>> The whole environment heavily relies on >>>> 1. probabilistic processing >>>> 2. the quality of the understanding between the many minds (developers) >>>> in this eco-system. >>>> >>>> Traditional measures of conceptual clarity and guarantees / >>>> deterministic >>>> data processing are of very limited relevance in that setting. >>>> >>>> For instance, if you introduce a conceptual distinction which is very >>>> valuable and justified from an experts perspective, this may often not >>>> lead >>>> to more reliable data processing, since the element may be used more >>>> inconsistently among Web developers (or the distinctions may not be >>>> reliable represented in the databases driving the sites). >>>> >>>> Looking at schema.org from the perspective of knowledge representation >>>> in the traditional sense is insufficient, IMO. YOu have to look at the >>>> data >>>> ecosystem as a whole. >>>> >>>> Information exposed using schema.org meta-data is what I would call >>>> proto-data, not ready for direct consumption by deterministic >>>> computational >>>> operations. >>>> >>>> Martin >>>> >>>> >>> >>> >> > > >
Received on Tuesday, 29 October 2013 20:35:11 UTC