- From: Peter Patel-Schneider <pfpschneider@gmail.com>
- Date: Tue, 29 Oct 2013 11:14:21 -0700
- To: Guha <guha@google.com>
- Cc: Martin Hepp <martin.hepp@unibw.de>, W3C Vocabularies <public-vocabs@w3.org>
- Message-ID: <CAMpDgVyKVBMqZNK4p+-0FTEhtY=6ghwZSE_rw5-ShGYWr4Az6g@mail.gmail.com>
I do hope that there is a reasonable method for extracting good information from web pages marked up with schema.org content. I realize that it may be non-trivial to recover from common situations where the information is not precisely stated. However, I do hope that it is possible to extract good information from web pages marked up with schema.org content without requiring the full resources of a company that is in the business of web search, at least if one is a little bit careful about which sources one uses. Martin's message appeared to be contrary to this hope, alluding, among other things, to some unspecified non-deterministic processing, so I was wondering why (he thinks that) this is case. peter PS: What subtle distinctions are you thinking are not going to be made in schema.org markup? There is the use of string identifiers instead of actual items (which I wouldn't call a subtle difference), but what else? On Tue, Oct 29, 2013 at 9:13 AM, Guha <guha@google.com> wrote: > Peter, > > I don't think Martin implied that there was some kind of mystical, > non-deterministic process involved in using schema.org markup or that it > could only be consumed by major search players. > > I believe what he was saying (and I concur) is that when you have > millions of authors providing data, there are so many errors and > misinterpretations (i.e., noise) that consuming it and constructing > something meaningful out of it could be non-trivial. Expecting all authors > to make the kind of subtle (but important to certain academic communities) > distinctions might be too much. > > guha > > > On Mon, Oct 28, 2013 at 8:31 AM, Peter Patel-Schneider < > pfpschneider@gmail.com> wrote: > >> That's an awfully depressing view of schema.org. Do you really mean to >> say that there is no role for small or medium players in this game at all, >> even if they are only interested in some of the data? Just why do you say >> this? Is it because there is something inherent in the data that requires >> this processing? Is it because there is something inherent in the >> producers of the data that requires this processing? Is it because there >> is something inherent in the current consumers of the data the requires >> this processing? Is there something inherent in the schema.orgspecification that requires this processing? Is there something that can >> be fixed that will allow small or medium players to consume schema.orgdata? >> >> My hope here, and maybe it is a forlorn one, is precisely that more >> consumers can use the information that is being put into web pages using >> the schema.org setup. Right now it appears that only the major search >> players know enough about schema.org to be able to consume the >> information. Of course, I do think that conceptual clarity will help here, >> but I do realize that in an endeavor like this one there are always going >> to be problems with underspecified or incorrectly specified or incorrect >> data. I don't think, however, that this prevents small and medium players >> from being consumers of schema.org data. >> >> Knowledge representation, just like databases, has from the beginning >> been concerned with more than just simple notions of entailment and >> computational complexity, so I would not say that issues related to data >> quality and intent are outside of traditional knowledge representation. >> >> peter >> >> PS: Do you really mean to say that processing schema.org data requires >> non-deterministic computation? What would require this? What sorts of >> non-traditional computation is required? >> >> >> On Mon, Oct 28, 2013 at 2:46 AM, Martin Hepp <martin.hepp@unibw.de>wrote: >> >>> Peter: >>> Note that schema.org sits between millions of owners of data (Web >>> masters) and large, centralized consumers of Big Data, who apply hundreds >>> of heuristics before using the data. >>> Schema.org is an interface between Webmaster minds, data structures in >>> back-end RDBMS driving Web sites, and search engines (and maybe other types >>> of consumers). >>> >>> The whole environment heavily relies on >>> 1. probabilistic processing >>> 2. the quality of the understanding between the many minds (developers) >>> in this eco-system. >>> >>> Traditional measures of conceptual clarity and guarantees / >>> deterministic data processing are of very limited relevance in that setting. >>> >>> For instance, if you introduce a conceptual distinction which is very >>> valuable and justified from an experts perspective, this may often not lead >>> to more reliable data processing, since the element may be used more >>> inconsistently among Web developers (or the distinctions may not be >>> reliable represented in the databases driving the sites). >>> >>> Looking at schema.org from the perspective of knowledge representation >>> in the traditional sense is insufficient, IMO. YOu have to look at the data >>> ecosystem as a whole. >>> >>> Information exposed using schema.org meta-data is what I would call >>> proto-data, not ready for direct consumption by deterministic computational >>> operations. >>> >>> Martin >>> >> >> > >
Received on Tuesday, 29 October 2013 18:14:48 UTC