Re: The Vocabulary, Schema.org governance, etc. from martin.hepp@ebusiness-unibw.org on 2014-09-20 (public-vocabs@w3.org from September 2014)

From: <martin.hepp@ebusiness-unibw.org>
Date: Sat, 20 Sep 2014 10:02:23 +0200
To: Guha Guha <guha@google.com>
Cc: W3C Web Schemas Task Force <public-vocabs@w3.org>
Message-Id: <B5C90E3B-F0A9-47A9-9386-F92116C223E5@ebusiness-unibw.org>
Dear Guha:
Thanks a lot for this. A few quick comments:

First, my analysis of obstacles was no complaint, and it is clear that big search engines cannot reveal much of their data processing details.
However, some discussions on this list regarding schema.org extensions get heated, because proposals that are optimized for the special case of search engine information extraction are judged from the perspective of data consumption in the context of the W3C Semantic Web vision and technology stack. A clear example is the property-values proposal. I am pretty confident it fits the data processing capabilities of the sponsors of schema.org. But there was a very heated discussion (maybe 50-100 emails) about it, which I still find unnecessary and which was very time consuming.

Second, even if Google/Yahoo/Microsoft/Yandex cannot reveal details on the data processing, it would help a lot if search engines try their best to avoid any randomness in the processing of schema.org data in terms of using 

1. competing syntaxes and 
2. alternative patterns in schema.org

in all of their developer tools (e.g. validators) and operational systems.

RDFa, Microdata, JSON-LD should be equally supported with no random differences. If you can/want to support only a subset of a syntax or if a certain feature of a syntax is problematic for some reason, at least document that properly, in a generic fashion. 
If schema.org allows multiple ways of expressing the same information, indicate the preferred variant or deprecate the other one in schema.org.

It takes a lot of time for developers to debug data structures that are valid from the perspective of the specifications until they validate at least in the Google Structured Data Testing Tool.

What we could consider doing is branching schema.org into schema.org CORE and schema.org FULL. schema.org FULL would be an an open standard with a broad coverage. schema.org CORE would be the subset that search engines officially endorse. I think it would help site-owners to know which parts of schema.org are most relevant and which ones are there just for the sake of completeness.

You could counter, of course, that schema.org is schema.org CORE, and the ecosystem of Web vocabularies like GoodRelations, FOAF, SIOC, ... is schema.org FULL in my proposal ;-)


Best 
Martin

On 19 Sep 2014, at 19:25, Guha <guha@google.com> wrote:

> First, a heartfelt thanks for caring and being so passionate about this. I am really happy that we are having an open discussion about these matters. In that spirit, here are a few comments.
> 
> Schema.org does not claim or want to be *the* general web vocabulary. It is simply a vocabulary, that a set of groups within four large consumers of structured data on the web agree upon. I helped start schema.org because the fragmentation of vocabularies and confusion amongst webmasters was severely holding back adoption inside Google (Bing, Yahoo, Yandex) and consequently amongst webmasters. We figured that agreeing on the small subset of vocabulary that mattered to us would improve things a lot. It does seem to be working, but we constantly have to keep our focus and not stray into areas that are not of short/medium term focus for our companies. Indeed, we constantly find ourselves pulling back from more specialized areas. Having tried to build a "the" vocabulary once in Cyc, I am very wary of schema.org going down that road!
> 
> Schema.org is evolving not just in its vocabulary, but also in its governance model. We solicit and accept input from the broad community both on vocabulary and on other issues. In fact, the recent change in our TOS was motivated by issues raised by the community. I fully expect that there will be a number of further changes in the years to come.
> 
> Given the nature of web search and the effort expended by various 'search engine optimizers' in gaming search algorithms, we are unfortunately unable to discuss the details of our data processing. We welcome other consumers of this data and maybe some of them can be more explicit about how they use the data. I am very hopeful that there will be academic research projects that consume schema.org data in new applications. That will pave the path towards a well understood, documented model for consuming this data.
> 
> We encourage the creation of many vocabularies. We would love for there to be other vocabularies that get lots of adoption and as these vocabularies get adoption, the search engines will use them.
> 
> Thank you for being so understanding of our situation.
> 
> Guha
Received on Saturday, 20 September 2014 08:02:46 UTC