RE: proposal by Encyclopaedia Britannica from Cranmer, Paul on 2013-11-19 (public-vocabs@w3.org from November 2013)

From: Cranmer, Paul <PCranmer@eb.com>
Date: Tue, 19 Nov 2013 11:19:39 +0000
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
CC: Guha <guha@google.com>, "Hetrea, Carmen" <CHetrea@eb.com>, list <public-vocabs@w3.org>
Message-ID: <c6d8943719f04d78b4061aa41723caba@MAIL03.britannica.net>
Hi Martin,

I agree with everything that you have said, and I understand the limitations and difficulties involved in developing vocabularies that are versatile and appropriate for everyone.  The editors of Encyclopaedia Britannica have been seeking since 1768 to present an overview of knowledge that makes available the most significant and useful general information, which, of course, evolves constantly.  Therefore, we know how difficult and elusive an effective strategy is.  My concern with schema.org was not to criticize what has been done; its value is self-evident.  However, when we tried to apply it to our content, we realized that it was a poor fit and that we need a vocabulary geared toward general knowledge content in addition to what is already available.  It is simply our hope that by proposing the schema that we have, we can begin working with others to establish a workable general knowledge vocabulary for those of us who provide that type of content.  Please understand it in these terms and help us begin a productive discussion toward this end by adding your knowledge and insight to the mix.  Until now we have not had a forum for this kind of exchange.

Again I truly appreciate your taking the time to interact with me.

Paul


-----Original Message-----
From: Martin Hepp [mailto:martin.hepp@ebusiness-unibw.org] 
Sent: Monday, November 18, 2013 3:36 PM
To: Cranmer, Paul
Cc: Guha; Hetrea, Carmen; list
Subject: Re: proposal by Encyclopaedia Britannica

Dear Paul:
Thanks for your reply. First, I did not intend to discourage the general idea of improving schema.org by clean conceptual choices. 

My main point is that Web-scale vocabularies are interfaces that that sit between 

1. the millions of human minds of Web developers, actual Web sites, back-end databases that feed dynamic Web sites (including the people who created the schemas of those databases, and the people who ever added instance data), other actors and components, and 2. search engines and other computational services that consume the data,

and that creating such vocabularies is a non-trivial challenge that must consider many different dimensions.

In particular, there is a trade-off between 

- conceptual clarity (and beauty) on one hand (which improves access to and reuse of information, as we likely all agree on) and
- the ease, ability to, and reliability of using schema.org for the above mentioned audiences.

There are likely many factors influencing the quality of that interface besides pure conceptual clarity (in the sense of lasting, generically applicable distinctions / types), e.g.
- grounding in language (e.g. catchy terms that are easy to understand by native and non-native speakers) and
- a matching granularity to the data structures of back-end databases in Web sites (so that site owners can populate the schema from their existing instance data), and many more.

I think it is important to stress that having these many dimensions of building a good Web vocabulary in mind is not the same as making "quick-and-dirty" pragmatic choices. 

When schema.org does make compromises in comparison to existing elaborated conceptual structures (e.g. to upper ontologies like the BFO, http://www.ifomis.org/bfo/) or in comparison to elaborated distinctions from information management and information science, this is not necessarily because the people contributing have a lack of understanding of knowledge engineering or are acting in a short-sighted, solely business-driven fashion. 

However, people have tried for decades to "create ...[an] overview of knowledge before we start creating structures to manage some of its parts", and that could go on for many additional decades.

As far as I see it, the aim of schema.org is not to create the single, valid knowledge model for the universe, but first and foremost to provide a vocabulary for improving information extraction at the bottleneck between site content and automated information services - not limited too, but with currently a strong focus on search engines. If in passing, we can create a widely applicable conceptual model for other purposes, that would of course be a welcome additional benefit. But we should not overload the project with ambition. The task at hand is already difficult enough.

This is solely my personal view on the state of things in schema.org, of course.

Best

Martin



On Nov 18, 2013, at 9:25 PM, Cranmer, Paul wrote:

> Hi Martin,
> 
> I understand your points.  And I accept that your analysis of the audience for schema.org is not quite the same as the one I am addressing.  I am involved in information management from an information science viewpoint, and the things that I have learned over the years are reflected in the proposal that we have made.  
> 
> Regarding the use of the term metaphysical in contrast to physical, it was used for convenience to emphasize the distinction between what is a concept and what is an entity.  The fact that entities may not always be physical in the sense you note, is, of course, true.  So let me rephrase it to define an entity as something that exists beyond the mere conceptual.  In essence entities are the manifestations of concepts, though the concept may not have preceded the entity.  While I think my point was clear, I recognize the inadequacy of my phrasing.
> 
> I fully recognize that building a reliable machine-readable semantic representation of knowledge is not a simple challenge.  What is important to me is that we address it as the common denominator that makes the parts fit together whenever we develop semantic strategies. 
> 
> It appears that most of the interest has not been focused on general knowledge, as its application does not address directly or immediately the needs of Web sites to promote their content.  This is understandable.  However, the discussion should not be so focused on the parts that it does not also address them in the context of the whole.  Otherwise choices are made that will eventually need to be undone because they do not work when they must become part of the larger picture.  If we always work from the perspective of the whole, the choices we make concerning the parts become coherent and reusable as the picture broadens.
> 
> So, though I recognize the validity of what you say as it applies to immediate needs, I cannot refrain from hoping and proposing that we first create our overview of knowledge before we start creating structures to manage some of its parts.  I am not saying that no one is doing this, just that projects like schema.org reveal that not everyone in significant positions appears to be doing it.
> 
> I have no objection to separating the proposal such that one proposal addresses extensions to schema.org and separately a proposal for how search engines can better use that information.  As you understood, the proposal was initially one to extend schema.org by creating a coexistent structure for general knowledge as opposed to focusing only on business and current happenings.  However, we came to propose our thoughts to others as we looked around and came to believe that what we have learned may be of value to community at large.  We do not pretend to have the technical experience that many of those we are addressing have.  But we do have decades of experience with what it takes to manage information from a semantic perspective, and we sincerely wish to share that expertise with the Web community that we are now part of.  So please take what we have to say as something that may contain useful observations rather than a finished work that should be accepted.
> 
> Thanks again for taking the time to reply to me.
> 
> Paul
> 
> 
> -----Original Message-----
> From: Martin Hepp [mailto:martin.hepp@ebusiness-unibw.org]
> Sent: Monday, November 18, 2013 11:54 AM
> To: Cranmer, Paul
> Cc: Guha; Hetrea, Carmen; list
> Subject: Re: proposal by Encyclopaedia Britannica
> 
> Paul:
> Thanks for your reply. However, I think that there are a few misunderstandings.
> 
> First, schema.org is a Web vocabulary, not an ontology in the academic sense of the term. The nature of shared data structures at Web scale is only partly understood as of today, but we already know that it is not as simple a challenge as to build a "reliable machine-readable semantic representation of knowledge for the Semantic Web" as you state.
> 
> The fact that you need to use the term "metaphysical" to explain your proposal already indicates that there is a misfit between the audience you have in mind for your proposal and the audience who will actually use schema.org.
> 
> By the way, I disagree with the notion that entities are necessarily physical. Entities in the context of database systems can also be abstract things, at least since the initial work by Peter Chen on Entity-Relationship Modeling in 1976 [1].
> 
>> The more precision you are able to achieve in your system, the more 
>> flexibility your system will acquire in using the data it manages, 
>> since you can count on getting accurate results.]
> 
> As we all know, the Web is a complex socio-technical ecosystem. Precision of the vocabulary alone does not necessarily have any positive impact on the quality of the data. If people do not understand the specification with ease and reliably, they may avoid certain conceptual elements (-> less data) or use the elements less reliably (-> lower data quality). This issue has been discussed in this forum at length recently. For more information on my take at this, see http://vimeo.com/51152934.
> 
> 
>> [The proposal was initially conceived as an expansion of schema.org 
>> and its application by Google, Bing, and Yahoo.  Since it became 
>> clear that the search engines would only accept one class for a given 
>> resource, we felt it is important to bring up the fact that this does 
>> not meet the need since many things by their very nature belong to 
>> more than one class.  We believe that these multiple class 
>> designations should be recognized by the engines and incorporated 
>> into their semantic representation of the Webpage content.]
> Without wanting to offend you: When you suggest more subtle conceptual 
> distinction, then it would be nice for you to separate the two 
> proposals you have at the conceptual level into
> 
> 1. A proposal to extend the conceptual model of schema.org and 2. A proposal to search engines to extend the consumption of Web data based on schema.org.
> 
> Since the schema.org sponsors do in general not discuss the usage of schema.org data in this forum (but rather in their individual, products-related forums), mainly the first issues is relevant in here.
> 
> Best
> 
> Martin
> 
> 
> [1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.123.1085
> 
> On Nov 18, 2013, at 6:32 PM, Cranmer, Paul wrote:
> 
>> Martin,
>> 
>> Thanks for your feedback.  Please see my comments below in brackets.
>> 
>> -----Original Message-----
>> From: Martin Hepp [mailto:martin.hepp@ebusiness-unibw.org]
>> Sent: Monday, November 18, 2013 10:38 AM
>> To: Guha
>> Cc: Hetrea, Carmen; list; Cranmer, Paul
>> Subject: Re: proposal by Encyclopaedia Britannica
>> 
>> Carmen:
>> 
>> See below for a very quick feedback on part of your proposal:
>> 
>> On Nov 18, 2013, at 3:50 PM, Guha wrote:
>> 
>>> On Tue, Nov 12, 2013 at 9:45 AM, Hetrea, Carmen <CHetrea@eb.com> wrote:
>> ...
>>> SCHEMA.ORG ONTOLOGY EXPANSION - a PROPOSAL by Encyclopaedia 
>>> Britannica
>>> 
>>> 
>> ...
>>> 
>>> Proposal:
>>> 
>>> 1.    We propose top Class changes
>>> Top Class: SchemaOrgClass
>>> Two major Subclass divisions of information: Concept and Entity.
>> 
>> 
>> I agree that this makes sense from a knowledge representation perspective, but I have some concerns that this distinction actually improves the vocabulary for typical Webmasters.
>> In general, philosophically-grounded top-level distinctions can be difficult to apply by practitioners, which means that in the end, the quality of the data deteriorates.
>> For a *Web* vocabulary, ambiguous classes need not be a disadvantage since they can be reliably applied by publishers of data who are unable to apply the conceptual distinction reliably.
>> 
>> [First of all, the schema.org ontology is conceived to organize entities and has no place for organizing concepts. Since our primary concern is a consistent and reliable machine-readable semantic representation of knowledge for the Semantic Web, we arrived at our proposal primarily because we did not see any semantic ontologies that were concerned with representing general-knowledge content. 
>> 
>> The two top classes are essential for this purpose, because of the fundamental differences between concepts, which metaphysical, and entities which are physical.  Concepts can be related in semantic hierarchies that provide a useful skeleton for organizing information in the traditional vertically related broader concept, narrower concept, and the horizontally related concepts as has been addressed in SKOS.  We have added to this scenario also component relationships, i.e. concepts that effectively complete a picture thought not semantically narrower concepts.  
>> 
>> Entities, however, though definable by concepts, relate to each other 
>> multi-dimensionally both semantically and syntactically.  This 
>> distinction is simple and basic and can enable practitioners to 
>> manage their information more effectively.  While I understand your 
>> point that ambiguity appears to allow for a more flexible vocabulary, 
>> it risks undermining the very premise of semantics.  We have learned 
>> that ambiguous vocabulary ultimately limits the usefulness of the 
>> data it defines because it is inaccurate.  The more precision you are 
>> able to achieve in your system, the more flexibility your system will 
>> acquire in using the data it manages, since you can count on getting 
>> accurate results.]
>> 
>> 
>> 
>>> 2.    We propose to allow multiple class designation for a given Webpage
>>> This will allow content providers to classify both the subject matter and the delivery format individually as well as accommodate the fact that many general knowledge subjects can belong to more than one class. 
>>> 
>>>       For example, if a content provider offers an Article on an Event, both classes should be admissible and recognized by the semantic engine that reads the markup.  In the case of a Video showing Angkor Wat, a temple complex in Angkor, Cambodia, the subject is both a Place and a ManMadeObject (in this case a Temple) and the delivery format is Video.
>>> 
>> At the level of the vocabulary, it is already possible to expose information about multiple entities, so I may not understand your proposal correctly.
>> If you are referring to the problem that getting Rich Snippets is difficult if you mark-up multiple types of entities, then this is a different issue.
>> While this forum is not about the actual usage of schema.org by Google or any other search engine, and while I clearly do not claim to speak on behalf of any single search engine, the problem is that Rich Snippets or any other technique for summarizing page content for previews in the organic search results have to condense a whole page *to a single snippet*.
>> The currently dominating approach is to implement snippet types organized around a single, dominating type of object - e.g. products or events.
>> You may have seen that e.g. Google is sometimes already showing 
>> snippet types for pages that contain multiple objects, like the one 
>> attached (this one is not based on schema.org markup but other data 
>> sources, though),
>> 
>> [The proposal was initially conceived as an expansion of schema.org 
>> and its application by Google, Bing, and Yahoo.  Since it became 
>> clear that the search engines would only accept one class for a given 
>> resource, we felt it is important to bring up the fact that this does 
>> not meet the need since many things by their very nature belong to 
>> more than one class.  We believe that these multiple class 
>> designations should be recognized by the engines and incorporated 
>> into their semantic representation of the Webpage content.]
>> 
>> Sincerely,
>> 
>> Paul
>> 
> 
> --------------------------------------------------------
> martin hepp
> e-business & web science research group universitaet der bundeswehr 
> muenchen
> 
> e-mail:  hepp@ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>         http://www.heppnetz.de/ (personal)
> skype:   mfhepp 
> twitter: mfhepp
> 
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
> * Project Main Page: http://purl.org/goodrelations/
> 
> 
> 

--------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/
Received on Tuesday, 19 November 2013 11:20:10 UTC