- From: Dan Brickley <danbri@google.com>
- Date: Tue, 20 May 2014 17:16:04 +0100
- To: W3C Web Schemas Task Force <public-vocabs@w3.org>
On 20 May 2014 11:28, Jarno van Driel <jarnovandriel@gmail.com> wrote: > Martin, I don't know if I a completely agree about going to the product > forum about this. I think I understand why you might say this, but in my > thread about the working of WebPage (http://bit.ly/1jyFN0g), Jason Douglas > said: > >> "That said, we probably do need a mechanism for indicating the "primary >> entity" of a webpage when there is one. Current clients make up their own >> heuristics for this, but I think it would be better to have an explicit way >> of stating that." > > But this is not the main subject of this thread. Maybe a new thread to > discuss the "primary entity" or continuation of the subject in the thread I > already started is a better place. This is very much in scope for public-vocabs and for schema.org discussions. There are a few pieces to the puzzle, but the basic idea is simple. Schema.org allows a rich descriptive graph to be embedded in a Web page, which means we often have several entities mentioned; we'd like to know which one is the main one, if any. Consider the second example in http://schema.org/MusicEvent to give us a concrete focus. It describes a 'MusicEvent' (a concert), whose 'location' is a 'Place'. The event lists multiple associated 'offers'; each 'Offer' with price/date etc. info. The event also lists two 'performer's, each a 'MusicGroup'. There is nothing *intrinsically* primary about the event, the location, the offers or the musicians. This description is all the richer because it mentions multiple entities. If I was forced to pick one, I'd probably guess at the MusicEvent being the 'main' entity here, because the others feel slightly more like background information. But there's no need to leave this to guesswork. If this markup was on the homepage of the venue, that publisher might well consider the Place to be the main entity. And if it was on an artist's homepage, they might want to mention the gig (perhaps alongside others) but indicate that the MusicGroup was the main thing. The above sketches this in terms of embedded structured data, but we can also think of this in terms of capturing a very common pattern in Web content. Often Web pages _do_ have a focus on a single entity. If we add a property like mainEntity, it would give sites a way to make this focus explicit. PROPOSAL: 1. We already have "about", "The subject matter of the content.", relating a CreativeWork to a Thing. This is enough to do what we need, if we add clarification and examples. I suggest the description should be updated to say: "A Thing that is the primary subject matter of this CreativeWork". 2. If we want a more SKOS-like, bibliographic and nuanced notion of 'subject', I suggest we adopt something like Dublin Core's 'subject' to do that work. (DC has "The topic of the resource."/ "Typically, the subject will be represented using keywords, key phrases, or classification codes. Recommended best practice is to use a controlled vocabulary.", from http://purl.org/dc/terms/ ) The distinction: if we want to say "This document is about the entity Sweden, i.e. the thing that is sameAs http://en.wikipedia.org/wiki/Sweden http://www.freebase.com/m/0d0vqn), we would use http://schema.org/about ... i.e. this tells us the main thing that the page is about. but If we want to say, "This document's topic is “environmental impact of the decline of tin mining in Sweden in the 20th century“, we'd be going beyond "about" and would want a more bibliographic subject description, e.g. using DDC or UDC subject classification codes, SKOS etc. (fictional example, I know nothing about tin mining in Sweden) My proposal then is that we break out these two use cases, and target the 'about' more explicitly on the 'main entity' use case. 3. Tweak http://schema.org/mentions We should note that http://schema.org/mentions is a very similar notion to http://schema.org/about except that it allows multiple different entities to be referenced. "Indicates that the CreativeWork contains a reference to, but is not necessarily about a concept." I suggest rewording this in terms of entities/things, since we don't use 'concept' elsewhere: "Indicates that the CreativeWork contains a reference to, but is not necessarily about some particular thing." 4. http://schema.org/mainContentOfPage We already have this strange-looking property. It addresses a different use case: it relates a WebPage to a part of that WebPage, "Indicates if this web page element is the main subject of the page." The wording is awkward. It should be something like "Indicates the main element within some Web page." since the expected type is WebPageElement. I'm not convinced that the various types we have under WebPageElement ("A web page element, like a table or an image") really work, but the important point here is that they address a different scenario. A WebPageElement is a piece of markup, like SiteNavigationElement, Table, WPAdBlock, WPFooter, WPHeader, WPSideBar. This is a different idea to the problem of finding the main *entity* that all this markup is describing. HTML already a <main> element, see https://developer.mozilla.org/en-US/docs/Web/HTML/Element/main "The HTML <main> element represents the main content of the <body> of a document or application. The main content area consists of content that is directly related to, or expands upon the central topic of a document or the central functionality of an application. This content should be unique to the document, excluding any content that is repeated across a set of documents such as sidebars, navigation links, copyright information, site logos, and search forms (unless, of course, the document's main function is as a search form)." I believe most of the use cases for mainContentOfPage are better addressed by <main>. However <main> does not help us pick out a single highlighted entity: the main section of a Web page could still contain microdata/rdfa or json-ld mentioning lots of different entities. It is useful sometimes to know that structured data markup comes from footers or boilerplate rather than the <main> section of a page, and it is probably worth including some examples of this on the schema.org site. 5. Avoiding ratholes If we can please discuss this without slipping into discussion of http://www.w3.org/2001/tag/group/track/issues/14 I'd be happy. There are places in schema.org usage where we tolerate an URL for a WebPage being used in place of an URL that is more explictly for the real-world entity itself. For example in http://schema.org/Person we write "<a href="http://www.xyz.edu/students/alicejones.html" itemprop="colleague">Alice Jones</a>". Clarifying the use of 'about' as above could help such pages clarify which real world entity they are 'about'. This won't solve every issue around entity disambiguation, but it will improve the basic support we have within schema.org for stating such distinctions when we want to. (Sorry this was such a long mail...) Finally, let's also try not to get stuck on syntax issues at this stage. We'll have to find the best patterns in Microdata/RDFa and JSON-LD that we can for this, and it may sometimes be tricky. Here's an attempt at amending the MusicEvent example by adding a WebPage and 'about' - https://gist.github.com/anonymous/cf7e24f6378b176aa010 . We might want to discuss a reverse property that could be expressed on the entity rather than the page, for example. cheers, Dan
Received on Tuesday, 20 May 2014 16:16:33 UTC