Re: Indicating main entity / primaryTopic - proposal to use 'schema.org/about'

Dan:
Picking up an old thread with a new, related request: I think it would make the lives of ALL Web developers a LOT easier if the sponsors of schema.org could spend some effort to make sure that their parsers and consuming components treat all variants of valid schema.org equally, i.e. if you properly follow the spec, you should assume that the search engines understand your information if they process the respective type of information. My experience is that you need a lot of insider knowledge to design schema.org markup in a way that maximizes the understanding by search engines, e.g. in the case of

- syntaxes (RDFa, Microdata, JSON-LD, ...) and
- variants and alternatives in schema.org.

I know that this is difficult to implement at the level of four big corporations with hundreds or thousands of software components. Still, it would help to define schema.org-based test-cases that are used for automated testing. I once started something similar for GoodRelations at

    http://www.heppnetz.de/rdfa4google/testcases.html

But I think we need something like that for each major schema.org type in all relevant syntaxes, and for the more complex types, we will need variants (e.g. pricing for Offer).

Again, I think this is crucial for lowering the entrance barrier for adoption, because schema.org would be the official guideline for developers. Currently, schema.org is only a starting point and you need a lot of additional expertise and experience to apply it properly.


Best wishes / Mit freundlichen Grüßen

Martin Hepp

-------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  martin.hepp@unibw.de
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp 
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
* Project Main Page: http://purl.org/goodrelations/




On 20 May 2014, at 18:16, Dan Brickley <danbri@google.com> wrote:

> On 20 May 2014 11:28, Jarno van Driel <jarnovandriel@gmail.com> wrote:
>> Martin, I don't know if I a completely agree about going to the product
>> forum about this. I think I understand why you might say this, but in my
>> thread about the working of WebPage (http://bit.ly/1jyFN0g), Jason Douglas
>> said:
>> 
>>> "That said, we probably do need a mechanism for indicating the "primary
>>> entity" of a webpage when there is one.  Current clients make up their own
>>> heuristics for this, but I think it would be better to have an explicit way
>>> of stating that."
>> 
>> But this is not the main subject of this thread. Maybe a new thread to
>> discuss the "primary entity" or continuation of the subject in the thread I
>> already started is a better place.
> 
> This is very much in scope for public-vocabs and for schema.org discussions.
> 
> There are a few pieces to the puzzle, but the basic idea is simple.
> Schema.org allows a rich descriptive graph to be embedded in a Web
> page, which means we often have several entities mentioned; we'd like
> to know which one is the main one, if any.
> 
> Consider the second example in http://schema.org/MusicEvent to give us
> a concrete focus.
> 
> It describes a 'MusicEvent' (a concert), whose 'location' is a
> 'Place'. The event lists multiple associated 'offers'; each 'Offer'
> with price/date etc. info. The event also lists two 'performer's, each
> a 'MusicGroup'.
> 
> There is nothing *intrinsically* primary about the event, the
> location, the offers or the musicians. This description is all the
> richer because it mentions multiple entities. If I was forced to pick
> one, I'd probably guess at the MusicEvent being the 'main' entity
> here, because the others feel slightly more like background
> information. But there's no need to leave this to guesswork. If this
> markup was on the homepage of the venue, that publisher might well
> consider the Place to be the main entity. And if it was on an artist's
> homepage, they might want to mention the gig (perhaps alongside
> others) but indicate that the MusicGroup was the main thing.
> 
> The above sketches this in terms of embedded structured data, but we
> can also think of this in terms of capturing a very common pattern in
> Web content. Often Web pages _do_ have a focus on a single entity. If
> we add a property like mainEntity, it would give sites a way to make
> this focus explicit.
> 
> PROPOSAL:
> 
> 1.
> We already have "about", "The subject matter of the content.",
> relating a CreativeWork to a Thing. This is enough to do what we need,
> if we add clarification and examples.
> 
> I suggest the description should be updated to  say: "A Thing that is
> the primary subject matter of this CreativeWork".
> 
> 2.
> If we want a more SKOS-like, bibliographic and nuanced notion of
> 'subject', I suggest we adopt something like Dublin Core's 'subject'
> to do that work.
> 
> (DC has "The topic of the resource."/ "Typically, the subject will be
> represented using keywords, key phrases, or classification codes.
> Recommended best practice is to use a controlled vocabulary.", from
> http://purl.org/dc/terms/ )
> 
> The distinction:
> 
> if we want to say "This document is about the entity Sweden, i.e. the
> thing that is sameAs http://en.wikipedia.org/wiki/Sweden
> http://www.freebase.com/m/0d0vqn), we would use
> http://schema.org/about   ... i.e. this tells us the main thing that
> the page is about.
> 
> but
> 
> If we want to say, "This document's topic is “environmental impact of
> the decline of tin mining in Sweden in the 20th century“, we'd be
> going beyond "about" and would want a more bibliographic subject
> description, e.g. using DDC or UDC subject classification codes, SKOS
> etc.
> 
> (fictional example, I know nothing about tin mining in Sweden)
> 
> My proposal then is that we break out these two use cases, and target
> the 'about' more explicitly on the 'main entity' use case.
> 
> 3. Tweak http://schema.org/mentions
> 
> We should note that http://schema.org/mentions is a very similar
> notion to http://schema.org/about except that it allows multiple
> different entities to be referenced.
> 
> "Indicates that the CreativeWork contains a reference to, but is not
> necessarily about a concept."
> 
> I suggest rewording this in terms of entities/things, since we don't
> use 'concept' elsewhere:
> 
> "Indicates that the CreativeWork contains a reference to, but is not
> necessarily about some particular thing."
> 
> 4. http://schema.org/mainContentOfPage
> 
> We already have this strange-looking property. It addresses a
> different use case:
> 
> it relates a WebPage to a part of that WebPage,
> "Indicates if this web page element is the main subject of the page."
> 
> The wording is awkward. It should be something like "Indicates the
> main element within some Web page." since the expected type is
> WebPageElement.
> 
> I'm not convinced that the various types we have under WebPageElement
> ("A web page element, like a table or an image") really work, but the
> important point here is that they address a different scenario. A
> WebPageElement is a piece of markup, like SiteNavigationElement,
> Table, WPAdBlock, WPFooter, WPHeader, WPSideBar. This is a different
> idea to the problem of finding the main *entity* that all this markup
> is describing.
> 
> HTML already a <main> element, see
> https://developer.mozilla.org/en-US/docs/Web/HTML/Element/main
> 
> "The HTML <main> element represents the main content of  the <body> of
> a document or application. The main content area consists of content
> that is directly related to, or expands upon the central topic of a
> document or the central functionality of an application. This content
> should be unique to the document, excluding any content that is
> repeated across a set of documents such as sidebars, navigation links,
> copyright information, site logos, and search forms (unless, of
> course, the document's main function is as a search form)."
> 
> I believe most of the use cases for mainContentOfPage are better
> addressed by <main>.
> 
> However <main> does not help us pick out a single highlighted entity:
> the main section of a Web page could still contain microdata/rdfa or
> json-ld mentioning lots of different entities.
> 
> It is useful sometimes to know that structured data markup comes from
> footers or boilerplate rather than the <main> section of a page, and
> it is probably worth including some examples of this on the schema.org
> site.
> 
> 
> 5. Avoiding ratholes
> 
> If we can please discuss this without slipping into discussion of
> http://www.w3.org/2001/tag/group/track/issues/14 I'd be happy. There
> are places in schema.org usage where we tolerate an URL for a WebPage
> being used in place of an URL that is more explictly for the
> real-world entity itself. For example in http://schema.org/Person we
> write "<a href="http://www.xyz.edu/students/alicejones.html"
> itemprop="colleague">Alice Jones</a>".
> 
> Clarifying the use of 'about' as above could help such pages clarify
> which real world entity they are 'about'. This won't solve every issue
> around entity disambiguation, but it will improve the basic support we
> have within schema.org for stating such distinctions when we want to.
> 
> (Sorry this was such a long mail...)
> 
> Finally, let's also try not to get stuck on syntax issues at this
> stage. We'll have to find the best patterns in Microdata/RDFa and
> JSON-LD that we can for this, and it may sometimes be tricky. Here's
> an attempt at amending the MusicEvent example by adding a WebPage and
> 'about' - https://gist.github.com/anonymous/cf7e24f6378b176aa010 . We
> might want to discuss a reverse property that could be expressed on
> the entity rather than the page, for example.
> 
> cheers,
> 
> Dan
> 

Received on Monday, 11 August 2014 20:36:50 UTC