W3C home > Mailing lists > Public > public-vocabs@w3.org > May 2014

Indicating main entity / primaryTopic - proposal to use 'schema.org/about'

From: Dan Brickley <danbri@google.com>
Date: Tue, 20 May 2014 17:16:04 +0100
Message-ID: <CAK-qy=6nzkvt9UZBOt6B5NQAdGpmLvNKj3MqBYXLA3e_zODSFg@mail.gmail.com>
To: W3C Web Schemas Task Force <public-vocabs@w3.org>
On 20 May 2014 11:28, Jarno van Driel <jarnovandriel@gmail.com> wrote:
> Martin, I don't know if I a completely agree about going to the product
> forum about this. I think I understand why you might say this, but in my
> thread about the working of WebPage (http://bit.ly/1jyFN0g), Jason Douglas
> said:
>
>> "That said, we probably do need a mechanism for indicating the "primary
>> entity" of a webpage when there is one.  Current clients make up their own
>> heuristics for this, but I think it would be better to have an explicit way
>> of stating that."
>
> But this is not the main subject of this thread. Maybe a new thread to
> discuss the "primary entity" or continuation of the subject in the thread I
> already started is a better place.

This is very much in scope for public-vocabs and for schema.org discussions.

There are a few pieces to the puzzle, but the basic idea is simple.
Schema.org allows a rich descriptive graph to be embedded in a Web
page, which means we often have several entities mentioned; we'd like
to know which one is the main one, if any.

Consider the second example in http://schema.org/MusicEvent to give us
a concrete focus.

It describes a 'MusicEvent' (a concert), whose 'location' is a
'Place'. The event lists multiple associated 'offers'; each 'Offer'
with price/date etc. info. The event also lists two 'performer's, each
a 'MusicGroup'.

There is nothing *intrinsically* primary about the event, the
location, the offers or the musicians. This description is all the
richer because it mentions multiple entities. If I was forced to pick
one, I'd probably guess at the MusicEvent being the 'main' entity
here, because the others feel slightly more like background
information. But there's no need to leave this to guesswork. If this
markup was on the homepage of the venue, that publisher might well
consider the Place to be the main entity. And if it was on an artist's
homepage, they might want to mention the gig (perhaps alongside
others) but indicate that the MusicGroup was the main thing.

The above sketches this in terms of embedded structured data, but we
can also think of this in terms of capturing a very common pattern in
Web content. Often Web pages _do_ have a focus on a single entity. If
we add a property like mainEntity, it would give sites a way to make
this focus explicit.

PROPOSAL:

1.
We already have "about", "The subject matter of the content.",
relating a CreativeWork to a Thing. This is enough to do what we need,
if we add clarification and examples.

I suggest the description should be updated to  say: "A Thing that is
the primary subject matter of this CreativeWork".

2.
If we want a more SKOS-like, bibliographic and nuanced notion of
'subject', I suggest we adopt something like Dublin Core's 'subject'
to do that work.

(DC has "The topic of the resource."/ "Typically, the subject will be
represented using keywords, key phrases, or classification codes.
Recommended best practice is to use a controlled vocabulary.", from
http://purl.org/dc/terms/ )

The distinction:

if we want to say "This document is about the entity Sweden, i.e. the
thing that is sameAs http://en.wikipedia.org/wiki/Sweden
http://www.freebase.com/m/0d0vqn), we would use
http://schema.org/about   ... i.e. this tells us the main thing that
the page is about.

but

If we want to say, "This document's topic is “environmental impact of
the decline of tin mining in Sweden in the 20th century“, we'd be
going beyond "about" and would want a more bibliographic subject
description, e.g. using DDC or UDC subject classification codes, SKOS
etc.

(fictional example, I know nothing about tin mining in Sweden)

My proposal then is that we break out these two use cases, and target
the 'about' more explicitly on the 'main entity' use case.

3. Tweak http://schema.org/mentions

We should note that http://schema.org/mentions is a very similar
notion to http://schema.org/about except that it allows multiple
different entities to be referenced.

"Indicates that the CreativeWork contains a reference to, but is not
necessarily about a concept."

I suggest rewording this in terms of entities/things, since we don't
use 'concept' elsewhere:

"Indicates that the CreativeWork contains a reference to, but is not
necessarily about some particular thing."

4. http://schema.org/mainContentOfPage

We already have this strange-looking property. It addresses a
different use case:

it relates a WebPage to a part of that WebPage,
"Indicates if this web page element is the main subject of the page."

The wording is awkward. It should be something like "Indicates the
main element within some Web page." since the expected type is
WebPageElement.

I'm not convinced that the various types we have under WebPageElement
("A web page element, like a table or an image") really work, but the
important point here is that they address a different scenario. A
WebPageElement is a piece of markup, like SiteNavigationElement,
Table, WPAdBlock, WPFooter, WPHeader, WPSideBar. This is a different
idea to the problem of finding the main *entity* that all this markup
is describing.

HTML already a <main> element, see
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/main

"The HTML <main> element represents the main content of  the <body> of
a document or application. The main content area consists of content
that is directly related to, or expands upon the central topic of a
document or the central functionality of an application. This content
should be unique to the document, excluding any content that is
repeated across a set of documents such as sidebars, navigation links,
copyright information, site logos, and search forms (unless, of
course, the document's main function is as a search form)."

I believe most of the use cases for mainContentOfPage are better
addressed by <main>.

However <main> does not help us pick out a single highlighted entity:
the main section of a Web page could still contain microdata/rdfa or
json-ld mentioning lots of different entities.

It is useful sometimes to know that structured data markup comes from
footers or boilerplate rather than the <main> section of a page, and
it is probably worth including some examples of this on the schema.org
site.


5. Avoiding ratholes

If we can please discuss this without slipping into discussion of
http://www.w3.org/2001/tag/group/track/issues/14 I'd be happy. There
are places in schema.org usage where we tolerate an URL for a WebPage
being used in place of an URL that is more explictly for the
real-world entity itself. For example in http://schema.org/Person we
write "<a href="http://www.xyz.edu/students/alicejones.html"
itemprop="colleague">Alice Jones</a>".

Clarifying the use of 'about' as above could help such pages clarify
which real world entity they are 'about'. This won't solve every issue
around entity disambiguation, but it will improve the basic support we
have within schema.org for stating such distinctions when we want to.

(Sorry this was such a long mail...)

Finally, let's also try not to get stuck on syntax issues at this
stage. We'll have to find the best patterns in Microdata/RDFa and
JSON-LD that we can for this, and it may sometimes be tricky. Here's
an attempt at amending the MusicEvent example by adding a WebPage and
'about' - https://gist.github.com/anonymous/cf7e24f6378b176aa010 . We
might want to discuss a reverse property that could be expressed on
the entity rather than the page, for example.

cheers,

Dan
Received on Tuesday, 20 May 2014 16:16:33 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:29:41 UTC