Re: SoftwareApplication and the schema.org extension mechanism from Dan Brickley on 2012-01-02 (public-vocabs@w3.org from January 2012)

From: Dan Brickley <danbri@danbri.org>
Date: Mon, 2 Jan 2012 23:24:24 +0100
To: Jason Ronallo <jronallo@gmail.com>
Cc: public-vocabs <public-vocabs@w3.org>
Message-ID: <CAFNgM+a7ZJgJH=yEcHQE_uXr-bRQn73-WpZmvrHf4R-8MaLeXA@mail.gmail.com>
Hi Jason,

First up, sorry for sluggish replies here in last few weeks. I see
there are quite a few questions asked that need responses or adding to
the issue tracker; I'll get to these threads this week.

On 20 December 2011 22:20, Jason Ronallo <jronallo@gmail.com> wrote:
> I was watching a video on using microdata and schema.org for Google
> Rich Snippets for software applications [1]. At first I was surprised
> that something for software applications wasn't already included in
> schema.org, but on reading a webmaster support page [2] it states that
> it is a Google extension of schema.org of the CreativeWork type. The
> suggestion is to use http://schema.org/SoftwareApplication as the
> itemtype. It seems from the schema.org Extension Mechanism page [3]
> that extending CreativeWork would more correctly be:
> http://schema.org/CreativeWork/SoftwareApplication

Yes, you're reading the documentation correctly. By writing
"http://schema.org/CreativeWork/SoftwareApplication" we pack into the
URL the assertion that some class called 'SoftwareApplication' is a
more specialized subclass of 'CreativeWork'. By using the
not-yet-agreed URL "http://schema.org/SoftwareApplication", the Google
Rich Snippets app omits this hierarchical information out of the URL.

We should probably either (1) add something to
http://schema.org/docs/extension.html to note that it's OK to propose
new top level class terms (implicitly sub-classes of Thing) this way,
or (2) discourage people from proposing-by-inventing top level terms.
Deploying markup that hasn't been agreed always comes with some risk
and uncertaintly. Even for the schema.org partner companies, as well
for the public at large. But it also gives a quick and readable way of
getting things out there easily.

Probably the best way to read an URL for an unknown top-level class
such as "http://schema.org/Xyz" is that it is asserting the existence
of a subclass of schema.org's root object '/', i.e. a shorthand for
saying it's a subclass of "Thing". (This is implied, since the class
'Thing' embraces everything, without exception). While the
'/'-extension mechanism allows URLs to include other class names, it
does does not require that all relevant superclasses *must* be in the
URL. For example, many classes around the topic of local businesses
are subclasses of both 'Place' and 'LocalBusiness' (e.g.
http://schema.org/RecyclingCenter). We can't expect to pack all that
information into the URL. So similarly for extensions, we make it
possible to include class hierarchy in the name, but it might not
always be essential.

On this reading of the extension mechanism, the URL
http://schema.org/Xyz is a shorthand for http://schema.org/Thing/Xyz

...therefore the proposal-by-use of
http://schema.org/SoftwareApplication was a kind of shorthand for
http://schema.org/Thing/SoftwareApplication. This is consistent with
SoftwareApplication being a sub-class of other more specific terms
(e.g. CreativeWork), it just doesn't emphasize it.

> Am I not understanding the schema.org extension mechanism? Or can the
> schema.org partners just extend schema.org as they please without using the extension mechanism?

Anyone (schema.org partners or otherwise) can publish and/or consume
markup that extends the core schema freely, and anyone can use them.
Ideally, parties that propose the deployment of not-yet-agreed
extensions will be pluralistic and consume both the original design
plus whatever is subsequently agreed by a wider group.


>                                                        Or maybe this is a candidate for
> expanding the schema.org vocabulary, so rather than putting the
> extended forms of the URLs out there in the wild the choice was to
> just starting off with what the URLs would be if they were a part of
> schema.org?

Yes, I think this extension is best understood as a proposal for
inclusion in the core. But there's also various other pragmatic
factors in place here:

Firstly, we've had feedback over last few months that expresses
skepticism about the "/"-based extension mechanism and the extent to
which it's useful to emphasize it.

This is related to the discussions we've had about various technical
details of the Microdata and RDFa syntaxes, sometimes called
"distributed extensibility". Jeni Tennison has a good post on this at
http://www.jenitennison.com/blog/node/156 and from the comments there,
a handy example: if we see the URL "http://schema.org/Person/Minister"
how do we know whether it is meant to be a governmental minister or a
religious one? By extending a schema.org URL the proposer doesn't
really have anywhere obvious to describe in more detail what they're
proposing. Now *if* schema.org had classes defined for religious
minister, and for political/govt minister, the URL could be based on
those. But as we say in http://schema.org/docs/extension.html "the
variety and richness of structured data covering everything on the web
is much too rich for a single organization (like schema.org) to
completely cover". The work towards a simple RDFa Lite that fits
schema.org's simplicity goals (see
http://blog.schema.org/2011/11/using-rdfa-11-lite-with-schemaorg.html
) is relevant here, since RDFa emphasizes a more decentralised
approach, at the cost of seeing full URLs in the data. So for the
minister example, a party who wanted an extension class for the
political notion of 'Minister' could deploy something like
http://reference.data.gov.uk/def/central-government/Minister as a
class URL, alongside http://schema.org/Person. And then in that page
they could define (for people and machines) more clearly what they
intended. In some situations this works better than '/'-based
extension, in others it's more complex. There's no single simple right
answer here, but these design tradeoffs partly explain why the
"/"-based mechanism isn't the only option and why we might not be
pushing heavily for its use.

Secondly, by defining SoftwareApplication as a subclass only of
'Thing', the initial proposal is being non-committal, rather than
absolute. Instead of starting out by saying that each and every
SoftwareApplication will always also be a CreativeWork, it leaves the
door open for this to be said in the future, or for the classes to
only partially overlap. Figuring this out is part of the task of
reviewing the extension: can we think of counter-examples, eg.
something that might be a software app but not usefully considered a
CreativeWork, etc.

>     If so, is there other public documentation of where the
> schema for software applications is being hashed out? It seems that
> there are some properties like license name which could help support
> discovery of open source software.

We can reasonably expect to see a revised proposal here (to this
group) from the team behind the Rich Snippets software app vocabulary,
based on the feedback and deployment they received on the initial
design. I don't know how long this will take. In the meantime it would
be absolutely great if other properties and classes around this topic
were discussed here; feel free to keep notes in the wiki i.e.
http://www.w3.org/wiki/WebSchemas and nearby.

Quite a few opensource / free software projects have DOAP RDF
descriptions available, see http://trac.usefulinc.com/doap (or its
ancestor, https://launchpad.net/rdf ) so there might be something to
learn from (or directly use) there.

> Also the suggestion for the softwareApplicationCategory property is to
> use one of the supported software application types listed on a web page [4].
 >            The recently updated JobPosting type [5] also appears to
> suggest that the value of the occupationalCategory should use an
> outside taxonomy [6].  Are there other examples in schema.org proper
> where one should choose from a list of types like this?

The general design is that we don't want to maintain lots of giant
lists of enumerated types within the core schema.org vocabulary. In
some places we've already got a few, but the aspiration is to point
off to externally maintained lists instead. And the challenge is to do
this in a way that promotes simple markup and easy adoption.

The JobPosting case illustrates the tradeoffs. On the one hand,
schema.org has a strong bias towards putting a lot of things in a
single flat vocabulary, rather than forcing publishers to remember 15
different overlapping namespaces to accomplish fairly simple tasks. On
the other hand, we can't put *everything* in there! And often that
external info isn't (yet...) available in nice structured form, or
even in HTML. So to take the case of
http://www.onetcenter.org/taxonomy.html it seems the bulk content we
care about is currently up there in a PDF - see
http://www.onetcenter.org/dl_files/Taxonomy2010_AppA.pdf - and
consists of pairs of local codes + textual labels. Now if each of
those codes had a nice HTML page (maybe with RDFa/Microdata embedded)
it might just be plausible to expect publishers of job descriptions to
point to the page. But even then, it might be more realistic to hope
only for the pairing of the code and label to appear as a textual
property. For example "11-3071.02 Storage and Distribution Managers".
To this mix we can add some additional design goals: I18N/L18N, e.g.
if we want Spanish description of the entry for 11-3071.02, as well as
inclusion of enumerated vocabularies defined by other authorities ---
for example in EU the forthcoming ESCO work,
http://ec.europa.eu/social/main.jsp?langId=en&catId=89&newsId=852

As far as SoftwareApplication is concerned, the enumeration is cast as
a set of types, and perhaps it might remain an extension understood
only by Rich Snippets even if the rest becomes consensual core. For
the Jobs case, the controlled values maybe work better as controlled
property values than as classes (though the distinction is ultimately
arbitrary).

> It seems as if this is another kind of extension mechanism that schema.org has to
> manage with some of the vocabulary maintenance falling to
> organizations outside of schema.org. Is this likely to be a recurring pattern?

Yes, exactly. It is clear that:

1. We can't put all vocabulary into schema.org - at some point we need
to define attachment points (e.g. superclass but not detail)
2. Mainstream publishers of schema.org markup have quite limited
ability to handle markup complexity or make complex technical choices
3. Casually pointing off to a list of external vocabs makes it harder
to navigate schema.org documentation and figure out what's needed

> How can content authors and tool creators best keep up with
> cases where the suggestion is made to use a value from an outside list for a particular property?

There are some layers here.

I think the first is for the basic modeling and markup idioms to be
clarified and described properly. The case where some party defines a
set of types that aren't agreed within the schema.org core; the case
where textual or URL property values are defined externally. How these
both look in Microdata and RDFa. How the external vocabularies are
documented; on their own site, and potentially also somewhere on
schema.org. How URL-based and textual value-based idioms work, and
inter-relate to each other. How to handle multi-lingual labels, etc.

Beyond that, there is value in making sure machine-friendly versions
of each set of values can be easily found (from schema.org, and from
our Wiki here). We can also start simply, without creating any giant
over-arching frameworks. So for example if anyone has extracted CSV,
JSON or anything more computer friendly than PDF from
http://www.onetcenter.org/dl_files/Taxonomy2010_AppA.pdf ...  do
please link those from http://www.w3.org/wiki/JobPostingSchema to save
others doing the same work. There is particular value also in defining
bridges to large and widely-used enumerations such as
Wikipedia/DBpedia and Freebase. And on top of all this, some scope for
tool sharing and re-use. For example there are a few tools around that
auto-complete against large vocabularies; e.g. Freebase Suggest -
http://www.freebase.com/docs/suggest - and for UI-oriented authoring
tools this approach is worth investigating. For publishers who are
bulk-publishing from a database rather than hand-authoring, other
tooling would be more relevant. Again the Freebase folk have been busy
( see http://code.google.com/p/google-refine/ "Google Refine is a
power tool for working with messy data, cleaning it up, transforming
it from one format into another, extending it with web services, and
linking it to databases like Freebase.") but I've not yet looked at
direct application of this to schema.org scenarios.

Excuse the length and delayed reply, hope this is useful.

cheers,

Dan

> Jason Ronallo
>
> [1] http://www.youtube.com/watch?v=Yc8CQoWrsE0&feature=BFa&list=SP3107CD6C86454FE3&lf=list_related
> [2] http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1645432
> [3] http://www.schema.org/docs/extension.html
> [4] http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1645527
> [5] http://www.schema.org/JobPosting
> [6] http://www.onetcenter.org/taxonomy.html
>
>
Received on Tuesday, 3 January 2012 02:26:58 UTC