- From: Dan Brickley <danbri@danbri.org>
- Date: Mon, 2 Jan 2012 23:24:24 +0100
- To: Jason Ronallo <jronallo@gmail.com>
- Cc: public-vocabs <public-vocabs@w3.org>
Hi Jason, First up, sorry for sluggish replies here in last few weeks. I see there are quite a few questions asked that need responses or adding to the issue tracker; I'll get to these threads this week. On 20 December 2011 22:20, Jason Ronallo <jronallo@gmail.com> wrote: > I was watching a video on using microdata and schema.org for Google > Rich Snippets for software applications [1]. At first I was surprised > that something for software applications wasn't already included in > schema.org, but on reading a webmaster support page [2] it states that > it is a Google extension of schema.org of the CreativeWork type. The > suggestion is to use http://schema.org/SoftwareApplication as the > itemtype. It seems from the schema.org Extension Mechanism page [3] > that extending CreativeWork would more correctly be: > http://schema.org/CreativeWork/SoftwareApplication Yes, you're reading the documentation correctly. By writing "http://schema.org/CreativeWork/SoftwareApplication" we pack into the URL the assertion that some class called 'SoftwareApplication' is a more specialized subclass of 'CreativeWork'. By using the not-yet-agreed URL "http://schema.org/SoftwareApplication", the Google Rich Snippets app omits this hierarchical information out of the URL. We should probably either (1) add something to http://schema.org/docs/extension.html to note that it's OK to propose new top level class terms (implicitly sub-classes of Thing) this way, or (2) discourage people from proposing-by-inventing top level terms. Deploying markup that hasn't been agreed always comes with some risk and uncertaintly. Even for the schema.org partner companies, as well for the public at large. But it also gives a quick and readable way of getting things out there easily. Probably the best way to read an URL for an unknown top-level class such as "http://schema.org/Xyz" is that it is asserting the existence of a subclass of schema.org's root object '/', i.e. a shorthand for saying it's a subclass of "Thing". (This is implied, since the class 'Thing' embraces everything, without exception). While the '/'-extension mechanism allows URLs to include other class names, it does does not require that all relevant superclasses *must* be in the URL. For example, many classes around the topic of local businesses are subclasses of both 'Place' and 'LocalBusiness' (e.g. http://schema.org/RecyclingCenter). We can't expect to pack all that information into the URL. So similarly for extensions, we make it possible to include class hierarchy in the name, but it might not always be essential. On this reading of the extension mechanism, the URL http://schema.org/Xyz is a shorthand for http://schema.org/Thing/Xyz ...therefore the proposal-by-use of http://schema.org/SoftwareApplication was a kind of shorthand for http://schema.org/Thing/SoftwareApplication. This is consistent with SoftwareApplication being a sub-class of other more specific terms (e.g. CreativeWork), it just doesn't emphasize it. > Am I not understanding the schema.org extension mechanism? Or can the > schema.org partners just extend schema.org as they please without using the extension mechanism? Anyone (schema.org partners or otherwise) can publish and/or consume markup that extends the core schema freely, and anyone can use them. Ideally, parties that propose the deployment of not-yet-agreed extensions will be pluralistic and consume both the original design plus whatever is subsequently agreed by a wider group. > Or maybe this is a candidate for > expanding the schema.org vocabulary, so rather than putting the > extended forms of the URLs out there in the wild the choice was to > just starting off with what the URLs would be if they were a part of > schema.org? Yes, I think this extension is best understood as a proposal for inclusion in the core. But there's also various other pragmatic factors in place here: Firstly, we've had feedback over last few months that expresses skepticism about the "/"-based extension mechanism and the extent to which it's useful to emphasize it. This is related to the discussions we've had about various technical details of the Microdata and RDFa syntaxes, sometimes called "distributed extensibility". Jeni Tennison has a good post on this at http://www.jenitennison.com/blog/node/156 and from the comments there, a handy example: if we see the URL "http://schema.org/Person/Minister" how do we know whether it is meant to be a governmental minister or a religious one? By extending a schema.org URL the proposer doesn't really have anywhere obvious to describe in more detail what they're proposing. Now *if* schema.org had classes defined for religious minister, and for political/govt minister, the URL could be based on those. But as we say in http://schema.org/docs/extension.html "the variety and richness of structured data covering everything on the web is much too rich for a single organization (like schema.org) to completely cover". The work towards a simple RDFa Lite that fits schema.org's simplicity goals (see http://blog.schema.org/2011/11/using-rdfa-11-lite-with-schemaorg.html ) is relevant here, since RDFa emphasizes a more decentralised approach, at the cost of seeing full URLs in the data. So for the minister example, a party who wanted an extension class for the political notion of 'Minister' could deploy something like http://reference.data.gov.uk/def/central-government/Minister as a class URL, alongside http://schema.org/Person. And then in that page they could define (for people and machines) more clearly what they intended. In some situations this works better than '/'-based extension, in others it's more complex. There's no single simple right answer here, but these design tradeoffs partly explain why the "/"-based mechanism isn't the only option and why we might not be pushing heavily for its use. Secondly, by defining SoftwareApplication as a subclass only of 'Thing', the initial proposal is being non-committal, rather than absolute. Instead of starting out by saying that each and every SoftwareApplication will always also be a CreativeWork, it leaves the door open for this to be said in the future, or for the classes to only partially overlap. Figuring this out is part of the task of reviewing the extension: can we think of counter-examples, eg. something that might be a software app but not usefully considered a CreativeWork, etc. > If so, is there other public documentation of where the > schema for software applications is being hashed out? It seems that > there are some properties like license name which could help support > discovery of open source software. We can reasonably expect to see a revised proposal here (to this group) from the team behind the Rich Snippets software app vocabulary, based on the feedback and deployment they received on the initial design. I don't know how long this will take. In the meantime it would be absolutely great if other properties and classes around this topic were discussed here; feel free to keep notes in the wiki i.e. http://www.w3.org/wiki/WebSchemas and nearby. Quite a few opensource / free software projects have DOAP RDF descriptions available, see http://trac.usefulinc.com/doap (or its ancestor, https://launchpad.net/rdf ) so there might be something to learn from (or directly use) there. > Also the suggestion for the softwareApplicationCategory property is to > use one of the supported software application types listed on a web page [4]. > The recently updated JobPosting type [5] also appears to > suggest that the value of the occupationalCategory should use an > outside taxonomy [6]. Are there other examples in schema.org proper > where one should choose from a list of types like this? The general design is that we don't want to maintain lots of giant lists of enumerated types within the core schema.org vocabulary. In some places we've already got a few, but the aspiration is to point off to externally maintained lists instead. And the challenge is to do this in a way that promotes simple markup and easy adoption. The JobPosting case illustrates the tradeoffs. On the one hand, schema.org has a strong bias towards putting a lot of things in a single flat vocabulary, rather than forcing publishers to remember 15 different overlapping namespaces to accomplish fairly simple tasks. On the other hand, we can't put *everything* in there! And often that external info isn't (yet...) available in nice structured form, or even in HTML. So to take the case of http://www.onetcenter.org/taxonomy.html it seems the bulk content we care about is currently up there in a PDF - see http://www.onetcenter.org/dl_files/Taxonomy2010_AppA.pdf - and consists of pairs of local codes + textual labels. Now if each of those codes had a nice HTML page (maybe with RDFa/Microdata embedded) it might just be plausible to expect publishers of job descriptions to point to the page. But even then, it might be more realistic to hope only for the pairing of the code and label to appear as a textual property. For example "11-3071.02 Storage and Distribution Managers". To this mix we can add some additional design goals: I18N/L18N, e.g. if we want Spanish description of the entry for 11-3071.02, as well as inclusion of enumerated vocabularies defined by other authorities --- for example in EU the forthcoming ESCO work, http://ec.europa.eu/social/main.jsp?langId=en&catId=89&newsId=852 As far as SoftwareApplication is concerned, the enumeration is cast as a set of types, and perhaps it might remain an extension understood only by Rich Snippets even if the rest becomes consensual core. For the Jobs case, the controlled values maybe work better as controlled property values than as classes (though the distinction is ultimately arbitrary). > It seems as if this is another kind of extension mechanism that schema.org has to > manage with some of the vocabulary maintenance falling to > organizations outside of schema.org. Is this likely to be a recurring pattern? Yes, exactly. It is clear that: 1. We can't put all vocabulary into schema.org - at some point we need to define attachment points (e.g. superclass but not detail) 2. Mainstream publishers of schema.org markup have quite limited ability to handle markup complexity or make complex technical choices 3. Casually pointing off to a list of external vocabs makes it harder to navigate schema.org documentation and figure out what's needed > How can content authors and tool creators best keep up with > cases where the suggestion is made to use a value from an outside list for a particular property? There are some layers here. I think the first is for the basic modeling and markup idioms to be clarified and described properly. The case where some party defines a set of types that aren't agreed within the schema.org core; the case where textual or URL property values are defined externally. How these both look in Microdata and RDFa. How the external vocabularies are documented; on their own site, and potentially also somewhere on schema.org. How URL-based and textual value-based idioms work, and inter-relate to each other. How to handle multi-lingual labels, etc. Beyond that, there is value in making sure machine-friendly versions of each set of values can be easily found (from schema.org, and from our Wiki here). We can also start simply, without creating any giant over-arching frameworks. So for example if anyone has extracted CSV, JSON or anything more computer friendly than PDF from http://www.onetcenter.org/dl_files/Taxonomy2010_AppA.pdf ... do please link those from http://www.w3.org/wiki/JobPostingSchema to save others doing the same work. There is particular value also in defining bridges to large and widely-used enumerations such as Wikipedia/DBpedia and Freebase. And on top of all this, some scope for tool sharing and re-use. For example there are a few tools around that auto-complete against large vocabularies; e.g. Freebase Suggest - http://www.freebase.com/docs/suggest - and for UI-oriented authoring tools this approach is worth investigating. For publishers who are bulk-publishing from a database rather than hand-authoring, other tooling would be more relevant. Again the Freebase folk have been busy ( see http://code.google.com/p/google-refine/ "Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.") but I've not yet looked at direct application of this to schema.org scenarios. Excuse the length and delayed reply, hope this is useful. cheers, Dan > Jason Ronallo > > [1] http://www.youtube.com/watch?v=Yc8CQoWrsE0&feature=BFa&list=SP3107CD6C86454FE3&lf=list_related > [2] http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1645432 > [3] http://www.schema.org/docs/extension.html > [4] http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1645527 > [5] http://www.schema.org/JobPosting > [6] http://www.onetcenter.org/taxonomy.html > >
Received on Tuesday, 3 January 2012 02:26:58 UTC