FW: [schemaorg/schemaorg] Update to: Core Types to Support the Discovery of Life Sciences Resources (#2711)

Hi All,

Please see the below for details of the response from Dan on our request to merge our first collection of types into Schema.org.

I think this is a hugely positive step forward. Hopefully the inclusion into pending will entice more people to both deploy but also to build applications that rely on our proposed types.

Best regards

Alasdair

--
Alasdair J G Gray
Associate Professor in Computer Science,
School of Mathematical and Computer Sciences
Heriot-Watt University, Edinburgh, UK.

Email: A.J.G.Gray@hw.ac.uk<mailto:A.J.G.Gray@hw.ac.uk>
Web: http://www.macs.hw.ac.uk/~ajg33

ORCID: http://orcid.org/0000-0002-5711-4872

Office: Earl Mountbatten Building 1.39
Twitter: @gray_alasdair


Heriot-Watt is a global University, as a result my working hours may not be your working hours. Do not feel pressure to reply to this email outside your working hours.


To arrange a meeting: https://outlook.office365.com/owa/calendar/AlasdairGray@heriotwatt.onmicrosoft.com/bookings/


From: "notifications@github.com" <notifications@github.com>
Reply to: schemaorg/schemaorg <reply+AAIWUENVPLBNZFYILBZRFFF6OGZWDEVBNHHCT3QWYQ@reply.github.com>
Date: Thursday, 1 April 2021 at 15:02
To: schemaorg/schemaorg <schemaorg@noreply.github.com>
Cc: Alasdair Gray <A.J.G.Gray@hw.ac.uk>, "mention@noreply.github.com" <mention@noreply.github.com>
Subject: Re: [schemaorg/schemaorg] Update to: Core Types to Support the Discovery of Life Sciences Resources (#2711)

****************************************************************
Caution: This email originated from a sender outside Heriot-Watt University.
Do not follow links or open attachments if you doubt the authenticity of the sender or the content.
****************************************************************


Short version: My sense is that we should get this into Pending, with a view towards them becoming part of core schema.org as evidence of data-consuming applications is collected. Based on the experience of the last few years, we should also expand our notion of "data-consuming applications" to cover developer and datascientist -facing applications, such as public open data knowledge graphs. I believe the bioschemas schemas have great potential, but we have work to do yet to determine quite what level of detail is going to prove appropriate for this kind of vocabulary.

Next steps: I've asked @RichardWallis<https://github.com/RichardWallis> to take a look at some minor fixes to the PR, to mark these terms as part of the Pending area of schema.org, and remove any conflicts (e.g. SchemaExamples/schemaexamples.py needs removing).

Status and Context and expectation setting

When the Bioschemas activity was first suggested we (Schema.org leads) were initially wary of bringing Schema.org into an area where there were a great number of existing scientific and research data ontologies, unless there was a serious prospect of the schemas being used in substantive user-benefitting applications that could guide our decision making. For general consumer topics (reviews, ratings, photos, etc.) Schema.org as a unifying vocabulary made clear sense and was guided by user-facing applications. As we touched on deeper scientific topics where many levels of detail are potentially applicable, the territory felt different.

I spoke about this at the Elixir<https://elixir-europe.org/events/elixir-all-hands-2016> 2016 All Hands, and in particular emphasized that it could be counterproductive to add this kind of vocabulary with the expectation of it primarily being used in general web search engine product features. We didn't want life-science site publishers to be disappointed if they added the markup to their sites and did not subsequently feel they were benefitting from having done so (e.g. in the Google case, by the markup being used by one of the features in Google Search's list of structured data features<https://developers.google.com/search/docs/guides/search-gallery>). And I didn't want to run into people at conferences a few years later and be told "we added all this markup to our site and it hasn't done us any good at at all!".

Although these considerations apply to all schema.org additions, Bioschemas was an effort to move Schema.org towards covering scientific concepts and data structures in more detail than we had approached before. Schema.org has always focussed on schemas that are used, in the sense of consumed/interpreted by products, in user-facing features and applications. Without this, it is difficult to judge appropriate levels of detail, and it can be difficult for publishers to justify the effort of adding the markup.

The expectation originally was that the bioschemas project would work equally on the data publishing, and the data-consumption side of making these schemas part of a healthy ecosystem. I think what we've seen is a lot more success on the former side than on the latter (and that is no fault of any individual or group who has been part of the bioschemas effort).

Pending

By bringing these terms into schema.org's Pending area, schema.org (per our standard documentation) sets the following expectations:

The Pending Section is a staging area for work-in-progress terms which have yet to be accepted into the core vocabulary. Pending terms are subject to change and should be used with caution.
Implementors and publishers are cautioned that terms in the pending extension may lack consensus and that terminology and definitions could still change significantly after community and steering group review. Consumers of schema.org data who encourage use of such terms are strongly encouraged to update implementations and documentation to track any evolving changes, and to share early implementation feedback with the wider community.

This is loosely analogous to language W3C uses for Working Drafts<https://www.w3.org/2020/Process-20200915/#RecsWD>, and I highlight it here because it is important to acknowledge that the bioschemas vocabulary has been the product of a significant and expert-informed process over the last few years, and in particular it has been created, amended and developed in collaboration with many authoritative publishers of bioinformatics / lifesciences data.

It may be that the vocabulary in its schema.org incarnation will evolve further, but readers arriving here without knowledge of its origins should know that there have been substantial and long-running, expert-led collaborations<https://bioschemas.org/meetings/> leading to these designs.

Our challenge now will be to address any technical and usability integration issues between these schemas and the rest of Schema.org, and to move the focus towards data-consuming applications, so that we can understand whether the level of detail, definitions, properties proposed here are sufficient to meet the needs of user-facing applications.

The Bioschemas project provides some supporting tooling<https://bioschemas.org/software/>, and there are other opensource tools (e.g. Gleaner.io<https://gleaner.io/>, Schemarama<https://github.com/google/schemarama> that may be helpful to those developing applications.

Schema.org for Knowledge Graph Exchange

As we look to support the use of schema.org data in new and interesting areas, we should also take care to be open-minded about what counts as "using" Schema.org in a data-consuming application.

For example, at Google we made some investigations<https://github.com/google/schemarama/tree/main/kgx/wikidata/bioschemas> into whether Schema.org extended with Bioschemas is sufficiently expressive to capture a useful "knowledge graph for lifesciences<https://elifesciences.org/articles/52614>" subset extracted from Wikidata.org. Would such a database be a user-facing use of the data, or a workflow / infrastructural step towards an environment where user-facing applications could eventually be created? It is a little of both. While we can declare developers to be a kind of user we care about, these kinds of generic application do not always provide guidance that can help scope and shape schema design.

Such "knowledge graph exchange" scenario for using Schema.org-based data are part of a larger trend. For example:
·         Yago<https://yago-knowledge.org/>, which converts Wikidata to use Schema.org vocabulary.
·         Ozymandias<https://iphylo.blogspot.com/2018/08/ozymandias-biodiversity-knowledge-graph.html>, "a biodiversity knowledge graph of Australian taxa and taxonomic publications".
·         Springer Nature's SciGraph<https://researchdata.springernature.com/posts/45943-sn-scigraph-latest-release-patents-clinical-trials-and-many-new-features>, "collates information from across the research landscape, i.e. the things, documents, people, places and relations of importance to the science and scholarly domain."
·         DataCommons.org<https://datacommons.org/>, "Datacommons.org is an open knowledge repository hosted by Google that provides a unified view across multiple public datasets, combining economic, scientific and other open datasets into an integrated data graph." (wikipedia<https://en.wikipedia.org/wiki/Datacommons.org>, github<https://github.com/datacommonsorg/>).

I believe we should as a project explicitly declare these kinds of open data sharing, "knowledge graph exchange" initiatives as being amongst the kinds of data-consuming application that justify additions and changes to Schema.org. They are very much in the spirit of the project, but some thought is needed on how to operationalize this.

This doesn't mean that just spinning up an RDF database with some test data in would be sufficient; rather that we would be acknowledging data scientists, developers and others who work with data as being important user constituencies. Just as schema.org serves non-technical search engine end-users who are looking for jobs, recipes, reviews, events, datasets or fact checks on the various search engines, it can also support developers and data scientists who work with aggregations of schema.org data. As the DataCommons.org site says,

We cleaned and processed the data so you don't have to. Data about particular entities are aggregated from different sources for a unified view.

This kind of service (provided also by Wikidata et al.) can add huge value and help others meet the needs of their users.

The clarification to be made here is that our exit criteria for moving terms out of "Pending" status into the Schema.org core vocabulary should consider public, opendata knowledge graph use (SPARQL/RDF, Property Graphs, etc.) as important evidence towards demonstrating the usefulness of schema.org schema designs.

To @stain<https://github.com/stain>'s point, it is true that we have been a little blocked at Schema.org in terms of knowing how to handle the Bioschemas proposals, since they do make significant amounts of great data accessible via schema.org markup, even if the data-consuming applications we collectively anticipated back in 2016 have yet to emerge.

Schema.org in the past has suffered from "build it and they'll come" optimism, and contains a number of schema designs which lack substantive data-consuming implementations. This is why we introduced the notion of "pending<https://schema.org/docs/howwework.html#pending>", so that there is an opportunity to surface potentially valuable schema designs, while also flagging up that we believe there may be possible tweaks ahead as data-consuming implementations surface.

If we clarify "user-facing, data-consuming application" to include open data-sharing "knowledge graph" systems like Wikidata, Yago, SN SciGraph, Ozymandius, Data Commons, I believe this opens up a roadmap for bringing Bioschemas (and similar proposals) into Schema.org, without setting unrealistic expectations about the schema details being used. In particular it gives us a new focal point for articulating questions about the user needs being met by schema designs; we can ask about the kinds of queries supported by the combination of these schemas with opendata that uses the schemas.

Framed in this way I'm a lot more comfortable bringing these schemas into Pending, as it gives a plausible path for progressing things further. @AlasdairGray<https://github.com/AlasdairGray> et al., does that work for you?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://github.com/schemaorg/schemaorg/pull/2711#issuecomment-811929818>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIWUEOCEGGYCIR467YTVXTTGR4GDANCNFSM4RQHGDSQ>.

________________________________

Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. This email is generated from the Heriot-Watt University Group, which includes:

  1.  Heriot-Watt University, a Scottish charity registered under number SC000278
  2.  Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.

Received on Thursday, 1 April 2021 14:20:08 UTC