Re: Protein representation with a Bioschemas context () from Anders Riutta on 2017-11-09 (public-bioschemas@w3.org from November 2017)

From: Anders Riutta <anders.riutta@gladstone.ucsf.edu>
Date: Thu, 9 Nov 2017 14:11:58 -0800
To: "public-bioschemas@w3.org" <public-bioschemas@w3.org>
Message-ID: <CAJEHyTm5cr6Uwi6yF-NU-hPk1CcRfjXP1BOAWn+sRKzr-PrdrQ@mail.gmail.com>
Hi,

I share Carol's hesitation to mint new IRIs if they'll be exactMatches for
existing IRIs (this xkcd comic <https://xkcd.com/927/> is cited so often as
to be a cliche, but there is some truth to it).

I also like Leyla's idea of focusing our efforts on creating one or more
shared JSON-LD contexts that reflect the consensus of the Bioschemas
community. The terms in this context or contexts can have a consistent
casing convention of our choosing, and the IRIs can be exclusively
third-party, pre-existing IRIs.

> We have to select terms from existing ontologies, i.e. we will be
selecting one ontology over another

Alasdair's concern above is justified, because doing this can be sensitive,
and it can be tricky to accommodate the subtle variations in meaning that
different sub-communities attach to certain terms, especially when there
are multiple IRIs with a 98% overlap in meaning but a 2% difference that is
just enough to prevent them from being exactMatches. However, judiciously
endorsing selected IRIs as well-thought out and reflective of existing
practice could actually be a significant source of value that the
Bioschemas community could provide, because it would be an efficient and
transparent process for recognizing and forming consensus. Note that we
don't have to choose one ontology in toto over another; we can pick and
choose terms from multiple ontologies, as appropriate.

In regard to Leyla's options 1 and 2, it seems there are two concerns: how
to markup an existing API vs. a newly created one. For a new API, the
creators can easily use both the *terms* and *IRIs* from Bioschemas by
formatting their JSON like this
<https://github.com/ariutta/specifications/blob/ariutta-demo/Protein/examples/ProteinEntityNew.json>,
where "http://bioschemas.org/context.jsonld" would point to something like this
file
<https://github.com/ariutta/specifications/blob/ariutta-demo/context.jsonld>.
For a pre-existing API, the terms usually cannot be changed, but the
creators can still integrate with Bioschemas by adding a new JSON-LD
context that maps their terms to the *IRIs* endorsed by Bioschemas. That
could look roughly like this
<https://github.com/ariutta/specifications/blob/ariutta-demo/Protein/examples/ProteinEntityPreexisting.json>
.

Creating one or more Bioschemas JSON-LD contexts, each made up of our
preferred terms mapped to pre-existing IRIs, will support a gradual
convergence of both terms and IRIs in our community. New APIs can match
both our terms and endorsed IRIs, while pre-existing APIs can keep their
terms but use our endorsed IRIs. The output from these pre-existing APIs
can be transformed to match Bioschemas terms by expanding with the JSON-LD
context specified by the API creators and then compacting with the
Bioschemas context.

The Bioschemas JSON-LD context could be a single file, or it could be a
combined context, where "http://bioschemas.org/context.jsonld" might point
to a collection of contexts  like this:

[

  "http://schema.org/",

  "http://bioschemas.org/Protein/context.jsonld",

  "http://bioschemas.org/LabProtocol/context.jsonld",

    ...

]
>

Regards,
Anders

On Thu, Nov 9, 2017 at 8:00 AM, Leyla Garcia <ljgarcia@ebi.ac.uk> wrote:

> Hi,
>
> On 09/11/2017 14:25, Gray, Alasdair J G wrote:
>
> Hi
>
> Unless I’m mistaken, the decision is now about presentation. Options 1 and
> 2 are both equivalent when you expand them out.
>
> Mmm, maybe it is not just about presentation. In order to make easier
> things for tools and validators, we would need to agree on a set of
> predefined the aliases. Let's suppose Bioschemas recommends the aliases
> "Protein" and "transcribedFrom" but a mark up uses "EnzymeProtein" and
> "comesFromGene". The Bioschemas validation and tools would not know what to
> do with those "unknown" aliases.
>
> If we do not want to impose any predefined aliases, then yes, the two
> options are the same. And then Bioschemas tools and validators will need to
> come up with a strategy to figure it out what corresponds to one profile or
> the other and when two different aliases refer to the same concept.
>
> Regards,
>
>
>
> I personally like 2 as it makes the json-ld very readable and also
> explicitly declares where each property is from.
> https://github.com/BioSchemas/specifications/blob/master/Protein/examples/
> ProteinEntity-with-context.json
>
> Alasdair
>
> On 9 Nov 2017, at 13:44, Leyla Garcia <ljgarcia@ebi.ac.uk> wrote:
>
> Hi,
>
> In that case, our options are reduced to:
>
> 1. eliminate the long property names by introducing shorthands in the
> context, something like the latest commit to my example
> https://github.com/BioSchemas/specifications/blob/master/
> PhysicalEntity/examples/BioChemEntityAlt-min%2Brec.jsonld
> <https://github.com/BioSchemas/specifications/blob/master/PhysicalEntity/examples/BioChemEntityAlt-min+rec.jsonld>
>
> 2. using a context with  predefined aliases linking to the preferred
> ontology by the data provider (see https://github.com/BioSchemas/
> specifications/blob/master/Protein/examples/ProteinEntity-with-context.
> json).
>
> Any preferences?
>
> Regards,
>
> On 09/11/2017 13:31, Carole Goble wrote:
>
>
>
> Bioschemas has already been publicly accused of running parallel ontology
> efforts and we were very clear that we were not going to reinvent
> ontologies.
> So I am very reticent about doing so.
>
> Rafa and I are already involved in a Researchschemas initiative with EOSC
> and have lines of enquiry for joining up with initiatives in biodiversity
> and geosciences. It would be good if we didn’t end up with many many
> parallel activities. But instead a converged one
>
> Carole
>
>
>
>
> Sent from my iPhone by
> Professor Carole Goble
> The University of Manchester
> UK
>
> On 9 Nov 2017, at 11:13, Leyla Garcia <ljgarcia@ebi.ac.uk> wrote:
>
> Hi all,
>
> On 09/11/2017 10:47, Gray, Alasdair J G wrote:
>
> Hi All,
>
> Leyla, thanks for providing a concrete example from which we can base our
> discussions.
>
> Points in favour of Leyla’s proposal:
> - Properties and types defined in Bioschemas namespace
> - json-ld validates using the structured data markup tool
>
> It will depending on whether schema.org is before (validates but all
> schema terms are moved to the bioschemas namespace) or after (does not
> validate but the namespace are correctly conserved)
>
> - We don’t need to choose one ontology over another
>
> Points against Leyla’s proposal
> - We are minting our own ontology terms
>
>
> We can avoid that by using a context with just predefined aliases (see
> https://github.com/BioSchemas/specifications/blob/master/Protein/examples/
> ProteinEntity-with-context.json). But then, Google does not know anything
> about all those possible types that could be associated to the aliases.
>
> Minting our own terms (I would not say ontology) makes things easier as
> Google would need to know only schema.org and Bioschemas. BUT, then maybe
> Google does not want to open that door as Bioschemas would become a somehow
> parallel vocabulary and other projects/groups might want to do something
> similar... OR maybe Google will prefer all to be moved as proper types to
> schema.org.
>
> Also, keep in mind that schema.org mints terms already covered by
> ontologies. Citations for instance are covered by the Bibliographic
> Ontology (BIBO) and the Semantic Publishing and Referencing (SPAR)
> ontologies.
>
> Regards,
>
>
> We can of course eliminate the long property names by introducing
> shorthands in the context, something like the latest commit to my example
> https://github.com/BioSchemas/specifications/blob/master/
> PhysicalEntity/examples/BioChemEntityAlt-min%2Brec.jsonld
> <https://github.com/BioSchemas/specifications/blob/master/PhysicalEntity/examples/BioChemEntityAlt-min+rec.jsonld>
> This could be expanded to something similar to the full context that Leyla
> used, but instead of creating new Bioschema terms, we would reused terms
> from existing ontologies
>
> Points in favour of Alasdair's proposal:
> - We are not minting our own ontology terms
> - With full context like Leyla’s the example would validate
>
> Points against Alasdair's proposal
> - We have to select terms from existing ontologies, i.e. we will be
> selecting one ontology over another
>
> Ultimately, with all these proposals someone adopting will need to edit
> the same number of characters, and we should highlight somehow what these
> are.
>
> I think we are in broad agreement that we can move away from using the
> additionalProperties.
>
> What we still need to determine is are we going to mint terms in the
> Bioschemas namespace (that could eventually be pushed to schema.org) or
> select terms from existing ontologies. Opinions on this last point please.
>
> Alasdair
>
> Alasdair J G Gray
> Fellow of the Higher Education Academy
> Assistant Professor in Computer Science,
> School of Mathematical and Computer Sciences
> (Athena SWAN Bronze Award)
> Heriot-Watt University, Edinburgh UK.
>
> Email: A.J.G.Gray@hw.ac.uk
> Web: http://www.macs.hw.ac.uk/~ajg33
> ORCID: http://orcid.org/0000-0002-5711-4872
> Office: Earl Mountbatten Building 1.39
> Twitter: @gray_alasdair
>
>
>
>
>
>
>
>
>
>
> ------------------------------
>
> *Heriot-Watt University is The Times & The Sunday Times International
> University of the Year 2018*
>
> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
> campuses and students across the entire globe we span the world, delivering
> innovation and educational excellence in business, engineering, design and
> the physical, social and life sciences.
>
> This email is generated from the Heriot-Watt University Group, which
> includes:
>
>    1. Heriot-Watt University, a Scottish charity registered under number
>    SC000278
>    2. Edinburgh Business School a Charity Registered in Scotland,
>    SC026900. Edinburgh Business School is a company limited by guarantee,
>    registered in Scotland with registered number SC173556 and registered
>    office at Heriot-Watt University Finance Office, Riccarton, Currie,
>    Midlothian, EH14 4AS
>    3. Heriot- Watt Services Limited (Oriam), Scotland's national
>    performance centre for sport. Heriot-Watt Services Limited is a private
>    limited company registered is Scotland with registered number SC271030 and
>    registered office at Research & Enterprise Services Heriot-Watt University,
>    Riccarton, Edinburgh, EH14 4AS.
>
> The contents (including any attachments) are confidential. If you are not
> the intended recipient of this e-mail, any disclosure, copying,
> distribution or use of its contents is strictly prohibited, and you should
> please notify the sender immediately and then delete it (including any
> attachments) from your system.
>
>
>
>
> Alasdair J G Gray
> Fellow of the Higher Education Academy
> Assistant Professor in Computer Science,
> School of Mathematical and Computer Sciences
> (Athena SWAN Bronze Award)
> Heriot-Watt University, Edinburgh UK.
>
> Email: A.J.G.Gray@hw.ac.uk
> Web: http://www.macs.hw.ac.uk/~ajg33
> ORCID: http://orcid.org/0000-0002-5711-4872
> Office: Earl Mountbatten Building 1.39
> Twitter: @gray_alasdair
>
>
>
>
>
>
>
>
>
>
>
>
Received on Thursday, 9 November 2017 22:12:28 UTC